CN101393566A - Information tracking and detecting method and system based on network structure user pattern of behavior - Google Patents

Information tracking and detecting method and system based on network structure user pattern of behavior Download PDF

Info

Publication number
CN101393566A
CN101393566A CNA2008102268029A CN200810226802A CN101393566A CN 101393566 A CN101393566 A CN 101393566A CN A2008102268029 A CNA2008102268029 A CN A2008102268029A CN 200810226802 A CN200810226802 A CN 200810226802A CN 101393566 A CN101393566 A CN 101393566A
Authority
CN
China
Prior art keywords
network
data
user
unit
user model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008102268029A
Other languages
Chinese (zh)
Inventor
刘云
张立
李勇
沈波
张振江
贾凡
程辉
丁飞
司夏萌
张海峰
朱国东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CNA2008102268029A priority Critical patent/CN101393566A/en
Publication of CN101393566A publication Critical patent/CN101393566A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a method and a system for tracking and detecting information based on behavior patterns of a network structural user. The method comprises the following steps: firstly, obtaining all target information in Internet forums, and further extracting the information to obtain the information such as titles, contents, user names, publication time of related posts; secondly reconstructing the network structure of the Internet forums to the extracted information by a network building arithmetic, and establishing an integral graph and calculating the correlative character of the graph; and thirdly, carrying out correlative calculations by using a fitness estimation arithmetic, an activity estimation arithmetic and a heat degree estimation arithmetic to obtain hot topics in the Internet forums. The method considers the network structure of the Internet forums and the importance of users in the Internet forums, so the method can judge sensibility and controversial hot topics in the Internet forums quickly, thus reducing the calculation amount.

Description

The information trace of structure user behavior pattern Network Based and detection method and system
Technical field
The present invention relates to network information analytical technology, more specifically, relate to a kind of information trace and detection method and system of structure user behavior pattern Network Based.
Background technology
The development that infotech is maked rapid progress makes mass data storage become possibility, and information explosion has become one of major issue that current IT field faces.How obtaining valuable information from the network data of magnanimity quickly and efficiently is a problem of needing solution badly.It is more and more free from worry that broadband, optical fiber and 3G technology make that the user obtains the channel of data, because the information bottleneck that bandwidth causes has been away from current society gradually.Along with enriching of network application, after all kinds of static application, increasing dynamic network occurred and used, such as BBBS (Bulletin Board System)BS (BBS, Bulletin board system), network log (Blog), wikipedia application programs such as (Wikipedia).Traditional portal website also improves the system of self, makes original website based on static content obtain very big expansion, and the number of users of attraction also increases rapidly.The comment of model, topic and answer that these users deliver every day, view etc. are dispersed in each application of network, and quantity of information grows with each passing day.
From the present mankind's angle, people itself can't handle the network information of magnanimity like this at all, and therefore how drawing useful, effective topic from the magnanimity information of network then becomes an important need.Topic detects that (Topic Detection and Tracking, sub-research field TDT), its purpose are exactly for the text message by tissue and exploration magnanimity, and identify specific topic from this type of information with detection as Topic Tracking.Topic detects and can be automatically the information fusion of some separation to be become different bunches, the different information of in store similar topic in identical bunch.Researchist from the advanced research project office (DARPA, USDefense Advanced Research Projects Agency) of U.S. national defense, University of Massachusetts, Ka Naiji-Mei Long university and Dragon System has finished a prospective research in 1998.Topic be defined as a kind subevent or movable with and directly related with it incident or the activity followed.Therefore, can think that topic is made of a series of incident or activity.In the document of TDT 2004 incident and activity have been made more detailed description, the incident of pointing out promptly is the something in special time and locality generation, and has necessary condition precedent and inevitable consequence.And the movable incident that refers to take place in certain section special time, locality a series of associated, has same concerns point.
Research about Topic Tracking and context of detection in the prior art is slightly different with the present invention, and information trace among the present invention and detection mainly are meant the much-talked-about topic of finding in the internet forum.Internet forum is that a Web uses in essence, and this web application is mainly used in and holds topic discussion and put up the content that the user produces.Internet forum often is also referred to as Web forum, BBBS (Bulletin Board System)BS, zone of discussion or directly abbreviates column, forum as.Generally speaking, forum often refers to the whole community that discusses, and column often refers to form a sub-forum of whole community, and these columns are often only discussed at the topic of a certain particular aspects.Theme in the column is organized by some fixing forms often, and different forums is then different, and common have in chronological order pattern and by topic enterprise schema.
According to (the China Internet Network Information Center of the China Internet Network Information Center, CNNIC) statistics, in the end of the year 2008, there have been nearly 2.53 hundred million Internet users in China, wherein also comprise 2.14 hundred million broadband users, account for 84.7% of total user number.Along with the network application that emerges in an endless stream is popular within Chinese territory gradually, increasing young man puts in these application that have characteristics such as interaction, amusement, although Myspace, Facebook etc. are applied in all the fashion in the world, domestic also have similarly and other various types of social network application programs, but domestic netizen uses at most, and a class network application the most widely still is the traditional BBBS (Bulletin Board System)BS and the internet forum of new model.At home, the registered user of each internet forum has reached 3,000,000,000 more than (each network user can register) in a plurality of internet forums, 80% website in the country is all managing the internet forum of oneself, (Page View PV) has reached 1,600,000,000 more than to page browsing amount every day of these internet forums.Simultaneously, the model of delivering in these forums every day has then surpassed 1,000 ten thousand.Although be mingled with a large amount of junk information and flame in the model of these forums, its quantity that shows on the whole is quite surprising.
According to list of references 1 (Kumaran G, Allan J.Text classification and named entities fornew event detection.Proceedings of the 27th annual international ACM SIGIRconference on Research and development in information retrieval.2004.297~304), much-talked-about topic can be defined in the frequent topic that occurs in some time periods.People such as Kumaran have provided the description about the popular degree of certain topic equally, and the popular degree of topic is divided into two principal elements, and one is the frequency that popular keyword occurs in document, and it is two for comprising the quantity of this keyword in the document.It is very effective for catching in the document important, representative keyword that this class utilizes weight mechanism to carry out method that much-talked-about topic finds.In all multi-methods of the importance of estimating keyword, TF-IDF (TermFrequency-Inverse Document Frequency, keyword frequency-reverse order document frequency) is a kind of very common evaluation method.After it, the TF*IDF method has appearred again.These two kinds of methods all need a large amount of calculating.Because present Topic Tracking and detection algorithm do not design targetedly to the actual characteristic of internet forum, and calculated amount is big, thereby can't make rapid judgement to susceptibility in the network information and controversial topic.
Summary of the invention
At the problems referred to above, the present invention proposes at the structure Network Based of internet forum and the much-talked-about topic discover method of user behavior pattern, can promptly detect the much-talked-about topic in the internet forum, reduce calculated amount.
Content of the present invention is primarily aimed at how to detect or to extract interested much-talked-about topic from given internet forum.The present invention proposes a kind of method and system that are used for detecting much-talked-about topic at given internet forum.These method and system have been utilized the knowledge of complex network (Complex Networks) so that analyze relation between the user in the internet forum, and user's behavior pattern is analyzed, and these aspects are all different in prior art.Under the environment of internet forum, the present invention has very high efficient and accuracy.
In order to overcome the deficiencies in the prior art, the invention provides a kind of information trace and detection method of structure user behavior pattern Network Based, this method may further comprise the steps:
A, extraction network data;
B, building network structure;
C, according to described network data and network structure, calculate the first user model data;
D, according to described network data and network structure, calculate the second user model data;
E, according to above-mentioned network data, network structure, the first user model data and the second user model data, obtain testing result.
According to another aspect of the present invention, in above-mentioned steps A, by the web crawlers unit picks and store webpage; By the information extraction unit analysis and extract network data.
According to another aspect of the present invention, the form building network structure to scheme in above-mentioned steps B, described figure is a non-directed graph, each user is corresponding to a node among the described figure in the described network structure.
According to another aspect of the present invention, the described first user model data are the fitness estimated values that draw by the fitness algorithm for estimating, the described second user model data are the liveness estimated values that draw by the liveness algorithm for estimating, and described testing result is the temperature estimated value that draws by the temperature algorithm for estimating.
The present invention also provides a kind of information trace and detection system of structure user behavior pattern Network Based, it is characterized in that, this system comprises:
The web crawlers unit is used to grasp and store the webpage of targeted website;
The information extraction unit is used to extract required network data;
The universal data access unit is used for storing the described network data that described information extraction unit extracts into database, and reads the data of having stored in the described database;
The net structure unit utilizes the described network data building network structure that extracts;
The first user model data estimation unit is used for the first user model data of estimation network node;
The second user model data estimation unit is used for the second user model data of estimation network node;
The testing result acquiring unit is used for according to described network data, network structure, the first user model data and the second user model data, obtains testing result.
According to another aspect of the present invention, described system also comprises the Template Manager unit, is used for creating, revises and delete described predefined template;
Described web crawlers unit conducts interviews to the targeted website according to the URL address, obtains the webpage of targeted website, and the web data that grabs is stored in the local file system;
Described information extraction unit can with work of described web crawlers sequence of unit or concurrent working, predefined template is mated in webpage that described information extraction unit will have been stored and the described Template Manager unit, according to required data and the data pattern of defined information extraction in the template of coupling, obtain required network data then.
According to another aspect of the present invention, described net structure unit makes up described network structure with the form of scheming, and described figure is a non-directed graph, and each user is corresponding to a node among the described figure in the described network structure.
According to another aspect of the present invention, the described first user model data are the fitness estimated values that draw by the fitness algorithm for estimating, the described second user model data are the liveness estimated values that draw by the liveness algorithm for estimating, and described testing result is the temperature estimated value that draws by the temperature algorithm for estimating.
Description of drawings
Fig. 1 is the form of the composition synoptic diagram according to the internet forum model of one embodiment of the present invention.
Fig. 2 is the degree distribution curve according to the network of an embodiment of the invention.
Fig. 3 is the degree distribution curve synoptic diagram that uses the network of formula (4) generation according to an embodiment of the invention.
Fig. 4 is the statistical graph that increases number according to node every day of one embodiment of the present invention.
Fig. 5 is the statistical graph that increases number according to limit every day of one embodiment of the present invention.
Fig. 6 is the statistical graph according to every node increase limit number every day of one embodiment of the present invention.
Fig. 7 is that the user creates model and counts distribution plan.
Fig. 8 is the workflow diagram according to an embodiment of the invention.
Fig. 9 is the system architecture synoptic diagram according to an embodiment of the invention.
Embodiment
In order to further specify principle of the present invention and characteristic, the present invention is described in detail below in conjunction with the drawings and specific embodiments.Describe the specific embodiment of the present invention in detail below in conjunction with accompanying drawing.
According to the structure Network Based of one embodiment of the invention and the information trace and the detection method of user behavior pattern, comprising:
At first, extract the network data of internet forum and set up network structure.
An internet forum generally is made of several columns, and wherein each column may include other sub-columns or many models again, and sub-column generally directly includes relevant model down.According to an embodiment of the invention, the news about certain theme that occurs in the network, forum's model, blog etc. are referred to as " model ".Fig. 1 is the form of the composition synoptic diagram according to the internet forum model of one embodiment of the present invention.
As shown in fig. 1, a common model has comprised column title (Board Name), title (Title), the people that posts (User Name), content (Content), substance quoted (Replied To) and has delivered the time or edit session (Time of Post or Edit).The column title is represented the position that model is delivered.Title and content in the middle of the model are most important parts in the model, also are the parts of carrying out primary study in most of topic detection methods.User name in the middle of the model has shown that then the people that posts is used to discern the identifier of user identity in internet forum, can be character string or numeral etc.Generally speaking, user name does not allow repetition in the internet forum.Time and date then represents to deliver the time of this model.Editting function to delivering model is provided in some internet forums, and this type of forum generally can show the time that this model was edited in the later stage.If the relation of answer or adduction relationship are arranged between the model, then general embodiment to some extent in the body matter of model can show the full content or the partial content of replying or quote in the model this moment above or below main contents.
Because the user name in the internet forum does not allow repetition, therefore can customer contact be become a network by the model information that the user delivered.The degree distribution of figure or network is about one of the key character of this figure or network and attribute, and therefore the research to network all is to carry out from the angle that the research degree distributes.
In the present invention, represent a figure with G.Figure G be orderly two tuples (V, E), wherein V is called top collection, E is called the limit collection.They also can be write as V (G) and E (G).
The element of E is that one two number of tuples is right, with (x, y) expression, x wherein, y ∈ V.If two summits on a limit are same summit, then this limit is called ring.
If give direction of every limit regulation of figure, the figure that obtains so is called digraph.In digraph, the branch that there is the limit of going out on the limit that is associated with a node and goes into the limit.On the contrary, the limit does not have the figure of direction to be called non-directed graph.
Preferably, according to an embodiment of the invention, use non-directed graph to represent a network, thereby set up the network structure of internet forum.
Degree (Degree), i.e. the degree on a summit is meant the bar number on the limit that is associated with this limit, the degree note of vertex v is made d (v).Obviously have:
∑d(v)=2|E| (1)
The degree on the summit of digraph can divide in-degree (In Degree) and out-degree (Out Degree).An in-degree of vertex is meant the bar number of going into the limit that is associated with this limit, and out-degree then refers to the bar number that goes out the limit that is associated with this limit.
In the internet forum related according to an embodiment of the invention, each user in the network exists corresponding to a node in the non-directed graph.Suppose not encircle among the figure, and have only a limit at most between any two summits, so, (i j), then exists model at least once to reply relation between node i and the node j if there is one two tuple among the collection E of limit.
According to an embodiment of the invention, certain medium scale forum obtains its all data from the internet, about 20,000 of wherein total registered user, and model information has nearly 700,000.
In addition, utilize hereinafter will describe according to the structure Network Based of an embodiment of the invention and the information trace and the detection system of user behavior pattern, from the related data that internet forum obtains, these data owners will comprise user name, user ID, the quantity of posting, topic sign, the content of posting, post the time etc.Use whole network of these data construct by this system, and calculate the degree distribution of this network.
The formation model of define grid is as follows:
● node: each the different user ID that makes a speech in each zone of discussion counts a node, and the same subscriber ID that repeats does not consider;
● limit:, then think to have a limit between the node if having the relation of answer between two user ID;
● ring certainly: suppose in the network not from ring, ignore the ring certainly that forms when the user replys the original model of oneself being delivered;
● heavy limit: do not consider because the heavy limit that exists answer relation repeatedly to form between the user is thought at most only to have a limit between any two users.
The degree of network adopts the method for statistics directly to obtain, and degree distributes and need calculate acquisition later on by the degree of all nodes in obtaining whole network.Distribute degree of being meant in fact of degree is the Probability p (k) that the node of k occurs in whole network.
Fig. 2 is the degree distribution curve according to the network of an embodiment of the invention.Wherein transverse axis is represented the degree of node, represents with k; Longitudinal axis degree of a representation is the probability that the node of k occurs in whole network, with p (k) expression.The horizontal longitudinal axis implication of subgraph among Fig. 2 is identical with big figure, but its coordinate is a log-log coordinate, and log-log coordinate is to investigate network whether to have one of important measurement factor of no characteristics of scale.
As can be seen from Figure 2, the degree of replying relational network distributes and not have the scale network basic identical for original BA.Reply the network that relational network promptly is made of the relation of the answer between the user in the internet forum, it is Barab á si (Barab á si that original BA does not have the scale network, Albert-L á szl ó and R é ka Albert, " Emergenceof scaling in random networks ", Science, 286:509-512, October15,1999) etc. the no scale network of people's initial creation.Master map still is that subgraph all has with original BA and do not have the common graphics feature of scale network among Fig. 2.BA does not have the degree distribution of scale network and obeys power-law distribution, and power-law distribution can be represented by formula (2):
P(k)∝k -r (2)
Do not have in the scale network at BA, the r in the formula (2) is 3, and calculates r=2.28937 ± 0.01321 that internet forum is replied relational network by Fig. 2.The reason of the r value generation difference of two kinds of network degree distributions herein is because the difference of the preferential connection probability П that exists in the net structure process causes.Do not have in the scale network at BA, the definition that preferentially connects probability П as shown in Equation (3)
Π i = η i k Σ j η j k j - - - ( 3 )
And preferentially in the network of constructing according to an embodiment of the invention connect probability П as shown in Equation (4)
Π i ( t ) = η i k i ( t - t i + 1 ) - α Σ j η j k j ( t - t j + 1 ) - α - - - ( 4 )
T represents the step number of evolution, η in the formula (4) iThen be expressed as the fitness of node, its specific definition can be referring to list of references 2
(Lu?G.Old?School?BBS:The?Chinese?Social?Networking?Phenomenon:http://www.readwriteweb.com/archives/bbs_china_social_networking.php)。k iThe degree of expression node i, and α represents decay factor.Formula (4) is used for producing a network under the situation of given minority start node or be used for the fitness of the given node of estimation network under the situation of a given network.
Fig. 3 is the degree distribution curve synoptic diagram that uses the network of formula (4) generation according to an embodiment of the invention.Transverse axis among Fig. 3 is represented the degree of node, represents with k; Longitudinal axis degree of a representation is the probability that the node of k occurs in whole network, with p (k) expression.The horizontal longitudinal axis implication of subgraph among Fig. 3 is identical with big figure, but its coordinate is a log-log coordinate.
For a given network, can use formula (4) to estimate the fitness of node, in the network of the present invention node pairing promptly be a user in the internet forum.The fitness (promptly obtaining the fitness of certain node) that obtains the user can be estimated the temperature of the topic that produced by this user with it afterwards.Therefore, fitness detects the much-talked-about topic except being used to, and can also be used to carry out the part prediction work.
Next, the user behavior pattern of internet forum is analyzed.
User in the internet forum can be by two classes that are divided into roughly, i.e. any active ues and non-any active ues.Any active ues can be frequent, clocklike login forum and browse, check relevant information in the forum, but not any active ues then is irregularly to carry out above-mentioned activity, and frequency is less.From delivering and create the model aspect, any active ues can be used to discuss all kinds of problems by the frequent model of delivering, but not any active ues is then seldom carried out the activity of this aspect.
In order to understand attribute of user in the internet forum better, the present invention adds up the answer relational network that is produced by forum's data.Fig. 4 is the statistical graph that increases number according to node every day of one embodiment of the present invention.Fig. 5 is the statistical graph that increases number according to limit every day of one embodiment of the present invention.Fig. 6 is the statistical graph according to every node increase limit number every day of one embodiment of the present invention.
The node that Fig. 4 has showed every day increases number, and wherein transverse axis is represented fate, and the longitudinal axis is represented accelerating of node.Fig. 5 has then showed the every day on limit increases number, and wherein transverse axis is represented fate, and the longitudinal axis is represented accelerating of limit.As shown in Figure 4 and Figure 5, every day node the increase number and the increase number on limit do not have clear regularity, be in random state substantially.Be difficult to the evolutionary process of user in the internet forum and model is described accurately by Fig. 4 and Fig. 5.
The limit number that Fig. 6 has then showed every day every node on average increases, wherein transverse axis is represented fate, the longitudinal axis represents to save the average number of links of each node.Be similar to the result that Fig. 4 and Fig. 5 obtain, the variation of this feature does not have clear regularity yet, is difficult to be described with accurate expression formula yet.
Fig. 7 is that the user creates model and counts distribution plan, and wherein transverse axis is represented model quantity, and the longitudinal axis is represented relative frequency.As can be seen from Figure 7, exist some power user in internet forum, these power user have created a large amount of models, meanwhile, have a large amount of users to create a small amount of model.In an embodiment of the invention, most active user has created more than 7,000 model in the forum, and has 40% user only only to create a model.
At last, determine much-talked-about topic in the internet forum.
Much-talked-about topic be meant in the internet forum that in certain the period frequency of occurrences is higher and influenced more any active ues topic.Represent the popular degree of topic with ht, wherein subscript t represents topic, then h tBy following formula definition:
h t = Σ i ∈ u t f i a i T t - - - ( 5 )
T in the formula (5) tThe expression duration of topic in internet forum, u tAll user's collection that expression is discussed to this topic, f iThe fitness of expression user i, a iThe active degree of expression user i.Easy in order to calculate, the model quantity that can use the user to deliver in forum's active period is represented user's active degree, uses the preferential fitness that probability П represents the user that connects simultaneously.Like this, the significance level of the different user that mainly attracted in its life cycle by this topic of the popular degree of a given topic decides.
Provide algorithm false code related in above-mentioned each step below respectively.
1, net structure algorithm
Following false code is the net structure algorithm:
init?user?list?V
init?connection?list?E
put?all?users?into?set?V
foreach?user?i?in?V
if?user?i?has?a?connection?with?user?j!=i?in?V?and(i,j)is?not?in?E
put(i,j)in?E
end?if
loop
2, fitness algorithm for estimating
Following false code is the fitness algorithm for estimating:
construct?the?network(V,E)
foreach?user?i?in?V
fitness=0
foreach?timestep?t?in?user?i′s?life
fitness=fitness+(connections?in?t/(life?time-t))
loop
loop
3, liveness algorithm for estimating
Following false code is the liveness algorithm for estimating:
construct?the?network(V,E)
foreach?user?i?in?V
p=count(posts?i?created)
d=count(timesteps?i?registered)
activeness=p/d
loop
4, temperature algorithm for estimating
Following false code is the temperature algorithm for estimating:
construct?the?network(V,E)
hotness?estimation
activeness?estimation
foreach?topic?t?in?topics
users=users?involved?in?t
hotness=0
foreach?user?u?in?users
hotness=hotness+fitness(u)*activeness(u)
loop
hotness=hotness/timesteps?t?involved
loop
In order to carry out the validity experiment of algorithm, carried out relevant experiment.According to an embodiment of the invention, utilize a kind of information trace of structure user behavior pattern Network Based and detection system to realize, this system comprises:
The web crawlers unit is used to grasp and store the webpage of targeted website;
The information extraction unit, the webpage that is used for grabbing mates according to predefined template, obtains carrying out the topic temperature and analyzes required concrete data;
The universal data access unit is used for database is carried out data storage and reads;
The Template Manager unit is used for creating, revises and the deletion template;
The net structure unit is used for the building network structure;
The fitness estimation unit is used for the fitness of estimation network node;
The liveness estimation unit is used for the liveness of estimation network node;
The temperature estimation unit is used for estimating the temperature of internet forum topic.
The web crawlers unit conducts interviews to the targeted website according to given initial URL address, obtains the webpage of targeted website.In the time of access process webpage is carried out URL and resolve, the URL address in the webpage is deposited in the URL formation.When the webpage extracting of current URL address is finished in the web crawlers unit, will continue the extracting of next URL address in the URL formation, the web data that grabs is stored in the local file system.Exist in the prior art and the similar software of web crawlers Elementary Function, some that mainly include the web crawlers of each large-scale commerce search engine and open source software field have the software (such as Nutch) of similar functions etc.
The information extraction unit can with work of web crawlers sequence of unit or concurrent working.When the information extraction unit carries out work, will directly handle by the web crawlers unit picks and store local webpage into.At first, the information extraction unit stores local webpage into to these and discerns, and judges which kind of template it is fit to.Then, the information extraction unit is selected suitable template for use according to the result who judges, owing to defined the required data of information extraction and the pattern of these data in the template, can obtain required data after therefore can using the template cover in the webpage.In one embodiment of the invention, the data that extract from this internet forum include but not limited to user name, user ID, the quantity of posting, topic sign, the content of posting, post the time etc.Data structure includes but not limited to: character string, integer, Time of Day etc.
The data storage that the conventional data storage unit is used for obtaining behind the information extraction also can be used for reading of data to database, and data read is mainly used in the calculating in later stage.
The Template Manager unit is mainly used in the template in foundation, modification, the deletion system.
Network data and above-mentioned net structure algorithm that the net structure unit extracts according to the information extraction unit, tectonic network.Data transfer after will handling is then given fitness estimation unit and liveness estimation unit, and the temperature estimation unit has obtained the further calculating of result of fitness estimation unit and liveness estimation unit gained the temperature of topic.
This system has utilized the technology of DotNet, and only needing has on the platform that the common language runtime (CLR, CommonLanguage Runtime) supports and can move at one.The major function of this system is the information of collecting user and model in as the internet forum of data source.Experimental result proves, and is all effective according to user adaptation degree algorithm, user's liveness algorithm and net structure algorithm that the much-talked-about topic detection method of an embodiment of the invention is included.Certainly, those skilled in the art will appreciate that and to adopt any suitable computer programming software to realize this system, and the selection of internet forum includes but not limited to domestic forum.
According to an embodiment of the invention, to test used data and all be collected in certain domestic forum, the model number is nearly 700,000, number of users nearly 20,000.The data structure of each model is basic identical, shown in Figure 1 as in the preamble.Preferably, obtain the data of forum's model according to preceding method after, it is deposited in the relevant database of a standard.Need in this database to set up corresponding tables of data according to the model content of internet forum, this table should have user name, model theme, model content, deliver and data rows such as edit session.Obtaining by the described system of preamble of these data realizes.Certainly, those skilled in the art will appreciate that and to adopt any appropriate databases, include but not limited to relevant database.
Table 1 expression is according to the fitness result of the method acquisition of an embodiment of the invention.Showed 10 the highest users of fitness in the table 1, the user ID in the table has been represented a unique user, in order not relate to the privacy of user of this internet forum, does not have the explicit user name in the table 1, and uses user ID to be described.Fitness result of calculation is to draw according to the fitness algorithm for estimating of introducing in the preamble, user ID in the table 1 is a big integer that constantly increases progressively, that is to say the hour of log-on of the little user of user ID in this internet forum early, and evening time that the big user of user ID registers in this forum.As can be seen from Table 1, the morning and evening of user's registration there is no direct positive connection with this user's fitness size.
Table 10 big high fitness users
Figure A200810226802D00181
Figure A200810226802D00191
Table 2 expression is according to the liveness result of the method acquisition of an embodiment of the invention.Table 2 has been showed ten users with the highest liveness value, and is identical in the user ID of using in the table 2 and the table 1, repeats no more herein.From table 2, can see, only have the user of a rank the 5th in table 1, to occur in the table, and its rank position in table 1 be the 8th.This shows that liveness and fitness have very big difference, the both is the important parameter of reflection user significance level in internet forum.
Table 20 big high liveness users
Table 3 expression is according to the temperature value result of the method acquisition of an embodiment of the invention.Table 3 has been showed ten topics the most popular in the data source, and the topic sign is the same with user ID, all is a unique identifier, is used to identify a topic.The temperature value that table 3 is showed is to calculate according to the temperature algorithm for estimating of introducing in the preamble to get.
Table 30 big hot topic topics
Figure A200810226802D00201
The main contents that ten big hot topic topics are discussed are simply listed below, provide the related data information of this topic simultaneously.Consideration for the protection individual privacy has concealed partial content when listing the main contents of topic, the content that conceals mainly is some characters names.
1. this topic mainly is the content that some very active users quarrel in internet forum, comprises 97 parts of the relevant models of this topic in the data source altogether, has 29 different users and participates.
2. this topic mainly is the relevant discussion to the statesman in past certain, comprises 50 parts of the relevant models of this topic in the data source altogether, has 26 users and participates.
3. this topic mainly is the relevant discussion that other a certain position statesman is in the past carried out, and comprises 45 parts of the relevant models of this topic in the data source altogether, has 30 users and participates.
4. this topic mainly is that the political event that take place the sixties to the seventies is discussed, and comprises 45 parts of relevant models in the data source altogether, has 29 users and participates.
5. this topic mainly is that one piece of political article that certain user in this internet forum delivers is discussed, and comprises 117 parts of relevant models in the data source altogether, has 69 users and participates.
6. this topic mainly is how this internet forum to be developed discuss, and comprises 29 parts of relevant models in the data source altogether, has 19 users and participates.
7. this topic mainly is the content that two large user groups quarrel in this internet forum, comprises 86 parts of the relevant models of this topic in the data source altogether, has 35 different users and participates.
8. this topic mainly is the content that other two big class users once quarrel in this internet forum, comprises the relevant model 20 of this topic in the data source altogether, has 16 different users and participates.
9. this topic mainly is that wife to a preceding statesman discusses, and comprises 86 parts of relevant models in the data source altogether, has 37 users and participates.
10. this topic mainly is that the economic model of China is discussed, and comprises 36 parts of relevant models in the data source altogether, has 24 users and participates.
From above-mentioned content summary, can see, the model number of discussing at certain topic and the number of users of participation have provided directly perceived and simple a description to the popular degree of this topic, but the much-talked-about topic that obtains in the experimental result according to an embodiment of the invention is to obtain according to topic temperature algorithm for estimating of the present invention fully.Just can infer that from above-mentioned topic content summary these contents will be the topic that obtains numerous responses and concern in internet forum, these topics are disputable often and responsive politically.
In sum, according to an embodiment of the invention, the whole workflow of discovery much-talked-about topic as shown in Figure 8.
S01: extract network data.Utilize said method from the network forum, to extract the data that are used for building network.
S03: building network structure.Utilize the data that obtain among the S01,, create the network structure of internet forum according to the above-mentioned net structure algorithm of an embodiment of the invention.
S05: fitness is estimated.Data and the network structure of utilizing S01, S03 to obtain, according to the above-mentioned fitness algorithm for estimating of an embodiment of the invention, each user's fitness in the computational grid, judgement user's importance.
S07: liveness is estimated.Data and the network structure of utilizing S01, S03 to obtain, according to the above-mentioned liveness algorithm for estimating of an embodiment of the invention, each user's liveness in the computational grid, judgement user's importance.
S09: temperature is estimated.The data, network structure, fitness and the liveness that utilize S01-S07 to obtain according to the above-mentioned temperature algorithm for estimating of an embodiment of the invention, are judged much-talked-about topic.
Though more than described a plurality of embodiment of the present invention, but those skilled in the art is to be understood that, these embodiments only illustrate, those skilled in the art can carry out various omissions, replacement and change to the details of said method and system under the situation that does not break away from principle of the present invention and essence.For example, merge said units and/or method step, then belong to scope of the present invention to realize the identical result of essence thereby carry out the essence identical functions according to the identical method of essence.Therefore, scope of the present invention is only limited by appended claims.

Claims (8)

1. the information trace of a structure user behavior pattern Network Based and detection method is characterized in that, this method may further comprise the steps:
A, extraction network data;
B, building network structure;
C, according to described network data and network structure, calculate the first user model data;
D, according to described network data and network structure, calculate the second user model data;
E, according to above-mentioned network data, network structure, the first user model data and the second user model data, obtain testing result.
2. method according to claim 1 is characterized in that:
In above-mentioned steps A, by the web crawlers unit picks and store webpage;
By the information extraction unit analysis and extract network data.
3. method according to claim 1 is characterized in that: the form building network structure to scheme in above-mentioned steps B, and described figure is a non-directed graph, each user is corresponding to a node among the described figure in the described network structure.
4. method according to claim 1, it is characterized in that: the described first user model data are the fitness estimated values that draw by the fitness algorithm for estimating, the described second user model data are the liveness estimated values that draw by the liveness algorithm for estimating, and described testing result is the temperature estimated value that draws by the temperature algorithm for estimating.
5. the information trace of a structure user behavior pattern Network Based and detection system is characterized in that, this system comprises:
The web crawlers unit is used to grasp and store the webpage of targeted website;
The information extraction unit is used to extract required network data;
The universal data access unit is used for storing the described network data that described information extraction unit extracts into database, and reads the data of having stored in the described database;
The net structure unit utilizes the described network data building network structure that extracts;
The first user model data estimation unit is used for the first user model data of estimation network node;
The second user model data estimation unit is used for the second user model data of estimation network node;
The testing result acquiring unit is used for according to described network data, network structure, the first user model data and the second user model data, obtains testing result.
6. system according to claim 5 is characterized in that:
Described system also comprises the Template Manager unit, is used for creating, revises and delete described predefined template;
Described web crawlers unit conducts interviews to the targeted website according to the URL address, obtains the webpage of targeted website, and the web data that grabs is stored in the local file system;
Described information extraction unit can with work of described web crawlers sequence of unit or concurrent working, predefined template is mated in webpage that described information extraction unit will have been stored and the described Template Manager unit, according to required data and the data pattern of defined information extraction in the template of coupling, obtain required network data then.
7. system according to claim 5 is characterized in that: described net structure unit makes up described network structure with the form of scheming, and described figure is a non-directed graph, and each user is corresponding to a node among the described figure in the described network structure.
8. system according to claim 5, it is characterized in that: the described first user model data are the fitness estimated values that draw by the fitness algorithm for estimating, the described second user model data are the liveness estimated values that draw by the liveness algorithm for estimating, and described testing result is the temperature estimated value that draws by the temperature algorithm for estimating.
CNA2008102268029A 2008-11-17 2008-11-17 Information tracking and detecting method and system based on network structure user pattern of behavior Pending CN101393566A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008102268029A CN101393566A (en) 2008-11-17 2008-11-17 Information tracking and detecting method and system based on network structure user pattern of behavior

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008102268029A CN101393566A (en) 2008-11-17 2008-11-17 Information tracking and detecting method and system based on network structure user pattern of behavior

Publications (1)

Publication Number Publication Date
CN101393566A true CN101393566A (en) 2009-03-25

Family

ID=40493859

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008102268029A Pending CN101393566A (en) 2008-11-17 2008-11-17 Information tracking and detecting method and system based on network structure user pattern of behavior

Country Status (1)

Country Link
CN (1) CN101393566A (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101980199A (en) * 2010-10-28 2011-02-23 北京交通大学 Method and system for discovering network hot topic based on situation assessment
CN102214245A (en) * 2011-07-12 2011-10-12 厦门大学 Graph theory analysis method of research hot spots based on co-occurrence of keywords
CN102394798A (en) * 2011-11-16 2012-03-28 北京交通大学 Multi-feature based prediction method of propagation behavior of microblog information and system thereof
CN102637182A (en) * 2011-02-15 2012-08-15 北京大学 Method for analyzing interactive evolution of core user information of Web social network
CN102646098A (en) * 2011-02-16 2012-08-22 北京千橡网景科技发展有限公司 Method and device for determining frequency of content in network
CN102929918A (en) * 2012-09-20 2013-02-13 西北工业大学 False online public opinion identification method
CN102955804A (en) * 2011-08-25 2013-03-06 中国移动通信集团公司 Method and device for determining heat of web words
CN103563332A (en) * 2011-05-24 2014-02-05 阿瓦亚公司 Social media identity discovery and mapping
CN103955547A (en) * 2014-05-22 2014-07-30 厦门市美亚柏科信息股份有限公司 Method and system for searching forum hot-posts
CN104298783A (en) * 2014-11-10 2015-01-21 武汉安问科技发展有限责任公司 Behavior type generation method for network crawler template
CN105631021A (en) * 2015-12-29 2016-06-01 武汉理工大学 PageRank-based in-internet-forum opinion leader identification and optimization method in Hadoop environment
CN106097107A (en) * 2009-09-30 2016-11-09 柯蔼文 For social graph data analysis to determine the internuncial system and method in community
CN106789342A (en) * 2017-01-20 2017-05-31 国网山东省电力公司 A kind of communication network architecture of power system is set up, optimization method
CN107205019A (en) * 2017-05-04 2017-09-26 聚好看科技股份有限公司 User behavior data method for cleaning and device
CN107276781A (en) * 2016-04-07 2017-10-20 中国科学院声学研究所 A kind of router of band storage extends the pre- dispositions method of content distributing network
CN108154395A (en) * 2017-12-26 2018-06-12 上海新炬网络技术有限公司 A kind of customer network behavior portrait method based on big data
CN108153817A (en) * 2017-11-29 2018-06-12 成都东方盛行电子有限责任公司 A kind of intelligent web page collecting method
CN108595466A (en) * 2018-02-09 2018-09-28 中山大学 A kind of filtering of internet information and Internet user's information and net note structure analysis method
CN109726199A (en) * 2018-12-28 2019-05-07 杭州铭智云教育科技有限公司 A kind of data cleaning method
US10348586B2 (en) 2009-10-23 2019-07-09 Www.Trustscience.Com Inc. Parallel computatonal framework and application server for determining path connectivity
US10380703B2 (en) 2015-03-20 2019-08-13 Www.Trustscience.Com Inc. Calculating a trust score
CN113326355A (en) * 2021-07-29 2021-08-31 湖南正宇软件技术开发有限公司 Proposal scoring method, device, computer equipment and storage medium
US11341145B2 (en) 2016-02-29 2022-05-24 Www.Trustscience.Com Inc. Extrapolating trends in trust scores
US11386129B2 (en) 2016-02-17 2022-07-12 Www.Trustscience.Com Inc. Searching for entities based on trust score and geography
US11640569B2 (en) 2016-03-24 2023-05-02 Www.Trustscience.Com Inc. Learning an entity's trust model and risk tolerance to calculate its risk-taking score

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106097107A (en) * 2009-09-30 2016-11-09 柯蔼文 For social graph data analysis to determine the internuncial system and method in community
CN106097107B (en) * 2009-09-30 2020-10-16 柯蔼文 Systems and methods for social graph data analysis to determine connectivity within a community
US11968105B2 (en) 2009-09-30 2024-04-23 Www.Trustscience.Com Inc. Systems and methods for social graph data analytics to determine connectivity within a community
US11323347B2 (en) 2009-09-30 2022-05-03 Www.Trustscience.Com Inc. Systems and methods for social graph data analytics to determine connectivity within a community
US10812354B2 (en) 2009-10-23 2020-10-20 Www.Trustscience.Com Inc. Parallel computational framework and application server for determining path connectivity
US10348586B2 (en) 2009-10-23 2019-07-09 Www.Trustscience.Com Inc. Parallel computatonal framework and application server for determining path connectivity
US11665072B2 (en) 2009-10-23 2023-05-30 Www.Trustscience.Com Inc. Parallel computational framework and application server for determining path connectivity
CN101980199A (en) * 2010-10-28 2011-02-23 北京交通大学 Method and system for discovering network hot topic based on situation assessment
CN102637182A (en) * 2011-02-15 2012-08-15 北京大学 Method for analyzing interactive evolution of core user information of Web social network
CN102637182B (en) * 2011-02-15 2014-05-28 北京大学 Method for analyzing interactive evolution of core user information of Web social network
CN102646098A (en) * 2011-02-16 2012-08-22 北京千橡网景科技发展有限公司 Method and device for determining frequency of content in network
CN103563332A (en) * 2011-05-24 2014-02-05 阿瓦亚公司 Social media identity discovery and mapping
CN103563332B (en) * 2011-05-24 2016-10-19 阿瓦亚公司 Social media identity finds and maps
CN102214245B (en) * 2011-07-12 2013-09-11 厦门大学 Graph theory analysis method of research hot spots based on co-occurrence of keywords
CN102214245A (en) * 2011-07-12 2011-10-12 厦门大学 Graph theory analysis method of research hot spots based on co-occurrence of keywords
CN102955804B (en) * 2011-08-25 2016-03-02 中国移动通信集团公司 A kind of network word temperature defining method and device
CN102955804A (en) * 2011-08-25 2013-03-06 中国移动通信集团公司 Method and device for determining heat of web words
CN102394798B (en) * 2011-11-16 2014-12-31 北京交通大学 Multi-feature based prediction method of propagation behavior of microblog information and system thereof
CN102394798A (en) * 2011-11-16 2012-03-28 北京交通大学 Multi-feature based prediction method of propagation behavior of microblog information and system thereof
CN102929918A (en) * 2012-09-20 2013-02-13 西北工业大学 False online public opinion identification method
CN102929918B (en) * 2012-09-20 2015-11-18 西北工业大学 False online public opinion identification
CN103955547A (en) * 2014-05-22 2014-07-30 厦门市美亚柏科信息股份有限公司 Method and system for searching forum hot-posts
CN103955547B (en) * 2014-05-22 2017-02-15 厦门市美亚柏科信息股份有限公司 Method and system for searching forum hot-posts
CN104298783A (en) * 2014-11-10 2015-01-21 武汉安问科技发展有限责任公司 Behavior type generation method for network crawler template
US11900479B2 (en) 2015-03-20 2024-02-13 Www.Trustscience.Com Inc. Calculating a trust score
US10380703B2 (en) 2015-03-20 2019-08-13 Www.Trustscience.Com Inc. Calculating a trust score
CN105631021A (en) * 2015-12-29 2016-06-01 武汉理工大学 PageRank-based in-internet-forum opinion leader identification and optimization method in Hadoop environment
US11386129B2 (en) 2016-02-17 2022-07-12 Www.Trustscience.Com Inc. Searching for entities based on trust score and geography
US11341145B2 (en) 2016-02-29 2022-05-24 Www.Trustscience.Com Inc. Extrapolating trends in trust scores
US11640569B2 (en) 2016-03-24 2023-05-02 Www.Trustscience.Com Inc. Learning an entity's trust model and risk tolerance to calculate its risk-taking score
CN107276781B (en) * 2016-04-07 2019-10-22 中国科学院声学研究所 A kind of pre- dispositions method of the router extension content distributing network of band storage
CN107276781A (en) * 2016-04-07 2017-10-20 中国科学院声学研究所 A kind of router of band storage extends the pre- dispositions method of content distributing network
CN106789342B (en) * 2017-01-20 2019-08-27 国网山东省电力公司 A kind of communication network architecture of electric system establishes, optimization method
CN106789342A (en) * 2017-01-20 2017-05-31 国网山东省电力公司 A kind of communication network architecture of power system is set up, optimization method
CN107205019A (en) * 2017-05-04 2017-09-26 聚好看科技股份有限公司 User behavior data method for cleaning and device
CN107205019B (en) * 2017-05-04 2020-08-07 聚好看科技股份有限公司 User behavior data cleaning method and device
CN108153817B (en) * 2017-11-29 2021-08-10 成都东方盛行电子有限责任公司 Intelligent web page data acquisition method
CN108153817A (en) * 2017-11-29 2018-06-12 成都东方盛行电子有限责任公司 A kind of intelligent web page collecting method
CN108154395B (en) * 2017-12-26 2021-10-29 上海新炬网络技术有限公司 Big data-based customer network behavior portrait method
CN108154395A (en) * 2017-12-26 2018-06-12 上海新炬网络技术有限公司 A kind of customer network behavior portrait method based on big data
CN108595466A (en) * 2018-02-09 2018-09-28 中山大学 A kind of filtering of internet information and Internet user's information and net note structure analysis method
CN109726199A (en) * 2018-12-28 2019-05-07 杭州铭智云教育科技有限公司 A kind of data cleaning method
CN113326355A (en) * 2021-07-29 2021-08-31 湖南正宇软件技术开发有限公司 Proposal scoring method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN101393566A (en) Information tracking and detecting method and system based on network structure user pattern of behavior
CN106528693B (en) Educational resource recommended method and system towards individualized learning
Weerkamp et al. Credibility improves topical blog post retrieval
Abel et al. Leveraging the semantics of tweets for adaptive faceted search on twitter
White et al. Assessing the scenic route: measuring the value of search trails in web logs
Chelaru et al. How useful is social feedback for learning to rank YouTube videos?
Barjak et al. Which factors explain the Web impact of scientists' personal homepages?
Chianese et al. Cultural heritage and social pulse: a semantic approach for CH sensitivity discovery in social media data
Zhao et al. Time-dependent semantic similarity measure of queries using historical click-through data
CN103023714A (en) Activeness and cluster structure analyzing system and method based on network topics
Li et al. Novel user influence measurement based on user interaction in microblog
Song et al. Rt^ 2m: Real-time twitter trend mining system
Yao et al. Provenance-based indexing support in micro-blog platforms
Hassan et al. Task tours: helping users tackle complex search tasks
Zhu et al. A random digit search (RDS) method for sampling of blogs and other user-generated content
Bar-Ilan et al. Bibliographic references in Web 2.0
Song et al. Detecting dynamic association among Twitter topics
Chen et al. The best answers? think twice: online detection of commercial campaigns in the CQA forums
Wang et al. A study on influential user identification in online social networks
Noekhah et al. A novel approach for opinion spam detection in e-commerce
Wang et al. Seeft: Planned social event discovery and attribute extraction by fusing twitter and web content
Ma et al. Influencer discovery algorithm in a multi-relational network
Stewart et al. Discovering information diffusion paths from blogosphere for online advertising
Akinnubi et al. Visualization of Influential Blog Networks Using BlogTracker
Setiawan et al. Virtual application technology of citizen journalism based on mobile user experience

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20090325