CN105447065A - Method for generating social media timeline structured data - Google Patents

Method for generating social media timeline structured data Download PDF

Info

Publication number
CN105447065A
CN105447065A CN201410521961.7A CN201410521961A CN105447065A CN 105447065 A CN105447065 A CN 105447065A CN 201410521961 A CN201410521961 A CN 201410521961A CN 105447065 A CN105447065 A CN 105447065A
Authority
CN
China
Prior art keywords
micro
blog information
information
time
blog
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410521961.7A
Other languages
Chinese (zh)
Inventor
于程程
夏帆
钱卫宁
周傲英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN201410521961.7A priority Critical patent/CN105447065A/en
Publication of CN105447065A publication Critical patent/CN105447065A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The present invention discloses a method for generating social media timeline structured data. The method comprises: constructing a model according to microblog messages released and forwarded by a user in a social media; forming a frame consisting of a first buffer pool and a second buffer pool, simulating release time of a next microblog message by the model, saving the next microblog message in the first buffer pool, removing the earliest released microblog message from the first buffer pool, and if the microblog message is a forwarded microblog message, acquiring a forwarded source microblog message and forward messages to form a complete microblog message, saving the microblog message in the second buffer pool, enabling the second buffer pool to transfer the microblog message that exceeds a given time window to a storage device, and establishing an index; and configuring the frame to a system of the social media, and establishing timeline structure data for the microblog message in the system by using the frame. According to the method for generating the social media timeline structured data, the timeline data can be generated and a microblog data flow can be generated effectively in a high throughput.

Description

A kind of method producing social media time shaft structured data
Technical field
The invention belongs to database technical field, particularly relate to a kind of method producing social media time shaft structured data.
Background technology
Along with the lasting intensification of social informatization degree and the development of Web2.0 technology, social media (SocialMedia) is prevailing gradually, and in human lives, play more and more important effect.Meanwhile, the also amplification trend exponentially of the data in social media.
The user of social media reaches several hundred million scale, these mass users can create message (i.e. microblogging) in social media, this generates the user generated data of semi-structured or Un-structured in a large number, therefore these data have that data volume is large, data are complicated and the feature such as destructuring.Meanwhile, the behavior that these huge data also study user for social scientist and psychologist provides source.Therefore, no matter effective management and excavation social media data are all challenges in academia or industry member.
Social media data are in fact the information of a series of non-structured sequential.In addition, owing to forwarding and replying the existence of mechanism, these information can link, and that is an information can forward or reply another information.Therefore, social media management and to analyze be exactly the process of these sequential data of naive model process of being correlated with some.
Along with prosperity and the development of social media, increasing application appears on social media platform.Efficient management and analysis social media data become social media application and obtain successful foundation stone.Benchmark test is then the important means as weighing system performance.At present, there is the benchmark that some are used for comparing and weighing these technology or system, such as LinkedBench and BSMA.In order to more effectively test these technology, we need one can flexibly, effectively, produce the maker of " really " data.Simultaneously, such Data Generator can also help us better to understand the collective behavior of people in social media, such as, the data utilizing Data Generator to produce and real data compare, thus verify that whether this Data Generator hypothesis is behind correct etc.
Existing generation data technique can not be applied directly to the generation of social media time shaft structured data, because the data model produced is different, lacks forwarding information in similar operation, and can not generation time number of axle certificate flexibly.In order to " really " social media time shaft structured data can be produced flexibly, effectively, the present invention proposes a kind of method producing social media time shaft structured data.
Summary of the invention
The invention discloses a kind of method producing social media time shaft structured data, comprise the following steps:
Information pre-processing step: the micro-blog information Modling model issued in social media for a certain user and forward, this model for simulating the issuing time of next micro-blog information, and determine every bar micro-blog information be described user forward micro-blog information or by described user issue by other people forward micro-blog information;
Framework establishment step: build the framework be made up of the first Buffer Pool and the second Buffer Pool, described framework utilizes the issuing time of next micro-blog information of described each user of modeling, and determine every bar micro-blog information be forward micro-blog information or by other people forward micro-blog information, described first Buffer Pool is used for next microblogging after buffer memory current time, described second Buffer Pool for the history micro-blog information before storing current time, by upgrading and safeguarding that described first Buffer Pool and described second Buffer Pool set up the index of micro-blog information;
Data genaration step: by described chassis configuration in the system of social media, utilizes described framework to be micro-blog information axle construction Time Created data in described system;
Wherein, described time shaft construction step comprises: 1) by next micro-blog information stored in described first Buffer Pool; 2) from described first Buffer Pool, remove issuing time micro-blog information the earliest, if described micro-blog information is the micro-blog information forwarded, then obtain the source micro-blog information that is forwarded and forwarding information forms complete micro-blog information; 3) by described micro-blog information stored in described second Buffer Pool, the micro-blog information exceeding window preset time is transferred to memory device by described second Buffer Pool, and sets up index.
In the method for the described generation social media time shaft structured data that the present invention proposes, in described information pre-processing step, nonhomogeneous Poisson process is utilized to simulate the issuing time of next micro-blog information, described nonhomogeneous Poisson process comprises the steps: step a1: the sum adding up user's issuing microblog information in a time interval, tries to achieve the mean speed that user sends micro-blog information; Step a2: time interval is divided into the two or more time period, the frequency parameter of counting user issuing microblog information within each time period, is designated as time Tuning function; Step a3: in conjunction with described mean speed and time Tuning function and frequency parameter thereof, utilize multiplication operation to simulate the issuing time of next micro-blog information.
In the method for the described generation social media time shaft structured data that the present invention proposes, in described time shaft construction step, obtain described forwarding information and comprise following steps: step b1: obtain by user forward by other people the source micro-blog information issued; Step b2: set an initial time range, utilize described time range to reduce described source micro-blog information, utilizes inverse transformation to determine a time delay according to forward delay distribution, thus determines described time range; Step b3: utilize probable value to determine the forwarding information of described source micro-blog information, described probable value represents with following formula:
P ( m → n ) = D ( n ) + 1 Σ i ∈ F , u ( D ( i ) + 1 )
Wherein, m represents that the micro-blog information that user issues, n represent the source micro-blog information be forwarded, and P (m → n) is the probable value that m forwards n, the quantity that when D (n) represents that the micro-blog information m of user is published, source micro-blog information n is forwarded, F ' urepresent through the source micro-blog information that time range reduces, i represents F ' uin arbitrary microblogging.
In the method for the described generation social media time shaft structured data that the present invention proposes, the system of described social media is the file system of distributed structure/architecture, in described data genaration step, set up a host node and multiple slave node to be adapted to the file system of described distributed structure/architecture and to produce high-throughput data; Described host node is used for the subregion in social media to be assigned to described slave node, described slave node utilizes described framework to set up local time's number of axle certificate to the micro-blog information of user in described subregion, described host node, by merging local time's number of axle certificate of each slave node, generates described time shaft structured data.
In the method for the described generation social media time shaft structured data that the present invention proposes, in described data genaration step, when the micro-blog information in slave node be forward micro-blog information and its forwarding information not at affiliated subregion time, notify described host node, the slave node corresponding with forwarding information is specified to carry out the task of determining forwarding information by described host node, and the forwarding information determined is back in described host node, described host node utilizes the micro-blog information forwarded described in described forwarding information completion.
In the method for the described generation social media time shaft structured data that the present invention proposes, in described data genaration step, use asynchronous model to process the data of the file system of described distributed structure/architecture, described asynchronous model refers to: when micro-blog information of the slave node process of this locality need from other nodes long-range determine its forwarding information pointer time, the slave node of described this locality will determine that the task of pointer is sent in slave node long-range accordingly, the slave node of described this locality does not produce and interrupts waiting for the data interaction with long-range slave node, but continue next micro-blog information of process.
In the method for the described generation social media time shaft structured data that the present invention proposes, in described data genaration step, use and postpone update strategy to process the data of the file system of described distributed structure/architecture, described delay update strategy refers to; When described slave node determines the forwarding information of a micro-blog information, and the issuing time of described micro-blog information is later than described slave node when producing the issuing time of next micro-blog information, described slave node continues to produce next micro-blog information, until when the issuing time of described micro-blog information equals or produces the issuing time of next micro-blog information early than described slave node, described slave node just utilizes single node determination pointer to determine the forwarding information of a micro-blog information.
In the present invention:
Complete micro-blog information refers to: utilize a tlv triple <t, u, f>, t represents the issuing time of this microblogging, and u is the publisher of this microblogging, and f is a pointer, when this microblogging is original microblogging, f is empty, and when this microblogging is for forwarding microblogging, f points to forwarding information.
Incomplete micro-blog information refers to: for above-mentioned tlv triple, lacks the micro-blog information <t of pointer information, u>.
The micro-blog information forwarded refers to: the micro-blog information m issued by a certain user forwards on the basis of the micro-blog information n of other people issue, then micro-blog information m is the micro-blog information forwarded, and micro-blog information n is the source micro-blog information be forwarded.
Beneficial effect of the present invention is; Effectively can produce the framework of social media time shaft structured data, use user's configurable data parameter, this framework can be deployed in distributed environment to improve handling capacity.By the behavioral characteristic analyzing social media user, forwarding information is issued to user and carry out modeling, that there is provided according to user or from True Data, add up the time shaft data parameters got, utilize the distributed maker framework of generation model and the design of setting up to carry out generation time number of axle certificate, this generic frame can be deployed in distributed environment to improve the handling capacity of data genaration simultaneously.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet that the present invention produces social media time shaft structured data method.
Fig. 2 is the schematic diagram that the present invention produces the generation time axle data stream blocks of social media time shaft structured data method.
Fig. 3 is the distributed structure/architecture figure that the present invention produces social media time shaft structured data method.
Fig. 4 is that the present invention produces host node communication flow diagram in the distributed generation of social media time shaft structured data method.
Fig. 5 is that the present invention produces slave node communication flow diagram in the distributed generation of social media time shaft structured data method.
Fig. 6 is that the present invention produces slave node local time countershaft product process figure in the distributed generation of social media time shaft structured data method.
Fig. 7 is that the present invention produces host node length of a game axle product process figure in the distributed generation of social media time shaft structured data method.
Embodiment
In conjunction with following specific embodiments and the drawings, the present invention is described in further detail.Implement process of the present invention, condition, experimental technique etc., except the following content mentioned specially, be universal knowledege and the common practise of this area, the present invention is not particularly limited content.
As shown in Figure 1, the present invention produces the method for social media time shaft structured data, comprises the following steps:
Information pre-processing step: the micro-blog information Modling model issued in social media for a certain user and forward, this model for simulating the issuing time of next micro-blog information, and determine every bar micro-blog information be the micro-blog information that forwards of user or by user issue by micro-blog information that other people forward;
Framework establishment step: build the framework be made up of the first Buffer Pool and the second Buffer Pool, first Buffer Pool is used for next microblogging after buffer memory current time, framework utilizes the issuing time of next micro-blog information of each user of modeling, and determine every bar micro-blog information be forward micro-blog information or by other people forward micro-blog information, second Buffer Pool for the history micro-blog information before storing current time, by upgrading and safeguarding that the first Buffer Pool and the second Buffer Pool set up the index of micro-blog information;
Data genaration step: by chassis configuration in the system of social media, utilizes framework for the micro-blog information axle construction Time Created data in system;
Wherein, time shaft construction step comprises: 1) by next micro-blog information stored in the first Buffer Pool; 2) from the first Buffer Pool, remove issuing time micro-blog information the earliest, if micro-blog information is the micro-blog information forwarded, then obtain the source micro-blog information that is forwarded and forwarding information forms complete micro-blog information; 3) by micro-blog information stored in the second Buffer Pool, the micro-blog information exceeding window preset time is transferred to memory device by the second Buffer Pool, and sets up index.
Below in conjunction with detailed implementation step of the present invention, explanation is further explained to the technology of the present invention content.
(1) micro-blog information generates
The present invention uses nonhomogeneous Poisson process to carry out the process of analog subscriber issuing microblog information, and this part determines time point t in every bar micro-blog information and author u.
Each user's issuing microblog information can be modeled to a nonhomogeneous Poisson process, and { N (t, u), t >=0}, and the strength function of different user is also different, N (t, u) represents the micro-blog information quantity that user u sent out before time t.The strength function of user u is defined as:
λ u(t)=λ u×f(t);
Strength function λ ut () determined by two factors: 1) the basal rate λ of each user's issuing microblog information u, 2) and time Tuning function f (t).λ ube the user u average velocity released news per second, each user has themselves λ u.F (t) is for λ uat the regulation coefficient of different time sections.That is, the frequency released news of a user can along with time variations.Time Tuning function is defined as follows:
f(t)=D t×H t
Wherein, D tbe the coefficient in the sky when moment t, it has 7 coefficient values, a corresponding Zhou Qitian.H tbe hour coefficient at moment t, it has 24 each coefficient values, corresponding one day 24 hours.Being observed by the frequency that releases news to True Data user, there is periodically variable attribute in user's frequency that releases news.Can very simply simulate with this model, the frequency that user releases news can along with one week not on the same day with one day different hours and change.
Use weakens algorithm to simulate nonhomogeneous Poisson process, and the method utilizes the time sample of common Poisson process to generate Poisson process nonhomogeneous.Suppose to there is λ imake λ i>=λ ut (), by having frequency lambda ipoisson process in stochastic generation issuing time, be λ with probability by intercropping time this u(t)/λ ibe chosen for the issuing time of the user u in inhomogeous Poisson process.Therefore, according to the model proposed be:
λ I=λ u×MAX(D t)×MAX(H t);
Wherein, MAX (D t) and MAX (H t) be all D tand H tmaximum coefficient value.
The time point of next micro-blog information can be generated according to the time point of the current generation micro-blog information of each user by said method.Algorithm false code as shown in the following Table 1.
The Implementation of pseudocode mode of micro-blog information issuing time simulated by table 1
(2) forwarding information is generated
The present invention utilizes micro-blog information to be forwarded number of times distribution (normally power-law distribution) and time delay function chooses the micro-blog information be forwarded from the micro-blog information issued before, and this part determines the forwarding information in micro-blog information entry.
After a micro-blog information utilizes said method to produce, then need to judge whether this micro-blog information is forward micro-blog information, if forward micro-blog information, determines its forwarding information.Whether a micro-blog information is forward micro-blog information rp to utilize the forwarding probability being assigned to each user to judge u.The probability that releases news of each user can utilize inverse transformation technology, according to λ with forwarding probability uand rp uaccumulative joint probability distribution, utilize inverse transformation technology to simulate generation.
When a micro-blog information is judged as forwarding micro-blog information, the present invention needs to determine forwarding information.In social media, user forwards the micro-blog information in the time shaft feed of the user that they pay close attention to usually, is designated as F to the feed of user u u.Simultaneously for most of social media, F uthey have a dead length L, this means that those very old information can not be forwarded, because can be extruded F by new information u.
In order to realize this forwarding mechanism, will with following methods determination forwarding information:
1. obtain the source micro-blog information F of the forwarding of user u u.
2. reduce F by a time range uobtain F ' u, utilize inverse transformation technology to determine a time delay according to forward delay distribution, determine this time range with this.
3. the forwarding information of micro-blog information m is determined by probability below:
P ( m &RightArrow; n ) = D ( n ) + 1 &Sigma; i &Element; F , u ( D ( i ) + 1 )
Wherein, m represents that the micro-blog information that user issues, n represent the source micro-blog information be forwarded, and P (m → n) is the probable value that m forwards n, the quantity that when D (n) represents that the micro-blog information m of user is published, source micro-blog information n is forwarded, F ' urepresent through the source micro-blog information that time range reduces, i represents F ' uin arbitrary microblogging.
(3) framework of generation time axle construction data is built
The present invention will the micro-blog information of constantly output timing.By building by two Buffer Pools---the framework that the first Buffer Pool NextTweet and the second Buffer Pool RecentTweet is formed produces the micro-blog information of output timing effectively, thus builds time shaft structured data.
NextTweet Buffer Pool: this Buffer Pool is used for storing each user next micro-blog information not with forwarding information that will issue after current time.Utilize algorithm 1 can obtain its time of next micro-blog information issued according to the time of the current issuing microblog information of each user, then by the <t of each user, u> stored in this Buffer Pool.In this Buffer Pool, all information all sorts according to time sequencing simultaneously.
RecentTweet Buffer Pool: it store current time before the history micro-blog information produced in the recent period.According to given window t wcontrol the size of this Buffer Pool.Issuing time is less than t-t wwhich micro-blog information will be removed and refresh in secondary storage, simultaneously set up index according to issuing time.
The process that time shaft data stream produces is exactly the process that these two Buffer Pools are constantly updated.Before generation data, the beginning and ending time of the first given time shaft that will produce and number of users.Use the algorithm 2 shown in table 2, first initialization is carried out to NextTweet Buffer Pool, the Article 1 micro-blog information of each user is inserted in NextTweet.
The initialized Implementation of pseudocode mode of table 2 first Buffer Pool
Upgrade two Buffer Pools by iteration and the algorithm false code of generation time axle data stream as shown in the algorithm 3 of table 3.This renewal process is exactly lasting from NextTweet Buffer Pool, remove Article 1 micro-blog information, judge whether this micro-blog information is forwarded, if be forwarded, then utilize social network information and history micro-blog information data determination forwarding information, and complete micro-blog information is inserted in RecentTweet Buffer Pool, then next micro-blog information this micro-blog information publisher will issued again inserts NextTweet, the micro-blog information that exceeds window in RecentTweet will be removed stored in setting up index in secondary storage, and then the such process of iteration is until stop when NextTweet is empty.
Table 3 upgrades the Implementation of pseudocode method of two Buffer Pool generation time number of axle certificates
The generation framework of the data stream of all user's issuing microblog information is described, first by <m in figure below in conjunction with Fig. 2 1, m 2, m n> is inserted in NextTweet, to complete the initialization procedure of algorithm 1.Then two Buffer Pools are constantly updated thus generation time axle data stream according to algorithm 2, such as, when data genaration is to t 1during the moment, t 1all round dots of moment part to the right represent the information comprised in NextTweet this moment, and t 1and t 1' between all square dot be all the information be stored in RecentTweet, by m 4remove from NextTweet, after determining forwarding information, by complete micro-blog information m 4insert in RecentTweet, now current data has been generated to t 2in the moment, RecentTweet needs to safeguard t 2and t 2' between information, therefore, then diagonal line hatches part, the information that exceeds window to be removed from RecentTweet, and stored in secondary storage, next micro-blog information <t that then user u3 will be issued m5, u 3> is inserted in NextTweet, and then repeat such operation until NextTweet is for empty, all data genaration are complete.
(4) distributed data system generation time number of axle certificate
The rise time number of axle according to time, in order to determine forwarding information, need frequent visit social media and inquiry RecentTweet Buffer Pool.When social media is very large, the access of social media and the capacity of RecentTweet will be the bottlenecks producing high-throughput data.Therefore, the invention provides a distributed generation method and solve these problems.
As the distributed structure/architecture that Fig. 3 is time shaft structure generator.It contains a host node (master node) and multiple slave node (slave node).Social network is split according to slave number of nodes by master node, and is assigned in each slave node.Each slave node is responsible for the micro-blog information generating the user in corresponding social networks subregion according to rise time axle data stream architecture, produces local time's countershaft.Local time's countershaft from different slave node is merged into length of a game's axle by master node again, and finally exports.The distributed generation time number of axle according to and unit produce that following some is different:
1. each slave node only stores a social networks subregion.But not overlappingly between different social networks subregion to connect.In generation forwarding information part, if all users of paying close attention to of user are in a subregion, so this slave node will as the work determined on unit forwarding information.But as long as have a user paid close attention on another subregion, local slave node will determining that the inquiry of forwarding information is sent on other certain long-range slave nodes.This forwarding information is determined to postpone.
2. on each slave node, social networks subregion is stored in internal memory.The present invention just uses existing figure partition method to carry out subregion to social networks, and only uses division result.
Local time's countershaft from different slave node is merged into length of a game's axle by 3.master node again, i.e. time shaft structured data.
Under distributed environment, determine forwarding information part, some micro-blog informations in certain slave node may forwarded the micro-blog information in other slave nodes.Need to carry out interactive information between the slave node therefore with the subregion that is connected, a slave node may need the history micro-blog information inquired about in other slave nodes to determine the forwarding information of the local micro-blog information produced in other words.
Forwarding the generating portion of micro-blog information, exactly when determining the forwarding information of every bar micro-blog information.If the F of user u uall in same social networks subregion, corresponding slave node determines that the process of the forwarding information of this user is identical with the processing mode under single node.If F ube dispersed in two or more subregions, determine that the task of forwarding information sends in the past by choosing a slave node by one, utilize the relevant historical micro-blog information in this slave node to carry out determination result, then result is sent it back master node by slave node again.How explanation is chosen a slave node below to send task.According to derivation, user u select s (certain slave node) to determine the probability of pointer f that u institute produces forwarding micro-blog information is:
pick ( u , s ) = &Sigma; i &Element; F ( u , s ) &lambda; i / &Sigma; j &Element; F ( u ) &lambda; j ;
Wherein, F (u, s) is that the be concerned user of u in s gathers, λ iand λ jrepresent the issue speed on average per second of user i and j.
The strategy of asynchronous model and delay renewal is also used in distributed production data method.
Asynchronous model: one forward micro-blog information need long-range determine pointer f time, local this task to be sent to after in corresponding long-range slave node, locally still to continue to produce new micro-blog information and without the need to mutual with remote node again.
Postpone update strategy: in each slave node, the task of the determination forwarding information received from other remote nodes is left in order come, and according to each task of reception sequential processes.When the forwarding information of a micro-blog information m will be determined, if the issuing time t of m mbe later than the issuing time t of the up-to-date micro-blog information m ' that this node produces m' (i.e. t m> t m'), need to wait for that this node continues to produce new micro-blog information, until t m<=t mjust start to utilize the method for single node determination pointer to solve this task.Such wait can cause the quantity that is forwarded of part micro-blog information to postpone to upgrade, but this strategy of experiment display does not have influence on the Data distribution8 of generation.
Fig. 4 shows the communication flow diagram of master node.First master node can connect with each slave node, and individually communicates with.Then generation parameter and each social networks subregion can be specified and be sent to each slave node by master node.Each slave node just can carry out after receiving information generating corresponding local time's number of axle certificate, can send various information to master node in this process, therefore master node then needs to continue monitor each slave node and receive the various information from each slave node at communication processing section.Master node there is a watcher thread in order to receive the various information of process for each slave node, when the information received is complete micro-blog information, it is stored in order to be used for being merged into final length of a game's axle later; If forwarding mission bit stream, then first choose the slave node that can solve this task, then this task is sent to the node selected, finally again this task is recorded as the micro-blog information list of this slave node as incomplete micro-blog information and complete micro-blog information jointly according to the time sequencing issued, this list is according to time-sequencing, and the object done like this makes the local time's number of axle certificate obtained from each slave node of main device energy order; If forwarding object information, then record this and forward object information, forwarding task from each slave node is transmitted to again other its slave nodes and solves these tasks by master node, when after each slave node processing these tasks good, these task results can be sent it back master node again; When certain slave node last all receives above information, illustrate that this slave nodal information receives task complete.
Fig. 5 shows the communication flow diagram of slave node.First connect with master node from slave node, then receive parameter and data that long-range master node-node transmission comes, this locality of initialization simultaneously produces micro-blog information data desired parameters.Then slave starts to monitor and receives the information from master node, when the information received is forwarding task, then by forwarding production model and utilizing local data determination forwarding information, and task result is sent it back master node, then continue monitoring reception data.If what receive is not forwarding task but the information that is sent of forwarding task, then the communication module of local reception information just finishes, otherwise continues to monitor.
Fig. 6 shows the process flow diagram that slave node produces local time's number of axle certificate.First the Article 1 micro-blog information that each user will issue after current time is put in a Buffer Pool NextTweet, then circulate from this set, extract the micro-blog information of minimal time, determine (whether comprise is the micro-blog information forwarded for the forwarding situation of this micro-blog information, if and for forward micro-blog information, whether the forwarding information of its correspondence is stored in local memory device) after, corresponding information is sent to master node, and this micro-blog information is recorded in Buffer Pool RecentTweet, then system is by next micro-blog information of this micro-blog information of generation publisher, if next micro-blog information time be less than the time shaft that will generate by the time, then this new micro-blog information is added in NextTweet, and then the historical data exceeding given window is removed from RecentTweet, finally enter next round circulation.Circulation is until NextTweet is for terminating time empty, and now local local time shaft data just all generate complete.
Fig. 7 is master node length of a game axle product process figure.First from the micro-blog information list from each slave node, obtain various Article 1 information form a set, if this set is for empty, show that length of a game's axle generates complete, otherwise the micro-blog information of minimal time removes in just this being gathered, if this micro-blog information is incomplete micro-blog information, show that this micro-blog information forwards micro-blog information and do not determine forwarding information, just utilize in the forwarding task result record obtained from each slave node and obtain corresponding forwarding information, by this micro-blog information completion, then write storage device, otherwise, if this micro-blog information is complete, just write direct memory device.Then above operation is repeated again, until all local time shaft all takes out.
Protection content of the present invention is not limited to above embodiment.Under the spirit and scope not deviating from inventive concept, the change that those skilled in the art can expect and advantage are all included in the present invention, and are protection domain with appending claims.

Claims (7)

1. produce a method for social media time shaft structured data, it is characterized in that, comprise the following steps:
Information pre-processing step: the micro-blog information Modling model issued in social media for a certain user and forward, this model for simulating the issuing time of next micro-blog information, and determine every bar micro-blog information be described user forward micro-blog information or by described user issue by other people forward micro-blog information;
Framework establishment step: build the framework be made up of the first Buffer Pool and the second Buffer Pool, described framework utilizes the issuing time of next micro-blog information of described each user of modeling, and determine every bar micro-blog information be forward micro-blog information or by other people forward micro-blog information, described first Buffer Pool is used for next microblogging after buffer memory current time, described second Buffer Pool for the history micro-blog information before storing current time, by upgrading and safeguarding that described first Buffer Pool and described second Buffer Pool set up the index of micro-blog information;
Data genaration step: by described chassis configuration in the system of social media, utilizes described framework to be micro-blog information axle construction Time Created data in described system;
Wherein, described time shaft construction step comprises:
1) by next micro-blog information stored in described first Buffer Pool;
2) from described first Buffer Pool, remove issuing time micro-blog information the earliest, if described micro-blog information is the micro-blog information forwarded, then obtain the source micro-blog information that is forwarded and forwarding information forms complete micro-blog information;
3) by described micro-blog information stored in described second Buffer Pool, the micro-blog information exceeding window preset time is transferred to memory device by described second Buffer Pool, and sets up index.
2. the method producing social media time shaft structured data as claimed in claim 1, it is characterized in that, in described information pre-processing step, utilize nonhomogeneous Poisson process to simulate the issuing time of next micro-blog information, described nonhomogeneous Poisson process comprises the steps:
Step a1: the sum adding up user's issuing microblog information in a time interval, tries to achieve the mean speed that user sends micro-blog information;
Step a2: time interval is divided into the two or more time period, the frequency parameter of counting user issuing microblog information within each time period, is designated as time Tuning function;
Step a3: in conjunction with described mean speed and time Tuning function and frequency parameter thereof, utilize multiplication operation to simulate the issuing time of next micro-blog information.
3. the method producing social media time shaft structured data as claimed in claim 1, is characterized in that, in described time shaft construction step, obtain described forwarding information and comprise following steps:
Step b1: obtain by user forward by other people the source micro-blog information issued;
Step b2: set an initial time range, utilize described time range to reduce described source micro-blog information, utilizes inverse transformation to determine a time delay according to forward delay distribution, thus determines described time range;
Step b3: utilize probable value to determine the forwarding information of described source micro-blog information, described probable value represents with following formula:
P ( m &RightArrow; n ) = D ( n ) + 1 &Sigma; i &Element; F , u ( D ( i ) + 1 ) ;
Wherein, m represents that the micro-blog information that user issues, n represent the source micro-blog information be forwarded, and P (m → n) is the probable value that m forwards n, the quantity that when D (n) represents that the micro-blog information m of user is published, source micro-blog information n is forwarded, F ' urepresent through the source micro-blog information that time range reduces, i represents F ' uin arbitrary microblogging.
4. the method producing social media time shaft structured data as claimed in claim 1, it is characterized in that, the system of described social media is the file system of distributed structure/architecture, in described data genaration step, set up a host node and multiple slave node to be adapted to the file system of described distributed structure/architecture and to produce high-throughput data;
Described host node is used for the subregion in social media to be assigned to described slave node, described slave node utilizes described framework to set up local time's number of axle certificate to the micro-blog information of user in described subregion, described host node, by merging local time's number of axle certificate of each slave node, generates described time shaft structured data.
5. the method producing social media time shaft structured data as claimed in claim 4, it is characterized in that, in described data genaration step, when the micro-blog information in slave node be forward micro-blog information and its forwarding information not at affiliated subregion time, notify described host node, the slave node corresponding with forwarding information is specified to carry out the task of determining forwarding information by described host node, and the forwarding information determined is back in described host node, described host node utilizes the micro-blog information forwarded described in described forwarding information completion.
6. the method producing social media time shaft structured data as claimed in claim 4, it is characterized in that, in described data genaration step, use asynchronous model to process the data of the file system of described distributed structure/architecture, described asynchronous model refers to: when a micro-blog information of the slave node process of this locality needs to determine its forwarding information from other nodes, this forwarding information is expressed as a pointer, the slave node of described this locality will determine that this task of pointer is sent in slave node long-range accordingly, the slave node of described this locality does not produce and interrupts waiting for the data interaction with long-range slave node, but continue next micro-blog information of process.
7. the method for the generation social media time shaft structured data as described in claim 4 or 6, it is characterized in that, in described data genaration step, use and postpone update strategy to process the data of the file system of described distributed structure/architecture, described delay update strategy refers to; When described slave node determines the forwarding information of a micro-blog information, and the issuing time of described micro-blog information is later than described slave node when producing the issuing time of next micro-blog information, described slave node continues to produce next micro-blog information, until when the issuing time of described micro-blog information equals or produces the issuing time of next micro-blog information early than described slave node, described slave node just utilizes single node determination pointer to determine the forwarding information of a micro-blog information.
CN201410521961.7A 2014-09-30 2014-09-30 Method for generating social media timeline structured data Pending CN105447065A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410521961.7A CN105447065A (en) 2014-09-30 2014-09-30 Method for generating social media timeline structured data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410521961.7A CN105447065A (en) 2014-09-30 2014-09-30 Method for generating social media timeline structured data

Publications (1)

Publication Number Publication Date
CN105447065A true CN105447065A (en) 2016-03-30

Family

ID=55557247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410521961.7A Pending CN105447065A (en) 2014-09-30 2014-09-30 Method for generating social media timeline structured data

Country Status (1)

Country Link
CN (1) CN105447065A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106157155A (en) * 2016-07-27 2016-11-23 北京大学 Social media information based on map metaphor propagates visual analysis method and system
CN112000709A (en) * 2020-07-17 2020-11-27 微梦创科网络科技(中国)有限公司 Method and device for batch mining of total exposure of social media information
CN112347056A (en) * 2021-01-08 2021-02-09 北京东方通软件有限公司 Automatic file generation method based on time axis

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102637182A (en) * 2011-02-15 2012-08-15 北京大学 Method for analyzing interactive evolution of core user information of Web social network
US20140236931A1 (en) * 2013-11-19 2014-08-21 Share Rocket, Inc. Systems and Methods for Simultaneous Display of Related Social Media Analysis Within a Time Frame

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102637182A (en) * 2011-02-15 2012-08-15 北京大学 Method for analyzing interactive evolution of core user information of Web social network
US20140236931A1 (en) * 2013-11-19 2014-08-21 Share Rocket, Inc. Systems and Methods for Simultaneous Display of Related Social Media Analysis Within a Time Frame

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHENGCHENG YU等: "BSMA-Gen: A Parallel Synthetic Data Generator for Social Media Timeline Structures", 《DASFAA 2014: DATABASE SYSTEMS FOR ADVANCED APPLICATIONS》 *
CHENGCHENG YU等: "On efficiently generating realistic social media timeline structures", 《SSDBM 14 PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT》 *
FAN XIA等: "BSMA: a benchmark for analytical queries over social media data", 《PROCEEDINGS OF THE VLDB ENDOWMENT》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106157155A (en) * 2016-07-27 2016-11-23 北京大学 Social media information based on map metaphor propagates visual analysis method and system
CN106157155B (en) * 2016-07-27 2022-07-19 北京大学 Social media information propagation visualization analysis method and system based on map metaphor
CN112000709A (en) * 2020-07-17 2020-11-27 微梦创科网络科技(中国)有限公司 Method and device for batch mining of total exposure of social media information
CN112000709B (en) * 2020-07-17 2023-10-24 微梦创科网络科技(中国)有限公司 Social media information total exposure batch mining method and device
CN112347056A (en) * 2021-01-08 2021-02-09 北京东方通软件有限公司 Automatic file generation method based on time axis
CN112347056B (en) * 2021-01-08 2021-07-02 北京东方通软件有限公司 Automatic file generation method based on time axis

Similar Documents

Publication Publication Date Title
CN107038162B (en) Real-time data query method and system based on database log
CN106156810B (en) General-purpose machinery learning algorithm model training method, system and calculate node
CN104969213B (en) Data flow for low latency data access is split
CN105550225B (en) Index structuring method, querying method and device
CN104090901B (en) A kind of method that data are processed, device and server
CN105608194A (en) Method for analyzing main characteristics in social media
CN107038222A (en) Database caches implementation method and its system
CN104021205B (en) Method and device for establishing microblog index
CN104317789A (en) Method for building passenger social network
CN108363643A (en) A kind of HDFS copy management methods based on file access temperature
CN104216889B (en) Data dissemination analyzing and predicting method and system based on cloud service
CN110020046B (en) Data capturing method and device
CN105760279A (en) Method and system for generating fault early warning relevance tree of distributed database cluster
CN105447065A (en) Method for generating social media timeline structured data
JP2010514033A5 (en)
CN105208093B (en) The structure system of resource pool is calculated in a kind of cloud operating system
CN110737432B (en) Script aided design method and device based on root list
CN101635001B (en) Method and apparatus for extracting information from a database
CN104036039A (en) Parallel processing method and system of data
CN110110863A (en) A kind of distributed machines study tune ginseng system based on celery
CN106990913B (en) A kind of distributed approach of extensive streaming collective data
CN105426407A (en) Web data acquisition method based on content analysis
US8392466B2 (en) Method and apparatus for automated processing of a data stream
Barros A modular representation of fluid stochastic petri nets.
CN104932982B (en) A kind of Compilation Method and relevant apparatus of message memory access

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160330

WD01 Invention patent application deemed withdrawn after publication