Disclosure of Invention
According to the microblog information capturing method and device, the purpose of obtaining effective microblog information as much as possible by using limited capturing resources is achieved.
Therefore, the embodiment of the invention provides the following technical scheme:
a microblog information capturing method comprises the following steps:
acquiring microblog users to be captured, and judging the types of the microblog users to be captured;
if the microblog user to be captured is an active user, computing a capturing period of the microblog user to be captured, and predicting a capturing time point according to the capturing period to capture microblog information;
and if the microblog user to be captured is an inactive user, capturing the state of the microblog user to be captured and the quantity of the remaining captured users are obtained, and if the capturing state indicates that the microblog information can be captured and the quantity of the remaining captured users is not zero, capturing the microblog information of the microblog user to be captured.
Preferably, the acquiring the microblog user to be captured includes:
selecting at least one authenticated user as a seed user, and adding the seed user as an unprocessed user to a user list;
judging whether the unprocessed user has a subordinate user:
if yes, acquiring a subordinate user of the unprocessed user, adding the subordinate user to the user list, and setting the state of the unprocessed user as processed; taking the subordinate user as an unprocessed user, and continuing to execute the step of judging whether the unprocessed user has the subordinate user;
if not, the status of the unprocessed user is set to processed.
Preferably, the acquiring the subordinate user of the unprocessed user includes:
acquiring the subordinate user through the user relationship network of the unprocessed user; or,
and grabbing comments and/or forwarding the microblog issued by the unprocessed user as the subordinate user.
Preferably, the determining the type of the microblog user to be captured includes:
determining the user activity according to the frequency of the microblog to be captured for releasing the microblog;
judging the type of the microblog user to be captured according to a preset active value and the user activity, and if the user activity is not smaller than the preset active value, judging that the microblog user to be captured is an active user; otherwise, judging that the microblog user to be captured is an inactive user.
Preferably, the determining the user activity according to the frequency of issuing the microblog by the user to be subjected to the microblog grabbing comprises:
calculating the average posting interval of the users according to the microblogs issued by the microblog users to be captured;
and searching for the liveness corresponding to the average posting interval from a preset database.
A microblog information capturing device, the device comprising:
the first acquisition unit is used for acquiring microblog users to be captured;
the first judging unit is used for judging the type of the microblog user to be captured, which is acquired by the first acquiring unit;
the calculating unit is used for calculating the grabbing period of the microblog user to be grabbed when the first judging unit judges that the microblog user to be grabbed is the active user;
the grabbing unit is used for predicting grabbing time points according to the grabbing period to carry out microblog information grabbing;
the second acquisition unit is used for acquiring the grabbing state of the microblog user to be grabbed and the amount of the remaining grabbing users when the first judgment unit judges that the microblog user to be grabbed is the inactive user;
the grabbing unit is further configured to grab the microblog information from the microblog user to be grabbed when the grabbing state indicates that the microblog information grabbing can be performed and the remaining grabbing user amount is not zero.
Preferably, the first acquiring unit includes:
the system comprises a selecting unit, a judging unit and a judging unit, wherein the selecting unit is used for selecting at least one authenticated user as a seed user and adding the seed user as an unprocessed user to a user list;
a second judging unit configured to judge whether the unprocessed user has a subordinate user:
a third acquisition unit configured to acquire a subordinate user of the unprocessed user when the second judgment unit judges that the unprocessed user has the subordinate user,
an adding unit configured to add the subordinate user to the user list, and set a state of the unprocessed user as processed; taking the subordinate user as an unprocessed user, and informing the second judging unit to continuously judge whether the unprocessed user has the subordinate user;
a setting unit configured to set a state of the unprocessed user as processed when the second determination unit determines that the unprocessed user does not have a subordinate user.
Preferably, the third obtaining unit is specifically configured to obtain the subordinate user through the user relationship network of the unprocessed user; or,
the third obtaining unit is specifically configured to capture comments and/or forward a user of the microblog issued by the unprocessed user as the subordinate user.
Preferably, the first judging unit includes:
the determining unit is used for determining the user activity according to the frequency of the microblog to be captured for releasing the microblog;
the judging subunit is used for judging the type of the microblog user to be captured according to a preset active value and the user activity, and if the user activity is not smaller than the preset active value, judging that the microblog user to be captured is an active user; otherwise, judging that the microblog user to be captured is an inactive user.
Preferably, the calculation unit includes:
the calculating subunit is used for calculating the average posting interval of the users according to the microblogs issued by the microblog users to be captured;
and the searching unit is used for searching the liveness corresponding to the average posting interval from a preset database.
According to the microblog information capturing method and device, firstly, the microblog users to be captured are excavated as many as possible to serve as the processing objects of the invention, and then the processing objects are classified according to the activity of the processing objects: if the processing object is an active user, the behavior characteristics of the issuing microblog of the processing object are statistically analyzed, and a grabbing period is set according to the behavior characteristics, so that grabbing time points can be predicted by the grabbing period, and targeted information grabbing is performed; and if the processing object is an inactive user, judging whether to capture information according to the current capture state and the current remaining capture user amount. According to the method and the device, through a mode of carrying out differential processing on different types of users, reasonable distribution and use of the captured resources are realized, the resource utilization rate is improved, meanwhile, more microblog information can be captured in each capturing process, and the information capturing efficiency is improved.
Detailed Description
In order to make the technical field of the invention better understand the scheme of the invention, the following detailed description of the embodiments of the invention is provided in conjunction with the accompanying drawings and the implementation mode.
In order to extract news hotspots or analyze user interests, microblog information released by users should be timely and comprehensively captured, and in consideration of the limitation of the number of times and frequency of information capture by each large microblog platform in the prior art, if the information capture is performed in the same way for different types of microblog users, for example, for active users who have behaviors of microblog release, forwarding, comment and the like every day and inactive users who log in microblogs less, the unreasonable allocation and use of captured resources are obviously caused by the fact that the information capture is performed in the same way, and the efficiency of microblog information capture is low. In order to improve the information capturing efficiency and fully utilize limited capturing resources to rapidly and accurately obtain more effective microblog information, the microblog information capturing scheme is provided. According to the scheme, the types of the microblog users to be captured are analyzed, and different processing is performed on the users of different types. The following explains a specific implementation process of the present invention.
Referring to fig. 1, a flowchart illustrating a microblog information capturing method according to the present invention is shown, and may include:
step 101, obtaining microblog users to be captured, and judging the types of the microblog users to be captured.
Considering the limitation of each large microblog platform on information capturing resources every day, if more effective microblog information is captured by using the limited capturing resources, different capturing schemes should be formulated for different types of users.
Firstly, acquiring microblog users to be captured, namely mining the microblog users to determine as many information capturing objects as possible. As an implementation manner for acquiring the microblog user to be captured in this step, the implementation manner may be embodied as a flowchart shown in fig. 2, and may include:
step 201, selecting at least one authenticated user as a seed user, and adding the seed user as an unprocessed user to a user list.
Step 202, judging whether the unprocessed user has a subordinate user, if yes, executing step 203, and if not, executing step 205.
Step 203, obtaining the subordinate user of the unprocessed user, adding the subordinate user to the user list, and setting the state of the unprocessed user as processed.
And step 204, taking the subordinate user as an unprocessed user, and returning to execute the step 202.
Step 205, the status of the unprocessed user is set to processed.
Microblog users can be roughly divided into two types: and authenticating the user and the ordinary user. In order to excavate as many microblog users as possible, the seed users are determined from the authentication users with large influence and complex user relationship network. As an implementation mode for determining the seed users, the seed users can be captured on the page of a microblog celebrity hall, if the first 100 users who influence the ranking or the popularity ranking are used as the seed users, or the authentication users under a certain classification can be captured in a targeted manner according to marketing needs, if a travel product needs to be popularized at present, the authentication users under the travel classification can be captured as the seed users. The present invention may not be limited to the specific manner of determining the seed user from among the authenticated users.
After the seed users are determined, the seed users can be added to a user list as unprocessed users, whether the unprocessed users have subordinate users is judged, and the following processing is performed:
(1) if the unprocessed user does not have the subordinate user, the unprocessed user is indicated as the bottom node, all microblog users directly or indirectly related to the seed user are mined currently, and the state of the unprocessed user can be directly identified as processed.
(2) If the unprocessed user has a lower-level user, it indicates that the unprocessed user is not a bottom-level node, and further performing recursive mining on the basis of the lower-level user, at this time, the following processing may be performed:
a. identifying the status of the unprocessed user as processed;
b. adding the subordinate users of the unprocessed users into a user list;
c. the state of the subordinate user is identified as an unprocessed state so that recursive mining can be continued on the basis of the unprocessed state.
After the 3 processing actions, that is, it is indicated that there are unprocessed users in the user list, the step 202 should be returned to, and when these subordinate users are continuously determined as unprocessed users, whether there are subordinate users or not is determined, and then the difference processing is performed according to the determination result, which is not described herein again.
It should be noted that the present invention provides two implementation manners for acquiring the subordinate users of the unprocessed user, which are explained below separately.
(1) And acquiring the subordinate user through the user relationship network of the unprocessed user.
The user relationship network refers to the relationship between microblog users, and generally adopts a node graph to represent the relationship between the microblog users, wherein nodes represent the microblog users, and connecting lines between the two nodes represent the relationship between the users. In the microblog, the user A can pay attention to and receive the microblog which is released by the user B and interested by the user A, at the moment, the user A is the fan of the user B, and correspondingly, the user B is the attention of the user A.
As an implementation mode for acquiring the user relationship network, the method can be realized by calling the API of the microblog open platform, and the attention list and the fan list of a certain unprocessed user are acquired. Since the users in the interest list and the fan list are mined by the unprocessed user, the users in the interest list and the fan list can be referred to as subordinate users of the unprocessed user.
(2) And grabbing comments and/or forwarding the microblog issued by the unprocessed user as the subordinate user.
Under the condition that the user A and the user B are not concerned or not in fan relation, the user A may also forward and/or comment a microblog issued by the user B, at this time, the user A and the user B can be considered to have an association relation, and under the condition, the user A can also be considered as a subordinate user of the user B. Therefore, as another implementation manner for acquiring the subordinate users, the method can be implemented by capturing and forwarding and/or commenting users who have not processed microblogs issued by the users.
The microblog users mined according to the introduction manner can be regarded as processing objects of the invention, namely microblog users to be captured, and in order to realize the differential processing of the microblog users, the types of the microblog users are identified. The microblog user types in the invention can be divided into active users and inactive users, and the active users occupy a small amount and the inactive users have a large amount. For the two types, the invention provides two different processing modes. For active users, the processing may be performed in step 102, and for inactive users, the processing may be performed in step 103, which will be explained below.
The implementation for determining the user type is not detailed here.
And step 102, if the microblog user to be captured is an active user, computing a capturing period of the microblog user to be captured, and predicting a capturing time point according to the capturing period to capture microblog information.
As described above, the number of active users is small, but the amount of microblog information provided by the active users is large, and according to this characteristic, we can analyze the behavior characteristics of each active user issuing a microblog one by one, set a corresponding capture cycle for the active user according to the behavior characteristics, and then capture information in a targeted manner according to the capture time points predicted by the capture cycle (that is, the time points at which the user may issue a microblog).
It should be noted that the grabbing period determined for the active user may be a fixed period or a variable period.
That is, for a certain active user, the average interval of issuing the microblogs in a unit time (such as hour, day, week, etc.) can be obtained by statistically analyzing the historical microblogs issued by the active user, and a fixed grabbing period is calculated based on the average interval, and the grabbing time point is predicted according to the fixed grabbing period. Wherein, the average interval of issuing the microblogs in unit time can be understood as the behavior characteristic of the user.
Or, for a certain active user, the busy period and the idle period of the user who releases the microblog in unit time (such as hours, days, weeks and the like) can be obtained by statistically analyzing the historical microblog released by the user, different capturing periods are set for the busy period and the idle period, and information capturing is performed in a period-changing manner. If statistics shows that a certain active user frequently issues microblogs during lunch eating time, subway riding time or evening time, the time periods can be defined as busy hours; when the user rarely releases microblogs during working hours at work and rest hours at night, the time periods can be defined as idle periods. Therefore, the behavior characteristic of the user for releasing the microblog in the day is obtained, the capture period of the day can be set according to the behavior characteristic, and the capture time point of the same day in the next week can be predicted by the set capture period to capture the microblog information.
It should be noted that, in the process of determining the grabbing period, factors that may affect the length of the grabbing period at least include: the weight of each historical microblog, the influence of the user (which can be represented by the number of fans and the number of mentions), the quality of the microblog release of the user (which can be represented by the number of forwarded microblogs), the resource capture (which is limited by the capture platform), and the like, and detailed description is not provided herein.
Step 103, if the microblog user to be captured is an inactive user, capturing the state of the microblog user to be captured and the amount of remaining captured users are obtained, and if the capturing state indicates that the microblog information can be captured and the amount of the remaining captured users is not zero, the microblog information is captured for the microblog user to be captured.
As described above, the number of inactive users is large, and the amount of microblog information provided by the part of users is small, and if the information capture is performed according to a certain capture period (fixed period or variable period) in the step 102, not only is the waste of capture resources caused, but also the captured information may be limited, so that another capture scheme for the inactive users is provided.
First, a grabbing interval, e.g. 2 months, is set indicating the current grabbing status of the inactive user. The grasping state of the user during the grasping interval is not grasping, and the grasping state of the user when the grasping interval arrives is grasping possible. For example, when information capture is performed on a certain inactive user at 06.12 (which may be regarded as a capture starting point of the user), when it is determined at 06.13 whether information capture is required for the user, it is known that the microblog information of the user has just been captured on the previous day, at this time, it is not necessary to capture the information again for the moment, that is, the user is not captured in the capture state of 06.13, and by analogy with such a manner of determining day by day (of course, other time units may be determined successively, and the present invention may not be limited), until it is determined at 08.12 that the capture state of the user is available for capture at an interval of 2 months, the next information capture is performed.
Secondly, a grabbing user amount for limiting a grabbing upper limit of each day, namely how many inactive users can be grabbed each day, such as ten million inactive users, is set according to the API authority.
After the two parameters are set, whether information capture can be performed on the microblog user to be captured currently can be judged, and the specific process is as follows: judging whether the grabbing state of the microblog user to be grabbed is grabbed, if so, continuously judging whether the current remaining grabbing user amount is zero, if not, judging that the information grabbing can be carried out on the microblog user to be grabbed, and reducing the remaining grabbing user amount by 1 while carrying out the microblog information grabbing so as to ensure the judgment accuracy of other follow-up inactive users.
That is, for an inactive user, if the capture state is not capturing, or the current remaining capture user amount is zero, no information capture is performed on the inactive user.
It should be noted that the limited amount of the grabbing users may cause that the microblog information of some non-active users whose grabbing states are capable of being grabbed cannot be grabbed normally, and for this reason, a plurality of non-active users may be processed in a staggered manner by setting different grabbing intervals or grabbing starting points, so that the limited grabbing resources may be used to process as many non-active users as possible, and the resource utilization rate and the efficiency of grabbing effective information are improved.
Referring to fig. 3, a flow of determining a user type according to the present invention is shown, which may include:
step 301, determining the user activity according to the frequency of the microblog to be captured for releasing the microblog.
Step 302, judging the type of the microblog user to be captured according to a preset active value and the user activity, and if the user activity is not less than the preset active value, judging that the microblog user to be captured is an active user; otherwise, judging that the microblog user to be captured is an inactive user.
The method mainly determines the activity of the user according to whether the user issues the microblog and the frequency of issuing the microblog, and if the user does not issue the microblog, the user is directly defined as an inactive user; if the user issues the microblog, the liveness of the user is determined according to the frequency of issuing the microblog, and the method can be realized by adopting the process shown in fig. 4, and comprises the following steps:
step 401, calculating an average posting interval of the users according to the microblogs issued by the users waiting to capture the microblogs;
step 402, searching for liveness corresponding to the average posting interval from a preset database.
The embodiment mainly represents the posting frequency of the user through the posting intervals, and further reflects the activity of the user. During specific implementation, a database storing correspondence between posting intervals and liveness can be established, and after the posting intervals of the users are obtained through calculation, the corresponding liveness can be determined through a table look-up mode. It should be noted that, the posting intervals and the liveness can be in one-to-one correspondence, that is, one posting interval corresponds to one liveness; alternatively, the posting interval and the liveness may be many-to-one, that is, a plurality of posting intervals correspond to one liveness, and the liveness may be regarded as an activity level, which is not limited in this disclosure.
After the user activity is obtained, comparing the user activity with a preset activity value, and if the user activity is smaller than the preset activity value, judging that the user is an inactive user; and if the user activity is greater than or equal to the preset activity value, judging that the user is an active user.
Correspondingly, the present invention further provides a microblog information capturing device, referring to fig. 5, which shows a schematic diagram of the microblog information capturing device according to the present invention, and the device may include:
a first obtaining unit 501, configured to obtain a microblog user to be captured;
a first determining unit 502, configured to determine the type of the microblog user to be captured, where the type is obtained by the first obtaining unit;
a calculating unit 503, configured to calculate a capture cycle of the to-be-captured microblog user when the first determining unit determines that the to-be-captured microblog user is an active user;
a fetching unit 504, configured to predict fetching time points according to the fetching cycles to fetch microblog information;
a second obtaining unit 505, configured to obtain, when the first determining unit determines that the microblog user to be captured is an inactive user, a capture state of the microblog user to be captured and a remaining capture user amount;
the grabbing unit 504 is further configured to grab the microblog information from the microblog user to be grabbed when the grabbing state indicates that the microblog information grabbing can be performed, and the amount of the remaining grabbing users is not zero.
Referring to fig. 6, a schematic diagram of a first obtaining unit in the present invention is shown, which may include:
a selecting unit 601, configured to select at least one authenticated user as a seed user, and add the seed user as an unprocessed user to a user list;
a second determining unit 602, configured to determine whether the unprocessed user has a subordinate user:
a third acquiring unit 603 configured to acquire a subordinate user of the unprocessed user when the second judging unit judges that the unprocessed user has a subordinate user,
an adding unit 604, configured to add the subordinate user to the user list, and set the state of the unprocessed user to be processed; taking the subordinate user as an unprocessed user, notifying the second determining unit 602 to continue determining whether the unprocessed user has a subordinate user;
a setting unit 605, configured to set the status of the unprocessed user as processed when the second determination unit determines that the unprocessed user does not have a subordinate user.
The third obtaining unit may obtain the subordinate user in the following two ways: acquiring the subordinate user through the user relationship network of the unprocessed user; or, the user who grabs comments and/or forwards the microblogs issued by the unprocessed user is used as the subordinate user.
Referring to fig. 7, a schematic diagram of a first judging unit in the present invention is shown, which may include:
a determining unit 701, configured to determine user activity according to the frequency of issuing the microblog by the microblog user to be captured;
a determining subunit 702, configured to determine the type of the microblog user to be captured according to a preset active value and the user activity, and if the user activity is not less than the preset active value, determine that the microblog user to be captured is an active user; otherwise, judging that the microblog user to be captured is an inactive user.
Referring to fig. 8, a schematic diagram of a computing unit of the present invention is shown, which may include:
the computation subunit 801 is configured to compute an average posting interval of the users according to the microblogs issued by the users to be subjected to the microblog grabbing;
a searching unit 802, configured to search for an activity corresponding to the average posting interval from a preset database.
The foregoing is merely a preferred embodiment of the invention and is not intended to limit the invention in any manner. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.