CN105243122A

CN105243122A - Social software based data acquisition method and apparatus

Info

Publication number: CN105243122A
Application number: CN201510633010.3A
Authority: CN
Inventors: 李鹏
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2015-09-29
Filing date: 2015-09-29
Publication date: 2016-01-13

Abstract

The present invention provides a social software based data acquisition method and apparatus. The method comprises: S1: selecting at least one registered user in target social software, and adding a user identifier corresponding to the at least one registered user into a crawling queue; S2: according to the user identifiers in the crawling queue, crawling webpage data and friend lists of the users corresponding to the user identifiers one by one; and S3: adding each user identifier in the crawled friend lists into the crawling queue, returning to execute step S2, and ending when a preset condition is met. According to the present scheme, the number of users corresponding to the crawled webpage data can be larger, thereby improving the accuracy of an analysis result.

Description

A kind of data capture method based on social software and device

Technical field

The present invention relates to technical field of the computer network, particularly a kind of data capture method based on social software and device.

Background technology

Along with the develop rapidly of computer networking technology, the data volume that user produces on the computer network is also increasing.Wherein, by obtaining the data volume that produces on the computer network compared with multi-user, the concern information obtaining user can be analyzed, thus can be that following network Development is ready according to the concern information of user.

Existing data capture method captures by web crawlers the web data that user accesses.Due to obtain web data amount time for targeted customer more, more accurate to the analysis result of web data, therefore, how to obtain the web data of more targeted customer, to improve the accuracy rate of web data analysis result, become current urgent problem.

Summary of the invention

In view of this, the invention provides a kind of data capture method based on social software and device, to obtain the web data of more targeted customer.

The invention provides a kind of data capture method based on social software, comprising:

S1: select at least one registered user in the social software of target, and the user ID that this at least one registered user is corresponding is respectively added to crawl in queue;

S2: according to the described user ID crawled in queue, crawls web data and the buddy list of user corresponding to each user ID one by one;

S3: crawl in queue described in each user ID in the buddy list crawled is added to, and return execution step S2, terminate until meet when imposing a condition.

Preferably, comprise further:

The web data crawling user corresponding to each user ID is stored in database;

And/or,

Each user ID crawled in queue described in adding to is added in database.

Preferably, described each user ID in the buddy list crawled is added to described in crawl in queue, comprising:

Each user ID in the buddy list crawled is compared with each user ID of adding in database one by one, and crawls in queue described in the user ID do not stored in database is added to.

Preferably,

Comprise further: the space degree of the user ID crawled in queue described in adding to is marked, wherein, the space degree crawled described in adding to first corresponding to the user ID in queue is 1 degree, the space degree large 1 that the space degree for each user ID crawled in the buddy list of targeted customer's mark identifies than this targeted customer;

Described satisfied imposing a condition comprises: the space degree crawled described in adding to corresponding to the user ID in queue arrives setting value.

Preferably, described in crawl web data and the buddy list of user corresponding to each user ID, comprising:

The described user ID crawled in queue is divided into multiple Map task, and give at least two processors by multiple Map task matching of division, described at least two-server walk abreast crawl into its distribute Map task, and after process terminates, Reduce merging is carried out to the data of each station server process.

Present invention also offers a kind of data acquisition facility based on social software, comprising:

Selection unit, for selecting at least one registered user in the social software of target;

Adding device, adds to for the user ID that this at least one registered user is corresponding respectively and crawls in queue;

Crawling unit, for crawling the user ID in queue described in basis, crawling web data and the buddy list of user corresponding to each user ID one by one;

Described adding device, crawls in queue described in being further used for each user ID in the buddy list crawled to add to, and crawls unit described in triggering and perform corresponding operating, terminates the described triggering crawling unit until meet when imposing a condition.

Preferably, comprise further:

Transmitting element, for being stored in database by the web data crawling user corresponding to each user ID;

And/or,

Described transmitting element, for being stored into each user ID crawled in queue described in adding in database.

Preferably, described adding device, specifically for each user ID in the buddy list crawled being compared with each user ID of adding in database one by one, and crawls in queue described in the user ID do not stored in database being added to.

Preferably,

Comprise further: indexing unit, for marking the space degree of the user ID crawled in queue described in adding to, wherein, the space degree crawled described in adding to first corresponding to the user ID in queue is 1 degree, the space degree large 1 that the space degree for each user ID crawled in the buddy list of targeted customer's mark identifies than this targeted customer;

Preferably, describedly crawl unit, specifically for the described user ID crawled in queue is divided into multiple Map task, and give at least two processors by multiple Map task matching of division, described at least two-server walk abreast crawl into its distribute Map task, and after process terminates, Reduce merging is carried out to the data of each station server process.

Embodiments provide a kind of data capture method based on social software and device, by utilizing registered users a large amount of in social software, and the friend relation between registered user, crawl the web data of user, because in social software, the number of registered user is larger, therefore, corresponding to the web data crawled, the number of user is also more, thus can improve the accuracy rate of analysis result.

Accompanying drawing explanation

Fig. 1 is the method flow diagram that the embodiment of the present invention provides;

Fig. 2 is the method flow diagram that another embodiment of the present invention provides;

Fig. 3 is the hardware structure figure of the data acquisition facility place equipment that the embodiment of the present invention provides;

Fig. 4 is the data acquisition facility structural representation that the embodiment of the present invention provides;

Fig. 5 is the data acquisition facility structural representation that another embodiment of the present invention provides.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described.Obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

As shown in Figure 1, embodiments provide a kind of data capture method based on social software, the method can comprise the following steps:

Step 101: select at least one registered user in the social software of target, and the user ID that this at least one registered user is corresponding is respectively added to crawl in queue;

Step 102: according to the described user ID crawled in queue, crawls web data and the buddy list of user corresponding to each user ID one by one;

Step 103: crawl in queue described in each user ID in the buddy list crawled is added to, and return execution step 102, terminate until meet when imposing a condition.

In embodiments of the present invention, by utilizing registered users a large amount of in social software, and the friend relation between registered user, crawl the web data of user, because in social software, the number of registered user is larger, therefore, corresponding to the web data crawled, the number of user is also more, thus can improve the accuracy rate of analysis result.

In a preferred embodiment of the invention, in order to prevent to the web data of same user repeat crawl, each user ID in the buddy list crawled can be compared with each user ID of adding in database one by one, and the user ID do not stored in database is added to crawl in queue.Thus the calculated amount that can reduce in subsequent process.

In a preferred embodiment of the invention, crawling of web data for user is not endless, theoretical according to six degrees of separation, between two strangers, the people at institute interval can not more than six, that is, at most by being just familiar with between five go-betweens, two strangers, Here it is, and six degrees of separation is theoretical, is also Small-world Theory in Self.Therefore, the space degree adding the user ID crawled in queue to is being marked, wherein, the space degree crawled described in adding to first corresponding to the user ID in queue is 1 degree, the space degree large 1 that the space degree for each user ID crawled in the buddy list of targeted customer's mark identifies than this targeted customer; Impose a condition can comprise for meeting: the space degree crawled described in adding to corresponding to the user ID in queue arrives setting value.Wherein, this setting value can be 6.When to add the space degree corresponding to the user ID that crawls in queue to be 6, namely can show to crawl the web data of all registered users in this social software.

In a preferred embodiment of the invention, because the data volume of crawled web data is larger, therefore the user ID crawled in queue can be divided into multiple Map task, and give at least two processors by multiple Map task matching of division, described at least two-server walk abreast crawl into its distribute Map task, and after process terminates, Reduce merging is carried out to the data of each station server process.Thus the efficiency that web data crawls can be improved.

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail.

As shown in Figure 2, embodiments provide a kind of data capture method based on social software, the method can comprise the following steps:

Step 201: arrange crawl end meet impose a condition.

Crawling of web data for user is not endless, theoretical according to six degrees of separation, between two strangers, the people at institute interval can not more than six, that is, at most by being just familiar with between five go-betweens, two strangers, Here it is, and six degrees of separation is theoretical, is also Small-world Theory in Self.

In the present embodiment, need arrange crawl end meet impose a condition.Wherein, can arrange crawl as follows end meet impose a condition:

1, the space degree added to corresponding to the user ID crawled in queue arrives setting value a.

Wherein, theoretical according to six degrees of separation, this setting value a can be 6, certainly, also can be not less than 6 other numerical value.

2, corresponding to the web data crawled, the number of user ID reaches setting value b.

Wherein, this setting value b can be 100,000,000.The value of this setting value b is larger, and the analysis result for the web data getting this setting value b user is more accurate.

Step 202: determine the social software of target.

In the present embodiment, the social software of the target determined can be the social software of any class that can run in the current networks such as facebook, twitter, instagram, microblogging or QQ.

Step 203: select at least one registered user in the social software of target, and the user ID that this at least one registered user is corresponding is respectively added to crawl in queue, 1 degree is labeled as to the space degree of at least one registered user of this selection.

For crawl terminate required meet impose a condition as above-mentioned condition 1, the space degree of at least one registered user selected first from the social software of target can be labeled as 1 degree.

In the present embodiment, in order to be convenient to crawling the web data of user in subsequent process, the user ID that this at least one registered user selected is corresponding respectively can be added to and crawling in queue.

Step 204: according to the user ID crawled in queue, crawls web data and the buddy list of user corresponding to each user ID one by one.

In the present embodiment, web crawlers can be utilized to crawl for the web data of each user and crawling of buddy list info, wherein, web crawlers, be also called webpage spider, network robot or webpage follower, be a kind of according to certain rule, automatically capture program or the script of web message.In this web crawlers operational process, all information that user carries out on network all can be crawled, thus these information crawled can be utilized to carry out analyzing to determine the focus of user.

In the present embodiment, because the data volume of crawled web data is larger, therefore the user ID crawled in queue can be divided into multiple Map task, and give at least two processors by multiple Map task matching of division, this at least two-server walk abreast crawl into its distribute Map task, and after process terminates, Reduce merging is carried out to the data of each station server process.Thus the efficiency that web data crawls can be improved.

Further, in order to store and record the content crawled, can the web data of user corresponding to each user ID crawled be stored in database; And/or, each user ID crawled in queue described in adding to can be added in database.

Step 205: each user ID in the buddy list crawled is compared with each user ID of adding in database one by one, and crawl in queue described in the user ID do not stored in database is added to; And return execution step 204, terminate until meet when imposing a condition.

In the present embodiment, crawl the space degree of user ID in queue mark for adding at every turn, wherein, for the space degree large 1 that the space degree of each user ID crawled in the buddy list of targeted customer's mark identifies than this targeted customer.

Such as, added to by the registered user A of selection in step 203 and crawl in queue, the space degree that this registered user A is corresponding is 1 degree.In this step, good friend for the registered user A that will crawl: user A1, user A2, user A3 ..., user An adds to when crawling in queue, can to user A1, user A2, user A3 ..., user An space degree be all labeled as 2.Good friend for by crawling user A1: user A11, user A12, user A13 ..., user A1n adds to when crawling in queue, can to user A11, user A12, user A13 ..., user A1n space degree be all labeled as 3, the like.

When reaching met imposing a condition, such as, when to add the space degree corresponding to user ID crawled in queue to be 6, the acquisition to space degree buddy list of user corresponding to the user ID of 6 is stopped.

In the present embodiment, three circulation sets of threads can be used, perform following three functions respectively: 1, queue to be crawled, produce new queue to be crawled, preserve the data crawled.

As shown in Figure 3, Figure 4, a kind of data acquisition facility based on social software is embodiments provided.Device embodiment can pass through software simulating, also can be realized by the mode of hardware or software and hardware combining.Say from hardware view; as shown in Figure 3; for the embodiment of the present invention is based on a kind of hardware structure diagram of the data acquisition facility place equipment of social software; except the processor shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place can also comprise other hardware usually, as the forwarding chip etc. of responsible process message.For software simulating, as shown in Figure 4, as the device on a logical meaning, be by the CPU of its place equipment, computer program instructions corresponding in nonvolatile memory is read operation in internal memory to be formed.The data acquisition facility based on social software that the present embodiment provides comprises:

Selection unit 401, for selecting at least one registered user in the social software of target;

Adding device 402, adds to for the user ID that this at least one registered user is corresponding respectively and crawls in queue;

Crawling unit 403, for crawling the user ID in queue described in basis, crawling web data and the buddy list of user corresponding to each user ID one by one;

Described adding device 402, crawls in queue described in being further used for each user ID in the buddy list crawled to add to, and crawls unit described in triggering and perform corresponding operating, terminates the described triggering crawling unit until meet when imposing a condition.

In a preferred embodiment of the invention, as shown in Figure 5, this data acquisition facility may further include:

Transmitting element 501, for being stored in database by the web data crawling user corresponding to each user ID;

And/or,

Described transmitting element 501, for being stored into each user ID crawled in queue described in adding in database.

Further, described adding device 402, specifically for each user ID in the buddy list crawled is compared with each user ID of adding in database one by one, and crawl in queue described in the user ID do not stored in database is added to.

Comprise further: indexing unit 502, for marking the space degree of the user ID crawled in queue described in adding to, wherein, the space degree crawled described in adding to first corresponding to the user ID in queue is 1 degree, the space degree large 1 that the space degree for each user ID crawled in the buddy list of targeted customer's mark identifies than this targeted customer;

Further, describedly crawl unit 403, specifically for the described user ID crawled in queue is divided into multiple Map task, and give at least two processors by multiple Map task matching of division, described at least two-server walk abreast crawl into its distribute Map task, and after process terminates, Reduce merging is carried out to the data of each station server process.

To sum up, the embodiment of the present invention at least can realize following beneficial effect:

1, in embodiments of the present invention, by utilizing registered users a large amount of in social software, and the friend relation between registered user, crawl the web data of user, because in social software, the number of registered user is larger, therefore, corresponding to the web data crawled, the number of user is also more, thus can improve the accuracy rate of analysis result.

2, in embodiments of the present invention, in order to prevent to the web data of same user repeat crawl, each user ID in the buddy list crawled can be compared with each user ID of adding in database one by one, and the user ID do not stored in database is added to crawl in queue.Thus the calculated amount that can reduce in subsequent process.

3, in embodiments of the present invention, because the data volume of crawled web data is larger, therefore the user ID crawled in queue can be divided into multiple Map task, and give at least two processors by multiple Map task matching of division, described at least two-server walk abreast crawl into its distribute Map task, and after process terminates, Reduce merging is carried out to the data of each station server process.Thus the efficiency that web data crawls can be improved.

The content such as information interaction, implementation between each unit in the said equipment, due to the inventive method embodiment based on same design, particular content can see in the inventive method embodiment describe, repeat no more herein.

It should be noted that, in this article, the relational terms of such as first and second and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element " being comprised a 〃〃〃〃〃〃 " limited by statement, and be not precluded within process, method, article or the equipment comprising described key element and also there is other same factor.

One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can have been come by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program, when performing, performs the step comprising said method embodiment; And aforesaid storage medium comprises: ROM, RAM, magnetic disc or CD etc. various can be program code stored medium in.

Finally it should be noted that: the foregoing is only preferred embodiment of the present invention, only for illustration of technical scheme of the present invention, be not intended to limit protection scope of the present invention.All any amendments done within the spirit and principles in the present invention, equivalent replacement, improvement etc., be all included in protection scope of the present invention.

Claims

1. based on a data capture method for social software, it is characterized in that, comprising:

2. method according to claim 1, is characterized in that, comprises further:

The web data crawling user corresponding to each user ID is stored in database;

And/or,

Each user ID crawled in queue described in adding to is added in database.

3. method according to claim 2, is characterized in that, described each user ID in the buddy list crawled is added to described in crawl in queue, comprising:

4. method according to claim 1, is characterized in that,

5., according to described method arbitrary in claim 1-4, it is characterized in that, described in crawl web data and the buddy list of user corresponding to each user ID, comprising:

6. based on a data acquisition facility for social software, it is characterized in that, comprising:

7. data acquisition facility according to claim 6, is characterized in that, comprises further:

And/or,

8. data acquisition facility according to claim 7, it is characterized in that, described adding device, specifically for each user ID in the buddy list crawled is compared with each user ID of adding in database one by one, and crawl in queue described in the user ID do not stored in database is added to.

9. data acquisition facility according to claim 6, is characterized in that,

10. according to described data acquisition facility arbitrary in claim 6-9, it is characterized in that, describedly crawl unit, specifically for the described user ID crawled in queue is divided into multiple Map task, and give at least two processors by multiple Map task matching of division, described at least two-server walks abreast the Map task crawled as it distributes, and after process terminates, carries out Reduce merging to the data of each station server process.