CN104915438A

CN104915438A - Method for acquiring PCU association data in specific topic microblogs

Info

Publication number: CN104915438A
Application number: CN201510358782.0A
Authority: CN
Inventors: 刘均; 陈浩; 米建红; 吕彦章; 占梦婷
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2015-06-25
Filing date: 2015-06-25
Publication date: 2015-09-16
Anticipated expiration: 2035-06-25
Also published as: CN104915438B

Abstract

The invention discloses a method for acquiring PCU association data from microblogs, and aims to overcome the technical defect of incapability of acquiring associated microblog posts, comments and posters in the prior art. The method comprises the following steps: (1) gaining of a data access permission: automatically filling identity authentication information by analyzing login page HTML (Hypertext Markup Language) tags to gain the data access permission; (2) downloading of PCU association data pages: automatically and sequentially downloading pages containing PCU association data under the guidance of the logical relation of the PCU data according to the HTML structures and tag semantics of microblog pages; and (3) structured parsing and construction of the PCU association data: fusing post relations, user-friend relations and user-post sub-relations to construct a heterogeneous network, namely, a PCU association data network. Through adoption of the method, the PCU association data in Sina microblogs can be acquired automatically; the structured association data network is constructed; and a good data set is provided for subsequent social network mining.

Description

A kind of method obtaining PCU associated data in specific topics microblogging

Technical field

The invention belongs to computing machine social network data acquiring technology, be specifically related to the method for specific topics PCU associated data in a kind of automatic acquisition microblogging.

Background technology

Internet and Web2.0 impel the fast development of social networks, and the customer volume of social networks is large, data produce fast, and the data day by day accumulated and the relational structure of complexity thereof make the acquisition of information and understand more and more difficult.

In the social network site Sina microblogging that Chinese influence power is maximum, information containing a large amount of potential value, the importance studying these information is model, the model comment of specific topics in analysis Sina microblogging, post user, these data are scattered in the different pages, cause people cannot find rapidly and exactly from a large amount of page or understand these useful information.

A large amount of valuable information is contained in the relational structure of these data inherences, thus need the abstracting method of robotization from these data, obtain the good data message of structure, and carry out merging the final Sina microblogging PCU associated data forming adaptation social network data and excavate.

Applicant is new through looking into, and does not find the patent about obtaining PCU associated data from Sina's microblogging, thus retrieved one section of granted patent relevant to this patent:

1. based on the Theme Crawler of Content system [patent No.: ZL200910062020.0] of society's mark; In patent 1, the society that system takes full advantage of webpage marks the judgement carrying out web page correlation, instructs the direction of creeping of reptile, topic search engine is provided to the web data content of high-quality.Method described in patented invention mainly solves the direction of creeping of spiders, and rely on the knowledge base of society's mark formation, but the method cannot form the incidence relation obtained between content, the complicated relevance that social networks is natural cannot be adapted to, thus cannot carry out effective must organizing to social network data.

Summary of the invention

The present invention proposes a kind of new data acquisition strategy, object is to provide the method for the PCU associated data of specific topics in a kind of automatic acquisition microblogging, with incidence relation guide data acquisition approach in logic between element in unstructured data, automatic structure structural data collection comes descriptive element and incidence relation thereof, thus provide data basis for further data analysis and knowledge excavation, have that elements correlation degree is high, element relation structuring is strong, obtain the high feature of efficiency.

For achieving the above object, the present invention takes following technical scheme to be achieved:

From microblogging, specific topics obtains a method for PCU associated data, comprises following process:

(1) data access authority obtains: by analyzing the HTML getting the page, identify the submission label of username and password, and robotization is filled in and submitted to and logs in authentication request, completes the process that analog subscriber logs in.

(2) PCU associated data page-downloading: the HTML structure analyzing microblog page, obtain the association entrance of the model content P of every bar model on this page, model comment C, this three classes entity of people U of posting, then by the association entrance of above-mentioned three class entities, the html page of the comment of synchronization gain model content, model and people's friend relation of posting.

(3) structuring of PCU associated data is resolved and is built: carry out hierarchical classification parsing to the html page of the three class associated entity obtained, and according to being subordinate between associated entity, cooccurrence relation, builds heterogeneous network G=(P, C, U, f, g, h).

Process (1) described data access authority obtains according to following process:

1st step: start IE9 browser by selenium, automatically enter Sina's microblogging and log in homepage http://www.weibo.com/login.php, html tag <inputname=" username " > and <input name=" password " > of location input account and password;

2nd step: according to the label of the 1st step location, utilize the username and password that selenium Auto-writing has succeeded in registration;

3rd step: according to login page whether containing html tag <table name=" verifycode " >, judge whether to need input validation code;

4th step: if there is identifying code input label, adopts character extractive technique to obtain identifying code, for the situation adopting this technical failure, takes artificial cognition and the method for manual input;

5th step: navigated to by selenium and log in label <a class=" W_btn_g " >, is automatically triggered this label and completes data access authority acquisition.

The step of process (2) described PCU associated data page-downloading is:

1st step: enter microblogging searched page http://s.weibo.com/, selects " comprehensively " search inquiry interface, the model linked queue of initialization simultaneously pond Url _p, user's linked queue pond Url _u, comment linked queue pond Url _c, friend relation list linked queue pond Url _f, model pool of page Pages _p, User Page storehouse Pages _u, review pages storehouse Pages _c, friend relation pool of page Pages _ffor sky;

2nd step: automatically inserted by specific topics keyword in search box <input type=" text " >, triggers submission " search " button <input type=" text " class=" searchInp_form " > automatically.The link url of this keyword search page is obtained by the variable current_url of selenium _pand stored in Url _p;

3rd step: from Url _pin take out url one by one _p, obtain the corresponding page with get (url) function of selenium, extract page source code S with the variable page_source of selenim _p, and stored in pool of page Pages _p;

4th step: detect user's URL queue pond Url _u, comment URL queue pond Url _cwith good friend's relational links queue pond Url _fwhether be empty, the 1st step of if it is empty, then go to procedure (3); If not empty, the user therefrom taking out model successively links url _uand comment link url _c, enable three threads download user page S respectively _u, review pages S _cand S _f, stored in User Page storehouse Pages _u, review pages storehouse Pages _cand friend relation pool of page Pages _fin, proceed to the 5th step of process (3).

The structuring of the described PCU associated data of process (3) is resolved, in accordance with the following steps:

1st step: one by one from Pages _ptake out S _putilize Beautifulsoup positioning label <div class=" WB_cardwrapS_bg2clearfix " >, parse models all on the Post page, parse necessary field for each model, comprise model id (post_id), user name of posting (poster), the user id (user_id) that posts, user's head portrait of posting, user home page of posting link url _u, content of posting, the time of posting, comment number, forward number;

2nd step: previous step is resolved the user home page link url that posts obtained _u, pattern is http://weibo.com/u/<user_id>, puts into user's linked queue pond Url _u;

3rd step: obtain comment link url by utilizing model id and being pieced together by the model comment response address that firebug intercepts and captures _cdoes is pattern http://weibo.com/aj/v6/comment/big? ajwvr=6 & id=<post_id> & page=<num>, puts into comment linked queue pond Url _c;

4th step: by the people user_id that posts, obtains friend relation link url _f, pattern is http://weibo.com/p/100505<user_id>/f ollow? page=<num> and http://weibo.com/p/100505<user_id>/f ollow? relate=fans & page=<num>;

5th step: detect Pages _u, Pages _c, Pages _fwhether be empty, if it is empty, then enter the 4th step of process (2); If not empty, one by one from Pages _umiddle taking-up S _u, utilize Beautifulsoup to parse user profile relevant fields, comprise user id, user name, concern number, bean vermicelli number, idiograph, personal information; One by one from Pages _cmiddle taking-up S _c, utilize Beautifulsoup to parse model comment relevant fields, comprise comment name, comment people id, comment content, comment time, complete the acquisition of one group of PCU data; One by one from Pages _fmiddle taking-up S _f, utilize Beautifulsoup to parse to post the relevant fields of people good friend, comprise the pet name, id, homepage link, head portrait, concern number, bean vermicelli number, microblogging number, inhabitation address;

The structure of the described PCU associated data of process (3) is in accordance with the following steps:

1st step: it is empty for building oriented heterogeneous network G=(P, C, U, f, g, h), an initialization G of having no right;

2nd step: by the membership between model and comment, judges whether it is formerly post and comment on, judge whether two models are forwarding relation, obtains model relational network f after fusion;

3rd step: by user concern be concerned information and obtain user's friend relation network g;

4th step: build user's model membership network h according to the membership that user and model, model comment on;

5th step: merge three class entities and three class incidence relations obtain heterogeneous network G=(P, C, U, f, g, h).

Present invention achieves and associated element in microblogging obtained fast and accurately and organizes, greatly solve in microblogging the problem that the data with logic association relation are difficult to because being scattered in the different page analyze, excavating for social network data provides good structural data.Specifically, the present invention's advantage is compared with prior art:

(1) the data correlation degree obtained is high: the data of acquisition and organizational form thereof meet incidence relation inherent between data, logic association according to data instructs acquisition approach, possess semantic information, farthest can really reflect incidence relation intrinsic between data.

(2) data acquisition and rationally organizationally efficient: the path obtained according to the relation guide data between data, avoid non-effective expense, accelerate data from the progress getting tissue, thus improve the efficiency of method, enhance the applicability of method for this online extensive social networking system of Sina's microblogging.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the present invention's automatic acquisition PCU associated data from Sina's microblogging;

Fig. 2 is that data access authority of the present invention obtains process flow diagram;

Fig. 3 is the process flow diagram that PCU associated data of the present invention is downloaded, resolved and build;

Fig. 4 is PCU associated data example schematic.

Embodiment

Below in conjunction with accompanying drawing and example, the present invention is further illustrated.

Be scattered in the element of each page from Sina's microblogging and logical relation builds the implementation procedure of PCU associated data as shown in Figure 1, following 3 processes can be divided into:

(1) data access authority obtains, and comprises 5 steps.

1st step: start IE browser by selenium, automatically enter Sina's microblogging and log in homepage http://www.weibo.com/login.php, navigate to html tag <inputname=" username " > and <input name=" password " > of input account and password;

2nd step: the username and password utilizing selenium Auto-writing to succeed in registration according to the label of the 1st step, as filled in account " robbersun@sohu.com ", password " 897fgCKdf ";

4th step: if there is identifying code label, adopts third-party character extractive technique to obtain identifying code, for the situation adopting this technical failure, takes artificial cognition and the method for manual input;

The flow process of these steps as shown in Figure 2.

(2) PCU associated data page-downloading, comprises 4 steps.

1st step: enter microblogging searched page http://s.weibo.com/, selects " comprehensively " search inquiry interface, the model of initialization simultaneously URL queue pond Url _p, user URL queue pond Url _u, comment URL queue pond Url _c, friend relation list URL queue pond Url _f, model pool of page Pages _p, User Page storehouse Pages _u, review pages storehouse Pages _c, friend relation pool of page Pages _ffor sky;

2nd step: by specific topics keyword, as " the poorest president in the world ", automatically insert in search box <input type=" text " >, " search " button <input type=" text " class=" searchInp_form " > is submitted in automatic triggering to, is obtained the link url of gained model by the variable current_url of selenium _pand stored in Url _p;

4th step: detect user's URL queue pond Url _u, comment URL queue pond Url _cwith good friend's relational links queue pond Url _fwhether be empty, if it is empty, then forward the 1st step of step3 to; If not empty, the user therefrom taking out model successively links url _uand comment link url _c, enable three threads download user page S respectively _u, review pages S _cand S _f, stored in User Page storehouse Pages _u, review pages storehouse Pages _cand friend relation pool of page Pages _fin, proceed to the 5th step of step3 (2).

The flow process of these steps as shown in Figure 3.

(3) structuring of PCU associated data is resolved and is built the structure comprising the page and the PCU incidence relation of resolving and downloading and obtaining, and wherein the former comprises 5 steps.

1st step: one by one from Pages _ptake out S _putilize Beautifulsoup positioning label <div class=" WB_cardwrapS_bg2clearfix " >, parse models all on the Post page, parse required field for each model, comprise " model id (post_id), user name (poster), user id (user_id), user's head portrait, user home page link url _u, model content, the time of posting, comment number, forward number ";

2nd step: previous step is resolved the user home page link url obtained _u, pattern is http://weibo.com/u/<user_id>, puts into user's linked queue pond Url _u;

3rd step: obtain comment link urlc by utilizing model id and being pieced together by the model comment response address that firebug intercepts and captures, does is pattern http://weibo.com/aj/v6/comment/big? ajwvr=6 & id=<post_id> & page=<num>, puts into comment linked queue pond Url _c;

5th step: detect Pages _u, Pages _c, Pages _fwhether be empty, if it is empty, then enter the 4th step of step2; If not empty, one by one from Pags _umiddle taking-up User Page source code S _u, utilize Beautifulsoup to parse user information field, comprise " user id, user name, concern number, bean vermicelli number, idiograph "; One by one from Pages _cmiddle taking-up review pages source code S _c, utilize Beautifulsoup to parse model comment relevant fields, comprise " comment user name, comment user id, comment content, comment time "; And then, one by one from Pages _fmiddle taking-up friend relation original list source code S _f, utilize Beautifulsoup to parse to post the relevant fields of people good friend, comprise " link of the pet name, id, homepage, head portrait, concern number, bean vermicelli number, microblogging number, inhabitation address ";

PCU associated data builds and comprises 5 steps: as shown in Figure 4.

5th step: merge three class entities and three class networks obtain heterogeneous network G=(P, C, U, f, g, h).

Claims

1. one kind obtains the method for specific topics PCU associated data in microblogging, it is characterized in that: closing with the logic association in Sina's microblogging between element is instruct, automatic order ground obtains the PCU associated data formed by model content, model comment and the user that posts, and is expressed as heterogeneous network:

G＝(P，C，U，f，g，h)

Wherein P, C and U represent model, model comment respectively and to post user, and f represents the interactive relation that model is commented on model, the friend relation between g representative of consumer, h representative of consumer and model, model comment between membership;

The method comprises: data access authority obtains, the parsing of PCU associated data page-downloading and PCU associated data and structure, and concrete process is as follows:

Step1: data access authority obtains:

At Sina microblogging login page http://weibo.com/login.php, web automated test tool selenium is utilized automatically to locate label <div class=" inp username " > and <div class=" inp password " > and fill in username and password, trigger submit button <div class=" info_listlogin_btn " >, complete authentication, obtain the authority of visit data;

Step2:PCU associated data page-downloading:

1) the range sublink url of specific topics is obtained by the query interface of microblogging _p, stored in model URL queue pond Url _p, therefrom take out model link url one by one _pand download model page S according to it _p, stored in model pool of page Pages _pin;

2) from user URL queue pond Url _u, comment URL queue pond Url _u, friend relation list URL queue pond Url _fin take out model successively user link url _u, comment link url _c, friend relation link url _f, enable three threads download user page S respectively _u, review pages S _cwith good friend's page S _f, stored in User Page storehouse Pages _u, review pages storehouse Pages _cand friend relation storehouse Pages _fin;

The parsing of Step3:PCU associated data and structure:

1) one by one from model pool of page Pages _pmiddle taking-up model page S _p, from User Page storehouse Pages _umiddle taking-up User Page S _u, from review pages storehouse Pages _c, middle taking-up review pages S _c, from friend relation storehouse Pages _fmiddle taking-up good friend page S _f, utilize document analytical tool Beautifulsoup to S _p, S _u, S _cand S _fcarry out Hierarchical Location parsing, by required label value write into Databasce, wherein resolve S _pthe url obtained _u, url _cand url _f, put into user URL queue pond Url respectively _u, comment URL queue pond Url _cwith good friend's relation list URL queue pond Url _f;

2) obtain model relational network f by the interactive relation between model and comment, obtain user's friend relation network g by the concern of user and bean vermicelli information, the membership between being commented on by user and model, model obtains user's model membership network h;

Finally obtain heterogeneous network G=(P, C, U, f, g, h).

2. the method obtaining specific topics PCU associated data in microblogging as claimed in claim 1, it is characterized in that, described process step1 concrete steps are:

1st step: start IE9 browser by selenium, automatically enter Sina's microblogging and log in homepage http://www.weibo.com/login.php, html tag <input name=" username " > and <input name=" password " > of location input account and password;

2nd step: according to the 1st step tag location result, utilizes the chartered username and password of selenium Auto-writing;

3. the method obtaining specific topics PCU associated data in microblogging as claimed in claim 1, it is characterized in that, described process step2 concrete steps body is:

1st step: enter microblogging searched page http://s.weibo.com/, selects " comprehensively " search inquiry interface, the Url of initialization simultaneously _p, Url _u, Url _c, Url _f, Pages _p, Pages _u, Pages _c, Pages _ffor sky;

2nd step: specific topics keyword is inserted automatically in search box <input type=" text " >, " search " button <input type=" text " class=" searchInp_form " > is submitted in automatic triggering to, is obtained the link url of this keyword search page by the variable current_url of selenium _pand stored in Url _p;

4th step: detect family URL queue pond Url _u, comment URL queue pond Url _cwith good friend's relational links queue pond Url _fbe whether empty, if it is empty, then forward step3 step 1 to) the 1st step in concrete steps; If not empty, the user therefrom taking out model respectively successively links url _u, comment link url _cand friend relation list link url _f, enable three threads download user page S respectively _u, review pages S _cwith good friend's relation list page S _f, stored in User Page storehouse Pages _u, review pages storehouse Pages _cand friend relation pool of page Pages _fin, proceed to step3 step 1) the 5th step in concrete steps.

4. the as claimed in claim 1 method obtaining specific topics PCU associated data in microblogging, is characterized in that, described process step3 step 1) concrete steps are:

1st step: one by one from Pages _pmiddle taking-up S _putilize Beautifulsoup positioning label <div class=" WB_cardwrapS_bg2 clearfix " >, parse models all on the Post page, parse required field for each model, comprise model id (post_id), user name of posting (poster), user id (user_id), user's head portrait, user home page link, content of posting, time of posting, comment number, forward number;

2nd step: previous step is resolved the user home page link url that posts obtained _u, pattern is: http://weibo.com/u/<user_id>, puts into user's linked queue pond Url _u;

3rd step: obtain comment link url by utilizing model id and being pieced together by the model comment response address that firebug intercepts and captures _cdoes is pattern: http://weibo.com/aj/v6/comment/big? ajwvr=6 & id=<post_id> & page=<num>, puts into comment linked queue pond Url _c;

4th step: by the people user_id that posts, obtains friend relation link url _f, pattern is: http://weibo.com/p/100505<user_id>/f ollow? page=<num> and http://weibo.com/p/100505<user_id>/f ollow? relate=fans & page=<num>;

5th step: detect Pages _u, Pages _c, Pages _fwhether be empty, if it is empty, then enter the 4th step in step2 concrete steps; If not empty, one by one from Pages _umiddle taking-up S _u, utilize Beautifulsoup to parse user profile relevant fields, comprise user id, user name, concern number, bean vermicelli number, idiograph, personal information; One by one from Pages _cmiddle taking-up S _c, utilize Beautifulsoup to parse model comment relevant fields, comprise comment name, comment people id, comment content, comment time; One by one from Pages _fmiddle taking-up S _f, utilize Beautifulsoup to parse to post the relevant fields of people good friend, comprise the pet name, good friend id, homepage link, head portrait, concern number, bean vermicelli number, microblogging number, inhabitation address.

5. the as claimed in claim 1 method obtaining specific topics PCU associated data in microblogging, is characterized in that, described step3 step 2) concrete steps are: