CN104915438A - Method for acquiring PCU association data in specific topic microblogs - Google Patents

Method for acquiring PCU association data in specific topic microblogs Download PDF

Info

Publication number
CN104915438A
CN104915438A CN201510358782.0A CN201510358782A CN104915438A CN 104915438 A CN104915438 A CN 104915438A CN 201510358782 A CN201510358782 A CN 201510358782A CN 104915438 A CN104915438 A CN 104915438A
Authority
CN
China
Prior art keywords
url
user
pages
page
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510358782.0A
Other languages
Chinese (zh)
Other versions
CN104915438B (en
Inventor
刘均
陈浩
米建红
吕彦章
占梦婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201510358782.0A priority Critical patent/CN104915438B/en
Publication of CN104915438A publication Critical patent/CN104915438A/en
Application granted granted Critical
Publication of CN104915438B publication Critical patent/CN104915438B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The invention discloses a method for acquiring PCU association data from microblogs, and aims to overcome the technical defect of incapability of acquiring associated microblog posts, comments and posters in the prior art. The method comprises the following steps: (1) gaining of a data access permission: automatically filling identity authentication information by analyzing login page HTML (Hypertext Markup Language) tags to gain the data access permission; (2) downloading of PCU association data pages: automatically and sequentially downloading pages containing PCU association data under the guidance of the logical relation of the PCU data according to the HTML structures and tag semantics of microblog pages; and (3) structured parsing and construction of the PCU association data: fusing post relations, user-friend relations and user-post sub-relations to construct a heterogeneous network, namely, a PCU association data network. Through adoption of the method, the PCU association data in Sina microblogs can be acquired automatically; the structured association data network is constructed; and a good data set is provided for subsequent social network mining.

Description

A kind of method obtaining PCU associated data in specific topics microblogging
Technical field
The invention belongs to computing machine social network data acquiring technology, be specifically related to the method for specific topics PCU associated data in a kind of automatic acquisition microblogging.
Background technology
Internet and Web2.0 impel the fast development of social networks, and the customer volume of social networks is large, data produce fast, and the data day by day accumulated and the relational structure of complexity thereof make the acquisition of information and understand more and more difficult.
In the social network site Sina microblogging that Chinese influence power is maximum, information containing a large amount of potential value, the importance studying these information is model, the model comment of specific topics in analysis Sina microblogging, post user, these data are scattered in the different pages, cause people cannot find rapidly and exactly from a large amount of page or understand these useful information.
A large amount of valuable information is contained in the relational structure of these data inherences, thus need the abstracting method of robotization from these data, obtain the good data message of structure, and carry out merging the final Sina microblogging PCU associated data forming adaptation social network data and excavate.
Applicant is new through looking into, and does not find the patent about obtaining PCU associated data from Sina's microblogging, thus retrieved one section of granted patent relevant to this patent:
1. based on the Theme Crawler of Content system [patent No.: ZL200910062020.0] of society's mark; In patent 1, the society that system takes full advantage of webpage marks the judgement carrying out web page correlation, instructs the direction of creeping of reptile, topic search engine is provided to the web data content of high-quality.Method described in patented invention mainly solves the direction of creeping of spiders, and rely on the knowledge base of society's mark formation, but the method cannot form the incidence relation obtained between content, the complicated relevance that social networks is natural cannot be adapted to, thus cannot carry out effective must organizing to social network data.
Summary of the invention
The present invention proposes a kind of new data acquisition strategy, object is to provide the method for the PCU associated data of specific topics in a kind of automatic acquisition microblogging, with incidence relation guide data acquisition approach in logic between element in unstructured data, automatic structure structural data collection comes descriptive element and incidence relation thereof, thus provide data basis for further data analysis and knowledge excavation, have that elements correlation degree is high, element relation structuring is strong, obtain the high feature of efficiency.
For achieving the above object, the present invention takes following technical scheme to be achieved:
From microblogging, specific topics obtains a method for PCU associated data, comprises following process:
(1) data access authority obtains: by analyzing the HTML getting the page, identify the submission label of username and password, and robotization is filled in and submitted to and logs in authentication request, completes the process that analog subscriber logs in.
(2) PCU associated data page-downloading: the HTML structure analyzing microblog page, obtain the association entrance of the model content P of every bar model on this page, model comment C, this three classes entity of people U of posting, then by the association entrance of above-mentioned three class entities, the html page of the comment of synchronization gain model content, model and people's friend relation of posting.
(3) structuring of PCU associated data is resolved and is built: carry out hierarchical classification parsing to the html page of the three class associated entity obtained, and according to being subordinate between associated entity, cooccurrence relation, builds heterogeneous network G=(P, C, U, f, g, h).
Process (1) described data access authority obtains according to following process:
1st step: start IE9 browser by selenium, automatically enter Sina's microblogging and log in homepage http://www.weibo.com/login.php, html tag <inputname=" username " > and <input name=" password " > of location input account and password;
2nd step: according to the label of the 1st step location, utilize the username and password that selenium Auto-writing has succeeded in registration;
3rd step: according to login page whether containing html tag <table name=" verifycode " >, judge whether to need input validation code;
4th step: if there is identifying code input label, adopts character extractive technique to obtain identifying code, for the situation adopting this technical failure, takes artificial cognition and the method for manual input;
5th step: navigated to by selenium and log in label <a class=" W_btn_g " >, is automatically triggered this label and completes data access authority acquisition.
The step of process (2) described PCU associated data page-downloading is:
1st step: enter microblogging searched page http://s.weibo.com/, selects " comprehensively " search inquiry interface, the model linked queue of initialization simultaneously pond Url p, user's linked queue pond Url u, comment linked queue pond Url c, friend relation list linked queue pond Url f, model pool of page Pages p, User Page storehouse Pages u, review pages storehouse Pages c, friend relation pool of page Pages ffor sky;
2nd step: automatically inserted by specific topics keyword in search box <input type=" text " >, triggers submission " search " button <input type=" text " class=" searchInp_form " > automatically.The link url of this keyword search page is obtained by the variable current_url of selenium pand stored in Url p;
3rd step: from Url pin take out url one by one p, obtain the corresponding page with get (url) function of selenium, extract page source code S with the variable page_source of selenim p, and stored in pool of page Pages p;
4th step: detect user's URL queue pond Url u, comment URL queue pond Url cwith good friend's relational links queue pond Url fwhether be empty, the 1st step of if it is empty, then go to procedure (3); If not empty, the user therefrom taking out model successively links url uand comment link url c, enable three threads download user page S respectively u, review pages S cand S f, stored in User Page storehouse Pages u, review pages storehouse Pages cand friend relation pool of page Pages fin, proceed to the 5th step of process (3).
The structuring of the described PCU associated data of process (3) is resolved, in accordance with the following steps:
1st step: one by one from Pages ptake out S putilize Beautifulsoup positioning label <div class=" WB_cardwrapS_bg2clearfix " >, parse models all on the Post page, parse necessary field for each model, comprise model id (post_id), user name of posting (poster), the user id (user_id) that posts, user's head portrait of posting, user home page of posting link url u, content of posting, the time of posting, comment number, forward number;
2nd step: previous step is resolved the user home page link url that posts obtained u, pattern is http://weibo.com/u/<user_id>, puts into user's linked queue pond Url u;
3rd step: obtain comment link url by utilizing model id and being pieced together by the model comment response address that firebug intercepts and captures cdoes is pattern http://weibo.com/aj/v6/comment/big? ajwvr=6 & id=<post_id> & page=<num>, puts into comment linked queue pond Url c;
4th step: by the people user_id that posts, obtains friend relation link url f, pattern is http://weibo.com/p/100505<user_id>/f ollow? page=<num> and http://weibo.com/p/100505<user_id>/f ollow? relate=fans & page=<num>;
5th step: detect Pages u, Pages c, Pages fwhether be empty, if it is empty, then enter the 4th step of process (2); If not empty, one by one from Pages umiddle taking-up S u, utilize Beautifulsoup to parse user profile relevant fields, comprise user id, user name, concern number, bean vermicelli number, idiograph, personal information; One by one from Pages cmiddle taking-up S c, utilize Beautifulsoup to parse model comment relevant fields, comprise comment name, comment people id, comment content, comment time, complete the acquisition of one group of PCU data; One by one from Pages fmiddle taking-up S f, utilize Beautifulsoup to parse to post the relevant fields of people good friend, comprise the pet name, id, homepage link, head portrait, concern number, bean vermicelli number, microblogging number, inhabitation address;
The structure of the described PCU associated data of process (3) is in accordance with the following steps:
1st step: it is empty for building oriented heterogeneous network G=(P, C, U, f, g, h), an initialization G of having no right;
2nd step: by the membership between model and comment, judges whether it is formerly post and comment on, judge whether two models are forwarding relation, obtains model relational network f after fusion;
3rd step: by user concern be concerned information and obtain user's friend relation network g;
4th step: build user's model membership network h according to the membership that user and model, model comment on;
5th step: merge three class entities and three class incidence relations obtain heterogeneous network G=(P, C, U, f, g, h).
Present invention achieves and associated element in microblogging obtained fast and accurately and organizes, greatly solve in microblogging the problem that the data with logic association relation are difficult to because being scattered in the different page analyze, excavating for social network data provides good structural data.Specifically, the present invention's advantage is compared with prior art:
(1) the data correlation degree obtained is high: the data of acquisition and organizational form thereof meet incidence relation inherent between data, logic association according to data instructs acquisition approach, possess semantic information, farthest can really reflect incidence relation intrinsic between data.
(2) data acquisition and rationally organizationally efficient: the path obtained according to the relation guide data between data, avoid non-effective expense, accelerate data from the progress getting tissue, thus improve the efficiency of method, enhance the applicability of method for this online extensive social networking system of Sina's microblogging.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the present invention's automatic acquisition PCU associated data from Sina's microblogging;
Fig. 2 is that data access authority of the present invention obtains process flow diagram;
Fig. 3 is the process flow diagram that PCU associated data of the present invention is downloaded, resolved and build;
Fig. 4 is PCU associated data example schematic.
Embodiment
Below in conjunction with accompanying drawing and example, the present invention is further illustrated.
Be scattered in the element of each page from Sina's microblogging and logical relation builds the implementation procedure of PCU associated data as shown in Figure 1, following 3 processes can be divided into:
(1) data access authority obtains, and comprises 5 steps.
1st step: start IE browser by selenium, automatically enter Sina's microblogging and log in homepage http://www.weibo.com/login.php, navigate to html tag <inputname=" username " > and <input name=" password " > of input account and password;
2nd step: the username and password utilizing selenium Auto-writing to succeed in registration according to the label of the 1st step, as filled in account " robbersun@sohu.com ", password " 897fgCKdf ";
3rd step: according to login page whether containing html tag <table name=" verifycode " >, judge whether to need input validation code;
4th step: if there is identifying code label, adopts third-party character extractive technique to obtain identifying code, for the situation adopting this technical failure, takes artificial cognition and the method for manual input;
5th step: navigated to by selenium and log in label <a class=" W_btn_g " >, is automatically triggered this label and completes data access authority acquisition.
The flow process of these steps as shown in Figure 2.
(2) PCU associated data page-downloading, comprises 4 steps.
1st step: enter microblogging searched page http://s.weibo.com/, selects " comprehensively " search inquiry interface, the model of initialization simultaneously URL queue pond Url p, user URL queue pond Url u, comment URL queue pond Url c, friend relation list URL queue pond Url f, model pool of page Pages p, User Page storehouse Pages u, review pages storehouse Pages c, friend relation pool of page Pages ffor sky;
2nd step: by specific topics keyword, as " the poorest president in the world ", automatically insert in search box <input type=" text " >, " search " button <input type=" text " class=" searchInp_form " > is submitted in automatic triggering to, is obtained the link url of gained model by the variable current_url of selenium pand stored in Url p;
3rd step: from Url pin take out url one by one p, obtain the corresponding page with get (url) function of selenium, extract page source code S with the variable page_source of selenim p, and stored in pool of page Pages p;
4th step: detect user's URL queue pond Url u, comment URL queue pond Url cwith good friend's relational links queue pond Url fwhether be empty, if it is empty, then forward the 1st step of step3 to; If not empty, the user therefrom taking out model successively links url uand comment link url c, enable three threads download user page S respectively u, review pages S cand S f, stored in User Page storehouse Pages u, review pages storehouse Pages cand friend relation pool of page Pages fin, proceed to the 5th step of step3 (2).
The flow process of these steps as shown in Figure 3.
(3) structuring of PCU associated data is resolved and is built the structure comprising the page and the PCU incidence relation of resolving and downloading and obtaining, and wherein the former comprises 5 steps.
1st step: one by one from Pages ptake out S putilize Beautifulsoup positioning label <div class=" WB_cardwrapS_bg2clearfix " >, parse models all on the Post page, parse required field for each model, comprise " model id (post_id), user name (poster), user id (user_id), user's head portrait, user home page link url u, model content, the time of posting, comment number, forward number ";
2nd step: previous step is resolved the user home page link url obtained u, pattern is http://weibo.com/u/<user_id>, puts into user's linked queue pond Url u;
3rd step: obtain comment link urlc by utilizing model id and being pieced together by the model comment response address that firebug intercepts and captures, does is pattern http://weibo.com/aj/v6/comment/big? ajwvr=6 & id=<post_id> & page=<num>, puts into comment linked queue pond Url c;
4th step: by the people user_id that posts, obtains friend relation link url f, pattern is http://weibo.com/p/100505<user_id>/f ollow? page=<num> and http://weibo.com/p/100505<user_id>/f ollow? relate=fans & page=<num>;
5th step: detect Pages u, Pages c, Pages fwhether be empty, if it is empty, then enter the 4th step of step2; If not empty, one by one from Pags umiddle taking-up User Page source code S u, utilize Beautifulsoup to parse user information field, comprise " user id, user name, concern number, bean vermicelli number, idiograph "; One by one from Pages cmiddle taking-up review pages source code S c, utilize Beautifulsoup to parse model comment relevant fields, comprise " comment user name, comment user id, comment content, comment time "; And then, one by one from Pages fmiddle taking-up friend relation original list source code S f, utilize Beautifulsoup to parse to post the relevant fields of people good friend, comprise " link of the pet name, id, homepage, head portrait, concern number, bean vermicelli number, microblogging number, inhabitation address ";
PCU associated data builds and comprises 5 steps: as shown in Figure 4.
1st step: it is empty for building oriented heterogeneous network G=(P, C, U, f, g, h), an initialization G of having no right;
2nd step: by the membership between model and comment, judges whether it is formerly post and comment on, judge whether two models are forwarding relation, obtains model relational network f after fusion;
3rd step: by user concern be concerned information and obtain user's friend relation network g;
4th step: build user's model membership network h according to the membership that user and model, model comment on;
5th step: merge three class entities and three class networks obtain heterogeneous network G=(P, C, U, f, g, h).

Claims (5)

1. one kind obtains the method for specific topics PCU associated data in microblogging, it is characterized in that: closing with the logic association in Sina's microblogging between element is instruct, automatic order ground obtains the PCU associated data formed by model content, model comment and the user that posts, and is expressed as heterogeneous network:
G=(P,C,U,f,g,h)
Wherein P, C and U represent model, model comment respectively and to post user, and f represents the interactive relation that model is commented on model, the friend relation between g representative of consumer, h representative of consumer and model, model comment between membership;
The method comprises: data access authority obtains, the parsing of PCU associated data page-downloading and PCU associated data and structure, and concrete process is as follows:
Step1: data access authority obtains:
At Sina microblogging login page http://weibo.com/login.php, web automated test tool selenium is utilized automatically to locate label <div class=" inp username " > and <div class=" inp password " > and fill in username and password, trigger submit button <div class=" info_listlogin_btn " >, complete authentication, obtain the authority of visit data;
Step2:PCU associated data page-downloading:
1) the range sublink url of specific topics is obtained by the query interface of microblogging p, stored in model URL queue pond Url p, therefrom take out model link url one by one pand download model page S according to it p, stored in model pool of page Pages pin;
2) from user URL queue pond Url u, comment URL queue pond Url u, friend relation list URL queue pond Url fin take out model successively user link url u, comment link url c, friend relation link url f, enable three threads download user page S respectively u, review pages S cwith good friend's page S f, stored in User Page storehouse Pages u, review pages storehouse Pages cand friend relation storehouse Pages fin;
The parsing of Step3:PCU associated data and structure:
1) one by one from model pool of page Pages pmiddle taking-up model page S p, from User Page storehouse Pages umiddle taking-up User Page S u, from review pages storehouse Pages c, middle taking-up review pages S c, from friend relation storehouse Pages fmiddle taking-up good friend page S f, utilize document analytical tool Beautifulsoup to S p, S u, S cand S fcarry out Hierarchical Location parsing, by required label value write into Databasce, wherein resolve S pthe url obtained u, url cand url f, put into user URL queue pond Url respectively u, comment URL queue pond Url cwith good friend's relation list URL queue pond Url f;
2) obtain model relational network f by the interactive relation between model and comment, obtain user's friend relation network g by the concern of user and bean vermicelli information, the membership between being commented on by user and model, model obtains user's model membership network h;
Finally obtain heterogeneous network G=(P, C, U, f, g, h).
2. the method obtaining specific topics PCU associated data in microblogging as claimed in claim 1, it is characterized in that, described process step1 concrete steps are:
1st step: start IE9 browser by selenium, automatically enter Sina's microblogging and log in homepage http://www.weibo.com/login.php, html tag <input name=" username " > and <input name=" password " > of location input account and password;
2nd step: according to the 1st step tag location result, utilizes the chartered username and password of selenium Auto-writing;
3rd step: according to login page whether containing html tag <table name=" verifycode " >, judge whether to need input validation code;
4th step: if there is identifying code input label, adopts character extractive technique to obtain identifying code, for the situation adopting this technical failure, takes artificial cognition and the method for manual input;
5th step: navigated to by selenium and log in label <a class=" W_btn_g " >, is automatically triggered this label and completes data access authority acquisition.
3. the method obtaining specific topics PCU associated data in microblogging as claimed in claim 1, it is characterized in that, described process step2 concrete steps body is:
1st step: enter microblogging searched page http://s.weibo.com/, selects " comprehensively " search inquiry interface, the Url of initialization simultaneously p, Url u, Url c, Url f, Pages p, Pages u, Pages c, Pages ffor sky;
2nd step: specific topics keyword is inserted automatically in search box <input type=" text " >, " search " button <input type=" text " class=" searchInp_form " > is submitted in automatic triggering to, is obtained the link url of this keyword search page by the variable current_url of selenium pand stored in Url p;
3rd step: from Url pin take out url one by one p, obtain the corresponding page with get (url) function of selenium, extract page source code S with the variable page_source of selenim p, and stored in pool of page Pages p;
4th step: detect family URL queue pond Url u, comment URL queue pond Url cwith good friend's relational links queue pond Url fbe whether empty, if it is empty, then forward step3 step 1 to) the 1st step in concrete steps; If not empty, the user therefrom taking out model respectively successively links url u, comment link url cand friend relation list link url f, enable three threads download user page S respectively u, review pages S cwith good friend's relation list page S f, stored in User Page storehouse Pages u, review pages storehouse Pages cand friend relation pool of page Pages fin, proceed to step3 step 1) the 5th step in concrete steps.
4. the as claimed in claim 1 method obtaining specific topics PCU associated data in microblogging, is characterized in that, described process step3 step 1) concrete steps are:
1st step: one by one from Pages pmiddle taking-up S putilize Beautifulsoup positioning label <div class=" WB_cardwrapS_bg2 clearfix " >, parse models all on the Post page, parse required field for each model, comprise model id (post_id), user name of posting (poster), user id (user_id), user's head portrait, user home page link, content of posting, time of posting, comment number, forward number;
2nd step: previous step is resolved the user home page link url that posts obtained u, pattern is: http://weibo.com/u/<user_id>, puts into user's linked queue pond Url u;
3rd step: obtain comment link url by utilizing model id and being pieced together by the model comment response address that firebug intercepts and captures cdoes is pattern: http://weibo.com/aj/v6/comment/big? ajwvr=6 & id=<post_id> & page=<num>, puts into comment linked queue pond Url c;
4th step: by the people user_id that posts, obtains friend relation link url f, pattern is: http://weibo.com/p/100505<user_id>/f ollow? page=<num> and http://weibo.com/p/100505<user_id>/f ollow? relate=fans & page=<num>;
5th step: detect Pages u, Pages c, Pages fwhether be empty, if it is empty, then enter the 4th step in step2 concrete steps; If not empty, one by one from Pages umiddle taking-up S u, utilize Beautifulsoup to parse user profile relevant fields, comprise user id, user name, concern number, bean vermicelli number, idiograph, personal information; One by one from Pages cmiddle taking-up S c, utilize Beautifulsoup to parse model comment relevant fields, comprise comment name, comment people id, comment content, comment time; One by one from Pages fmiddle taking-up S f, utilize Beautifulsoup to parse to post the relevant fields of people good friend, comprise the pet name, good friend id, homepage link, head portrait, concern number, bean vermicelli number, microblogging number, inhabitation address.
5. the as claimed in claim 1 method obtaining specific topics PCU associated data in microblogging, is characterized in that, described step3 step 2) concrete steps are:
1st step: it is empty for building oriented heterogeneous network G=(P, C, U, f, g, h), an initialization G of having no right;
2nd step: by the membership between model and comment, judges whether it is formerly post and comment on, judge whether two models are forwarding relation, obtains model relational network f after fusion;
3rd step: by user concern be concerned information and obtain user's friend relation network g;
4th step: build user's model membership network h according to the membership that user and model, model comment on;
5th step: merge three class entities and three class networks obtain heterogeneous network G=(P, C, U, f, g, h).
CN201510358782.0A 2015-06-25 2015-06-25 A method of obtaining PCU associated data in specific topics microblogging Expired - Fee Related CN104915438B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510358782.0A CN104915438B (en) 2015-06-25 2015-06-25 A method of obtaining PCU associated data in specific topics microblogging

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510358782.0A CN104915438B (en) 2015-06-25 2015-06-25 A method of obtaining PCU associated data in specific topics microblogging

Publications (2)

Publication Number Publication Date
CN104915438A true CN104915438A (en) 2015-09-16
CN104915438B CN104915438B (en) 2019-02-05

Family

ID=54084501

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510358782.0A Expired - Fee Related CN104915438B (en) 2015-06-25 2015-06-25 A method of obtaining PCU associated data in specific topics microblogging

Country Status (1)

Country Link
CN (1) CN104915438B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372239A (en) * 2016-09-14 2017-02-01 电子科技大学 Social network event correlation analysis method based on heterogeneous network
WO2018058611A1 (en) * 2016-09-30 2018-04-05 深圳市华傲数据技术有限公司 Method and apparatus for querying data node path
CN108170842A (en) * 2018-01-16 2018-06-15 重庆邮电大学 Hot microblog topic source tracing method based on tripartite graph model
CN109165333A (en) * 2018-07-12 2019-01-08 电子科技大学 A kind of high speed Theme Crawler of Content method based on web data
CN109284312A (en) * 2018-08-27 2019-01-29 山东威尔数据股份有限公司 A kind of heterogeneous database change real-time informing method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120209850A1 (en) * 2011-02-15 2012-08-16 Microsoft Corporation Aggregated view of content with presentation according to content type
CN103810283A (en) * 2014-02-20 2014-05-21 东莞中国科学院云计算产业技术创新与育成中心 Microblog data acquisition method based on user correlation
CN104199947A (en) * 2014-09-11 2014-12-10 浪潮集团有限公司 Important person speech supervision and incidence relation excavating method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120209850A1 (en) * 2011-02-15 2012-08-16 Microsoft Corporation Aggregated view of content with presentation according to content type
CN103810283A (en) * 2014-02-20 2014-05-21 东莞中国科学院云计算产业技术创新与育成中心 Microblog data acquisition method based on user correlation
CN104199947A (en) * 2014-09-11 2014-12-10 浪潮集团有限公司 Important person speech supervision and incidence relation excavating method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHENGFENG LIN ET AL: "analysis and identification of spamming behavious in sina weibo", 《PROCEEDINGS OF THE 7TH WORKSHOP ON SOCIAL NETWORK MINING AND ANALYSIS》 *
袁毅等: "微博客用户信息交流过程中形成的不同社会网络及其关系实证研究", 《图书情报工作》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372239A (en) * 2016-09-14 2017-02-01 电子科技大学 Social network event correlation analysis method based on heterogeneous network
WO2018058611A1 (en) * 2016-09-30 2018-04-05 深圳市华傲数据技术有限公司 Method and apparatus for querying data node path
CN108170842A (en) * 2018-01-16 2018-06-15 重庆邮电大学 Hot microblog topic source tracing method based on tripartite graph model
CN108170842B (en) * 2018-01-16 2021-12-14 重庆邮电大学 Microblog hot topic tracing method based on three-part graph model
CN109165333A (en) * 2018-07-12 2019-01-08 电子科技大学 A kind of high speed Theme Crawler of Content method based on web data
CN109284312A (en) * 2018-08-27 2019-01-29 山东威尔数据股份有限公司 A kind of heterogeneous database change real-time informing method

Also Published As

Publication number Publication date
CN104915438B (en) 2019-02-05

Similar Documents

Publication Publication Date Title
CN109710701B (en) Automatic construction method for big data knowledge graph in public safety field
CN101957816B (en) Webpage metadata automatic extraction method and system based on multi-page comparison
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN104915438A (en) Method for acquiring PCU association data in specific topic microblogs
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
Zheng et al. Template-independent news extraction based on visual consistency
CN101520798A (en) Webpage classification technology based on vertical search and focused crawler
CN103970845B (en) Webpage filtering method based on program slicing technology
CN103559234B (en) System and method for automated semantic annotation of RESTful Web services
CN103294732B (en) Webpage capture method and reptile
CN103927397A (en) Recognition method for Web page link blocks based on block tree
CN107862039B (en) Webpage data acquisition method and system and data matching and pushing method
CN103345532A (en) Method and device for extracting webpage information
CN103246732A (en) Online Web news content extracting method and system
Ji et al. Tag tree template for Web information and schema extraction
CN103744987B (en) Video website media asset integrating method and system based on DOM tree matching
CN102567521B (en) Webpage data capturing and filtering method
Wang et al. A novel blockchain oracle implementation scheme based on application specific knowledge engines
CN106294885A (en) A kind of data collection towards isomery webpage and mask method
Mehta et al. DOM tree based approach for web content extraction
CN109657114A (en) A method of extracting webpage semi-structured data
CN103853770B (en) The method and system of model content in a kind of extraction forum Web pages
CN101894109A (en) Database building method and device
CN103870495A (en) Method and device for extracting information from website
CN104156458B (en) The extracting method and device of a kind of information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190205

Termination date: 20210625

CF01 Termination of patent right due to non-payment of annual fee