CN109241380A - A kind of acquisition method of the microblog data combined based on web crawlers and Sina API - Google Patents

A kind of acquisition method of the microblog data combined based on web crawlers and Sina API Download PDF

Info

Publication number
CN109241380A
CN109241380A CN201810970733.6A CN201810970733A CN109241380A CN 109241380 A CN109241380 A CN 109241380A CN 201810970733 A CN201810970733 A CN 201810970733A CN 109241380 A CN109241380 A CN 109241380A
Authority
CN
China
Prior art keywords
user
seed
microblogging
list
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810970733.6A
Other languages
Chinese (zh)
Inventor
张仰森
黄改娟
段瑞雪
张良
曾健荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN201810970733.6A priority Critical patent/CN109241380A/en
Publication of CN109241380A publication Critical patent/CN109241380A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a kind of acquisition methods of microblog data combined based on web crawlers and Sina API, obtain seed user and its corresponding bean vermicelli user and concern user from microblogging celebrity's roll based on Sina API, are added to seed list;Seed list is converted into seed URL, and judge whether seed user list is empty, if it is empty then terminate, otherwise seed list is traversed, using the method for web crawlers, the relevant microblog information, microblogging comment information and userspersonal information of seed user are crawled, and microblogging comment user is added in seed list.Compared with prior art, the present invention by Sina API and for the web crawlers of Sina weibo platform by combining, both available format compared the microblog data of specification, it can be carried out large-scale data again to crawl, and the data format crawled more standardizes, noise data is fewer, and then can provide important data basis to carry out the detecting of social security events in microblogging.

Description

A kind of acquisition method of the microblog data combined based on web crawlers and Sina API
Technical field
The present invention relates to microblog data acquisition technique fields, especially a kind of to be combined based on web crawlers and Sina API Microblog data acquisition method.
Background technique
It is extremely important for data acquisition in microblogging, it can also be mentioned in this way to carry out the detecting of social security events in microblogging For important data basis.Currently, there are mainly two types of the data acquisition modes of microblogging: based on Sina API and being directed to Sina weibo The web crawlers of platform.The available format of scheme based on Sina API compares the data of specification, but its call number has Certain limitation can not carry out large-scale data and crawl, and some information can not be got;Side based on web crawlers Method although available large-scale data, but the analysis treatment process of its page is more complicated, and its data for crawling Format is lack of standardization, and noise data is relatively more.
Summary of the invention
The invention aims to solve the deficiencies in the prior art, provide a kind of based on web crawlers and Sina The acquisition method for the microblog data that API is combined.
In order to achieve the above objectives, the present invention is implemented according to following technical scheme:
A kind of acquisition method of the microblog data combined based on web crawlers and Sina API, comprising the following steps:
Step1: seed user and its corresponding bean vermicelli user are obtained from microblogging celebrity's roll based on Sina API and concern is used Family is added to seed list;
Step2: being converted to seed URL for seed list, and judge seed user list whether be it is empty, if it is empty then into Enter Step4, otherwise enters Step3;
Step3: traversal seed list crawls the relevant microblog information, micro- of seed user using the method for web crawlers Rich comment information and userspersonal information, and microblogging comment user is added in seed list;
Step4: terminate.
Specifically, the Step3 includes:
URL to be crawled in seed list is obtained, and carries out URL parsing and acquisition of information, is specifically included: obtaining user's letter Breath URL simultaneously crawls user's bean vermicelli user into respective page and pays close attention to user and crawl other relevant informations of user;It obtains User's microblogging URL and entrance respective page crawl microblogging forwarding and thumb up, comment on user, crawling microblogging comment text and crawl Other relevant informations of microblogging;And the user's bean vermicelli user crawled and concern user, other relevant informations of user, microblogging are turned Hair thumbs up, comments on user, crawls microblogging comment text and crawls other corresponding microblog data resources of correlation building of microblogging Library;The user's bean vermicelli user crawled is thumbed up with concern user, the microblogging crawled forwarding simultaneously, comments on user's addition seed column In table.
Compared with prior art, the present invention is by mutually tying Sina API with the web crawlers for Sina weibo platform It closes, not only available format compared the microblog data of specification, but also can be carried out the data that large-scale data are crawled, and crawled Format more standardizes, and noise data is fewer, and then can provide for the detecting of social security events in development microblogging important Data basis.
Detailed description of the invention
Fig. 1 is flow chart of the invention.
Specific embodiment
The invention will be further described combined with specific embodiments below, the invention illustrative examples and say It is bright to be used to explain the present invention but not as a limitation of the invention.
As shown in Figure 1, a kind of acquisition of microblog data combined based on web crawlers and Sina API of the present embodiment Method, comprising the following steps:
Step1: seed user and its corresponding bean vermicelli user are obtained from microblogging celebrity's roll based on Sina API and concern is used Family is added to seed list;
Step2: being converted to seed URL for seed list, and judge seed user list whether be it is empty, if it is empty then into Enter Step4, otherwise enters Step3;
Step3: traversal seed list crawls the relevant microblog information, micro- of seed user using the method for web crawlers Rich comment information and userspersonal information, and microblogging comment user is added in seed list, specific steps are as follows: obtain kind URL to be crawled in sublist, and URL parsing and acquisition of information are carried out, it specifically includes: obtaining user information URL and enter corresponding The page crawls user's bean vermicelli user from microblog data resources bank and pays close attention to user and crawl other relevant informations of user; It obtains user's microblogging URL and crawls microblogging forwarding from microblog data resources bank into respective page and thumb up, comment on user, climb It takes microblogging comment text and crawls other relevant informations of microblogging;Simultaneously by the user's bean vermicelli user crawled and concern user, climb The microblogging forwarding taken is thumbed up, is commented in user's addition seed list.
Step4: terminate.
After having acquired microblog data according to the method for the present embodiment, so that it may be carried out to collected microblogging text data Processing, removes abnormal data and noise data therein, realizes the standardization of data format, and construct corresponding microblogging resource Library, and then important data basis can be provided to carry out the detecting of social security events in microblogging.
The limitation that technical solution of the present invention is not limited to the above specific embodiments, it is all according to the technique and scheme of the present invention The technology deformation made, falls within the scope of protection of the present invention.

Claims (2)

1. a kind of acquisition method of the microblog data combined based on web crawlers and Sina API, which is characterized in that including following Step:
Step1: seed user and its corresponding bean vermicelli user and concern user are obtained from microblogging celebrity's roll based on Sina API, added Enter to seed list;
Step2: being converted to seed URL for seed list, and judges whether seed user list is sky, if it is empty then enters Otherwise Step4 enters Step3;
Step3: traversal seed list crawls relevant microblog information, the microblogging comment of seed user using the method for web crawlers Information and userspersonal information, and microblogging comment user is added in seed list;
Step4: terminate.
2. the acquisition method of the microblog data according to claim 1 combined based on web crawlers and Sina API, special Sign is: the Step3 includes:
URL to be crawled in seed list is obtained, and carries out URL parsing and acquisition of information, is specifically included: obtaining user information URL And user's bean vermicelli user is crawled into respective page and pays close attention to user and crawls other relevant informations of user;It is micro- to obtain user Rich URL and into respective page crawl microblogging forwarding thumb up, comment on user, crawl microblogging comment text and crawl microblogging other Relevant information;And the user's bean vermicelli user crawled and concern user, other relevant informations of user, microblogging forwarding are thumbed up, commented By user, crawls microblogging comment text and crawl other corresponding microblog data resources banks of correlation building of microblogging;It will climb simultaneously The user's bean vermicelli user and concern user that take, the microblogging forwarding crawled thumb up, comment in user's addition seed list.
CN201810970733.6A 2018-08-24 2018-08-24 A kind of acquisition method of the microblog data combined based on web crawlers and Sina API Pending CN109241380A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810970733.6A CN109241380A (en) 2018-08-24 2018-08-24 A kind of acquisition method of the microblog data combined based on web crawlers and Sina API

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810970733.6A CN109241380A (en) 2018-08-24 2018-08-24 A kind of acquisition method of the microblog data combined based on web crawlers and Sina API

Publications (1)

Publication Number Publication Date
CN109241380A true CN109241380A (en) 2019-01-18

Family

ID=65067849

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810970733.6A Pending CN109241380A (en) 2018-08-24 2018-08-24 A kind of acquisition method of the microblog data combined based on web crawlers and Sina API

Country Status (1)

Country Link
CN (1) CN109241380A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918153A (en) * 2019-03-18 2019-06-21 霍芳 Page open method and apparatus, page content retrieval method and device
CN110263237A (en) * 2019-05-31 2019-09-20 精硕科技(北京)股份有限公司 The acquisition methods and device of public sentiment data
CN111131268A (en) * 2019-12-27 2020-05-08 南京邮电大学 User data acquisition and storage system and method based on microblog platform
CN112632361A (en) * 2020-12-29 2021-04-09 中科院计算技术研究所大数据研究院 Iterative data acquisition method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810283A (en) * 2014-02-20 2014-05-21 东莞中国科学院云计算产业技术创新与育成中心 Microblog data acquisition method based on user correlation
CN104063390A (en) * 2013-03-20 2014-09-24 腾讯科技(深圳)有限公司 Microblog data processing method and system
CN104281607A (en) * 2013-07-08 2015-01-14 上海锐英软件技术有限公司 Microblog hot topic analyzing method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063390A (en) * 2013-03-20 2014-09-24 腾讯科技(深圳)有限公司 Microblog data processing method and system
CN104281607A (en) * 2013-07-08 2015-01-14 上海锐英软件技术有限公司 Microblog hot topic analyzing method
CN103810283A (en) * 2014-02-20 2014-05-21 东莞中国科学院云计算产业技术创新与育成中心 Microblog data acquisition method based on user correlation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
廉捷等: "《新浪微博数据挖掘方案》", 《清华大学学报(自然科学版)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918153A (en) * 2019-03-18 2019-06-21 霍芳 Page open method and apparatus, page content retrieval method and device
CN109918153B (en) * 2019-03-18 2022-05-27 北京信息科技大学 Page opening method and device and page content retrieval method and device
CN110263237A (en) * 2019-05-31 2019-09-20 精硕科技(北京)股份有限公司 The acquisition methods and device of public sentiment data
CN111131268A (en) * 2019-12-27 2020-05-08 南京邮电大学 User data acquisition and storage system and method based on microblog platform
CN112632361A (en) * 2020-12-29 2021-04-09 中科院计算技术研究所大数据研究院 Iterative data acquisition method
CN112632361B (en) * 2020-12-29 2021-10-29 中科院计算技术研究所大数据研究院 Iterative data acquisition method

Similar Documents

Publication Publication Date Title
CN109241380A (en) A kind of acquisition method of the microblog data combined based on web crawlers and Sina API
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN103927397B (en) Recognition method for Web page link blocks based on block tree
CN104462547B (en) A kind of method and system of configurable collecting webpage data
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
Kazemnezhad et al. The quality of maternal and child health care services with servqual model
Chen et al. Examining the impact factors of energy consumption related carbon footprints using the STIRPAT model and PLS model in Beijing.
Li et al. Analysis of heavy air pollution episodes in Beijing during 2013-2014.
CN104915438B (en) A method of obtaining PCU associated data in specific topics microblogging
CN106202501A (en) A kind of information analysis system
CN106294621B (en) A kind of method and system of the calculating event similitude based on complex network node similitude
CN107590265A (en) A kind of administrative ownership recognition methods in the website based on web crawlers
CN107358099A (en) Useless change quantity measuring method based on LLVM intermediate representation program microtomies
CN103593397B (en) A kind of method and apparatus of acquisition content of microblog
CN103823753A (en) Webpage sampling method oriented at barrier-free webpage content detection
Boussouga et al. Modeling of fluoride retention in nanofiltration and reverse osmosis membranes for single and binary salt mixtures.
CN102750392A (en) Web topic information extraction method and system
Zhang et al. Limitations of passive sampling technique of rainfall chemistry and wet deposition flux characterization.
CN106372168A (en) Data processing system based on internet
Pruys Consultation and protocols in publishing: A framework for Indigenous communities and publishers
Walia Racism, Austerity and Precarity: Canada's Role in Shaping Anti-Migrant Policies
Xu et al. Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding
Berisha Inflation and Impact on the Cost of Living-The Case of Kosovo
Lawrence Through Tender Opalescence
Hawley Density and species richness of macrofaunal benthic bivalves in North Inlet, South Carolina

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190118

RJ01 Rejection of invention patent application after publication