CN109241380A - A kind of acquisition method of the microblog data combined based on web crawlers and Sina API - Google Patents
A kind of acquisition method of the microblog data combined based on web crawlers and Sina API Download PDFInfo
- Publication number
- CN109241380A CN109241380A CN201810970733.6A CN201810970733A CN109241380A CN 109241380 A CN109241380 A CN 109241380A CN 201810970733 A CN201810970733 A CN 201810970733A CN 109241380 A CN109241380 A CN 109241380A
- Authority
- CN
- China
- Prior art keywords
- user
- seed
- microblogging
- list
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a kind of acquisition methods of microblog data combined based on web crawlers and Sina API, obtain seed user and its corresponding bean vermicelli user and concern user from microblogging celebrity's roll based on Sina API, are added to seed list;Seed list is converted into seed URL, and judge whether seed user list is empty, if it is empty then terminate, otherwise seed list is traversed, using the method for web crawlers, the relevant microblog information, microblogging comment information and userspersonal information of seed user are crawled, and microblogging comment user is added in seed list.Compared with prior art, the present invention by Sina API and for the web crawlers of Sina weibo platform by combining, both available format compared the microblog data of specification, it can be carried out large-scale data again to crawl, and the data format crawled more standardizes, noise data is fewer, and then can provide important data basis to carry out the detecting of social security events in microblogging.
Description
Technical field
The present invention relates to microblog data acquisition technique fields, especially a kind of to be combined based on web crawlers and Sina API
Microblog data acquisition method.
Background technique
It is extremely important for data acquisition in microblogging, it can also be mentioned in this way to carry out the detecting of social security events in microblogging
For important data basis.Currently, there are mainly two types of the data acquisition modes of microblogging: based on Sina API and being directed to Sina weibo
The web crawlers of platform.The available format of scheme based on Sina API compares the data of specification, but its call number has
Certain limitation can not carry out large-scale data and crawl, and some information can not be got;Side based on web crawlers
Method although available large-scale data, but the analysis treatment process of its page is more complicated, and its data for crawling
Format is lack of standardization, and noise data is relatively more.
Summary of the invention
The invention aims to solve the deficiencies in the prior art, provide a kind of based on web crawlers and Sina
The acquisition method for the microblog data that API is combined.
In order to achieve the above objectives, the present invention is implemented according to following technical scheme:
A kind of acquisition method of the microblog data combined based on web crawlers and Sina API, comprising the following steps:
Step1: seed user and its corresponding bean vermicelli user are obtained from microblogging celebrity's roll based on Sina API and concern is used
Family is added to seed list;
Step2: being converted to seed URL for seed list, and judge seed user list whether be it is empty, if it is empty then into
Enter Step4, otherwise enters Step3;
Step3: traversal seed list crawls the relevant microblog information, micro- of seed user using the method for web crawlers
Rich comment information and userspersonal information, and microblogging comment user is added in seed list;
Step4: terminate.
Specifically, the Step3 includes:
URL to be crawled in seed list is obtained, and carries out URL parsing and acquisition of information, is specifically included: obtaining user's letter
Breath URL simultaneously crawls user's bean vermicelli user into respective page and pays close attention to user and crawl other relevant informations of user;It obtains
User's microblogging URL and entrance respective page crawl microblogging forwarding and thumb up, comment on user, crawling microblogging comment text and crawl
Other relevant informations of microblogging;And the user's bean vermicelli user crawled and concern user, other relevant informations of user, microblogging are turned
Hair thumbs up, comments on user, crawls microblogging comment text and crawls other corresponding microblog data resources of correlation building of microblogging
Library;The user's bean vermicelli user crawled is thumbed up with concern user, the microblogging crawled forwarding simultaneously, comments on user's addition seed column
In table.
Compared with prior art, the present invention is by mutually tying Sina API with the web crawlers for Sina weibo platform
It closes, not only available format compared the microblog data of specification, but also can be carried out the data that large-scale data are crawled, and crawled
Format more standardizes, and noise data is fewer, and then can provide for the detecting of social security events in development microblogging important
Data basis.
Detailed description of the invention
Fig. 1 is flow chart of the invention.
Specific embodiment
The invention will be further described combined with specific embodiments below, the invention illustrative examples and say
It is bright to be used to explain the present invention but not as a limitation of the invention.
As shown in Figure 1, a kind of acquisition of microblog data combined based on web crawlers and Sina API of the present embodiment
Method, comprising the following steps:
Step1: seed user and its corresponding bean vermicelli user are obtained from microblogging celebrity's roll based on Sina API and concern is used
Family is added to seed list;
Step2: being converted to seed URL for seed list, and judge seed user list whether be it is empty, if it is empty then into
Enter Step4, otherwise enters Step3;
Step3: traversal seed list crawls the relevant microblog information, micro- of seed user using the method for web crawlers
Rich comment information and userspersonal information, and microblogging comment user is added in seed list, specific steps are as follows: obtain kind
URL to be crawled in sublist, and URL parsing and acquisition of information are carried out, it specifically includes: obtaining user information URL and enter corresponding
The page crawls user's bean vermicelli user from microblog data resources bank and pays close attention to user and crawl other relevant informations of user;
It obtains user's microblogging URL and crawls microblogging forwarding from microblog data resources bank into respective page and thumb up, comment on user, climb
It takes microblogging comment text and crawls other relevant informations of microblogging;Simultaneously by the user's bean vermicelli user crawled and concern user, climb
The microblogging forwarding taken is thumbed up, is commented in user's addition seed list.
Step4: terminate.
After having acquired microblog data according to the method for the present embodiment, so that it may be carried out to collected microblogging text data
Processing, removes abnormal data and noise data therein, realizes the standardization of data format, and construct corresponding microblogging resource
Library, and then important data basis can be provided to carry out the detecting of social security events in microblogging.
The limitation that technical solution of the present invention is not limited to the above specific embodiments, it is all according to the technique and scheme of the present invention
The technology deformation made, falls within the scope of protection of the present invention.
Claims (2)
1. a kind of acquisition method of the microblog data combined based on web crawlers and Sina API, which is characterized in that including following
Step:
Step1: seed user and its corresponding bean vermicelli user and concern user are obtained from microblogging celebrity's roll based on Sina API, added
Enter to seed list;
Step2: being converted to seed URL for seed list, and judges whether seed user list is sky, if it is empty then enters
Otherwise Step4 enters Step3;
Step3: traversal seed list crawls relevant microblog information, the microblogging comment of seed user using the method for web crawlers
Information and userspersonal information, and microblogging comment user is added in seed list;
Step4: terminate.
2. the acquisition method of the microblog data according to claim 1 combined based on web crawlers and Sina API, special
Sign is: the Step3 includes:
URL to be crawled in seed list is obtained, and carries out URL parsing and acquisition of information, is specifically included: obtaining user information URL
And user's bean vermicelli user is crawled into respective page and pays close attention to user and crawls other relevant informations of user;It is micro- to obtain user
Rich URL and into respective page crawl microblogging forwarding thumb up, comment on user, crawl microblogging comment text and crawl microblogging other
Relevant information;And the user's bean vermicelli user crawled and concern user, other relevant informations of user, microblogging forwarding are thumbed up, commented
By user, crawls microblogging comment text and crawl other corresponding microblog data resources banks of correlation building of microblogging;It will climb simultaneously
The user's bean vermicelli user and concern user that take, the microblogging forwarding crawled thumb up, comment in user's addition seed list.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810970733.6A CN109241380A (en) | 2018-08-24 | 2018-08-24 | A kind of acquisition method of the microblog data combined based on web crawlers and Sina API |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810970733.6A CN109241380A (en) | 2018-08-24 | 2018-08-24 | A kind of acquisition method of the microblog data combined based on web crawlers and Sina API |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109241380A true CN109241380A (en) | 2019-01-18 |
Family
ID=65067849
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810970733.6A Pending CN109241380A (en) | 2018-08-24 | 2018-08-24 | A kind of acquisition method of the microblog data combined based on web crawlers and Sina API |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109241380A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109918153A (en) * | 2019-03-18 | 2019-06-21 | 霍芳 | Page open method and apparatus, page content retrieval method and device |
CN110263237A (en) * | 2019-05-31 | 2019-09-20 | 精硕科技(北京)股份有限公司 | The acquisition methods and device of public sentiment data |
CN111131268A (en) * | 2019-12-27 | 2020-05-08 | 南京邮电大学 | User data acquisition and storage system and method based on microblog platform |
CN112632361A (en) * | 2020-12-29 | 2021-04-09 | 中科院计算技术研究所大数据研究院 | Iterative data acquisition method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103810283A (en) * | 2014-02-20 | 2014-05-21 | 东莞中国科学院云计算产业技术创新与育成中心 | Microblog data acquisition method based on user correlation |
CN104063390A (en) * | 2013-03-20 | 2014-09-24 | 腾讯科技(深圳)有限公司 | Microblog data processing method and system |
CN104281607A (en) * | 2013-07-08 | 2015-01-14 | 上海锐英软件技术有限公司 | Microblog hot topic analyzing method |
-
2018
- 2018-08-24 CN CN201810970733.6A patent/CN109241380A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104063390A (en) * | 2013-03-20 | 2014-09-24 | 腾讯科技(深圳)有限公司 | Microblog data processing method and system |
CN104281607A (en) * | 2013-07-08 | 2015-01-14 | 上海锐英软件技术有限公司 | Microblog hot topic analyzing method |
CN103810283A (en) * | 2014-02-20 | 2014-05-21 | 东莞中国科学院云计算产业技术创新与育成中心 | Microblog data acquisition method based on user correlation |
Non-Patent Citations (1)
Title |
---|
廉捷等: "《新浪微博数据挖掘方案》", 《清华大学学报(自然科学版)》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109918153A (en) * | 2019-03-18 | 2019-06-21 | 霍芳 | Page open method and apparatus, page content retrieval method and device |
CN109918153B (en) * | 2019-03-18 | 2022-05-27 | 北京信息科技大学 | Page opening method and device and page content retrieval method and device |
CN110263237A (en) * | 2019-05-31 | 2019-09-20 | 精硕科技(北京)股份有限公司 | The acquisition methods and device of public sentiment data |
CN111131268A (en) * | 2019-12-27 | 2020-05-08 | 南京邮电大学 | User data acquisition and storage system and method based on microblog platform |
CN112632361A (en) * | 2020-12-29 | 2021-04-09 | 中科院计算技术研究所大数据研究院 | Iterative data acquisition method |
CN112632361B (en) * | 2020-12-29 | 2021-10-29 | 中科院计算技术研究所大数据研究院 | Iterative data acquisition method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109241380A (en) | A kind of acquisition method of the microblog data combined based on web crawlers and Sina API | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN103927397B (en) | Recognition method for Web page link blocks based on block tree | |
CN104462547B (en) | A kind of method and system of configurable collecting webpage data | |
CN102169496A (en) | Anchor text analysis-based automatic domain term generating method | |
Kazemnezhad et al. | The quality of maternal and child health care services with servqual model | |
Chen et al. | Examining the impact factors of energy consumption related carbon footprints using the STIRPAT model and PLS model in Beijing. | |
Li et al. | Analysis of heavy air pollution episodes in Beijing during 2013-2014. | |
CN104915438B (en) | A method of obtaining PCU associated data in specific topics microblogging | |
CN106202501A (en) | A kind of information analysis system | |
CN106294621B (en) | A kind of method and system of the calculating event similitude based on complex network node similitude | |
CN107590265A (en) | A kind of administrative ownership recognition methods in the website based on web crawlers | |
CN107358099A (en) | Useless change quantity measuring method based on LLVM intermediate representation program microtomies | |
CN103593397B (en) | A kind of method and apparatus of acquisition content of microblog | |
CN103823753A (en) | Webpage sampling method oriented at barrier-free webpage content detection | |
Boussouga et al. | Modeling of fluoride retention in nanofiltration and reverse osmosis membranes for single and binary salt mixtures. | |
CN102750392A (en) | Web topic information extraction method and system | |
Zhang et al. | Limitations of passive sampling technique of rainfall chemistry and wet deposition flux characterization. | |
CN106372168A (en) | Data processing system based on internet | |
Pruys | Consultation and protocols in publishing: A framework for Indigenous communities and publishers | |
Walia | Racism, Austerity and Precarity: Canada's Role in Shaping Anti-Migrant Policies | |
Xu et al. | Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding | |
Berisha | Inflation and Impact on the Cost of Living-The Case of Kosovo | |
Lawrence | Through Tender Opalescence | |
Hawley | Density and species richness of macrofaunal benthic bivalves in North Inlet, South Carolina |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190118 |
|
RJ01 | Rejection of invention patent application after publication |