CN104317880A - Method special for microblog data acquisition mode - Google Patents
Method special for microblog data acquisition mode Download PDFInfo
- Publication number
- CN104317880A CN104317880A CN201410564110.0A CN201410564110A CN104317880A CN 104317880 A CN104317880 A CN 104317880A CN 201410564110 A CN201410564110 A CN 201410564110A CN 104317880 A CN104317880 A CN 104317880A
- Authority
- CN
- China
- Prior art keywords
- data
- microblog
- url
- page
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 20
- 230000009193 crawling Effects 0.000 claims description 8
- 238000013480 data collection Methods 0.000 claims description 3
- 241001269238 Data Species 0.000 claims description 2
- 238000007405 data analysis Methods 0.000 claims description 2
- 238000005065 mining Methods 0.000 claims description 2
- 238000004088 simulation Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a method specially aiming at a microblog data acquisition mode, which comprises the steps of firstly simulating a user to log in a microblog account, recording log-in information, and then capturing data in a microblog, wherein: 1) the microblog page of the captured user is used as an entry point, microblog information on the initial page is crawled, and effective structured data are formed; 2) the UID of the user and the URL of the current page form a new URL, and the new URL is stored in a storage queue, so that the UID of the content of the current crawled page is conveniently obtained; 3) and putting the crawled data into a local database. And the acquisition work of the microblog data updated in real time is designed and realized. The problems of microblog login limitation and microblog data real-time updating can be effectively solved. In the acquisition of microblog data, due to the reasons of large microblog data volume, multiple storage formats, high storage difficulty and the like, the acquisition of the microblog data is difficult, and therefore how to efficiently acquire the microblog data becomes a very important problem.
Description
Technical field
The present invention relates to communication technical field, specifically a kind of specially for the method for microblog data acquisition mode.This patent relates to data grabber, and simulation logs in, data acquisition, database, unstructured data.
Background technology
In the collection of microblog data, because microblog data amount is large, storage format is many, stores the reasons such as difficulty is large, causes and gathers difficulty very greatly to microblog data, and therefore how gathering microblog data efficiently becomes a very important problem.In traditional internet data acquisition mode, we cannot accomplish real-time update internet data, once can only log in and collect a small amount of data, cannot meet our demand, lack the integrality of data.
Summary of the invention
The object of this invention is to provide a kind of specially for the method for microblog data acquisition mode.
The object of the invention is to realize in the following manner, adopt the method simulated and log in, continual collection microblog data, the data that real-time update collects, can greatly improve data acquisition amount, guarantee is provided for doing data analysis and process later, unstructured data is the diversified feature of large Data Data, and the data in click steam are parts of enriched data, rely on powerful web analytics instrument, obtain the most fine-grained raw data Raw Data, destructuring data comprise text, video, document, audio frequency, even geographical location information, hit the text mining application of the unstructured data in stream, how these unstructured datas are better applied, for the problem above traditional real-time update data acquisition modes, improve traditional microblog data acquisition mode, this process is divided into two steps, first, analog subscriber logs in microblog account, and record logon information, then carry out capturing the data in microblogging, wherein:
1) using capturing the microblog page of user as entrance, crawling the micro-blog information in start page, and forming effective structural data;
2) UID of user and the URL of current page is formed new URL, be deposited into and preserve in queue, so the current UID crawling content of pages of convenient acquisition;
3) data crawled are put in local database.
Four steps are divided in the new URL process of composition,
1) URL that artificial preliminary examinationization one is new, is set to sky;
2) carry out in the process of page data collection, read the URL gathering the page, this URL is put in initialized URL;
3) microblogging can obtain some personal information of a UID and microblogging for each login user, this content pages is put in the URL of just preliminary examination words;
4) mode adopting character string to connect, form new URL module, the message structure of user comprises: the structural data of the UID of user and the personal information composition of user.
The invention has the beneficial effects as follows: in the collection of microblog data, because microblog data amount is large, storage format is many, stores the reasons such as difficulty is large, cause and gather difficulty very greatly to microblog data, therefore how gathering microblog data efficiently becomes a very important problem.Design and Implement the collecting work of the microblog data to real-time update, effectively can solve the problem of microblogging restriction login and microblog data real-time update.
Accompanying drawing explanation
Fig. 1 is fundamental diagram.
Embodiment
With reference to Figure of description, method of the present invention is described in detail below.
Owing to being log in for multi-user, we can take two kinds of modes, and one is manually determined to log in people, and another batch virtually logs in, simulation virtual address.After simulation logs in, we can start to carry out collecting work to page data.By the URL one_to_one corresponding of the page info that collects and preliminary examination, get the structural data that we facilitate analytical applications, be stored among local data base.Carry out step by step for: after simulation logs in, using capturing the microblog page of user as entrance, crawling the micro-blog information in start page, and forming effective structural data; The UID of user and the URL of current page is formed new URL, is deposited into and preserves in queue, conveniently can obtain the current UID crawling content of pages like this.
Embodiment
Improve traditional microblog data acquisition mode, comprise the steps, first, analog subscriber logs in microblog account, and records logon information, then carries out capturing the data in microblogging, wherein:
1) using capturing the microblog page of user as entrance, crawling the micro-blog information in start page, and forming effective structural data;
2) UID of user and the URL of current page is formed new URL, be deposited into and preserve in queue, so the current UID crawling content of pages of convenient acquisition;
3) data crawled are put in local database.
Four steps are divided in the new URL process of composition,
1) URL that artificial preliminary examinationization one is new, is set to sky;
2) carry out in the process of page data collection, read the URL gathering the page, this URL is put in initialized URL;
3) microblogging can obtain some personal information of a UID and microblogging for each login user, this content pages is put in the URL of just preliminary examination words;
4) mode adopting character string to connect, form new URL module, the message structure of user comprises: the structural data of the UID of user and the personal information composition of user.
Except the technical characteristic described in instructions, be the known technology of those skilled in the art.
Claims (2)
1. a special method for microblog data acquisition mode, it is characterized in that, adopt the method simulated and log in, continual collection microblog data, the data that real-time update collects, can greatly improve data acquisition amount, guarantee is provided for doing data analysis and process later, unstructured data is the diversified feature of large Data Data, and the data in click steam are parts of enriched data, rely on powerful web analytics instrument, obtain the most fine-grained raw data Raw Data, destructuring data comprise text, video, document, audio frequency, even geographical location information, hit the text mining application of the unstructured data in stream, how these unstructured datas are better applied, for the problem above traditional real-time update data acquisition modes, improve traditional microblog data acquisition mode, this process is divided into two steps, first, analog subscriber logs in microblog account, and record logon information, then carry out capturing the data in microblogging, wherein:
1) using capturing the microblog page of user as entrance, crawling the micro-blog information in start page, and forming effective structural data;
2) UID of user and the URL of current page is formed new URL, be deposited into and preserve in queue, so the current UID crawling content of pages of convenient acquisition;
3) data crawled are put in local database.
2. method according to claim 1, is characterized in that, is divided into four steps in the new URL process of composition,
1) URL that artificial preliminary examinationization one is new, is set to sky;
2) carry out in the process of page data collection, read the URL gathering the page, this URL is put in initialized URL;
3) microblogging can obtain some personal information of a UID and microblogging for each login user, this content pages is put in the URL of just preliminary examination words;
4) mode adopting character string to connect, form new URL module, the message structure of user comprises: the structural data of the UID of user and the personal information composition of user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410564110.0A CN104317880A (en) | 2014-10-22 | 2014-10-22 | Method special for microblog data acquisition mode |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410564110.0A CN104317880A (en) | 2014-10-22 | 2014-10-22 | Method special for microblog data acquisition mode |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104317880A true CN104317880A (en) | 2015-01-28 |
Family
ID=52373112
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410564110.0A Pending CN104317880A (en) | 2014-10-22 | 2014-10-22 | Method special for microblog data acquisition mode |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104317880A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106341470A (en) * | 2016-08-31 | 2017-01-18 | 北京量科邦信息技术有限公司 | Method for keeping conversation and grasping continuously-updated data of conversation |
CN108345662A (en) * | 2018-02-01 | 2018-07-31 | 福建师范大学 | A kind of microblog data weighted statistical method of registering considering user distribution area differentiation |
CN109388735A (en) * | 2018-09-13 | 2019-02-26 | 广州丰石科技有限公司 | A method of crawling wechat public platform information |
CN109815382A (en) * | 2018-12-29 | 2019-05-28 | 中国科学院计算技术研究所 | The perception and acquisition methods and system of large scale network data |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020103823A1 (en) * | 2001-02-01 | 2002-08-01 | International Business Machines Corporation | Method and system for extending the performance of a web crawler |
CN103618649A (en) * | 2013-12-03 | 2014-03-05 | 北京人民在线网络有限公司 | Website data acquisition method and device |
CN103984719A (en) * | 2014-05-12 | 2014-08-13 | 浪潮电子信息产业股份有限公司 | Method for acquiring by using crawler to simulate login |
-
2014
- 2014-10-22 CN CN201410564110.0A patent/CN104317880A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020103823A1 (en) * | 2001-02-01 | 2002-08-01 | International Business Machines Corporation | Method and system for extending the performance of a web crawler |
CN103618649A (en) * | 2013-12-03 | 2014-03-05 | 北京人民在线网络有限公司 | Website data acquisition method and device |
CN103984719A (en) * | 2014-05-12 | 2014-08-13 | 浪潮电子信息产业股份有限公司 | Method for acquiring by using crawler to simulate login |
Non-Patent Citations (1)
Title |
---|
孙青云等: "一种基于模拟登录的微博数据采集方案", 《计算机技术与发展》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106341470A (en) * | 2016-08-31 | 2017-01-18 | 北京量科邦信息技术有限公司 | Method for keeping conversation and grasping continuously-updated data of conversation |
CN108345662A (en) * | 2018-02-01 | 2018-07-31 | 福建师范大学 | A kind of microblog data weighted statistical method of registering considering user distribution area differentiation |
CN109388735A (en) * | 2018-09-13 | 2019-02-26 | 广州丰石科技有限公司 | A method of crawling wechat public platform information |
CN109815382A (en) * | 2018-12-29 | 2019-05-28 | 中国科学院计算技术研究所 | The perception and acquisition methods and system of large scale network data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Young et al. | Building library community through social media | |
CN105426502A (en) | Social network based person information search and relational network drawing method | |
CN104615627B (en) | A kind of event public feelings information extracting method and system based on microblog | |
CN104618806A (en) | Method, device and system for acquiring comment information of video | |
CN106528693A (en) | Individualized learning-oriented educational resource recommendation method and system | |
CN106339439A (en) | Big data analysis method | |
CN104317880A (en) | Method special for microblog data acquisition mode | |
CN105469335A (en) | PC front end based learning system | |
CN103365851A (en) | Method and system for sharing users' surfing behavior on basis of virtual organization | |
CN106230809B (en) | A kind of mobile Internet public sentiment monitoring method and system based on URL | |
CN104462482A (en) | Content providing method and system for medium display | |
CN105183925A (en) | Content association recommending method and content association recommending device | |
CN104991904A (en) | Page data acquisition method of dynamic webpage | |
CN108536700A (en) | A kind of method that nothing buries a collector journal | |
CN104765823A (en) | Method and device for collecting website data | |
CN104408105A (en) | Friend recommendation method applicable for intelligent TV (Television) users | |
Brito et al. | Experiences integrating heterogeneous government open data sources to deliver services and promote transparency in brazil | |
CN111161846A (en) | Knowledge base construction and analysis method, device and equipment based on electronic psychological sand table | |
CN105989167B (en) | Collecting method and device based on news client | |
WO2014114071A1 (en) | Method for generating emotional story recording background | |
CN102880674A (en) | Method for automatically collecting topic video based on video website | |
CN106354770A (en) | Data analysis system | |
CN107247772A (en) | A kind of picture and text search engine based on internet | |
Lu et al. | Research on forensic model of online social network | |
CN103279527B (en) | A kind of user interest network address method for digging and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150128 |