CN104317880A - Method special for microblog data acquisition mode - Google Patents

Method special for microblog data acquisition mode Download PDF

Info

Publication number
CN104317880A
CN104317880A CN201410564110.0A CN201410564110A CN104317880A CN 104317880 A CN104317880 A CN 104317880A CN 201410564110 A CN201410564110 A CN 201410564110A CN 104317880 A CN104317880 A CN 104317880A
Authority
CN
China
Prior art keywords
data
microblog
url
page
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410564110.0A
Other languages
Chinese (zh)
Inventor
焦毓葳
徐宏伟
王传超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Group Co Ltd
Original Assignee
Inspur Software Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Group Co Ltd filed Critical Inspur Software Group Co Ltd
Priority to CN201410564110.0A priority Critical patent/CN104317880A/en
Publication of CN104317880A publication Critical patent/CN104317880A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method specially aiming at a microblog data acquisition mode, which comprises the steps of firstly simulating a user to log in a microblog account, recording log-in information, and then capturing data in a microblog, wherein: 1) the microblog page of the captured user is used as an entry point, microblog information on the initial page is crawled, and effective structured data are formed; 2) the UID of the user and the URL of the current page form a new URL, and the new URL is stored in a storage queue, so that the UID of the content of the current crawled page is conveniently obtained; 3) and putting the crawled data into a local database. And the acquisition work of the microblog data updated in real time is designed and realized. The problems of microblog login limitation and microblog data real-time updating can be effectively solved. In the acquisition of microblog data, due to the reasons of large microblog data volume, multiple storage formats, high storage difficulty and the like, the acquisition of the microblog data is difficult, and therefore how to efficiently acquire the microblog data becomes a very important problem.

Description

A kind of specially for the method for microblog data acquisition mode
Technical field
The present invention relates to communication technical field, specifically a kind of specially for the method for microblog data acquisition mode.This patent relates to data grabber, and simulation logs in, data acquisition, database, unstructured data.
Background technology
In the collection of microblog data, because microblog data amount is large, storage format is many, stores the reasons such as difficulty is large, causes and gathers difficulty very greatly to microblog data, and therefore how gathering microblog data efficiently becomes a very important problem.In traditional internet data acquisition mode, we cannot accomplish real-time update internet data, once can only log in and collect a small amount of data, cannot meet our demand, lack the integrality of data.
Summary of the invention
The object of this invention is to provide a kind of specially for the method for microblog data acquisition mode.
The object of the invention is to realize in the following manner, adopt the method simulated and log in, continual collection microblog data, the data that real-time update collects, can greatly improve data acquisition amount, guarantee is provided for doing data analysis and process later, unstructured data is the diversified feature of large Data Data, and the data in click steam are parts of enriched data, rely on powerful web analytics instrument, obtain the most fine-grained raw data Raw Data, destructuring data comprise text, video, document, audio frequency, even geographical location information, hit the text mining application of the unstructured data in stream, how these unstructured datas are better applied, for the problem above traditional real-time update data acquisition modes, improve traditional microblog data acquisition mode, this process is divided into two steps, first, analog subscriber logs in microblog account, and record logon information, then carry out capturing the data in microblogging, wherein:
1) using capturing the microblog page of user as entrance, crawling the micro-blog information in start page, and forming effective structural data;
2) UID of user and the URL of current page is formed new URL, be deposited into and preserve in queue, so the current UID crawling content of pages of convenient acquisition;
3) data crawled are put in local database.
Four steps are divided in the new URL process of composition,
1) URL that artificial preliminary examinationization one is new, is set to sky;
2) carry out in the process of page data collection, read the URL gathering the page, this URL is put in initialized URL;
3) microblogging can obtain some personal information of a UID and microblogging for each login user, this content pages is put in the URL of just preliminary examination words;
4) mode adopting character string to connect, form new URL module, the message structure of user comprises: the structural data of the UID of user and the personal information composition of user.
The invention has the beneficial effects as follows: in the collection of microblog data, because microblog data amount is large, storage format is many, stores the reasons such as difficulty is large, cause and gather difficulty very greatly to microblog data, therefore how gathering microblog data efficiently becomes a very important problem.Design and Implement the collecting work of the microblog data to real-time update, effectively can solve the problem of microblogging restriction login and microblog data real-time update.
Accompanying drawing explanation
Fig. 1 is fundamental diagram.
Embodiment
With reference to Figure of description, method of the present invention is described in detail below.
Owing to being log in for multi-user, we can take two kinds of modes, and one is manually determined to log in people, and another batch virtually logs in, simulation virtual address.After simulation logs in, we can start to carry out collecting work to page data.By the URL one_to_one corresponding of the page info that collects and preliminary examination, get the structural data that we facilitate analytical applications, be stored among local data base.Carry out step by step for: after simulation logs in, using capturing the microblog page of user as entrance, crawling the micro-blog information in start page, and forming effective structural data; The UID of user and the URL of current page is formed new URL, is deposited into and preserves in queue, conveniently can obtain the current UID crawling content of pages like this.
Embodiment
Improve traditional microblog data acquisition mode, comprise the steps, first, analog subscriber logs in microblog account, and records logon information, then carries out capturing the data in microblogging, wherein:
1) using capturing the microblog page of user as entrance, crawling the micro-blog information in start page, and forming effective structural data;
2) UID of user and the URL of current page is formed new URL, be deposited into and preserve in queue, so the current UID crawling content of pages of convenient acquisition;
3) data crawled are put in local database.
Four steps are divided in the new URL process of composition,
1) URL that artificial preliminary examinationization one is new, is set to sky;
2) carry out in the process of page data collection, read the URL gathering the page, this URL is put in initialized URL;
3) microblogging can obtain some personal information of a UID and microblogging for each login user, this content pages is put in the URL of just preliminary examination words;
4) mode adopting character string to connect, form new URL module, the message structure of user comprises: the structural data of the UID of user and the personal information composition of user.
Except the technical characteristic described in instructions, be the known technology of those skilled in the art.

Claims (2)

1. a special method for microblog data acquisition mode, it is characterized in that, adopt the method simulated and log in, continual collection microblog data, the data that real-time update collects, can greatly improve data acquisition amount, guarantee is provided for doing data analysis and process later, unstructured data is the diversified feature of large Data Data, and the data in click steam are parts of enriched data, rely on powerful web analytics instrument, obtain the most fine-grained raw data Raw Data, destructuring data comprise text, video, document, audio frequency, even geographical location information, hit the text mining application of the unstructured data in stream, how these unstructured datas are better applied, for the problem above traditional real-time update data acquisition modes, improve traditional microblog data acquisition mode, this process is divided into two steps, first, analog subscriber logs in microblog account, and record logon information, then carry out capturing the data in microblogging, wherein:
1) using capturing the microblog page of user as entrance, crawling the micro-blog information in start page, and forming effective structural data;
2) UID of user and the URL of current page is formed new URL, be deposited into and preserve in queue, so the current UID crawling content of pages of convenient acquisition;
3) data crawled are put in local database.
2. method according to claim 1, is characterized in that, is divided into four steps in the new URL process of composition,
1) URL that artificial preliminary examinationization one is new, is set to sky;
2) carry out in the process of page data collection, read the URL gathering the page, this URL is put in initialized URL;
3) microblogging can obtain some personal information of a UID and microblogging for each login user, this content pages is put in the URL of just preliminary examination words;
4) mode adopting character string to connect, form new URL module, the message structure of user comprises: the structural data of the UID of user and the personal information composition of user.
CN201410564110.0A 2014-10-22 2014-10-22 Method special for microblog data acquisition mode Pending CN104317880A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410564110.0A CN104317880A (en) 2014-10-22 2014-10-22 Method special for microblog data acquisition mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410564110.0A CN104317880A (en) 2014-10-22 2014-10-22 Method special for microblog data acquisition mode

Publications (1)

Publication Number Publication Date
CN104317880A true CN104317880A (en) 2015-01-28

Family

ID=52373112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410564110.0A Pending CN104317880A (en) 2014-10-22 2014-10-22 Method special for microblog data acquisition mode

Country Status (1)

Country Link
CN (1) CN104317880A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106341470A (en) * 2016-08-31 2017-01-18 北京量科邦信息技术有限公司 Method for keeping conversation and grasping continuously-updated data of conversation
CN108345662A (en) * 2018-02-01 2018-07-31 福建师范大学 A kind of microblog data weighted statistical method of registering considering user distribution area differentiation
CN109388735A (en) * 2018-09-13 2019-02-26 广州丰石科技有限公司 A method of crawling wechat public platform information
CN109815382A (en) * 2018-12-29 2019-05-28 中国科学院计算技术研究所 The perception and acquisition methods and system of large scale network data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020103823A1 (en) * 2001-02-01 2002-08-01 International Business Machines Corporation Method and system for extending the performance of a web crawler
CN103618649A (en) * 2013-12-03 2014-03-05 北京人民在线网络有限公司 Website data acquisition method and device
CN103984719A (en) * 2014-05-12 2014-08-13 浪潮电子信息产业股份有限公司 Method for acquiring by using crawler to simulate login

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020103823A1 (en) * 2001-02-01 2002-08-01 International Business Machines Corporation Method and system for extending the performance of a web crawler
CN103618649A (en) * 2013-12-03 2014-03-05 北京人民在线网络有限公司 Website data acquisition method and device
CN103984719A (en) * 2014-05-12 2014-08-13 浪潮电子信息产业股份有限公司 Method for acquiring by using crawler to simulate login

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙青云等: "一种基于模拟登录的微博数据采集方案", 《计算机技术与发展》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106341470A (en) * 2016-08-31 2017-01-18 北京量科邦信息技术有限公司 Method for keeping conversation and grasping continuously-updated data of conversation
CN108345662A (en) * 2018-02-01 2018-07-31 福建师范大学 A kind of microblog data weighted statistical method of registering considering user distribution area differentiation
CN109388735A (en) * 2018-09-13 2019-02-26 广州丰石科技有限公司 A method of crawling wechat public platform information
CN109815382A (en) * 2018-12-29 2019-05-28 中国科学院计算技术研究所 The perception and acquisition methods and system of large scale network data

Similar Documents

Publication Publication Date Title
Young et al. Building library community through social media
CN105426502A (en) Social network based person information search and relational network drawing method
CN104615627B (en) A kind of event public feelings information extracting method and system based on microblog
CN104618806A (en) Method, device and system for acquiring comment information of video
CN106528693A (en) Individualized learning-oriented educational resource recommendation method and system
CN106339439A (en) Big data analysis method
CN104317880A (en) Method special for microblog data acquisition mode
CN105469335A (en) PC front end based learning system
CN103365851A (en) Method and system for sharing users' surfing behavior on basis of virtual organization
CN106230809B (en) A kind of mobile Internet public sentiment monitoring method and system based on URL
CN104462482A (en) Content providing method and system for medium display
CN105183925A (en) Content association recommending method and content association recommending device
CN104991904A (en) Page data acquisition method of dynamic webpage
CN108536700A (en) A kind of method that nothing buries a collector journal
CN104765823A (en) Method and device for collecting website data
CN104408105A (en) Friend recommendation method applicable for intelligent TV (Television) users
Brito et al. Experiences integrating heterogeneous government open data sources to deliver services and promote transparency in brazil
CN111161846A (en) Knowledge base construction and analysis method, device and equipment based on electronic psychological sand table
CN105989167B (en) Collecting method and device based on news client
WO2014114071A1 (en) Method for generating emotional story recording background
CN102880674A (en) Method for automatically collecting topic video based on video website
CN106354770A (en) Data analysis system
CN107247772A (en) A kind of picture and text search engine based on internet
Lu et al. Research on forensic model of online social network
CN103279527B (en) A kind of user interest network address method for digging and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150128