CN104133834A - Designated area microblog data collecting and processing method - Google Patents

Designated area microblog data collecting and processing method Download PDF

Info

Publication number
CN104133834A
CN104133834A CN201410254030.5A CN201410254030A CN104133834A CN 104133834 A CN104133834 A CN 104133834A CN 201410254030 A CN201410254030 A CN 201410254030A CN 104133834 A CN104133834 A CN 104133834A
Authority
CN
China
Prior art keywords
seed points
geo
microblogging
circular areas
border circular
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410254030.5A
Other languages
Chinese (zh)
Other versions
CN104133834B (en
Inventor
任福继
刘宁
全昌勤
华磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN201410254030.5A priority Critical patent/CN104133834B/en
Publication of CN104133834A publication Critical patent/CN104133834A/en
Application granted granted Critical
Publication of CN104133834B publication Critical patent/CN104133834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries

Abstract

The invention discloses a designated area microblog data collecting and processing method. According to the method, firstly, GEO geographic information seed point selection is carried out; then, microblog data is obtained; and finally, the microblog data is processed. The designated area microblog data collecting and processing method has the advantages that a parallel multi-user calling mode is adopted for increasing the data collecting flow rate; and multi-information-point coverage is adopted for searching and collecting the microblog data, and the requirements of designated area microblog data collection and processing can be met.

Description

Specify region microblogging Data Collection and disposal route
Technical field
The present invention relates to microblogging data processing method field, specifically a kind of region microblogging Data Collection and disposal route of specifying.
Background technology
Along with the rise of microblogging, this rapid enrichment of short text that has comprised a large amount of microcosmic points and be inclined to emotion, microblogging text analyzing becomes popular research direction.
In microblogging data collection process, a large amount of microblogging data collection strategies adopt crawler capturing method conventionally, and the method grasp speed is fast, efficiency is high, but the noise data capturing is large, although reduced the time of data collection, but increase at double obtain the pretreatment time of precise information; And reptile is unstable, usually face the danger of being closed by Sina.A small amount of microblogging data generally adopt the microblogging third party API of Sina to call collection, and noise data that the method is collected is few, region is obvious, but has comprised a large amount of advertisements, has additionally increased again gibberish ratio.
No matter be reptile method or the traditional third party of Sina API Calls, all cannot obtain in a large number the microblogging data under specified domain, particularly specify the processing of microblogging data under region, reptile method and the third party of Sina API Calls all cannot be suitable for.
Summary of the invention
The object of this invention is to provide a kind of region microblogging Data Collection and disposal route of specifying, cannot obtain in a large number the problem of specifying microblogging data under region to solve prior art reptile method or third party's API Calls.
In order to achieve the above object, the technical solution adopted in the present invention is:
Specify region microblogging Data Collection and disposal route, it is characterized in that: comprise the following steps:
(1), GEO geography information Seed Points is chosen:
If the target seed amount of counting is N, region, given city is used to rectangle cutting, determine urban fringe; Do rectangular area diagonal line, do parallel lines with 10 kilometers of spacing of map scale length, cut apart rectangular area; Cut apart on parallel lines at each, cover successively rectangular area taking 5 kilometers of map scale length as radius does border circular areas, each border circular areas is not overlapping; On separator bar, the region of 5 kilometers of less thaies adopts suitable border circular areas to cover by actual conditions; For each border circular areas junction, cover this region taking engineer's scale radius as the border circular areas of R kilometer, R≤5, require to accomplish that overlapping region is no more than 3%; The center of circle that covers each border circular areas of specifying region is candidate GEO geography information Seed Points, and candidate GEO geography information Seed Points total quantity is designated as N', determines last Seed Points quantity according to formula (1):
f = N &prime; , N &prime; < N N , N &le; N &prime; - - - ( 1 )
In formula (1), f represents Seed Points quantity, in the time that candidate GEO geography information Seed Points quantity N' is less than the target seed amount of counting N, gets candidate GEO geography information Seed Points as final Seed Points; In the time that candidate GEO geography information Seed Points quantity N' is greater than the target seed amount of counting N, adjust position and the radius size of candidate GEO geography information Seed Points border circular areas, the border circular areas quantity that makes to cover rectangular area be N with interior integer, what now select border circular areas after adjusting is final Seed Points region;
According to the final Seed Points of above-mentioned acquisition, i.e. the Seed Points of f representative, positioning map information, derives longitude and latitude data, can obtain the GEO geography information of Seed Points;
(2), microblogging data acquisition;
The Seed Points GEO geographic information data obtaining according to step (1), calls microblogging third party api interface, obtains the microblogging data in appointed area; Microblogging data comprise microblogging creation-time, micro-blog information content, geographical information field; The microblogging data of obtaining are kept at this locality by the TXT text of UTF-8 form, are designated as D gEO;
(3), microblogging data processing:
By the microblogging text D obtaining in step (2) gEOextract in pairs according to microblogging creation-time and corresponding geography information, and be kept at this locality with the text of UTF_8 form, be designated as D t × geo; From microblogging text D gEOin extract micro-blog information content, and the local text with UTF-8 form preserves, and is designated as D cont.
The present invention has improved the third party API of Sina, adopts parallel multi-user's method of calling to increase data collection flow; Adopt many information points to cover and collect microblogging data, obtain the deficiency of data accuracy to make up microblogging interface, can meet the requirement of specifying microblogging Data Collection and processing under region.
Embodiment
Specify region microblogging Data Collection and disposal route, region is the region that has microblogging user issuing microblog, and border, region is divided with Administrative boundaries; All microbloggings that region microblogging sends for the microblog users appearing in appointment region.Comprise the following steps:
(1), GEO geography information Seed Points is chosen:
If the target seed amount of counting is N, region, given city is used to rectangle cutting, determine urban fringe; Do rectangular area diagonal line, do parallel lines with 10 kilometers of spacing of map scale length, cut apart rectangular area; Cut apart on parallel lines at each, cover successively rectangular area taking 5 kilometers of map scale length as radius does border circular areas, each border circular areas is not overlapping; On separator bar, the region of 5 kilometers of less thaies adopts suitable border circular areas to cover by actual conditions; For each border circular areas junction, cover this region taking engineer's scale radius as the border circular areas of R kilometer, R≤5, require to accomplish that overlapping region is no more than 3%; The center of circle that covers each border circular areas of specifying region is candidate GEO geography information Seed Points, and candidate GEO geography information Seed Points total quantity is designated as N', determines last Seed Points quantity according to formula (1):
f = N &prime; , N &prime; < N N , N &le; N &prime; - - - ( 1 )
In formula (1), f represents Seed Points quantity, in the time that candidate GEO geography information Seed Points quantity N' is less than the target seed amount of counting N, gets candidate GEO geography information Seed Points as final Seed Points; In the time that candidate GEO geography information Seed Points quantity N' is greater than the target seed amount of counting N, adjust position and the radius size of candidate GEO geography information Seed Points border circular areas, the border circular areas quantity that makes to cover rectangular area be N with interior integer, what now select border circular areas after adjusting is final Seed Points region;
According to the final Seed Points of above-mentioned acquisition, i.e. the Seed Points of f representative, positioning map information, derives longitude and latitude data, can obtain the GEO geography information of Seed Points;
(2), microblogging data acquisition;
The Seed Points GEO geographic information data obtaining according to step (1), calls microblogging third party api interface, obtains the microblogging data in appointed area; Microblogging data comprise microblogging creation-time, micro-blog information content, geographical information field; The microblogging data of obtaining are kept at this locality by the TXT text of UTF-8 form, are designated as D gEO;
(3), microblogging data processing:
By the microblogging text D obtaining in step (2) gEOextract in pairs according to microblogging creation-time and corresponding geography information, and be kept at this locality with the text of UTF_8 form, be designated as D t × geo; From microblogging text D gEOin extract micro-blog information content, and the local text with UTF-8 form preserves, and is designated as D cont.

Claims (1)

1. specify region microblogging Data Collection and disposal route, it is characterized in that: comprise the following steps:
(1), GEO geography information Seed Points is chosen:
If the target seed amount of counting is N, region, given city is used to rectangle cutting, determine urban fringe; Do rectangular area diagonal line, do parallel lines with 10 kilometers of spacing of map scale length, cut apart rectangular area; Cut apart on parallel lines at each, cover successively rectangular area taking 5 kilometers of map scale length as radius does border circular areas, each border circular areas is not overlapping; On separator bar, the region of 5 kilometers of less thaies adopts suitable border circular areas to cover by actual conditions; For each border circular areas junction, cover this region taking engineer's scale radius as the border circular areas of R kilometer, R≤5, require to accomplish that overlapping region is no more than 3%; The center of circle that covers each border circular areas of specifying region is candidate GEO geography information Seed Points, and candidate GEO geography information Seed Points total quantity is designated as N', determines last Seed Points quantity according to formula (1):
f = N &prime; , N &prime; < N N , N &le; N &prime; - - - ( 1 )
In formula (1), f represents Seed Points quantity, in the time that candidate GEO geography information Seed Points quantity N' is less than the target seed amount of counting N, gets candidate GEO geography information Seed Points as final Seed Points; In the time that candidate GEO geography information Seed Points quantity N' is greater than the target seed amount of counting N, adjust position and the radius size of candidate GEO geography information Seed Points border circular areas, the border circular areas quantity that makes to cover rectangular area be N with interior integer, what now select border circular areas after adjusting is final Seed Points region;
According to the final Seed Points of above-mentioned acquisition, i.e. the Seed Points of f representative, positioning map information, derives longitude and latitude data, can obtain the GEO geography information of Seed Points;
(2), microblogging data acquisition;
The Seed Points GEO geographic information data obtaining according to step (1), calls microblogging third party api interface, obtains the microblogging data in appointed area; Microblogging data comprise microblogging creation-time, micro-blog information content, geographical information field; The microblogging data of obtaining are kept at this locality by the TXT text of UTF-8 form, are designated as D gEO;
(3), microblogging data processing:
By the microblogging text D obtaining in step (2) gEOextract in pairs according to microblogging creation-time and corresponding geography information, and be kept at this locality with the text of UTF_8 form, be designated as D t × geo; From microblogging text D gEOin extract micro-blog information content, and the local text with UTF-8 form preserves, and is designated as D cont.
CN201410254030.5A 2014-06-09 2014-06-09 Specify the collection of region microblog data and processing method Active CN104133834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410254030.5A CN104133834B (en) 2014-06-09 2014-06-09 Specify the collection of region microblog data and processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410254030.5A CN104133834B (en) 2014-06-09 2014-06-09 Specify the collection of region microblog data and processing method

Publications (2)

Publication Number Publication Date
CN104133834A true CN104133834A (en) 2014-11-05
CN104133834B CN104133834B (en) 2018-05-04

Family

ID=51806512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410254030.5A Active CN104133834B (en) 2014-06-09 2014-06-09 Specify the collection of region microblog data and processing method

Country Status (1)

Country Link
CN (1) CN104133834B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106933898A (en) * 2015-12-31 2017-07-07 北京国双科技有限公司 The treating method and apparatus of info web
CN113190648A (en) * 2021-04-16 2021-07-30 湖州师范学院 Context semantic based emotion analysis method for microblog short text

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102291435A (en) * 2011-07-15 2011-12-21 武汉大学 Mobile information searching and knowledge discovery system based on geographic spatiotemporal data
CN102622443A (en) * 2012-03-13 2012-08-01 北京邮电大学 Customized screening system and method for microblog
CN103092950A (en) * 2013-01-15 2013-05-08 重庆邮电大学 Online public opinion geographical location real time monitoring system and method
US20130238658A1 (en) * 2012-03-07 2013-09-12 Snap Trends, Inc. Methods and Systems of Aggregating Information of Social Networks Based on Changing Geographical Locations of a Computing Device Via a Network
CN103546447A (en) * 2012-07-17 2014-01-29 腾讯科技(深圳)有限公司 Information display method, information display system, client side and server

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102291435A (en) * 2011-07-15 2011-12-21 武汉大学 Mobile information searching and knowledge discovery system based on geographic spatiotemporal data
US20130238658A1 (en) * 2012-03-07 2013-09-12 Snap Trends, Inc. Methods and Systems of Aggregating Information of Social Networks Based on Changing Geographical Locations of a Computing Device Via a Network
CN102622443A (en) * 2012-03-13 2012-08-01 北京邮电大学 Customized screening system and method for microblog
CN103546447A (en) * 2012-07-17 2014-01-29 腾讯科技(深圳)有限公司 Information display method, information display system, client side and server
CN103092950A (en) * 2013-01-15 2013-05-08 重庆邮电大学 Online public opinion geographical location real time monitoring system and method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106933898A (en) * 2015-12-31 2017-07-07 北京国双科技有限公司 The treating method and apparatus of info web
CN106933898B (en) * 2015-12-31 2020-08-11 北京国双科技有限公司 Webpage information processing method and device
CN113190648A (en) * 2021-04-16 2021-07-30 湖州师范学院 Context semantic based emotion analysis method for microblog short text

Also Published As

Publication number Publication date
CN104133834B (en) 2018-05-04

Similar Documents

Publication Publication Date Title
CN102577446B (en) For based on cell broadcast information, use connectivity curve chart to provide the method and apparatus of location Based service
Alemany et al. Influence of physical environmental factors on the composition and horizontal distribution of summer larval fish assemblages off Mallorca island (Balearic archipelago, western Mediterranean)
Yuan et al. Impact of sea-level rise on saltwater intrusion in the Pearl River Estuary
CN104217593B (en) A kind of method for obtaining road condition information in real time towards mobile phone travelling speed
CN105138590A (en) Trajectory prediction method and apparatus
CN102752336A (en) User generated content (UGC) sharing method and system based on geographical location service
CN106487828B (en) News pushing method and device
Guido et al. Big data for public transportation: A DSS framework
CN105989024A (en) Method and device for determining position regions of users
Janzen et al. Estimating long-distance travel demand with mobile phone billing data
CN104661306A (en) Passive positioning method and system for mobile terminal
Jomelli et al. Glacier extent in sub-Antarctic Kerguelen archipelago from MIS 3 period: Evidence from 36Cl dating
CN104133834A (en) Designated area microblog data collecting and processing method
CN106991804B (en) Urban public transport working condition construction method based on multi-line coupling
CN104281646B (en) Urban waterlogging detection method based on microblog data
Moise et al. Tracking language mobility in the Twitter landscape
Miles et al. Slowdown of Shirase Glacier, East Antarctica, caused by strengthening alongshore winds
Jackson et al. Adaptation and implementation of a system for collecting and analyzing cyclist route data using smartphones
Buckley et al. Ready or not, big data is coming to a city (transportation agency) near you
Howells et al. Using smart technology in sustainable entrepreneurship in Island tourism: A preliminary research
CN107889053B (en) A kind of video preprocessor loading method of Network Environment prediction
Reimão Silva et al. Carrying capacity analysis of Praia do forte beach, Brazil
Wismans et al. From the guest editors: Mobile phones, travel, and transportation
CN106649683A (en) Book recommendation method and apparatus
Schulman Climate Change Challenges and Djibouti: A Photoessay

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant