CN104133834A - Designated area microblog data collecting and processing method - Google Patents
Designated area microblog data collecting and processing method Download PDFInfo
- Publication number
- CN104133834A CN104133834A CN201410254030.5A CN201410254030A CN104133834A CN 104133834 A CN104133834 A CN 104133834A CN 201410254030 A CN201410254030 A CN 201410254030A CN 104133834 A CN104133834 A CN 104133834A
- Authority
- CN
- China
- Prior art keywords
- seed points
- geo
- microblogging
- circular areas
- border circular
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9537—Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
Abstract
The invention discloses a designated area microblog data collecting and processing method. According to the method, firstly, GEO geographic information seed point selection is carried out; then, microblog data is obtained; and finally, the microblog data is processed. The designated area microblog data collecting and processing method has the advantages that a parallel multi-user calling mode is adopted for increasing the data collecting flow rate; and multi-information-point coverage is adopted for searching and collecting the microblog data, and the requirements of designated area microblog data collection and processing can be met.
Description
Technical field
The present invention relates to microblogging data processing method field, specifically a kind of region microblogging Data Collection and disposal route of specifying.
Background technology
Along with the rise of microblogging, this rapid enrichment of short text that has comprised a large amount of microcosmic points and be inclined to emotion, microblogging text analyzing becomes popular research direction.
In microblogging data collection process, a large amount of microblogging data collection strategies adopt crawler capturing method conventionally, and the method grasp speed is fast, efficiency is high, but the noise data capturing is large, although reduced the time of data collection, but increase at double obtain the pretreatment time of precise information; And reptile is unstable, usually face the danger of being closed by Sina.A small amount of microblogging data generally adopt the microblogging third party API of Sina to call collection, and noise data that the method is collected is few, region is obvious, but has comprised a large amount of advertisements, has additionally increased again gibberish ratio.
No matter be reptile method or the traditional third party of Sina API Calls, all cannot obtain in a large number the microblogging data under specified domain, particularly specify the processing of microblogging data under region, reptile method and the third party of Sina API Calls all cannot be suitable for.
Summary of the invention
The object of this invention is to provide a kind of region microblogging Data Collection and disposal route of specifying, cannot obtain in a large number the problem of specifying microblogging data under region to solve prior art reptile method or third party's API Calls.
In order to achieve the above object, the technical solution adopted in the present invention is:
Specify region microblogging Data Collection and disposal route, it is characterized in that: comprise the following steps:
(1), GEO geography information Seed Points is chosen:
If the target seed amount of counting is N, region, given city is used to rectangle cutting, determine urban fringe; Do rectangular area diagonal line, do parallel lines with 10 kilometers of spacing of map scale length, cut apart rectangular area; Cut apart on parallel lines at each, cover successively rectangular area taking 5 kilometers of map scale length as radius does border circular areas, each border circular areas is not overlapping; On separator bar, the region of 5 kilometers of less thaies adopts suitable border circular areas to cover by actual conditions; For each border circular areas junction, cover this region taking engineer's scale radius as the border circular areas of R kilometer, R≤5, require to accomplish that overlapping region is no more than 3%; The center of circle that covers each border circular areas of specifying region is candidate GEO geography information Seed Points, and candidate GEO geography information Seed Points total quantity is designated as N', determines last Seed Points quantity according to formula (1):
In formula (1), f represents Seed Points quantity, in the time that candidate GEO geography information Seed Points quantity N' is less than the target seed amount of counting N, gets candidate GEO geography information Seed Points as final Seed Points; In the time that candidate GEO geography information Seed Points quantity N' is greater than the target seed amount of counting N, adjust position and the radius size of candidate GEO geography information Seed Points border circular areas, the border circular areas quantity that makes to cover rectangular area be N with interior integer, what now select border circular areas after adjusting is final Seed Points region;
According to the final Seed Points of above-mentioned acquisition, i.e. the Seed Points of f representative, positioning map information, derives longitude and latitude data, can obtain the GEO geography information of Seed Points;
(2), microblogging data acquisition;
The Seed Points GEO geographic information data obtaining according to step (1), calls microblogging third party api interface, obtains the microblogging data in appointed area; Microblogging data comprise microblogging creation-time, micro-blog information content, geographical information field; The microblogging data of obtaining are kept at this locality by the TXT text of UTF-8 form, are designated as D
gEO;
(3), microblogging data processing:
By the microblogging text D obtaining in step (2)
gEOextract in pairs according to microblogging creation-time and corresponding geography information, and be kept at this locality with the text of UTF_8 form, be designated as D
t × geo; From microblogging text D
gEOin extract micro-blog information content, and the local text with UTF-8 form preserves, and is designated as D
cont.
The present invention has improved the third party API of Sina, adopts parallel multi-user's method of calling to increase data collection flow; Adopt many information points to cover and collect microblogging data, obtain the deficiency of data accuracy to make up microblogging interface, can meet the requirement of specifying microblogging Data Collection and processing under region.
Embodiment
Specify region microblogging Data Collection and disposal route, region is the region that has microblogging user issuing microblog, and border, region is divided with Administrative boundaries; All microbloggings that region microblogging sends for the microblog users appearing in appointment region.Comprise the following steps:
(1), GEO geography information Seed Points is chosen:
If the target seed amount of counting is N, region, given city is used to rectangle cutting, determine urban fringe; Do rectangular area diagonal line, do parallel lines with 10 kilometers of spacing of map scale length, cut apart rectangular area; Cut apart on parallel lines at each, cover successively rectangular area taking 5 kilometers of map scale length as radius does border circular areas, each border circular areas is not overlapping; On separator bar, the region of 5 kilometers of less thaies adopts suitable border circular areas to cover by actual conditions; For each border circular areas junction, cover this region taking engineer's scale radius as the border circular areas of R kilometer, R≤5, require to accomplish that overlapping region is no more than 3%; The center of circle that covers each border circular areas of specifying region is candidate GEO geography information Seed Points, and candidate GEO geography information Seed Points total quantity is designated as N', determines last Seed Points quantity according to formula (1):
In formula (1), f represents Seed Points quantity, in the time that candidate GEO geography information Seed Points quantity N' is less than the target seed amount of counting N, gets candidate GEO geography information Seed Points as final Seed Points; In the time that candidate GEO geography information Seed Points quantity N' is greater than the target seed amount of counting N, adjust position and the radius size of candidate GEO geography information Seed Points border circular areas, the border circular areas quantity that makes to cover rectangular area be N with interior integer, what now select border circular areas after adjusting is final Seed Points region;
According to the final Seed Points of above-mentioned acquisition, i.e. the Seed Points of f representative, positioning map information, derives longitude and latitude data, can obtain the GEO geography information of Seed Points;
(2), microblogging data acquisition;
The Seed Points GEO geographic information data obtaining according to step (1), calls microblogging third party api interface, obtains the microblogging data in appointed area; Microblogging data comprise microblogging creation-time, micro-blog information content, geographical information field; The microblogging data of obtaining are kept at this locality by the TXT text of UTF-8 form, are designated as D
gEO;
(3), microblogging data processing:
By the microblogging text D obtaining in step (2)
gEOextract in pairs according to microblogging creation-time and corresponding geography information, and be kept at this locality with the text of UTF_8 form, be designated as D
t × geo; From microblogging text D
gEOin extract micro-blog information content, and the local text with UTF-8 form preserves, and is designated as D
cont.
Claims (1)
1. specify region microblogging Data Collection and disposal route, it is characterized in that: comprise the following steps:
(1), GEO geography information Seed Points is chosen:
If the target seed amount of counting is N, region, given city is used to rectangle cutting, determine urban fringe; Do rectangular area diagonal line, do parallel lines with 10 kilometers of spacing of map scale length, cut apart rectangular area; Cut apart on parallel lines at each, cover successively rectangular area taking 5 kilometers of map scale length as radius does border circular areas, each border circular areas is not overlapping; On separator bar, the region of 5 kilometers of less thaies adopts suitable border circular areas to cover by actual conditions; For each border circular areas junction, cover this region taking engineer's scale radius as the border circular areas of R kilometer, R≤5, require to accomplish that overlapping region is no more than 3%; The center of circle that covers each border circular areas of specifying region is candidate GEO geography information Seed Points, and candidate GEO geography information Seed Points total quantity is designated as N', determines last Seed Points quantity according to formula (1):
In formula (1), f represents Seed Points quantity, in the time that candidate GEO geography information Seed Points quantity N' is less than the target seed amount of counting N, gets candidate GEO geography information Seed Points as final Seed Points; In the time that candidate GEO geography information Seed Points quantity N' is greater than the target seed amount of counting N, adjust position and the radius size of candidate GEO geography information Seed Points border circular areas, the border circular areas quantity that makes to cover rectangular area be N with interior integer, what now select border circular areas after adjusting is final Seed Points region;
According to the final Seed Points of above-mentioned acquisition, i.e. the Seed Points of f representative, positioning map information, derives longitude and latitude data, can obtain the GEO geography information of Seed Points;
(2), microblogging data acquisition;
The Seed Points GEO geographic information data obtaining according to step (1), calls microblogging third party api interface, obtains the microblogging data in appointed area; Microblogging data comprise microblogging creation-time, micro-blog information content, geographical information field; The microblogging data of obtaining are kept at this locality by the TXT text of UTF-8 form, are designated as D
gEO;
(3), microblogging data processing:
By the microblogging text D obtaining in step (2)
gEOextract in pairs according to microblogging creation-time and corresponding geography information, and be kept at this locality with the text of UTF_8 form, be designated as D
t × geo; From microblogging text D
gEOin extract micro-blog information content, and the local text with UTF-8 form preserves, and is designated as D
cont.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410254030.5A CN104133834B (en) | 2014-06-09 | 2014-06-09 | Specify the collection of region microblog data and processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410254030.5A CN104133834B (en) | 2014-06-09 | 2014-06-09 | Specify the collection of region microblog data and processing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104133834A true CN104133834A (en) | 2014-11-05 |
CN104133834B CN104133834B (en) | 2018-05-04 |
Family
ID=51806512
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410254030.5A Active CN104133834B (en) | 2014-06-09 | 2014-06-09 | Specify the collection of region microblog data and processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104133834B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106933898A (en) * | 2015-12-31 | 2017-07-07 | 北京国双科技有限公司 | The treating method and apparatus of info web |
CN113190648A (en) * | 2021-04-16 | 2021-07-30 | 湖州师范学院 | Context semantic based emotion analysis method for microblog short text |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102291435A (en) * | 2011-07-15 | 2011-12-21 | 武汉大学 | Mobile information searching and knowledge discovery system based on geographic spatiotemporal data |
CN102622443A (en) * | 2012-03-13 | 2012-08-01 | 北京邮电大学 | Customized screening system and method for microblog |
CN103092950A (en) * | 2013-01-15 | 2013-05-08 | 重庆邮电大学 | Online public opinion geographical location real time monitoring system and method |
US20130238658A1 (en) * | 2012-03-07 | 2013-09-12 | Snap Trends, Inc. | Methods and Systems of Aggregating Information of Social Networks Based on Changing Geographical Locations of a Computing Device Via a Network |
CN103546447A (en) * | 2012-07-17 | 2014-01-29 | 腾讯科技(深圳)有限公司 | Information display method, information display system, client side and server |
-
2014
- 2014-06-09 CN CN201410254030.5A patent/CN104133834B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102291435A (en) * | 2011-07-15 | 2011-12-21 | 武汉大学 | Mobile information searching and knowledge discovery system based on geographic spatiotemporal data |
US20130238658A1 (en) * | 2012-03-07 | 2013-09-12 | Snap Trends, Inc. | Methods and Systems of Aggregating Information of Social Networks Based on Changing Geographical Locations of a Computing Device Via a Network |
CN102622443A (en) * | 2012-03-13 | 2012-08-01 | 北京邮电大学 | Customized screening system and method for microblog |
CN103546447A (en) * | 2012-07-17 | 2014-01-29 | 腾讯科技(深圳)有限公司 | Information display method, information display system, client side and server |
CN103092950A (en) * | 2013-01-15 | 2013-05-08 | 重庆邮电大学 | Online public opinion geographical location real time monitoring system and method |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106933898A (en) * | 2015-12-31 | 2017-07-07 | 北京国双科技有限公司 | The treating method and apparatus of info web |
CN106933898B (en) * | 2015-12-31 | 2020-08-11 | 北京国双科技有限公司 | Webpage information processing method and device |
CN113190648A (en) * | 2021-04-16 | 2021-07-30 | 湖州师范学院 | Context semantic based emotion analysis method for microblog short text |
Also Published As
Publication number | Publication date |
---|---|
CN104133834B (en) | 2018-05-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102577446B (en) | For based on cell broadcast information, use connectivity curve chart to provide the method and apparatus of location Based service | |
Alemany et al. | Influence of physical environmental factors on the composition and horizontal distribution of summer larval fish assemblages off Mallorca island (Balearic archipelago, western Mediterranean) | |
Yuan et al. | Impact of sea-level rise on saltwater intrusion in the Pearl River Estuary | |
CN104217593B (en) | A kind of method for obtaining road condition information in real time towards mobile phone travelling speed | |
CN105138590A (en) | Trajectory prediction method and apparatus | |
CN102752336A (en) | User generated content (UGC) sharing method and system based on geographical location service | |
CN106487828B (en) | News pushing method and device | |
Guido et al. | Big data for public transportation: A DSS framework | |
CN105989024A (en) | Method and device for determining position regions of users | |
Janzen et al. | Estimating long-distance travel demand with mobile phone billing data | |
CN104661306A (en) | Passive positioning method and system for mobile terminal | |
Jomelli et al. | Glacier extent in sub-Antarctic Kerguelen archipelago from MIS 3 period: Evidence from 36Cl dating | |
CN104133834A (en) | Designated area microblog data collecting and processing method | |
CN106991804B (en) | Urban public transport working condition construction method based on multi-line coupling | |
CN104281646B (en) | Urban waterlogging detection method based on microblog data | |
Moise et al. | Tracking language mobility in the Twitter landscape | |
Miles et al. | Slowdown of Shirase Glacier, East Antarctica, caused by strengthening alongshore winds | |
Jackson et al. | Adaptation and implementation of a system for collecting and analyzing cyclist route data using smartphones | |
Buckley et al. | Ready or not, big data is coming to a city (transportation agency) near you | |
Howells et al. | Using smart technology in sustainable entrepreneurship in Island tourism: A preliminary research | |
CN107889053B (en) | A kind of video preprocessor loading method of Network Environment prediction | |
Reimão Silva et al. | Carrying capacity analysis of Praia do forte beach, Brazil | |
Wismans et al. | From the guest editors: Mobile phones, travel, and transportation | |
CN106649683A (en) | Book recommendation method and apparatus | |
Schulman | Climate Change Challenges and Djibouti: A Photoessay |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |