NZ793494A - System and method for detecting geo-locations in social media - Google Patents
System and method for detecting geo-locations in social mediaInfo
- Publication number
- NZ793494A NZ793494A NZ793494A NZ79349417A NZ793494A NZ 793494 A NZ793494 A NZ 793494A NZ 793494 A NZ793494 A NZ 793494A NZ 79349417 A NZ79349417 A NZ 79349417A NZ 793494 A NZ793494 A NZ 793494A
- Authority
- NZ
- New Zealand
- Prior art keywords
- location
- social media
- posting
- locations
- mention
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 147
- 238000004891 communication Methods 0.000 claims abstract description 23
- 238000007781 pre-processing Methods 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 6
- 150000001768 cations Chemical class 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims 2
- SPBWHPXCWJLQRU-FITJORAGSA-N 4-amino-8-[(2r,3r,4s,5r)-3,4-dihydroxy-5-(hydroxymethyl)oxolan-2-yl]-5-oxopyrido[2,3-d]pyrimidine-6-carboxamide Chemical compound C12=NC=NC(N)=C2C(=O)C(C(=O)N)=CN1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O SPBWHPXCWJLQRU-FITJORAGSA-N 0.000 claims 1
- 238000013459 approach Methods 0.000 description 14
- 230000000875 corresponding effect Effects 0.000 description 11
- 230000008569 process Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 238000001514 detection method Methods 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 150000002500 ions Chemical class 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- NLZUEZXRPGMBCV-UHFFFAOYSA-N Butylhydroxytoluene Chemical compound CC1=CC(C(C)(C)C)=C(O)C(C(C)(C)C)=C1 NLZUEZXRPGMBCV-UHFFFAOYSA-N 0.000 description 1
- 241000209140 Triticum Species 0.000 description 1
- 235000021307 Triticum Nutrition 0.000 description 1
- 208000027418 Wounds and injury Diseases 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 210000001217 buttock Anatomy 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000011010 flushing procedure Methods 0.000 description 1
- 208000014674 injury Diseases 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Abstract
A method of determining locations for social media postings, the method comprising: retrieving, by communicating with at least one application programming interface (API) of a social media system over one or more first communication networks, at least one social media posting; determining at least one location mention in text of the at least one social media posting, and at least one textual user location of the at least one social media posting; determining a plurality of locations for the at least one location mention and the at least one textual user location, wherein each of the plurality of locations includes a set of geo-coordinates; comparing terms in an account name or account description of the at least one social media posting to a taxonomy list, validating the at least one textual user location when at least one of the terms matches the taxonomy list, and discarding the at least one textual user location when none of the terms match the taxonomy list; selecting one of the plurality of locations as a primary location; storing, in at least one database on a non-transitory machine-readable storage medium, at least one posting object for the at least one social media posting including the primary location; and outputting, by communicating with a user system over one or more second communication networks, the at least one social media posting with the determined primary location. one location mention in text of the at least one social media posting, and at least one textual user location of the at least one social media posting; determining a plurality of locations for the at least one location mention and the at least one textual user location, wherein each of the plurality of locations includes a set of geo-coordinates; comparing terms in an account name or account description of the at least one social media posting to a taxonomy list, validating the at least one textual user location when at least one of the terms matches the taxonomy list, and discarding the at least one textual user location when none of the terms match the taxonomy list; selecting one of the plurality of locations as a primary location; storing, in at least one database on a non-transitory machine-readable storage medium, at least one posting object for the at least one social media posting including the primary location; and outputting, by communicating with a user system over one or more second communication networks, the at least one social media posting with the determined primary location.
Description
A method of determining locations for social media postings, the method comprising: retrieving, by icating with at least one application programming interface (API) of a social media system over one or more first communication networks, at least one social media posting; determining at least one location mention in text of the at least one social media posting, and at least one l user on of the at least one social media posting; determining a plurality of locations for the at least one location mention and the at least one textual user location, wherein each of the plurality of locations includes a set of geo-coordinates; comparing terms in an account name or account description of the at least one social media posting to a taxonomy list, validating the at least one textual user location when at least one of the terms matches the my list, and discarding the at least one textual user location when none of the terms match the taxonomy list; selecting one of the ity of locations as a primary location; storing, in at least one database on a non-transitory machine-readable storage medium, at least one posting object for the at least one social media g including the primary location; and outputting, by communicating with a user system over one or more second communication networks, the at least one social media posting with the determined primary location. 793494 A1 SYSTEM AND METHOD FOR DETECTING GEO-LOCATIONS IN SOCIAL MEDIA CROSS REFERENCE TO D ATIONS This application claims priority to U.S. Provisional Patent Application No. 62/419,609 filed on November 9, 2016, and U.S. Patent Application No. 15/787,416, filed on October 18, 2017, each of which are hereby incorporated by reference herein in their entireties. This application is also related to U.S. Patent Application No. 15/143,730, filed on May 2, 2016, which is also hereby incorporated by reference herein in its entirety. This application is a divisional of New Zealand Patent Application No. 752653, the originally filed specification of which is incorporated herein by reference in its entirety.
BACKGROUND INFORMATION Geo-location detection from text is a difficult task. Detecting geo-locations from social data is further complicated by the prominence of hashtags, platform-specific lingo, lack of punctuation, capitalization, and proper grammar. Some of the main nges in identifying locations accurately in social media postings include the following: 1) Lack of proper standards or heuristics: There are no definitive strategies for identifying locations in text, since they can be expressed in a variety of ways. 2) Ambiguous words: Ambiguous words, for instance names of ons that can also be names of people, are prominent. id="p-5" id="p-5" id="p-5" id="p-5" id="p-5" id="p-5" id="p-5" id="p-5"
id="p-5"
[0005] 3) Lack of standard r: Many social media users use informal and somewhat andard language in their messages, and many social media s have their own lingo. This means that models that have been trained on standard h cannot perform well on social data. 4) Prominence of hashtags: Hashtags are used across many social platforms to indicate metadata related to a message, e.g. its topic. Over years of usage on social media, hashtags have taken a life of their own, eding or succeeding a message with witty or creative tokens. On many occasions users mix more than one word to make a composite hashtag or express the location of an event via a trailing hashtag. In these instances, automated parsers are unable to break down the hashtags properly. 5) Consistency of self-identified user locations: Users can often choose to identify their location in their profile. For many social media platforms, this location does not need to be validated and can be expressed as free-text. This has led to the inevitable prominence of creative but non-viable locations. id="p-8" id="p-8" id="p-8" id="p-8" id="p-8" id="p-8" id="p-8" id="p-8"
id="p-8"
[0008] 6) Granularity of information: Some disaster-response teams, police and fire departments set up official social media accounts to report emergencies in real-time. The locations they fy in their messages are often specific to their location. For instance, "Injury wreck being reported on Hwy 183 NB at Loyola Ln. Back-ups toward MLK" es a granular description of the address of an accident, which might be difficult to parse.
Moreover the address might be difficult to locate, since a similar address or ection might exist in many different cities. 7) Identifying the correct geo-coordinates: Even if words that refer to locations are accurately identified, sometimes they can be mapped to various geo-coordinates. For ce there are several cities named "Orlando" in the United States (e.g., in Florida, Oklahoma, West Virginia, New York, Virginia, Kentucky, North Carolina, and Arkansas). 8) Identifying the primary location of an event: Consider the message "Rebel Groups Supported By Turkey & US Reportedly Clash W/ US-Backed Kurdish Group In Syria" which mentions three countries. It can be important to understand which on is where the event took place (i.e., . id="p-11" id="p-11" id="p-11" id="p-11" id="p-11" id="p-11" id="p-11" id="p-11"
id="p-11"
[0011] 9) ness & sustainability requirements: Even though machine ng models might yield good precision/recall numbers, they are often too slow to be applicable in real-time. In addition, since many of these models are trained on static training data, they will require periodic updates and adjustment.
Therefore, a system is d that addresses all of the above challenges and provides a validated model against other geo-location services.
BRIEF DESCRIPTION OF THE DRAWINGS So that the features of the t invention can be understood, a number of drawings are described below. However, the ed drawings illustrate only particular embodiments of the invention and are therefore not to be considered limiting of its scope, for the invention may encompass other y effective embodiments. is a schematic diagram depicting an embodiment of a system for detecting geo-locations in postings of social media systems such as microblogs according to an embodiment of the disclosure. is a flowchart depicting an embodiment of a method of detecting geolocations in postings of social media systems such as microblogs according to an embodiment of the sure. is a schematic diagram depicting an embodiment of an exemplary system architecture for detecting geo-locations in postings of social media systems such as microblogs according to an embodiment of the disclosure. a)-4(d) show exemplary social media postings having geo-locations that can be detected ing to an embodiment of the disclosure. is a flowchart depicting an embodiment of a ocessing method according to an embodiment of the disclosure. is a flowchart depicting an embodiment of a location identification method according to an embodiment of the disclosure. is a flowchart depicting an embodiment of a method of identifying locations from the text of a social media g according to an embodiment of the disclosure. is a art depicting an embodiment of a method implementing a taxonomy-based approach ing to an embodiment of the disclosure. is a flowchart depicting an embodiment of a method implementing a heuristic-based approach ing to an ment of the disclosure. id="p-23" id="p-23" id="p-23" id="p-23" id="p-23" id="p-23" id="p-23" id="p-23"
id="p-23"
[0023] is a flowchart ing an embodiment of method implementing a knowledge-based (KB) approach according to an embodiment of the disclosure. is flowchart depicting an embodiment of method of determining location ordinates according to an embodiment of the disclosure. (a)-12(b) shows an exemplary output from an ary location library according to an embodiment of the disclosure. id="p-26" id="p-26" id="p-26" id="p-26" id="p-26" id="p-26" id="p-26" id="p-26"
id="p-26"
[0026] is a flowchart ing an embodiment of a method of qualifying locations ing to an embodiment of the disclosure. is a flowchart depicting an embodiment of a method of qualifying locations from a location library ing to an embodiment of the disclosure. is a art depicting an embodiment of a method of ying locations using community heuristics according to an embodiment of the disclosure. is a flowchart depicting an embodiment of a method of determining primary location geo-coordinates according to an embodiment of the disclosure.
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS Disclosed are embodiments of systems and methods for detecting geo-locations in postings of social media systems such as microblogs. Referring now to the figures, shows a schematic diagram depicting an embodiment of a system 100 for detecting geolocations in social media postings. The system 100 includes a social media system 104, a geo-location system 108, an application 112, and a user system 116.
The social media system 104 provides a platform for its users to post postings and/or content to a k of other users using accounts of the system 100. The social media system 104 includes a social media server system 120 having a communication interface 124. The social media server system 120 provides functionality of the social media system 120 for users and as discussed herein, with the communication interface 124 ing communications over one or more communication networks 128 between the social media system 120 and other systems. In embodiments, the social media system 120 can take various different forms. In one example, the social media system 120 can be Twitter, in which users use their accounts to, among other , post short postings and/or content, called , on the system. In other embodiments, the social media system 120 can be r system, such as one or more of Facebook, Instagram, Snapchat, Tumblr, Pinterest, Flickr, or Reddit, etc.
The geo-location system 108 includes a location mention identification module 132, a location determining module 136, and a qualifying module 140. The geo-location system 108 has a ication interface 144 that interfaces with the social media system 120 to retrieve social media postings and send them to location mention identification module 132 to identify and/or detect any location mentions in the postings or set of gs.
Location mention identification module 132 can also t locations specified by users of the social media system 120. id="p-33" id="p-33" id="p-33" id="p-33" id="p-33" id="p-33" id="p-33" id="p-33"
id="p-33"
[0033] The location mention identification module 132 has a database 148 for storing s locations, location ranks/scores, and geo-coordinates. The location determining module 136 finds the latitude and ude geo-coordinate information associated with the detected locations using a on/geo-coordinate library. In some embodiments, the location library is a third-party library. id="p-34" id="p-34" id="p-34" id="p-34" id="p-34" id="p-34" id="p-34" id="p-34"
id="p-34"
[0034] The qualifying module 140 uses a various methods to qualify and/or disambiguate locations, find the correct geo-coordinates for each location, and rank the locations based on a level of relevance to an event. The qualifying module 140 outputs location information to a communication interface 152. In the embodiment shown, each of location mention identification module 132, location determining module 136, and qualifying module 140 can communicate with each other. In the embodiment shown, communication interface 152 of geo-location module 108 outputs location information as metadata in a posting object of the social media g over one or more communication networks 156 to application 112 for display on user system 116. In the embodiment shown, application 112 may include an application programming interface (API) 160. In the embodiment shown, application 112 communicates with user system via one or more communication networks 164.
Alternatively, the geo-location system may output directly to the user system 116.
The user system 116 may be any ing platform, such as one or more of a computer, a desktop computer, a laptop computer, a , a smart phone, or other stationary or mobile devices, etc., that a user uses to communicate with other s via one or more ication networks 164.
In some embodiments, the system 100 for detecting geo-locations in various social media postings may include only any subset of, or an alternative connection of, the features depicted in or discussed herein in regard to shows a flowchart depicting an embodiment of a method 200 of detecting geo-locations in postings of s social media s such as microblogs according to an embodiment of the disclosure. The method may be med by or ing components of the system 100 of such as by the geo-location system 108. The method begins at step At step 208, one or more social media postings are retrieved from social media server system 120. In the embodiments shown, geo-location module 108 communicates with social media server system 120 via communication interfaces 124, 144 over network 128 to retrieve one or more social media postings of one or more social media accounts.
At step 212, the retrieved social media postings are pre-processed. In some embodiments, the pre-processing is performed by geo-location system 108. In other embodiments, the pre-processing can be performed by a separate sing module and the pre-processed social media postings then input into the cation system 108. The preprocessing involves one or more functions performed by the system to clean and prepare gs before identifying potential locations of the postings, as discussed herein. For example, the pre-processing may include any combination of the features of the systems and methods of FIGS. 3-5, or any combination of any subset and/or alternative ordering of the features of such system or methods.
At step 216, the system determines location mentions for the received and preprocessed social media postings. In the embodiments shown, location mention identification module 132 determines location mentions by performing one or more of fying location mentions from the text of the social media g and/or from a user location of a social media account, as discussed herein. For example, determining location mentions may include any combination of the features of the s and methods of FIGS. 6-10, or any combination of any subset and/or alternative ordering of the features of such system or methods. id="p-41" id="p-41" id="p-41" id="p-41" id="p-41" id="p-41" id="p-41" id="p-41"
id="p-41"
[0041] At step 220, the system determines locations based on the determined location mentions. In the embodiments shown, on determining module 136 receives the ined location mentions from location mention identification module 132 and determines possible geo-coordinates (i.e., de and longitude nates) for the geographic locations corresponding to the determined location mentions, as discussed herein.
For example, determining locations based on the location mentions may include any combination of the features of the systems and methods of FIGS. 11-12, or any combination of any subset and/or alternative ordering of the features of such system or methods.
At step 224, the system determines a primary location of each location mention based on the determined locations. In the embodiments shown, qualifying module 140 receives the determined geographic locations from location determining module 136 and ines which location is a primary location corresponding to each location n in the social media posting, as discussed herein. For example, determining locations based on the location mentions may include any combination of the features of the systems and methods of FIGS. 13-16, or any combination of any subset and/or alternative ng of the features of such system or methods. id="p-43" id="p-43" id="p-43" id="p-43" id="p-43" id="p-43" id="p-43" id="p-43"
id="p-43"
[0043] At step 228, the system stores the geo-coordinates for the determined primary location in a social media g object. In the embodiments shown, for each location mention of a social media posting, the system adds the geo-coordinates for the primary location corresponding to that on mention. In some embodiments, the geo-coordinates can be added as one or more metadata fields to the posting object. In some embodiments, the geo-coordinates for each location can be communicated to location mention fication module 132 to be stored in database 148 on a non-transitory machine-readable storage medium for future location ination processes for future retrieved social media postings.
At step 232, the social media posting objects containing the geo-coordinates for their location mentions can be output to any downstream application seamlessly and in realtime.
In the embodiments shown, the posting objects can be output from geo-location module 108 to an API 160 of application 112 via network 156 and then output for display on a user system 116 via network 164. atively, the posting objects can be output directly to user system 116. The method ends at step 236.
In some embodiments, the method 200 for detecting geo-locations in various social media postings may include only any subset of, or an alternative connection of, the features ed in or discussed herein in regard to shows a schematic diagram ing an exemplary embodiment of the geo-location system 108 for detecting geo-locations in postings of various social media systems such as microblogs in further detail according to an embodiment of the disclosure.
In the embodiment shown, the system 108 retrieves one or more social media postings from a social data stream 304. In some embodiments, social media stream 304 can be outputted from social media server system 120. In the embodiment shown, the system can retrieve social media postings in a -posting mode 308 and/or a multi-posting mode 312. In single-posting mode 308, the system ines geo-location ation based on single postings from user accounts, as discussed herein. In multi-posting mode 312, the system determine geo-location information based on multiple postings received from multiple user accounts. In each mode, each social media posting can include both a posting 316 in text and/or a user location 320. In the ment shown, the text postings 316 of the social media postings are input into location mention identification module 132.
In the embodiment shown, on ns from the posting text 316 of the social media postings and the user locations 320 of the social media postings are input into location determining module 136. In the embodiment shown, possible location geo - coordinates for each location mention are input into qualifying module 140. In the embodiment shown, primary location geo-coordinates for each location mention in the g text 316 and/or user location 320 are output from qualifying module 140 and added to the posting object for each social media posting received from social data stream 304.
In some ments, the system 300 for detecting geo-locations in various social media postings may include only any subset of, or an alternative connection of, the features depicted in or discussed herein in regard to FIGS. 4(a)-4(d) show exemplary social media postings 400 having geo-locations that can be detected according to an embodiment of the disclosure. In the embodiments shown, various examples of social media gs that can be received from social media server system 120 as a part of social data stream 304 are depicted. In each social media posting 400 shown, there is a cation 404, 408, 412, 416 mentioned in the posting text.
When each social media posting 400 is analyzed by the sed system, the posting object can be d with primary location geo-coordinates for each location mention.
In some embodiments, the exemplary social media postings 400 having geolocations that can be detected may include only any subset of, or an alternative connection of, the features depicted in or discussed herein in regard to FIGS. 4(a)-4(d). shows a flowchart depicting an embodiment of a pre-processing method 500 according to an ment of the disclosure. In some embodiments, pre-processing method 500 is performed as step 212 of method 200. Prior to processing the social media g(s) to identify potential locations, the system may perform a few preprocessing steps to clean and prepare the posting(s). Method 500 begins at step 504. In step 508, the system removes truncations from the posting text. In social media platforms such as Twitter, truncated postings are common in automated post-sharing applications. For example, when a g exceeds Twitter’s 140-character limit, the trailing part of the posting is automatically removed by third-party applications. This can cause potential issues for the system. For instance, consider a truncated tweet that reads "Let us celebrate New York…" It’s unclear whether the trailing word refers to a location such as "New York City," or if it was meant to say "New Yorker ne." Therefore, in step 508, the system ignores and/or removes all truncation symbols and truncated words and phrases from the posting text.
However, in some embodiments, removing truncations is not as simple as merely identifying postings having the truncation symbol (i.e., an ellipsis represented by "...") at the end of the posting. On many ons, automated applications may append additional hashtags, mentions, or URLs to the end of the g (e.g. "Let us celebrate New York… via meApp"). This often comes at the expense of the length of the original posting.
Moreover, not all postings that end with an ellipsis are truncated; sometimes users use the symbol simply as a mode of expression. To address these difficulties, the system, in step 508, can use two main clues to determine if a tweet is truncated: 1) if the posting length is close to the character limit, and, 2) if the g ends with either of two main truncation s (i.e. "..." and the unicode character for horizontal ellipsis) potentially followed by a standard expression of truncation (e.g., "via @handle," or "via #hashtag"). In embodiments, if a posting matches the above ia, the last word or token before the truncation symbol is removed as well as the remaining tail-end of the posting.
In step 512, the system splits hashtags that appear in the posting text of the social media posting. In social media gs, hashtags can play an important role in identifying locations, especially when no other clue is ble. Many reliable official ts (e.g., from disaster response teams, weather channels, traffic monitors, etc.) often use hashtags to convey location information (e.g., "#BuelahHillFire"). In many instances, these official accounts are often careful to use different letter-casing in their postings to denote locations within hashtags. In order to use hashtag information, the system ts each posting to determine r it is written in ALL-CAPS (i.e., written using all capital letters). If ALLCAPS is not used, the system splits each hashtag based on the location of its uppercase s. For instance, the g #BuelahHillFire can be broken into the text ah Hill Fire." If multiple uppercase letters appear next to each other, the system reattaches any dangling letters back together. F or instance, the hashtag "#LAFlood" will be broken into "#L A Flood," and then the dangling letters "L" and "A" reattached to read "#LA Flood." In some embodiments, the system keeps the hashtag symbol in order to distinguish between hashtag-based locations and other locations determines from the posting text. This helps delimit the beginning of g-based locations from the rest of the posting text. For instance, consider the posting "Hurricane Matthew moving towards Florida #HaitiDisaster." Once the hashtag is broken down by the described processes, the posting will read "Hurricane Matthew moving towards Florida #Haiti Disaster." If the hashtag symbol is removed, the system might mistakenly identify "Florida Haiti" as a single location due to tent letter-casing. Therefore, r etaining the hashtag symbol ensures that the system determines that "Florida" and "Haiti" are two different locations.
In step 516, the system removes special ters that appear in the posting text of the social media posting. Special characters may include non-alphanumeric characters found in the posting text. In embodiments, the system removes all special characters except a ed set of special characters retained as useful for identifying locations. For example, the system may remove all l ters, such as including, e.g., brackets, asterisks, percentage signs, backslash, except the following: Hyphen: Some locations include hyphens (e.g., "Al-Hasakah"). id="p-57" id="p-57" id="p-57" id="p-57" id="p-57" id="p-57" id="p-57" id="p-57"
id="p-57"
[0057] Apostrophe or single quote: These symbols are occasionally used to denote possessive forms. ( e.g., "Austin’s PD reports a three-alarm fire downtown"). As discussed herein, the system can use the names of public agencies (e.g., police departments, fire stations, etc.) to find locations expressed in possessive form. g: As previously discussed, this symbol is retained to distinguish hashtag- based locations. id="p-59" id="p-59" id="p-59" id="p-59" id="p-59" id="p-59" id="p-59" id="p-59"
id="p-59"
[0059] Forward slash: Sometimes this symbol is used to connect multiple locations (e.g., cane warning for Kings/Queens counties"). Retaining this symbol can help to identify these cases so that y" can be permuted to both "Kings" and "Queens." Comma: This symbol is commonly used to associate two locations (e.g., "Orlando, Florida"). id="p-61" id="p-61" id="p-61" id="p-61" id="p-61" id="p-61" id="p-61" id="p-61"
id="p-61"
[0061] Period, exclamation point, question mark, colon, semi-colon: These symbols are often used to identify the end of sentences or end of phrases, which is essential in identifying locations correctly. Consider the posting "Big celebration in New York City. t at 8pm." If periods were removed from the sentence, the system might mistakenly surmise that "New York City Tonight" is the name of a location. In step 520, once the posting(s) are pre- processed, the method 500 ends.
In some embodiments, the pre-processing method 500 may e only any subset of, or an alternative connection of, the features depicted in or discussed herein in regard to shows a flowchart depicting an embodiment of a location identification method 600 according to an embodiment of the disclosure. In some embodiments, location identification method 600 is performed as step 216 of method 200 by location mention identification module 132. In the ment shown, method 600 begins at step 604. As previously discussed, the system can work in two modes: single and multi-posting mode. In both cases, in step 608, the system identifies locations mentioned in the text of each posting it retrieves. In step 612, if one or more locations are identified from the posting text, method 620 ends. In some instances, there is not enough information in a posting (or set of postings) to identify any ons. In step 612, if no location is fied from the posting text, method 600 proceeds to step 616 to identify locations from user es. Different location fication thresholds can be used to determine whether sufficient locations have been fied in step 612. For example, if the system finds fewer than two location mentions in the message, it also extracts user locations and adds them to the set of potential locations in step 616. In some embodiments , user locations can be identified even if locations are fied from text. For example, step 612 can be removed from method 600 and locations can be identified from both text and from user locations. id="p-64" id="p-64" id="p-64" id="p-64" id="p-64" id="p-64" id="p-64" id="p-64"
id="p-64"
[0064] In some embodiments, the location identification method 600 may include only any subset of, or an alternative connection of, the features depicted in or discussed herein in regard to Returning to the depicted embodiment of the location mention identification module 132 includes a taxonomy-based classifier module 324, a heuristic-based fier module 328, and a knowledge-base (KB) based classifier module 332. In the embodiment shown, location mention identification module 132 also es a location taxonomy list and/or table 336 and an alias KB database 340 although one or more of these elements may be located outside location mention identification module 132. Taxonomy - based classifier module 324 is configured to perform a taxonomy-based approach to detect location mentions from the text of a social media g, heuristic-based classifier module is configured to perform a heuristic-based approach, and KB-based fier module 332 is configured to perform a KB-based approach. In the embodiment shown, the taxonomy-based classifier module 324 can communicate with location my list and/or table 336 and KB- based classifier module 332 can communicate with alias KB database 340. In the ment shown, taxonomy-based classifier module 324, heuristic-based classifier module 328, and ed classifier module 332 each receive one or more social media postings, such as from social posting stream 304 via the communication ace 144 (omitted from this figure for clarify of illustration). Receipt of the social me dia postings and/or implementation of the location ion processes can be in parallel and/or in series. id="p-66" id="p-66" id="p-66" id="p-66" id="p-66" id="p-66" id="p-66" id="p-66"
id="p-66"
[0066] shows a flowchart depicting an embodiment of a method 700 of identifying locations from the text of a social media posting according to an embodiment of the disclosure. In the embodiment shown, method 700 is performed by location mention identification module 132 and is performed serially. In embodiments, the location identification method 700 may be performed as step 216 of method 200 and/or as step 608 of method 600. In the embodiment shown, method 700 begins at step 704. At step 708, location mentions are detected from one or more social media gs using a taxonomybased ch. At step 712, location mentions are detected from one or more social media postings using a heuristic-based ch. At step 716, location mentions are detected from one or more social media postings using a KB-based approach. Method 700 ends at step 720.
In some embodiments, the method 700 for identifying locations from the text of a social media posting may include only any subset of, or an alternative connection of, the features depicted in or discussed herein in regard to shows a art ing an embodiment of a method 800 implementing a taxonomy-based approach according to an embodiment of the disclosure. In embodiments, method 800 is implemented at step 708 of method 700 by taxonomy-based classifier module 324. M ethod 800 begins at step 804 and uses a standard taxonomy to detect common locations from posting text. Names of large and/or well-known geo-locations such as countries, continents, states, and provinces can be curated and/or collected to generate a taxonomy list 336 of location names. In some embodiments, my list 336 can include thousands of names. In step 808, the location names making up taxonomy list 336 are retrieved by taxonomy-based classifier module 324 and are compared against the text of each received social media posting to see if any location names from taxonomy list 336 occur anywhere in the posting text.
In step 812, the system determines whether there is an exact match between a location n in the text and a location included in taxonomy list 336. If there is no exact match, method 800 ds to step 816. If there is an exact match, method 800 proceeds to step 820. In step 816, in cases when an exact match is not found, the system determines whether there is a proximate match. In this step, the system parses the text for words that may be d to a particular location but may not y match a location from taxonomy list 336. For instance, nationalities found in the text can be mapped to their corresponding countries from taxonomy list 336 (e.g., "Canadian" can be matched to "Canada"). Possessive forms can also be mapped to their corresponding locations from taxonomy list 336 (e.g., " ana’s fire departments" can be matched to "Louisiana."). In step 820, once a word from the posting text is matched (either y or proximately) to a term in taxonomy 336, it is removed and replaced by a location mask (e.g., "") in a processed version of the social media posting. B y removing location mentions when they are d, the system can avoid processing the same location mention multiple times and preserve system resources.
For example, the posting shown in c) contains the term "Brussels." If "Brussels" is included as a on in taxonomy list 336, the term "Brussels" is removed and replaced by a on mask to avoid unnecessary processing. Method 800 ends at step 824. In some embodiments, even though processing overhead is increased, method 800 can determine proximate matches even when exact s are also determined.
In some embodiments, the method 800 for implementing a taxonomy-based approach may include only any subset of, or an alternative connection of, the features depicted in or discussed herein in regard to shows a flowchart depicting an embodiment of a method 900 implementing a heuristic-based approach according to an embodiment of the disclosure. In the embodiment shown, method 900 is implemented at step 712 of method 700 by tic- based classifier module 328. Method 900 begins at step 904. In addition to standard my, the system uses a set of standard heuristics to identify words or phrases from the social media posting that are likely to refer to locations. At step 908, the system generates a rule list ating particular words or phrases found in certain positions in the text with possible locations. At step 912, the system checks the posting for text that matches rules from the rule list. Table 1 shown below lists some of these words/phrases as "clues" along with a few examples of possible correlated locations.
Table 1. List of clues used to identify on mentions heuristically.
Clue Position w.r.t Examples Examples of Exceptions the location Cardinal directions Prefix - North Aleppo - North of the city - Southwestern Aleppo - Northwest Bank - North of Aleppo - Southeast of Aleppo Landmark Prefix/Suffix - City of Aleppo - City Bank identifiers - Aleppo City - State Department - Gulf of Mexico - c State - Suburb of New Jersey Distance indicators Prefix - 3 miles from NYC - Five miles of wheat fields - Five kilometers of NJ Urban landmark Suffix - St. John’s high school - My niece’s high school indicators - JFK t - Manila City Hall Natural landmark /Suffix - Red River - Deep river indicators - Mnt Rushmore - River basin - Green Lake Agencies Suffix - NYC police - Higher police presence - LA fire department City area Prefix - Downtown Los Angeles - Uptown girl Movement Prefix - ane moving - Coming to my ’s indicators towards NJ house - Typhoon tearing through - Moving towards an NYC agreement - Festival coming to LA Border indicators Prefix/Suffix - Border between NY and - Doctors Without Borders NJ - Borders on insanity - Intersection of Main and - NY/NJ border Pronouns Prefix - In NYC - In conversations - At the JFK - At home For example, the system may detect the term "North" in the posting text and determine that this term correlates to a rule from the rule list, namely that the term "North" is a cardinal direction and is used as a prefix. The middle-right column of Table lists examples where the rule correctly detects a location mention in the text. However, the rightmost column of Table 1 lists examples where the rules match words in the text but the word that precedes or succeeds them is not a on. In order to recognize these cases, the system ents step 916 to determine a location based on capitalization. If the word in the posting text seems to have proper capitalization (e.g. if the word is not n in ALL-CAPS or Title Case (i.e., using only capital s to start principle words)), then the system relies on capitalization to determine if the words adjacent to the words matching a rule refer to a location. Words not beginning with capital letters are less likely to be locations. As shown in Table 1, when the term "North" is ined to be near the capitalized word "Aleppo," "Aleppo" is determined to be a location n. However, when the term "North" is determined to be near the non-capitalized word "city," "city" is not ined to be a particular location mention.
As can be seen, step 916 can still match words to locations that are not locations (i.e., "Northwest Bank" satisfies the capitalization rule but is not a location mention. In order to mitigate these false matches, the system implements step 920 to remove and/or ignore certain isted terms. In some embodiments, each rule has a list of blacklisted terms associated with it. For example, the Landmark Identifiers rule shown in Table 1, a blacklisted term is "The Islamic State" because it is a ly-used term that does not correspond to a location mention for the general landmark identifier "state." Method ends at step 924.
In some embodiments, the method 900 for implementing a heuristic-based approach may include only any subset of, or an ative connection of, the features depicted in or discussed herein in regard to shows a flowchart depicting an embodiment of method 1000 implementing a knowledge-based (KB) approach according to an embodiment of the disclosure. In embodiments, method 1000 is implemented at step 716 of method 700 by KB- based classifier module 332. A typical problem in identifying location mentions is the prominence of terms such as "NYC," "LA," "PDX," "Big Apple," and other common aliases for locations. Some of these aliases are known and can be supplied manually using a lexicon. r, sometimes local news outlets popularize a term that does not catch on in global media. F or instance, "J&K" is commonly used to refer to "Jammu and Kashmir" in Indian media, but not elsewhere. Additionally, some events generate news hashtags that are tied to a particular location (e.g., "#LAFlood"). onally, some official weather ts use a standard format to indicate a location. For example, "#nywx" refers to weather forecasts for New York. These types of aliases are not easy to detect or curate using ional lexicons.
Therefore, it may be important to dynamically detect these terms and associate them with their corresponding location, especially if they occur frequently. The system uses a cally self-adjusting dge-Base (KB) to achieve this functionality. In some embodiments, the dynamically self-adjusting Knowledge-Base (KB) is stored in database 340 shown in In the embodiment shown, method 1000 begins at step 1004. At step 1008, the system builds the KB by using co-occurrence information to determine associations between locations and their aliases. For instance, if "New York City" frequently co-occurs with "NYC" in social media postings, the system updates the KB to align the two terms.
Using the KB, the system can remove incorrect alignments. At step 1 012, the system determines whether two terms in the KB satisfy a m threshold for co- ence. For ce, the terms "New York City" and "NYC" have to occur together more than 1,000 times (i.e., the minimum threshold) to be considered. If the system determines that the terms do not meet the minimum threshold, the terms are removed from the KB.
At step 1016, the system determines a rank reciprocity for the terms based on the co-occurrence information. In some embodiments, the system first ranks each term based on the frequency of its co-occurrence with another term. For instance, if "NYC" is the secondmost common term that co-occurs with "New York City," then its rank with regards to "New York City" will be "2." After all ranks are calculated, the system checks to see if "NYC" and "New York City" reciprocate each other’s rank. For ce, if "NYC" is the top most frequent term that co-occurs with "New York City" (i.e., ranked "1"), the system ines whether "New York City" also the top most frequent term (i.e., ranked "1") that co-occurs with "NYC." If the ranks match, the two terms are determined to satisfy rank ocity and are recognized as alias pairs in the KB. In embodiments, the KB may be cally updated periodically (e.g., every week) to expand its collection of alias alignments.
At step 1020, the system uses the KB to detect aliases that occur in the posting and determine a location corresponding to each alias. During the processing of a posting to determine locations, if an alias appears as a potential location, both the alias and its corresponding match can be added as a on. For instance, if "NYC" appears in the posting (such as in the example shown in d)), both "NYC" and "New York City" are marked as ial location mentions. Method ends at step 1024. id="p-81" id="p-81" id="p-81" id="p-81" id="p-81" id="p-81" id="p-81" id="p-81"
id="p-81"
[0081] In some embodiments, the method 1000 for implementing a KB-based approach may include only any subset of, or an alternative connection of, the features depicted in or discussed herein in regard to .
Returning to the depicted embodiment of the on determining module 136 includes a geo-coordinator 344, a geo-coordination library and/or service 348, and a user-type validator 352. In other embodiments, the user-type validator 352 may be located e location determining module 136. Geo-coordinator 344 is configured to receive the results of all three location ion approaches (e.g., taxonomy-based, heuristic-based, and KB-based) sed above as a list of potential locations. In the embodiment shown, geocoordinator 344 is also configured to receive one or more validated user locations from user- type validator 352. In the embodiment shown, user-type validator 352 can receive a posting 316 and a user location 320 and determine if the g account represents a particular type of account such as a fire or police department, local news outlet, or other public service account. In the embodiment shown, geo-coordinator 344 can communicate with a geocoordination library and/or service 348 to determine geo-coordinates for each of the potential location mentions received from location mention fication module 132. Geo - coordination library and/or service 348 can be a third-party library comprising a database correlating particular locations with geo-coordinates. shows a flowchart depicting an ment of method 1100 of determining location geo-coordinates according to an embodiment of the disclosure. In embodiments, method 1100 is implemented at step 220 of method 200 by location determining module 136 shown in In the embodiment shown, method 1 100 begins at step 1104. At step 1108, the system looks up each of the potential locations received from location mention identification module 132 in geo-coordination library and/or service 348. In some embodiments, the library 348 can be a third-party location y, such as, e.g., Nominatim, although other third-party location libraries can be used. This particular library receives data from the OpenStreetMap (OSM) t, which periodically provides a dump of all geo-locations around the globe that can be uploaded to a database. The location y provides a mechanism to easily access and query the database and can provide a REST service and a GUI for easy navigation. At step 1112, the system receives geo-location results from the location library for each potential location. If a word is looked up in the location library, it returns a set of geo-location results that the word can correspond to. At step 1116, the geo-location results can each have a score, such as an importance score, representing the strength of an association between the potential on and a particular geo-location. In embodiments, the score may be received from the library 348. In other embodiments, the score may be tely calculated. At step 1120, the method ends. id="p-84" id="p-84" id="p-84" id="p-84" id="p-84" id="p-84" id="p-84" id="p-84"
id="p-84"
[0084] In some embodiments, the method 1100 for detecting determining location geocoordinates may include only any subset of, or an ative tion of, the features depicted in or sed herein in regard to .
FIGS. 12(a)-12(b) show exemplary outputs 1200 from an exemplary location library ing to an embodiment of the disclosure. In (a), the query "orlando" 1204 was looked up in a location y and a listing of results 1208 was ed. In the embodiment shown, the listing of results 1208 can be supplemented and/or ed by a selection of button 1212. In (b), each result 1208 includes data 1216 including a latitude/longitude geo-coordinate pair, a polygon, some metadata indicating the ’s larger context (e.g., province, country), and an importance score that shows a degree of commonality between the association between the query 1204 and the result 1208. In other words, the importance score ents the degree of correlation between the queried location and each geo-location result. For instance, the importance score for Orlando, Florida is about 0.71, while the importance score for Orlando, Arkansas is 0.37. The system can use this ation to find the most likely geo-coordinates for each potential location.
In some embodiments, the exemplary output 1200 from an exemplary location library may e only any subset of, or an alternative connection of, the features depicted in or discussed herein in regard to FIGS. 12(a)-12(b).
Returning to the depicted embodiment of the ying module 140 includes a location disambiguation engine 356, a user-location validator 360, and a location ranking engine 364. In the ment shown, qualifying module 140 is configured to receive the results from the location library sed above as a list or set of geo-coordinates and importance scores of potential ons. In the embodiment shown, qualifying module 140 is configured to qualify and validate the list of ial locations and determine a primary location and corresponding geo-coordinates. shows a flowchart depicting an ment of a method 1300 of qualifying locations according to an embodiment of the disclosure. In embodiments, method 1300 is implemented at step 224 of method 200 by qualifying module 140 shown in In the embodiment shown, method 1300 begins at step 1304. At step 1308, the system qualifies location results from the location library using the importance scores. At step 1312, the system qualifies user locations using community heuristics based on types of user accounts and user locations. In some embodiments, these steps may include processes that identify and remove words that are not locations. For example, these are words that have been identified as locations by mistake or self-identified user locations that are not viable, such as "Planet Earth." At step 1316, the system finds and geo-coordinates a y location determined from the list of remaining qualified potential locations. For example. once certain locations have been removed, for the remaining locations, the system can determine the "best" or most likely geo-coordinates from the list provided by the on library. At step 1320, the method ends.
In some embodiments, the method 1300 for qualifying locations may include only any subset of, or an alternative connection of, the features depicted in or discussed herein in regard to . shows a flowchart depicting an embodiment of a method 1400 of qualifying ons from a location library according to an ment of the disclosure. In embodiments, method 1400 is implemented at step 1308 of method 1300 by location disambiguation engine 356. M ethod 1400 begins at step 1404. At step 1408, the system analyzes the results from the location library, such as by comparing the importance score of each result to an importance score threshold. At step 1412, the system ines whether an ance score of a particular result is lower than the threshold. If the importance score is lower than the old, the system discards the location result at step 1416. The system performs this analysis for each location result returned by the location library. If the importance score of the top result returned by the location library is below the threshold, it likely means that the location is too obscure to be reliable. For instance, "The Milky Way" might match the name of a bar in New Jersey, but the system will not recognize (i.e., discard) this location if its ance score is below the threshold.
If the importance score of a result is determined to be greater than or equal to the score threshold, the process moves to step 1420 and analyzes the important score variance of the results in the list. At step 1412, the system determines whether the variance in important scores across the entire listing of results is lower than a variance threshold. If the variance is lower than the variance threshold, the system ds the location result at step 1416. If the variance of the importance score of the results is below a variance threshold, it likely means that the system does not have enough confidence in coming up with a definitive geo-location for a potential on. For instance, "The Milky Way" might be the name of both a bar in New Jersey and a cafe in Washington, D.C. However, since both landmarks are relatively e, the location y will assign similarly low importance scores to these results and they will be subsequently removed from the set of potential locations. If the variance is higher than the variance threshold, the system keeps the location result at step 1428. At step 1432, method 1400 ends. id="p-92" id="p-92" id="p-92" id="p-92" id="p-92" id="p-92" id="p-92" id="p-92"
id="p-92"
[0092] In some embodiments, the method 1400 for qualifying locations from a location library may include only any subset of, or an ative connection of, the features depicted in or discussed herein in regard to . shows a flowchart depicting an embodiment of a method 1500 of qualifying user locations using community heuristics according to an embodiment of the disclosure. In embodiments, method 1500 is implemented at step 1312 of method 1300 by user location validator 360. User locations aren’t always related to the location of a particular event. This may especially be true for targeted attacks such as terror attacks which may often happen in crowded locations prone to transit, such as airports and tourist attractions. For example, referring to b), a social media g including a witness account from a Nice terrorist attack (location mention 408) may be posted by a user from Monaco.
Therefore, the system may qualify user locations to determine a le geo-location. id="p-94" id="p-94" id="p-94" id="p-94" id="p-94" id="p-94" id="p-94" id="p-94"
id="p-94"
[0094] In the embodiment shown, method 1500 begins at step 1504. At step 1508, the system determines r it is retrieving social media posts in a -posting mode or a multi-posting mode, such as a function of a setting or input from a user or other application.
If a single-posting mode is determined, method 1500 proceeds to step 1512. In ments, step 1512 may be performed by user-type validator 352. In single-posting mode, the system may only accept the user location if the account is of a reliable type. Often, reliable accounts are al accounts from local news agencies, disaster se teams, police, or fire departments. In order to determine whether a user account is reliable, the system can use a rd set of a words (e.g., "fire," "police," etc.) and match them against an account’s description and name. In order to identify the standard set of a words, the system can curate a list of social media accounts belonging to local news agencies, local government agencies, and fire and police departments. The system can also collect the descriptions of these social media accounts, tokenize them, remove stopwords, and determine the most common terms found in the account descriptions. In some embodiments, an exemplary standard set of words and/or taxonomy list can include words such as "city," "state," "county," ," "department," "dept.," "police," "emergency," "emergencies," "fire," "911," ing," "news," ce," "weather," on," ic," and "channel." The list of accounts can also include accounts having handles that begin or end with "PD" or "FD" (in uppercase) and ts having descriptions that include the word "official." Additionally, the t list can include all accounts that have a URL ending in ".gov." At step 1512, the system compares the name of the posting account to the determined set of words. At step 1 516, it is determined whether the posting account name matches an entry in the determined set of words. If the account name matches a n entry in the determined set of words, the potential user location(s) is kept at step 1520. If the account name does not match an entry in the determined set of words, the potential user location(s) is discarded (e.g., ignored) at step 1524. For example, referring to a), a social media posting may contain the term "police" but may be posted by a user in a different location than the bed event (e.g., Miami instead of Orlando). The system can compare the account name of the social media posting with the determined set of words and, if the account name does not match an entry in the determined set of words, the potential user location is ignored.
In the embodiment shown, if a multi-posting mode is determined, method 1500 proceeds to step 1532. In multi-posting mode, the system may only accept the user location if at least a certain predetermined tage of users (e.g., 75%) have the same user location.
If the system determines that a n percentage of users does match the posting user’s location, the location is kept at step 1520. If the system determines that a certain percentage of users do not match the posting user’s location, the location is discarded (e.g., ignored) at step 1524. For example, if three postings are fed into the system, at least two users will have to have ng locations for their posts to be kept. Matching locations can be fied as "nearby" places (i.e., locations within a ermined distance or radius of the user location). For instance, one user might identify a user location as "North London" and another user might identify a user location as "South London." The location library can return a list of results/hits for each location. If, among the list of hits there is at least one user pair within a predetermined distance or radius (e.g., 20 miles) from each other, the users can be ered "nearby." Method 1500 ends at step 1536.
In some embodiments, the method 1500 for qualifying user locations using community heuristics may include only any subset of, or an alternative connection of, the features ed in or discussed herein in regard to . shows a flowchart depicting an ment of a method 1600 for determining primary on geo-coordinates according to an embodiment of the disclosure.
In the embodiment shown, method 1600 is implemented at step 1316 of method 1300 by location ranking engine 364. In method 1600, the system ranks all locations remaining after the completion of the us processes discussed above. In the embodiments shown, the ranking is based on a confidence level of the relevance of the location to an event that is the subject matter of the social media posting. At step 1608, the system ranks each remaining on based on the set of qualified user locations discussed above. This ranking represents the highest level of confidence that a location corresponds to the event. At step 1612, the system ranks each remaining location based on the location taxonomy list 336 discussed above. This ranking represents a lower level of confidence that a location corresponds to the event compared to a location ranked using a qualified user location. At step 1616, the system ranks each remaining location based on the heuristics used by heuristic-based fier module 328 and the KB-based classifier module 332 discussed above. This g represents the lowest level of confidence that a location corresponds to the event compared to either a location ranked using a qualified user location or a location ranked using on taxonomy list 336.
In the ranking method 1600, each source (e.g., taxonomy, rule-based, and KB) is assigned a te confidence weight as discussed above in steps 1608, 1612, and 1616. For instance, taxonomy-based locations can have a higher confidence score than heuristics-based or KB locations. Each location can also be given a ranking score based on the following factors: 1) its on within the posting (e.g. leftmost location, rightmost location, secondleft location, etc.); and 2) its inclusion within other locations inside the posting (e.g.
"Flooding in Paris, France" will have both "Paris" and "France" tagged as locations, but since Paris is located in France, it will have a higher g score). The final ranking score can be a linear olation of the source confidence weights and the ranking scores. The ranking performance can be assessed within the end-to-end evaluation of the system, and the weights can be tuned such that they give the system maximum performance. In some embodiments, each of steps 1608, 1612, and 1616 are performed on each remaining location. In other ments, step 1612 is only performed on a remaining location that could not be ranked at step 1608 and step 1616 is only performed on a remaining location that could not be ranked at either step 1608 or 1612.
At step 1620, the system identifies the geo-coordinates that most closely match each of the locations as y location geo-coordinates. Having ranked the locations based on the above ia, the system can use a pairwise minimum-distance process to determine the results/hits that generate the shortest distances. This step can be illustrated with reference to the ing scenarios. In scenario A, the following social g is received: "Two- alarm fire at 30 Main St." For this posting, a qualified user location is "Fort Lee, NJ." In scenario B, the following social posting is received: cane alert for Paris, Texas." For this posting, no qualified user location could be determined and, therefore, no user location is used.
In scenario A, the system determines a location mention of "30 Main St" and a qualified user location of "Fort Lee, NJ" after implementation of the methods disclosed herein. The system receives results from a location library including three hits: "30 Main St., Brooklyn, NY," "30 Main St., Flushing, NY," and "30 Main St., Fort Lee, NJ." Because the posting es a qualified user location, the system implements step 1608 and determines a rank of the locations using the qualified user location "Fort Lee, NJ." Based on this g, the location result "30 Main St., Fort Lee, NJ" is kept and the other two locations are discarded. In this scenario, steps 1612 and 1616 are not performed because a location ranked using a qualified user location represents the highest confidence level so further sing is unnecessary. The system then implements step 1620 to find a pair of geo-coordinates that corresponds to "30 Main St., Fort Lee, NJ," and designate those geo-coordinates as corresponding to the primary location of the event bed in the posting.
In scenario B, the system determines a location mention of "Paris, Texas" after implementation of the s sed herein. The system receives results from a location library including three hits: "Paris, France," "Paris, Texas," and "Paris, Illinois." However, because the system fails to y the user location, it relies solely on the locations mentioned in the posting text to determine the primary location. The system skips step 1608 because there is no qualified user location and implements step 1612 to determine a rank of the locations based on the taxonomy list. Since " and "Texas" are determined to be the names of a city and state, tively, from the my list, "Paris, Texas" is ranked as the most likely location in the list of hits. The system then implements step 1620 to find a pair of geo-coordinates that ponds to "Paris, Texas" and designate those geo-coordinates as corresponding to the primary location of the event described in the posting.
In the embodiments shown, method 1600 ends at step 1624. When the primary locations geo-coordinates have been determined, the system can enrich the incoming social posting stream 304 by adding the geo-coordinates as metadata fields in the posting object(s).
The stream can be consumed by any downstream ation seamlessly and in ime.
The downstream applications may include, e.g., news detection, disaster detection, user profiling, etc. ations. For example, in some ments, the modified posting objects can be output to another application by communicating with a user system over one or more communication networks. In some embodiments, the modified posting objects can be output directly to a user over one or more communication networks. In some embodiments, the application can run in two modes: single-message and multi-message. Each mode can be used in streaming or pull fashion. For example, rs of geo-tagged postings in a multimessage mode can be consumed via API calls or via a streaming service that enriches a UI or another application. A single geo-tagged g in -message mode can be similarly consumed.
In some embodiments, the method 1600 for determining primary location geocoordinates may include only any subset of, or an alternative connection of, the features depicted in or discussed herein in regard to .
It may be appreciated that the functions described above may be performed by multiple types of software applications, such as web applications or mobile device ations. If implemented in firmware and/or software, the functions described above may be stored as one or more instructions or code on a non-transitory computer-readable medium.
Examples include non-transitory er-readable media encoded with a data structure and non-transitory computer-readable media encoded with a computer program. Non-transitory computer-readable media includes al computer storage media. A physical storage medium may be any available medium that can be accessed by a computer. By way of example, and not limitation, such ansitory computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other physical medium that can be used to store d program code in the form of instructions or data structures and that can be accessed by a computer. Disk and disc includes compact discs (CD), laser discs, optical discs, digital versatile discs (DVD), floppy disks and Blu-ray discs. Generally, disks uce data magnetically, and discs reproduce data optically. Combinations of the above are also ed within the scope of non-transitory computer-readable media. Moreover, the functions described above may be achieved through dedicated devices rather than software, such as a hardware t comprising custom VLSI circuits or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or other discrete components, all of which are non-transitory. Additional examples include mmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices or the like, all of which are non-transitory. Still further es include application specific integrated circuits (ASIC) or very large scale integrated (VLSI) circuits. In fact, persons of ordinary skill in the art may utilize any number of suitable structures capable of executing logical operations according to the bed embodiments.
For example, embodiments of the social media system 104, geo-location system 108, application 112, and user system 116, and/or any individual one, , or all of the components thereof, may be implemented as hardware, software, or a mixture of hardware and software. For example, each of the social media system 104, geo-location system 108, application 112, and user system 116, and/or any individual one, , or all of the components thereof, may be implemented using a processor and a non-transitory storage medium, where the non-transitory machine-readable storage medium includes program instructions that when executed by the processor perform embodiments of the functions of such components sed herein. In embodiments, each of the social media system 104, cation system 108, application 112, and user system 116, and/or any individual one, , or all of the components thereof, may be ented using one or more er systems, such as, e.g., a desktop computer, laptop computer, mobile computing device, network device, , Internet server, cloud server, etc.
The above specification and examples provide a complete description of the structure and use of illustrative embodiments. Although certain embodiments have been described above with a certain degree of particularity, or with reference to one or more individual embodiments, those skilled in the art could make numerous alterations to the disclosed embodiments without departing from the scope of this invention. As such, the various illustrative embodiments of the disclosed methods, devices, and systems are not intended to be limited to the particular forms disclosed. Rather, they include all modifications and alternatives falling within the scope of the , and embodiments other than those shown may include some or all of the features of the depicted embodiment. For example, components may be ed as a y ure and/or connections may be substituted. Further, where appropriate, aspects of any of the examples described above may be combined with aspects of any of the other es described to form further examples having comparable or different properties and addressing the same or different problems. rly, it will be understood that the benefits and advantages described above may relate to one embodiment or may relate to several embodiments.
Additional embodiments of the social media system 104, geo-location system 108, application 112, and user system 116, and associated methods, as discussed herein, are possible. For example, any feature of any of the embodiments of these systems and methods described herein may be used in any other ment of these systems and methods. Also, embodiments of these s and methods may include only any subset of the components or features of these systems and methods discussed herein.
Claims (21)
1. A method of determining locations for social media postings, the method comprising: retrieving, by communicating with at least one application programming interface (API) of a social media system over one or more first communication networks, at least one social media g; determining at least one on n in text of the at least one social media posting, and at least one textual user location of the at least one social media posting; determining a plurality of locations for the at least one on mention and the at least one textual user location, wherein each of the plurality of locations includes a set of geocoordinates comparing terms in an account name or account description of the at least one social media g to a taxonomy list, validating the at least one textual user location when at least one of the terms matches the taxonomy list, and discarding the at least one textual user location when none of the terms match the taxonomy list; selecting one of the plurality of locations as a primary location; storing, in at least one database on a non-transitory machine-readable storage medium, at least one posting object for the at least one social media posting including the primary location; and ting, by communicating with a user system over one or more second communication ks, the at least one social media posting with the determined primary location.
2. The method of claim 1, wherein the determining at least one location mention includes implementing a taxonomy-based identification including: comparing terms of the at least one social media posting with a list of location names; fying the at least one location n when a term of the at least one social media posting matches or proximately matches a name from the list of location names; and removing the identified term from further processing of the social media posting for location mentions.
3. The method of claim 2, further comprising at least one of: mapping a nationality term in the at least one social media posting to a ponding country in the list of location names, or mapping a possessive term in the at least one social media g to a corresponding location in the list of location names.
4. The method of claim 1, wherein the ining at least one location mention includes implementing a heuristic-based identification including: identifying an indicator term in the at least one social media posting; and identifying a potential location mention based on a rule for the indicator term, wherein the identifying the potential location mention includes identifying at least one of: a term of the at least one social media posting preceding the indicator term, or a term of the at least one social media posting ding the indicator term based.
5. The method of claim 4, wherein fying the indicator term identifies the indicator term based on an indicator term type, the indicator term type including at least one of: a al direction, a landmark identifier, a distance indicator, and urban landmark indicator, a natural rk indicator, an agency, a city area, a movement indicator, a border indicator, or a pronoun; and wherein the rule for the indicator term is based on the indicator term type of the identified indicator term.
6. The method of claim 4, further sing: evaluating the identified potential location mention based on a capitalization of the identified potential location mention; and comparing the identified potential location mention to a list of blacklisted terms.
7. The method of claim 1, wherein the identifying at least one location includes enting a knowledge-base-based identification including: expanding the at least one location mention to include at least one alias based on an association between the at least one location n and the at least one alias in a knowledge-base (KB), the KB including associations between location mentions and aliases, an association between a particular location mention and a corresponding alias in the KB being based on a frequency that the particular on mention and the corresponding alias ur in social media postings.
8. The method of claim 1, further comprising pre-processing the at least one social media posting, wherein the pre-processing includes at least one of: removing tions from the at least one social media posting; splitting hashtags of the at least one social media posting; or removing special characters from the at least one social media posting.
9. The method of claim 1, wherein determining the at least one location from the at least one location n includes: looking up the at least one location mention in a location library; receiving one or more geo-location results corresponding to the at least location mention; and determining a score for each of the one or more geo-location results.
10. The method of claim 9, wherein the score ents a degree of correlation between the at least one location mention and each of the one or more geo-location results.
11. The method of claim 9, wherein determining the y location includes qualifying one or more geo-location results from the location library, the qualifying including: comparing the score for each of the one or more geo-location results to a score threshold; and discarding the one or more geo-location s when the score is less than the score
12. The method of claim 11, wherein qualifying the one or more geo-location results from the location library includes: analyzing a variance of the scores of the one or more geo-location results when at least some of the scores are greater than the score threshold; keeping the one or more cation results when the ce is greater than a variance threshold; and discarding the one or more geo-location results when the ce is less than the variance threshold.
13. The method of claim 1, wherein ining the primary location includes qualifying the at least one user location using community heuristics.
14. The method of claim 13, wherein when the at least one social media posting is received in a single posting mode, qualifying the at least one user location using community heuristics includes: comparing a term in an account name of the at least one social media posting to a taxonomy list; keeping the at least one user location as the at least one location when the term of the account name matches the taxonomy list; and ding the at least one user location when the term of the account name does not match the taxonomy list.
15. The method of claim 13, wherein when the at least one social media posting is received in a multi-posting mode, qualifying the at least one user location using community heuristics includes: determining whether the at least one user location matches user locations of at least a predetermined tage of users posting social media gs of the multi-posting mode; g the at least one user location when it matches the user ons of the at least predetermined percentage of users; and discarding the at least one user location when it does not match the user location of the at least predetermined percentage of users.
16. The method of claim 1, wherein determining the primary location from the one or more locations includes at least one of: g the one or more locations based on at least one qualified user location; ranking the one or more locations based on results of a taxonomy-based identification; ranking the one or more locations based on results of a heuristic-based identification; ranking the one or more locations based on s a knowledge-base-based fication.
17. The method of claim 1, wherein determining the primary location includes determining a shortest distance between a user location and the one or more locations.
18. The method of claim 1, wherein the method further comprises: ing the at least one location mention to include at least one alias.
19. The method of claim 1, wherein the method further comprises: expanding the at least one location mention to include at least one alias based on an association between the at least one location mention and the at least one alias in a knowledge-base (KB) including associations between location mentions and aliases.
20. A non-transitory machine-readable storage medium ing programming instructions, which when executed by at least one processor perform a method of determining locations for social media postings, the method comprising: retrieving, by communicating with at least one application programming interface (API) of a social media system over one or more first communication networks, at least one social media posting; determining at least one location mention in text of the at least one social media posting, and at least one textual user location of the at least one social media posting; determining a ity of ons for the at least one on n and the at least one textual user location, wherein each of the plurality of locations includes a set of geocoordinates comparing terms in an account name or t ption of the at least one social media posting to a taxonomy list, validating the at least one textual user location when at least one of the terms matches the taxonomy list, and discarding the at least one textual user location when none of the terms match the taxonomy list; selecting one of the plurality of ons as a primary location; storing, in at least one database on a non-transitory machine-readable storage medium, at least one posting object for the at least one social media posting including the y location; and outputting, by communicating with a user system over one or more second communication networks, the at least one social media posting with the determined primary location.
21. A system for determining locations for social media postings, the system comprising: at least one processor; and a non-transitory machine-readable storage medium including programming instructions, which, when executed by at least one processor, m a method of determining locations for social media gs, the method comprising: retrieving, by icating with at least one application programming interface (API) of a social media system over one or more first communication networks, at least one social media posting; determining at least one location mention in text of the at least one social media g, and at least one textual user location of the at least one social media posting; determining a plurality of locations for the at least one location mention and the at least one textual user location, wherein each of the plurality of locations includes a set of geo-coordinates; comparing terms in an account name or account description of the at least one social media posting to a taxonomy list, validating the at least one l user location when at least one of the terms s the taxonomy list, and discarding the at least one textual user location when none of the terms match the taxonomy list; selecting one of the plurality of locations as a y location; storing, in at least one database on a ansitory machine-readable storage medium, at least one posting object for the at least one social media posting including the primary location; and outputting, by communicating with a user system over one or more second communication networks, the at least one social media posting with the determined primary location. User System 1 16 164 Application 112 API 1 60 156 Geo-Location System 108 Location Determining Module 136 Qualifying Module 140 Comm. Interface 1 52 Location n ID Module 132 Database 1 48 Interface 1 44 Comm. 128 Social Media Comm. Interface 124 100 Server Sys.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US62/419,609 | 2016-11-09 | ||
US15/787,416 | 2017-10-18 |
Publications (1)
Publication Number | Publication Date |
---|---|
NZ793494A true NZ793494A (en) | 2022-10-28 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11323403B2 (en) | System and method for detecting geo-locations in social media | |
Kumar et al. | Skill squatting attacks on Amazon Alexa | |
US20240039879A1 (en) | Detecting messages with offensive content | |
Wanichayapong et al. | Social-based traffic information extraction and classification | |
US10250538B2 (en) | Detecting messages with offensive content | |
Ciot et al. | Gender inference of Twitter users in non-English contexts | |
Jose et al. | Prediction of election result by enhanced sentiment analysis on twitter data using classifier ensemble Approach | |
US8745065B2 (en) | Query parsing for map search | |
US9679024B2 (en) | Social-based spelling correction for online social networks | |
US11455344B2 (en) | Computer implemented system and method for geographic subject extraction for short text | |
Malmasi et al. | Location mention detection in tweets and microblogs | |
US20160019659A1 (en) | Predicting the business impact of tweet conversations | |
US10629053B2 (en) | Automatic detection and alert of an emergency from social media communication | |
US11138373B2 (en) | Linguistic based determination of text location origin | |
US9436677B1 (en) | Linguistic based determination of text creation date | |
NZ793494A (en) | System and method for detecting geo-locations in social media | |
Ogrodniczuk et al. | Lexical correction of polish twitter political data | |
US9659007B2 (en) | Linguistic based determination of text location origin | |
JP5879150B2 (en) | Phrase detection device and program thereof | |
Yamada et al. | Extracting local event information from micro-blogs for trip planning | |
SUWAILEH | LOCATION MENTION PREDICTION FROM DISASTER TWEETS | |
Wang et al. | A Robust Semantic Frame Parsing Pipeline on a New Complex Twitter Dataset | |
Rai et al. | Real Time Traffic Event Detection Using Tweet Stream | |
Anıt | Emergency situation notification based on social networks for mobile devices | |
EricHennenfent et al. | Skill Squatting Attacks on Amazon Alexa |