CA2868948A1

CA2868948A1 - System and method for identifying experts on social media

Info

Publication number: CA2868948A1
Application number: CA2868948A
Authority: CA
Inventors: Nilesh Bansal; Nick Koudas
Original assignee: University of Toronto
Current assignee: University of Toronto
Priority date: 2014-10-24
Filing date: 2014-10-24
Publication date: 2016-04-24

Abstract

A system and method for identifying experts on social media and more specifically to systems and methods for identifying experts, topics and followers in social media networks that may be used to engage or track a wide and relevant audience for message targeting.

Description

2 TECHNICAL FIELD

3 [0001] The following relates generally to a system and method for identifying experts on

4 social media and more specifically to systems and methods for identifying experts, topics and followers in social media networks that may be used to engage or track a wide and relevant 6 audience for message targeting.

8 [0002] Social media has transformed the way we interact online as individuals and 9 consumers. At the same time, it is transforming the way businesses aim to interact with their customers and fans online. Before social media became mainstream, online marketers and 11 advertisers resorted to the collection of behavioral online information regarding individuals to 12 target their messages. Individuals were primarily targeted based on the topical focus of the sites 13 they visited. For example, sports news sites might display advertising related to the perceived 14 interests of sports fans. The general interests of sports fans would be derived based on third party market research (e.g., males aged 25-35 with interest in sports are also interested in 16 certain types of movies or specific male grooming products).
17 [0003] In the early stages of the social web, bloggers on particular topics with wide 18 followings were identified to endorse or sponsor specific products. At the same time, bloggers 19 started serving advertisements on their blog real estate.
[0004] Social media is transforming the way marketers and advertisers spend their budgets.
21 Novel ways to market online are gaining traction both from an academic as well as a practical 22 point of view. In particular, influencer-based targeting in social media has emerged as a very 23 popular way to market on social platforms (such as Twitter and Facebook). Individuals are 24 identified as online experts in particular topics; they are either incentivized to participate in sponsored advertising by spreading the messages to their followers or the platforms 26 automatically insert sponsored messages in their activity streams (as in the case of 27 Twitter/Facebook advertising). Further, they may be targeted with relevant content such that 28 they organically share it with their followers. The goal is to increase brand awareness, by 29 increasing the number of impressions (e.g., how many followers see a particular message) and click-throughs to particular campaign (how many click on the link embedded in the message) 31 with the ultimate goal to track conversions (how many end up purchasing a product).

1 [0005] As an example, with more than 250 million users, Twitter has emerged as a 2 prominent marketing and advertising vehicle in addition to being a prominent social 3 communications platform.

[0006] In one aspect a system for identifying one or more experts of a topic on a social 6 network is provided, the system comprising a server in communication over a network with a 7 social network, the server comprising: (a) a user interface unit configured to obtain a topical 8 query representing the topic; (b) an obtaining unit configured to obtain social network data from 9 the social network, the social network data comprising one or more topical lists and a social graph representing user relationships in the social network, each topical list identifying one or 11 more users; (c) a tokenizing unit configured to: (i) tokenize titles of the topical lists and lexically 12 group the tokens into token groupings; and (ii) tokenize the topical query to determine at least 13 one token grouping to which the topical query corresponds; and (d) a processing unit configured 14 to: (i) generate, for each user, a topic signature vector comprising topic signature vector elements corresponding to the token groupings for which the user is identified in the 16 corresponding topical lists; (ii) generate for each topic signature vector element an occurrence 17 count representing the number of times each of the token groupings is identified for the user; (iii) 18 rank the users by their occurrence counts for the at least one token groupings corresponding to 19 the topical query; and (iv) return a selected set of the ranked users as experts in the topic.
[0007] In another aspect, a computer network implemented method for identifying one or 21 more experts of a topic on a social network is provided, the method comprising: (a) obtaining a 22 topical query representing the topic; (b) obtaining social network data from the social network, 23 the social network data comprising one or more topical lists and a social graph representing 24 user relationships in the social network, each topical list identifying one or more users; (c) tokenizing titles of the topical lists and lexically grouping the tokens into token groupings; (d) 26 tokenizing the topical query to determine at least one token grouping to which the topical query 27 corresponds; (e) generating, by a processing unit comprising one or more processors, for each 28 user, a topic signature vector comprising topic signature vector elements corresponding to the 29 token groupings for which the user is identified in the corresponding topical lists; (f) generating, by the processing unit, for each topic signature vector element an occurrence count 31 representing the number of times each of the token groupings is identified for the user; (g) 32 ranking the users by their occurrence counts for the at least one token groupings corresponding 33 to the topical query; and (h) returning a selected set of the ranked users as experts in the topic.

2 [0008] The features of the invention will become more apparent in the following detailed 3 description in which reference is made to the appended drawings wherein:
4 [0009] Fig. 1 is a block diagram of a system for identifying experts on social media;
[0010] Fig. 2 is a flowchart illustrating the process of creating a topic signature;
6 [0011] Fig. 3 illustrates a user interface for accessing the system;
7 [0012] Fig. 4 illustrates a user interface for accessing the system;
8 [0013] Fig. 5 illustrates a user interface for accessing the system;
9 [0014] Fig. 6 is a sample Twitter user and topic graph;
[0015] Fig. 7 is a graph illustrating the potential effect of random sampling on aspects of the 11 system;
12 [0016] Fig. 8 is a graph illustrating the potential effect of random sampling on aspects of the 13 system;
14 [0017] Fig. 9 is a graph illustrating the potential effect of random sampling on aspects of the system;
16 [0018] Fig. 10 is a graph illustrating the potential effect of random sampling on aspect of the 17 system; and 18 [0019] Fig. 11 is a graph illustrating the potential effect of random sampling on the system.

[0020] It will be appreciated that for simplicity and clarity of illustration, where considered 21 appropriate, reference numerals may be repeated among the figures to indicate corresponding 22 or analogous elements. In addition, numerous specific details are set forth in order to provide a 23 thorough understanding of the embodiments described herein. However, it will be understood by 24 those of ordinary skill in the art that the embodiments described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components 26 have not been described in detail so as not to obscure the embodiments described herein. Also, 27 the description is not to be considered as limiting the scope of the embodiments described 28 herein.

1 [0021] It will also be appreciated that any module, unit, component, server, computer, 2 terminal or device exemplified herein that executes instructions may include or otherwise have 3 access to computer readable media such as storage media, computer storage media, or data 4 storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Computer storage media may include volatile and non-volatile, removable 6 and non-removable media implemented in any method or technology for storage of information, 7 such as computer readable instructions, data structures, program modules, or other data.
8 Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other 9 memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage mother magnetic storage devices, or any 11 other medium which can be used to store the desired information and which can be accessed 12 by an application, module, or both. Any such computer storage media may be part of the device 13 or accessible or connectable thereto. Any application or module herein described may be 14 implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media and executed by the one or more processors.
16 [0022] Advertising and marketing on Twitter involves two crucial steps. First, being able to 17 identify who are the "experts" on any topic on the platform and second, being able to identify 18 sets of users with active "interest" on a particular topic. In the context of Twitter, an expert in a 19 particular topic is represented as an account (user) that primarily produces and shares content related to that topic and has a wide following that actively engages with the produced content 21 (sharing, re-tweeting, etc.). A user may demonstrate interest in a particular topic if, for example 22 the user follows a number of experts in the topic and engages with the content they produce.
23 [0023] A need exists to be able to identify experts on any given topic and analytical 24 functions on the set of experts' accounts on a specific topic, such as what other topics they are experts in, what conversations they participate in, and what types of content they share online.
26 A further need exists to identify other users (e.g., followers) that are likely to be interested in a 27 given topic.
28 [0024] The following relates generally to systems and methods for identifying experts on 29 social media. The system is configured to collect data on user interaction, communication and profile information to identify experts, topics and followers in social networks that may be used 31 to engage or track a wide and relevant audience for marketing purposes.
In another aspect, 32 such information may be provided via a user interface to enable message targeting decisions.

1 [0025] Social networks like Google+, Facebook, Twitter, and Pinterest, have emerged as 2 vehicles for marketing and branding. Marketers seeking to engage with, and advertise to, 3 consumers may wish to identify a network of experts in and followers of given topics to whom to 4 market specific content, as interests in a topic may correlate to sales of a given product or service. Without a loss of generality, Twitter will be used herein as an example of a social 6 platform from which content may be collected to provide data regarding experts, topics and 7 followers. The techniques described may be applied equally well to any other similar social 8 platform. The terms "follower", "marketer" and "social networks" are used herein illustratively 9 and in a non-limiting manner. These terms could be substituted for appropriate parties as applicable to alternative implementations.
11 [0026] In another aspect, a system and method is provided for characterizing the expertise 12 of particular social network users among a set of topics, including the generation of a topic 13 signature for each user of a social network. A topic signature comprises a list of all topics of 14 expertise of the user. Additionally, a system and method is provided to produce an aggregate signature. An aggregate signature comprises a list of topics in which a set of users has 16 expertise. Both topic and aggregate signatures can be interpreted as a ranking of most relevant, 17 for the purposes of reaching the largest audience, to least relevant topics.
18 [0027] Many social networks provide advertising platforms with tools for marketers. For 19 example, in use, a marketer would utilize the Twitter advertising platform in one of the following three ways: Firstly, the advertiser provides a set of Twitter user handles, and Twitter targets 21 advertisements to the followers of these accounts. Being able to identify sets of experts in any 22 topic readily aids advertisers to identify the most relevant accounts to provide advertisements to 23 while instigating a Twitter advertising campaign.
24 [0028] Secondly, the advertiser bids on a list of topics on Twitter. Twitter, using their own proprietary algorithms, identifies which users are interested in the topic and subsequently 26 targets those users with messages "promoted" by the advertisers (inserting them in their tweet 27 stream). By analyzing related topics for a topic of interest, advertisers can identify possibly 28 cheaper topics to bid on. For example, if the price for 'social marketing' is too high, 'seo as a 29 related topic, which may have a relatively lower bid price may be used instead. The effectiveness of the campaign may be the same, due to the substantial overlap between the 31 two.
32 [0029] Thirdly, the advertisers bid on search keywords (to target searches input to the 33 Twitter search feature). Information on Twitter is temporal by nature and events evolve with

5 1 time, thus the keywords used in searches evolve over time. When a keyword is used during a 2 search query on Twitter for which an advertisement exists, the platform will display promoted 3 messages (as advertising) along with the search results. Advertisers may wish to identify 4 keywords related to a queried keyword at a given time.
[0030] However, the applicants have now determined that advertisers may also be

6 interested in specific users that would be highly relevant as followers of a given user. These

7 new followers should be highly interested in the topics for which the given user has expertise,

8 since the followers desire to follow the given user's messages.
Furthermore, the applicants have

9 determined that advertisers may be well served by not just understanding who the experts are in relation to a given topic, but what other topics those users are interested in; for example, by 11 identifying all experts in 'cloud computing' with interests in 'photography' or experts in 'food and 12 dining' with interest in 'movies'. Such sets of experts can be targets of novel engagement 13 campaigns that attract attention by combining their area of expertise and their interests.
14 [0031] Various interactions between users and social networks such as Twitter result in data generation. Many social media networks record their interactions with their users. The system is 16 configured to obtain interaction data via various sources and/or connectors. The social networks 17 collect and record such data in logs stored on social network nodes or a network-accessable 18 server. A social network may provide access to data collected on user interactions. For 19 example, Twitter provides a Gardenhose streaming API which may be used to access messages and user profile information. Thus, Twitter activity may be stored in files that may be 21 automatically created and maintained by a given server or set of servers. In another aspect, 22 Twitter may be accessed directly via network connection, such as the internet, and data may be 23 crawled, scraped and indexed. Crawling and scraping may be performed using various 24 techniques by employing varying levels of automation. The obtained data may be stored in a database for ready use by the system.
26 [0032] Referring now to Fig. 1, an exemplary system for identifying experts in a social 27 network is shown. The system comprises network 106 connected elements including a server 28 100 linked to a database 101, and a social network 105. The network is in most 29 implementations the internet. The server comprises an obtaining unit 109 for obtaining social network data, a tokenizing unit 108 for tokenizing social media messages, a processing unit 102 31 for processing the social media data to generate each user's topic signature and an aggregate 32 signature for each topics, respectively, and an indexing unit 107 for interaction with the 33 database to locally store the social network data. In further embodiments, the system may 1 comprise a user interface unit 103. In yet an additional embodiment, a graphing unit 104 may be 2 provided for generating or obtaining a social graph enabling a query for users that a given user 3 is following. A representative embodiment contemplates the use of a processor which, it will be 4 understood, could be implemented by a plurality of processors which could be distributed.
[0033] Referring now to Fig. 3, an exemplary user interface is shown. The user interface 6 enables search for various social network experts. As previously described with reference to 7 Fig. 1, the user interface enables interaction by a marketer (or similar person, organization or 8 other end-user) with the system. A marketer may select from a plurality of commands in order to 9 view social network expert data.
[0034] As shown in Fig. 3, the system may present a user with a plurality of options for 11 identifying expert accounts. A search box 301 accepts topic queries with full boolean syntax.
12 Alternatively, several advanced queries may be performed. The marketer may input a query to 13 identify a user that is an expert in one topic 302 and displays an interest in another topic 303.
14 The marketer may also find users interested in a given topic 304, and configure the search by selecting a checkbox to identify users that woud provide the widest reach 305.
The option "max 16 reach" 305, if selected, instructs the system to identify experts of topic A with interest in topic B, 17 to collectively maximize the follower reach. This effectively means that the set of accounts 18 reported of cardinality K collectively reach the maximum number of unique accounts (thus 19 maximizing user impressions) among all possible subsets of expert accounts of size K, as will be discussed in more detail below. The user interface is further operable to accept the input of a 21 particular user account 306 and a topic 307 to find followers of that user that have interest in a 22 given topic. Any of the foregoing searches could also be limited to particular time intervals to 23 pinpoint experts at relevant times.
24 [0035] To accomplish the foregoing, as mentioned above, the server comprises an obtaining unit 109 for obtaining social network data, a tokenizing unit 108 for tokenizing social 26 media messages, a processing unit 102 for processing the social media data to generate each 27 user's topic signature and an aggregate signature for each topics, respectively, and an indexing 28 unit 107 for interaction with the database to locally store the social network data. In the case of 29 the exemplary system that utilizes public Twitter data, obtaining unit 109 fetches 3 pieces of information from public Twitter data feed: the actual textual tweet contents from the public 31 Gardenhose streaming API, the Twitter follower graph (who follows who), and the Twitter lists, 32 as more fully set forth below.

1 [0036] The tokenizing unit 108 tokenizes the Twitter lists to produce topics and associates 2 the topics to the users of the lists. The processing unit 102 uses the output association (of user 3 to topics) from tokenizing unit 108 to instruct indexing unit 107 to stores it as a fast accessible 4 index "IXD" in the database. The user interface unit 103 can use index "IXD" to process the topic query "q" to return the list of experts "E" as associated to the topics by tokenizing unit 108.
6 [0037] "IXD" supports this by storing (1) the inverted index from topics to a collection of 7 users who belong to a Twitter list associated with the topic, and (2) total number of topic lists a 8 given user belongs to. Using "IXD", one can compute, given a specific topic and a specified 9 user, the number of times the user is listed in a Twitter list associated with the specified topic (represented as "frequency count" or "occurrence count"). The inverted index is used to 11 compute the set of all experts "E" associated with "q" by finding set intersection of index 12 associated with each topic "t" present in the query "q". The experts in "E" can be ranked for 13 display using user interface unit 103 by using the frequency count as described above.
14 [0038] To return a list of related topics to "q", user interface unit 103 will consult "IXD" to lookup all other topics for all users from "E", call this set of all topics "A". From "A", a list of top 16 ranked topics is presented to the user using some scoring function (frequency count, or tf.idf) 17 [0039] Referring now to Fig. 4, upon issuing a topic query, a plurality of expert accounts are 18 retrieved and displayed 401 along with corresponding profile information. This can be 19 accomplished by use of topic signature generation, which is described more fully below.
Analytics on these accounts are also provided, which as shown include topics associated with 21 the query topic in the topic signatures of these users 402, keywords used frequently in the 22 recent messages of these users 403, keyword pairs resulting from a frequency analysis of the 23 keyword associations between keywords in messages 404, as well as hashtags 405 which 24 represent frequent hashtags in the recent messages of the accounts identified . For each such analysis function a visual word cloud 406 may be provided to study the words identified and 26 their associated frequency via suitable font sizing. Fig. 4 illustrates how one can infer what 27 topics they are also experts in (402), what type of conversations they participate in and share 28 online (403, 404 and 405).
29 [0040] Referring now to Fig. 5, all domains from which content is frequently shared (in the form of URL links) via the messages of these users are shown 501.
Additionally, a user may 31 select the option to 'Analyze the interests of their followers' 502 and the system may 32 automatically conduct the same analysis but this time taking into account collective followers of 33 these accounts. This provides a mechanism for identifying the interests and analyzing the 1 'audience' of any expert group on Twitter. Preferably in this case the same type of output are 2 returned to keep the same user experience. The means to provide the response to each of the 3 foregoing types of queries will now be described.
4 [0041] Referring again to Fig. 1, the server 100 is in real time communication with the social network 105. The obtaining unit 109 obtains messages and corresponding metadata from the 6 social network. Generally, message data and metadata can be obtained from APIs provided by 7 social network, such as public APIs from Twitter, for example. For example, the obtaining unit 8 109 may utilize a streaming API (not shown) provided by the social network 105 to receive 9 messages and associated metadata. The metadata may include the following:
author name, author userid, set of followers and friends of the author, and lists the author has created. The 11 database 101 may be integrated with the server, located in proximity of the server, or remotely 12 from the server and accessible by network connection. Upon obtaining the message data and 13 metadata, the obtaining unit 109 stores the data and metadata on the database. A suitable 14 storage approach utilizes a compressed row format with each message being assigned a message identifier. As the data is stored, the indexing unit 107 is configured to create a table for 16 each day for storage in the database. Each row in the table is a unique account identifier and a 17 list of all message identifiers the account produced that day.
18 [0042] Relaxed transactional semantics may be run to increase throughput across multiple 19 threads reading and writing the table. The tables for a selected time period may be stored on solid state drives (SSDs) for increased performance. The collection of tables keeping the 21 association between account identifier and message identifiers may be stored in the database.
22 The indexing unit may retrieve for any day, the identifiers of all messages produced that day for 23 any set of accounts. The indexing unit may then provide the collection of all message identifiers 24 to the database to retrieve the actual messages.
[0043] The obtaining unit is configured to collect account relationships, such as which users 26 follow others directly. Certain social networks are configured to permit users to create lists 27 containing a descriptive name (supplied by the creator) and a set of accounts associated to the 28 list (supplied by the creator). For example, a list on "machine learning" may contain all accounts 29 that are experts or very related to the topic of machine learning. The obtaining unit is configured to store the above mentioned data in the database 101.
31 [0044] The obtaining unit is further configured to receive information about which accounts 32 follow others along with a set of metadata appended by the social network, associated with the 33 accounts, for storage in the database. In an embodiment, this data may be represented by a 1 graph that may be stored in a MYSQL instance. It will be appreciated that another relational 2 database may be used as an alternative to MYSQL. The indexing unit is further configured to 3 index this data. In embodiments, an Apache Lucene index may be used. It will be appreciated 4 that another text search engine library may be used as an alternative to Apache Lucene. This data provides an expertise vector, or a set of all lists a given account is associated with. That 6 information is then directed to Lucene to populate the index of topics as associated with the 7 account. The index supports full Lucene query syntax including phrase queries and Boolean 8 logic. At the same time, the social graph provides related information about user interests. For 9 example, if a user follows someone with expertise in cooking, one may infer that the user has interest in cooking. Given all accounts followed by a given account, the union of both expertise 11 vectors may produce an interest vector.
12 [0045] The server is configured to un-shorten multiple URLs. Since typically URLs in 13 messages are shortened (using popular URL shortening sites like bit.ly or t.co) conducting 14 analysis on the shared domains to provide insight into the source of the content is challenging as each URL has to be un-shortened (possibly multiple times). Thus the server efficiently un-16 shortens multiple URLs. Utilizing asynchronous 10 this process may be conducted for tens of 17 thousands of URLs in parallel on a single thread, typically in a short time frame.
18 [0046] Given the receipt of some or a combination of the data mentioned above, the 19 processing unit is able to generate useful information and analysis such as a topic signature for a given user, an aggregate signature for a group of users and techniques to automatically 21 identify changes in the aggregate signatures over time for a given query.
22 [0047] Referring now to Fig. 2, a flowchart showing the process for generating a topic 23 signature is shown. At block 200, the obtaining unit obtains all or a subset of all lists on the 24 social network. As lists are obtained, at block 202 the tokenizing unit is configured to tokenize the titles of the lists and, at block 204, remove stop words and other frequently appearing 26 information as well as idioms that carry no information (e.g., if, friends, etc.) from the list names.
27 At block 206, a dictionary of similar words is then used to group together tokens that have the 28 same meaning. A suitable technique utilizes WordNet, a large lexical database of English.
29 Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. It will be appreciated that alternative lexical databases may 31 be used by the system for a similar purpose. At block 208, for each account, a vector of all 32 tokens in the titles of all the lists of which the user is a member is assembled. Each token is 33 accompanied by a number that expresses the occurrence count, namely, the processing unit 1 calculates the number of times each token (grouping) was identified in a list title that of which 2 the account is a member.
3 [0048] The vector may be referred to as the topic signature, assuming a total ordering on all 4 topics (tokens) and assigning a value of zero to the occurrence count for a topic if the account is not associated with that topic at all. Assuming each token to be a unique dimension in a multi-6 dimensional space, the occurrence counts are normalized to produce the unit topic signature 7 vector in Li space. Thus, this vector represents the weight of the account being associated with 8 a topic. The Li space is used and hence, the length of the expertise vector is normalized to 1 9 using the manhattan norm. The union of all these vectors will result in a multi-dimensional space with each unique token corresponding to a dimension.
11 [0049] In an exemplary scenario, consider a user @john that is member of three Lists 12 {toronto-dentist, dentists, music-toronto}. The set of tokens with occurrence counts for this user 13 is {dentist(2), toronto(2), music(1) After normalization, the unit topic signature vector becomes 14 topics(john) = aentist +151nusic + ."toronto. The vector above is of unit length in Li space, with non-zero values across three dimensions and zero across all others.
16 [0050] Considering two more users in the same scenario: @henry belonging to lists 17 {dentists, squash-london, music}, and @susan who is a member of lists {squash, music-london, 18 squash-london}. After considering all the 3 users, a 5-dimensional topic space is produced:
7 dentist music london squash toronto\

John ¨5 0 0 ¨5 henry ¨4 ¨4 ¨4 ¨4 0 Vusan 0 0 19 The above matrix is a compact form notation of individually writing the vectors as topics(john) = dentist +15ln'usic + Olondon + OS'quash +iforonto, and soon.
21 [0051] The process of computing topic signatures for each user is linear in the number of 22 users and the length of their topic signatures. In this example, the technique of extracting topic 23 signatures is applied on messages from 240 million users and 15 million lists. Apache Lucene 24 may be used for implementing the tokenizing unit and indexing unit to tokenize and index the lists.

1 [0052] The generated topic signature provide for fast response to a request for an expert list 2 in response to a topic query. The index allows query with full boolean syntax, and is used to 3 quickly return all users having certain topic associations. For example, in response to a query, 4 the query is tokenized and processed lexically in similar manner to lists are processed as described in reference to Fig. 2. Users may be ranked by the occurrence count in their 6 respective pre-normalized topic signature for the token grouping(s) corresponding to the querry, 7 and a particular number of the highly ranked users may be returned as experts. It will be 8 appreciated that obtaining from the database a list of the users related to the query can be 9 made relatively quickly due to the intelligent approach to indexing taking at the time of storage.
[0053] To describe how the processing unit generates the aggregate signature, the 11 following setting and notation will be used: Let the set of all users be denoted by Um, which has 12 cardinality M. Let u c Um denote a unique user, with unit normalized topic signature vector 13 topics (u). The vectors are derived from a multi-dimensional space S' with N dimensions. The 14 matrix representing signature vectors for all users will then have M x N
entries. We denote this matrix as MsE.
16 [0054] If si is a specific dimension in SN , then the signature vector may be represented as follows where wi represents length of the vector across dimension Si, topics =
w1.1 2' + w 2 +
18 === + WN.N. The graphing unit generates a social graph, where the social graph spanning all 19 users denoting follower relationships is represented by (Um ,E), where Um is the set of nodes and E is the set of edges. Each user represents a node. Edges are follower relationships, i.e., if 21 a user u follows another user v, then the directed edge from u to v will be part of the set of 22 edges E. Formally, u, v E Um , follows(u,v) = true (u, v) c E.Let q represent a keyword query 23 such as, "hurricane sandy" or "pepsi"; the query may permit more complex queries that include 24 boolean operators such as, "elections AND ("barack obama" OR "mitt romney")" as well. Let R
represent the set of message results after evaluating the query against the content of all raw 26 messages obtained by the obtaining unit from the social network and stored in the database. If 27 the search has a time restriction t denoting that only results within the time interval t are of 28 interest, the set of results is R. Each entry r E Rt is a message such that matches(q,r) = true 29 meaning that the query evaluates to true on message r and post -time (r) E t, namely that the message was posted within the time interval of interest t. Let At denote the set of all unique authors (users) in Rt, i.e., set of authors of all messages r e R. u E At 3r E Rt:author(r) =
32 u.

1 [0055] As MsE can potentially consists of a large number of entries, it is desirable to produce 2 a concise summary of MsE as aggregate signature. This can be done by first computing the 3 relevant rank of each user in the set Um using the social graph and network ranking algorithms 4 such as PageRank. Then, conditional probability can be utilized to aggregate the topics. This process is mathematically the following : Pr(s) = Pr(s I uj )* Pr(u1) for expertise s, and user u, 6 where Pr(s) denotes the probability of s, over the appropriate sample space.
7 [0056] The aggregate signature may be used for two main purposes, namely, a) to obtain a 8 concise view of the topics associated with the messages of interest (denoted by the query q 9 above). This is done by aggregating the expertise vector of all authors who have authored one of the messages of interest, and b) to rank those topics based on their potential for 11 dissemination within the social network. This is done by summing over the users and their 12 topics/expertise using conditional probability. The utility of such an aggregate signature may be 13 to gain insight on what topics a marketer may associate with q in order to increase the reach 14 and dissemination of messages related to q in the Twitter network, via sharing through re-posting messages from those accounts sending messages about q.
16 [0057] The processing unit may alternatively generate the aggregate signature of all 17 followers of a particular account (instead of author group for a query).
For example, a marketer 18 may be interested in understanding who is talking about the brand Pepsi on Twitter and the 19 aggregate of topics associated with that group. This information may be used to create better advertising content for this group, e.g., if many users are associated with travel, then a good 21 strategy could be to create marketing messages incorporating travel as a theme. That way one 22 aims to capture the attention of the group in multiple ways and increases engagement and 23 content sharing.
24 [0058] Thus, in order to maximize the spread of marketing content to a relevant group it is desirable to direct resources toward users who can spread the content (e.g., by re-tweeting or 26 resharing the message). Hence, when constructing the aggregate signature not every member 27 of At is considered with an equal weight. The aggregate signature aggrsig(q,t) (or 28 correspondingly aggrsig(At)) is computed by taking in account the ability of a user to spread a 29 message.
[0059] Given a query q, time interval t, and its associated set of authors At from result set Rt 31 the processing unit generates the aggregate signature aggrsig(q,t) taking into account the 32 potential reach of each author in At. Having retrieved all messages Rt with respect to the query 33 q of time interval t, the processing unit scans through all items in Rt to resolve the set of unique 1 users from Rt as A. The processing unit is operable to generate the topic signatures for each 2 user in At.
3 [0060] One strategy to produce aggregate signatures is to sum up the topic signatures 4 retrieved and normalize them to unit length. However, this method fails to capture the relative importance of each user in disseminating a message to their followers with respect to the query 6 q. Under this scheme all users are assumed equally important as far as the dissemination 7 potential is concerned, which may not actually reflect reality. For example, the set At may 8 contain several users with association in the topic of music but each with very few followers, and 9 few users with association in the topic of travel but each with many followers.
[0061] Thus, the processing unit generates an aggregate score which may be referred to as 11 "AGGR" herein. AGGR represents the relative ranking of u E A. By looking at the subgraph 12 induced by At on the original follower graph (Um ,E) only, a user u1 may have a substantial 13 number of followers in (Um ,E) but have very few followers who also belong to A. The number 14 of followers in At may be more important than the total number of followers across the entire social graph, as the aim is to find users who can disseminate the message to potentially largest 16 group of relevant users.
17 [0062] To capture these intuitions the processing unit models this scenario as a Hidden 18 Markov Model (HMM), with each user in 11 E At represented as a node in the hidden layer, and 19 each topic in their aggregate signatures' represented as a node in the output layer. For users u, V E At, if user u follows V, a directed edge is added in the Markov chain from u to v.
21 Transition from one node to another takes place with equal probability.
That is, if there are eu 22 edges out of node u, one of the edges is selected for transition with equal probability Since 23 the Markov chain may have disconnected components, with a small pre-specified probability a a 24 random jump takes place, and with probability 1 ¨ a one of the outgoing edges is selected.
[0063] Traversing the Markov chain, while at node it, having eu outgoing edges, the 26 probability of transition is computed as follows. If eu is zero, the next node after transition is 27 randomly picked from set A. Let 'At' be the cardinality of the set A. If eu is non zero, then the 28 next node will be:
a pickrandomv E At withprob lAtl-en 29 next (u) = I 1¨a pickrandomvlfollows(u, v) withprob ¨
eu 1 This completes the construction of the Markov chain and emission probability for the topics is 2 assigned at each node. The symbols being emitted from the HMM are dimensions of the topic 3 signature. For example, if the topic signature of a user u is topic(u)= w + w As=
1. -1 - 2.-2 =
4 -1 friusic + -1 quash, then one of music or squash is emitted with equal probability when at the node in the HMM associated with this particular user u. Since the topic signatures are of unit 6 length in LI. space, further normalizations are not needed to compute symbol emission 7 probabilities. For a topic signature, topic(u) = w1.1 + w2. 2 + = = = +
wN. N. The symbol si will be 8 emitted with probability wi. Since, Wo 14/1 === WN = 1, the sum of all probabilites will be 1.
9 [0064] Continuing from the example used for the creation of the topic signature above and assuming each of the three users follow each other, the HMM as displayed in Fig. 6 is 11 constructed. The hidden layer is constructed with three nodes, representing the three users.
12 Transition edges are added, each with probability 2-2, such that Pri!=i(uilui) = 0.5. As a result, 13 the steady state distribution is observed to be Pr(john) = Pr(henry) =
Pr(susan) = 1/3 for the 14 hidden layer. Marginalizing out user probability from Pr(topicsignature,user), the processing unit can compute the aggregate signature for the entire graph. For example, Pr(dentist) =

Pr(dentistljohn)Pr(john) + Pr(dentistlhenry)Pr(henry) = X X = -12-60, and, 17 Pr(toronto) = Pr(torontoljohn)Pr(john) = x = . Similarly, Pr(music) =
Pr(london) =
18 Pr(squash) = E60. As a final check, performed by the processing unit,Pr(dentist) + Pr(music) +
19 Pr(london) + Pr(squash) + Pr(toronto) = 1. The resulting aggregate signature, therefore 13 -= 13 ^ 2 isTO dentist + ¨13Ihusic + ¨ london +13 .C=-quash + ¨15toronto.

21 [0065] The HMM has now been defined by the processing unit with a set of nodes, 22 transition probabilities, and emission probabilities for symbols. The steady state probabilities for 23 this HMM will allow the processing unit to compute the aggregate signature across the set of all 24 users A. At steady state, assuming that the probability that a symbol si is seen is proh(si), then the aggregate signature will be, aggrsig(At) = prob(s1) 1+ prob(s2) 2 + = = =
+ prob(sN). N.
26 which is of unit length in L.
27 [0066] At steady state, assuming the probability that a symbol si is seen is prob(si), then 28 the aggregate signature will be, aggrsig(At) = prob(si)S'i + prob(s2) 2 + = == + prob(sN). N.
29 which is of unit length in L. To compute the aggregate signature given the steady state distribution from the Twitter follower graph, the processing unit uses the definition of conditional 31 probability. Observe that for topic s:

1 Pr(s) = Et, Pr(s,u) = Pr(slu)Pr(u) (1) 2 and since Pr(slu) (the topic probability of a user u) and Pr(u) (the steady state probability of 3 user it) are independent and known from the Markov chain and preprocessing, the processing 4 unit proceeds to solve the HMM first by the hidden user layers, then the emission (topic) layer.
[0067] Marketers invest significant effort to change brand perception and association. A
6 change in the audience of a brand could be organic over time, or it may be influenced by an 7 event. For example, numerous marketing efforts attempt to reinvent or reposition brands in new 8 target segments and change the way brands are perceived online or offline. An effort to make a 9 brand more fashionable or trendy may be successful if the people taking about the brand online associate themselves with fashion and/or fashion trends. Thus, such changes, if one is able to 11 identify them, may point to the success or failure of marketing efforts online. Identifying such 12 changes in the conversation around events may further identify parties relevant to a political or 13 academic subject and how insights evolve over time. In an exemplary scenario, the query 14 "hurricane sandy" is considered. The processing unit conducts the search for one day time intervals for a 92 day long period from 1 Oct 2012 to 31 Dec 2012.
16 [0068] The processing unit may proceed to generate results as to how aggrsig(q, 17 evolves over time for a long time range T consisting of D smaller time intervals, T =
18 ftl,t2,...tD}. The processing unit generates the aggregate signature for a given query q for 19 each of the time intervals as: ASM(q,T) =
taggrsig(q,t1),aggrsig(q,t2),...aggrsig(q,tD)}. The resulting matrix ASM(q,T) has N rows 21 and D columns. The rows will each correspond to a topic dimension from SN, and columns will 22 each correspond to a time interval from T. This matrix is referred to as the aggregate signature 23 matrix (ASM(q,T)) over time T for the query q.
24 [0069] Given an aggregate signature matrix ASM(q,T) = faggrsig(q,t1), ...aggrsig(q,tD)}, a pre specified k < D, and a function score that measures similarity of aggregate signatures, 26 namely, score(aggrsig(q,ti) aggrsig(q, ti)) E IV define a disjoint, continuous k partitioning 27 of [1,2, ...,D] as Pk: = ([61,e1],[b2,e2],...,[bk,ekll with b1 -= 1, ek = D, e1 E Z, ei bi and 28 vi <j, e1 = b1 ¨ 1 by solving for argminpk Ei score(aggrsig(q,tbi),...,aggrsig(q,te,)).
29 [0070] In embodiments, the processing unit may iterate over a few values of k and trace the value of the overall function score for each value of k. Points at which large discontinuities arise 31 are typically good candidates for k.

1 [0071] The processing unit selects k groups of continuous days across the 92 day period for 2 a pre-specified k. Each of these k date ranges will represent a distinct aspect of the event. For 3 example, if k was 3, resulting date ranges could be expected to represent the pre-hurricane 4 period, the period during the hurricane specifically as it passed over New York city, and the period post-hurricane.
6 [0072] Using the notations defined above, given the aggregate signature matrix ASM(q,T) 7 and specified k < D, the processing unit partitions T into k continuous and disjoint intervals. The 8 aim is to group similar time periods together, and this is formalized by defining a scoring 9 function capturing similarity that is minimized. Once the scoring function has been chosen, the problem reduces to that of identifying the optimal partitioning.
11 [0073] The processing unit generates two scoring functions. The first function minimizes the 12 total error represented as the sum of root mean square distance between the average 13 aggregate signature of a collection of signatures and the aggregate signatures in the collection.
14 The first measure, given a collection of aggregate signatures ASM = taggrsig(q,t1),aggrsig(q,t2),...aggrsig(q,tD)J, access the distance using the root 16 mean square error : score = EiII aggrsig(q, ti) (E, ag grsig (al 112. The RMSE score 17 increases as the distance between aggregate signatures increases, i.e., when the topics across 18 [N., t2, , tD) are different, and decreases when the topics are the same. Therefore, with this 19 score function intervals of time are singled out where the aggregate signatures are very similar to each other.
21 [0074] The second discretizes ASM(q,T) into an indicator matrix of 0 and 1, and measures 22 similarity as the hamming distance across neighbouring aggregate signatures. A second error 23 measure is proposed that involves the discretization of aggregate signatures. The value in each 24 dimension of aggrsig(q,t,), is between 0 and 1. The aggregate signature can be discretized by assigning each dimension the value of 0 or 1. There are many ways to discretize the signature;
26 a statistically sound way is to assess the mean of all the values and assign a value as 1 if it is 27 above some standard deviation of the mean and zero otherwise. Denote the discretized 28 aggrsig(q,ti) as aggrs7g(q,t1) and similarly, the discretized ASM(q,T) matrix as ASM(q,T).
29 With t1, t2, ...,tD ¨ 1 = T1 and t2, ...,tD = T2, rewritten in the compact form: score =11 ASM(q,T2) ¨ ASM(q,Ti) 'IF using the Frobenius norm. II A 11,= Ei lAiiI2 where A is a 31 matrix.

1 [0075] In further embodiments, given a function score that computes a distance between 2 aggregate signatures of ASM(q,T), the following recurrence may be generated by the 3 processing unit to measure the similarity of aggregate signatures. Bi,k is defined to be the best k 4 partition score of the first] columns of ASM(q,T) using the given score function:
Bj,k = score(exp(q,ti)=== exp(q,tj)). (2) 6 [0076] The best K partition of ASM(q,T) for all K < D may be computed using Equation 2.
7 Notice that it would take 0 (DD)score steps to solve the best K partition of ASM(q,T) for all K in 8 a brute force way. Since there are D ¨ 1K 1 ways to produce k disjoint continuous intervals for 9 [1,2, D] , the processing unit generates 0(Di) = 0(DD).
[0077] Looking at the recurrence Equation 2 the processing unit may pre compute 11 score(ASM(q,Ti)=== ASM(q,Tj)) as it is independent from the recurrence.
12 argLI1ci Bi-1,k I SCOi will take 0(D) steps. When solving for the best K partition 13 of ASM(q,T) for all K < D using dynamic programming, the runtime is dramatically reduced to 14 0(D3). The space requirement can also be optimized by noting that in Equation 2, B j,k depends only on the values from the previous iteration. Therefore, after an iteration is complete, the 16 processing unit may discard the optimal interval partitioning and the optimal scoring from the 17 last iteration, bringing the space requirement excluding the precomputed scores, down to 0(D).
18 [0078] Continuing with the example above, for each day, the aggregate signature vector is 19 computed based on everyone who is talking about the hurricane on that day i.e. who have posted a message containing the words hurricane and sandy). As time progresses, a natural 21 evolution in the matrix ASM(q,T) for this query is expected. Hurricane Sandy first affected 22 Caribbean and Bermuda on Oct 22nd, and Twitter users actively participating in the discussion 23 topically associated with these regions. As days progressed, more American and subsequently 24 global audience started discussing the hurricane. As the hurricane traveled from the southeast of the US (Florida, Virginia, Carolinas), to the mid-Atlantic region (Washington DC, Maryland, 26 New Jersey), and finally reached New York City, the group of users talking about the hurricane 27 changed. In November, post-hurricane, the discussion shifted further to rebuilding efforts and 28 those discussing were associated with politics. Intuitively it is evident that this 92 day time 29 period can be partitioned into a discrete time periods which capture the evolution of this story namely tracing the geographical path of the storm (by observing the topics associated with 31 those taking about it) and then capturing the political discussion centered in re-building efforts.

1 [0079] In embodiments, the processing unit may be configured to perform random sampling 2 of data to speed up this computation without sacrificing quality. This effectively offers a good 3 tradeoff between accuracy and speed.
4 [0080] The system may be configured to process a subset of search results to reduce processing time for a query. Run time may be improved by using random sampling on the set 6 A. Instead of constructing the HMM with all users in At, only a fraction, such as, for 7 example,f 1.0 may be randomly selected. Referring now to Fig. 11, processing time is shown 8 as the fraction f varies.
9 [0081] Referring now to Fig. 7, particular exemplary queries indicate that the use of 30% or more of the search results provides better than 90% (using cosine similarity) accuracy as 11 compared to all results.
12 [0082] As the fraction f is reduced, the number of topics with non-zero weights may also 13 decrease, as depicted in Fig. 8. Particular exemplary queries indicate that the use of 30% of the 14 search results provide that the resulting AS has only half the number of topics with non-zero weights compared to AS', however the topics not present are only those with small weight in 16 AS'. Thus random sampling may speed up the AGGR processing considerably while producing 17 similar results to AS'. Reduction in the number of topics with non-zero weights in AS further 18 helps in reducing the run time and memory usage. While the specific results may not be 19 representative of all queries, they indicate that the use of subsets of results can be made without substantial loss of accuracy, in cases.
21 [0083] Referring now to Fig. 9, the running time of k as a function of the number of days of 22 messages is considered. Particular exemplary queries indicate that given messages for one 23 year, the run time, may be less than an hour. Further exemplary queries show that overall 24 memory consumption scales linearly with the amount of data, as can be seen in Fig. 10.
[0084] Other applications may become apparent.
26 [0085] Although the invention has been described with reference to certain specific 27 embodiments, various modifications thereof will be apparent to those skilled in the art without 28 departing from the spirit and scope of the invention as outlined in the claims appended 29 hereto.The entire disclosures of all references recited above are incorporated herein by reference.

Claims

1. A system for identifying one or more experts of a topic on a social network, the system comprising a server in communication over a network with a social network, the server comprising:
(a) a user interface unit configured to obtain a topical query representing the topic;
(b) an obtaining unit configured to obtain social network data from the social network, the social network data comprising one or more topical lists and a social graph representing user relationships in the social network, each topical list identifying one or more users;
(c) a tokenizing unit configured to:
(i) tokenize titles of the topical lists and lexically group the tokens into token groupings; and (ii) tokenize the topical query to determine at least one token grouping to which the topical query corresponds; and (d) a processing unit configured to:
(i) generate, for each user, a topic signature vector comprising topic signature vector elements corresponding to the token groupings for which the user is identified in the corresponding topical lists;
(ii) generate for each topic signature vector element an occurrence count representing the number of times each of the token groupings is identified for the user;
(iii) rank the users by their occurrence counts for the at least one token groupings corresponding to the topical query; and (iv) return a selected set of the ranked users as experts in the topic.

2. The system of claim 1, wherein the system is further configured to identify one or more related topics of interest to users interested in the topical query, wherein the processing unit is further configured to: determine, for each expert of the topical query, other topics for which the expert is identified; and generate a ranked list of the determined other topics using a scoring function.

3. The system of claim 1, wherein the system is further configured to identify one or more related topics of interest to users interested in the topical query, wherein:
(a) the obtaining unit is further configured to obtain social network messages in the social network data;
(b) the tokenizing unit is configured to:
(i) tokenize the social network messages and lexically group the tokens into the token groupings; and (c) the processing unit is configured to:
(i) determine a subset of the social network messages containing the topical query;
(ii) generate an aggregate signature comprising summing the topic signature vectors of the experts identified for the topical query;
(iii) determine other topics having a high occurrence count in the aggregate signature; and (iv) return a selected set of the other topics as secondary topics.

4. The system of claim 3, wherein generating the aggregate signature further comprises determining the number of followers of the experts.

5. The system of claim 3, wherein generating the aggregate signature further comprises analyzing a social graph to determine reach.

6. The system of claim 2, wherein the processing unit is further configured to determine conversations the identified experts participate in and share.

7. The system of claim 1, wherein the processing unit is further configured to determine, for a given user, other users having similar interests.

8. The system of claim 3, wherein the processing unit is further configured to determine changes in conversations around events by determining changes in aggregate signatures over time.

9. The system of claim 8, wherein the changes in conversations enables an identification of users relevant to the conversations and insights into how conversations evolve over time.

10. The system of claim 8, wherein the processing unit is further configured to identify times at which the conversations change significantly.

11. A computer network implemented method for identifying one or more experts of a topic on a social network, the method comprising:
(a) obtaining a topical query representing the topic;
(b) obtaining social network data from the social network, the social network data comprising one or more topical lists and a social graph representing user relationships in the social network, each topical list identifying one or more users;
(c) tokenizing titles of the topical lists and lexically grouping the tokens into token groupings;
(d) tokenizing the topical query to determine at least one token grouping to which the topical query corresponds;
(e) generating, by a processing unit comprising one or more processors, for each user, a topic signature vector comprising topic signature vector elements corresponding to the token groupings for which the user is identified in the corresponding topical lists;
(f) generating, by the processing unit, for each topic signature vector element an occurrence count representing the number of times each of the token groupings is identified for the user;

(g) ranking the users by their occurrence counts for the at least one token groupings corresponding to the topical query; and (h) returning a selected set of the ranked users as experts in the topic.

12. The method of claim 11, further comprising identifying one or more related topics of interest to users interested in the topical query, by determining, for each expert of the topical query, other topics for which the expert is identified; and generate a ranked list of the determined other topics using a scoring function.

13. The method of claim 11, further comprising identifying one or more related topics of interest to users interested in the topical query, by:
(a) obtaining social network messages in the social network data;
(b) tokenizing the social network messages and lexically group the tokens into the token groupings;
(c) determining a subset of the social network messages containing the topical query;
(d) generating an aggregate signature comprising summing the topic signature vectors of the experts identified for the topical query;
(e) determining other topics having a high occurrence count in the aggregate signature; and (f) returning a selected set of the other topics as secondary topics.

14. The method of claim 13, wherein generating the aggregate signature further comprises determining the number of followers of the experts.

15. The method of claim 13, wherein generating the aggregate signature further comprises analyzing a social graph to determine reach.

16. The method of claim 12, further comprising determining conversations the identified experts participate in and share.

17. The method of claim 11, further comprising determining, for a given user, other users having similar interests.

18. The method of claim 13, further comprising determining changes in conversations around events by determining changes in aggregate signatures over time.

19. The method of claim 18, wherein the changes in conversations enables an identification of users relevant to the conversations and insights into how conversations evolve over time.

20. The method of claim 18, further comprising identifying times at which the conversations change significantly.