WO2015065290A1 - Surveillance de marques par les microblogs - Google Patents

Surveillance de marques par les microblogs Download PDF

Info

Publication number
WO2015065290A1
WO2015065290A1 PCT/SG2014/000508 SG2014000508W WO2015065290A1 WO 2015065290 A1 WO2015065290 A1 WO 2015065290A1 SG 2014000508 W SG2014000508 W SG 2014000508W WO 2015065290 A1 WO2015065290 A1 WO 2015065290A1
Authority
WO
WIPO (PCT)
Prior art keywords
topic
organization
data
information
determined
Prior art date
Application number
PCT/SG2014/000508
Other languages
English (en)
Inventor
Tat-Seng CHUA
Hadi AMIRIEBRAHIMABADI
Yan Chen
Anqi CUI
Original Assignee
National University Of Singapore
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University Of Singapore filed Critical National University Of Singapore
Publication of WO2015065290A1 publication Critical patent/WO2015065290A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Definitions

  • Embodiments relate generally to information determination devices and information determination methods.
  • an information determination device may be provided.
  • the infoimation detennination device may include: an account crawler configured to determine data from at least one pre- determined user account; and an information determiner configured to determine information related to a pre-detemiined organization based on the data.
  • an information determination method may be provided.
  • the information determination method may include: determining data from at least one pre-determined user account; and determining information related to a pre-detemvined organization based on the data.
  • FIG. 1 A and FIG. IB show information determination devices in accordance with various embodiments
  • FIG. 1C shows a flow diagram illustrating an information determination method according to various embodiments
  • FIG. 2 shows an illustration of websites
  • FIG. 3 shows an illustration of a power law correlation
  • FIG. 4 shows an illustration of an architecture according to various embodiments
  • FIG. 5 shows an illustration of learning evolving and emerging topics
  • FIG. 6 shows an illustration of a distribution of relevant tweets
  • FIG. 7, FIG. 8, and FIG. 9 show illustrations of an effect of learning parameters
  • FIG. 10 shows an illustration of an effect of the temporal continuity constraint.
  • the infonnation determination device as described in this description may include a memory which is for example used in the processing carried out in the infonnation determination device.
  • a memory used in the embodiments may be a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a nonvolatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory).
  • DRAM Dynamic Random Access Memory
  • PROM Programmable Read Only Memory
  • EPROM Erasable PROM
  • EEPROM Electrical Erasable PROM
  • flash memory e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase
  • a “circuit” may be understood as any kind of a logic implementing entity, which may be special purpose circuitry or a processor executing software stored in a memory, firmware, or any combination thereof.
  • a “circuit” may be a hard-wired logic circuit or a programmable logic circuit such as a programmable processor, e.g. a microprocessor (e.g. a Complex Instruction Set Computer (CISC) processor or a Reduced Instruction Set Computer (RISC) processor).
  • a “circuit” may also be a processor executing software, e.g. any kind of computer program, e.g. a computer program using a virtual machine code such as e.g. Java, Any other kind of implementation of the respective functions which will be described in more detail below may also be understood as a "circuit” in accordance with an alternative embodiment.
  • devices and methods may be provided for online discovery of events and topics for organizations from social media.
  • a unified framework may be provided to address two issues that have not been tackled to date: (a) crawling more representative distribution of relevant contents and (b) discriminating relevant from irrelevant content for the organization.
  • the current organization or brand monitoring systems use a fixed set of known keywords to crawl micro-posts from social media. This popular strategy results in (a) many missing relevant micro-posts, and (b) many irrelevant micro-posts.
  • the first issue is due to the dynamic nature of the social media contents, while the latter issue is due to the polysemy problem in which the acronyms of organizations are often shared by many entities. For example, NUS is shared between National University of Singapore, National Union of Students and Nu-SkinTM company.
  • a unified framework may be provided to address the above issues.
  • This framework may utilize multiple aspects of organizations including fixed keywords, known accounts and automatically identified key-users to crawl more relevant data about organizations from social media. Moreover, it effectively may employ content and user information to address the polysemy problem for organizations. Given the automatically identified relevant micro-posts for an organization, an adaptation of online sparse coding algorithms to efficiently learn the topics through time may be provided. Comprehensive experiments show promising results for three different organizations using streaming data obtained from Twitter.
  • devices and method for discovering topics related to a given organization by automatic identification of the relevant micro-posts (e.g. tweets) and users through time in the context of Microblogs may be provided.
  • devices and method for social media analytics may be provided.
  • devices and methods for mining the sense of organizations in social media may be provided.
  • Devices and methods according to various embodiments may be referred to as "OrgSense”.
  • Live tweet streams have been previously used for topic mining and event detection in general contexts. Also, models of burst and hot topic detection have been developed, from automation to temporal patterns.
  • Keywords are keywords attached to the # symbol to categorize tweets based on their context
  • keyword based approaches work well on mining tweets about specific topics, they are restricted to a set of keywords that are maintained manually. Fixed keywords fail to discover a large fraction of relevant information simply due to missing newly-introduced terms within topics and micro-posts without known keywords. Furthermore, fixed keywords may represent several different entities and results in many iixelevant micro- posts.
  • Mining evolving and emerging topics in the social media content has become a hot research topic recently. An approach may be used to identify emergent keywords and to utilize them to find emerging topics, A term may be defined as emergent if it frequently occurs in the current time but not in the previous times.
  • Temporal-LDA Linear discriminant analysis
  • Temporal-LDA Linear discriminant analysis
  • the evolution of topics may be tracked through time. It may be shown that a sparse coding algorithm with the non-negativity constraint is effective for topic modeling in the social media context.
  • a continuity constraint may be introduced. Transient crowd, a short-lived collection of people who directly communicate with each other through social messages like reply and mention of Twitter, may be mined.
  • users may be part of the same community as long as they share interest on the same topic (such communities may be referred to as interest communities).
  • Commonly used algorithms may not be effective to mine such interest communities as there may not be any direct conversation between the users in these communities.
  • FIG. 1A shows an information determination device 100 according to various embodiments.
  • the information determination device 100 may include an account crawler 102 configured to determine data from at least one pre-determined user account.
  • the information determination device 100 may further include an information determiner 104 configured to determine information related to a pre-determined organization based on the data.
  • the account crawler 102 and the information determiner 104 may be connected via a connection 106 (or a plurality of separate connections), for example an electrical or optical connection, for example any kind of cable or bus.
  • a crawler or a Web crawler, as referred to herein, may be software that downloads data automatically from a network, for example from the Internet.
  • a crawler may systematically visit web pages (for example with given configuration and strategy designed by programmers) and download the data of the web pages. The crawler may perform these repetitive tasks at a much higher rate than doing manually.
  • the streaming API application programming interface
  • Twitter may be used to crawl tweets.
  • a topic miner may be software that implements a method to discover human-understandable topics from the texts.
  • an information determination device may be provided which detennines information related to a predetermined organization based on data which are determined from one or more predetermined user accounts.
  • the account crawler 102 may include or may be or may be included in a known account crawler configured to determine the data from at least one account for the organization.
  • the account crawler 102 may include or may be or may be included in a key-user configured to determine the data from at least one account of a key user.
  • FIG. IB shows an information determination device 108 according to various embodiments.
  • the information determination device 108 may, similar to the information determination device 100 of FIG. 1A, include an account crawler 102 configured to determine data from at least one pre-determined user account.
  • the information determination device 108 may, similar to the information determination device 100 of FIG. 1A, further include an information determiner 104 configured to detennine information related to a pre-determined organization based on the data.
  • the information determination device 108 may further include a keyword crawler 1 10, like will be described in more detail below.
  • the information determination device 108 may further include a user friend list crawler 112, like will be desciibed in more detail below.
  • the information detenni nation device 108 may further include a classifier 1 14, like will be described in more detail below.
  • the information determination device 108 may further include a topic miner 116, like will be described in more detail below.
  • the information determination device 108 may further include an optimization problem solver 1 18, like will be described in more detail below.
  • the infonnation determination device 108 may further include a trivial topic purging circuit 120, like will be described in more detail below.
  • the account crawler 102, the information determiner 104 the keyword crawler 110, the user friend list crawler 1 12, the classifier 114, the topic miner 1 16, the optimization problem solver 118, and the trivial topic purging circuit 120 may be connected via a connection 122 (or a plurality of separate connections), for example an electrical or optical connection, for example any kind of cable or bus.
  • the keyword crawler 1 10 may be configured to determine further data based on at least one pre-determined keyword.
  • the information determiner 104 may further be configured to determine the information further based on the further data.
  • the keyword crawler 110 may include or may be or may be included in a fixed keyword crawler configured to determine the further data based on at least one fixed keyword.
  • the keyword crawler 110 may include or may be or may be included in a dynamic keyword crawler configured to determine the further data based on at least one dynamic keyword. At least one dynamic keyword may be changed based on processing of the information determination device 108.
  • the user friend list crawler 112 may be configured to determine a user graph of users in a social relationship with at least one of the organization or each other.
  • the account crawler 102 may further be configured to determine the at least one pre-determined user account based on the user graph.
  • the classifier 1 14 may be configured to classify the data into data relevant to the pre-determined organization and data irrelevant to the pre-determined organization.
  • the information determiner 104 may further be configured to determine the information based on the data relevant to the predetermined organization,
  • the topic miner 1 16 may be configured to detect whether the determined information related to the pre-determined organization is related to an evolving topic or to an emerging topic about the pre-determined organization.
  • the topic miner 1 16 may be configured to detect whether the determined information related to the pre-determined organization is related to an evolving topic or to an emerging topic about pre-determined organization.
  • the dynamic keyword crawler may further be configured to determine the at least one dynamic keyword based on the information of the evolving topic and based on the information of the emerging topic.
  • the account crawler 102 may further be configured to determine the at least one pre-determined user account based on the data relevant to the pre-determined organization.
  • the classifier 114 may be configured to classify the data based on learning.
  • the classifier 1 14 may be configured to classify the data based on a support vector machine.
  • the topic miner 116 may be configured to detect whether the determined information related to the pre-determined organization is related to the evolving topic or to the emerging topic based on learning.
  • the topic miner 116 may be configured to detect whether the determined information related to the pre-determined organization is related to an evolving topic or to an emerging topic based on solving an optimization problem.
  • the optimization problem may include a temporal continuity constraint.
  • the optimization problem may include a sparse matching constraint.
  • the optimization problem solver 118 may be configured to solve the optimization problem based on a least angle regression.
  • the trivial topic purging circuit 120 may be configured to remove old topics from evolving topics and emerging topics.
  • FIG. 1C shows a flow diagram 124 illustrating an information determination method.
  • data may be deteixnined from at least one pre-determined user account.
  • information related to a pre-determined organization may be determined based on the data.
  • the method may further include determining the data from at least one account for the organization.
  • the method may further include determining the data from at least one account of a key user. 10047] According to various embodiments, the method may further include: detennining further data based on at least one pre-detennined keyword; and determining the infonnation further based on the further data.
  • the method may further include determining the further data based on at least one fixed keyword.
  • the method may further include detennining the further data based on at least one dynamic keyword, wherein the at least one dynamic keyword is changed based on processing of the information determination method.
  • the method may further include detennining a user graph of users in a social relationship with at least one of the organization or each other.
  • the method may further include determining the at least one pre-determined user account based on the user graph.
  • the method may further include classifying the data into data relevant to the pre-determined organization and data irrelevant to the pre-determined organization.
  • the method may further include determining the information based on the data relevant to the pre-determined organization.
  • the method may further include detecting whether the detennined infonnation related to the pre-detennined organization is related to an evolving topic or to an emerging topic.
  • the method may further include: detecting whether the determined information related to the pre-determined organization is related lo an evolving topic or to an emerging topic; and detennining the at least one dynamic keyword based on the information of the evolving topic and based on the information of the emerging topic.
  • the method may further include detennining the at least one pre-determined user account based on the data relevant to the pre-determined organization.
  • the method may further include classifying the d ta based on learning.
  • the method may further include classifying the data based on a support vector machine.
  • the method may further include detecting whether the determined infonnation related to the pre-detennined organization is related to the evolving topic or to the emerging topic based on learning.
  • the method may further include detecting whether the determined information related to the pre-detennined organization is related to an evolving topic or to an emerging topic based on solving an optimization problem.
  • the optimization problem may include a temporal continuity constraint.
  • the optimization problem may include a sparse matching constraint.
  • the method may further include solving the optimization problem based on a least angle regression.
  • the method may further include removing old topics from evolving topics and emerging topics.
  • FIG. 2 shows an illustration 200 of websites, for example Optus Online Department on Twitter, see Optus account on Twitter at https://twitter.com Optus. Optus is the second largest telecommunications company in Australia.
  • FIG. 2 shows the verified Twitter account of the Optus Telecommunication Company. The biography of this account and its activity level indicate that user-centric businesses are spending substantial resources to hear the voice of their customers. In fact, it may be invaluable for such organizations to keep track of their live feedback to discover actionable insights from social media and provide better (personalized) services to their users. According to various embodiments, sophisticated methods and devices may be provided to discover such topics about a given organization from social media contents.
  • a first key challenge may be effective data harvesting:
  • the fu " si challenge may be about effective crawling of a live and representative distribution of data about organizations.
  • Most current crawling methodologies rely on a fixed list of keywords (a few previously-known keywords) such as the name of the organization to crawl data.
  • keywords a few previously-known keywords
  • Such methodologies cannot cover all the relevant micro-posts and consequently topics about the organization.
  • the user community of the target organization may be automatically identified and monitored.
  • the rationale of this approach may be based on the power law correlation between the number of users and the number of relevant tweets for organizations.
  • PIG- 3 shows an illustration 300 of the power law correlation between the number of users and the number of relevant tweets for three organizations, namely NUS (in plot 302), DBS (in plot 304), and StarHub (in plot 306).
  • the statistics are obtained from 1-year tweets posted for NUS, and 6-month tweets posted for DBS and StarHub organizations.
  • FIG. 3 shows that a small number of users of an organization often produce the major portion of relevant content about the organization.
  • data may be crawled based on multiple aspects of organizations: (a) known accounts, (b) key-users, and (c) fixed keywords of the organization.
  • the known accounts may be a few manually identified official accounts created on social media portals that broadcast news and announcements about the organization; while key-users are a dynamic list of active and influential users of the target organization that should be automatically identified (like will be described in more detail below).
  • the above sources collectively elicit more relevant data for organizations as compared to the fixed keywords used by the current crawling methodologies.
  • the second key challenge may be micro-post disambiguation:
  • the second challenge is about discriminating relevant from irrelevant micro-posts with respect to the target organization as data streams in. This is a challenging task because of the polysemy problem in which the acronyms of organizations are often shared by many entities in social media. Current systems simply return many irrelevant micro-posts as they don't disambiguate micro-post for organizations that share the same acronym. It is to be noted that users often use the acronym forms instead of the complete names of the organizations in the social media context mainly due to the length limit imposed by social media portals.
  • the context of the target organization defined by the current relevant content (keywords and micro-posts) and the user community of the organizations may be utilized.
  • a highly accurate classifier may be provided to predict the relevance of each incoming micro -post to the target organization based its context information.
  • the third key challenge may be topic discovery and monitoring:
  • the third challenge is about online clustering of relevant streaming data into coherent set of topics. This is challenging because, with streaming data, new topics as well as the old ones can be introduced or vanished respectively at any point of time.
  • the stream of relevant micro-posts may be clustered into emerging and evolving topics.
  • the emerging topics may be the new topics that emerge and potentially become major in a short period of time, while the evolving ones may be those that have been detected previously and are smoothly evolving through time.
  • a novel online sparse coding approach with temporal continuity and sparse matching constraints may be provided.
  • the approach according to various embodiments may be linear with respect to the number of input micro-posts.
  • a simple purging mechanism may be provided to detect the inactive topics to further improve the performance of topic modeling.
  • NUS National University of Singapore
  • DBS Development Bank of Singapore
  • StarHub StarHub company
  • the first two organizations are ambiguous (in which NUS is shared between National University of Singapore, National Union of Students, and NU Skin company, and DBS is shared between several organizations like Development Bank of Singapore and Dublin Business School etc), while the third organization (StarHub) is not ambiguous.
  • a framework may be provided which effectively addresses the data harvesting problem for organizations. It may utilize multiple aspects of organizations to obtain more relevant data about them from social media.
  • a framework may be provided which effectively resolves the polysemy issue in social media for organizations.
  • a framework may be provided which provides a novel adaptation of online sparse coding algorithms to mine the emerging and evolving topics for organizations.
  • FIG. 4 shows an illustration 400 of an architecture according to various embodiments for mining the sense of organizations from social media.
  • a fixed keyword crawler 402 a known account crawler 404, and org key-user crawler 406 may be provided, like will be described in more detail below.
  • the framework may utilize the several crawlers to obtain potentially relevant data about the organization from social media.
  • the resultant data is given to a classifier 410 to make a real-time judgment about their relevance to the taiget organization.
  • the classifier may make use of the context of the organization (both content-level information 408 and user-level information 412) provided by the keyword miner 414 and user miner 418 (using a user graph 426 and a friend list crawler 428) components respectively.
  • the relevant data may then be stored in the relevant tweet repository 416.
  • the topic miner component 424 may extract the cunent emerging topics 422 and evolving topics 420 about the organization using the resultant relevant data.
  • Brand monitoring systems may make use of a few manually selected fixed keywords to crawl data for organizations. Examples of fixed known keywords for a given organization are the name of the target organization or its products, the acronym of the organization etc.
  • the fixed keyword crawler 402 may crawl the micro-posts that contain the fixed keywords.
  • the known account crawler 404 will be further described. Similar to fixed keywords, a few known accounts for the target organization (such as the Optus account in FIG. 2) may be manually identified. These may be official accounts of the target organization that act as informers and usually post relevant micro-posts about their organization. These accounts may be given to the known account crawler 404 to be observed.
  • a few known accounts for the target organization such as the Optus account in FIG. 2 may be manually identified. These may be official accounts of the target organization that act as informers and usually post relevant micro-posts about their organization. These accounts may be given to the known account crawler 404 to be observed.
  • the org key-user crawler 406 will be further described.
  • the org (organization) key-user crawler 406 may be provided with a dynamic list of key-users to be observed.
  • a definition for key-users according to various embodiments will be provided below.
  • the user friend crawler 428 (in other words: friend list crawler 428) will be further described.
  • the user friend list crawler 428 may be used to construct the user graph 426 of the target organization by crawling the social relationships between users who have posted relevant data about the organization. This user graph 426 may evolve over time as new users are identified.
  • the keyword miner 414 may utilize an active learning approach to extract temporally-relevant keywords for organization from the recently seen relevant data. These keywords may be considered as dynamic keywords at each point of time and used by the classification component to determine the content-based relevance of the incoming micro-posts.
  • the user miner 418 may identify the user community and the key-users of the organization so that such users may be monitored in order to obtain more relevant data about the organization.
  • the user miner 418 may utilize the user graph 426 and user activity information to rank the users and find key-users of the organization (like will be described in more detail below).
  • the input data obtained by different crawlers may be a mix of relevant and irrelevant data.
  • key-users may also send micro-posts about other subjects like their various life activities.
  • the classification component 410 (in other words: the classifier 410) may utilize the context information to label the input data as relevant or irrelevant to the target organizations.
  • the topic miner 424 will be described.
  • the topic miner component 424 may utilize the relevant tweets to detect and keep track of topics related to the target organization.
  • an adaptation of online sparse coding algorithms may be provided to learn the topics in an efficient way.
  • mining keywords and organization users will be described and the approach according to various embodiments for mining organization context defined by its content and user community will be described.
  • Dynamic keywords may be those keywords that represent the current discussions about the target organization at each point of time. To identify such keywords, suppose we have two sets of foreground ⁇ s ⁇ riff) and background (s£ aA ) tweets at each point of time t. Let sj or include the recently- seen relevant tweets posted in a short time window of length T, i.e. [t-T,t], while 3 ⁇ 4 tt([ includes the irrelevant tweets identified in the same time window, [t-T,t]. In addition, let be the vocabulary set obtained from Sf 0r . We define the dynamic keywords as a subset of W* words that best represent the current relevant discussions about the organization. Our aim is thus to extract such keywords from W 1 .
  • Equation (1) assigns higher weights to the terms that frequently occur in S or , but rarely occur in Sf, ak .
  • Equation (1) only takes into account the words Wj with ⁇ f > b / and assign zero weight to those with f t ⁇ bi.
  • the framework may rank the more active and influential users of the organizations in the higher orders, while, in case of ambiguous organizations, discard the users of the other organizations.
  • an active user of an organization as the one who sends many relevant micro-posts about the organization, and an influential user as the one who has many followers within the organization and initiates major discussions about the organization.
  • the combination of these measures can be used to rank the users of the target organization with high accuracy.
  • We compute the score for each user u,- E U' based on the following Equation at time t:
  • the above equation may rank the user based on the aforementioned three criteria.
  • the top K users may be considered as the key-users of the organization at time t. These users may be passed to the org key-user crawler 406 to be monitored.
  • a high quality classifier may be provided to discriminate relevant content for organizations by (a) learning their content relevance and (b) their user information respectively.
  • the framework may assign a relevance score to each input data based on its content similarity with the current discussions about the organization. For this purpose, we utilize the dynamic keywords (mined as described further below) because such keywords are good indicators of the current discussions about the organization.
  • W - fwt w m ⁇ of arbitrary size m contains the dynamic keywords at time t.
  • W' as the classification features and Sj or X
  • S ⁇ ak as training data to discriminate the input streaming data into relevant and irrelevant sets.
  • the dynamic keywords may provide a fast way to prune the huge amount of irrelevant input data as they stream in.
  • each test tweet may be assigned a relevance score which represents the content-based relevance score of the tweet.
  • a final judgment may be made about the relevance of an input tweet to the target organization.
  • the user information for the data obtained from the fixed keyword crawler may be utilized. This may be because the data crawled from the other two crawlers (known account crawler 404 and org key-user crawlers 406) may come from the users who already have high relevance scores to the target organization and therefore it may be desired to ensure the relevance of their content.
  • the final score of the tweet may be determined by the linear combination of its content and user score as follows:
  • C S( 6 [—1,1] may indicate the content-based relevance score of sf and W3 ⁇ 4j G [—1,1] may indicate the relevance score of Uj as the author of if (see Equation (4)).
  • the parameter may control the contribution of each of the above scores in labeling the tweet. This parameter may be learnt using development data.
  • Any incoming tweet with Lj>0 may be considered as relevant, and the rest as irrelevant.
  • the relevant tweet may be added to the relevant tweet repository which will be then utilized in the next iterations.
  • Table 1 illustrates an online classification algorithm according to various embodiments, The effect of the length of the time interval t on the classification performance may be analyzed.
  • Table 1 shows an illustration of Algorithm 1 and Classification at time t.
  • each Sj G t3 ⁇ 4 m is a term vector of length m weighted by the standard Term Frequency (TF) and Inverted Document Frequency (IDF) as follows: where C is the normalization factor, TF(i,j) indicates the frequency of the term and IDF(j) indicates the inverted document frequency of Wj.
  • TF Term Frequency
  • IDF Inverted Document Frequency
  • FIG. 5 shows an illustration 500 of learning evolving and emerging topics at time t; wherein the circles represent the topic learning (TL) process.
  • the enor difference between the topic of the tweet s, and the topics that have been learned up to time t-1 , i.e. D 1' 1 may be computed.
  • the error difference may be called residual error.
  • a purging method may be performed which removes a topic from topic set D' "1 if it is not matched with any tweet Si for 24 hours.
  • topic modeling may be performed over the evolving tweets S cv to create D ev the evolving topics at time t.
  • topic modeling may be performed over the emerging tweets S em to create D em the emerging topics at time t
  • the results of the above two topic modelers i.e. D ev and D em
  • the emerging tweets may be clustered into groups to form D em (this component may be optional or may be removed as (514, 516) may do this, and as such, this component may be there merely for quality purposes).
  • this component may be optional or may be removed as (514, 516) may do this, and as such, this component may be there merely for quality purposes).
  • the first constraint may be to prevent dramatic changes in the evolving topics in two consecutive time stamps, whereas the second constraint may be due to the limited length of the tweets. This may be because tweets are limited to 140 characters; this space may be too short to be used for writing about several topics.
  • the evolving topic matrix D ev may be learned by minimizing the following optimization problem
  • the emerging topics may be totally new and there may be no prior information about the number of emerging topics. Therefore, the X- eans approach may be utilized to find an initial set of clusters from S em .
  • X-Means may be an extension of the standard k-means that utilizes the Bayesian Information Criterion (BIC) model to estimate the best number of clusters within a given range.
  • BIC Bayesian Information Criterion
  • the resultant clusters that have sufficient number of tweets (for example, only the clusters that have more than 20 tweets may be taken into account) may be considered as the emerging topics and their centroid vector may be used to create an initial emerging topic matrix D ,ni1 6 R m*k ' where k' is the number of such centroids.
  • the same approach may be followed as evolving topics to find the optimum value for D em as follows:
  • FIG. 5 depicts the overall procedure of learning topics at each point of time. It is to be noted that the above two processes (learning D ev and D em ) can be performed in parallel to speed up the overall learning process. The purging process in FIG. 5 will be explained further below. [00128] In the following, decomposition of streaming data will be described, [00129] Given the input matrix S l and the topic matrix D 1'1 , it may be desired to decompose S l into S ev and S em matrices. For this, we find the best representation of each Sj 6 S* in terms of D t_l as follows:
  • the resultant vector j e E3 ⁇ 4 3 ⁇ 4t 'ma indicate the already known topics that best represent the input vector s ⁇ .
  • the representation error of s* on D 1"1 (what we call residual error) as follows:
  • the matrix S l may be decomposed into the two matrices as follows:
  • each topic is selected as the dominant topic for an input tweet. This time may be used as a measure to purge the topics.
  • the matching score between each d j and each Si may be deteimined by the (ij)lh entry of the weight matrix X, i.e. xjj, see Equations (8) and (9).
  • all the topics that have not been selected as a dominant topic in the past 24 hours may be considered as non-active and are removed from D l .
  • the problem may be equivalent to a -ij - regularized least square problem and can be efficiently solved by a least angle regression (LARS) method or an alternating direction method.
  • LLS least angle regression
  • X is fixed
  • the problem may be a least square problem with quadratic constraints.
  • an advanced version of the projected gradient approach may be provided. It may be an effective online approach that processes each input data (or a small subset of data) only once. This may be particularly important in the context of social media where the input data can potentially be large at each time.
  • Equation(8) may be converted to the following problem:
  • the projected gradient approach may solve Equation (13) by iteratively obtaining the projected gradients using the following updating rule:
  • Di+i P D, - 0i V D £(D) lDj X) (15) where Dj may indicate D at iteration i, the parameter c3 ⁇ 4 may be the step size, and D£(D)f Di X j may be the gradient of C(D) with respect to D ( see Equation (16), evaluated on D; and X, and may be a projection function defined for the non-negativity constraint, Equation (17):
  • V D £(D) 2SX R + DXX R + 2/i(D - D' _ 1 ) C16) pr ! _ / z f ⁇ 0
  • the disadvantage of the above approach may be that it may be slow and may need the parameter a to be carefully chosen to obtain good results.
  • the second order information the Hessian matrix
  • the Hessian matrix may be utilized to obtain the final updating rule as follows:
  • Table 2 Algorithm 2, computing D* and X 1 at time t, see TL in Figure 5.
  • N correct + is the number of micro-posts that were assigned correct relevant label
  • / tota[ + is the total number of relevant micro-posts (the same definition applies for the irrelevant class).
  • Fl + and Fl " arc the classification performances for the relevant and irrelevant classes respectively and therefore Avg-Fl indicates the average classification performance in terms of Fl -score.
  • Two evaluation metrics may be considered to assess the performance of the topic miner component, namely topic detection accuracy, and miss-rate at first detection.
  • the first measure evaluates the topic detection performance in terms of precision and recall, while the second measure evaluates the amount of information (number of tweets) that has been missed before the first automatic detection of each topic.
  • the second measure is important as we need a small miss-rate for earlier prediction of emerging topics.
  • miss-rate at first detection the fraction of ⁇ (/, ⁇ ) tweets posted before the origin time of dj (that is the best match of Ij) may be considered as the missed tweets and their percentage determines the value of miss rate (MR) for I j .
  • MR miss rate
  • the miss rate for Ij is determined with respect to dj and may be defined as follows:
  • a good topic miner should have a high topic detection performance, Fl, and a small miss rate, MR.
  • the crawlers utilized the streaming API of twitter to crawl data.
  • Around 10 fixed keywords were manually identified for each of the ambiguous organizations NUS and DBS (including their acronyms) and only one fixed keyword, the term "starhub" itself, for StarHub.
  • Table 3 shows the number of tweets obtained from each of the crawlers and the crawling eriod for the three organizations.
  • Table 3 Data statistics and crawling period for the three organizations NUS, DBS, and StarHub.
  • FIG. 6 shows an illustration 600 of the distribution of the relevant tweets in the resultant ground-truth for the three organizations (NUS, DBS, StarHub, in subplots 602, 604, and 606).
  • “Fixed-Known” indicates the number of relevant tweets obtained by the fixed keyword or known account crawlers for the organization, while “overall” indicates the total number of relevant tweets obtained by all the three crawlers.
  • “overall” indicates the total number of relevant tweets obtained by all the three crawlers.
  • Such tweets can greatly improve the performance of online topic miner algorithms by providing more content information about the topics.
  • Table 5 shows an illustration of a Classification performance in terms of Avg- Fl with different types of features and input Classification perfonnance in terms of Avg- Fl with different types of features and input data.
  • Table 5 Classification performance in terms of Avg-Fl with different types of features and input Classification performance in terms of Avg-Fl with different types of features and input data.
  • the value of is smaller than 0.5 for both NUS and DBS. This was expected because the parameter only affects tweets with fixed keywords (see Algorithm 1) and for such tweets the weight of the user score, i.e. 1-a, is expected to be high. Also, the classification performance is invariant to the parameter a in case of StarHub: as we mentioned above, the parameter only affects tweets with fixed keywords. Such tweets are considered as relevant for non-ambiguous organizations by default (see subplot 604 in FIG. 6),
  • FIG. 7 shows an illustration 700 of the effect of learning parameters T and a of the classification performance for NUS.
  • FIG. 8 shows an illustration 800 of the effect of learning parameters T and of the classification performance for DBS.
  • FIG. 9 shows an illustration 900 of the effect of learning parameters T and of the classification performance for StarHub.
  • FIG. 7, FIG. 8, and FIG. 9 show the effect of the learning parameters t and a on our model, Dynamic-kw + User, evaluated over the entire ground truth dataset.
  • FIG. 7, FIG. 8, and FIG. 9 show greater time intervals (t) increase the classification performance for NUS but causes great reduction in the classification performance for DBS and StarHub.
  • the life time of the topics happening about the organization may affect the classification performance. If the topics are long, increasing the time interval t may not reduce the performance as the old topics are still active, whereas for short topics, increasing t reduces the performance as the old discussions are not active anymore and thus the dynamic keywords extracted from such topics are not useful features to classify the current input data.
  • FIG. 8, and FIG. 9 show, for NUS and DBS as ambiguous organizations, smaller values of a (i.e. giving less weight to the content relevance score and higher weight to the user score for the tweets with fixed keywords) leads to better performance. This result indicates the important role of user scores to classify tweets with fixed keywords.
  • the classification performance for non-ambiguous organizations like StarHub is invariant to the parameter a, but will be affected by the learning time interval.
  • the online topic modeling method may be applied over the entire dataset for each organization and only restrict the evaluation to the topic dataset.
  • Table 6 shows the evaluation results for topic detection in terms of Fl performance (Equation (22)).
  • the Overall column shows the performance when we perform the evaluation over all the relevant input data for the topic modeling memepose, while the Known column shows the corresponding performance when we only use the relevant tweets obtained from fixed keyword or known account crawlers.
  • the optimization framework outperforms the baseline for DBS and StarHub while its performance for NUS is comparable with the baseline.
  • the average improvement over the baseline is 7.98%, i.e. from 49.43% to 57.41%o, when we utilize the overall input data for topic modeling.
  • Table 7 shows the evaluation results for the miss-rate at first detection metric (Equation (24)).
  • the lower values of miss-rate indicate that the topic modeling algorithm is able to identify the emerging topics earlier.
  • the average miss-rate is lower when we use the overall data instead of only tweets obtained by the fixed keyword or known account crawlers. This suggests that we can detect emerging topic earlier, if we make use of more (relevant) tweets. It may be concluded that the key-user crawler according to various embodiments is an effective resource for early prediction of emerging topics about organizations. The results show that our approach outperforms the baseline by 4.83% reduction in the average miss-rate (from 27.1 1% to 22.28%).
  • a framework may be provided to automatically identify relevant micro-posts, topics and users of the organization from social media.
  • Previous brand monitoring systems are not designed to address the lack of representative data or polysemy issues for entities like organizations.
  • Various embodiments provide an effective framework to elicit representative amount of data about organizations and dealing with the polysemy issue for organizations in social media.
  • the framework according to various embodiments may provide methods to address the above issues by
  • the system may provide live feedback to organizations by automatic discovery of the relevant content about them from social media, identifying their user community in social media, and listening to their key-users. This information may be invaluable for user-centric organizations as they utilize such information to obtain actionable insights from social media.
  • the system according to various embodiments may be fed by the social media portals like Twitter and Facebook.
  • an mformation determination device may be provided.
  • the information determination device may include: an account crawler configured to determine data from at least one pre-determined user account; and an infonnation detenniner configured to determine information related to a pre-determined organization based on the data.
  • an infonnation determination device may be provided which determined information related to a predetermined organization based on data which are determined from one or more predetermined user accounts.
  • the account crawler may include or may be a known account crawler configured to detennine the data from at least one account for the organization.
  • the account crawler may include or may be a key-user crawler configured to detennine the data from at least one account of a key user.
  • the information determination device may further include a keyword crawler configured to determine further data based on at least one pre-determined keyword, and the information determiner may further be configured to determine the information further based on the further data.
  • the keyword crawler may include or may be a fixed keyword crawler configured to determine the further data based on at least one fixed keyword
  • the keyword crawler may include or may be a dynamic keyword crawler configured to determine the further data based on at least one dynamic keyword, and at least one dynamic keyword may be changed based on processing of the information determination device.
  • the information determination device may further include a user friend list crawler configured to determine a user graph of users in a social relationship with at least one of the organization or each other.
  • the account crawler may further be configured to determine the at least one pre-determined user account based on the user graph.
  • the information determination device may further include: a classifier configured to classify the data into data relevant to the pie- determined organization and data irrelevant to the pre-determioed organization.
  • the information determiner may further be configured to determine the information based on the data relevant to the pre-determined organization.
  • the information determination device may further include a topic miner configured to detect whether the determined information related to the pre-determined organization is related to an evolving topic or to an emerging topic about the pre-determined organization.
  • the infoimation determination device may further include: a topic miner configured to detect whether the determined information related to the pre-determined organization is related to an evolving topic or to an emerging topic about pre-detennined organization, and the dynamic keyword crawler may further be configured to deteimine the at least one dynamic keyword based on the information of the evolving topic and based on the information of the emerging topic.
  • a topic miner configured to detect whether the determined information related to the pre-determined organization is related to an evolving topic or to an emerging topic about pre-detennined organization
  • the dynamic keyword crawler may further be configured to deteimine the at least one dynamic keyword based on the information of the evolving topic and based on the information of the emerging topic.
  • the account crawler may further be configured to determine the at least one pre-determined user account based on the data relevant to the pre-determined organization.
  • the classifier may be configured to classify the data based on learning.
  • the classifier may be configured to classify the data based on a support vector machine.
  • the topic miner may be configured to detect whether the determined information related to the pre- determined organization is related to the evolving topic or to the emerging topic based on learning.
  • the topic miner may be configured to detect whether the determined information related to the pre-detennined organization is related to an evolving topic or to an emerging topic based on solving an optimization problem.
  • the optimization problem may include a temporal continuity constraint.
  • the optimization problem may include a sparse matching constraint
  • the infoimation determination device may further include an optimization problem solver configured to solve the optimization problem based on a least angle regression.
  • the information determination device may further include a trivial topic purging circuit configured to remove old topics from evolving topics and emerging topics.
  • an information determination method may be provided.
  • the information determination method may include: determining data from at least one pre-determined user account; and determining information related to a pre-determined organization based on the data.
  • the information determination method may further include determining the data from at least one account for the organization. [00219] According to various embodiments, the information determination method may further include determining the data from at least one account of a key user.
  • the information determination method may further include determining further data based on at least one pre-determined keyword; and determining the information further based on the further data.
  • the information determination method may further include determining the further data based on at least one fixed keyword.
  • the information determination method may further include determining the further data based on at least one dynamic keyword, wherein the at least one dynamic keyword is changed based on processing of the information determination method.
  • the information determination method may further include determining a user graph of users in a social relationship with at least one of the organization or each other.
  • the information determination method may further include determining the at least one pre-detennined user account based on the user graph.
  • the information determination method may further include classifying the data into data relevant to the pre-detennined organization and data irrelevant to the pre-determined organization.
  • the information detennination method may further include determining the information based on the data relevant to the pre- detennined organization. 100227] According to various embodiments, the information determination method may further include detecting whether the determined information related to the pre- deteimined organization is related to an evolving topic or to an emerging topic.
  • the information determination method may further include detecting whether the determined information related to the predetermined organization is related to an evolving topic or to an emerging topic; and determining the at least one dynamic keyword based on the information of the evolving topic and based on the information of the emerging topic.
  • the information determination method may further include determining the at least one pre-determined user account based on the data relevant to the pre-determined organization.
  • the information determination method may further include classifying the data based on learning.
  • the information determination method may further include classifying the data based on a support vector machine.
  • the information determination method may further include detecting whether the determined information related to the predetermined organization is related to the evolving topic or to the emerging topic based on learning.
  • the information determination method may further include detecting whether the determined information related to the predetermined organization is related to an evolving topic or to an emerging topic based on solving an optimization problem.
  • the optimization problem may include a temporal continuity constraint
  • the optimization problem may include a sparse matching constraint.
  • the information determination method may further include solving the optimization problem based on a least angle regression.
  • the information determination method may further include removing old topics from evolving topics and emerging topics.

Abstract

L'invention concerne la surveillance, à l'aide de collecteurs Web, de médias sociaux et de microblogs à la recherche d'informations pertinentes se rapportant à des marques de société et d'informations associées. Une structure unifiée constituée de mots-clés fixes et dynamiques, de comptes connus, d'utilisateurs de clés et de listes d'amis sert à identifier des microblogs et tweets pertinents concernant une certaine organisation considérée. Ces informations sur l'organisation sont classées par pertinence, au moyen d'algorithmes d'apprentissage, et des sujets émergents et évolutifs sont identifiés.
PCT/SG2014/000508 2013-10-30 2014-10-30 Surveillance de marques par les microblogs WO2015065290A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG201308079-1 2013-10-30
SG201308079 2013-10-30

Publications (1)

Publication Number Publication Date
WO2015065290A1 true WO2015065290A1 (fr) 2015-05-07

Family

ID=53004726

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2014/000508 WO2015065290A1 (fr) 2013-10-30 2014-10-30 Surveillance de marques par les microblogs

Country Status (1)

Country Link
WO (1) WO2015065290A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9948554B2 (en) 2014-12-11 2018-04-17 At&T Intellectual Property I, L.P. Multilayered distributed router architecture
US10243849B2 (en) 2013-04-26 2019-03-26 At&T Intellectual Property I, L.P. Distributed methodology for peer-to-peer transmission of stateful packet flows
US10257089B2 (en) 2014-10-30 2019-04-09 At&T Intellectual Property I, L.P. Distributed customer premises equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007101263A2 (fr) * 2006-02-28 2007-09-07 Buzzlogic, Inc. Systeme et procede d'analyse sociale permettant d'analyser des conversations sur des contenus multimedia a caractere social
WO2008045792A2 (fr) * 2006-10-06 2008-04-17 Technorati, Inc. Procédés et appareil pour de la publicité conversationnelle
US20100063948A1 (en) * 2008-09-10 2010-03-11 Digital Infuzion, Inc. Machine learning methods and systems for identifying patterns in data
US20130103490A1 (en) * 2006-01-20 2013-04-25 International Business Machines Corporation System and method for marketing mix optimization for brand equity management

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130103490A1 (en) * 2006-01-20 2013-04-25 International Business Machines Corporation System and method for marketing mix optimization for brand equity management
WO2007101263A2 (fr) * 2006-02-28 2007-09-07 Buzzlogic, Inc. Systeme et procede d'analyse sociale permettant d'analyser des conversations sur des contenus multimedia a caractere social
WO2008045792A2 (fr) * 2006-10-06 2008-04-17 Technorati, Inc. Procédés et appareil pour de la publicité conversationnelle
US20100063948A1 (en) * 2008-09-10 2010-03-11 Digital Infuzion, Inc. Machine learning methods and systems for identifying patterns in data

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10243849B2 (en) 2013-04-26 2019-03-26 At&T Intellectual Property I, L.P. Distributed methodology for peer-to-peer transmission of stateful packet flows
US10887228B2 (en) 2013-04-26 2021-01-05 At&T Intellectual Property I, L.P. Distributed methodology for peer-to-peer transmission of stateful packet flows
US10257089B2 (en) 2014-10-30 2019-04-09 At&T Intellectual Property I, L.P. Distributed customer premises equipment
US10652148B2 (en) 2014-10-30 2020-05-12 At&T Intellectual Property I, L. P. Distributed customer premises equipment
US11388093B2 (en) 2014-10-30 2022-07-12 Ciena Corporation Distributed customer premises equipment
US9948554B2 (en) 2014-12-11 2018-04-17 At&T Intellectual Property I, L.P. Multilayered distributed router architecture
US10484275B2 (en) 2014-12-11 2019-11-19 At&T Intellectual Property I, L. P. Multilayered distributed router architecture

Similar Documents

Publication Publication Date Title
US20230334254A1 (en) Fact checking
Ma et al. Label embedding for zero-shot fine-grained named entity typing
WO2017137859A1 (fr) Systèmes et procédés de génération d'attributs de langage sur une représentation de mots multicouche
US20170270096A1 (en) Method and system for generating large coded data set of text from textual documents using high resolution labeling
JP2021508866A (ja) 対象領域およびクライアント固有のアプリケーション・プログラム・インタフェース推奨の促進
CN113434858B (zh) 基于反汇编代码结构和语义特征的恶意软件家族分类方法
Bhattacharjee et al. Identifying malicious social media contents using multi-view context-aware active learning
WO2012158572A2 (fr) Exploitation d'enregistrements de clics d'interrogation pour la détection de domaine dans la compréhension d'une langue parlée
Ra et al. DeepAnti-PhishNet: Applying deep neural networks for phishing email detection
Zhang et al. User classification with multiple textual perspectives
Studiawan et al. Automatic log parser to support forensic analysis
Mamun et al. Classification of textual sentiment using ensemble technique
JP2021508391A (ja) 対象領域およびクライアント固有のアプリケーション・プログラム・インタフェース推奨の促進
Ramraj et al. Topic categorization of tamil news articles using pretrained word2vec embeddings with convolutional neural network
WO2015065290A1 (fr) Surveillance de marques par les microblogs
Hossain et al. Automatic Bengali document categorization based on word embedding and statistical learning approaches
Lydiri et al. A performant deep learning model for sentiment analysis of climate change
MacDermott et al. Using deep learning to detect social media ‘trolls’
Mahmud et al. Deep learning based sentiment analysis from Bangla text using glove word embedding along with convolutional neural network
Kanagavalli et al. Social networks fake account and fake news identification with reliable deep learning
US20140037154A1 (en) Automatically determining a name of a person appearing in an image
Trivedi et al. A study of ensemble based evolutionary classifiers for detecting unsolicited emails
Ou et al. Refining BERT embeddings for document hashing via mutual information maximization
Washha et al. Information quality in social networks: A collaborative method for detecting spam tweets in trending topics
Datta et al. A supervised machine learning approach to fake news identification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14857422

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14857422

Country of ref document: EP

Kind code of ref document: A1