WO2014048479A1

WO2014048479A1 - A system and method for the automatic creation or augmentation of an electronically rendered publication document

Info

Publication number: WO2014048479A1
Application number: PCT/EP2012/069140
Authority: WO
Inventors: Walid Magdy
Original assignee: Qatar Foundation; Hoarton, Lloyd
Priority date: 2012-09-27
Filing date: 2012-09-27
Publication date: 2014-04-03

Abstract

A method, system, tool or computer-readable medium creates or augments an electronically rendered publication document comprising: identifying a topic of the publication document; analysing the content of multiple microblogs to identify classification features, the classification features comprising evidence that a microblog from the multiple microblogs is in a particular microblog classification; identifying microblogs relevant to the topic of the publication document; collating into supplemental microblog content at least those microblogs or the content from those microblogs which fall into a particular microblog classification and which are relevant to the topic of the publication topic; linking the publication document with the supplemental microblog content. Preferably the supplemental microblog content is rendered alongside or within the publication document.

Description

Title: A system and method for the automatic creation or augmentation of an electronically rendered publication document

Background

The present invention relates to a system, tool and method for automatic enrichment of the content of publication material and more particularly to a system, tool and method for the automatic creation or augmentation of an electronically rendered publication document or digital medium.

Electronically rendered publication material or digital media comprises news articles, current affairs articles and periodicals or articles published in other arenas such as scientific journals or website forums, particularly where the publication is rendered electronically on a user interface. The publication material is, in simple forms: a website page, a word-processing document, a spreadsheet, a book, an e-book, or the entirety of a website.

News websites have become very popular over the last decade and in many countries they have started to take over the more classic delivery of news, newspapers - i.e. printed publications. News websites are able to be more current than a newspaper which has content which is fixed at the time of publication (hard copy printing) whereas a news website can be constantly updated . It is common for news websites to allow users to provide their opin ion , comments or analysis of news articles appearing on the news website.

Such user generated content (UGC) is usually shown in the form of a comments section appended to the news article. A further layer of interaction is also usually provided allowing users to rate the comments about the users and reply to those comments. These comments and the layers of interaction they provide enrich the content of the news article allowing readers to discover related events and news and learn more about the opinions and comments of others. Conventional systems attempt to maximise the extent of interaction between the news provider and the users and potential ly between the users themselves. Conventional systems do so by incorporating or appending users' comments from microblog websites. Comments about a particular news article can be linked directly from a respective social website (microblog) such as Facebook or Twitter. This communication is two-way allowing user articles to be linked directly to the social websites as well as from the social websites. Interaction with the news article and other users can therefore take place on the news website and/or on the social website. Some conference websites and consumer websites incorporate a small window on the main page of the site showing microblog feeds from social websites. Microblogs sometimes incorporate a hash-tag (#topic) for the particular conference or in the case of a social website shows comments posted on the social website page of the conference. These microblogs are comments which are fed directly by the users to the conference website via the conference's social pages on the respective social website.

The Twitter microblog uses "hashtags" - "The # symbol, called a hashtag, is used to mark keywords or topics in a Tweet. It was created organically by Twitter users as a way to organise messages." - source: www.twitter.com. In other words, a user creates a hashtag by prefixing a term with a # symbol to identify the prefixed term as the intended topic of that microblog. The hashtag can be seen as a "Subject:" line or topic identifier so that other users can search for that particular hashtag to identify further microblogs referencing the same hashtag. More than one hashtag can be present in a single microblog. Some sites allow searching by hashtags, in which case, the hashtags are used as keywords: http://truthy.indiana.edu/. This website provides a tool for analysing a population of microblogs for hashtags and plots a link graph between a hashtag forming a user query and other hashtags that co-occur within each microblog. The website also allows a user to search for a hashtag and then displays recent tweets that contain the given hashtag, as well as an indication of the distribution of how many times the searched hashtag is mentioned over time. Searching social content and social networking site content in general and microblogs (aka tweets) in particular has been basic and limited, especially for time-sensitive topics. The currently implemented microblog search on sites such as Twitter is based on simple word matching and retrieves the most recent microblogs that match a given query.

Key features of online news are the freshness of publishing news at the time of action and allowing users to interact with the news by commenting and expressing their feelings and response toward the news. These comments have been always of interest to readers of online articles, where it gives an insight about public opinion and can open discussion about the news. Social media has become a large social hub for people to express their thoughts and feelings toward everything, including news. People usually report news in microblogs which may comprise a short message of 140 characters at most, and sometimes, they include their comment about the reported news within this short length of text. The nature of comments on news on microblogs differs considerably from that of comments on an online news website. Users of microblogs are constrained by a short length text to write their comment while reporting the news to let others understand what they are talking about. This makes the nature of these comments very interesting since they are very short in length and focused. In addition, the comments on microblogs on a given piece of news are very diverse since the comments are derived from a huge population of contributors, each having their own character or persona. The diversity of comments can provide a good indication of the public opinion towards a particular publication document/news article.

Conventional approaches have a number of shortcomings and associated technical problems:

Polarised comments: the number of comments on a particular news article can be limited. Sometimes even the small number of comments present are biased by the character of the news website (for example a politically aligned or funded news website) or by the persona or characteristics of the visitors to a particular website. The few comments present can be skewed or polarised in a particular direction. Because the comments are direct comments being fed to the news article, there can be little diversity in the comments.

Comments on an online news website are typically representative of the opinion of the typical visitors to the particular website and cannot therefore sample and provide a true reflection of public opinion as well as an open social media microblog.

Comment weighting: The news websites make no distinction between the comments from one user vis-a-vis the comments from another user. There is no intelligence or verification process involved in deciding whether a comment should be displayed or not. There is no "value" or indication of merit attached to a particular microblog in dependance on the nature of the user. Further there is little or no control on the relevance of comments posted directly to a news website.

User identity: In cases where a news article is linked to a social media, some news websites show the most "liked" (or most popular) microblogs at the top of the comment ranking but this approach does not take the identity of the user into account.

Timing: Comments can only appear on a respective news publication after the news has been published and is available for comment when the article appears on the respective website. Where there are comments already existing on social websites and microblogs on a specific topic, any article published after those comments will not take note of those pre-existing comments as the users will not have had an opportunity to link or direct their comments to the news article.

Na^'i^'ve strategies: The presence of microblog feeds on some news websites relies on na^'i^'ve and basic keyword matching such as matching hash tags in microblogs.

Comment relevance: A further shortcoming of the current technology is that microblog comments may be spam and adverts appended to news articles by bloggers directly and little or no attention is given to examining the relevance of these comments to the content of the publication document

REFERENCES:

1 . L. Barbosa, J. Feng. (2010). Robust Sentiment Detection on Twitter from Biased and Noisy Data. COLING 2010

2. L. Chen , W. Wang , M . Nagarajan, S. Wang, A. P. Sheth . (201 2). Extracting Diverse Sentiment Expressions with Target-Dependent Polarity from Twitter. ICWSM 2012

3. N. Diakopoulos, M. Naaman. (201 1 ). Towards Quality Discourse in Online News Comments. CSCW 201 1

4. E . Gil bert, K. Karahalios. (2010). Understanding Deja Reviewers. CSCW 2010 5. R. Gonzalez-lbanez, S. Muresan, N. Wacholder. (201 1 ). Identifying Sarcasm in Twitter: A Closer Look. ACL 201 1

6. P. Kilner, C. Hoadley. (2005). Anonymity Options and Professional Participation in an Online Community of Practice. CSCL 2005

7. R Kohavi. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 1995

8. L. Jiang, M. Yu, M . Zhou, X. Liu, T. Zhao. (201 1 ). Target-dependent Twitter Sentiment Classification ACL 201 1

9. T. Joachims. (2001 ). A Statistical Learning Model of Text Classification for Support Vector Machines. SIGIR 2001

10. C. Lin, Y. He, R. Everson. (201 1 ). Sentence Subjectivity Detection with Weakly-Supervised Learning. IJCNLP 201 1

1 1 . Z. Luo, M. Osborne, T. Wang. (2012). Opinion Retrieval in Twitter. ICWSM 2012

12. N . Naveed , T. Gottron , J . Kuneg is, A. Al had i . (201 1 ). Searching microblogs: coping with sparsity and document quality. CIKM 201 1 .

13. I . Ounis, C. Macdonald, J . Lin, I . Soboroff. (201 1 ). Overview of the TREC-201 1 Microblog Track. TREC 201 1

14. O. Phelan, K. McCarthy, M. Bennett, and B. Smyth. (201 1 ). Terms of a feather: content-based news recommendation and discovery using twitter.

ECIR 201 1 .

15. E. Riloff and T. Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP 2003

16. B. Sriram, D. Fuhry, E. Demir, H . Ferhatosmanoglu, M . Demirbas. (2010). Short Text Classification in Twitter to Improve Information Filtering.

SIGIR 2010.

17. T. Strohman , D. Metzler, H. Turtle, W. B . Croft. (2004). Indri : A language model-based search engine for complex queries. ICIA 2004

18. I. Subasic, B. Berendt. (201 1 ). Peddling or Creating? Investigating the Role of Twitter in News Reporting. ECIR-201 1 19. J . Teeva n , D . Ra m ag e , M . Morris. (201 1 ). #Twittersearch: A comparison of microblog search and web search. WSDM 201 1 .

20. H . Wang, D. Can, A. Kazemzadeh, F. Bar, S. Narayanan . (201 2). A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle. ACL 2012

21 . T. Wilson, J. Wiebe and P. Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. HLT 2005

22. W. X. Zhao, J. Jiang, Ji. Weng, J. He, E-P. Lim, Ho. Yan, X. Li. (201 1 ). Comparing twitter and traditional media using topic models. ECIR 201 1 .

Microblog Text Classification

Much research work has focused on classifying tweets into several classes for different uses. Most of this work focused on applying sentiment analysis to opin ionated tweets, by classifying them into positive/negative sentiment [1 ,2,10]. Some research focused on a higher level classification of microblogs, where they classified tweets into subjective/objective [1 ,6,1 1 ,22]. Subjective tweets are those that represent an opinion on something or someone, while objective tweets are those that carry facts such as news or information. Other work applied additional classification to tweets [5,16] . I n [16], tweets were classified into four categories: news, events, opinion and deals. An approach for automatically detecting sarcastic tweets messages was presented in [5]. Different approaches were introduced for tweets classification in the prior work. The most common approach is using support vector machines (SVM), where a set of features are extracted from tweets and used to tra in a classification model using SVM [1 ,5,8,16]. This approach focuses on extracting features about the text structure and possibly tweets meta-information, such as features about the users who posted the tweet. Other approaches focused on the terms and their co-occurrence with different classes, which led them to using Latent Dirichlet allocation (LDA) for the classification task [10,20] . A recent research that studied the retrieval of opinionated tweets on a given topic, has modeled the task as a ranking task [1 1]. A machine learning algorithm was applied in conjunction with a retrieval model to rank tweets according to both relevance and subjectivity. Mean average precision (MAP) was used in [1 1 ] for the evaluation of the task. However, the majority of the prior work measured the classification performance using the standard accuracy of classifying tweets correctly. This is expected, since the number of samples in the test sets used in experimentations of most of the prior work was comparable for all the classes.

Microblog and News

Current online news media are moving toward providing more interaction between the news and users. Many news websites enable users to comment news and share it on social media. In many news websites, the comments on news article are linked directly to one of the social websites, such as Facebook or Twitter, so the comment is posted and seen by user's social network. With the growth of interactivity between users and news, especially by comments on news articles, it became an important matter to manage this interaction and use it for improving the online news content.

Previous work has found solid evidence for the journalistic value of comments, including adding perspectives, insights, and personal experiences that can enrich a news story [3,4,6]. The work by [6] further studied the motivations for users to comment on news articles. The study showed that users typically comment online news to: ask and answer questions, add information, share personal experience, express sentiment, criticize or support the news.

Additional research work focused on coupling news to Twitter for creating useful services and technology to users. For example, [18,22] used tweets as a news source and compared them to other online news media to detect features for automatic news detection from Twitter. Also in [14], tweets were used to recommend news to users based on their preferences. Microblog Retrieval

Interest in microblog retrieval has significantly increased in recent years. Several studies investigated the nature of microblog search compared to other search tasks [12,19]. Naveed et al. [12] illustrated the challenges of microblog retrieval, where documents are very short and typically focus on a single topic. Taveen et al . [19] highlighted the differences between web queries and microblog queries, where microblog queries usually represent users' interest to find updates about a given event or person as opposed to finding relevant pages on a given topic in web search.

Due to this increased interest in microblog search, TREC introduced a new track focused on microblog retrieval in 201 1 [13]. The aim was to find the best methods for ach ieving h igh precision retrieval for microblog search . A collection of 14 million tweets from Twitter and a test set of 50 topics were provided for investigation [13]. Although the track led to a variety of effective retrieval approaches, the issue of modeling the search scenario remains important as the TREC track setup models search like a standard ad-hoc retrieval task, which may be suboptimal [13].

It is an object of the present invention to seek to amel iorate the above problems and to provide a system and method for automatically enriching the content of publication material.

There is, therefore, a desire to overcome one or more of the problems associated with the prior art and provide, for example, a system and method topic-based analysis of information derived from microblogs and to enhance a digital medium with microblog comments relevant to the digital medium.

The technical solution is to present a system, tool and method for automatic enrichment of the content of publication material : a system and method for the automatic creation or augmentation of an electronically rendered publication document with content derived from multiple microblogs.

Embodiments of the present invention seek to ameliorate one or more problems associated with the prior art.

Summary of the invention

One aspect of the invention provides a method of augmenting an electronically rendered publication document containing content data, the method comprising: identifying a topic of the publication document; analysing the content of m ultiple m icroblogs to identify classification featu res, the classification features comprising evidence that a microblog from the multiple microblogs is in a particular microblog classification; identifying microblogs relevant to the topic of the publication document; collating into supplemental microblog content at least those m icroblogs or the content from those microblogs which fall into a particular microblog classification and which are relevant to the topic of the publication topic; linking the publication document with the supplemental microblog content.

Another aspect of the invention provides a method of augmenting an electronically rendered publication document containing content data, the method comprising: analysing the content of the publication document and extracting one or more publication topics from the content d ata of the publication document; analysing the content of multiple microblogs and identifying a m icroblog topic or topics from the content of the multiple microblogs; matching a microblog topic to a publication topic; collating into supplemental microblog content at least those microblogs or the content from those microblogs with a microblog topic that matches the publication topic; linking the publication document relating to the matched publication topic with the matched supplemental microblog content. Preferably, presenting the supplemental microblog content with the publication document. Conveniently, sorting the supplemental microblog content for presentation with the publication document.

Preferably, the sorting is by popularity of the supplemental microblog content and/or by temporal data relating to the supplemental microblog content.

A fu rther aspect of the i nvention provides a method of creating an electronically rendered publication document containing content data, the method comprising: analysing the content of multiple microblogs and identifying a microblog topic or topics from the content of the multiple microblogs; establishing a microblog topic as a publication topic; collating into supplemental microblog content at least those microblogs or the content from those microblogs with the microblog topic; providing the supplemental microblog content as content data for a new publication document, thereby creating the publication document relating to the microblog topic.

Another aspect of the invention provides a method of creating an electronically rendered publication document conta in i ng content data , the method comprising: analysing the content of multiple microblogs to identify classification features, the classification features comprising evidence that a m icroblog from the multiple microblogs is in a particular microblog classification; identifying microblogs relevant to a particular topic; collating into supplemental microblog content at least those microblogs or the content from those microblogs which fall into a particular microblog classification and which are relevant to the particular topic; providing the supplemental microblog content as content data for a new publication document, thereby creating the publication document relating to the particular topic. One aspect of the invention provides a system for creating or augmenting an electronically rendered publication document containing content data, the system including: a computing device having a processor and a memory; and a storage device, wherein the computing device is configured to perform a method embodying the invention.

A further aspect of the invention provides a computer-readable medium storing instructions which when executed to run on a processor cause the processor to perform the steps according to the method embodying the present invention.

Another aspect of the invention provides a publication document creation or augmentation tool operable to generate automatically supplemental content derived from multiple microblogs for the creation or the augmentation of an electronically rendered publication document containing content data, the tool comprising: a content analysis engine to analyse the publication document and extract one or more publication topics from the content data of the publication document; a content analysis engine to analyse the content of multiple microblogs and identify a microblog topic or topics from the content of the multiple microblogs; a comparator to match a microblog topic identified by the content analysis engine to a publication topic identified by the content analysis engine; a collator to operate on the matched topics and compile supplemental microblog content from at least those microblogs or the content from those microblogs with a microblog topic that matches the publication topic.

A fu rther aspect of the invention provides an electron ical ly rendered publication document containing content data, the document comprising: rendered publication document content data; and at least one channel of rendered microblog content data of a first classification, the microblog content data being derived from multiple microblogs having a topic matching a publication topic of the rendered publication document content data. Brief description of the drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which:

Figure 1 is a schematic overview of a system embodying the present invention; Figure 2 is a flow chart representing an overview of the method according to an embodiment of the present invention;

Figure 3 is an exam ple of a preferred sample output according to an embodiment of the present invention; and

Figure 4 is a cycle chart representing a flow of data according to an embodiment of the present invention.

Detailed Description

Embodiments of this invention relate to information extraction and social computing. More particularly, the invention relates to extracting posts from microblogs and social websites in general that can be classified as comments on a given piece of news, for example, then appending these posts as comments on the related news article, on a news website.

Embodiments of the invention can be expressed as a system and method for automatically extracting posts from a microblog, such as a social website, that can be considered as a comment on a given news article which describes a piece of news or an incident. The abstract flowchart of one example of the invention is presented in Figure 1 . Embodiments of the present invention provide a system, tool and method for automatically enriching the content of publication material preferably with content material derived from a microblog feed or from a repository of multiple microblogs. Referring to the figures and, in particular, figures 1 and 2, examples of the invention are described in relation to news articles appearing on a news website but the invention is applicable to other publication media such as any information provided on a website. The publ ication media "source" is not limited to text and can be still image, video, audio, text and any combination or combinations thereof.

Microblog websites and microblogs are a popular tool for users to post news, information or queries online for public (or private group) dissemination, review and reply. Twitter is a popu lar m icroblog site with over 300,000,000 microblogs each day being exchanged. This is a very significantly sized pool of information and comment. A Twitter microblog or Tweet comprises a message of 140 characters. Microblogs can be broadly categorised into three categories: news; social media; and comment on topics. Examples of the present invention are concerned with content classified in the last category, user comments and more particularly, user comments relevant to a topic of a particular publication document. The system, method and tool embodying the present invention augments a publication document with microblog user comments as opposed to any other category of microblog content. Microblogs are not limited to the social website environment and can also comprise other forms of user generated content (UGC) such as: a post on a social networking site; a comment on a news article; a comment or a post on a forum; and/or a comment on a social networking site. Referring to figures 1 and 2, in general terms, the solution embodied by examples of the invention analyses a publication document 1 containing content data 2 and a body 3 of microblogs 4. The publication document 1 or media is an electronically rendered document containing content data 2 representing a still image, video, audio, text and any combination or combinations thereof.

The microblog source is a repository 3 of multiple posts or microblogs 4 or can be a live feed 3 of microblogs 4. The multiple microblogs 4 can be normalised into microblog data of a predetermined format and stored in an index of microblogs. Preferably, the predetermined format includes at least a selection of: a microblog text content 4'; microblog metadata such as a microblog identifier; and a microblog time-stamp.

A topic analysis S1 is performed on the publication document 1 and its content data 2. The topic analysis S1 delivers the most likely topic(s) of the publication document - this is the publication topic(s) PT. In a simple case, the topic is determined as all or part of the title of the publication document 1 . In another simple case, all microblog posts (or social posts in general) that link to the publication document can be retrieved using the link URL.

An analysis S2 is also carried out on the source 3 of multiple microblogs 4. This analysis S2 is a topic query which can be performed on demand and on the fly on receipt of a new publication topic (PT) or can be performed on microblogs 4 which have already been indexed into a library of microblogs classified by topics. A topic query S2 made of the microblog repository 3 returns microblogs 4 or microblog content matching the publication topic (PT).

Each microblog 4 has content 4' and metadata. The content 4' desired to augment the publication document 1 is microblog user comments as opposed to any other category of microblog content 4'. Preferably, an analysis engine extracts microblog user comment content 4' for use as the supplemental microblog content. The microblogs are classified into various classifications in the repository 3 or elsewhere. The microblogs 4 are classified in accordance with the content and/or the metadata into their one or more classifications. For example, the microblogs 4 can be classified as being sourced from a particular user (microblogger), sent from a particular geographic location, street, city, region or country (potentially as desig nated by geo-location and/or IP address information in the microblog metadata), having a positive or negative sentiment, being sent during a particular time period (e.g. during a volcanic eruption or in the aftermath), most recent or most popular posts, and/or comments from experts. The classification can later be used as a filter criterion and allow a user to render microblogs of one or more desired classifications.

Further, the classification criteria can be based on sentiment or opinion : "support/disagree", "happy/unhappy", "happy/sad/angry/confused". Embodiments of the invention can apply the classifications to both the generated comments 4' and also the original comments posted by users to the publ ication document 1 independently of the invention . The respective classifications can then be correlated to provide visual representations plotting the distribution of opinions of users in a comparable form between original comments and microblog comments 4'.

The microblogs 4 identified as containing comments 4' relevant to the publication topic (PT) are collated and presented S3 for linking to the publication document 1 , the news article in this example. In this example, the microblog comments 4' are appended to a) the news article 1 and/or b) the news article content data 2 and are then rendered on the same display as a "footer" to the main article 1 as shown in Figure 1 .

The microblog comments 4' comprise supplemental content which is identified and col lated by the system embodying the present invention. The supplemental content is microblog comment 4' which has been arrived at by the topic analyses (S1 ,S2) conducted on the publication document 1 and the microblogs 4, which two elements (1 ,4) share the same topic wh ich the microblog comments 4' are discussing.

The supplemental content 4' which has been automatically generated can be linked to the original publication document 1 so that it is accessible to a user interacting with the publication document 1 , thereby augmenting the original publication document 1 . The supplemental content 4' is rendered separately at a linked resource or, alternatively or in addition, the supplemental content 4' is also (or either) rendered alongside or embedded with in the orig inal publication document 1 so that it is immediately accessible to a user interacting with the publication document 1 , thereby augmenting the original publication document. Preferably, the supplemental microblog content comprising microblog comments 4' are rendered alongside or embedded within the publication document.

Another example of implementation is as follows:

given a news article 1 , find microblog posts 4 relevant to the topic of the news article = relevant posts 4;

extract from the relevant posts 4, posts 4 that can be classified as comments 4', not just news stating = relevant comments;

present the relevant comments 4' from the microblog 4 as a further type of comment within or after the news article 1 ;

user sorts the relevant comments according to user preference (the classifications): many different preferences, such as most recent or most popular, comments from experts, comments from a particular city, region or country.

This part of the specification deals with the task of extracting users' microblogs that carry comments on specific news. These are termed "comment- microblogs". The extracted or identified comment-microblogs are then embedded within or appended to online news articles to improve a reader's experience when reading the comment-augmented articles.

The main differences between comments on news extracted from microblogs (comment-microblogs) and the "standard" comments on an onl ine news website can be listed as follows:

1 . Nature and Length: comments on news are very limited in length, but still give an impression about the news. The comment can be just a word, an abbreviation, an emoticon, or a short question.

2. Size and diversity: the number of microblogs commenting on a given news article can be much more than the number of comments on the website reporting the news itself. Moreover, this large number of comments is expected to model more personas than one news website. 3. Automatic-importance of a comment: one useful feature in microblogs is the count of the number of times a microblog has been reposted ("retweets" in Twitter terminology). When a celebrity or popular person writes a microblog commenting on given news, typically users recirculate this microblog. These comment-microblogs, when identified, can be highly ranked when displayed to the news article readers, since it will be interesting for readers to know the news and the reaction to that news, especially from popular people or experts.

4. Freshness: Sometimes users start to report news and comment on the news on a microblog even before an official news article is published. The identification of comments enables the publishing of news articles along with freshly extracted comments about that news obtained from the microblogs.

For identifying microblogs that carry comments on news, a machine learning approach is applied . This classifies relevant microblogs on a given news article into two classes: news-reporting microblogs and comment- microblogs. The news-reporting microblogs are those which report the news in similar, different, or incremental forms to that in the online news article. These microblogs usually represents the majority of microblogs relevant to a news article. Only a small portion of the relevant microblogs to news can be described as comment-microblogs. A full description and discussion about the n atu re of these microblogs is provided be l ow i n th e section entitled: "Comments on news on microblogs".

The data of TREC 201 1 microblog track was used to create a new test set of microblogs to evaluate the identification task. A set of news articles (publication documents) and relevant comment-microblogs are created from the TREC data. Experimental results showed the effectiveness of this approach to identify comment-microblogs with a high precision and acceptable recall. Moreover, the effectiveness of the system was also evaluated on live news articles and showed superior performance over comments on news extraction from, for example, the Twitter microblog.

Embodiments of the present invention contribute as follows:

Automatic detection of users' comments on news from social media (comment microblogs) and embedding those within online news articles for a better reader experience. Embodiments of the invention couple online news publications such as news articles to microblog comments to enrich the user experience when reading online news by providing users' comment from social media.

Classification of microblogs into a new class of "comment-microblog", which is more general and different in nature than "subjective" text and microblogs investigated in the literature and above references, for example. Embodiments of the invention apply a new classification to microblogs. The classification identifies if the microblogs contain user comment on a given piece of news or not. Th is classification has common features to classifying microblog (tweets) into subjective tweets; however, there are fundamental differences as described below.

Application of machine-learning for comment-microblog detection, while investigating a large set of features and analyzing the effect of each.

The data from the TREC microblog track was used to create a test set of microblogs and articles for the evaluation of embodiments of the invention. Briefly, we use the provided topics to get relevant articles, and validate the relevance assessments. Finally, we annotate the relevant set of microblogs into comment/non-comment-microblogs for use with embodiments of the invention. Comments on News on Microblogs

In this section, we describe the nature of comment-microblogs on news from microblogs such as Twitter. We show the main differences between comment- microblogs and what has been investigated in literature about "subjective" or "opinionated" short text. A list of examples is presented to illustrate the nature of these comment-microblogs. We try to describe comment-microblogs on news in a similar way to that presented in [6], which analyzed the nature of users' comments on online news websites. Definition

A comment-microblog on a given piece of news on the Twitter microblog is a tweet containing information about a user's response toward the news. This can be in the form of sentiment, opinion, question, rumour, or call for action. This does not include restating, rephrasing, or adding details to the news. The comment can range from a full sentence describing user reaction toward the news, or a simple emoticon or hashtag (#tag).

Among the microblogs relevant to a topic, we treat the microblogs which just restate the news or facts in the news as non-comment-microblogs. Typically these state the headline or a fact in the news and often supply a link to the news article.

Comment-microblogs vs. Subjective-microblogs

Classifying microblogs into subjective and objective tweets was investigated in many research studies [Error! Reference source not found., Error! Reference source not found., Error! Reference source not found., Error! Reference source not found., Error! Reference source not found.]. The common definition of the subjective microblogs is those that carry opinion and sentiment rather than facts. This is why most of this research applied polarity sentiment classification after identifying subjective microblogs [Error! Reference source not found., Error! Reference source not found.]. Although many comment-microblogs can be classified as subjective microblogs, still many others cannot be considered subjective given the current definition.

The main differences between comment-microblogs and subjective ones can be listed as follows:

1 . Comment-microblogs on news should be relevant to the news itself. This is why comment-microblogs are typically composed of two parts: an objective part that carries the news itself, and a response part that carries that user's comment on the news. The objective part that carries the news can be in the form of the news headline, a paraphrase of the news, or a link to the news.

2. A comment-microblog does not have to be a sentiment or opinion. It does not have to be subjective, since it is a general response by the user to the news, which can be a call for action, a wish or a hope of the user, an opening to a discussion, or even a correction of the news itself.

The identification of such microblogs requires the selection of a set of adequate and robust features in order to capture the wide range of comments. As our proposed task eventually intends to send them to a news article, the emphasis is on the precision as too many incorrectly identified comment- microblogs may damage the user experience on news website.

Later we discuss the possible natures of comment-microblogs on news, while showing some example for illustration.

The Nature of Comments on News on microblogs, e.g. Twitter

People comment on news websites because of various motivations as presented in [6]. This also applies to comments on news on social media, such as Twitter. This is why a comment can be expressed in numerous ways. Here, we discuss some of these comments on news on Twitter and present possible types of them. Our objective is to better illustrate the definition of comment- microblogs and to demonstrate the possible challenges faced to identify these comments automatically.

Table 1 (Examples of comments on news from Twitter) shows some examples of microblogs that contain comments on given news. Referring to the examples in Table 1 , there are some short tweets expressing or exclaiming sentiment without any other input, see example 1 in table. There are longer tweets expressing opinion see example 2; funny tweets or sarcastic tweets, see example 3; tweets demanding action are also common, see example 4; indicating personal involvement in topic of news are common, see example 9.

Finding comments among relevant microblogs is sometimes challenging. For example, the hashtag used in example 3 is #MyBigFatGypsyWedding, which is not a proper English word rather a concatenation of words used to express a funny commentary (or the name of a television programme). On the contrary, a n ot h e r twe et a bo u t B B C Wo rl d S e rv i ce wh i c h h a s a h a s h ta g #bbcworldservice would be considered not as a comment as it just concatenates name entities from news. Similarly, considering the news head l ine of example 8 in Table 1 , based on ly on its words "Al Gore Snowmaggedon"; it is difficult to say that comments on "global warming" would be relevant to it. However, the comment itself talks about quality of the news article rather than the news itself. This demonstrates another type of comments in microblogs.

There are some microblogs difficult even for a human to specify whether they are genuine comment or a restating of news. For example, the microblogs which are re-conveying someone else's message or pointing to a blog post, such as example 7. Sometimes a blog is an expression of comment, while sometimes it is not.

Also example 5 and 6 in Table 1 demonstrate how task of finding comments differ from earlier work on subjectivity classification of microblogs. For example 5, the microblog provokes a discussion with an open question and thus does not bear a sentiment; while in example 6, the microblog is sentiment-neutral as it expresses a 'hope'.

These examples illustrate the variety of expression of microblog comments. This makes comment identification a challenging yet an interesting task.

From the examples shown in Table 1 , some of the microblogs do not indicate a specific sentiment, and may not be considered as subjective expressions by previous work. Nevertheless, these are comments according to our definition. The aim of sentiment classification and aim of comment identification is different. Sentiment classification aims to predict the sentiment of users about something. The technical result here is to augment publication material with user comment so as to improve engagement with news readers and to enrich their reading experience by bringing to them comments from a rich source like microblogs such as Twitter.

Approach for Comment-microblog Detection

As we showed, comment-microblogs have a wider definition than subjective microblogs. Hence, relying on a term's occurrence can be suboptimal or at least may require a large amount of training data [Error! Reference source not found., Error! Reference source not found.]. The approach of embodiments of the invention is based on extracting multiple features from microblogs and possibly news articles to train a single Support Vector Machines (SVM) model that is capable of classifying such comment- microblogs with their various forms. Many of the features extracted are inspired by the literature [Error! Reference source not found., Error! Reference source not found.], and supplemented by additional features to capture more evidence for our classification task.

The features extracted from the microblogs can be categorized into four groups:

1 . microblog-specific features (TS): These are the set of features that are specific to the way of writing the text of a microblog such as a "tweet". These features can have higher distribution with comment-microblogs that is different than general ones. This is a set of seven binary features, which includes: a. the presence of a hash-tag (#tag), b. The presence of a user mention (@some_user) c. Microblog starts with a user mention d. the presence of a link e. The presence of words after a link f. , the presence of a retweet (RT) in the middle of text. g. incomplete text indicated by the presence of

These features may evidence the presence of a comment in the microblog. For example, the presence of "RT" in the middle of a tweet is an indication that a user added some text at the beginning of an original tweet, which may be a comment. Similarly, when there is text after a link in the microblog, this can be an indication of a comment on the news in the link. This "minor" evidence derived from microblog features can also be combined with other forms of comment evidence and lead to the identification of a comment- microblog.

2. Language-independent features (LI): This set of features consists of non-lexical features that give an indication to the presence of an opinion or a comment in general. It consists of nine binary features that include: a. the presence of question marks (?), b. the presence of exclamation marks (!), c. the presence of underscores (_) d. the presence of repeated marks (e.g. "??" and "!!!!") e. the presence of characters that reveal sentiment, such as emoticons (e.g. ":)", ":(", ":D" ... etc.) f. the presence of words written in uppercase letter, which can indicate emphasize on some meaning by users g. the first letter in the microblog is in uppercase h. at least two words are in uppercase i. the presence of repeated characters more than twice (e.g.

"cooool").

These features and combinations thereof provide evidence of the presence of comments. 3. Lexical features (LX): These features represent the presence of set of terms of special nature that are more likely to exist in comments. Seven binary features test the presence of:

a. A singular first-person pronoun in the tweet: I, me, my, and mine. b. question word, such as what, why, how, etc. c.d.e. sentiment words [15;21 ]: three lists of positive, negative, and neutral English words (File: subjclueslen1 -HLTEMNLP05.tff (http://www.cs.pitt.edu/mpqa/)) were used to produce three different features. f. social chat abbreviations: a large list of 1 ,356 social abbreviations, such as BRB (be right back), CU (see you), and FYI (for your information), were obtained for that purpose (http://www.webopedia.com/quick_ref/textmessageabbreviations.asp). This was the longest list of social abbreviations we found. However, we noticed some confusing abbreviations in the list, which can lead to destructive effect on the effectiveness of the feature, such as: 182 (I hate you), ARE (acronym rich environment), and SO (significant other). Hence, we applied a pruning step to the list by filtering out all abbreviations formed of digits only, and all those that appears in the top 10% frequent words in English vocabulary (We use the English Aspell dictionary as our vocabulary). The final number of abbreviations remained after the pruning step was 1 ,298 abbreviations. g. emotive words: a list of 100 terms was obtained for this feature

(http://www.dailywritingtips.com/100-mostly-small-but-expressive- interjections/). The list included terms that express a certain emotion, such as: boo, ew, ha-ha, uh, and yay.

4. Topic-dependent features (TD). This list of features captures the information about how a relevant microblog is similar to the news article, and how this correlates with whether the user is restating the news or not. Four float features are in this set, which are the dot product and the weighted dot product between the tweet text vector and the article title and body. Weighted dot product is similar to the dot product, but each term is weighted by its TFIDF value. The IDF of terms was computed from the TREC tweets collection. These four sets of features were extracted from a training set of annotated tweets and used to build and train an SVM [9] to create a model for classification.

Experimental Setup

The nature of the task requires a test dataset that is composed of a set of news articles and a set of relevant microblogs to each of the news in these articles. These relevant microblogs are annotated to either being a comment or not. In this section we describe the procedure we used to obtain a test dataset with th ese specs . F u rthermore, we d iscu ss the eva l uation methodology used for evaluating the effectiveness of our comment-microblogs identification approach.

Data Collection

The TREC-201 1 microblog track data was used as a collection of microblogs [13]. The collection originally consisted of around 14 million tweets crawled in the period between the 25^th of January 201 1 and the 8^th of February 201 1 . The tweets collection contained tweets in multiple languages. A set of 50 topics was provided to the participants in the track as the test set. All topics were in English and the track organizers considered only the English tweets for the relevance assessments, while considering any tweet in a different language to be irrelevant [13] . The topics were expressed as short queries that were typically a few words long. Each topic was associated with a query-time, which is the time of querying this topic on twitter, and the task was to find relevant English tweets that were posted only before the time of the topic. Figure 5 shows an example of the microblog track topics. Relevance judgments constitute the manual assessments of which tweets are relevant to which topics. The evaluation metric used for evaluation was precision at 30 (P@30), which was picked based on the assumption that users usually check no more than 30 tweets per query. The relevant number of microblogs per topic ranged from zero (for topic 50) up to 200 relevant microblogs. The topic that had zero relevant microblogs was excluded later from the evaluation leaving only 49 topics that have relevant microblogs in the collection. The relevance assessments of the 49 topics contain judgments for more than 40k tweets, out of which 3k are relevant.

Since Engl ish tweets in the collection were only considered, we used a language detection tool (http://code.google.eom/p/language-detection/) for identifying the language of microblogs. This led to a collection of roughly 5 million English tweets.

Preparing News Articles

The test set provides a set of news articles instead of the topics of the TREC 201 1 microblog track data. Therefore, we used the 49 topics in the test set to search Google for relevant articles in the period of time of the collection. We searched the web with the 49 topics using a time-restricted search, where the results were restricted to be published in the period starting from the 25^th of Jan 201 1 (the earliest tweet date in the collection) till the topic query-time. This assures retrieving web results in the period of the tweets collection and earlier than the query-time. A relevant article was manually selected for each of the 49 topics. We noticed sometimes that some of the topics did not have any kind of news article in th is period ; therefore, we selected the most relevant webpage to the topic instead. Figure 6 shows an example of the selected articles, which is for topic "3" presented in Figure 5.

The selected news articles and webpages for each of the topics were found to be sometimes more focused than the topic. For example, topic "2" was 2022 Fifa Soccer. The selected article in the period of the tweet collection was about Qatar's 2022 FIFA World Cup Stadiums are Eco Marvels, which is a subtopic of the wider TREC microblog topic. This occurred for some of the topics in the test set, which led us to reconsider the relevance assessments by validating that the relevant microblogs to the TREC microblog topics remain relevant to the selected article.

Data Annotation for Relevance and Comments

The test set created of the news articles that correspond to the microblog topics required two additional annotations to obtain the comment-microblogs identification test set; namely, the tweets relevance to the selected articles and whether the relevant microblogs contain comments or are just stating news.

Microblog relevance annotation

As mentioned earlier, there may be a shift in the focus of the news article than the original TREC topic. This led us to apply relevance validation to all the set of relevant microblogs to topics. The relevance validation process led to reducing the number of relevant microblogs by nearly 600 tweets that were relevant to the TREC topic but not relevant to the more focused news article. Moreover, we appl ied additional retrieval runs for searching the tweets collection using the articles to enrich the test set with additional relevant microblogs that may not have been captured in the relevance assessments prepared by the TREC track. This step was important to enriching the test set since all the assessed tweets by the track organizers were those retrieved by different participants using the original topics [13]

The English tweets collection was indexed using Indri toolkit [17]. Queries were prepared from the 49 articles using the article headl ine and sub- headlines, if exist. Four runs were performed to search the collection:

HL: queries are the news article headline

- HLS: queries are formulated from the article headline and sub-headline HLFB: similar to HL + pseudo relevance feedback (PRF) was performed using 50 tweets and 10 terms for the feedback process

HLSFB: similar to HLS + PRF performed The top 30 retrieved tweets of each topic were considered for assessment similarly to the track. Retrieval results were merged for the four runs and compared to the existing relevance assessments of the track to judge the ones that were not assessed by the track. An average of 18 tweets per topic did not receive relevance assessment by the track and required our manual assessing to them. Our assessment led to the addition of 347 relevant microblogs to the existing relevance assessments by the track.

The final total number of relevant microblogs for the 49 news articles was 2900. Comment-microblogs Annotation

The set of relevant microblogs to the news articles were then manually annotated as comments or not. The 49 topics were divided by three annotators to manually tag the relevant microblogs that represent a user comment on the news article. The annotators were supplied with clear guidelines for tagging the comments. After the annotation process, the annotators sit together to discuss any doubtful cases of microblogs that any of them was not confident about its tagging. The final annotation led to the tagging of 607 of the relevant microblogs to be comments, wh ich represents only 20% of the relevant microblogs. It was found that eight of the news articles did not have any tweets that contain comments. These eight articles were useful to test the situation when no comments should be identified to a given article. Figure 7 plots the total number of relevant microblogs for each news article sorted in an ascending order. The microblogs which were tagged as comments are plotted in gray - these are the comment-microblogs. The black plots the number of non-comment-microblogs.

Evaluation Methodology

Unlike prior work in subjective and sentiment classification for microblogs that rel ied on using classification accuracy for evaluating their approaches, embodiments of the invention measure the performance using precision and recall as benchmarks. Accuracy would be an adequate measure if the two classes had a comparable number of samples. However, since comment- microblogs represent only a small portion of the dataset (only 20%), accuracy would be 80% if we classified all tweets to be non-comment. Therefore, we believe that calculating the precision and recall for detecting the comment- microblogs would be a more meaningful measure to evaluate the system performance.

For the evaluation, we calculate both precision and recall using two methods:

1 . Overall precision and recall, where the values are calculated over identifying the comment-microblogs as a whole without considering which tweets relate to which of the topics. These measures will ind icate the performance of the system on the microblog level.

2. Precision and recall on the topic level, where precision and recall are calculated for each topic separately, and then the average of scores is calculated. These measures will indicate the performance of the system on the news article level regardless to how many microblogs are relevant to it.

Precision (P) and recall (R) are calculated as follows:

TP

^{P ~} TP + FP

TP

R =

TP + FN

where: TP are the correctly identified comments, FP are the false identified comments, and FN are the misidentified comments.

For the calculation method 2, there were some situations where no comments are identified to a given article. This means that the TP and FP values are zero, meaning the P = 0/0, which is undefended. Similarly, for the articles which did not contain any relevant comment-microblogs. This means that the recall value will be also undefined. To overcome such cases, we calculated precision and recall per topic as follows:

TP&i

TP(t) +· Fff(t) > 0 1 rP(t) = FN&) = 0

Previous equations assure the avoidance of any undefined values for precision and recall. Also, it gives a fair precision and recall values when no comments are identified for a g iven topic. When no comments are identified either correctly or incorrectly to a given topic, this gives a precision of one, since no false comments are identified to annoy the user who checks the comments on the news. However, at the same time, if this article has annotated comments that were not identified, the recall will be equal to zero. Similarly, the recall for articles in the test set that do not have any annotated comments will be equal to one if no microblogs were identified as comments.

Experimental Runs

Due to the limited number of training examples, we applied cross-validation for training and testing the comment-microblogs classification approach [7]. We apply an extensive leave-one-out cross-validation (LOOCV) on the topic level. For all experiments, we use the relevant microblogs of 48 new articles for tra in ing a model , and then use the model for classifying the relevant microblogs of the remaining article. For evaluating the features set, we tested the classification using different combinations of the four groups of features as follows:

Run1 : TS (microblog-specific features)

Run2: LI (language-independent features)

Run3: LX (lexical features)

Run4: TS+LI

Run5: TS+LX Run6: LI+LX

Run7: TS+LI+LX

Run8: TS+LI+LX+TD (all features including topic-dependent features)

Our aim behind all these runs was to understand the effect of each of the features set on the system performance for both precision and recall. Certainly, the ideal objective is to maxim ize both precision and recall . However, such a task of bringing comments on news from Twitter to news readers can be seen as a more precision-oriented task, where missing comments from twitter about the news will not lead to a negative impression to readers as retrieving a news stating tweet to the readers while claiming that it is a comment. Therefore, our objective is to achieve a high precision with the highest possible recall.

System Performance

Classification Results in the Test Set

Table 2 reports the results of classification of our approach using different combinations of the four groups of features. As shown in Table 2, the tweet- specific (TS) features alone have a poor performance of classification, where it failed to achieve a high precision and achieved a low recall. The language- independent (LI) features set led to an acceptable recall, but a low precision. Adding TS features to LI did not lead to any improvement over the LI features. The lexical (LX) features showed effectiveness in precision, but a low recall, and again, adding the TS features to it did not lead to an improvement. The LI and LX features together led to a relatively high recall (57.3%), but a low precision (65.3%), which is a bad result for a precision-oriented task. Using all topic-independent features (TS+LI+LX) achieved a high precision of 80.9% and the highest recall of 57.8%. Adding the topic-dependent (TD) features to the set of features led to a significant improvement in precision to reach 88.6%, but a drop in the recall occurred to be 50.3%, which was found to be significantly lower than that achieved by the features without the TD ones. The change for both precision and recall was found to be significant using a 2- tailed t-test with confidence level over 95%.

Based on the results in Table 2, the best configuration to the classification system can be selected between the last two ones, when using all the features sets with and without the topic-dependent features according to the desired output. Although the achieved recall for the best runs was 50-60% of the existing comnnent-nnicroblogs, this result is seen positive and suitable for a live practical running system; since the task is seen to be challenging and the nature of the task is precision-oriented. Achieving such high precision and acceptable recall for this task, where only 20% of the data set is the target, is seen to be a good result. However a detailed further investigation is still required

Discussion

Here we try to understand the factors which cause a large portion of the comment-microblogs not to be recalled . The factors are identified below. Accommodating these factors increases the proportion of recalled comment- microblogs allowing the classifier to capture or recal l more comment- microblogs without affecting the achieved high precision. Also, we check the false classified microblogs as comments, to see why some non-comment- microblogs are classified as comments.

Analyzing the false negatives for the classifier, we noticed the following reasons for not being able to detect them:

1 . Unavailability of useful features to do meaningful classification: a. The comment-microblog is so short that it is difficult to extract useful features from it. E.g. "the rite wasn't even all that". b. In a longer tweet, the manner of expression, does not offer many features to be extracted. E.g. "It's not all bad news for jobs. Unemployment rate fell from 9.4% to 9%. Avg. hourly earnings up 0.4%. Nov. & Dec job growth revised higher." Only two features were present for this tweet: the sentiment word "bad" and the beginning with a capital letter "It".

2. Though the classifier uses presence of words from numerous lexicons, it does not have enough coverage over for cuss words-like. E.g. "bloody", "suck" ... etc.

3. Though smiley, elongated words, presence of "emotion" word, individually may indicate that a tweet to be a comment, they do not occur frequently enough in the corpus to give the classifier enough evidence to indicate their strength as discriminators. May be with larger amount of training data, this problem could be solved.

4. The classifier detects sentiment words even if expressed in a hashtag. However, there are more complicated hashtags that require additional parsing, such as "#WeAreHereForFreedom" and "#sleepswithfishes". These compound hashtags require additional processing to detect if they carry any sentiment.

On the other hand, analyzing the false positives, which are incorrectly identified tweets as comments, we noticed the following factors behind misclassification:

1 . A tweet that contains a question whose answer is in the article. E.g. "Why is it so cold, if global warming is such a big deal? http://bit.ly/hbFv8X', where the link refers to an article on global warming. Here the question does not represent any response from the user to the news, and hence it is not considered a comment.

2. A tweet contains the headline of a relevant article to the news that is in the form of a comment. E.g. "Maddow's Excuse for Reporting

Spoof Story as Fact: It's Beck's Fault! http://bit.ly/euNxOv". This kind of problem is automatically resolved when the tweets and the news article share that same headline because of the topic-dependent features. However, when a different article rephrases the news in a comment-like way, it becomes difficult to our classifier to notice this. The presence of more training examples and additional analysis to the data will improve the classifier for the task of "comments on news" detection from social media. The present classifier system is an effective tool for news websites to significantly increase their visitor experience by providing them with public comments on the news from microblogs such as Twitter, for example.

System Validation on Live News Articles

After testing the performance of the comment classification system on the test set, it is important for us to validate its performance on live news articles. For this experiment, we selected 15 articles on popular news published by the end of July and the beg inning of August 201 2. We brought the articles from different popular news websites, such as CNN, NY-Times, The Economist, Reuters, and Al-Jazeera. For retrieving relevant microblogs to the news articles, we used the article URL to search Twitter for tweets linking the article. This guarantees that the retrieved tweets are all relevant to the news article without the need for performing any relevance assessments. We used the tweet4j package (http://code.qooqle.eom/p/tweet4i/) for retrieving the tweets that links the news articles. The package has a constraint of retrieving 100 tweets at most. Therefore the maximum number of retrieved tweets for a given article was 100.

Applying the retrieval process, we retrieved a total of 1 ,384 tweets for the 15 articles, since some of the articles did not have 100 tweets linking them on Twitter.

Later, we applied our classification system to the retrieved tweets using all the feature sets. The classification model was trained using the 49 articles and their relevant microblogs in our test set. Out of the 1 ,384 tweets, only 99 tweets were classified as comments. Only two articles out of the 15 did not have any identified comments . To eval uate th e performan ce of the classification system, we calculated the precision by checking how many of the identified comment-microblogs are correct. Calculating the recall would require much effort, since we had to annotate the full 1 ,384 tweets; but since our task is precision-oriented, calculating precision was sufficient for us to measure the system performance.

Surprisingly, we found that among the 99 tweets classified as comment, only four were wrong. All the wrong tweets were identified to the same article. Another annotator validated the result, and he identified these four tweets only to be doubtful cases, but confirmed that the remaining 95 tweets to be comments.

Table 3 shows all the 1 5 news articles headlines, the number of retrieved tweets for each, the number of identified comments for each, and sample comments of those identified . Article number 1 0 is the one that had four misclassified comment-microblogs. The four tweets were identical, since they were all retweets to each other. This tweet is shown in Table 3, article 1 0, sample 2. As shown, the tweet added the sentence "It's gold! Stanning and Glover end the wait for Team GB" to the link, which the annotators have seen to be a rephrasing to the news in the article and does not contain any indication of a comment.

The sample tweets shown in Table 3 demonstrate the effectiveness of the classification system we developed, since, as it is clear in table, many different types of comments have been identified. Some interesting identified comments are: samples of article 3, which does not carry any sentiment or opinion, rather than giving an advice and asking a question. Also the sample comment of article 5; the comment here is just repeated question marks, which show the objection or extreme disapproval of the news. Revising many of the sample comments in Table 3 illustrates the main differences between the comment- microblogs classification task and the subjective classification to tweets studied in prior work. Another remarkable feature that has come about from embodiments of the invention, identifying comments about news from microblogs and augmenting news articles, for example is illustrated in sample 1 of article 3: "What do #lndia and #Pakistan have in common? The need for effective regulation of their respective electrical grid. http://t.co/8IOjbQwE'. We noticed that this comment-microblog was repeated seven times as retweets in the identified comments. This gave us a hint that it might be an important comment. By checking the user who posted this comment, it turned out to be by "Philip J. Crowley" who describes himself on Twitter as "Fellow at The George Washington University Institute for Public Diplomacy and Global Communication and Commentator for the BBC and Daily Beast', and he has more than 46,000 followers. The automatic detection of this comment, and the detection of its importance by the nu mber of retweets, can be a very interesting feature to readers of the article on the news website when they can find this comment, by a famous person who is a specialist, ranked at the top of the identified comments from Twitter on the news article. Such a feature does not exist with the current way of commenting on the news by commenting on a news website directly - experts and celebrities typically prefer to express their thoughts on their own social network.

In fact, the samples of the live articles presented in this section demonstrate all the unique and significant features of the system embodying the present invention over the state-of-the-art in microblog classification and comments on news analysis.

The previous part of the specification dealt with the task of extracting users' microblogs that carry comments on specific news. Embodiments of the invention provide for detecting users' comments on news on microblogs such as Twitter and can present those comments as an augmentation of the news articles to enrich a reader's experience.

The construction and implementation of the SVM classifier embodying the present invention provides a unique mechanism for the classification of microblogs into comment-microblogs and non-comment-microblogs.

The set of features which provide indicators of the presence of comments are discussed in detail and their effectiveness is tested and the results provided. We bu ilt a test set for evaluating the task and described the evaluation methodology to be used . The results showed the high performance of the classifier for detecting comment-microblogs precisely. We provided an analysis for the reasons of why some of the comment-microblogs are not captured by the classifier. We demonstrated the superior performance of our system on live news data and showed remarkable features the make our task novel and our approach effective.

Embodiments of the invention can be provided as a standalone tool or as an enterprise solution for use by press agencies, news delivery entities and publishers of electronically rendered documentation and/or digital media. The system, method and/or tool can also be implemented by other means such as applets, apps and bespoke desktop solutions. Referring to Figure 1 , the outcome of using embodiments of the invention is an enhancement of digital media such as a publication document with microblog user comments which have been identified as comments and as relevant to the topic of the publication document. The comments 4' are pulled from a repository of microblogs (or a microblog feed) to the publication document website rather than being pushed to the site directly by users although this push route can be taken in parallel with examples of the invention. The enhancement delivers comments 4' to a news article that are more relevant than pushed comments and less biased or skewed. Further, there is no time delay as a fresh news article can be posted and immediately populated with microblog comments 4' from the repository or microblog feed 3. It can be seen that embodiments of the present invention with access to a repository of microblog comments allow a fresh article to be published already incorporating the comments of users from the social websites where the users have been blogging on the topic of the news article directly. There is no need to wait for readers of the news article to comment on the article itself. The embodiments of the invention "pull" relevant comments 4' from the microblog repository 3 or feed 3.

The comments 4' in the repository 3 are numerous (greater in number than if comments were solicited from a single site) and offer comment diversity since the comments are not influenced by the persona or character of the news site. Embod iments of the invention have the m icroblog sphere as the pool of comments on a particular topic. Thus, the pool is a great many, if not all, comments in the microblog sphere on the news topic. Access to this extensive pool of comments leads to an increase in the diversity of the comments and enriches the experience of the reader of the news (and comments) to get a balanced image of the different opinions and comments on a topic.

Since the number of comments from microblogs is expected to be large, one or some of the following can be applied: normalising similar comments (nearly duplicates) into one and presenting the number of similar comments; and/or clustering and summarising comments of similar sentiment and presenting them with the number of comments in each cluster. Embodiments of the invention can be tailored (by the classification system selected by a user or as default settings) to deliberately include as comments 4', top comments from experts or celebrities on a particular topic. The classification "expert" or "celebrity" can be selected by a user interacting with the news article and the tool embodying the present invention can access the repository of microblogs in the selected classification. These "classified" comments 4' are then linked or appended to the news article to show the "expert" opin ion on a piece of news or incident from the "expert's" own microblog. Embodiments of the present invention l in k the publication document to such classified comments 4' allowing these comments 4' to be incl uded as com ments to the news article on a news website. This significantly enriches the content of the article: readers can see the news, the comments of people in general, and the comments of the experts and celebrities or of the most popular microblogs (top shared, reposted, or liked). The classification system is used to generate a number of "channels" of comment 4' streams from the microblog repository or feed 3.

A preferred embodiment of the present invention is shown in figure 2. This is an implementation of the high level embodiment shown in figure 1 . The figure 2 system works as follows: When a news article 1 is posted on a news website, the topic (PT) of the article 1 is extracted, which in its simplest form can be the title of the article. Posts 4 from a microblog website are checked for any posts 4 that match the topic (PT) of the article 1 . Initially, the posts 4 in a specified time period (an hour for example) are checked, and relevant posts are identified and retrieved. Subsequently, the microblog is monitored for any more recent relevant posts and updated accordingly. Relevant posts (microblogs 4) are classified into two categories:

1 ) comments 4'; and 2) non-comments, which can be just stating the news.

Only posts classified as comments 4' on the news are used. Classification into 1 ) comments; and 2) non-comments is preferably processed using rules (a rule-based method) or using automatically built models (a machine learning method).

The popularity of "comment" content 4' is then calculated. Popularity can be a function of: number of reposting , l ikes, repl ies, comments, or even the popularity of the user posting the comment himself as the number of friends/followers. The popularity function can be estimated manually or statistically.

The designated "popular" comments 4' are posted on the page of the news article separately in a fixed place clearly distinguished from the news article and other comments, if present. Conventional "comment" posts are posted with the news article, and sorted by time of post.

The "popular" comments 4' are updated depending either on the number or age of the posts. Some "old" comments 4' can gain in popularity over time on the micro-blog website.

An example of a final layout of the news article 1 website page is shown in figure 3. The news article 1 has two sets of comments 4' derived from the microblog repository 3 - representing two separate classification "channels" reflecting posted comments 4' . Preferably, the reader has selected the classification channels although the channels can also be preset or default settings used. Embodiments of the invention are not language-specific so examples of the invention can deliver comments 4' relevant to a given news article 1 from microblogs posted in d ifferent languages, For example, the publ ication docu ment 1 is in one lang uage and the com ments 4' are in d ifferent languages. In one embodiment of the present invention, the publication document 1 is a live television broadcast with both video and audio content data 2. The audio content data 2 is sampled and analysed to identify a publication topic (PT) or the topic may be manually entered by the broadcaster. The topic (PT) is transmitted to a microblog analyser and the microblog repository or feed 3 is analysed to identify relevant microblogs 4 matching the topic (PT). The relevant microblogs 4 are then categorised as comments or non-comments and the relevant comments 4' are returned to the broadcaster and shown live on a constantly updating ticker banner, for example, under the television news broadcast whilst the live event progresses.

The above description concentrates on augmenting an existing publication document 1 with relevant comments 4' derived from a microblog source 3. The system embodying the present invention can also operate to create or generate a utomatica l ly fresh pu bl ication docu m ents 1 (wh ich may subsequently be augmented accord ing to embodiments of the present invention).

Referring now to figure 4, this is a variant of the present invention: the microblog source 3 is monitored and analysed (S2) to identify trends and/or potential popular topics. The system settles on a potential topic for the creation of a new fresh publication document 1 . The system takes the topic as the publication topic (PT) for the as-yet-unfounded publication document 1 and matches the publication topic against relevant microblogs 4 and then derives relevant comments 4' or non-comments and populates a new publ ication document 1 with the relevant comments 4' or non-comments. The result is a basic (potentially title-only, being the publication topic) publication document 1 . From this genesis of a new publication document 1 , the above described system embodying the invention can then be applied to the fresh publication document 1 and the comments 4' continually updated. This system could thus serve as a flag to news organisations that a particular event is attracting a lot of comments 4' and is therefore deserving of further investigation and perhaps manual creation of a respective publication document 1 .

When used in this specification and claims, the terms "comprises" and "comprising" and variations thereof mean that the specified features, steps or integers are included. The terms are not to be interpreted to exclude the presence of other features, steps or components.

The features disclosed in the foregoing description, or the following claims, or the accompanying drawings, expressed in their specific forms or in terms of a means for performing the disclosed function, or a method or process for attain ing the disclosed result, as appropriate, may, separately, or in any combination of such features, be utilised for realising the invention in diverse forms thereof.

Table 1. Examples to comments on news from Twitter

News Headline Tweet

Drug war comes to Mexico's 2nd LOL! @HuffPostWorld Pot-firing catapult found at Mexican border city http://huff.to/gCa8sF

Phone-hacking in Britain Funny Guardian thinks Assange is a hero for Wikileaks & NOTW journalists evil for alledged phone hacks. Just pure snobbery & self interest!

BBC World Service cuts outlined to If the BBC mortgaged the #Strictly wardrobe to that (minted) wedding staff dressmaker in #MyBigFatGypsyWedding, they could save the World Service. U.S. Murder Case Threatens Whether in Egypt or Pakistan the US must demand that our citizens are Pakistan Ties treated fairly, RELEASE OUR DIPLOMAT! RELEASE OUR JOURNALISTS! Haiti's former president Jean- If #Aristide returned to #Haiti, would it change anything? Would it create Bertrand Aristide vows to return democracy?

Rachel Maddow at MSNBC makes wish i could afford HBO so i could watch bill maher, and rachel maddow, they an idiot of herself again tell it like it is.

U.S. Unemployment Falls, But New First Thoughts: What has changed (and what hasn't): It's bad enough that the Jobs Lag progressive income tax, a concept... http://twurl.nl/xzone

Al Gore Explains 'Snowmageddon' Awesome business article related to global warming - crazy stuff - and I'm not a tree hugger! read.bi/gV3WVv via @businessinsider

TSA shuts door on private airport Got my first TSA pat down at the Thunder Bay airport. He was friendly screening program

Table 2. Classification results for comments measured by precision and recall for 8 runs of different combinations of features

Average per topic Overall

Precision Recall Precision Recall

TS 0.691 0.323 0.582 0.190

LI 0.672 0.470 0.670 0.407

LX 0.789 0.350 0.738 0.297

TS+LI 0.648 0.472 0.646 0.412

TS+LX 0.736 0.384 0.720 0.363

LI+LX 0.653 0.573 0.663 0.558

TS+LI+LX 0.809 0.578 0.795 0.528

All Feats 0.884 0.503 0.843 0.450

Table 3. The headlines of the 15 news articles, the number of retrieved tweets (Ret) for each article from Twitter, the number of tweets classified as comments (Idnt) by our system, and sample(s) of the identified comments for each news article.

# News Ret Idnt Samples of the identified comments

Headline

1 Mr. Bean Gets 98 6 . What I like about Olympics. :))) http://t.co/SnLB6XYb

Carried Away During . Funny! LOL! http://t.co/lvKOz3p8

Olympics Appearance

2 Reports: iPhone 5 to 54 9 • @ClaudeBotes87 - on mv bdav?? Its a siqn! http://t.co/1 DiOYUn2 be unveiled Sept. 12 • Give me that!!! Reports: iPhone 5 to be unveiled 9/12.

http://t.co/fWdtAgOD #cnn #apple

3 2nd Day of Power 99 20 • What do #lndia and #Pakistan have in common? The need for Failures Cripples effective regulation of their respective electrical grid.

Wide Swath of India http://t.co/8IOjbQwE

• 2nd Day of Power Failures Cripples Wide Swath of India http://t.co/Pd1s14e1 If Obama policies continue, it can & will happen in the US??

4 Fight continues for 100 1 • Fight continues for control of Syria's Aleppo - http://t.co/LsJ1 p3gh control of Syria's Call in IZZY for counter air support!!!

Aleppo

5 India: 600 million 60 3 • @SpyEyesAnalysis: India: 600 million without power in 'biggest without power in ever blackout' - Telegraph http://t.co/OuoRPo2L"? ????? ???? ?? 'biggest ever blackout'

6 The LIBOR scandal 100 0

The rotten heart of

finance

7 Russia's "Pussy Riot" 31 0

on trial for cathedral

protest

8 Obama announces 99 4 • Obama announces new Iran sanctions http://t.co/n1 Y5d4TY <-how new Iran sanctions many more are possible? I hear about new Iran sanctions every other week.

• @AJEnglish: Obama announces new Iran sanctions

http://t.co/nZiHjqgh - who's the real evil?

9 Mexican official: CIA 99 1 • (( the question i would ask,, if i cared, and didn't think it was good 'manages' drug trade policy,, (((would be who appointed))) the... http://t.co/w0bOCnFT

10 London 2012: rowers 55 5(1 ) • Gold Medal commemorative stamps http://t.co/d2iBsU7T available Glover and Stanning from Post Offices from tomorrow. Go Team GB!

win Team GB's first http://t.co/YtDiUDWi

gold medal • It's gold! Stanning and Glover end the wait for Team GB

http://t.co/xrdi20jm #teamfollowback

1 1 Peter Jackson's The 100 24 • Oh lord, now the Hobbit is going to be a trilogy. Honestly, that man Hobbit to be extended doesn't know how to edit. http://t.co/cTDONkGT

to three films • hmm ... Artistic vision or studio money grabbing? 3 films for the price of 3, instead of 2, or just be concise and do 1

http://t.co/FvZiRB2a

12 Gore Vidal dies; 68 2 • Rest in peace, Gore. "Style is knowing who you are, what you want imperious gadfly and to say, and not giving a damn." Amen to that. http://t.co/gOOIqvFd prolific, graceful writer

was 86

US author Gore Vidal 100 7 • Not goodbye, rather, a thank you and goodnight. As your influence dies aged 86 remains, now strengthened by sacrifice. Rest in Peace

http://t.co/MlaZkZPe

• RT @PeppScotland: Gore Vidal Passes. As we quote him in ...Rain? we'll be paying tribute at our opening performance today. http://t.co/U 1 pLbZkh

Romney's Overseas 70 4 . BEAUTIFUL YES!!!! #POPE JOHN PAUL II => #Romney: 'Look Trip Produces Hits to Poland' on Freedom. Economv http://t.co/Ar84vlvu" and Misses • Mr. @MittRomney have you been in these countries? Or you just talking what is clear for people in decades http://t.co/wAZbAH09 via @WSJ

In Sliding Internet 99 13 • As Social Sites' Shares Fall, Some Hear Echo of 2000

Stocks, Some Hear http://t.co/ojiBpZ7t - RT @om: quite certain i haven't read a more Echo of 2000 awkward piece

• Echo of 2000? Hopefully not, but I do agree with IDC that

"Facebook is not and will not be a second Google"

http://t.co/hltOzhhE

Claims

1 . A method of augmenting an electronically rendered publication document containing content data, the method comprising:

identifying a topic of the publication document;

analysing the content of multiple microblogs to identify classification features, the classification features comprising evidence that a microblog from the multiple microblogs is in a particular microblog classification;

identifying microblogs relevant to the topic of the publication document; collating into supplemental microblog content at least those microblogs or the content from those microblogs which fall into a particular microblog classification and which are relevant to the topic of the publication topic;

linking the publication document with the supplemental microblog content.

2. The method of claim 1 , wherein analysing the content of multiple microblogs comprises classifying a microblog from the multiple microblogs into a particular microblog classification.

3. The method of claim 1 or 2, wherein the microblog classifications incl ude a classification of comment-microblogs which are microblogs determined to contain a comment on a particular topic.

4. The method of claim 3, wherein the microblog classifications include a classification of non-comment-microblogs which are microblogs determined not to contain a comment on a particular topic.

5. The method of claim 3 or 4, wherein the microblog classifications are further classified into:

positive or negative sentiment in respect of a particular topic; subjective or objective content, subjective content representing an opinion and objective content representing facts such as news or information; and

news, events, opinion and deals.

6. The method of any preceding claim, wherein the classification features used in the microblog analysis comprise one of or a combination of at least:

microblog-specific features (TS) that are specific to the manner of writing the text of a microblog; Language-independent features (LI) that comprise non-lexical features that give an indication to the presence of an opinion or a comment in general;

Lexical features (LX) that represent the presence of a set of terms of a special nature that are more likely to exist in comments; Topic-dependent features (TD) that provide a measure of the relevance of the microblog to the publication document.

7. A method of augmenting an electronically rendered publication document containing content data, the method comprising:

analysing the content of the publication document and extracting one or more publication topics from the content data of the publication document; analysing the content of multiple microblogs and identifying a microblog topic or topics from the content of the multiple microblogs;

matching a microblog topic to a publication topic;

collating into supplemental microblog content at least those microblogs or the content from those microblogs with a microblog topic that matches the publication topic;

linking the publication document relating to the matched publication topic with the matched supplemental microblog content.

8. The method of any preceding claim comprising: presenting the supplemental microblog content with the publication document.

9. The method of claim 8 further comprising sorting the supplemental microblog content for presentation with the publication document.

10. The method of claim 9, wherein the sorting is by popularity of the supplemental m icroblog content and/or by temporal data relating to the supplemental microblog content.

1 1 . A method of creating an electronically rendered publication document containing content data, the method comprising:

analysing the content of multiple microblogs and identifying a microblog topic or topics from the content of the multiple microblogs;

establishing a microblog topic as a publication topic;

collating into supplemental microblog content at least those microblogs or the content from those microblogs with the microblog topic;

providing the supplemental microblog content as content data for a new publication document, thereby creating the publication document relating to the microblog topic.

12. The method of cla i m 1 1 , further comprising augmenting the electronically rendered publication document by:

analysing the content data of the created publication document and extracting one or more publication topics from the content data of the publication document;

matching a microblog topic to a publication topic; collating into supplemental microblog content at least those microblogs or the content from those microblogs with a microblog topic that matches the publication topic;

linking the publication document related to the matched publication topic with the matched supplemental microblog content.

13. The method of any preceding claim, wherein the publication document content data is image data, video data, audio data, text data, URL data or a combination or combinations thereof.

14. The method of any preceding claim, further comprising rendering the linked supplemental microblog content with the publication document content data.

15. The method of any preceding claim, further comprising rendering the linked supplemental microblog content separately to the publication document content data.

16. The method of any preceding claim, wherein the microblog has content and metadata and the microblog is classified in accordance with the content and/or the metadata into one or more classifications.

17. The method of claim 16, wherein the matched supplemental microblog content i s a classified subset selectable by a user interacting with the publication document.

18. The method of claim 17, further comprising rendering the classified subset of the matched supplemental microblog content.

19. The method of any preceding claim, wherein one category of microblog content comprises microblog user comments and the method further comprises extracting m icroblog user comment content for use as the supplemental microblog content.

20. The method of any preceding claim, wherein identifying a microblog topic or topics from the content of the multiple microblogs comprises at least one of the following:

identifying microblogs containing reference to all or part of a publication topic;

identifying microblogs containing l i n ks to microblogs containing reference to all or part of a publication topic;

identifying microblogs containing reference to all or part of a publication topic in their metadata;

identifying microblogs containing reference to all or part of a publication topic in a URL or in the resource identified by the URL;

expanding a URL in a microblog and identifying microblogs containing reference to all or part of a publication topic in the expanded URL or in the resource identified by the URL.

21 . The method of any preceding claim, wherein the multiple microblogs are selected from the group comprising:

a post on a social networking site;

a comment on a news article;

a comment or a post on a forum;

a comment on a social networking site; and