WO2015135600A1

WO2015135600A1 - Method and computer product for automatically generating a sorted list from user generated input and / or metadata derived form social media platforms

Info

Publication number: WO2015135600A1
Application number: PCT/EP2014/066082
Authority: WO
Inventors: Claudia WYRWOLL
Original assignee: Wyrwoll Claudia
Priority date: 2014-03-10
Filing date: 2014-07-25
Publication date: 2015-09-17

Abstract

Method for automatically generating a sorted list of user generated input derived from social media platforms by a) in a first step (101) automatically extracting data related to a plurality of user generated input datasets and / or metadata from the social media platform, b) optionally in a second step (102) automatically extracting for each user generated input dataset the intrinsic data from the content, in particular video, text, audio, and / or picture data, c) in a third step (103) normalizing the intrinsic data and / or the metadata according to respective measures, such as e.g. the number of views and / or likes; d) in a fourth step (104) aggregating the normalized measures into at Ieast one numerical score (r_author, r_source, r_intrinsic, r_extrinsic, r_recency ) for each user generated input dataset and / or metadata; e) in a fifth step (105) reducing the at least one numerical score (r_author, r_source, r_intrinsic, r_extrinsic, r_recency ) of the aggregated normalized measures to one overall score; f) in a sixth step (106) ranking the user generated input datasets according to their respective overall scores.

Description

Method and computer product for automatically generating a sorted list from user generated input and / or metadata derived form social media platforms

The invention relates to a method according to claim 1 and a computer product according to claim 11.

User-generated content is content publicly available on online platforms contributed by their users. The increasing amount of user-generated content available on social media platforms requires new methods to find, evaluate, and compare it. This comprises the technical problem to be solved by the embodiment described below.

User-generated content, such as blog postings, forum discussions, shared videos, and so on, contain information that can be used for its evaluation independent of specific search interests.

Existing query-independent ranking approaches for Web pages are not applicable to user- generated content because they do not take the latter's specific characteristics into account.

In the following different embodiments for a query-independent ranking approach specifically for user-generated content are presented that allow to rank user-generated content from different platforms.

The embodiments refer to an analysis of user-generated content from different social media platforms (e.g., blogs, forums, and social networks).

It introduces the social media document view that models the characteristic properties that all user-generated content units have in common.

Based on this, a query-independent ranking approach is proposed that evaluates content units using several scores, which can be aggregated to one score.

It comprises for instance an author score, a source score, and a score for the popularity of a user-generated content unit. The proposed approach thus provides a unified way of ranking and comparing user-generated content across different platforms.

Different embodiments are described in the following.

Social media platforms are ubiquitous today and provide enormous amount of data related to their users. To access this data (e.g., for automated market research) is made difficult in particular because the data is distributed over different platforms.

One of the key concepts is use of metadata which is contained in the social media platforms which is described in section 2.

The independent method patent claim 1 comprises six steps 101, 102, 103, 104, 105, 106 which are part of the flowsheet shown in Figures 1 and 2. The second step 102 is optional.

The flowsheet in Figures 1 and 2 itself shows additional, optional method steps which are in particular claimed in the dependent subclaims.

The references in Figure 1 and 2 refer to the parts in this description. The invention also relates to a computer program product with the features given in claim 11. The computer program product can be realized as a program stored on some media and / or implemented in hardware, i.e. a custom made chip.

1 Key Terms

The following Section gives an overview of the key terms used in the following.

In the context of this description, the central characteristic for user-generated content is the possibility for users to publish content to others.

To apply the concept of publishing in the context of social media, it is rendered more precisely. To publish means making information publicly available. To adapt the concept to social media, several levels of public are distinguished. General public means that no receiver is specified by the contributing user. The content is available for everyone. This means that the audience is potentially unlimited. Limited public means that no receiver is explicitly specified by the contributing user, but the audience is limited. The limitation can be caused by platforms that require registration prior to reading. This is the case, for example, if a platform presents its content only to registered users; although anyone might be admitted to register, the audience is limited to the registered users. Limited public can be subdivided into known-limited public and unknown-limited public. Known-limited public comprises the cases when no receiver is specified but the audience is limited to known people. An example is content shared with a group, such as friends in social networks. This example illustrates that known-limited public is similar to private. Unknown-limited public describes the case when the audience is limited but not exclusively to known people. An example is when content is shared with a closed community such as friends of friends in social networks.

If the audience is limited to specified receivers, it is called private. In the context of this description, private communication is not user-generated content. This means, telephone calls, written letters, faxes, emails, SMS, instant messages, and so on, do not fall under the notion of user-generated content as it is used in this thesis. However, private communication can be part of a social media platform. Social networks for instance, usually allow to publish content to the general public, to a limited public as well as private messages.

In social media, the user who contributes a piece of content does not need to define his audience, but he can limit the audience. Reach is the number of people who receive a message. The less the audience is limited by a contributor, the more potential reach the message has. The degree of intimacy usually increases with limitation of the audience. Figure 3 shows the private and public levels of communication. Their characteristics are illustrated in relation to reach and intimacy.

Figure 3 shows the Reach-Intimacy-Model. The Reach-Intimacy-Model illustrates private communication and the levels of public for user-generated content in relation to reach and intimacy. Private communication is not regarded as user-generated content. This notion of public, adapted for user-generated content, replaces the concept of sender and receiver by contributor and audience. The contributor is the user who published a message. The contributor does not necessarily have to be the creator of the content. Whether or not the contributor is the creator of the content, in social media he is usually (In some cases, platforms display citations.) displayed as author and will therefore be referred to as author.

User-generated content is content published on an online platform by users. The term social media comprises platforms that contain user-generated content. Users do not need programming skills to publish content on a social media platform.

Whether content contributed by a company on a social media platform is considered user- generated content, depends on the notion of user. User can refer to the user of a social media platform. In this case, the content contributed by a company on a social media platform would be regarded as user-generated content. User can also refer to private individual as opposed to professional or business person. In this case, the content contributed by a company on a social media platform would not be considered user-generated content. In the context of this description, user refers to the user of a social media platform.

For search engines the smallest unit is a Web page with a URL as identifier. For user-generated content this view does not sufficiently apply. One Web page, one URL usually contains several social media entries from different authors. For social media the smallest unit is the user- generated content unit. A user-generated content unit is one single contribution by one author at a given time. Collaboratively created content usually has more than one author. This case is not covered in this thesis.

A user-generated content unit consists of core data and metadata. The given piece of information— the content— is the core data. Metadata is information about a given piece of information (Baeza- Yates L· Ribeiro-Neto, 2003) . Examples for metadata about user-generated content are date of publication, status of the author in the community, and number of views.

Opinions that are expressed by one click are also referred to as one-click-opinions. Examples are Facebook's likes, Google's -j-1, Youtube's thumbs up, and so on. Ratings of user-generated content units by other users are peer-ratings.

Social media platforms include, but are not limited to:

1. Blogs

2. Forums

3. Location sharing and annotation platforms

4. Media sharing platforms

5. Microblogs

6. Question and answer platforms

7. Rating and review platforms 8. Social networks

The following provides a short characterization for each category. The sets of characteristics of categories are not necessarily disjoint. Some platforms allocated in a certain category can have aspects of other categories. For example, social networks allow to share pictures and videos as it is typical for media sharing platforms.

Blogs are a special form of Web sites keeping publication deceptively simple. Entries are displayed in reverse chronological order, presenting the most recent entry at the top of the page. Blogs can be differentiated by the number of users authorized to publish blog entries into single- and multi-authored blogs. (Note that also in multi-authored blogs content units are usually not collaboratively contributed.) Until the mid 2000s blogs were usually written by one author. In recent years multi-author blogs have become popular as well (Safko, 2010). Bloggers tend to evolve their blogs around a special interest (Macdonald, Santos, Ounis L· Soboroff, 2010). Blogs can also be distinguished by the degree of professionalism of the content produced into professional- and private-content blogs. Blogs range from personal diaries to professional journalists' and corporate blogs.

A forum is an online discussion site where people can hold conversations in form of posted messages. Conversations are organized in threads and are stored permanently. A thread belongs to a topic. It consists of a root-posting and replies. In open forums content is public and can be read by everyone. In closed forums it is necessary to become a member of the forum- community to read the postings. To actively take part in a discussion it is usually necessary to become a member. Forums have a hierarchic structure. The first level displays a list of topics that are covered by the forum. Threads covering the same topic are collected in a sub-forum. Threads are on a lower level. They start with a root-posting containing a question, topic or statement. A thread consists of one or more postings, which are organized by the time of submission, usually in chronological order. Some forums offer their members the possibility to connect with each other, as it is common in social networks. But unlike in social networks, users can read discussions and contributions whether or not they are connected. Members can discuss topics with any other member, being connected is not a precondition. Hence, the average forum user has fewer connections than the average social network user. Forums exist since the early days of the Internet and they form rich repositories of collaborative knowledge

Location sharing and annotation platforms apply location based services that enable groups of friends to share their current location and annotations. Location based services allow people to see where they are geographically with the help of GPS equipped mobile phones. Examples are applications that help the user to navigate from A to B, locating others on a map displayed on the phone or receiving information about traffic jam ahead (Rogers, Sharp & Preece, 2011). Location sharing and annotation platforms usually allow users to plot their location and share it with other users. Users can also create places, upload pictures, videos and leave written messages for others.

Media sharing refers to platforms where registered users can upload their content and share it with friends or provide it to the public. Other users can rate and comment on the content. Microblogs allow users to share information by broadcasting short, real-time messages. A well- known microblog provider is Twitter (http : //www . twitter . com). A microblog entry consists of a short text or a link to images or videos. Using microblogs, the author does not specify a recipient. Every message is public by default and the recipients choose whose messages they read, whom they follow. Users who follow someone are called her followers. The followings are the users whom she follows. A content unit published on Twitter is called tweet. A tweet cited by someone else is called a retweet. Category assignments or tags are marked with a number sign (#).

Question and answer platforms are platforms where users can pose questions and everyone can answer them. Answers can be rated by other users.

Rating arid review platforms allow users to rate and comment on products or services. There are rating and review platforms that are completely user-generated and there are commercial platforms that integrate user-generated content. Ratings are opinions that can be contributed by just one click on a given scale. The scale can be binary (e.g., thumbs up), or it can have more levels (e.g., x out of n stars). Reviews are written texts about products, services or experiences. Usually, platforms allow both, ratings and reviews.

Social networks are platforms that allow individuals to create a profile and articulate a list of other users with whom they share a connection. Users can view and traverse their connections (Boyd & Ellison, 2008).

Further terms used in the following:

• The information the user desires is referred to as information need.

• The technique with which a retrieval system chooses documents and presents them to the user is referred to as information retrieval method.

• The documents the retrieval system chooses from a larger set of documents in the retrieval process are the retrieved documents.

2 Metadata in User- Generated Content

User-generated content units consist of core data and metadata. Core data is the content itself. The content can be text, audio files, pictures, videos or a combination of these. Metadata is information about information. In the context of user-generated content metadata is further information about the content, such as the date of publication or the author of the content.

Users are willing to contribute content as well as metadata on online platforms that allow them to do so (Mika, 2007). Different platforms allow their users to contribute different kinds of metadata. Consequently, user-generated content from different platforms have heterogeneous metadata. The analysis of metadata for all social media categories is the basis for further insights about the structural nature of user-generated content. To do so it is necessary to collect all features provided by different social media platforms. User-generated content units from different platforms are analyzed. For each category up to 6 representatives were examined. The number of representatives chosen per category depends on how diverse a category's representatives are. If all platforms of a category provide the same information, one example already suffices to show what kind of information is provided by platforms belonging to that category. If there are differences between platforms belonging to the same category, more examples are presented to illustrate the differences.

The following section starts with a summary of the analysis of metadata for each category. The second section of this section introduces the structural types of metadata that can be observed throughout the categories. The third section introduces a semantic pattern for the metadata of user-generated content. The pattern applies to user-generated content from all categories and is independent of the platform. This is the basis for the modeling of a ranking that works category-independent. A full list of features provided by the social media platforms analyzed can be found in the Appendix.

2.1 Analysis of Metadata for User- Generated Content

This section summarizes the analysis of metadata for user-generated content for each category. It is the basis for further analysis and modeling presented in the subsequent sections.

2. 1 . 1 Blogs

User-generated content units from blogs are blog posts. The traditional blog is written by one single person and blog posts are displayed in reverse chronological order showing the most recent entry on top of the page. Today, blogs occur in many different varieties. They range from personal diaries to professional journalists' and multi-authored blogs.

Whether a blog is single-authored or multi-authored, a single posting always has one author, a publishing date and a source. The source is the blog it was published in. Other than those three, there are no further measures that are standard in blogs. But there is a multitude of useful measures that occur in some blogs. Often found in blogs is a feature that allows readers to comment on blog posts. Every comment shows that the blog entry has been worth a user's time to write the comment. The number of comments on a blog entry indicates that the entry has been noticed. Some blogs also count and display the number of views per entry.

Similar to back-links that are references from other Web sites to the blog, track-backs refer to references from other Web sites to a specific posting. The number of track-backs indicates how many others regarded a posting as interesting or useful. Track-backs can only be applied for user-generated content units that have a distinct URL.

Figure 4 shows an example for a blog post with a plugin from another platform. The screen- shot shows a blog post with an integrated Flattr button at the bottom (Peter, 2013) . The post has been Battred six times, which means that the author of the post receives small amounts of money from six of his readers, who enjoyed the text. The text of the blog post has been shortened and masked to guide the reader's attention.

The reverse chronological order in which blog posts are displayed could imply that for blogs most recent entries are also the most relevant. But this cannot be generally assumed. There are also blogs about topics that are not dependent on time. Blogs tend to evolve around special topics its authors are interested in and have gathered some expertise in (Macdonald et al. , 2010) . There are blogs about current events, new products or recent experiences for which the most recent entry indeed tends to be also the most interesting. Blog entries that discuss general ideas or giving advice on problems are less dependent on time. Also, blogs about personal interests, such as photography or literature, are like collections in character. In collections the most recent entry can be as interesting as any other entry. Therefore blogs are often alternatively structured with categories and tags. An example is Wordpress, an open source blogging tool. Wordpress provides tags and categories to group related posts (WordPress, 2013) .

For a single-authored blog the rating of the blog is equal to the rating of the author. For a multi-authored blog this is not the case. To automatically decide whether a blog is single- or multi-authored is not trivial and can also change over time. Since a posting has a single author and a single source, the information about the author and about the source can be allocated separately to the posting. Thus, postings from single- or multi-authored blogs can be treated alike.

Figure 5 shows an example for a blog post with several plugins from other platforms. The screen-shot shows a blog post with plugins (from left to right) from Twitter, Facebook's like, Reddit, Email, Google+, share on Facebook, send to StumbleUpon, Fark It!, Share on Linkedin, and more sharing options in the box at the bottom (Beadon, 2013) . The text of the blog post has been shortened and masked to guide the reader's attention.

There are approaches that rank blogs. They can be classified as either link-based or feature- driven. They rank whole blogs, not single postings. Therefore, they cannot be directly integrated into the ranking of a content unit. The number of back-links— the number of times a blog is referenced— is an example for a simple ranking feature for a whole blog. Link-based blog ranking approaches estimate the relevance of a blog by the number other blogs that link to it. The relevance of a source can be applied to predict the probability for a piece of content published within that platform to be relevant as well.

Some blogs integrate other social media applications as plugins. These plugins can be clicked and have a certain meaning. A click on a Twitter button means that the post is shared on Twitter. A Facebook button means that the post is shared on Facebook. A Facebook like button means that the expresses his approval of the text without sharing it. A Flattr (Flattr is a microdonation provider (http : //f lattr . com/) .) button means that the reader of the post donates a small amount of money to the author. Figure 4 shows an example of a Rattred blog post. Figure 5 shows an example of a blog that includes various plugins. They allow to like a content unit or to share it on social networks. This is also a way to crowd-source relevance. The following are examples for other social media applications that are currently often used in blogs:

Flattr (http://flattr.com/)

Twitter

• Facebook (e.g. , share & like)

• Google+ (http ://www. google . com/intl/en/+l/button)

• Stumble (http: //w w.stumbleupon. com/)

• InShare (http : //de . linkedin. com/)

• Reddit (http : //www . reddit . com/)

Each click on a plugin button means that a user consumed the post— at least partly— and rewarded the post with an interaction. The possible interactions can have different semantics that express a higher or lower level of involvement, but they are all human selected recommendations. Human recommendations are a useful source of information for others when they try to select content for their own consumption.

As texts in general, the text of a blog-post can be analyzed with regards to text length, frequency and use of specific words, number of references within the text, and so on.

2.1 .2 Forums

A peculiarity of forums is their discussion-like character. This is what makes them so valuable for topics that need debate. But that also often leads to off-topic discussions that make it difficult to find the desired content units.

Forums tend to evolve around specific fields of interest. Forums are particularly valuable for users who seek like minded people, specialists in a field of interest or information about specifi c topics. Specialists share their knowledge in their field of competence. They also write about their experiences with products and brands. In technologically oriented forums for example computers, monitors, and gadgets are discussed. In telecommunication forums the best service providers and mobile devices are disputed. In sports oriented forums users share recommendations about training and their experiences with the latest sports gear. Hence, forums are also a popular source of product information prior to buying (Elsas & Glance, 2010).

There is a large number of forums with various focus topics online. There are some sources online that provide overviews and statistics for forums, one of which is Big Boards (http : //www . big-boards . com) . Most forums are based on either phpBB or vBulletin. PhpBB is an open source software, (http : //www . phpbb . com) whereas vBulletin has to be licensed (http : //ww . vbulletin . com). These forum solutions only support a fixed order of topics. Organizing threads or postings by any concept of ranking is not common. Consequently, the user has to orientate himself by reading through the topics and threads. Most forums also offer a text search that delivers matching results without a particular ranking.

It is especially hard to get an overview over all the information available for a posting published in a forum. This is due to the hierarchical structure of forums. Measures, that are helpful to estimate the relevance of a posting, are distributed among the different levels of a forum's hierarchical structure. Information such as when a posting has been published and by whom it was posted can be found on posting-level as publishing date and author, as Figure 6 shows. The number of views and how many answers there are to an initial posting can be found on thread-level as number of replies, as Figure 7 shows. The number of threads and postings can be found on topic-level, as Figure 8 shows. Information about how many users are active in a forum-community and how many subsequent remarks there are to an initial posting can be found one level above thread level.

Figure 6 shows the hierarchic structure of forums showing the postings on posting-level, screen- shot from http : //f oru.ms . bit-tech . net, accessed: August 15, 2012.

Figure 7 shows the hierarchic structure of forums on thread-level, screen-shot from http : // forums . bit-tech . net, accessed: August 15, 2012.

Figure 8 shows the hierarchic structure of forums on topic-level which gives an overview of the forum's topics, screen-shot from http : //f orums . bit-tech . net, accessed: August 15, 2012.

A large amount of postings within a topic (or sub-forum) can indicate that a topic is popular. Another reason for a large amount of postings within a topic or sub-forum can also be the way a topic or sub-forum is composed. If for example one sub-forum subsumes all cultural topics, whereas political topics are subdivided into several sub-forums, numbers are difficult to compare. The inner structure of topics of interest is not mandatory. Sub-forums and topics are organized manually and consequently differ in their organization. The difference in topic organization can also be interpreted as a bias, intended by a human mind. The resulting bias in the calculation might therefore be still reasonable in the semantics of the forum and therefore still helpful for the user. If for example the above mentioned forum subsumes all cultural topics, whereas political topics are subdivided into several sub-forums, this could likely be a forum that was founded as a political forum. Consequently, it should indeed have more content within the political sub-forums than in the culture section.

Elsas <k Glance (2010) worked on an approach to identify forums with rich product discussion. Their approach is based on a previously known list of products and brands people could search for. To identify relevant discussion within a forum they also worked with the number of postings within a topic. They solved the problem of different aggregation levels by ignoring information from higher levels. They assign each thread to the parent forum containing the thread assuming that each message belongs only to the immediate parent forum. Higher-level forums are ignored. On the one hand this solves the problem of comparability of topic sizes. On the other hand it neglects potentially useful information on higher levels. Yet, information from different levels can be mapped to a single posting. This means that the same posting can be rated higher if it is posted in a higher rated thread, topic or forum, or is written by a higher rated author. Authors in forums are usually assigned a level of expertise based on the author's statistics. Forums differ in the number of levels that are assigned (e.g. , newbie to expert signifying level 1 to level n) . Users who are new to a forum-community are mostly users who search for expertise. They initially become a member to post their questions. In contrast, users who are members for a longer time, are more likely to be interested in the topic of the forum in general. The number of postings published by an author can be interpreted as indicator for expertise presuming that a user who posts more has more expertise to share. The duration of his membership in combination with his activity derived from the date of his last posting gives further allusion to his potential to publish useful content.

2.1.3 Location Sharing and Annotation Platforms

Location sharing and annotation refers to collaboratively produced metadata for virtual representations of physical places. Foursquare (https : //de .foursquare . com), Loopt (http : //ww . loopt . com), Facebook Places (https : //www. facebook. com/about/location), and Google Latitude (http : //www. google . com/latitude) are examples for location sharing and annotation platforms. These applications are rapidly growing. Foursquare has more than 15 million users (https : //foursquare . com/about) and Facebook places is used by more than 30 million users. Users of Foursquare shared their locations over a 100 million times by July 2010 (Cramer, Rost & Holmquist, 2011 , p. 57) .

Figure 9 shows a representation of the New York Marriott Marquis Hotel on the location sharing and annotation platform Foursquare. Foursquare (2013) gives general information about the venue such as address and contact information. Users uploaded 1879 photos of the hotel. They can be accessed at the top of the page. Below the short text about the hotel the number of total visitors and the number of check-ins is displayed. Visitors left tips for other users, which are displayed at the bottom of the page.

Figure 10 shows a user's profile on on the location sharing and annotation platform Foursquare. In the profile the user, Daer (2013) , presents information about herself. At the top her name is displayed along with a short text describing herself. This user has performed 6,014 check-ins and given 174 tips, one of which is the New York Marriott Marquis Hotel in Figure 9. On the right hand side the number of badges, the user's level ( "Superuser Level 1" ) and the number of her mayorships are displayed. At the bottom right the number of friends is provided. This view of this user's profile is general public as specified (see also Figure 3) . It can be accessed without being a member of Foursquare.

Location-sharing services have a longer history in research then they are available for consumers. Research activities with focus on locating and tracking people go back to the early 1990s (Pier, 1991 ; Harper, Lamming & Newman, 1992) . The first locator technology was the ActiveBadge, originated at Olivetti Cambridge Research Lab. It was intended for office application. Badges should be worn in the workplace to track locations of employees. Even at this early point of research sociological and ethical questions were raised and considered important. The questions posed by Pier ( 1991 , p. 285) "Will 'Big Brother' monitor your every move?" and "Can we architect systems that provide desirable services without actually revealing any individual's location and trail unless given permission by that individual?" illustrate farsightedness as well as the core problem with location tracking systems. Most ensuing systems in research have focused on location tracking, while providing the user with different levels of control over what is shared with whom, usually a limited audience (Iachello, Smith, Consolvo, Chen & Abowd, 2005; Reilly, Dearman, Ha, Smith & Inkpen, 2006; Barkhuus, Brown, Bell, Sherwood, Hall & Chalmers, 2008; Scellato, Noulas, Lambiotte & Mascolo, 2011).

The question "Can we architect systems that provide desirable services without actually revealing any individual's location and trail unless given permission by that individual?" (Pier, 1991, p. 285) already contains the answer to the addressed problem: permission by the individual. The important issue regarding privacy is that the location sharing is performed manually and not via tracking. Current solutions do not automatically track and share peoples' location. If desired by the individual user, he can give permission and reveal his location to others. Each publication is enabled and authorized by the user himself. In in-depth interviews Cramer et al. (2011) observed a shift from privacy concern driven behavior and data deluge to performative considerations in location sharing.

Check-ins are manually entered to pair user location with semantically enriched venues (e.g., restaurants, grocery stores, bars), which are visible to other users. Figure 9 shows an example of the representation of a hotel on the location sharing and annotation platform Foursquare.

The venues are the central reference points in location sharing and annotation. A venue has a name and a geographical location. Furthermore, the total number people who have visited a location so far is provided along with the total number of check-ins. Assuming that users rather share places they like with other people, a large number of visitors suggest a high popularity of a location. A high ratio of check-ins per visitor suggests loyal customers. Annotations of locations always have an author and a publishing date. Users can rate the annotations of other users.

Additionally, the location sharing and annotation-user's profile shows how many total check-ins she published so far and how long she has been active on the platform. Figure 10 shows an example of a user's profile on the location sharing and annotation platform Foursquare.

2. 1 .4 Media Sharing Platforms

Media sharing platforms are platforms where registered users can upload content and share it with friends or provide it to the public. Existing platforms are specialized on specific media such as pictures, videos or audio content. Youtube is the most successful platform of our times for video sharing (http : //www . youtube . com). Flickr is an example for a platform where users share pictures (http : //www . flickr . com).

Users can connect with other users. Connections can be unidirectional as typical for microblogs or mutual. Youtube even supports both types of relationships, namely friends and subscribers. Flickr supports friends as well as groups users can join. Usually, the number of connections is lower than in social networks. This might be due to the fact that in social media sharing platforms users do not have to be connected to see each others content. Users can comment on the content and contribute one-click-opinions.

A user-generated content unit from a social media sharing site usually has a contributor, a publishing date, a number of one-click-opinions, a number of views and a number of comments. Figure 11 and Figure 12 show examples of content units from two different social media sharing platforms.

User profiles typically give information about the user's nickname, the date when the user became member of the community, and the number of content units contributed.

Figure 11 shows a user-generated content unit on the media sharing platform Youtube. The screen-shot shows an example for a user-generated content unit on Youtube. The platform displays the video in the main area. It shows the title of the video 120715 - PSY - Gangnam style (Comeback stage)... and the contributing user CapsuleHD20 (2012) . Furthermore, the number of views and the number of user ratings are displayed. The bottom of the figure page shows when the video has been contributed, in this example, the contributor is not the artist of the video. The artist— PSY— is explicitly named at the right side of the bottom of the screen-shot.

Figure 12 shows a user-generated content unit on the media sharing platform Flickr. 2.1 .5 Microblogs

In microblogs the author does not specify a recipient. The reader chooses which authors he would like to read postings from. Every message is public by default and the recipients choose whose messages they read. The follow relation is not mutual.

Twitter is a typical representative. A tweet has always one distinct author and a publishing date. Favorites are Twitter's one-click-opinions. Tweets can be further distributed by other users. For Twitter this is called retweet. The number of favorites and the number of retweets are displayed with the content unit. They help users to estimate the importance of a tweet.

Figure 13a shows a user-generated content unit published on Twitter. Figure 13b shows a user's profile on Twitter.

The user's profile information shows how many content units a user contributed, how many followers he has and how many others he follows (i.e. , following) . The number of followers indicates an author's reach. The more followers an author has, the more people consider his contributions worth reading. If we think about the author and his followers as a directed graph, more conclusions about the value of an author's followers can be drawn. A user who follows fewer authors could be considered to be more selective about content and to choose more carefully whom he follows. Furthermore, it could be assumed that this user is more likely to really read the postings by those authors. The conclusion can now be drawn that a large number of followers who follow many authors is not as valuable as having the same amount of followers who follow a small number of authors. Figure 13 shows an example of a user-generated content unit and a user's profile from the microblogging platform Twitter. Microblogs differ from traditional blogs in being much shorter and smaller in file size. For example, a tweet is limited to 140 characters. It contains text sometimes accompanied by a short-link (a short-link is a URL that is shortened in length and still directs to the required page). Therefore, the length of tweets varies only in this small range. A particularity of microblogs is that there are syntax agreements, which can be used within the text. The syntax is not technically imposed by Twitter. It has emerged as conventions from the users' needs. Category tags could be derived from the text itself as well by parsing for the number sign (#). Retweets can easily be parsed, since they are marked within the text message by the letters RT.

2.1.6 Question and Answer Platforms

In question and answer platforms users can pose questions and other users can answer them. Question and answer platforms usually allow peer-ratings. Answers can be rated by other users. The goal of the rating procedure is to find the best, ideally the correct answer from all answers given.

Questions as well as answers have a publishing date and an author. The author's profile usually shows the author's nickname, membership since, the number of contributions, the number of questions posed, the number of questions answered, and the number of— usually peer-rated— best answers.

According to the works of Agichtein, Castillo, Donato, Gionis & Mishne (2008), the most significant indicators for the quality of questions as well as answers are peer ratings. Other significant features for quality classification of questions and answers are features derived from text analysis, such as punctuation density in the question's subject, number of words per sentence, the number of capitalization errors in the question, answer length (rewarding longer texts), unique number of words in an answer, and the word overlap between the question and the answer.

2. 1 .7 Rating and Review Platforms

Rating and review platforms are specialized for ratings and reviews of products, services or experiences. Ratings and reviews can also be part of commercial platforms. Amazon is an example for a commercial platform that allows ratings and reviews by users. Content of this type is also referred to as consumer-generated product reviews (Archak, Chose & Ipeirotis, 2007) or online consumer reviews (Yu, Zha, Wang & Chua, 2011).

Figure 14a shows a user-generated content unit published on Ciao. Next to the author's nickname and her profile picture, is the overall rating the author gave the product (Roses Ar- eRedl207, 2012). Figure 14b shows a user's profile on Ciao.

For ratings and reviews the author as well as the date of publishing are displayed to the reader. Reviews are written texts of flexible length. Sometimes pictures can be contributed as well. Ratings are one-click-opinions on different scales. Some platforms offer several one-click-opinions for predefined criteria. Sometimes users who are willing to contribute a review have to provide their ratings for all of the predefined criteria offered. On one hand, this can have the advantage that ratings are more differentiated. On the other hand, the predefined criteria need to fit the object of the rating which is not always the case (for example, Ciao.de makes reviewers rate the smell of electronic products, such as an electric toothbrush (Ciao, 2013). Smell as rating criterion might be suitable for toothpaste, but is irritating for an electric toothbrush).

Furthermore, most platforms allow peer-ratings of the reviews. Platforms differ in the scales they offer. Some offer a binary scale This review was helpful to me. and This review was not helpful to me. (e.g., Amazon). Others offer to rate the degree of helpfulness on several levels

(e.g., Ciao). Figure 14 shows an example of a user-generated content unit and a user's profile from the rating and review platform Ciao.

To give readers information about the author of a review, the number of helpfuls is displayed along with the review itself. A high number of positive peer-ratings indicates a high quality review.

Sometimes, users can also comment on reviews. In this case, the number of comments a review received is displayed.

Most platforms require a registration to contribute reviews. Authors can register with a pseudonym or their real name. Some systems offer to confirm the correctness of the real name (e.g., Amazon). Authors' profiles are publicly accessible providing statistics about the user's reviews to other users. The statistics differ from platform to platform. All platforms display the number of reviews the user has contributed. If peer-ratings are offered, the average peer-rating the author received for his reviews is commonly displayed, too. Some systems offer detailed statistics about peer-ratings and contributions (e.g., Ciao shows number of readings received, number of comments received, number of comments written, etc.).

2.1.8 Social Networks

Social networks are characterized by their users and the connections between them. Connections can be either mutual as the friends connection in Facebook or unidirectional as it is in Google's social network Google . Google+ users can add anyone to their circles (i.e., the user's networks), the other user does not confirm the connection. For each publication users can specify to which circle they like to publish it.

A user-generated content unit is always published with author and date of publication. A user- generated content unit is published in a social network as part of so-called feeds. Usually, a user-generated content unit appears in the feeds of all users connected with the contributor. Therefore, the number of connections indicates a user's reach. Originally, those postings consisted of short text messages. The possibilities to post photos, links and videos were added gradually. Usually, people can rate entries by leaving one-click-opinions , comment on it and share content with their connections. This leads to a number of further measures, such as number of one-click-opinions, number of comments, and number of shares. Figure 15 shows two examples of user-generated content units from social networks.

Figure 15 shows user-generated content units from the social networks Google+ and Facebook.

For the purpose of this description it is important to note that content published in social networks might be published only to the limited public. If content is not published to the general public, it should not be analyzed, except the reader is included in a limited public.

2.2 Cross- Category Comparison of Metadata

After appraisal of all the measures that occur in different social media categories, the data can be analyzed for patterns. To develop a modeling concept that provides comparability of information, the collected measures are analyzed for similarities and differences across the categories.

Figure 16 shows user-generated content and the allocation of information. The boxes labeled 0.- 2. represent content. It can be text, pictures or other media. The ellipses represent metadata. In the center of consideration is the contribution (1.) . The contribution has a publishing date, a source, an author. The contribution can have comments allocated to it (2.) . A comment also has an author and usually a publishing date. The contribution can have an object it refers to (0.) . For rating and review platforms that can be a product for example.

First of all, the elements that are to be ranked need to be determined. This decides also on which level user-generated content is compared. For user-generated content that stands for itself— that is, it does not refer to other elements and does not have other elements referring to them -this is straightforward. But for a product review for example, it needs to be specified whether the products, the review, or the comment to the review should be ranked.

It is necessary to distinguish the object of a user-generated content unit (e.g. , that can be a product on a rating and review platform or a location on a location sharing and annotation platform) , the central user-generated content unit, and the user-generated content units that refer to the content unit as comments. Figure 16 illustrates this structure. It shows the central user-generated content unit as contribution in the middle (labeled 1.) . A contribution has an author, a source and a publishing date. A contribution can have comments (labeled 2.) . A comment can have an author, too. But in the context of this description it is interpreted as additional information about a contribution. It is not object of the ranking itself. A contribution can refer to other objects (labeled 0. in Figure 16) . In the context of this work, this is also not object of the ranking.

For each social media category analyzed in Section 2.1, there is a level of granularity for which there is a publishing date, a source where it has been published and an author by whom it was published. In the context of this work, this is the object of the ranking and the level of comparison. It correlates to the contribution in Figure 16.

Three basic types of information can be distinguished: content information, primary information, and secondary information. Content information is information derived from the content. For example, if a social media content unit contains text, the text itself can be parsed to derive further information. Simple examples are text length, number of words and sentences or number of question marks or the number of links the text contains. Less simple examples are the numbers of previously defined strings or positive and negative adjectives. For videos, this can be the file size or the duration of the video.

Primary information is information that is displayed directly on the level with the user- generated content unit, it refers directly to one content unit. The publishing date and the author of a content are examples of information that is always displayed with the content unit (i.e., the user does not have to navigate). Publishing date and author refer to a content unit and can be distinctly allocated.

Secondary information is information that refers to information about a user-generated content unit. Secondary information is often found on a different level than the content unit itself. It can refer to more than one content unit. The number of content units an author has contributed is an example for secondary information. It reveals information about the author and cannot be directly allocated to a content unit. But, if we know who the author of a content unit is and we know something about the author, we can draw conclusions about the content units the author publishes.

Secondary information, allocated to a user-generated content unit, can be differentiated according to the roots of heritage. Information derived from the thread of a user-generated content unit from a forum is thread inherited information (e.g., number of views of a thread). Source-related information is secondary information that is derived from the source of a social media content unit (e.g., number of community members). Author-related information is secondary information that is derived from an author's profile (e.g., number of best answers an author contributed).

There is source-related information that can be assessed for all categories. An example is the number of back-links to a source. There is also source-related information that is specific for certain categories. The number of members of a community in a social network is an example.

Figure 17 shows types of information for user-generated content units. User-generated content units consist of content and further information. Author, date, and source are primary information. Secondary information is further information about primary information. Specific information is available for some platforms for others not. Content information is information derived from the content.

Similarly, there are different kinds of author-related information that can be assessed through the author's profile, depending on the platform. Microblogs for example offer the number of followers and the total number of tweets, whereas forums show the number of contributions without number of followers since that concept does not exist in forums. Nevertheless, both examples reveal further information about the author.

Specific informat ion occurs only on certain platforms. Specific information is not available for all types of user-generated content. Various forms of peer-ratings are examples. Figure 17 shows an overview of all types of information related to a user-generated content unit.

2.3 From Data to Information: Semantics of Metadata

A datum without a meaning is just a number. When data have meaning, it becomes information that is useful for humans. Focus of the following section is the interpretation of the metadata available for user-generated content.

2.3.1 Why It Matters Who the Author Is

Conducting information from an author to his assertions is a well-known concept. If someone likes the novel Great Expectations, he might want to read David Copperfield as well, because it is also a novel by Charles Dickens. Of course, it may happen that one novel of an author is excellent and another is not. But, although we cannot draw conclusions that are reliable in every case, it is an approximation that proved efficient for the majority of cases.

Expertism is another example of how the reputation of an author is used to draw conclusions about the content he or she produces. In traditional media, an example for trusted experts are journalists. An article written by a journalist is expected to be of better quality than an article by a layman. The same applies for scientists and their publications. An academic degree increases the trustworthiness of a contribution to a discourse induced by assumed expertism. In science, authors are more likely to be cited if they have already been cited more often than comparable work. A highly rewarded scientist who has published many well respected works is expected to publish more work that deserves respect. Quoting this well respected author will probably be more convincing than quoting someone unknown. If someone announces the consumption of onions would relieve migraine symptoms, more people will try it if that someone is a neurology professor, than if that someone is an organic farmer. These conclusions are not necessarily correct for every single publication. But nevertheless, even though it is not precise, it is an efficient approximation.

With the advent of social media, readers are confronted with many authors; more than they could ever learn enough about to judge their content by their reputation. Some users might have the feeling that this mass of authors contributing content in social media virtually equals complete anonymity. But, there is information about authors in social media, too. An author who has written several contributions has proved his loyalty over the last years of his membership, whereas for an author who has just joined a community it is unknown whether he only joined the community to fake positive reviews about his own products. An author who has many followers has probably a higher reach than an author who has just a few. An author who received many best answer peer-ratings for his answers is probably more reliable than an author who has just contributed his first answer.

Authors' profiles vary significantly in the amount of personal information provided. There are authors who provide profile pictures, real names, contact information, and even provide the link to their personal W^Teb site, whereas other authors provide a nickname only. Studies suggest that Web site credibility is enhanced by providing personal and contact information (Fogg, Marshall, Laraki, Osipovich, Varma, Fang, Paul, Rangnekar, Shon, Swani & Treinen, 2001; Fogg, Marshall, Osipovich, Varma, Laraki, Fang, Paul, Rangnekar, Shon, Swani et al., 2000: Fogg, Soohoo, Danielson, Marable, Stanford h Tauber, 2003).

2.3.2 Why It Matters Where Something Is Published

The source of a publication allows to draw conclusions about the publication itself. Properties of a source can be applied to predict the probability for a piece of content published within that platform to have these properties as well. This concept is well established. It is a concept known to users from traditional media.

Newspapers, book publishers, television channels are examples. When a reader decides to buy a newspaper, there are expectations about its contents. Political orientations and sophistication are examples. The quality and credibility of an article can be roughly classified by the kind of newspaper it is published in. If an article is published in The New York Times for example, readers probably expect it to be more credible and of higher quality than an article from a boulevard magazine.

Further conclusions can be drawn from a newspapers reach. A local newspaper covers different stories than an international newspaper. And if a story is published in a newspaper with a high run, the news it contains will probably reach more people.

Social media platforms vary widely in their size. The size of a platform can be measured by number of contributions and number of members or visitors. The number of visitors correlates with the potential reach of the platform's content. A message being shared through Twitter— a platform that has 200 million active users (W^Tickre, 2013)— has a higher potential reach than a message published in a small forum that has 5,000 members. The number of incoming links is an indicator of the probability that the random surfer visits the platform and an indicator of its relative importance. The random surfer is a notion used by Page, Brin, Motwani & Winograd (1998). The random surfer surfs the Web by randomly following its hyperlink structure. When he arrives at a page with several outlinks, he randomly chooses one, navigates to the new page and continues this process indefinitely (Langville & Meyer, 2006). The probability is higher that users read an article accidentally if it is published on a platform that has many incoming links than if it is published on an unnoticed personal blog that has no incoming links. The mechanism can be compared to the print run of newspapers. An article published in a small local newspaper is less likely to be spread than an article published in a widely read international newspaper.

2.3.3 What the Content Reveals

User-generated content can consist of texts, pictures, video, and audio files, or a mixture of them. Often, nontext multi-media content is also accompanied by descriptive text. Text can be analyzed for text features that are known to correlate with characteristics of the text. Studies show that citations, references and other kinds of source material contribute to a text's credibility (Fogg et al., 2000, 2001). Links within the text of a user-generated content unit can help to ascertain the source of information and indicate utility as they point to additional sources of information that can help the user (Moturu, 2010; Elgersma & de Rijke, 2008). Quotation marks are an indicator for citations as well, but they are also commonly used to indicate irony. A solution for the disambiguation of quotation marks used for irony and quotation marks used for citations could be that, in the case of irony, typically only one or two words are set in quotation marks, whereas a quotation is usually longer.

There are also machine learning algorithms that can be applied to detect quality flaws. Anderka, Stein & Lipka propose an approach for automatic quality flaw detection for Wikipedia articles. They propose to interpret the detection of quality flaws as one-class classification problem identifying articles that contain a particular quality flaw among a set of articles. An example for a frequent quality flaw is not citing any references or sources in an article (Anderka, Stein & Lipka, 2011b). The texts that contain a particular quality flaw are given as positive examples to decide for unseen texts whether they contain that particular flaw. For each known flaw, an expert is asked whether a given document suffers from it. Based on the manually tagged document set, for each flaw a one-class classifier is evaluated, trained, and tested.

The automatic determination of text quality is object of many research projects (e.g., Chen, Liu, Chang & Lee, 2010; Kakkonen, Myller, Timonen & Sutinen, 2005). Typically, these approaches analyze texts for predefined vocabulary and text structure. Predefined vocabulary is not applicable to the task at hand. First of all, it is typical for user-generated content that neologisms and abbreviations are used. Furthermore, vocabulary is language-dependent and the work at hand aims at a language-independent approach. For the automated quality assessment of Wikipedia articles Dalip, Goncalves, Cristo & Calado (2009) show that text length as well as structure and style are the most important quality features. Text length is a feature often used as quality indicator (e.g., Moturu, 2010; Dalip et al., 2009; Hu, Lim, Sun, Lauw & Vuong, 2007). Concerns are that text structure is not applicable to all types of user-generated content, because there are no common conveniences for text structure that apply to all platforms under consideration.

Which metrics are used to determine the quality of content is a choice that also needs to be based on the desired accuracy and the available resources. Text length is a feature that can be easily extracted, whereas a classification approach requires more processing resources.

2.3.4 What Other People Tell Us

"Under the right circumstances, groups are remarkably intelligent, and are often smarter than the smartest people in them." (Surowiecki, 2005, p. XIII). The wisdom of the crowds phenomenon refers to the observation that the aggregated solutions from a group of individuals are sometimes better than the majority of the individual solutions. Traditionally, it has been applied to point estimation of continuous- valued physical quantities for example (Surowiecki, 2005). The use of the wisdom of the crowds is an emerging field of research. It has also been widely applied for discrete labeling (Raykar, Yu, Zhao, Valadez, Florin, Bogoni & Moy, 2010), the prediction of urban traffic routes (Yu, Low, Oran <fe Jaillet, 2012), and Web page clipping (Zhang, Tang, Luo, Chen, Jiao, Wang & Liu, 2012). Wisdom of the crowds can be applied to assist individuals in their decision making process by gathering data about the decisions taken by a group of individuals.

For social media, every one-click-opinion— every thumbs up, every like, every +2, and so on— is an assessment of a content unit. Each interaction— be it a one-click-opinion of any kind or a comment— with a content unit means that a user felt that a content unit was worth spending time with it. User interactions with a contribution are a valuable source of information for other users who search for interesting content. Mishne L· Glance (2006) demonstrate that the number of comments is a strong indicator for the popularity of a blog post. Compared to a comment, the amount of time a user invests is relatively small when he contributes a one-click- opinion. But, a one-click-opinion is an explicit expression of opinion. The more people consider a content unit is worth an interaction, the higher is the probability that the content unit might be interesting for other users as well. The collection of all user interactions with a content unit is a crowd sourced assessment of the content unit that can serve as recommendation for other users.

2.4 The Social Media Document View

The social media document view comprises the structure all user-generated content units have in common (The modeling approach presented here is based on the analysis of the metadata of the evaluated platforms. The platforms have been chosen as representatives that cover the typical range for the analyzed categories to allow transferability of results. However, when statements are made about how platforms or platform categories function, it is not impossible that there are or there will be platforms for which the statements do not apply. The descriptions of platforms and patterns are empirical observations of the platforms as they were at the time of the development of this work. The presented approach has been designed aiming at robustness towards variation and changes, but nevertheless, it does make no claim to be universally valid.) Traditionally, the term document is used to refer to a unit of information, typically consisting of text. It can also contain other media. A document can be a whole book, an article, or be part of a larger text such as a paragraph, a section or a chapter. Here, the term social media document has been chosen paralleling traditional information retrieval, where the ranked units are also referred to as documents.

Based on the patterns revealed by the cross-category analysis, further conclusions can be drawn for a social media document. I propose a social media specific document view for user-generated content units.

A social media document always has a certain structure that is independent from the category it belongs to. The user-generated content unit always contains content. This can be text, pictures, audio or video files or a combination of them. Measures derived from the content itself are in the following referred to as intrinsic information. An example for intrinsic information is the number of references (i.e. , links) a text contains.

Metadata about the content on the level of primary information is extrinsic information. Examples for extrinsic information are the number of replies and the number of likes.

Secondary information that is not specific to a category but is equally provided throughout all social media categories leads to two further elements of the social media document: author- related information and source-related information (cf., Section 2.2) . Author-related information can be assessed through the author's identifier (e.g., the author's nickname concatenated to the source name) . Source-related information is information about the source and can be assessed through the source's identifier (e.g. , its URL) .

Figure 18 shows the social media document view for user-generated content. It comprises the structure all user-generated content units have in common. Measures that can be derived from the content itself are intrinsic information. Metadata about the content is extrinsic information. Information that is not directly about the content unit, but about its author is author-related information. Information about the platform, where the content unit is published on, is source- related information.

This is the structure all user-generated content units have in common. Figure 18 illustrates the structure. Within the elements of the structure the measures may differ depending on the category of the platform. Figure 19 illustrates the modeling concept using the example of forum postings.

Traditionally, Web pages are seen as parts of the network World W de Web. A network can be described as a graph consisting of nodes and edges. In this case, the nodes are Web pages and the edges are the links. Link-based ranking approaches such as the PageRank are based on this view, but do not go beyond the granularity of the Web page. User-generated content units are parts of web pages and are not considered in this view. Hence, the traditional site-centered view does not apply to user-generated content. I prose a different view— the social media document view— that is adequate for the required granularity of user-generated content.

The social media document view takes into account that one Web page can contain different user-generated content units published by several authors at various times with varying quality. It shifts the focus from the Web page to the content unit. Furthermore, the social media document view accounts for the user's role. In the traditional site-centered view the user does not occur. It was not necessary because the former role of the user was one of passive consumption. Nowadays, the user can actively participate. The World Wide Web of today is significantly co-authored by users. It consists of users acting either as authors publishing content or as readers consuming the content. The user's passive role has shifted to the active role of a contributor. With the author-related information as inherent part of the modeling, the social media document view accounts for that development. 3 A Cross-Platform Ranking Approach for User- Generated Content

The development of a query-independent ranking approach for user-generated content is the main goal of this thesis. The following section presents a modeling and a ranking solution based on the insights presented in the previous sections. Section 3.1 proposes a vector notation for the central characteristics of a user-generated content unit based on the social media document view introduced in section 2.4. The calculation of a ranking that is applicable to all types of user-generated content is proposed in Section 3.2. The example calculation in section 3.3 illustrates the application of the proposed ranking approach by means of user-generated content units from different social media categories.

3.1 Five Scores Reflecting the Social Media Document View

Looking for a solution that solves both, allowing to include more information than what is common for all user-generated content units (i.e. , publishing date, author, and source) , while maintaining comparability, approaches from other disciplines were evaluated. Encapsulation and information hiding are concepts known from object-oriented software construction (Meyer, 1997) . Those are the two fundamental aspects of abstraction in software engineering. Abstraction is the process of identifying the essential aspects of an entity while ignoring unimportant details. Encapsulation describes a construction that facilitates the bundling of data. The concept of information hiding means that external aspects of an object are separated from its internal details. These concepts simplify the construction and maintenance of software with the help of modularization. An object is a black box that can be constructed and modified independently (Connolly & Begg, 2005, p. 814) .

These generic concepts shall serve as inspiration to solve the problem at hand. The measures from different social media platforms differ in specification and target set of a function. Nevertheless, they hold important information that allows to be more precise about the document they belong to.

A. user-generated document can be represented as a vector of its properties. The vector notation has been chosen because it is a compact notation suitable for further numerical processing. Let D = {di , . . . , d_m} be a set of social media documents and P(<¾) = (pi (<¾) , . . . , p_n(di)) be the vector of properties for <¾.

To compare different measures from different categories, one possible solution could be to find correspondent measures throughout the categories. One major drawback of this approach is that there are a lot of measures that do not have correspondents throughout all other categories. Thus, this solution cannot include all information available for all platforms. Furthermore, it involves thorough understanding of platforms to map correspondent measures. Moreover, this has to be done manually. Consequently, this approach does not allow to flexibly and quickly add new platforms. Another solution is to develop a number of abstract concepts that unite several per se incomparable measures into one aggregated measure, which then allows comparison. Abstraction levels should have semantic correspondents and be category-independent. Different features from different categories are combined in a way that they express the same fact for each category. After normalization of the co-domains, different aspects become comparable throughout all social media categories. The abstraction levels are individually composed modules, but the modeled aspects are comparable between social media categories. The result is a vector

where a < n and a is constant. i?(c¾) holds a fixed number of properties r_j of the social media document <¾. The Γ_; (<¾) are derived from one or more properties ¾((¾) .

The social media cross-category comparison shows that there are a few measures that occur in every category. Namely those are publishing date, author, and source. All other features are specific for a certain category or platform (e.g. , number of retweets) . For some features and categories semantic analogues can be found. Followers in a microblogging service like Twitter are very similar to the being in the circles of that can be found in some social networks.

The number of properties depends on the number of measures that can be retrieved for a social media document. The number of measures that can be retrieved depends on the measures a platform offers and in some cases (e.g., author-related information) on the information provided by the user of the platform. Hence, the dimension n of the property vector

P(di) = (pi ((¾) , . . . , p„{di)) differs.

To identify a fixed number of abstract concepts that have semantic correspondents throughout all categories, the measures collected in Section 2.1 need to be examined for similarities. This has been done in Section 2.2. The resulting social media document view holds the common ground for user-generated content of all types.

The identified elements of the social media document view each serve as the semantic layer of abstract concepts that unite several per se incomparable measures each into one aggregated measure which then allows comparison.

The identified elements of the social media document view are:

author-related information,

source-related information,

intrinsic information and

extrinsic information.

The publishing date belongs to the extrinsic information of a document and is also available for all user-generated content units.

Let D = {<¾, ... , d_m} be a set of user-generated content units and P(<¾) = (pi(i¾), ... ,¾(<¾)) be the properties for <¾.

The properties P(<¾) = (jpi(<¾), ... ,p_n(di)) hold the following:

...,ph-i(di) are all author related measures of <¾,

ph(di), ...,pk-i(di) are all source related measures of <¾,

Pk(di), ...,p_m_i(<¾) are all intrinsic measures of i¾,

Pm(di), .. -,p_n~i(di) are all extrinsic measures of <¾, and

Pn{di) is the publishing date of <¾.

Now we can derive a vector = (r₁(ci_i), ... , r₅(c¾)) that holds a fixed number of properties of the social media document <¾ with the following dimensions:

^ ^ recency i^i) J

R(di) is the social media document vector. r_1; ... , r₅ being abstract concepts of the user- generated content unit di.

The five abstract concepts r_j(di) are derived from one or more concrete document properties P_j(di) and are referred to as follows:

r_author(di) is derived from pi(<¾), ..

r source (di) is derived from Ph(d_t), ...,p_k-i(d ),

r_mtr_tns_tc(d_t) is derived from ¾((¾), ...,p_m-i(di),

r _extrin_si_cal) is derived from p_m(di),

and r_recency(di) is derived from the time of search or processing time and the publishing date p_n(di).

Figure 20 shows how the proposed social media document view relates to the derived social media document vector that holds a fixed number of properties of the social media document di. The social media document vector comprises five scores. The author score is derived from author-related measures, the source-score is based on source-related measures, intrinsic measures constitute the intrinsic document score, extrinsic measures are the basis for the extrinsic score and recency is derived from the publishing date. Each score belongs to a document and can be mapped to one single value, the ranking is based on. At the same time it is possible to maintain the scores separate in order to be able to easily adjust the weight of the scores. With regards to the code of conduct no. 3, the levels of abstraction are modeled in a way that they are comprehensible for the user. This way it is possible to provide the user with an interface that allows him to factorize different dimensions according to his needs. The framework proposed here also allows to include or exclude properties and adjust the way they contribute to the ranking, if desired.

The author score r_author (di) adapts known concepts described in Subsubsection 2.3.1 for social media. It is composed by all metadata about the author of a user-generated content unit that allow to draw conclusions about the author. The basic assumptions are that the more an author is cited, the longer he or she is member of the community, the more contributions he or she has published, the higher his connectivity in the community (e.g., number of friends) , the more positive peer ratings an author received (e.g., number of likes), and the more information the author voluntarily reveals about himself, the higher is the author's reputation.

The source score r_source (di) adapts known concepts described in Subsubsection 2.3.2 for social media. It is derived from metadata about the source where <¾ is published. The conclusions that can be drawn from a social media source are deduced from the information that can be assessed about a source. The size of a community can be measured by the number of members. The size of a community indicates the potential reach of a user-generated content unit. For example, a message being shared through Twitter— a platform that has 200 million active users (Wickre, 2013)— has a higher potential reach than a message published in a small forum that has 5,000 members. The number of incoming links is an indicator of the popularity of a source. This measure is also used as central element of the PageRank (Page et al., 1998). Hence, the size of the source and number of references to it will be rewarded.

The intrinsic score r_intrinsic(di) allows to include content-derived features into the ranking. Several conclusions about the content can be conducted from consuming the content itself. The properties, that can be gained by a computer analysis, are limited compared to human capabilities. User-generated content can consist of texts, pictures, video and audio files or a mixture of them. Often, also nontext multi-media content is accompanied by descriptive text. Content-derived features for audio and video content is left to further research. If the content is or contains text, indicators can be gained by text analysis. Text mining is a lively field of research that offers many approaches of different levels of sophistication. Generally, the framework proposed here is open to several solutions. With respect to the desired language- independence, the features proposed here are language-independent. If the approach presented here is applied to content written in a single language only, a more sophisticated language- dependent approach could also be applied. The following are proposals for features that can be easily derived from texts.

The proposed features are meant to be a starting point that can be extended and further developed to more sophisticated levels. The modularity of the proposed framework also allows to neglect single scores such as the intrinsic score of social media documents completely. From studies presented in Subsubsection 2.3.3, it can be concluded that references contribute to text credibility. Consequently, references will be awarded. Furthermore, text length has shown to be an efficient indicator for text quality. The number of sentences, the number of words per sentence and the number of questions can also be used as features.

The extrinsic score r_ext_rinsic (di ) captures the assessments of user-generated content units described in Subsection 2.3.4. A social media document is part of a lively dynamic system, driven by interactions between users and content, enabled through social media platforms as recommendations, shares, likes, and so on. Every interaction with a user-generated content unit contains information about its potential relevance for users that have already consumed a piece of content. That allows to predict the relative probability that a piece of content will be interesting for others. The concept of the extrinsic score is the advancement of the traditional word-of-mouth concept combined with the wisdom of the crowds concept. The collection of all user interactions with a content unit— whether something is read, commented, recommended, shared or liked— is a crowd sourced assessment of the content unit that can serve as recommendation for other users. The extrinsic score is derived from the additional information available on document level. Most of these features are produced by implicit or explicit peer evaluation. Explicit peer evaluations are one click ratings such as likes on Facebook, -f-1 on Google+, or thumbs up on Youtube. Shares on Facebook and Twitter are also a kind of explicit peer evaluation. Comments are also a kind of peer evaluation. In the content of the comment users explicitly express their opinion, but implicitly with each comment users show that the content unit was worth spending the time to write the comment. So, whether an explicitly expressed opinion is positive or negative, implicitly every comment shows that the content that is commented on is somehow important. Consequently, the number of comments is also a feature indicating relevance. In forums, it is number of hits and the size of a thread by number of replies to a root-posting that indicate the relevance of a topic and a root posting. In blogs, the number of comments per entry shows how much attention a social media document received. The number of links to a blog post, also referred to as track-backs, reveals how often a blog post has been shared and thus recommended. Nowadays, blogs often integrate other social media platforms as plug-ins. Consequently, there can be Facebook likes, +ls, and other peer ratings also for blog posts. Those are indicators for relevance as well and are integrated into the extrinsic score. In microblogs such as Twitter the same mechanisms can be found. If a social tweet is recommended, it is counted as retweet and if it is liked, it is counted as favorite.

Recency is a well-known concept for the user. It is the time that has passed between the publication of a piece of content and its consumption by a user. The less time has passed, the more recent is the user-generated content unit. Even though a lot of user-generated content units are time-dependent and more relevant if they are new, this can not be generalized. Consequently, I propose to reward newer content units and to allow the user to adapt recency according to his needs. 3.2 Cross-Platform Compatibility Creates Comparability

To start the calculation of the relevance score, we begin with a set of documents. It is assumed that the documents have been crawled and stored into a database. All dates have already been converted into the same format (e.g., dd-tt-yyyy) . Text quality measures have been extracted from the text. Semantically corresponding properties have the same key reference, also referred to as label. For example, publishing date is always referenced as publishing date, and not sometimes as date of publication and other times as date. This can be done either on a basic level or it can contain a semantic mapping. For example, hits can be mapped to views, if desired.

P(dj) is a map that holds the n properties p_j , j = 1 . . . n of document <¾ as tuples of key and value (keif_j (di) , value_j (dij) , j = 1 . . . n:

(key₁ (d_i) , valuei idi)) , . . . , (fcey_¾_„₁

(d_i)) : author-related information,

(keyh (di) , valuer ) , · · (&β¾/*_ι (<¼), v(duek-_\ {di)): source-related information,

(key_k(di) , value_k (di)) , . . ., (key_m-i (di) , value_m-i (di)) : intrinsic information,

(key_m(di), value_m(di)), . . ., (key_n-i (di), value_n-i (di)): extrinsic information, and

(key_n(di) , value_n(di)) : publishing date.

Each field holds a record with the type of information as key and the value (e.g. , key=hometown, value= Vancouver) .

Step 1: Preprocessing Each property map originally holds information of different types. Such are numeric values such as hits - 100 as well as strings such as hometown=Vancouver. The first step maps the nonnumerical information to numerical values. There are several possible solutions how this can be achieved. Finding the best way is a trade-of between information quality and processing efficiency. This question should be evaluated separately, and is not part of this thesis. In the following, one possible solution is introduced.

Studies suggest that the more information an author reveals about himself, the more trustworthy he is regarded (Fogg h Tseng, 1999; Fogg et al. , 2000, 2001, 2003; Tseng k Fogg, 1999) . Hence, I propose to approximate the users assessment by mapping nonnumerical author related properties such as age, hometown, hobbies, job, and so on, to one numerical value as number of further information given. This can be achieved by summing the additional information given as a weighted or nonweighted sum, without taking its value into account. On the one hand, a disadvantage of this approach is that it does not take into account if the further information given has any reference to the context of the content the author produced. For example, if a social media document is posted within a forum with a medical focus topic it might make a difference to the reader, if the author claims he is a physician or an undertaker. The approach does not make this distinction. On the other hand, the advantage of the approach is that it is flexible and easily transferable to the variety of author profiles of the different social media platforms. Also, topic identification as well as the mapping of the correspondences between the identified topic and the author information given is not a trivial task. Furthermore, the proposed approach needs less processing time. Source related properties do not contain nonnumerical properties. Intrinsic and extrinsic properties are available as numerical values as well. Recency is derived from the publishing date.

Step 2: Normalization For each key the value is normalized with respect to the known maximum. To do that, a comparison map CM is created that holds all keys and their maxima. Each new document is then compared with CM. If a new document holds a property that is not part of CM, CM is updated and the new property is added along with its value as first maximum. If the key is already part of the map, its value is looked up and compared with the new value. If the new value is higher than the prevalent value in CM, the value is reassigned to the key. Next, all the values of the document are normalized with respect to CM. The normalization is based on the achieved maximum of each property. This means that each property is measured in terms of its performance with respect to the global range of this property. This relates the values of the properties of a single user-generated content unit to the values of the properties of other content units. Alternatively, values could be normalized by the average value instead of the maximum value. The average has to be calculated whereas the maximum can be gained by just comparing a new value to the current maximum. Hence, using the maximum is more efficient.

initialize empty map CM ;

for all documents d_{i ;} i=l:m do

for all properties p_j, j=l:n of di do

if CM contains key_j (di) then

if value(CM.key_j) < value _j (di) then

I set value(C M.key_j) = value _j (di)

end

else

I add _j to CM

end

for all documents di, i=l:m do

es for t¾;

end

Alternatively, the normalization can be performed with respect to a fixed value, a median, an upper bound or CM (occupied maxima) . Step 3: Aggregation For each document a new array is created that holds the five doc scores. The aggregation is based on the average performance of the properties.

Step 4: Reduction to one score For each document the five scores can now be mapped to one score. A simple approach is to calculate the length of the vector weighing all five scores equally. Alternatively, the euclidean norm could be used, which requires fewer resources for its calculation.

Separability of the Dataset to be Ranked and the Normalization The proposed approach can be applied to large sets of user-generated content as well as to small subsets. The values for the map CM that are used to normalize measures are proposed to be gathered from the set of content units for which the scores are calculated. Alternatively, the values in map CM used to normalize measures can also be gathered from a different, larger set than the set for which the scores are calculated. For example, if resources are limited, the set of user-generated content units that are ranked can be limited but still be measured in relation to values gained from a larger set of content units. The map CM could even contain manually researched maximum values.

3.3 An Example Case for Forums and Media Sharing Platforms

The following example calculation shall illustrate the method described in the previous section for user-generated content units from different social media categories. The content units are randomly selected from three different forums, and two different media sharing sites. The categories were chosen to demonstrate the range of content units and their metadata. The examples chosen differ in type and number of metadata. Furthermore, they differ in the type of content they contain. User-generated content units from forums contain text, whereas user- generated content units from media sharing sites contain photos or videos.

Table 1 shows the metadata for user-generated content units from three different forums and two different media sharing platforms. Example number 1 is from f orum . runner sworld . de (http : //forum . runnersworld.de/forum/trainingsplanung-fuer-marathon/

33367-mit-greif -durens-trainingsjahr .html ) , example number 2 is from www . laufforum . de (http : //www . laufforum . de/immer-noch-kein-speed-87509. html ) , example number 3 is from www. apfeltalk. de ( http : //www . apfeltalk . de/forum/ macbook-pro- 15-a-t410209. html ) and example number 4 is from www . apfeltalk . de (http : //www . apfeltalk . de/forum/showthread .php?t= 109921) as well, example number 5 is from www.flickr . com (http : //www .flickr . com/photos/28252015@N00/3199464411/), example number 6 is also from www. flickr . com (http : //www . flickr . com/photos/ 36755776QN07/7780251000/), example number 7 is from www.youtube.com (http: //www . youtube . com/watch?v=5HHnDEnsdno ) . and example number 8 is from www.youtube . com (http : //www . youtube . com/watch?v=GEKgYKpEJ3o ) as well.

Table 1 shows metadata for eight user-generated content units from three different forums and two different media sharing platforms, accessed: August 15, 2012. The first column of Table 1 shows the type of metadata referenced by their labels. The subsequent columns show their values in the different examples. The metadata for examples 1-4 differ from the metadata for examples 5-8. For example, the user-generated content units from Flickr and Youtube have likes, whereas the user-generated content units from the forums do not. Intrinsic measures, that give indications about the quality of the content, derived from the content have been introduced for texts only. Intrinsic measures for multi-media content such as photos and videos are left to further research.

Information about back-links have been gathered from Google. There are also alternative sources for back-links (e.g., Alexa (http: //www. alexa. com/faqs/?p=91 )).

Some metadata are differently labeled, but have the same meaning. Some metadata do not have the same but similar meanings. Depending on the desired precision, it is suitable to map those to one metadatum. In the example at hand, some metadata have been semantically mapped. Originally, there are number of replies for a forum's content unit and number of comments for a media sharing platform's content unit. It is a matter of choice whether to interpret them as expressing the same information or to interpret them as expressing different information. If differently labeled metadata are interpreted to have the same meaning, they can be mapped to one label. If differently labeled metadata are interpreted to have different meanings, different labels should be kept. Here, number of replies of a forum's content unit and number of comments of a media sharing content unit have been both mapped to number of replies. Furthermore, hits from forums and views from media sharing platforms have been both mapped to hits. In some cases, it can make sense to differentiate between those two. Hits usually indicate that the user-generated content unit has been clicked on, whereas views indicate that the user-generated content unit has been consumed for at least a short time.

Then, the maxima of all metadata are gathered as described in Section 3.2. It holds all occurring types of metadata and the current maximum values as Table 2 shows. Dates such as member since and publishing date have been mapped to a numerical value. In the example given, the maximum value for member since has been calculated as difference in days between the oldest membership and a reference date. Alternatively, it would also be feasible to set a binary value, to differentiate only the new memberships and not new memberships.

The maximum value for publishing date has been calculated as difference in days between the newest publishing date and a reference date. The retrieval date served as reference date.

Table 2 shows the results of the calculation (maximum values of the metadata of examples 1-8.). When new documents are added, the map of maxima has to be updated regularly. When it changes, the scores for the content units have to be updated as well.

Table 3 shows the metadata of examples 1-8 normalized with respect to the maximum values.

Then, the values of the metadata are normalized with respect to the maxima. The recency of a user-generated content unit r(t ) is calculated as the smallest difference between a reference date and the publishing date of the user-generated content units <¾, ¾^' = 1 8 and the difference between the publishing date of <¾ and the reference date: r(( ) = (ιηίη Δ<¾)/Δ<¾. The calculation of recency applied here has an exemplary character. Other methods to derive a recency factor from a publishing date can be applied as well. The choice of the method depends on which distribution for recency is desired and the resources that are available for calculation. Table 3 shows the results. All numbers are rounded to three digits after the decimal point.

Table 4 shows the five relevance scores for examples 1-8.

Then the author score, source score, intrinsic score, extrinsic score, and recency are calculated. For the missing intrinsic measures for videos and photos, the average intrinsic score of examples 1-4 has been assumed and set for examples 5-8. Table 4 shows the results.

Finally, the resulting scores are calculated. There are several options to map the five scores to one. The euclidean norm is one option. In the example given, the arithmetic mean is used. Table 5 shows the results. The euclidean norm requires the calculation of the square-root of the sum of the square of all five dimensions, whereas the arithmetic mean only requires the calculation of the sum divided by the number of summands. For large data sets, the arithmetic means might be preferable because it requires less complex calculations. The determination of the optimal solution depends on the application and is left to further research.

Table 5 shows the result scores and ranks for examples 1-8.

Figure 21 shows screen-shots of the three user-generated content units with the highest scores. The user-generated content unit with the highest score of the eight examples is a video on Youtube by MumfordandSons. MumfordandSons is the band who plays in the video of example 7. It is a relatively popular band and it is the band's own Youtube-account. This is rewarded with the author-related measure realnarne. MumfordandSons has the highest author score. The second ranked is also a video on Youtube. It is a video of the song Hey Jude by The Beatles. It has been watched 7,542,870 times. This is reflected in the highest extrinsic score. The user-generated content unit with the third highest score is a thread in a runner's forum. The second highest author score receives a thread in a runner's forum where Chri.S shares his experiences with his training and gives advice to other runners. Chri.S is an experienced runner and an active user of the forum. In six years of membership he wrote 4,734 contributions in the runner's forum. His high author score reflects this. The example case demonstrates how the proposed approach can be applied to user-generated content units from different platforms of different social media categories. It shows the five scoring dimensions and illustrates how they can be evaluated separately or as one aggregated score.

4 Applications

The query-independent ranking method presented in the previous sections can be used in a variety of applications. First of all, it can be applied in a discovery engine for user-generated content. Secondly, in combination with a query-dependent ranking it can be applied in a search engine specialized for user-generated content. Thirdly, the proposed ranking approach can be applied to rank any set of user-generated content units. For example it could be used to rank a set of user-generated content units that are part of a social media monitoring tool. Furthermore, the presented approach can be applied to all types of documents, for which metadata is available that allows to determine an author score, a source score, an intrinsic score, an extrinsic score and a recency score. If information that is required for one of the scores is missing, the ranking can be based on the remaining subset of scores. If, for example, there is no information about the source of a set of documents, the ranking can be based on author-related information, intrinsic information, extrinsic information, and recency.

The following section presents a concept of a discovery engine for user-generated content in Section 4.1 and a search engine for user-generated content in Section 4.2, as examples for applications of the proposed query-independent ranking approach.

Prior to further processing, content usually needs to be obtained from social media platforms and stored in a database. Figure 22 schematically illustrates the main steps of how information is obtained from social media platforms for availability in a user interface. The data extraction process collects user-generated content units and their metadata from social media platforms and stores them in a database. Then, the content units can be processed (e.g., query-independent ranking) . The results are stored and can be delivered for display in a user interface. 4.1 A Discovery Engine for User-Generated Content

The proposed ranking approach allows to rank user-generated content units from different platforms independent of a search query. It can be applied in a discovery engine that provides users with the content units that have the highest score of all evaluated content units.

A discovery engine for user-generated content is an example application for the proposed query- independent ranking approach. Figure 23 shows a wireframe of the proposed concept. The main area in the center shows user-generated content units from different platforms. The user- generated content units are displayed in the order of their rank based on the five scores. The weight of the five scores can be adjusted by the user according to his interests and information needs. Figure 24 specifies in detail the presentation of content units. The content of a content unit is displayed in the central area. At the top, the source URL and its source score is displayed along with the publishing date. Below this, the author and the author score is given. Directly above the content, the title of the contribution is provided. Next to the title the result score is visualized by stars. The result score is also provided in numerical form. At the bottom the extrinsic score is displayed as popularity. Below the extrinsic score, the extrinsic measures that led to the extrinsic score are provided. The triangle next to Popularity indicates that measures can be displayed or hidden as desired. The more user-generated content units the discovery engine evaluates, the worthier are the results. Ideally, the discovery engine covers all public user-generated content units.

The discovery engine for user-generated content enables exploratory search and is suited for users with feature-related information needs, who do not search for a specific topic or search term and do not have an exact idea of what they want to find. A user who seeks the best scoring content units could explore the content units displayed by the discovery engine in the order of their overall score. Users could use the discovery engine to look for inspiration, for something new, or for what is going on in the world. Users can visit the discovery engine to discover what people share, comment on and like world wide.

Due to the platform comparability users get the results for various platforms. Consequently, they do not longer need to know which platforms are most relevant and they do not need to know beforehand which authors are most active and most recommended.

Additionally, the user could be provided with a possibility to manipulate the weight of the scores according to his interests. In the example shown, the author score is labeled Author, the source score is labeled Source, the intrinsic score is labeled Content, the extrinsic score is labeled Popularity, and the recency score is labeled Recency. The labels have been chosen to indicate the meaning of the scores. For example, if a user is interested in the highest scoring sources, he could set the source score to 100 percent, thus setting all other scores to zero and browse through the results. If he is interested in the most active and most recommended authors, he could set the author score to 100 percent, thus setting all other scores to zero and browse through the results.

The features used in the proposed query-independent ranking approach are all language- independent. It is therefore possible to apply it to content units of different languages. Provided that content units are separable by language, (for example, content could be tagged during the data extraction process) it would also be possible to filter content units by language and to compare results from different languages.

The author score, source score, intrinsic score, extrinsic score, and recency are additional information that support the user to orientate himself. It enhances transparency, because all available metadata can be displayed together. The user does not have to search for additional metadata on other pages anymore (as it is usually the case for author related information and source related information) . Furthermore, he does not need to rely on a feeling gained from experience to interpret the metadata. Metadata is already normalized in relation to other user- generated content units' values and displayed with the content. Figure 24 shows how ranked content units could be displayed in the discovery engine's interface. Figure 25 shows the visualization of ranked user-generated content units using the example of a concrete video on Youtube. This application provides the user with a source of information that is independent of what is shared in his own, limited social environment.

4.2 A Search Engine for User-Generated Content

Today, for many users search engines are their main entrance to the World W de Web (Hearst, 2009). (The results of a Web site ranking indicate that this might change in favor of social media applications. The first and the second of the Web sites with the most traffic are https : //www . google . com and https : //www . facebook . com by turns (Alexa, 2013). The calculation of traffic is based on a combination of average daily visitors and pageviews over the past month.)

Search engines usually welcome their users with a blank page featuring the search field (e.g. , Google (http : //www . google . com) , Bing (http: //www . bing. com), search.com (http: //www. search, com), W^TolframAlpha (http: //www. wolframalpha. com) ) . They do not show content before the user has entered a keyword. These search engines allow nonexploratory search for content-related search needs only. The proposed query-independent ranking approach can be used in combination with a query-dependent ranking approach for a search engine for user- generated content that supports exploratory as well as nonexploratory search.

A search engine for user-generated content is an example application for the proposed query- independent ranking approach. Figure 26 shows a concept of the social media search engine. It allows users to express content-related information needs in addition to feature-related information needs. The main area of the screen shows user-generated content units, ordered by rank. The user can type in search queries to filter the user-generated content units with regards to a specific topic for example. The weight of the five scores can be adjusted according to the user's interests. Additionally, the user can filter content by date, by source and by country.

The social media search engine illustrated in Figure 26 is an extension of the concept presented in the previous section. It is extended by the possibility to search content units by entering a search query in the search field. The main part of the interface is the result area, which displays the user-generated content units with the highest scores. The scores serve as additional information for the user. They help him to orientate himself and to classify the search results.

Furthermore, additional filter concepts are added. If the information is available in the database, content units can be filtered by specific dates, sources, or countries. The left hand side of Figure 26 indicates further possibilities to integrate filter concepts. Alternatively, the user could be offered the possibility to filter not by source but by social media type.

There are several possibilities to combine query-independent and query-dependent rankings (Craswell, Robertson, Zaragoza & Taylor, 2005) . The search query can be used as filter to the ranked lists of user-generated content units. The set of user-generated content units is divided into two subsets: one subset containing the content units in which the entered search query occurs and one subset containing the content units in which the entered search query does not occur.

An improved approach calculates a second query-dependent rank that reflects the degree of correspondence between the text and the search query. Information retrieval offers a variety of approaches for query-dependent ranking of text documents. The query-dependent rank is combined with the query-independent rank.

Not all user-generated content units consist of text only. This can be a problem for text based ranking approaches. Videos and pictures from media sharing platforms for example, usually contain only a few words. The text can be found in the title or the description of the medium. Some platforms allow to tag pictures or videos with descriptive words (e.g. , Flickr) . Those can be used for the query-dependent ranking of user-generated content units of this type.

5 Conclusion

This final section provides a summary of the most important results of this description in Section 5.1 before outlining potentials for further work and research in Section 5.2.

5.1 Summary

This thesis started out with the problem that user-generated content units provide various information in their metadata that could help to evaluate them, but is left unused due to a lack of comparability. Considering the massive amount of user-generated content that is already available and that is continuously produced, users need assistance in their task to find and evaluate user-generated content units to tap the full potential of social media.

This thesis provides a framework to compare entities by different types and number of measures.

This is achieved through the identification of appropriate levels to which the measures can be aggregated and that are common for all entities.

Specifically, this thesis provides a query-independent ranking method to compare user-generated content units from different social media platforms with each other. It solves the problem of comparability of metadata of different quantity and types as it is the case for user-generated content. This is done by providing a model that can be applied to user-generated content with different metadata. The approach maps different metadata to aspects that are common for all types of user-generated content.

For each user-generated content unit a score for each aspect is calculated. A user-generated content unit is represented by a vector of its scores. The scores are derived from measures that can be obtained from metadata.

The modeled aspects are: author-related information, source-related information, intrinsic information and extrinsic information and recency. For each aspect the related information is normalized with respect to the maximum known value for the information. It thus relates the information for a given user-generated content unit to other user-generated content units. For a single content unit it helps the user to estimate the magnitude of the numbers provided in its metadata.

The five scores for each document can be mapped to a single score. For a given set of user- generated content units this allows to compare them and establish an order among them. This description also provides a suggestion for an interface and a visualization of the ranked user- generated content units.

The proposed approach is language-independent. Therefore, it can be applied to user-generated content independent of the language of their content.

The proposed query-independent ranking can be combined with query-dependent ranking to built a search engine for user-generated content that accounts for the specific characteristics of user- generated content.

The application of the approach proposed in this thesis helps users to access the value of social media and to unlock more of its potential.

5.2 Extensions and Further Research

This section suggests further research and work from technical details closely related to the approach presented in this thesis to more comprehensive propositions, the latter also addressing challenges beyond the scope of this work.

Further research might explore solutions to derive intrinsic measures from pictures, audio files, and videos. A measure that indicates quality for media files could for example be the file size, assuming that a larger file size indicates a higher resolution. But for online content the file size is often reduced and optimized for fast transfer. Therefore, the file size is not necessarily suitable as quality indicator.

Further work could extend the author score. Let the connections an author has be his primary connections and the connections of his primary connections be his secondary connections. The author score could be extended from taking only first-degree connections into account to including connections of further degrees into the author score. Given that the information is available not only about how many connections an author has, but also who he is connected with, the use of this information might be a reasonable extension for the author score. A solution to integrate this information is, to assign a weight to every primary connection depending on the primary connection's number of primary connections. This way, the number of secondary connections of an author can be included into her author score. This could be continued with a decreasing impact for further degrees of connections. A simplified example for Twitter shall illustrate this mechanism. Twitter-author A has 10 followers which have an average of 10 followers, Twitter- author B has 50 followers which have an average of 10 followers and Twitter-author C has also 50 followers which have an average of 100 followers. We could compare them based on their primary connections by the number of their followers. Then, the author scores of Twitter-author B and Twitter-author C would be equal. But, let us assume we have a 50 percent probability that the people who follow Twitter-author B and C share their tweets. Then, a tweet by author B would reach his 50 followers plus half of the followers of his followers, hence 300 people. A tweet by author C would also reach his 50 followers plus half of the followers of his followers, hence 2,550 people. Consequently, it can be reasoned that, even though author B has the same number of primary connections as author C, author C's tweet will probably reach more people than a tweet from B. The implementation of this approach requires more complex data retrieval and storage than the approach presented in this work. Whether the increased complexity is worth the gain in precision is left to further research.

Further work might also investigate possibilities to extend the proposed approach to collaboratively created content. The proposed model assumes that a user-generated content unit has one author and one publishing date. Collaboratively created content, such as wiki-articles, can be created by many authors. Wiki-articles are also object to continuous changes that are preserved in their editing history. A considerable amount of work has investigated the possibilities to analyze wiki-content and editing history. It would be interesting to explore possibilities to make results from wiki analysis comparable to other types of content as well.

The object of the ranking proposed in this description is the social media document (cf., Section 2.2) that can be allocated to level 1 in Figure 16 and to level 1 in Figure 27. For specific platforms, there is an additional level above these contributions. This is the case for location sharing and annotation platforms as well as rating and review platforms. For location sharing and annotation platforms the ranked objects are the annotations (cf., level 1 in Figure 27) of locations (cf., level 0 in Figure 27). For rating and review platforms the ranked objects are the reviews, which refer to a product or service. The proposed ranking approach ranks the annotations and reviews (cf., level 1 in Figure 27). The object of these contributions— location, product, or service— is allocated on level 0 in Figure 27. In contrast to the contributions on level 1, the object of a contribution on level 0 is not a social media document as described in the social media document view in Section 2.4. In the case of location sharing and annotation platforms, the annotations refer to locations. Locations do not have a distinct author and publishing date. In the case of rating and review platforms, ratings and reviews refer to a product or service, which do not have a publishing date or author. Nevertheless, the average rating of a product is an indicator for the quality of the product, service, or experience reviewed. Consequently, for platforms of this type, level 0 is also a relevant level of consideration. Further work could investigate possibilities to aggregate the ranking of user-generated content units to a higher reference point, such as products, services, or venues. For a single source this might be straightforward, but the possibilities of integration into a cross-category compatible framework, as it is proposed in this work, needs to be investigated.

Furthermore, future research could evaluate alternative designs to refine the concepts proposed in section 4 with regard to usability. The description at hand enables to display information to the user that helps him to evaluate user-generated content. On the one hand, this additional information can be a useful guidance. On the other hand, too much information displayed at once could lead to counterproductive complexity of the interface, section 4 provides a suggestion how to display and visualize ranked user-generated content units (cf., Figure 24 and Figure 25). Future research could investigate whether all measures and scores should be provided to the user at once or it is more useful to display a subset, that for example could consist of the scores only. Furthermore, it could be evaluated whether it contributes more to usability to display numerical values or to transfer the values into a visualization. The visualization of the values gives rise to the question on which level of detail the visualization should represent the numerical values. Further work might explore how this additional information should be visualized in a way that it is most compressible and contributes most to usability.

Beyond the scope of this thesis, there are adjacent areas that can be further developed to efficiently apply the proposed query-independent ranking approach to large sets of user-generated content. This concerns the data extraction process from the social media platforms and the data base design for example. The amount of user-generated content produced per time unit is enormous and a real-time application with the proposed features is a challenge for which strategies for efficient data processing still need to be developed.

Tables 6 to 13 present a collection of metadata available for each of the evaluated categories. All measures are sorted according to the social media document view introduced in Subsection 2.4. Measures are listed independent of their significance for the ranking of user-generated content units.

References

Agichtein, E., Castillo, C, Donato, D., Gionis, A. & Mishne, G. (2008). Finding High-Quality

Content in Social Media. In: Proceedings of the International Conference on Web Search and Web Data Mining. WSDM '08, ACM, New York, pp. 183-194.

Alexa (2013). The Top 500 Sites on the Web. URL: http://www.alexa.com/topsites, accessed: September 5, 2013.

Anderka, M., Stein, B. & Lipka, N. (2011a). Detection of Text Quality Flaws as a One- Class Classification Problem. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. CIKM '11, ACM, New York, pp.2313-2316.

Anderka, M., Stein, B. & Lipka, N. (2011b). Towards Automatic Quality Assurance in Wikipedia. In: Proceedings of the 20th International Conference Companion on World W^Tide Web, Hyderabad. WWW '11, ACM, New York, pp.5-6.

Archak, N., Ghose, A. & Ipeirotis, P.G. (2007). Show me the Money! Deriving the Pricing Power of Product Features by Mining Consumer Reviews. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD '07, ACM, New York, pp.56-65.

Baeza- Yates, R. & Ribeiro-Neto, B. (2003). Modern Information Retrieval. ACM Press Books, New York.

Barkhuus, L., Brown, B., Bell, M., Sherwood, S., Hall, M. k Chalmers, M. (2008). From Awareness to Repartee: Sharing Location Within Social Groups. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI '08, ACM, Florence, pp. 497-506.

Beadon, L. (2013). Funniest/Most Insightful Comments Of The Week At

Techdirt. URL: http://www.techdirt.com/articles/ 20130922/09493824615/ funniestmost- insightful- comments -week-techdirt . shtml, accessed: September 23, 2013.

Boyd, D. & Ellison, N. (2008). Social Network Sites: Definition, History, and Scholarship. Journal of Computer-Mediated Communication, vol. 13, 1, pp.210-230.

CapsuleHD20 (2012). PSY - Gangnam Style (Comeback Stage) - Inkigayo. URL: http://www. youtube . com/watch?v=60MQ3AGl c8o, accessed: September 25, 2013.

Chen, Y.Y., Liu, C.L., Chang, T.H. & Lee, C.H. (2010). An Unsupervised Automated Essay Scoring System. IEEE Intelligent Systems, vol.25, pp.61-67.

Ciao (2013). Erfahrungsberichte. URL: http://www.ciao.de/Braun_.Oral_.B_.Professional_.

Care_7000_Black__l 1152413, accessed: September 26, 2013. Connolly, T. & Begg, C. (2005). Database Systems: a Practical Approach to Design, Implementation, and Management. Addison-Wesley Longman, Essex.

Cramer, I i. , Rost, M. & Holmquist, L.E. (2011). Performing a Check-in: Emerging Practices, Norms and 'Conflicts' in Location-Sharing Using Foursquare. In: Proceedings of the 13th International Conference on Human Computer Interaction with Mobile Devices and Services. MobilellCI '11, ACM, New York, pp. 57-66.

Craswell, N., Robertson, S., Zaragoza, I I . & Taylor, M. (2005). Relevance Weighting for Query Independent Evidence. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil. SIGIR '05, ACM, New York, pp. 416-423.

Daer, A. (2013). User Profile. URL: https : //foursquare . com/ alicedaer, accessed: September 24, 2013.

Dalip, D. I I. , Goncalves, M.A., Cristo, M. & Calado, P. (2009). Automatic Quality Assessment of Content Created Collaboratively by Web Communities: A Case Study of Wikipedia. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, Austin. JCDL '09, ACM, New York, pp. 295-304.

Elgersma, E. & de Rijke, M. (2008). Personal vs Non-Personal Blogs: Initial Classification Experiments. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore. SIGIR '08, ACM, New York, pp. 723-724.

Elsas, J.L. & Glance, N. (2010). Shopping for Top Forums: Discovering Online Discussion for Product Research. In: Proceedings of the First Workshop on Social Media Analytics, Washington. SOMA '10, ACM, New York, pp. 23-30.

Fogg, B., Marshall, J., Laraki, O., Osipovich, A., Varma, C, Fang, N., Paul, J., Rangnekar, A., Shon, J., Swani, P. & Treinen, M. (2001). What Makes A Web Site Credible? A Report on a Large Quantitative Study. In: Proceedings of ACM CHI 2001 Conference on Human Factors in Computing Systems, Seattle. ACM, New York, vol. 1, pp. 61-68.

Fogg, B., Marshall, J., Osipovich, A., Varma, C, Laraki, O., Fang, N., Paul, J., Rangnekar, A., Shon, J ., Swani, P. et al. (2000). Elements that Affect Web Credibility: Early Results from a Self-Report Study. In: CHI'OO extended abstracts on Human factors in Computing Systems, The Hague. ACM, New York, pp. 287-288.

Fogg, B., Soohoo, C, Danielson, D., M arable, L., Stanford, J. & Tauber, E. (2003). How do Users Evaluate the Credibility of Web Sites? In: Proceedings of the 2003 Conference on Designing for User Experiences, San Francisco. ACM, New York, pp. 1-15.

Fogg, B.J. <fe Tseng, I I . (1999). The Elements of Computer Credibility. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: the ( Ί 11 is the Limit. CHI '99, ACM, New York, pp. 80-87. Foursquare (2013) . New York Marriott Marquis. URL: https : //f oursquare . eom/v/ new-york-marriott-marquis/439c437bf964a520f02blfe3, accessed: September 24, 2013.

Harper, R.H.R. , Lamming, M. G. & Newman, W.M. ( 1992) . Locating Systems at Work: Implications for the Development of Active Badge Applications. Interacting with Computers, vol. 4, 3, pp. 343-363.

Hearst, M.A. (2009) . Search User Interfaces. Cambridge University Press, New York.

Hu, M. , Lim, E.P. , Sun, A. , Lauw, H.W. & Vuong, B.Q. (2007) . Measuring Article Quality in Wikipedia: Models and Evaluation. In: Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management, Lisbon. CIKM '07, ACM, New York, pp. 243-252.

Iachello, C , Smith, I. , Consolvo, S. , Chen, M. & Abowd, G. (2005) . Developing Privacy Guidelines for Social Location Disclosure Applications and Services. In: Proceedings of the 2005 Symposium on Usable Privacy and Security, Pittsburgh. ACM, New York, pp. 65-76.

Kakkonen, T. , Myller, N. , Timonen, J. & Sutinen, E. (2005) . .A utomatic Essay Grading with Probabilistic Latent Semantic Analysis. In: Proceedings of the Second Workshop on Building Educational Applications Using NLP, Ann Arbor. EdAppsNLP 05, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 29-36.

Langville, A.N. & Meyer, CD. (2006) . Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, Princeton, New Jersey.

Macdonald, C , Santos, R.L. , Ounis, I. & Soboroff, I. (2010) . Blog Track Research at TREC . SIGIR Forum, vol. 44, pp. 58-75.

Meyer, B. ( 1997) . Object-Oriented Software Construction. Prentice Hall, 2nd ed.

Mika, P. (2007) . Social Networks and the Semantic Web. Springer Science+Business Media, New York.

Mishne, G. & Glance, N. (2006) . Leave a Reply: An Analysis of Weblog Comments. In: Proceedings of the 3rd Annual Workshop on the W^Teblogging Ecosystem, Edinburgh, pp. 22-26.

Moturu, S. (2010) . Quantifying the Trustworthiness of Social Media Content: Content Analysis for the Social Web. LAP Lambert Academic Publishing, Saarbriicken, Germany.

Page, L. , Brin, S. , Motwani, R. <fe W^rinograd, T. ( 1998) . The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library.

Peter (2013) . Gottfrid i Medierna. URL: http : //blog . brokep . com/, accessed: September 23, 2013. Pier, K. (1991). Locator Technology in Distributed Systems: the Active Badge. In: Proceedings of the Conference on Organizational Computing Systems, Atlanta. COCS '91, ACM, New York, pp.285-287.

Raykar, V.C., Yu, S., Zhao, L.IL, Valadez, G.IL, Florin, C, Bogoni, L. & Moy, L. (2010). Learning from Crowds. Journal of Machine Learning Research, vol. 11, pp. 1297-1322.

Reilly, D., Dearman, D., Ha, V., Smith, I. L· Inkpen, K. (2006). "Need to Know" : Examining Information Need in Location Discourse. Pervasive Computing, pp.33-49.

Rogers, Y., Sharp, H. <fe Preece, J. (2011). Interaction Design: Beyond Human- Computer Interaction. Wiley, Sussex, 3rd ed.

RosesAreRedl207 (2012). My last iphone. URL: http : //www . ciao . co . uk/Apple_iPhone_5_ 16GB__Review_6069358, accessed: September 26, 2013.

Safko, L. (2010). The Social Media Bible: Tactics, Tools, and Strategies for Business Success. Wiley, Hoboken, NJ, USA, 2nd ed.

Scellato, S., Noulas, A., Lambiotte, R. & Mascolo, C. (2011). Socio-Spatial Properties of Online Location-Based Social Networks. In: Proceedings of ICW^TSM, Barcelona, vol.11, pp.329-336.

Surowiecki, J. (2005). The Wisdom of Crowds. Abacus, London.

Tseng, S. & Fogg, B.J. (1999). Credibility and Computing Technology. Communications of the ACM, vol.42, pp.39-44.

W^Tickre, K. (2013). Celebrating #Twitter7. URL: https://blog.twitter.com/2013/ celebrating-twitter7, accessed: August 16, 2013.

W^TordPress (2013). Categories. URL: http : //en . support . word press . com/posts/ categories, accessed: September 23, 2013.

Yu, J., Low, K.H., Oran, A. & Jaillet, P. (2012). Hierarchical Bayesian Nonparametric Approach to Modeling and Learning the Wisdom of Crowds of Urban Traffic Route Planning Agents. In: Web Intelligence and Intelligent Agent Technology (WT-IAT), 2012 IEEE/WIC/ACM International Conferences. IEEE Computer Society, Washington, DC, USA, vol.2 of WT-IAT '12, pp.478-485.

Yu, J., Zha, Z.J., Wang, M. & Chua, T.S. (2011). Aspect Ranking: Identifying Important Product Aspects from Online Consumer Reviews. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: tluman Language Technologies, Portland. Association for Computational Linguistics, Stroudsburg, PA, USA, vol. 1 of HLT '11, pp.

1496-1505.

Zhang, L., Tang, L., Luo, P., Chen, E., Jiao, L., Wang, M. L· Liu, G. (2012). Harnessing the Wisdom of the Crowds for Accurate W^Teb Page Clipping. In: Proceeding^'s of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing. K I ) i ) '12, ACM, New York, pp.570-578.

Claims

Patent claims

1. Method for automatically generating a sorted list of user generated input derived from social media platforms by

a) in a first step (101) automatically extracting data related to a plurality of user generated input datasets and / or metadata from the social media platform,

b) optionally in a second step (102) automatically extracting for each user generated input dataset the intrinsic data from the content, in particular video, text, audio, and / or picture data,

c) in a third step (103) normalizing the intrinsic data and / or the metadata according to respective measures, such as e.g. the number of views and / or likes;

d) in a fourth step (104) aggregating the normalized measures into at least one numerical score (r_author, r_source, r_mtrmstc, r_extrmstc, r_recency) for each user generated input dataset and / or metadata;

e) in a fifth step (105) reducing the at least one numerical score (r_author , r source , r intrinsic, ^extrinsic , r recency ) of the aggregated normalized measures to one overall score;

f) in a sixth step (106) ranking the user generated input datasets according to their respective overall scores.

2. Method according to claim 1, wherein the normalized at least one numerical score {r_author , r_Source , r_intrinsic, ^extrinsic, recency) calculated in the fourth step (104) is weighted with predetermined weights.

3. Method according to claim 1 or 2, wherein in the fourth step (104) and / or the fifth step (105) at least one numerical score and / or the overalls score are calculated by applying heuristic methods, an Euclidian norm, an arithmetic average, a geometric average, quadratic mean, generalized mean, weighted mean, truncated mean, generalized mean, truncated mean, midrange, or a harmonic average.

4. Method according to at least one of the preceding claims, wherein the at least one numerical score of author-related data (r_author ) comprises metadata about the authors, in particular the connectivity of the authors in the social media platform, the duration of the membership in the social media platform and / or ratings by peers.

5. Method according to at least one of the preceding claims, wherein the at least one numerical score of source-related data (r_source) comprises metadata related to the social media platform itself, such as e.g. the number of members and / or the number of incoming links.

6. Method according to at least one of the preceding claims, wherein the at least one numerical score of intrinsic-related data (r_irairirasic) comprises data related to the content itself, such as e.g. the media-type and / or data related to text analysis.

7. Method according to at least one of the preceding claims, wherein the at least one numerical score of extrinsic-related data (r_extrinsic) comprises data related to the interactions with third parties, such as e.g. the number of recommendations, shares with others and / or approval remarks.

8. Method according to at least one of the preceding claims, wherein the at least one numerical score of time-related data (r_recency) comprises data about the time between the posting of the user generated input and the accession of the data by the user.

9. Method according to at least one of the preceding claims, wherein the calculation of at least one numerical score and / or the normalization is automatically updated when a new user generated input dataset yielding a new maximum value for a numerical score is found.

10. Method according to at least one of the preceding claims, wherein the normalization is performed with respect to a fixed value, a median, a sliding average, an upper bound or CM.

11. Computer product for automatically generating a sorted list from user generated input and / or metadata derived from social media platforms

with an extracting unit for extracting data related to a plurality of user generated input datasets and / or metadata from the social media platform,

optionally with an intrinsic data calculation unit for normalizing the intrinsic data and / or the metadata according to respective measures, such as e.g. the number of views and / or likes; with an aggregating unit for normalizing measures into at least one numerical score (r 'author , source , r_intrinsic, r_extrmsic, r_recency) for each user generated input dataset and / or metadata; with a normalizing unit for normalizing and / or averaging at least one numerical score (r_ttMi¾or,

^source , ^ intrinsic, ' extrinsic , ^ recency ) O One Overall SCOie,

a ranking unit for the user generated input datasets and according to their respective overall scores.