WO2016186634A1

WO2016186634A1 - Maximizing information value of web content

Info

Publication number: WO2016186634A1
Application number: PCT/US2015/031263
Authority: WO
Inventors: Bernardo Huberman; Sitaram Asur; Sandra SERVIA-RODRIGUEZ
Original assignee: Hewlett Packard Enterprise Development Lp
Priority date: 2015-05-15
Filing date: 2015-05-15
Publication date: 2016-11-24

Abstract

Examples relate to maximizing information value of web content. In some examples, sharing statistics for web content items are collected from a data computing device, where the sharing statistics include time-dependence data and temporal comparisons. A characteristic 2-vector is generated for each of the web content items based on the sharing statistics, where the characteristic 2-vector includes novelty values and popularity values. Next, the characteristic 2-vector of each web content item is normalized, and a Markov process is applied to each web content item to determine a corresponding transition probability based on a normalized, characteristic 2-vector associated with the web content item. At this stage, the web content items are continually ordered based on the transition probabilities, where a subset of the web content items are displayed according to the order of the web content items.

Description

MAXIMIZING INFORMATION VALUE OF WEB CONTENT BACKGROUND

[0001] The popularity of the Web and social media services has resulted in a constant flood of information that can make it difficult for users to identify and consume the most relevant and useful pieces of content. Given the limited amount of attention that users can afford, providers of content can decide what items to prioritize in order to gain the attention of users and become popular. Examples of techniques for prioritizing web content include ranking (e.g., relevance algorithms used by search engines) and recommendations (i.e., user voting to specify if content is useful or not). BRIEF DESCRIPTION OF THE DRAWINGS

[0002] The following detailed description references the drawings, wherein:

[0003] FIG. 1 is a block diagram of an example computing device for maximizing information value of web content;

[0004] FIG. 2 is a block diagram of an example computing device in communication with data computing devices for maximizing information value of web content;

[0005] FIG. 3 is a flowchart of an example method for execution by a computing device for maximizing information value of web content; and

[0006] FIG.4 is a diagram of an example index rankings map that is ordered to maximize information value. DETAILED DESCRIPTION

[0007] As detailed above, web content can be prioritized using ranking algorithms and/or user-based recommendations. However, ranking and recommendations are limited when prioritizing content in web media because the former uses a keyword query and the latter uses the preferences of the subjects (i.e., users). In online newspapers, magazines and blogs; editors can decide the choice of content and the presentation order. Further, the emergence of news media aggregators has introduced citizen journalism-based ordering. That is, instead of having professional editors to determine the important news, people can vote for news that they find interesting and the votes received by an article play an important role in its ranking with respect to other news on the front page or in the different ordered lists of news.

[0008] Social media services feature a large number of subscribers and serve as aggregators of content such as news, promotional campaigns, media and status updates from users. Given the diversity and magnitude of content that is available, it is important, from the service provider's point of view, to ensure easy access to relevant information to users in order to retain and increase user engagement with the platform. For example, a timeline on a social media site may display content in decreasing order of publication. However, novelty is not the only feature that makes social media content valuable to users, and other features such as popularity can also contribute to give value to the content.

[0009] In this disclosure, a technique for selecting the arrangement of tweets that improves the information value of users is described. Considering the number of, for example, shares as an indicator of the popularity of a web content item and the time since the item was posted as an indicator of its novelty, examples herein use the solution proposed by Huberman and Wu (i.e., a dynamical model characterized by a single novelty factor) to obtain the arrangement of web content that improves the information value for the users. By mapping the problem to that of improved allocation of effort for a number of competing projects, Huberman and Wu formulate the problem as a special case of the bandit problem, which they solve by applying the adaptive greedy algorithm proposed by Bertsimas and Nino- Mora.

[0010] Examples disclosed herein improve information value of web content by using characteristics vectors that are based on sharing statistics to generate index ranking maps for ordering items in the web content. For example, the novelty and popularity of web content items can be used as objective measures of the items’ relevance and utility. In this example, the Huberman-Wu algorithm can be used to automatically select the items that should receive the most attention in the next time interval.

[0011] In some examples, sharing statistics for web content items are collected from a data computing device, where the sharing statistics include time-dependence data and temporal comparisons. A characteristic 2-vector is generated for each of the web content items based on the sharing statistics, where the characteristic 2-vector includes novelty values and popularity values. Next, the characteristic 2-vector of each web content item is normalized, and a Markov process is applied to each web content item to determine a corresponding transition probability based on a normalized, characteristic 2- vector associated with the web content item. At this stage, the web content items are continually ordered based on the transition probabilities, where a subset of the web content items are displayed according to the order of the web content items.

[0012] Referring now to the drawings, FIG.1 is a block diagram of an example computing device 100 for maximizing information value of web content. Computing device 100 may be any computing device (e.g., server, desktop, notebook, tablet, etc.) with access to web content provided by, for example, data servers such as data computing device 200 of FIG.2. In FIG.1, computing device 100 includes a processor 110, an interface 115, and a machine-readable storage medium 120.

[0013] Processor 110 may be one or more central processing units (CPUs), microprocessors, and/or other hardware devices suitable for retrieval and execution of instructions stored in machine-readable storage medium 120. Processor 110 may fetch, decode, and execute instructions 122, 124, 126, 128, 130 to improve information value of web content, as described below. As an alternative or in addition to retrieving and executing instructions, processor 110 may include one or more electronic circuits comprising a number of electronic components for performing the functionality of one or more of instructions 122, 124, 126, 128, 130. [0014] Interface 115 may include a number of electronic components for communicating with other computing devices. For example, interface 115 may be an Ethernet interface, a Universal Serial Bus (USB) interface, an IEEE 1394 (Firewire) interface, an external Serial Advanced Technology Attachment (eSATA) interface, or any other physical connection interface suitable for communication with the other computing devices. Alternatively, interface 115 may be a wireless interface, such as a wireless local area network (WLAN) interface or a near-field communication (NFC) interface. In operation, as detailed below, interface 115 may be used to send and receive data, such as web content to and from a corresponding interface of another computing device.

[0015] Machine-readable storage medium 120 may be any electronic, magnetic, optical, or other physical storage device that stores executable instructions. Thus, machine-readable storage medium 120 may be, for example, Random Access Memory (RAM), an Electrically-Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disc, and the like. As described in detail below, machine-readable storage medium 120 may be encoded with executable instructions for improving information value of web content. The machine-readable storage medium 120 may be non-transitory.

[0016] Sharing statistics collecting instructions 122 collects sharing statistics for web content from a data source (e.g., data computing device). Sharing statistics describe attributes of web content such as posts to a social media page, content provided by an online journal, etc. The attributes can include, for example, a timestamp for when the content was created, whether and how often the content was reshared, user voting, how often the content was viewed, etc. Sharing statistics can be collected from multiple sources and for any number of users at each of those sources.

[0017] Characteristic vector generating instructions 124 creates a characteristic vector(s) for each item of web content based on the sharing statistics. For example, each characteristic vector can be a constant 2-vector that has a range of characteristic values for two attributes (e.g., novelty and popularity, etc.). In this example, the range of values for novelty can describe possible values for the number of times an item is reshared immediately after the item is initially posted, and the range of values for popularity can describe possible values for the number of times an item is reshared over a longer interval (in comparison to the interval use for novelty) of time.

[0018] Characteristic normalizing instructions 126 normalize the characteristic vectors so that each of the possible values applies to an equal sized subset of the web content. For example, the majority of web content is reshared less than 100 times while a very small percentage of web content is reshared over 1000 times. In this example, the characteristic vector for popularity is normalized so that the range of possible values is equally distributed when applied to web content. Further, the normalized characteristic vector can also be modified to reflect an average number of reshares per time interval (e.g., minute, hour, day, etc.). Each possible value in a characteristic vector is attributed with a reward that is used to calculate the utility of an item of web content as described below.

[0019] Markov process applying instructions 128 applies a Markov process to the normalized characteristics vectors to determine a state for each item of web content. Specifically, the Markov process can be applied to the normalized characteristic vectors of the web content to dynamically determine transition probabilities of the states for the items of web content. A Markov process is a stochastic process that satisfies the memory-less property (i.e., Markov property), which states that futures states are dependent on the present state and not previous events. The states of an item of web content includes the range of possible combinations of characteristic values that are possible for the item. For example, the range of an item’s state can be from low utility to high utility, where a high utility indicates that the item is highly popular and novel.

[0020] Web content ordering instructions 130 order the items of web content based on the transition probabilities. Because the transition probabilities are dynamically determined, the ordering of the web content can be updated in real- time as the characteristics of the items change. In some cases where user interface real estate is limited, the items of web content displayed can be restricted to, for example, the top 10 items. [0021] FIG. 2 is a block diagram of an example computing device 250 in communication via a network 245 with data computing devices 200A-200N. As illustrated in FIG. 2 and described below, computing device 250 may communicate with data computing devices 200A-200N to maximize information value of web content.

[0022] As illustrated, each data computing device 200A-200N may include a corresponding web content module 200A-200N, while computing device 250 may include a number of modules 252-268. Each of the modules may include a series of instructions encoded on a machine-readable storage medium and executable by a processor of the respective device 200, 250. In addition or as an alternative, each module may include one or more hardware devices including electronic circuitry for implementing the functionality described below.

[0023] As with computing device 250 of FIG. 2, each data computing device 200A-200N may be a server, a notebook, desktop, tablet, workstation, mobile device, or any other device suitable for executing the functionality described below. As detailed below, each data computing device 200A-200N may include a web content module 210A, 210N for providing web content and associated sharing statistics. For example, data computing device 200A can be a web server that is configured to provide web content in a social media network. In this example, the web content module 210A is configured to provide computing device 250 with access to the web content and associated sharing statistics, which describe various attributes of the web content).

[0024] As with server 100 of FIG. 1, computing device 250 may be any computing device with access to data computing devices 200A-200N over a network 245 that is suitable for executing the functionality described below. As detailed below, computing device 250 may include a series of modules 252-268 for improving information value of web content.

[0025] Interface module 252 may manage communications with the data computing devices 200A-200N. Specifically, the interface module 252 may initiate connections with the data computing devices 200A-200N and then send or receive web content data to the data computing devices 200A-200N. [0026] Statistics module 256 may collect and process web content and associated sharing statistics from data computing devices 200A-200N. Collecting module 258 of statistics module 256 uses interface 252 to collect the web content and sharing statistics from data computing devices 200A-200N. The data can be collected in real-time and/or based on a schedule. Each data computing device 200A-200N can provide web content of a different source such as a social media service, an online news journal, etc. Further, the collected data can be filtered based on various parameters. For example, the data collected can be associated with news media sources.

[0027] Characteristics module 260 of statistics module 256 processes the web content and sharing statistics to determine characteristics of the web content. For example, a range of popularity and novelty values can be determined for the web content of each source.

[0028] Characteristics module 260 can process and aggregate sharing statistics that describe the resharing (e.g., share, retweet, forward, etc.) of web content. Specifically, the sharing statistics may specify the number of times that each item of web content is reshared, which can vary greatly depending on the author, type of content, etc. In this case, the resharing statistics can be used to determine the average number of times content is reshared for a time interval (e.g., every minute, hourly, daily, etc.).

[0029] Further, trends can be identified in the resharing statistics as time- dependence data. For example when observing a web content source that prioritizes novelty (e.g., TWITTER®), an item of web content receives more engagement shortly after (e.g., in the second and third minute) the item is posted, and after an initial discovery period with little engagement (e.g., after 1 minute), the number of reshares greatly increases and then fits a power law distribution. At some stage, the number of reshares of an item decreases significantly because its novelty is lower. For less popular content, the increase in engagement after the initial discovery period and the following decline in engagement can occur over a larger time scale. TWITTER® is a registered trademark of Twitter, Inc., which is headquartered in San Francisco, California. [0030] Temporal comparisons with other platforms can also be observed. In this case, when making a comparison with a web content source that account for novelty and popularity such as community-managed content sources (e.g., REDDIT®), the quantity of reshares does not decrease as dramatically with time for web content that is very popular (i.e., highly up voted). In this case, the more times popular items are displayed, the more prone the items are to obtain users' attention. REDDIT® is a registered trademark of Reddit Inc., which is headquartered in San Francisco, California.

[0031] Characteristics module 260 can also process sharing statistics to determine the conditional variance of the number of reshares received after t minutes from the publication of an original item of web content. Conditional variance describes the variance between the number of reshares of an item received after t minutes from publication and other items that received the same number of reshares after t - 1 minutes from publication.

[0032] Analysis module 262 may analyze data collected by statistics module 256 to improve information value of web content. Normalizing module 264 of analysis module 262 creates characteristics vectors for items of web content based on the sharing statistics. Characteristics vectors can be made for multiple attributes (e.g., novelty, popularity, etc.), where each characteristic vector describes the range of possible values for an item of web content for a particular attribute. Normalizing module 264 can also normalize the characteristics vectors so that web content is evenly distributed throughout the characteristics vectors. Each of the normalized values can be attributed with a reward that can be used to determine the utility of items of web content. For example, a characteristic vector for novelty can have higher rewards for values that indicate an item is more novel.

[0033] Markov module 266 of analysis module 262 applies a Markov process to web content determine transitional probabilities. It is assumed that the state of each item changes according to the Markov process independent of the state of other items, with transition probabilities

if the item is in a top list (e.g., top 10 items of web content) and if the item is not in

the top list. In order to empirically calculate the transition probabilities, the web content posted during set interval of observation is considered. For example, assuming that all the items are on the top list (i.e., all of them are displayed),

is defined as:

where I_i(t) is the set of items in state i at time t and I_j(t + 1) the set of items in state j at t + 1 that transited to this state from state i. At this stage, is fixed for , which accounts for the fact that displaying an item on the top list accelerates its transition speed by ten times.

[0034] Ordering module 268 of analysis module 262 orders items of web content based on the transitional probabilities. In an example with two characteristic vectors with 10 states each, the G index (i.e., ordering index) rankings of the 101 states (100 states from the combination of the 2 vectors and an additional state 0 that is the unknown state) are calculated using, for example, the Bertsimas-Nino-Mora (BNM) adaptive greedy algorithm. [0035] Before using the BNM algorithm, a set of constants

calculated. Assuming that E is finite, for any subset

S , the S-active policy u_S is defined to be the strategy that recommends items whose state is in S. Considering an item that starts from an initial state X

( ) , under the action implied by strategy

u_S, its total occupancy time in S is given by

where

It is provided that

The variables

can be solved from the set of linear equations above. A matrix of constants defined by means of

as follows:

The constants

are then used in the BNM algorithm as shown below:

The items of web content can then be described based on the G index (e.g., ordering, top 10 items, etc.).

[0036] Example G index rankings are shown in FIG. 4. After the web content is ordered, the items can be displayed according to the ordering to improve the utility of the information displayed. For example, the top 10 items can be displayed and updated as the ordering of the web content is dynamically determined based on real-time sharing statistics. [0037] FIG. 3 is a flowchart of an example method 300 for execution by a computing device 100 for maximizing information value of web content. Although execution of method 300 is described below with reference to computing device 100 of FIG. 1, other suitable devices for execution of method 300 may be used, such as computing device 250 of FIG.2. Method 300 may be implemented in the form of executable instructions stored on a machine-readable storage medium, such as storage medium 120, and/or in the form of electronic circuitry.

[0038] Method 300 may start in block 305 and continue to block 310, where computing device 100 collects sharing statistics for web content from a data computing device. Sharing statistics can be collected from multiple data devices and for any number of users at each of those devices. In block 315, a characteristic vector is created for each item of web content based on the sharing statistics. For example, a novelty and a popularity vector can created based on the sharing statistics.

[0039] For a social media source, a certain set of properties for each post (e.g., age, number of reshares, favorites, etc.) can be tracked. The properties that define the state of each web content item at each instant t are its novelty (i.e., time since publication) and popularity (i.e., number of reshares of the item). In order to have a finite set of states E, the possible values of novelty and number of retweets are discretized, resulting in 10 different values for novelty and 10 different values for popularity. At this stage, each state can be represented as a 2-vector (n, p) {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. In addition to these 100 states, the unknown state (i.e., state 0) is also considered. Each item initially starts in the unknown state and also ends on the unknown state (i.e., the unknown state serves as both the sink and the source).

[0040] In order to set the reward and the values of the properties that define each state, the novelty and popularity of the web content items posted during an observation period are considered. Limits for the characteristics vectors can be set based on the observed data. For example, if dealing with a social media source that favors novelty, the limits between the different novelty intervals can be as follows:

In this example, the state of novelty i n contains the items that were posted between limn[i] and limn[i + 1] - 1 minutes before the current time of observation.

[0041] With respect to the popularity of the items, it is observed that the number of

reshares per item is distributed according to a power law distribution, where the majority of the items receive less than 100 reshares whereas a very small percentage of items are reshared more than 1000 times. In order to set the popularity of the states, the reshares are split, sorted according to the times they are reshared, into equal sized subsets. In this example, the limits between the different intervals that define the state are:

So, the state of popularity j p contains the items that have been retweeted between lim_p[j] and lim_p[j + 1]– 1 times before the current time of observation.

[0042] In block 320, the characteristic vectors are normalized so that each of the possible values applies to an equal sized subset of the web content. Further, the normalized characteristic vector can also be modified to reflect an average number of reshares per minute. Each possible value in a characteristic vector is attributed with a reward that is used to calculate the utility of an item of web content.

[0043] In this example, the reward of each state can be set to

where the r_n and r_p are the normalized average number of reshares per interval. In other words, the average number of reshares received between lim_n[i] and lim_n[i + 1] - 1 minutes after publication in the case of novelty, and the average number of total reshares received by those items that have received between lim_p[i] and lim_p[i + 1]– 1 reshares in the case of popularity, which results in

The reward when p = 1 is not zero but, in order to conserve the reward of the novelty in r(n, 1) / n {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, the average number of reshares in this set is considered to be 1.

[0044] In block 325, a Markov process is applied to the normalized characteristics vectors to determine a state for each item of web content. Specifically, the Markov process can be applied to the normalized characteristic vectors of the web content to dynamically determine transition probabilities of the states for the items of web content. In block 330, the items of web content are ordered based on the transition probabilities. Because the transition probabilities are dynamically determined, the ordering of the web content can be updated in real-time as the characteristics of the items change. In some cases where user interface real estate is limited, the items of web content displayed can be restricted to, for example, the top 10 items. Further, the refresh rate of items of web content can also be dependent on their ordering (i.e., higher priority items can be refreshed with new values more frequently).

[0045] Method 300 may then continue to block 335, where method 300 may stop.

[0046] FIG. 4 is a diagram of an example index rankings map 400 that is ordered to maximize information value. The index rankings map 400 shows rankings for items of web content, which are ordered according to a G index, the value of which is indicated on a node associated with each item. As shown in the known states 408 for the web content, state (6; 2) has the largest G index, state (6; 3) the second-largest, and so on. Index rankings map 400 has a popularity axis 404 and a novelty axis 406. The absolute value of the indices are not as important as their relative orders, and items should be displayed according to the relative order of the indices of their states. For example, state (6; 2), which has a G index of 1, is not the most novel but is the most popular, and a display of . On the other hand, state (5; 2), which has a G index of 4, is less popular but more novel than state (6; 5), which has a G index of 5. Also, because the algorithm gives high index values to potentially valuable states means, the unknown state 402, which gives no reward, should have a higher display priority than other states with a positive reward. Further, the influence of the popularity in the output is higher than the influence of novelty.

[0047] The foregoing disclosure describes a number of examples for improving information value of web content by a computing device. In this manner, the examples disclosed herein enable improving information value by using characteristics vectors that are based on sharing statistics to dynamically order items of the web content.

Claims

CLAIMS We claim:

1. A computing device for maximizing information value of web content, the computing device comprising:

a processor to:

collect sharing statistics for a plurality of web content items from a data computing device, wherein the sharing statistics comprises time- dependence data and temporal comparisons;

generate a characteristic 2-vector for each of the plurality of web content items based on the sharing statistics, wherein the characteristic 2-vector comprises a plurality of novelty values and a plurality of popularity values;

normalize the characteristic 2-vector of each web content item of the plurality of web content items;

apply a Markov process to each web content item of the plurality of web content items to determine a corresponding transition probability of a plurality of transition probabilities based on a normalized, characteristic 2-vector associated with the web content item; and

continually order the plurality of web content items based on the plurality of transition probabilities, wherein a subset of the plurality of web content items are displayed according to the order of the plurality of web content items.

2. The computing device of claim 1, wherein the ordering of the plurality of web content items is performed using a Bertsimas-Nino-Mora adaptive greedy algorithm.

3. The computing device of claim 1, wherein a refresh rate for updating each transitional probability of the plurality of transitional probabilities is determined by the order of a corresponding web content item of the plurality of web content items.

4. The computing device of claim 1, wherein the processor is further to select a state from the corresponding normalized, characteristic 2-vector, wherein the state is associated with a utility reward that is used to determine the corresponding transition probability.

5. The computing device of claim 1, wherein the data computing device provides a social media service, and wherein the temporal comparisons are obtained by comparing the social media service to a community-managed content source.

6. The computing device of claim 1, wherein the plurality of popularity values are determined using a plurality of reshares that satisfy a power law distribution.

7. A method for maximizing information value of web content, the computing device comprising:

collecting sharing statistics for a plurality of web content items from a data computing device, wherein the sharing statistics comprises time- dependence data and temporal comparisons;

generating a characteristic 2-vector for each of the plurality of web content items based on the sharing statistics, wherein the characteristic 2-vector comprises a plurality of novelty values and a plurality of popularity values;

normalizing the characteristic 2-vector of each web content item of the plurality of web content items;

applying a Markov process to each web content item of the plurality of web content items to determine a corresponding transition probability of a plurality of transition probabilities based on a normalized, characteristic 2-vector associated with the web content item; and

continually using an adaptive greedy algorithm to order the plurality of web content items based on the plurality of transition probabilities, wherein a subset of the plurality of web content items are displayed according to the order of the plurality of web content items.

8. The method of claim 7, wherein a refresh rate for updating each transitional probability of the plurality of transitional probabilities is determined by the order of a corresponding web content item of the plurality of web content items.

9. The method of claim 7, further comprising selecting a state from the corresponding normalized, characteristic 2-vector, wherein the state is associated with a utility reward that is used to determine the corresponding transition probability.

10. The method of claim 7, wherein the data computing device provides a social media service, and wherein the temporal comparisons are obtained by comparing the social media service to a community-managed content source.

11. The method of claim 7, wherein the plurality of popularity values are determined using a plurality of reshares that satisfy a power law distribution.

12. A non-transitory, machine-readable storage medium encoded with instructions executable by a processor for maximizing information value of web content, the machine-readable storage medium comprising instructions to:

normalize the characteristic 2-vector of each web content item of the plurality of web content items; select a state of a plurality of states from the corresponding normalized, characteristic 2-vector for each web content item of the plurality of web content items, wherein the state is associated with a utility reward of a plurality of utility rewards;

apply a Markov process to each web content item of the plurality of web content items to determine a corresponding transition probability of a plurality of transition probabilities based on the utility reward associated with the web content item; and

13. The storage medium of claim 1, wherein the ordering of the plurality of web content items is performed using a Bertsimas-Nino-Mora adaptive greedy algorithm.

14. The storage medium of claim 1, wherein a refresh rate for updating each transitional probability of the plurality of transitional probabilities is determined by the order of a corresponding web content item of the plurality of web content items.

15. The storage medium of claim 1, wherein the plurality of popularity values are determined using a plurality of reshares that satisfy a power law distribution.