WO2022147060A1

WO2022147060A1 - Providing topics entities in a messaging platform

Info

Publication number: WO2022147060A1
Application number: PCT/US2021/065421
Authority: WO
Inventors: Arash AGHVELI; Aziz Michael BATIHK; Brian WICHERS; Gui Ming TANG (Jim); Hafeezul Rahman MOHAMMAD; Joshua LANDE; Mike Wu; Lu Gao; Masoud VALAFAR; Matt Miller; Michael Barry; Mira RADEVA; Murph FINNICUM; Prachi PODDAR; Prakash Rajagopal; Tejas DHARAMSI; Venu Satuluri; Xinqian LI; Yang Tang; Yao Wu
Original assignee: Twitter, Inc.
Priority date: 2020-12-31
Filing date: 2021-12-29
Publication date: 2022-07-07

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for providing topics information to users of a messaging platform. One of the methods includes storing information about multiple topics on a messaging platform, wherein each of the plurality of topics is a predefined topic on the platform that represents a subject of platform content, each of the topics is an entity distinct from any account of the platform, and each of the topics is an entity that the platform enables users of the platform to follow. This method further includes identifying a set of candidate topics that are likely interesting to a user of the platform; generating for display a content presentation interface presenting the set of candidate topics to the user; and receiving from the user a selection of one or more topics to follow among the set of candidate topics.

Description

PROVIDING TOPICS ENTITIES IN A MESSAGING PLATFORM

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to US Application No. 17/405,153, filed on August, 18, 2021 which claims the benefit of US Provisional Application No. 63/132,689, filed December

5 31 , 2020, the entire contents of each of which are incorporated herein by reference.

TECHNICAL FIELD

This specification relates generally to messaging platforms and, more particularly to systems and methods for identifying and selecting topics information, and selecting content items related to subjects of content represented by topics.

10 BACKGROUND AND SUMMARY

A messaging platform operates to provide a social media service, for example, services enabling its users to post, view and engage with content on the platform. Content on the platform includes user-authored content, for example, messages broadcasted or posted by an account of the platform, or platform-generated content, for example, news, events, or

15 notifications. The format of the content can be or include one or more of text, graphics, video, audio, multimedia content, or content links. A user can initiate or participate in one or more message threads by posting messages, by commenting or responding to posted messages, by forwarding posting messages, or by expressing a sentiment, e.g., a like, about posted messages.

The term “user”, as used in this specification, may refer to a human user or an account

20 used by a human user or both, as determined by the context A user may not be an account holder or may not be logged in to an account of the platform. In this case, the platform enables such a user to utilize certain functionalities of the platform by associating the user with a temporary account or identifier. In this specification, an account of a user on the platform may be referred as a “user,” “producer”, “content producer”, “consumer” or “content consumer,”

25 depending on context and role being played.

To help connect users with content they are actually interested in, the platform implements topics as a new type of followable entity representing a subject of content The platform enables users to select and follow topics they are interested in, e.g., from a list of topics through various content presentation interfaces of the platform, for example, users’ home pages

30 or timelines, or explore pages. A user’s home page is the user’s main page, for example, presenting top content selected for the user from various applications on the platform. A timeline is a user interface presenting a stream of content items, e.g., messages, displayed in

1 some order, .e.g., the order in which they are posted or generated, with the most recent on top. As one of a user’s timelines, a home timeline is a user interface that the user sees by default and that presents a stream of content items selected for the user and generally updated in real-time, for example, content from accounts the user has chosen to follow on the platform.

5 Topics followed by users indicate the users’ interests. The platform uses the user-topic following relationships to improve the content discovery and consumption experience of users. For example, once a user follows a topic, more content related to it will start appearing in the user’s home timeline.

BRIEF DESCRIPTION OF THE DRAWINGS

10 The accompanying drawings facilitate an understanding of non-limiting, example embodiments of the disclosed technology. In the drawings:

FIGURE 1 illustrates an example messaging system according to an implementation;

FIGURE 2 illustrates an example directed acyclic graph according to an implementation;

FIGURE 3 illustrates example related-to relationships according to an implementation;

15 FIGURE 4 illustrates example following relationships between users according to an implementation;

FIGURE 5 illustrates example known-for relationships between users according to an implementation;

FIGURE 6 illustrates example interested-in relationships between users according to an

20 implementation;

FIGURE 7 illustrates an example entity graph according to an implementation;

FIGURE 8A illustrates an example user interface for selecting topics to follow according to an implementation;

FIGURE 8B illustrates an example user interface for topic selection according to an

25 implementation;

FIGURE 9 illustrates example computation of a topic matrix according to an implementation;

FIGURE 10A illustrates an example model architecture according to an implementation;

FIGURE 10B illustrates an example model architecture according to an implementation;

30 FIGURE 11 illustrates an example user profile page according to an implementation;

FIGURE 12 illustrates example computation of a producer matrix according to an implementation;

FIGURE 13 illustrates an example workflow for identifying authoritative producers on topics according to an implementation;

2 FIGURE 14 illustrates an example user interface of an interactive tool according to an implementation;

FIGURE 15 illustrates an example system flow for recommending content based on topics according to an implementation;

5 FIGURE 16 illustrates an example workflow for a content recommender according to an implementation;

FIGURE 17 illustrates an example workflow for a SimClusters algorithm according to an implementation;

FIGURE 18 illustrates an example user-user graph and an example bipartite graph

10 corresponding to the user-user graph according to an implementation;

FIGURE 19 illustrates generation of an example producer-producer similarity graph from an example bipartite graph according to an implementation;

FIGURE 20 illustrates an example known-for matrix according to an implementation; and

15 FIGURE 21 illustrates computation of an example user interested-in matrix according to an implementation.

DETAILED DESCRIPTION

Reference will now be made in greater detail to embodiments, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements

20 throughout. In this regard, the example embodiments may have different forms and should not be construed as being limited to the descriptions set forth herein. Accordingly, the example embodiments are merely described below, by referring to the figures, to explain certain aspects. In the accompanying drawings, portions irrelevant to a description of the embodiments are omitted for clarity. Expressions such as “at least one of,” when preceding a list of elements,

25 modify the entire list of elements and do not modify the individual elements of the list.

FIGURE 1 illustrates an example system 100, which includes a messaging platform 102 and many client devices 104. The messaging platform, which may be referred to simply as the “platform,” is implemented on one or more computers, located in one or more locations interconnected with a data communication network, and programmed appropriately in

30 accordance with this specification. In some implementations, the platform has multiple components including a content repository 112, a user repository 114, a topic repository 130, and a content selection component 126. Each component is implemented on one or more computers in one or more locations. The client devices are computers, generally personal devices, in particular, mobile devices, running client software 128, e.g., apps, applications, or scripts

3 running in a web browser, that communicates with, and enables user to interact with, the platform.

The content repository stores content and information about content. The information about a content item includes an identifier of the broadcasting or posting user account of the

5 content item, a list of users for receiving the content item, or a number of users that received the content item. The information about the content item may further include sentiments, e.g., a like or a dislike, or a degree or magnitude associated with the sentiments.

The topic repository stores information about topics. As described below in Section I, information about topics include attributes and keywords of topics.

10 The user repository stores information about users and connections between users. Specifically, the user repository relates an identifier of a user to the user's preferences or history on the platform. For example, the user preferences or history includes language preferences, user accounts followed by the user account, or subjects that a user account is interested in.

The platform stores information for multiple different accounts, e.g., accounts for

15 individuals, businesses, or organizations, as well as pseudonym accounts and novelty accounts. One or more users of each account uses the platform to send content, e.g., messages, to other accounts inside or outside of the platform. The platform may enable users to communicate in “real-time”, i.e., to converse with other users by exchanging content with a minimal delay and to conduct essentially a conversation with one or more other users during concurrent sessions on

20 the platform.

The content selection component operates to identify content to present to users based on, e.g., accounts, topics the users are following and the users’ interests.

I. Technical Features of Topics

1. Information about a Topic

25 A topic represents a subject of that content may relate to. The platform stores data defining a set of topics and information about the topics. Information about a topic includes attributes of the topic, for example, one or more of an identifier, name, language, country, creation time, or description of the topic. Attributes of a topic may also include information identifying users that are likely to publish content related to the topic or one or more keywords

30 that define what the topic is about or relevant to.

A topic may have an attribute that provides a general indication of which geographic audiences will likely be interested in content related to the topic.

The language of a topic does not necessarily limit the language of keywords associated with the topic, e.g., even if the language of a topic is “English-US,” information about the topic

4 can still include Japanese keywords. Keywords associated with a topic may be used by the platform to determine whether a content item is related to a topic by matching the keywords against at least part of the content item or content metadata.

In some implementations, the platform stores a set of topics together with information

5 about the topics as respective global objects in the topic repository. Alternatively, the platform can store the information in other forms of organization.

The platform uses information about topics to identify content related to a given topic, identify topics and messages that are likely interesting to a user, identify content producers known for a given topic, or identify topics to recommend to a user.

10 2. Relationships between Topics

The relationships between topics can be represented by the platform as a directed acyclic graph, in which each node represents a topic and each edge represents a relationship between two topics. The DAG is referred as a “topic relatedness graph" in this specification. The relationship between two topics is a parent-child relationship, which represents a relationship

15 between a topic and one of its sub-topics. A topic may have multiple parents and multiple children.

FIGURE 2 illustrates an example directed acyclic graph 200 that includes a set of nodes representing topics. A node in the directed acyclic graph that does not have any parents is referred as a “top-level topic node" corresponding to a “top-level topic." Top-level topic nodes

20 generally represent broad topics, for example, Arts & Culture, Business & Finance, Entertainment, Fashion, Food, Sports, or News.

A topic may have one or more sub-topics. For example, the topic “Sports" may have sub-topics “Baseball”, “Basketball”, and “Sports News”. Sub-topics are topics narrower than their parent topic in that any item that is related to the sub-topic will be related to the topic, but

25 not necessarily vice versa. A sub-topic may have further sub-topics.

A topic may be a sub-topic of multiple topics, which may or may not have parent-child relationships. For example, the topic “Sports News” is a sub-topic of both topics “Sports" and “News", but topics “Sports” and “News” are not related to each other as parents, directly or indirectly, or as children, directly or indirectly.

30 The relatedness between a topic and its related topic may be scored based on the topic relatedness graph. In some implementations, the platform computes the relatedness between a topic and its related topic by the number of steps to traverse from the topic to its related topic through edges in the topic relatedness graph. The traversal of the graph may or may not follow the directions of edges in the graph, i.e., either from a topic to one of its child topics or from the

5 topic to one of its parent topics. Optionally, edges in the topic relatedness graph are weighted, and the weight of each edge represents the strength of relatedness between nodes connected by the edge. Thus, the platform may compute the relatedness between two topics by further taking into account the weights of the edges to traverse from one of the topics to the other.

5 3. Relationships among Topics. Content and Users

The platform uses one or more entity graphs, in which each node represents an entity and each edge represents a relationship between the two entities connected by the edge, i.e., how the entities are related to each other. An entity in such a graph may be a topic, a user, or a content item, e.g., a message, an event, e.g., a broadcast media event or a live event, a news item, a

10 notification, or an advertisement.

As topics, represented as nodes, are added to graphs as new entities, the graphs will generally represent new relationships among topics, content and users, e.g., following relationships between users and topics, known-for relationships between users and topics, and content presentation interfaces related-to relationships between content and topics, as described

15 below. The platform determines content that is likely interesting to a user by leveraging these additional relationships.

(a) Content Related to Topics

FIGURE 3 illustrates example related-to relationships between topics 310 and content items 320, e.g., messages. In the figure, an edge connecting a message and a topic indicates that

20 the message is related to the topic.

(b) Topics Explicitly Followed By a User

The platform enables users to explicitly follow certain topics, for example, to keep up with information related to the topics they are interested in. As described below in Section II, the platform enables users to select topics to follow, e.g., from their home pages, their home

25 timelines, other users’ profile pages, or topics’ landing pages. After a user explicitly follows a topic, the platform will present content related to the followed topic to the user through various content presentation interfaces across the platform.

FIGURE 4 illustrates example following relationships between users, e.g., consumers 410, topics 420 and content items 430. The users (Userl, User2, User 3 and User 4) in the figure

30 are referred as “content consumers” or “consumers”. One user may follow one or more topics, and a topic may be followed by one or more consumers. If a user follows a topic, the platform treats content items related to the followed topic as likely being of interest to the user.

6 In some implementations, the platform enables users who are following a topic to specify what kinds of topic-related content they prefer to receive. For example, users can select to follow one or more sub-topics of the followed topic. The platform can optionally provide a pulldown menu adjacent to a message on a topic, which menu enables users to indicate that they are

5 or are not interested in the message while they are viewing the message. The platform can use the user’s indication as a signal when scoring content similar to the message, as a result of which similar messages may be ranked lower or not be shown at all on the user’s home page or timeline.

The platform enables users to un-follow topics that they have explicitly followed. The

10 platform may also enable users to mute certain topics entirely, which means no content related to the topic will be shown to the user while the topic is muted, or instruct the platform to show less of content from certain topics.

In some implementations, the platform enables users to rate topics they have followed relative to each other, and ranks content related to their followed topics based on user’s ratings of

15 topics, so that more content related to one topic is presented than content related to another.

(c) Authoritative Producers Known for a Topic

One of the reasons why users follow topics is to get content from experts on the topics or influencers related to the topics. The platform identifies a group of accounts known as experts or influencers of a given topic, and identifies content posted by these accounts as likely being

20 related to the topic. These accounts are also referred as “authoritative producers” on the topic. The platform or human curators identify authoritative producers automatically or by hand, as described below in Section Ul.l.(d).

FIGURE 5 illustrates example known-for relationships between users, e.g., producers 520, content items 510 and topics 530. The users (Userl, User2, User3 and User4) in the figure

25 are referred as “content producers” or “producers,” and an edge in the figure indicates that a producer is identified as one of the authoritative producers who are known for a topic. A producer may be an authoritative producer on multiple topics.

The platform treats content from authoritative producers on a topic as likely being related to the topic. In FIGURE 5, since Userl is identified as an authoritative producer on topics

30 Topic 1 and Topic2, and Userl posted messages MSG1 and MSG2, these messages are identified as likely being related to topics Topic 1 and Topic2.

Generally, not all content posted by an expert will be related to the expert’s topic of expertise. Thus, the platform applies textual matching roles to filter out off-topic content from authoritative producers on a topic. For example, the platform filters messages from authoritative

7 producers according to a rule defining that a message is likely not about a topic if the message includes none of a predetermined set of keywords associated with the topic and is not a reply to or comment on a message or event that is related to the topic.

In some implementations, the platform optionally conveys to users who are identified to

5 be authoritative producers on a topic that they are known for the topic and enables such users to opt-out of this status, e.g., using their profile configuration dashboards.

(d) Topics a User May be Interested in or Implicitly Follows

Even if a user does not explicitly follow any topics, the platform may still determine topics the user is interested in based on the user’s historical activities on the platform, e.g., the

10 user’s interaction with content and other users, or the content produced by the user. By way of example, the platform determines that a user is interested in a topic based on any combinations of different criteria, e.g., whether the user has searched for certain terms aligning to the topic, followed certain authoritative producers on the topic, or liked, re-posted or forwarded messages that are related to the topic or that were posted by one or more authoritative producers on the

15 topic. Various ways of determining topics a user is likely interested in are described below from Section n.3.(b) to Section n.3.(d).

FIGURE 6 illustrates example interested-in relationships between users, e.g., consumers 610 and topics 620. Users (Userl, User2, User3 and User4) in the figure are referred as “consumers.” An edge in the figure indicates that a consumer is interested in or implicitly

20 follows a topic. The platform uses the determined topics, in which a user might be interested, in generally the same way it uses explicitly followed topics to provide content recommendations or personalization.

FIGURE 7 illustrates an example entity graph 700. The entity graph includes nodes representing entities, i.e., topics 710, consumers of content 720, producers of content 730 and

25 content items 740, e.g., messages, and edges each representing relationships between entities, e.g., consumers follow topics, consumers follow producers, consumers engage with messages, producers are authoritative producers of content for topics. n. Topic Discovery for New and Experienced Users

The platform identifies topics that are likely interesting to users and presents such topics

30 to them on various content presentation interfaces of the platform for selection. After following topics they are interested in, users will start to see more content related to their followed topics on the platform. It is especially effective for new users or other users that platform does not have enough information about to determine their likely interests. In contrast, for experienced users,

8 e.g., users who regularly log in and engage with content, the platform will generally have enough information to determine their likely interests.

The platform uses different methods to identify candidate topics, i.e., topics the user is likely interested in, depending on whether the platform has enough information about the users’

5 historical activities.

For both new and experienced users, without using historical information about users, the platform identifies topics that are likely interesting to such users by methods described below, e.g., in Section II.1, Section II.2, Section II.3.(a), Sections II.5.(a)-(b) and Section II.6.

For experienced users, the platform may use historical information to identify topics that

10 are likely interesting to these users using additional methods, e.g., described below in Sections II.3.(b) - (d), Section II.4 and Section II.5.(c).

1. Topic Recommendations based on Popularity or Generic Attributes

In some implementations, when onboarding new users, the platform recommends topics to these new users algorithmically. An example algorithm sorts all the topics in each of multiple

15 categories by popularity, and then the categories are also sorted by the sum of the popularity of the topics within them. The most popular categories are presented in popularity order. Under each category, the most popular topics are presented for selection by the user. In particular, the operations are: (i) calculate the popularity score of each topic; (ii) map each topic to one or more categories; and (iii) for each category, pick the top k, e.g., 10, 20, 30, topics. Optionally, in some

20 implementations, curators may exclude from time to time some topics from consideration as topics to recommend to new users.

In some implementations, the platform recommends topics to users simply based on one or more generic attributes of topics, e.g., country, language, or trendiness. In these implementations, the topic repository stores a predetermined number of topics associated with

25 certain generic attributes. For example, the topic repository associates each country with a list of the most-popular topics among users in that country. The platform may determine the popularity of topics for each country in accordance with users’ engagements with content related to these topics, e.g., how many users in that country have viewed, liked, re-posted, forwarded or commented on such content. According to the attributes of the stored popular topics and the

30 locale information of users, the platform may recommend to users a predetermined number of the most-popular topics for their country. In some implementations, the platform maintains an association of a predetermined number of the most-popular topics with other kinds of geographical regions, e.g., states, or cities, or entities with which a user may be associated, e.g., schools or employers, and uses that information to recommend topics.

9 2. Topic Recommendations based on Search Terms

In some implementations, the platform determines a likelihood that a user is interested in certain topics based on search terms entered by the user, e.g., in a recent time window. When a user enters a search term to search for content on the platform, the platform provides a topics-to-

5 follow prompt to recommend the user to follow one or more topics closely aligned to the search term, if there are any. For example, when a user is searching for a politician, the platform may provide a topics-to-follow prompt enabling the user to follow the topic “Politics”. One or more search terms may align to a single topic, or one single search term may align to multiple topics. A machine learning system may be trained to define the correspondence relationships between

10 search terms and topics using, as training data, historical data of topics and search terms entered by users following the topics, or human curators may manually define the correspondence relationships in accordance with their individual knowledge or research about certain domains.

3. Topic Recommendations from User Home Pages

As described above, the platform enables users to explicitly select topics to follow from

15 their own home pages. The following paragraphs describe methods by which recommendations of topics are made for this use case.

(a) Following Topics from Broad to Narrow

FIGURE 8A illustrates an example user interface for selecting topics to follow. As shown in the figure, the user interface includes a list of icons representing certain topics defined

20 on the platform. Users can click on one or more of the icons to follow the corresponding topics. To fit topics in a small space on a user interface, the platform optionally provides a carousel user interface that users can swipe through to see more topics.

When starting out with topics, users are more likely to follow a broad topic before following narrow topics. On the other hand, narrow topics are more likely to be related to users’

25 particular interests than broad topics. Thus, if users follow narrow topics, they will likely have a better topic experience, e.g., an increased content engagement rate, and a decreased topic unfollow rate. Accordingly, the platform enables users to select topics to follow from broad topics to sub-topics of the broad topics. Upon determining that users have selected topics to follow, the platform further enables the users to follow topics that are narrower than their

30 selected topics.

FIGURE 8B illustrates an example user interface for topic selection. The user interface enables users to select topics to follow starting from top-level topics to sub-topics. In the figure,

10 once a user has followed a topic, the platform will prompt the user to follow sub-topics of the followed topic.

When the platform has little information about a user’s interests, it may first provide a user interface presenting and enabling the user to select one of the top-level topics in a topic

5 relatedness graph. After a user has followed a topic, the platform presents and enables the user to select one or more of the sub-topics of the topic. If the user further selects to follow a subtopic, then the platform may present another user interface presenting and enabling the user to select one or more deeper sub-topics of the sub-topic. Accordingly, the platform enables the user to specify a narrowed topic by sequentially selecting a series of topics from broad to narrow.

10 (b) Identifying Candidate Topics for a User based on a Similarity-based Clusters Algorithm

For experienced users, the platform can identify topics to recommend using following or engagement relationships between users on the platform or other historical data involved with topics. In some implementations, the platform uses an algorithm called Similarity-based Clusters (SimClusters) to identify candidate topics to recommend to users.

15 The SimClusters algorithm is a clustering algorithm that generates clusters of users.

Users in the same cluster have similar properties, e.g., similar followers. The clusters may also be referred as “communities'’ or “SimClusters”. Using the SimClusters algorithm, the platform identifies topics related to communities in which a user is likely interested. The communities are used as an embedding space. Further details about the SimClusters algorithm will be described

20 in Section IV.

As described below in this section, the platform identifies candidate topics for a user using the SimClusters algorithm by three steps: 1) computing a user interested-in matrix specifying communities in which users are likely interested by embedding the users into a space of the communities; 2) computing a topic matrix specifying communities with which topics are

25 associated by embedding topics into the space of the communities; and 3) computing relevance scores indicating the relatedness between users and topics from the computed user interested-in matrix and topic matrix in the same space.

The space of communities has dimensions corresponding to the communities specified for the subset of users, e.g., the top 5, 10, or 20 million most-followed users, on the platform. In

30 particular, in the space of communities, each dimension corresponds to one of the specified communities, and the number of dimensions of the space is the same as the number of the communities. Thus, an item in the space can be located by coordinates in the dimensions of the space. The platform embeds an item by transforming the item into a representation, e.g. a vector, in the space of communities. Therefore, by constructing the user interested-in matrix for the

11 users on the platform, the platform has embedded users into representations, e.g., vectors, in the space of the communities.

The platform first computes a user interested-in matrix, for example, as described below in Section IV.1. In determining topics to recommend on a given user’s home page, the platform

5 also computes a topic matrix by embedding topics into the same space of the communities. The platform implements topic embeddings offline or online (or referred as “in real-time”). The platform may implement a topic embedding offline based on semantic annotations of content items. As described in Section IH.l.(b), semantic annotations of content items may include topical annotations, each of which indicates one or more topics a content item is related to.

10 FIGURE 9 illustrates example computation of an example topic matrix R. As illustrated in the figure, the platform computes the topic matrix R by taking the cosine similarity between consumers who are interested in a community, e.g., represented by a user interested-in matrix U, and the number of aggregated favorites each consumer has taken on a content item that has a topical annotation, e.g., represented by a matrix T. In some implementations, the platform

15 weights each user’s interested-in vector in the user interested-in matrix (7 by a decayed value based on how long ago that the user liked the content item, or more particularly, the more recently a user liked a content item, the larger weight the corresponding interested-in vector has.

Alternatively, the platform may implement a topic embedding in real-time based on a vector of users who have engaged with a content item related to a topic. In the real-time

20 implementation, a topic embedding corresponds to a dot product between the vector of users who have engaged with a topic, for example, with a time decay, and the normalized interested-in vector for each engager. For the topic embedding, in some implementations, the platform weights each user’s interested-in vector by a decayed value based on how long ago that the user engaged with the topic. In particular, new engagers may contribute more than previous

25 engagers, so that their corresponding interested-in vectors may have larger weights.

Given the computed user interested-in matrix and topic matrix in the same space, the platform computes relevance scores indicating relatedness between each user and topics by computing a dot product or cosine similarity of the user interested-in matrix and the topic matrix. For each user, based on the resultant relevance scores between the user and topics, the platform

30 sorts the topics to recommend to the user in descending order to form the ranked list of topics the user is likely interested in.

Users interested in popular clusters can be mapped to topics that are not so related. For example, if a user is interested in the “News” cluster, as content associated with the “News” cluster is diverse, the user could get mapped to topics that they might not be very much

35 interested in. Therefore, when identifying topics to recommend on user home pages, the

12 platform removes certain very popular clusters, for example, clusters that have more than 5 million users interested in it.

(c) Identifying Candidate Topics for a User based on a Machine Learning Model

In some implementations, the platform uses a machine learning model to identify

5 candidate topics for experienced users. The model uses user-topic follow data as a “groundtruth” training data set, by treating a user who follows a topic as likely being a user who is strongly interested in the topic. The platform may collect training data from other data sources. The data sources for the model may need to be easily accessible and not have existing ranking to avoid selection biases.

10 In some implementations, the training process uses a “leave-one-out” strategy, according to which the training system randomly picks a followed topic from the user as the label and trains with all other available topics as well as other user features to predict the label.

For example, u denotes a user, and T_u denotes the list of topics that the user u follows. The training system randomly picks a topic t_t from T_u, and builds a model /which takes user

15 features _u and topic features , which means excluding the topic from the list of

followed topics

. The model ranks / higher than any other topic ty, where tj is a topic that is not in the list of followed topics T_u, which may be represented as

• The training system may learn the model through a pointwise loss function, e.g., by treating the record with the topic t_t as positive and treating those records with the other

20 topics tj as negative, or a pairwise loss function, e.g., by constructing pairs. The training

system may not treat all the unfollowed topics as negative, as doing so would generate too many negatives and these unfollowed topics are not really negative. For example, a user may not follow a topic due to reasons other than being uninterested in the topic. Thus, even if a user does not follow a topic, the user may still be interested in the topic. To solve this problem, the

25 training system may perform random negative sampling, either uniformly or based on the popularity of the topic.

FIGURES 10A and 10B illustrate two example model architectures. The main difference between the model architectures in FIGURE 10A and FIGURE 10B is whether to treat the held- out topic t_t as input or output. If the training system takes the topic t_t as the input in FIGURE

30 10A, the model will use binary labels. If the training system takes the topic t_t as the output in FIGURE 10B, the training system needs to construct a softmax layer on the output, for example, with all the unfollowed topics in a normalizer of the softmax layer.

In some implementations, the training system randomly generates a predetermined number N of training records based on the above described “leave-one-out” strategy. For

13 example, the training system sets the number N differently depending on whether a pointwise or pairwise loss function is used. In one implementation, the training system generates the same size of training data for each user, to remove the user-level popularity bias.

The training system trains the model based on various features and labels. For each

5 training record, the training system fetches all the user and topic features that are needed by the model. For a user, the model needs a list of features that are comprehensive to represent the user’s interest as input. These features for the user may include demographic features, features generated using the SimClusters algorithm, e.g., a user interested-in matrix, or information derived from a user-content interaction graph. For a topic, the model may also use features

10 generated using the SimClusters algorithm, e.g., a topic matrix, or information about the topic as input features.

Using the “leave-one-out” strategy, each training record can have a different number of topics as the input, so that the training system needs to use a length invariant pooling method, e.g., average pooling, max pooling, or an attention layer.

15 At prediction time, the model uses the whole set of followed topics and user features as input, and predicts the most promising topics among those the user is not following. For example, to use the model, for each user, the platform considers a list of candidate topics, e.g., all other topics the user has not followed, and predicts the scores for these topics and takes top K topics that are likely interesting to the user.

20 (d) Other Methods for Identifying Candidate Topics for a User

In some implementations, the platform recommends to an experienced user topics similar to the topics the user follows or has shown interests in. By way of an example, the platform infers that the user has shown interests in topics if the user has searched for these topics, liked a certain number of content items related to these topics within a short period of time, e.g., 12

25 hours or two days or a week, or selected interests similar to these topics during onboarding or at another time.

The platform may identify candidate topics similar to an already followed topic based on information about topics or the above described topic relatedness graph. Similar topics may simply be sub-topics of a topic the user has already followed or shown interests in. For example,

30 from the topic relatedness graph, the platform fetches a node corresponding to a topic the user follows, finds ancestors of the node, and then uses the ancestors or their children or grandchildren as similar topics. Optionally, the platform determines the similarity between two topics based on the distance between these two topics in the topic relatedness graph, i.e., the number of steps to traverse from one topic to the other.

14 In some implementations, the platform learns rules for identifying candidate topics to recommend based on historical activities of other users, e.g., by training a machine learning model. For example, the platform learns that users who follow topic X are likely to follow topic Y. Accordingly, the platform will identify candidate topic Y for users who follow topic X.

5 In some implementations, the platform also ranks the determined candidate topics and only recommends a predetermined number of top-ranked similar topics to user, to avoid recommending too many similar topics to the user. The platform computes similarity scores for (topic, topic) pairs periodically, e.g., weekly, or in real-time. In some implementations, the platform takes the maximum similarity score between a candidate topic and all of the followed

10 topics as the similarity score of the candidate topic. For example, if a user has already followed topics T1 and T2, and topic T3 is similar to both topics T1 and T2 with different similarity scores SI and S2, the platform determines the maximum score for T3, for example, (T3, max (SI, S2)) while ranking. Alternatively, the platform uses the sum of the similarity scores between the candidate topic and all of the followed topics as the similarity score of the candidate topic.

15 Before presenting candidate topics on user home pages, the platform may filter out candidate topics based on various criteria. For example, if the platform has paused a topic or determined that a topic is unhealthy, the platform will at least temporarily stop recommending the topic to any users. In some implementations, the platform further filters out topics previously presented for selection, followed or opted-out by the user, by recording and storing in a

20 repository information about which topics have been presented, followed or opted-out by users. As a result, the platform does not present certain topics that have already been recommended to users on their home pages.

The platform determines similar topics using one or more of the above described methods and then present a topics-to-follow prompt with the determined similar topics on the user home

25 page to help the user discover more topics and collect feedback about the user’s opted-in interests.

4. Topic Recommendations from Users’ Home Timelines

The platform also recommends topics to a user from the user’s home timeline. For example, the platform presents topics-to-follow prompts on users’ home timelines to enable the

30 users to follow one or more suggested topics. Due to the limited space on home timelines, the platform may first identify candidate topics for a given user by one or more of the above described methods in Section n.3, and then filter, score and rank the identified candidate topics to present a limited number of topics on the user’s home timeline.

15 In some implementations, the platform further scores each candidate topic using a unified scorer, which calculates a score indicating how much the user is likely interested in the candidate topic. The platform therefore can select some of the candidate topics for the user based on their scores. As the platform will present only a limited number of topic recommendations, it is

5 desirable to show topics that are not so similar to each other. Certain randomness may be added to the recommendations for exploration. For example, the platform may use a machine learning model called Determinantal Point Processes (DPPs) to determine diversified topics. The DPP is a probabilistic model of repulsion that can be used to diversify sets of recommended items. A DPP model can efficiently score an entire list of candidate topics rather than scoring each topic

10 individually, allowing the platform to better take into account topic correlations. DPP is described, for example, in Wilhelm, “Practical Diversified Recommendations on YouTube with Determinantal Point Processes”, Proceedings of the 27th ACM International Conference on Information and Knowledge Management (C1KM T8), Pages 2165-2173.

To avoid an annoying repetition of topics recommendations on a user’s home timelines,

15 the platform records and stores, in a repository, information about which topics have been presented to users and prevents repetition of topics within a predetermined time window.

In some implementations, the platform uses a top-level fatigue module to avoid recommending the same topics repeatedly from various content presentation interfaces, e.g., user home pages, other users’ profile pages or home timelines. The platform records and store in a

20 common repository which topics have already been presented to users. The common repository is shared by different applications on the platform for recommending topics. In other implementations, each application of the platform writes the impressions to its own impression logs. The platform aggregates the different impression logs from multiple applications into a unified data log about which topics have been presented to users, so that the platform can query

25 the different impression logs uniformly.

In some implementations, the fatigue module determines how often a topic recommendation module is served to users in accordance with fatigue rules. Example fatigue rules include that, for new users or users who do not log in regularly, the platform serves the topic recommendation module once every twenty-four hours, while, for active users who

30 regularly log in and engage with other users or content, the platform serves the module once every seven days.

In other implementations, the platform serves more topic recommendations to users who are actively following topics. For example, for active users, the platform serves the topic recommendation module once every day to users who have followed a topic within the last

16 week, while, if the user has not followed a topic within the past week, the platform serves the module once every seven days.

In some implementations, in applying impression-based fatigue, the platform uses the same fatigue durations, e.g., twenty-four hours or seven days, but serves the topic

5 recommendation module again to a user if it has not been viewed by the user. As a safeguard to avoid serving the module too frequently, the platform may also implement a back-off mechanism, e.g., not to recommend topics that have been viewed by the same user during the past day or not to recommend topics that have been served to the same user within the past four hours.

10 5. Topic Recommendations from Other Users’ Profile Pages

When a user lands on another user’s profile page, e.g., a celebrity’s profile page, the platform may recommend topics to the user based on the attributes of the other user, or the interests of the followers of the other user. The other user will be referred as a “profile page owner”.

15 FIGURE 11 illustrates an example user profile page. The user profile page includes a topics-to-follow section, which presents a list of topics that are suggestions for users who land on the profile page to follow. In the figure, when a user lands on the profile page of “User 1", the platform presents certain topics followed by the followers of User 1 in the topics-to-follow section. Even if a user landing on a user profile page is not logged in, the platform may still

20 display a topics-to-follow section on the user profile page, as the platform can determine topics to present in the topics-to-follow section without using information about the viewing user, or the platform has information about the user from cookies on a device the user uses when logged in.

In some implementations, the platform selects topics to recommend on user profile pages

25 from candidate topics identified using one or more of the following methods:

(a) Identifying Candidate Topics based on Attributes of the Profile Page Owner

With this method, the platform recommends topics that the profile page owner is known for, or topics similar to the topics that the profile page owner is known for, when the owner is an authoritative producer on the topics. This method uses the fact that the user has visited the

30 profile page as a signal that the user is likely to be interested in the known-for topics or similar topics. The platform will thus tend to recommend topics of which the profile page owner is an authoritative producer, without regard to topics the owner is interested in.

17 (b) Identifying Candidate Topics based on Topics Followed by the Followers of the Profile Page Owner

With this method, the platform recommends topics according to the topics followed by the followers of the profile page owner. These recommendations are made on the assumption

5 that users who are interested in the profile page owner and land on that user’s profile page will be interested in the topics followed by the followers of the profile page owner. Note that these topics will not, in general, be the same topics as followed by the profile page owner. One way of determining top topics followed by the followers of a profile page owner is just counting the number of users who follow both the profile page owner and each topic. In some

10 implementations, this method has a minimum score threshold to filter out potentially irrelevant (producer, topic) associations. Using this score threshold ensures that the topics recommended were highly relevant to a profile. In other implementations, no such a score threshold is applied, and topics with lower scores are recommended and generally found to be still relevant to the viewers of the profile. This increases the number of topics in the topics-to-follow section on the

15 profile page. In either implementation, any unused space in the topics-to-follow section on the profile page is optionally backfilled with other topics relevant to the viewing user.

However, this method may rank generally popular topics at the top for many profile page owners. Therefore, it can be advantageous to normalize the popularity effects, for example, by performing an Expectation-Maximization (EM) method to separate the popular topics out. EM

20 algorithms are described, for example, in Roche, EM algorithm and variants: an informal tutorial (2012), https://arxiv.org/abs/1105.1476v2; Borman, The Expectation Maximization Algorithm - A short tutorial (2004), http://www.seanborman.com/publications/-EM_algorithm.pdf; Moon, The Expectation-Maximization Algorithm, IEEE Signal Processing Magazine (Volume 13, Issue 6, Nov. 1996) pp. 47-60.

25 Performing an EM method, the platform takes two multinomial distributions as input: a background model and a domain model, to determine top topics followed by the followers of a profile page owner. The background model represents a general topic distribution for all profile page owners, and the domain model represents a topic distribution specific to the domain, i.e., the profile page owner. The background model is a global model shared by all profile page

30 owners, e.g., representing the probability of following a specific topic for any user, while the domain model is for each individual profile page owner, e.g., representing the probability of following a specific topic for a user who also follows the profile page owner. The background model is considered as being known and the domain model is the one to be estimated. The platform may iteratively run the EM algorithm to find the best estimate for the domain model,

18 i.e., serving the topics with the highest weights in the domain model as top topic recommendations for the domain, ,i.e. the profile page owner.

(c) Identifying Candidate Topics Using the SimClusters Algorithm

With this method, the platform recommends topics to experienced users who land on

5 other users’ profile pages by embedding both a subset of users, e.g., the top 5, 10, or 20 million most-followed producers, and topics into representations in a space of communities using the SimClusters algorithm.

As described below, the platform determines topics to recommend on user profile pages, using an example method including three steps: 1) computing a producer matrix R by embedding

10 producers into a space of communities; 2) computing a topic matrix V by embedding topics into the space of communities; and 3) computing relevance scores between producers and topics from the computed topic matrix R and the producer matrix V in the same space.

The platform first represents producers, e.g., a predetermined number of the most- followed producers, as producer vectors in the space of communities, or more particularly,

15 mapping producers to which communities they are authoritative producers. The platform computes the producer embedding based on 1) users who follow the producer or possibly engage with the producer and 2) which communities these users are interested in. For example, a producer embedding is an aggregation or averaging of all of the users who follow the producer and which communities these users are interested in. That is, topics to be recommended on a

20 user profile page are topics associated with the communities that the profile page owner is known for, but not necessarily interested in. For instance, if a user is a producer on a topic X but interested in a topic Y, then users who come to the user’s profile page are assumed to more interested in X than Y.

The embedding of producers is similar to the computation of a known-for matrix for

25 these producers as described below in Section IV.(b), but it outputs a matrix that is denser than the known-for matrix. The known-for matrix is a maximally sparse matrix, in which each producer can only be known for a single community. Although this maximally sparse matrix is useful from a computational perspective, it may not sufficiently capture the real relationships between users and communities, given that each user posts content items about many different

30 topics and may be known for different communities. A producer embedding can be used to capture richer relationships between producers and communities. For example, the above described known-for matrix specifies that an account for a politician is only known for the “Politics” community, while an embedding vector of the politician’s account may specify that the account is known for multiple communities, e.g., “Politics," “Business" and “Finance”.

19 FIGURE 12 illustrates an example computation of a producer matrix P, in which the platform calculates cosine similarity between a matrix A representing a user-user graph and the user interested-in matrix U to compute the producer matrix V. As described in Section IV.1, the user-user graph may be a directed graph representing following or engagement relationships

5 between consumers and producers on the platform.

The platform further computes the topic matrix R, representing which communities each topic is associated with, by embedding topics into the space of communities, as described above in Section n.3.(b).

After that, the platform computes the relevance scores between producers and topics by

10 computing the cosine similarity or dot product of the computed producer matrix V and topic matrix R. The platform may use cosine similarity, instead of the dot product, to avoid the above described popularity effects.

In some implementations, the platform computes the relevance scores of (producer, topic) pairs in multiple languages that the topic-related content is available in, as users in different

15 countries may be interested in the same topic. Thus, the topics along with relevance scores based on the viewer’s language can be returned. Therefore, in generating the topic matrix, the platform embeds each (topic, language) pair, instead of each topic, into the communities that the topic is associated with in that language. Alternatively, the platform embeds each (topic, country) pair or each (topic, language, country) pair into the communities that this topic is

20 associated with in that country, or both in that language and in that country. After that, the platform further computes the relevance scores of (producer, (topic, language}) pairs, (producer, (topic, country}) pairs or (producer, (topic, language, country}) pairs by computing the cosine similarity or dot product of the producer matrix and the topic matrix. Thus, the computed relevance scores provide the set of topics, per language, per country, or per a combination of

25 language and country, that a user is a producer on.

Given the computed relevance scores, the platform determines the list of topics to recommend on user profile pages according to the language of the viewers, i.e., visitors to the pages. For example, if the viewer’s language is English, then the topics and messages shown on the user profile pages will be in English. Even if the profile page owner is a Korean singer, the

30 platform computes the set of topics the owner is known-for based on the viewer’s language, by embedding (topic, language) pairs as described above. This will ensure that two viewers with the same language get the same set of results. If the viewer’s language is Korean, the ranking of topics for the Korean singer might be different as the topics might have a different embedding in Korean language.

20 Based on the computed relevance scores of (producer, topic) pairs, the platform sorts the topics to recommend on a user profile page in descending order to form the ranked list of candidate topics.

In some implementations, before placing the topics on user profile pages, the platform

5 further filters out candidate topics based on various criteria. The platform may filter-out candidate topics that are already followed, not-interested or opted-out by a viewer landing on the page, or candidate topics that are marked as opted-out by the profile page owner. In contrast, the platform may not filter out candidate topics that are followed or not-interested by a profile page owner, as what matters is whether the viewer, not the profile page owner, has followed or may

10 be interested in the candidate topics. Further, the platform may not recommend topics that are paused or unhealthy. Paused topics are topics that the platform or an administrator on the platform has paused, e.g., due to being out of date or a lack of sufficient amount of related content items within a predetermined time frame, or while some issue with content identified as related is being investigated. Unhealthy topics are topics that the platform has determined attract

15 as related content too many content items that are flagged as being toxic, not safe for work, abusive, or otherwise unhealthy.

6. Topic Recommendations from Topic Landing Pages

Users can use topic landing pages to look for content related to their interests. A topic landing page is a home page for a topic, and includes various kinds of content related to the

20 topic. When users land on a landing page of a given topic, the platform generally recommends other topics similar to or related to the topic, for example, sub-topics of the topic, to these users.

In some implementations, the platform computes the top topics similar to the topic using the attributes or keywords of topics. The platform runs this computation periodically, e.g., weekly, as a similarity score for a (topic, topic) pair may be considered static for short periods.

25 In other implementations, the platform determines topics related to the given topic according to the above described topic relatedness graph. For example, from a topic relatedness graph, the platform finds a node corresponding to the topic of a topic landing page, and then finds other nodes in the graph related to the located node. The relatedness between a topic and one of its related topics may be scored based on the distance between the topic and the related

30 topic, i.e., the number of steps to traverse from the topic to the related topic. in. Identifying Content based on Topics

By virtue of the established new relationships in connection with topics and information about topics, the platform can present content that is likely to be interesting to a user because it is

21 related to a topic. In some implementations, the platform determines content to be presented to a user at a given time from different content presentation interfaces based on information about topics, structure of topics, or relationships among topics, content and users, using one or more of the methods described below.

5 1. Methods for Determining Candidate Content Related to Topics

(a) Identifying Content Related to a Topic Using the SimClusters Algorithm

In some implementations, the platform determines whether content items are related to a topic using the SimClusters algorithm, i.e., by embedding both content items and topics into the same high dimensional space, and then computing a relevance score indicating the relatedness

10 between each content item and each topic by calculating the relatedness between vectors in the space corresponding to the content item and topic respectively.

The platform assigns a subset of its users, e.g., the top 5, 10, or 20 million most-followed users, to communities (or clusters) based on a similarity graph for the subset of users, and then computes a user interested-in matrix which specifies the communities in which each user on the

15 platform is interested. In this way, each user on the platform is represented as a user vector in a space of the communities. After that, for each candidate content item, the platform further computes a content-item embedding vector which contains the top communities that the content item is trending in.

The platform may compute the embedding vector of the content item by summing up

20 user interested-in vectors of users who have shown interests in the content item, e.g., users have engaged with the item. Each user interested-in vector can contribute to the sum differently, based on the requirements of different use cases. For example, the platform computes a weight between a content item t and a community c by the following formula:

25 wherein represents the user M’S interested-in weight for the community <?,

with the content item t, and it is 0 if the user u did not engage with the content item t. For example, for a message embedding, each user’s interested-in vector is weighted by the decayed value based on how long ago that the user engaged with the message. New engagers may

30 contribute more than previous engagers, so that their corresponding interested-in vectors may have larger weights.

22 According to the above formula, •t.< is a dot product of two vectors: U-.c <= and

, wherein U-,_c represents the c-th column of the matrix, and represents the u-th row of the matrix. It is actually computing the intersections of two sets of users: 1) the content item z’s engagers and 2) users who are interested in the community c. The platform may use

5 different weights to weight users, for example, using the interested-in weights, or using the decayed value of engagement time.

Alternatively, while computing a content-item embedding, the platform may weight each user’s vector based on an interaction score between the user and a content item. An interaction score is based on the historical interaction between users and content items, e.g., an aggregation

10 of (user, content-item) engagements on the platform, which may be represented in a user-content interaction graph. For example, for each user engagement, e.g., like, re-post, reply, or click, between a user and a message, the platform aggregates the edge between the user and the message.

In this way, the platform constructs representations, e.g., content-item vectors, in the

15 space of the communities for various kinds of content items on the platform. That is, the platform constructs a content-item matrix including the content-item vectors. The platform may compute the embedding vectors of one kind of content items, e.g., messages, in real-time when the lifecycle of these content items is too short to be computed in a batch setting. The platform may compute the embedding vectors of another kind of content items by batch processing if the

20 lifecycle of the other kind of content items is relatively long.

In determining content related to a given topic, the platform computes a topic matrix by assigning topics to communities, as described above in Section II.3.(b). Given the calculated content-item matrix and topic matrix, the platform computes relevance scores indicating the relatedness between content items and each topic by calculating a dot product or cosine

25 similarity of the content-item matrix and topic matrix. Based on the relevance scores between content items and topics, the platform sorts the content items related to a topic in descending order to form the ranked list of content items for the topic. The platform presents a predetermined number of top-ranked content items on the landing page of the topic. In some implementations, the platform filters the ranked list as described above to limit fatigue on the

30 viewing user, to avoid presenting too many similar topics, and so on.

(b) Determining Whether Content Is about a Topic using Topical Annotations

As described above in Section 1.1, the platform stores topics together with information about the topics, which includes not only attributes of the topics, but also keywords defining what the topic is about or relevant to.

23 Similarly, in some implementations, the platform also classifies content items semantically, by annotating each content item with keywords or hashtags that define what the content item is about or relevant to. For example, the platform annotates a content item with locale information that provides a general indicator of which geographic audience will be

5 interested in the item. The platform may use the locale information of a content item to present the content item to appropriate users.

In some implementations, the platform annotates a content item with a topical annotation that indicates the content item is related to a topic, e.g., with an identifier of a topic. The “topical annotation” may also be referred as a “topical keyword” of the content.

10 Human curators can annotate a content item with one or more topical annotations based on their individual research or knowledge of topics. Human curators may be data science experts or domain experts. Alternatively, the platform may annotate a content item according to one or more predefined rules. An example rule defines that, if a content item contains the text or keywords the platform pre-defined for a given topic, the platform will label the content item to

15 be related to the topic.

Using the topical annotations of candidate content determined for a topic, the platform can determine whether the candidate content is actually about the topic. While this annotationbased method is an effective way to categorize the content for some topics, it may not always be effective for the platform to find the best content about other topics. For example, for the “New

20 York City” topic, many of the most related content items do not contain any obvious text.

(c) Identifying Content likely Interesting to a Given User by Unified Scoring

In some implementations, the platform selects candidate content items for a given user according to unified scores of content items indicating how likely the user will engage with the content items. The platform may score content items using a machine learning model, which

25 takes into consideration one or more of the relatedness between the viewing user and authors of content items, the relatedness between topics and content items, or the relatedness between users and topics.

A training system trains the model based on various features and labels. Each training record includes the content-item features, user features, and topic features that are needed by the

30 model. For a user, the model uses a list of features that represent the user’s interest as input. These user features may include demographic features, information derived from a user-content interaction graph, or SimClusters features, e.g., a user interested-in vector for the user. For a content item, the model uses features that are aggregated over time about the content item, e.g., features about the viewers and author of the content item. For a topic, the model may use

24 SimClusters features, e.g., a topic vector indicating which clusters (or communities) the topic is classified to, and other information about the topic, e.g., annotation keywords, attributes, and popularity, as input topic features.

At prediction time, the model uses the above described user features, content-item

5 features and topic features as input, and predicts a score for each content item which estimates how much the user will like the content item. The platform ranks and selects candidate content items for the user based on the predicted scores of these items, subject to filtering.

(d) Identifying Content for a Topic by Identifying Authoritative Producers on the Topic

Some users who follow a topic intend to stay informed about the topic broadly, rather

10 than participating in niche community discussions. Followers of a topic may want content produced by experts or influencers of the topic. A consistent set of producers who dependably post content about the topic are likely to post the best content for the topic. It appears often to be true that the majority of interesting topical content comes from a minority of content producers, e.g., at the center of every interest community is a small group of influencers who create, say,

15 95% of the good content.

In some implementations, therefore, the platform identifies user accounts that are “authoritative producers” on the topic.

FIGURE 13 illustrates an example workflow 1300 for identifying authoritative producers on topics. As shown in the figure and described below, the platform may identify authoritative

20 producers on topics by:

1) computing pairwise similarity between the top N most-followed user accounts in accordance with pointwise mutual information (PMI) of their incoming follows, i.e., users who follow them, to generate a PMI matrix (1310);

2) factorizing the computed PMI matrix to derive user interest vectors (1320);

25 3) projecting the derived user interest vectors into two dimensions using a UMAP (Uniform Manifold Approximation and Projection) algorithm to derive user clusters (1330); and

4) displaying the derived user clusters and receiving input by which human curators identify key topics, and select users known for each topic (1340).

30 The platform groups each set of related users into the same community by computing the PMI between each pair of the top N most-followed users, i.e., the PMI of a single user following both users of the pair (1310). The top N most-followed users may also be refereed as the top N users. PMI is an information-theoretic association measure between a pair of discrete outcomes x and y, for example, defined as:

25 PMI is described, for example, in Levy, “Neural word embedding as implicit matrix factorization'’, Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPST4) - Volume 2, December 2014, pages 2177-2185.

5 Prior to computing the PMIs, the platform filters a user-user graph to derive the top N most-followed users. The platform sums up the incoming follows from each user to identify the top N, e.g., 100,000, 150,000, or 200,000, most-followed users, so that the user-user graph can be cropped to include only the top N users. The platform further randomly subsamples the outgoing follows of each of the top N users to no more than a threshold, e.g., 150, 200, or 300,

10 number of users. This is done to avoid any individual user having too large of an effect on the embedding of users as well as making the PMI calculation tractable.

The platform then uses the randomly subsampled graph of the top N users to compute a PMI matrix. Given two users A and B, the platform computes the PMI as below:

15 where P(A), P(B) and P(A&B) represent the probabilities that a randomly selected user follows A, follows B and follows both A and B respectively, and SE[P(A&B)] represents the standard error of the point estimate of P(A&B). The estimate of P(A&B) is often very noisy, so subtracting a couple of standard errors, e.g., the subtraction of log(5) in the above equation, may clear out some of the noises. The computed PMI matrix is an N x N real valued matrix.

20 Optionally, the platform sets all negative values in the matrix to zero to produce a Positive PMI (PPMI) matrix in which all negative values are replaced by 0. The intuition behind ignoring negative values may be that humans can easily think of positive similarity but find it harder to perceive negative similarity. This suggests that the perceived similarity of two users is more influenced by the positive associations than by the negative associations. This also

25 introduces a fair amount of sparsity.

The platform factorizes the generated PMI matrix, e.g., the PPMI matrix, by a Singular Value Decomposition (SVD) to k dimensions, to produce an N x k matrix U of singular vectors and a k x k diagonal matrix S of singular values (1320). Suitable values of k are 500, 800, 1000, or 1500.

30 Given the generated matrix U and diagonal matrix 5, the platform scales the singular vectors in the matrix U by the square root of the singular values of the diagonal matrix 5 by:

^scaled _; = U • VS, to generate a U_scaled matrix. The U_scaled matrix is an N x k matrix and

U s,caled is its transpose (a k x N matrix), then the product U_scale{l • U_scaled ^T is an N x N matrix,

26 representing the root mean square error (RMSE) reconstruction of the N x N PPMI matrix. The generated N x k matrix is the result of embedding the user-user graph, and it is a matrix including embedding vectors, i.e., user interest vectors.

In summary, the platform implements the embedding of the user-user graph by: 1)

5 filtering users in the graph to identify the top N most-followed users, 2) randomly subsampling the outgoing follows of each of the top N most-followed users to no more than a threshold number, 3) computing a PMI matrix, e.g., a PPMI matrix, of incoming follows for each pair of top users, 4) factorizing the computed PMI matrix to k dimensions, to produce an N x k matrix U and a k x k diagonal matrix S, and 5) scaling the singular vectors in the matrix U by the square

10 root of the singular values of the diagonal matrix S.

From the above generated user interest vectors, the platform groups a set of related users into a cluster or community (1330), for example, by performing a 2D projection to project these user interest vectors into two dimensions. For example, the platform implements the 2D projection by performing a UMAP on the user interest vectors. These related users represent the

15 authoritative producers associated with a cluster or community, which may correspond to an interest.

Optionally, prior to performing the 2D projection, the platform may normalize the user interest vectors in the U_SCQled matrix for each user to have a unit length. The absolute values of the PMIs are naturally smaller for large user accounts, e.g., users having a large number of

20 followers. Thus, without normalization, the user interest vectors naturally have larger magnitudes for small accounts, such that, in the Euclidean space, all of the large accounts will be in a small blob around the origin. In contrast, when the user interest vectors are normalized, the 2D projection of these user interest vectors will be more differentiated. In addition, despite length normalizing of the user interest vectors, using a Euclidean distance metric for the

25 projection may work slightly better than using a cosine distance.

In some implementations, human curators can select authoritative producers on certain topics by using an interactive tool that visualizes the above derived user clusters, along with the ability to see which user accounts are popular for each cluster.

FIGURE 14 illustrates an example user interface of such an interactive tool. The tool

30 allows human curators to quickly and easily browse, select, visualize and import sets of producers, to define followable topics, and to associate producers with topics. The user interface includes a user account section 1410 displaying a group of dots each represents a user account, and a producer section 1420 displaying a set of icons each representing one of the top producers. In particular, the interactive tool interactively displays (1340) to a human curator a view of

35 clusters of interests, and receives input (1340) by which the curator identifies one or more topics

27 that a cluster is about, identifies clusters of top producers, labels sets of top producers on a particular topic, or exports one or more of these signals to a rule engine that determines whether content is about a topic. The identified producers on each topic may also be exported to a topic annotation system that annotates topics with the identified producers.

5 Optionally, the interactive tool may also support selecting a particular set of users and then expanding and redoing the UMAP method on the fly. Thus, the tool enables a curator to drill down into a local region, thereby making it easier to view small structures of the interest space.

Optionally, the feature of drilling down into a local region may be implemented other

10 than with the above described 2D projection. For example, for each of the selected users, the platform finds a certain number, e.g., 40, 50 or 60, of its nearest neighbors by cosine similarity between users using the original embedding of the user-user graph before being normalized, and unions the selected user and all of its nearest users into the set on which the 2D projection will be performed. If the total number of users to be projected is less than the dimensionality of the

15 embedding, the platform may perform a Singular Value Decomposition (SVD) of the embedding of the users, but maintain only as many dimensions as there are users. For example, if there are m, e.g., 400, selected users and the platform factorizes the generated PMI matrix to k, e.g., 1000, dimensions, the platform takes their mx k embedding matrix, e.g., a 400 x 1000 matrix, and factorizes it to get an mx m matrix, e.g., a 400 x 400 matrix. In this way, the platform performs

20 a dimensionality reduction on the original embedding of the user-user graph if the number of users is too small. Since these users are selected because they are near each other on the interest space, their embeddings will be very similar. Thus, the platform highlights their differences by re-factorizing.

Thus, the platform improves content recommendations by serving more content from

25 authoritative producers known for topics. The platform optionally applies textual matching rules to filter out off-topic noise, as content generated by authoritative producers of a topic may not always be related to the topic. By including only content produced by the authoritative producers on a topic as candidate content that is likely related to the topic, the platform only needs a much more limited set of textual matching rules to further filter the candidate content.

30 (e) Identifying Personalized Content for a Given User for a Topic

Using one or more of the methods described above, the platform can present content to a user based on topics followed by the user. It is likely that the platform will always generate a very similar set of candidate content for all users who follow a given topic. However, certain

28 users may be interested in only a subset of the content generally determined to be related to the topic, especially for a broad topic.

In some implementations, the platform uses certain implicit signals to better personalize the content each user sees according to the topics the user follows. For users who have provided

5 more implicit signals about what they are interested in, e.g., experienced users, the platform can deliver them more of the content they are interested in seeing for topics they follow.

In some implementations, the platform looks for content that is similar to the content with which the user recently engaged, e.g., liked, forwarded or re-posted. For example, for users who are following a topic and have recently engaged a lot with content about a particular sub-topic,

10 the platform provides more content related to this particular sub-topic that the users are apparently more interested in. In another implementation, the platform looks for content similar to a user’s long-term interests determined based on the long-term historical activities of users they follow, e.g., based on a user interested-in matrix generated using the above described SimClusters algorithm.

15 The above described method can dramatically increase the users’ sense of the quality of the content users see about a topic.

The platform may use the above method after performing one or more of the other methods described in Section IH.l to further filter out candidate content related to a topic and present content that is likely interesting to users. Alternatively, the platform may perform the

20 above method in parallel with any of the other methods.

(f) Identifying Recent Content for a Topic at a Given Time

In some implementations, to provide fresh, recent messages, e.g., no more than six hours old, the platform uses a streaming pipeline method, e.g., a streaming top K indexed content items method, to retrieve recent content items for a given topic. With this method, the platform

25 indexes every topic-item pair into hourly partitions within a recent time period through a realtime distributed stream data processing engine, e.g., a Storm (https://storm.apache.org/) engine, which powers real-time stream data management tasks. The method allows for arbitrary timerange queries at read time at low latency. Thus, the platform can provide fresh and recent content for topics at a given time.

30 Using the streaming top K indexed content items method, the platform builds a topic per locale content index and achieves good coverage of every topic in every locale where there exists such annotated content. Instead of relying on SimClusters to get mappings between each topic and a list of content items, the platform builds the mappings directly by parsing content items’ topical annotations, and uses these annotations to construct a key- value indexed store. This

29 method ensures that every content item associated with the topic is guaranteed to pass the topicannotation filtering described in Section in. L(b). Therefore, this method boosts the coverage greatly compared to determining top content items based on the SimClusters algorithm, which only indexes a subset of content items annotated with the topic. Once the platform obtains an

5 array of content items for each topic, it ranks them to identify the best content items per topic.

The streaming top K indexed content items method includes a write-path stage and a read-path stage. At the write-path stage, the platform uses the real-time distributed stream data processing engine to read in content items. At this stage, for a content item, the platform parses its topical annotations, e.g., a list of topic identifiers, derives its creation time by the hour,

10 fetches its locale information, e.g., language and country, and updates a top content item list per topic based on a ranking rule.

The platform at this stage compares every content item created for a given topic, and applies filtering and ranking to preserve good quality content items, e.g., on the order of 10,000 or 20,000 messages per topic, in a caching system. To do the filtering and ranking at the write¬

15 path stage, the platform sorts the top content items based on the number of users who have liked each of the content items. In some implementations, the platform sorts content items in accordance with the above-described user interaction scores between users and content items.

The platform may optionally use the logarithms of the user interaction scores to prevent overly active users from dominating the results. Other filtering or ranking algorithms may be applied at

20 the read-path stage. The method therefore gives great flexibility to apply various ranking algorithms.

The platform can store each content item for a topic in a distributed memory object caching system, e.g., MemCache, in which each content item may be identified by a key: (Topicld, Locale), which specifies the identifier of the topic and the locale information of the

25 content item, e.g., the language of the content item or country of the author. In some implementations, the platform structures the key of a content item with a language or a timestamp or both. In particular, the key may be (Topicld, Locale, Language), (Topicld, Locale, Timestamp) or (Topicld, Locale, Language, Timestamp). The platform uses an hourly timestamp as a partition key to evenly distribute the content items by the hour. Other time

30 periods, longer or shorter, depending on the rate at which content items arrive, may also be used. The time partition also allows the platform to scale up without blowing up any particular key. The platform may use language as part of the key structure to serve similar purposes, e.g., distributing the content by language, or distinguishing the same topic referenced in different contexts, e.g., content related to basketball in Japanese vs. in English.

30 In an example implementation, there are about 3000 topics and about 200 locales that are served. Using the streaming top K indexed content items method, the platform stores the top content items within the past 24 hours, which gives 24 timestamp partitions. In this way, it gives the platform a maximum of 14.4 million (3000x200x24) keys. For each key and in each

5 partition, the platform stores the top 1000 content items along with their scores. Therefore, in this case the platform requires a maximum storage capacity for 14.4 billion content items, although, the platform may actually store fewer content items than the estimated maximum number due to skew in the distribution of topics and locales.

At the read-path stage, to retrieve content from the memory caching system, the platform

10 takes, for example, a Topicld and a user locale information as inputs, optionally coupled by a time range. This allows the platform to reconstruct one or a range of (Topicld, Locale, Timestamp) keys. Assuming the platform stores records up to 24 hours, the platform can make up to 24 fanout calls to the memory caching system for up to 24,000 candidate content items. After that, the platform applies additional ranking or filtering on top of the candidate content

15 items to identify the top content items.

At the read-path stage, the platform further applies one or more ranking methods to rank candidate content items. In one method, the platform scores the candidate content items by computing the cosine similarity between topic embeddings and content-item embeddings, both of which are computed using the SimClusters algorithm.

20 In another method, the platform scores content items by computing the cosine similarity between “users who liked the content items” and “users who like the topic”. The main difference between this method and the above SimCluster-based method for ranking is that this method calculates the similarity between the raw user vectors, while the above SimCluster-based method for ranking first maps the raw user vectors to the space of SimClusters, e.g., by summing

25 up user interested-in vectors, and then calculates cosine similarity in the mapped space. Based on the ranking results, the platform retrieves top K content items per cluster, and stores the resultant top K content items in a separate memory cache, e.g., MemCache, identified by (Topic, Locale). The use of caching allows the platform to apply heavy ranking algorithms without being penalized by the latency, as well as shielding the underlying content cache from

30 heavy search traffic coming from client devices.

Using the streaming top K indexed content items method, the platform is able to find recent, fresh content items about a topic and serve them to the users timely.

31 (g) Filtering Out Content that Are Unhealthy

Unhealthy content includes, for example, abusive content that is abusive to certain users, spam that is unsolicited content to certain users, trolling content that is inflammatory or offensive to certain users, or pornographic content.

5 The platform can filter out candidate content items to be presented to a user based on topics in one or more of the following ways.

The platform can filter out candidate items from suspended user accounts or from user accounts muted or blocked by the user, or content items previously blocked by the user. Although a user using the blocked account cannot see content posted by the user who blocked

10 her/him, a user of the muted account can still see content posted by the user who muted her/him. However, the user will not see content posted by either the muted account or the blocked account.

The platform can filter out candidate content items based on annotations on the content items. For example, the platform enables users to report abusive content, spam, trolling content

15 or pornographic content, and the platform annotates the identified content with corresponding keywords, e.g., abusive, spam, trolling or pom. The platform may also enable human curators to screen content and annotate content items with keywords representing various kinds of unhealthy content. The platform can filter out content items based on single curator annotations or based on some threshold number of user annotations.

20 If the platform cannot timely determine whether certain content items are unhealthy, e.g., in real-time, the platform temporarily filters out and stores those content items and then processes the items later, e.g., using a batch processing job, to determine whether those content items should or should not be filtered out.

2. Content Recommending Using One or More of the Above Described Methods

25 In some implementations, the platform combines one or more of the above described methods to recommend content based on topics.

FIGURE 15 illustrates an example system flow for recommending content based on topics. As shown in the figure, a topic embedding job 1510 generates topic embeddings using the SimClusters algorithm as described above in Section II.3.(b), and then writes the generated

30 topic embeddings into an off-line topic embedding data store 1540. The topic embedding job computes topic embeddings based on a user-content interaction graph stored in an offline data store 1520 and user interested-in SimClusters stored in another offline data store 1530. In some implementation, the topic embedding job uses the aggregated, 7-day decayed favorite count between each (user, item) pair in the user-content interaction graph as the weights of the user

32 embeddings. The platform may run the topic embedding job 1510 once during a predetermined time frame, e.g., a week, or more frequently, e.g., once a day, to reflect the changes in user engagement with topics through the user-content interaction graph.

A content recommender 1550 recommends content by reading both 1) the topic

5 embeddings generated using the SimClusters algorithm and stored in an on-line topic embedding repository 1560; and 2) the top K content items per SimCluster generated by the above described streaming top K indexed content items method (in Section m.l.(f)) and stored in a caching system 1570. The topic embeddings stored in the on-line topic embedding repository 1560 may be imported from the offline topic embedding data store 1540.

10 FIGURE 16 illustrates an example workflow 1600 for the content recommender:

The content recommender retrieves (1610) a topic embedding from the on-line repository 1560 storing a predetermined number, e.g., 150, 200, or 250, of top clusters for each topic.

The content recommender retrieves (1620) a predetermined number, e.g., 40, 50, or 60, of top content items from each cluster in the retrieved topic embedding from the top K content

15 items per SimCluster in the caching system 1570.

The content recommender de-duplicates the retrieved content items and takes a predetermined number, e.g., 150, 200, or 250, of top content items based on the dot product between the topic embedding and the content-item embedding (1630). This step is computationally cheap.

20 The content recommender computes the cosine similarity between a predetermined number, e.g., 150, 200, or 250, of the remaining content items and the topic embedding and filters out the content items with a cosine similarity score less than a threshold, for example, 0.3 (1640). This step is computationally expensive.

The content recommender applies (1650) a high precision filter to ensure the resultant

25 content items are topically related, for example, using the annotations of the content items as described above in Section IH.l.(b). In this way, remaining content items after the filtering will be returned as content items to be recommended for the topic.

3. Use Cases of Content Consumption Based on Topics

(a) Determining Content to be presented on a User’s Home Timeline

30 To identify content to place on a user’s home timeline, the platform uses one or more of the following methods to identify candidate content:

(1) Identifying content whose representation has the highest dot-product with the viewing user’s interest representation, using the SimClusters algorithm, as described in Section m.l.(a);

33 (2) Identifying recent content for a topic for the viewing user at a given time using the above streaming pipeline method, as described in Section III.l.(f);

(3) Identifying content related to a topic by identifying authoritative producers on topics followed by the viewing user, as described in Section in.l.(d);

5 (4) Determining whether candidate content, e.g., content identified by one or more of the above methods (1 )-(3), is actually about a topic using textual matching rules, as described in Section III.l.(b);

(5) Filtering out candidate content by determining whether the author account of the content is muted or blocked by the user, whether the content itself has already had a

10 negative engagement by the user, as described in Section m.l.(g);

(6) Selecting content that is likely interesting to the viewing user using a machine learning model to score all candidate content items based on how likely the viewing user will engage with the candidate content items, as described in Section III.l.(c); or

(7) Filtering out content by determining whether the candidate content is healthy, as

15 described in Section m.l.(g).

In some implementations, the platform uses one or more of the above methods in parallel to identify candidate content items independently and combines the identified content items to present on a user’s home timeline. The platform may use one or more of the above methods sequentially, so that the platform can use one of the above methods to further filter candidate

20 content items identified by one or more of the other methods. In one scenario, the platform first identifies candidate content using both the SimClusters algorithm and the above described streaming pipeline method independently, and then determines whether the identified candidate content is actually about a topic using textual matching rales.

(b) Determining Content to be placed on a Topic Landing Page

25 In some implementations, the platform runs two or more of the above described methods of determining candidate content for a given topic in parallel. These methods include: 1) identifying content related to a topic using the SimClusters algorithm as described in Section ULI. (a); 2) identifying content related to a topic using textual matching rules, as described in Section m.l.(b); 3) identifying content from authoritative producers on the topic, as described in

30 Section HI.1.(d); or 4) using the streaming top K indexed content items method, as described in Section m.l.(f), to retrieve recent content items for the topic.

Alternatively, the platform may sequentially ran at least two of these methods of determining candidate content for a given topic. In one scenario, the platform first uses the SimClusters algorithm to identify candidate content items whose vectors have high cosine

34 similarity with the vector of the topic in the same space, and then applies textual matching rules to further filter out candidate content items determined for the topic.

(c) Consumption of Trending Content based on Topics

The platform identifies trending content. A trend is a hashtag, word, phrase, or subject

5 that is mentioned at a greater rate than others in user-generated content. In some implementations, the platform annotates trending content not only with locale information, e.g., information indicating the location of an event or news, but also with a topical annotation to provide better contextual information. For example, for each trend, the platform identifies content items that are contributing to the trend and their related topics, tracks how many of the

10 content items are related to each of the related topics, and then selects a broad topic that represents most or all of the topics related to the content items that are contributing to the trend, if any. The platform annotates the trend with the broad topic. In this way, the platform classifies trending content, even content not otherwise identified as related to the topics, with topic categories. Therefore, the platform determines trending content items to be presented to a user

15 based on their topical annotations e.g., identifiers of topics to which these items are related.

IV. Item Embedding Using the SimClusters Algorithm

This section describes the SimClusters algorithm and how it is used to embed various kinds of items on the platform into a space of communities.

FIGURE 17 illustrates an example workflow 1700 for the SimClusters algorithm. The

20 SimClusters algorithm includes a community discovery stage 1710 and an item representation stage 1720.

In the community discovery stage, the platform uses a user-user graph to classify users into a set of communities, and to calculate association weights quantifying the strength of the users’ association with these communities. That is, the algorithm constructs, from the user-user

25 graph, user vectors in a space of the communities.

In the item representation stage, the platform constructs representations, e.g., sparse, inteipretable vectors, for various kinds of items on the platform in the same space of the communities. The items may include users, topics, content items, or annotations of content items, and the content items may be messages, notifications, events, trends or news. The items

30 can be the targets for different recommendation or personalization problems.

1. Community Discovery

The community discovery stage is about discovering communities from a user-user graph. A user-user graph may be a directed graph representing following or engagement

35 relationships between users on the platform. The engagement relationships between users may include a user engaged with, e.g., liked, re-posted or forwarded, content posted by another user. In the user-user graph, each node represents a user, each edge represents a following or engagement relationship between two users and is directed to indicate which user follows or

5 engages which. Follower relationships, rather than foliowee relationships, are used in the useruser graph, as users are more able to control who they follow than who follows them

The user-user graph is reformulated as a bipartite graph, in which multiple nodes are divided into two disjoint and independent sets and every edge connects two nodes in the two sets respectively. In this specification, the two sets in the bipartite graph are referred as a left¬

10 partition L and a right-partition R. The left-partition L includes left nodes corresponding to content consumers (or referred as “consumers”) and the right-partition R includes right nodes corresponding to content producers (or referred as “producers”). For example, the producers are users who are followed by other users and the consumers are users who are following. An account can appear as a node in both partitions.

15 FIGURE 18 illustrates an example user-user graph representing following or engagement relationships between four users and a bipartite graph corresponding to the user-user graph.

The SimClusters algorithm identifies a set of communities from the bipartite graph, and assigns each of the left nodes and each of the right nodes to the identified communities with community association weights to indicate the strength of their association with each of the

20 communities. The left and right nodes are represented as sparse, non-negative vectors, each element of which correspond to a community, and embedding the nodes in a space of communities.

Since the majority of edges in a typical user-user graph for a messaging platform is directed towards a minority of users, e.g., the top 20 million most-followed users, the number of

25 the right nodes may be much smaller than the number of the left nodes. In one scenario, the right nodes represent the top ~10⁷ most followed users on the platform, and the left nodes represent all of the users on the platform, e.g., ~10⁹ users. Thus, the algorithm first discovers communities based on the minority of users represented by the right nodes, and then assigns all of the users represented by the left nodes to these discovered communities as well.

30 As described below, the community discovery stage includes three steps: step 1 - calculating the similarity between the right nodes in the bipartite graph and generating a weighted, undirected graph, e.g., a similarity graph, representing the similarity between the right nodes; step 2- discovering communities from the generated weighted, undirected graph of the right nodes; and step 3 - assigning the left nodes in the bipartite graph to certain communities

35 discovered in step 2.

36 (a) Generating a Similarity Graph of the Right Nodes

The platform constructs (1711), from the bipartite graph, a much smaller uni-partite undirected graph G including the right nodes in the bipartite graph, i.e., a similarity graph of the right nodes. In the similarity graph, each node represents a producer, and the weight of an edge

5 represents the similarity between the two producers represented by the two nodes connected by the edge. A weight of an edge in the similarity graph may be referred as a “similarity weight”. The similarity graph may be constructed by connecting each of the producers to the most similar other producers. For example, producers are more or less similar as their followers are more or less similar.

10 FIGURE 19 illustrates generation of a producer-producer similarity graph from an example bipartite graph. The bipartite graph may be represented as an m x n matrix A of elements (u, v), where consumers in the left partition L of the graph are presented as u, producers in the right partition R of the graph are represented as v, the number of consumers in the left partition L is m, and the number of producers in the right partition R is n. The platform

15 constructs a producer-producer similarity graph from the matrix A.

In some implementations, the platform computes the similarity weight between two producers (i, f) by calculating a measure of similarity between their respective groups of followers. For example, the platform can compute the cosine similarity of their followers, e.g., consumers, in the left partition L of the bipartite graph. To elaborate, if X_t and Xj represent the

20 binary incidence vectors of producer i’s and producer j’s followers respectively, their cosine similarity is defined as: X_t • XJ

11 . With this definition, two users would have nonzero similarity, or an edge in the producer-producer similarity graph, simply by sharing one common neighbor in the bipartite graph.

The size of the resulting producer-producer similarity graph may be unmanageably large.

25 Thus, the platform may keep only the most important edges after the discovery of communities. For example, to avoid generating an extremely dense producer-producer similarity graph, the platform discards edges with similarity weights lower than a predetermined threshold, or keeps at most a predetermined number of edges with the largest similarity weights for each producer.

(b) Discovering Communities for the Right Nodes

30 A clustering algorithm is then run (1712) on the generated producer-producer similarity graph to group producers into communities or clusters. The clustering algorithm discovers communities of densely connected nodes from the producer-producer similarity graph, and classifies producers in the similarity graph into the discovered communities. Each community

37 may be characterized by the top users of the community, for example, accounts that many consumers in that community follow.

The output of this step is an n x k known-for matrix of the form Vn x k, in which the i-th row specifies the communities to which the node i in the similarity graph is assigned, where k

5 represents the number of communities and n presents the number of the nodes in the similarity graph that correspond to the right nodes in the bipartite graph. The known-for matrix V indicates communities which each of the producers corresponding to the right nodes in the bipartite graph is known for.

In order to accurately preserve the structure of the producer-producer similarity graph, it

10 may be important for the communities to have fewer nodes, e.g., hundreds of nodes, rather than thousands or tens of thousands of nodes. The platform may discover communities from the similarity graph using a neighbor-based sampling algorithm, e.g., an algorithm called Neighborhood-aware Metropolis Hastings (Neighborhood-aware MH), which is accurate, fast, and scales to graphs with billions of edges. The Neighborhood-aware MH algorithm extends a

15 Metropolis Hastings sampling algorithm described, for example, in Tsourakakis, “Provably Fast Inference of Latent Features from Networks: with Applications to Learning Social Circles and Multilabel Classification,” WWW '15: Proceedings of the 24th International Conference on World Wide Web, May 2015, pages 1111-1121. The implementation of the Neighborhood- aware MH algorithm are open-sourced in https://github.com/twitter/sbf.

20 The platform runs the Neighborhood-aware MH algorithm on the producer-producer similarity graph to identify an associated community for each of the n producers in the graph. This algorithm takes in a parameter k specifying the number of communities to be detected and returns community association weights for each of the n producers. The platform then uses the community association weights to construct the known-for matrix Vn x k, in which each producer

25 is associated with at most one community.

FIGURE 20 illustrates an example known-for matrix V. According to the known-for matrix V in the figure, a producer v7 is known for a community k2, a producer v2 is known for a community kl.

(c) Assigning the Left Nodes to the Discovered Communities

30 The platform assigns (1713) consumers corresponding to the left nodes to the discovered communities. The output of this step is an m x k user interested-in matrix of the form Um x k, in which the i-th row specifies the communities to which the left-node i in the bipartite graph is assigned, where m represents the number of consumers corresponding to the left nodes and k represents the number of communities. The user interested-in matrix U indicates communities

38 which each of the consumers corresponding to the left nodes in the bipartite graph is interested in.

In generating the user interested-in matrix, the platform assigns a left-node in the bipartite graph to communities by looking at the communities that its neighbors have been

5 assigned to. In the bipartite graph, the neighbors of the left-node are the right nodes in the bipartite graph that are connected with the left-node. The platform has already assigned the neighbors to communities in the above described step 2.

FIGURE 21 illustrates example computation of a user interested-in matrix U. As illustrated in the figure, the interested-in matrix U is computed by multiplying the matrix

10 representation of a user-user graph A by a known-for matrix V. The user interested-in matrix U indicates that consumer U1 is interested in community Ki only, whereas consumer U3 is interested in all three communities, i.e., Kl, K2, and K3.

A user can be interested in many communities. However, the platform saves only some of the top m communities, e.g., top 50 communities. In some implementations, the platform

15 applies noise removal to the user interested-in matrix. For example, the platform discards an element in the matrix with a value lower than a certain threshold, or keeps at most a certain number of elements with the highest weights for each user.

In some implementations, the platform sets a user interested-in matrix U by an equation U = truncate (A • V), where A represents a user-user graph, V represents a known for matrix, and

20 the truncate function keeps only up to a certain number of non-zero elements per row to save on storage. This equation is motivated by the fact that in the special case when the known-for matrix V is an orthonormal matrix, i.e., V^TV = /, then U = A • V is the solution to A = U • V^T, where V^T is the transpose of the known-for matrix V .

The platform runs the community discovery stage in a batch-distributed setting or in a

25 real-time setting. In some implementations, the platform makes the output user interested-in matrix of the community discovery stage available in both offline data warehouses as well as low-latency online stores, for example, indexed by user identifiers.

2. Item Representation

In this stage, the platform computes representations for different items in the space of the

30 communities. The different items include, for example, topics, messages, events, trends, notifications, hashtags, URLs, inquiries or any other items on the platform, and can be the targets for different recommendation or personalization problems.

The second stage of the SimClusters algorithm may be implemented by several jobs running in parallel, each of which calculates the representations for a specific recommendation

39 target, using a user-item bipartite graph formed from historical or on-going user engagements with the target items on the platform. Each job in the second stage operates in either a batch- distributed setting or a streaming-distributed setting, depending on the shelf-life of the recommendation target and the chum in the corresponding user-item bipartite graph.

5 Each of the items may be represented as a vector in the space of the communities, in which an element (i,f) corresponding to the i-th community for an item j indicates how interested the i-th community is for item j. The end result is that different items on the platform are represented as sparse, interpretable vectors in the same space.

A content item’s representation may be computed by aggregating the representations of

10 all the users who engaged with the content item, e.g., the representation for a content item j is

WO') = _w^t*((V(u),v« € ^u») where W(j) is a vector for the content item j, U(u) is a vector from a user interested-in matrix U, and N(j) denotes all the users who engaged with the content item j. The aggregate function can be chosen based on different applications and can be learned from a specific

15 supervised task. The aggregation function used may be “exponentially time-decayed average” that exponentially decays the contribution of a user who interacted with the item based on how long ago that user engaged with the item.

The platform implements the item embedding using batch jobs or real-time jobs depending on how real-time sensitive the item is. For example, for the embedding of a message,

20 the platform needs to process the list of users who engage with the message and to sum up the vectors in real-time. For a long-lived item, e.g., a topic, the platform may use a batch job since the item already has a lot of user engagements in the past and real-time engagements may just contribute to a small portion of the historical engagements. However, the platform may still use a real-time job to embed a long-lived item if real-time computation is needed.

25 3. Technical Advantages Brought by the SimClusters Algorithm

The SimClusters algorithm provides a way to embed various kinds of content items into the same space, for example, a space of the communities. By virtue of the sparse, non-negative properties of embedding vectors, the embedding vectors are relatively easy to generate, store and index.

30 The platform uses the SimClusters algorithm to obtain a unified feature representations of various items so that the platform easily applies machine learning models to learn the high-order interactions between any items.

40 The algorithm isolates the hard-to-parallelize step of community discovery into the second step of discovering communities from the similarity graph of the right nodes, where the algorithm operates on a smaller graph that may fit into the memory of a single machine. In contrast, the other two steps of community discovery operate on much bigger input, and hence

5 run in batch-distributed settings. Experiments show that the three-step approach does not lead to reduced accuracy compared to directly learning the communities on an input bipartite graph.

The SimClusters algorithm avoids matrix factorization methods that typically require solving massive numerical optimization problems. The algorithm instead relies on a combination of similarity search and community discovery, both of which are easier to scale.

10 The SimClusters algorithm also uses the new method for community discovery, i.e., the Neighborhood-aware MH, which is 10 to 100 times faster, and 3 to 4 times more accurate than off-the-shelf baselines, and scales easily to graphs with ~10⁹ nodes and -10¹¹ edges.

V. Technical Management of Topics

1. Human Curators Define Topics

15 The platform provides a Topic Editor enabling human curators to review and define topics followable by users. Human curators may create topics related to a domain based on their individual knowledge and research about the domain.

Human curators can create new topics in view of recent popular search terms on the platform, i.e., search terms that have been entered by users more than a predetermined number of

20 times on the platform within a predetermined period of time. Human curators may also create new topics in view of events or activities that are currently popular or will be popular soon.

The Topic Editor may further enable curators to modify or remove topics. For example, curators may remove certain topics about events or activities that are no longer popular, e.g., topics for which users post less and less new content or topics in which fewer and fewer users

25 are interested.

2. Human Curators Define Information about a Topic

Human curators can also define information about a topic, including attributes of the topic and one or more keywords defining what the topic is about or relevant to.

The creation of keywords by human curators may also be supported by an Editor Tool to

30 scale data tracking as far as the platform is able to maintain a high precision. The Editor Tool enables curators to suggest additional keywords to track or verify the created keywords for accuracy. For example, the Editor Tool presents the suggested additional keywords in the form of a word cloud. The Editor Tool may also enable human curators to scan the suggested

41 keywords to remove created keywords or add missing keywords, thereby enabling the curators to remove bias and maximize content recall. In some implementations, the platform further audits the generated keywords regularly, thereby achieving a high precision consistently.

3. Human Curators Define Authoritative Producers on a Topic

5 Human curators can identify authoritative producers who are known for a given topic and likely to post the best content for the topic. As described above in Section HL.L(d), the platform provides an interactive tool enabling a human curator to interactively view clusters of interests, identify one or more topics that a cluster is about, identify clusters of top producers, or label a set of authoritative producers on a particular topic (1340). Information about the identified

10 authoritative producers on a given topic may further be imported into the above described Topic Editor to annotate the topic with these identified producers.

As shown in FIGURE 14, the platform may include an interactive tool that enables curators to browse a space of communities, referred as an “interest space,” and identifies groups of producers as authoritative producers of on followable topics respectively. That is, content

15 items from the groups of authoritative producers are related to the respective followable topics. Since the definition of followable topics starts in the interest space, the tool creates topics that align with interests that actually exist and are active on the platform.

In addition to the embodiments of the claims specified below and the embodiments described above, the following numbered embodiments are also innovative.

20 Embodiment 1 is a method comprising storing information about a plurality of topics on a messaging platform, wherein each of the plurality of topics is a predefined topic on the platform, each of the plurality of topics represents a subject of content on the platform, and each of the plurality of topics is an entity distinct from any account of the platform, and each of the plurality of topics is an entity of the platform that the platform enables users of the platform to

25 follow; identifying a set of candidate topics that are likely interesting to a user among the plurality of topics, the user being one of a plurality of users of the platform; generating for display a content presentation interface presenting the set of candidate topics to the user; and receiving from the user a selection of one or more topics to follow among the set of candidate topics.

30 Embodiment 2 is the method of embodiment 1, wherein the content presentation interface is a profile page of the user, a timeline of the user, a landing page of a topic, or a profile page of another user.

42 Embodiment 3 is the method of any one of embodiments 1 or 2, wherein identifying the set of candidate topics comprises identifying a sub-topic of a topic expressly followed by the user as one of the set of candidate topics.

Embodiment 4 is the method of any one of embodiments 1 to 3, wherein identifying the set of candidate topics comprises identifying a topic similar to a topic expressly followed by the user as one of the set of candidate topics.

Embodiment 5 is the method of any one of embodiments 1 to 4, comprising: computing similarity scores between topics expressly followed by the user and other topics; and identifying a topic from the other topics as similar to a topic expressly followed by the user according to the similarity scores.

Embodiment 6 is the method of any one of embodiments 1 to 5, wherein identifying the set of candidate topics comprises receiving a search phrase from the user; and identifying a topic associated with the received search phrase as one of the set of candidate topics.

Embodiment 7 is the method of any one of embodiments 1 to 6, wherein identifying the set of candidate topics comprises identifying, as one of the set of candidate topics, a topic sharing a geographical region with the user, a topic sharing a language with the user, or a trending topic.

Embodiment 8 is the method of any one of embodiments 1 to 7, wherein identifying the set of candidate topics comprises detecting a plurality of communities made up of users that are each within a predetermined subset of the plurality of users; and embedding the plurality of users into an embedding space defined by the communities, wherein each dimension of the embedding space corresponds to a distinct one of the plurality of communities, to generate a plurality of user vectors each specifying coordinates of the respective user in the dimensions of the embedding space, and each user vector has at least one non-zero value for a vector element when the user corresponding to the user vector follows or has engaged with at least one of the subset of the plurality of users within the community corresponding to the vector element. The method further comprises embedding the plurality of topics into the embedding space to generate a plurality of topic vectors each specifying coordinates of the respective topic in the dimensions of the embedding space, wherein each topic vector has at least one non-zero value for a vector element when the topic corresponding to the vector element is associated with the community corresponding to the vector element; computing relevance scores indicating measures of relatedness between the plurality of users and the plurality of topics using the plurality of user vectors and the plurality of topic vectors; and identifying as topics that are part of the set of candidate topics one or more topics that have a relevance above a threshold relevance for the user according to the relevance scores. Embodiment 9 is the method of embodiment 8, wherein detecting communities comprises generating a bipartite graph from a user-user graph representing the following or engagement relationships between users of the plurality of users, the user-user graph having nodes representing individual users of the plurality of users; generating a similarity graph from the bipartite graph, the similarity graph having nodes representing users in the subset of the plurality of users and edges with weights representing measures of similarity between users in the subset of the plurality of users; and classifying users in the subset of the plurality of users into the plurality of communities.

Embodiment 10 is the method of any one of embodiments 8 to 9, wherein the subset of the plurality of users is a predetermined number of most-followed users within the plurality of users.

Embodiment 11 is the method of any one of embodiments 8 to 10, wherein identifying the set of candidate topics comprises: retrieving user-topic following information about one or more topics followed by the user as input to a machine learning model; retrieving information about the user as one or more user features input to the model; retrieving information about the plurality of topics as one or more topic features input to the model; training the model using the user-topic following information, the one or more user features, and the one or more topic features; and identifying at least one of the set of candidate topics using the trained model.

Embodiment 12 is the method of embodiment 11, wherein the one or more user features include demographic information, the plurality of user vectors in the embedding space, or information derived from a user-content interaction graph representing interaction between the plurality of users and a plurality of content items on the platform; and the one or more topic features include the plurality of topic vectors in the embedding space.

Embodiment 13 is the method of any one of embodiments 1 to 12, the content presentation interface is a landing page of a topic.

Embodiment 14 is the method of embodiment 13, wherein identifying the set of candidate topics comprises identifying a sub-topic of the topic of the landing page, as one of the set of candidate topics.

Embodiment 15 is the method of any one of embodiments 13 or 14, wherein identifying the set of candidate topics comprises identifying a topic similar to the topic of the landing page, as one of the set of candidate topics.

Embodiment 16 is the method of embodiment 15, wherein identifying a topic similar to the topic of the landing page comprises computing similarity scores between the landing topic and other topics; and identifying the at least one topic from the other topics using the similarity scores. Embodiment 17 is the method of any one of embodiments 1 to 12, wherein the content presentation interface is a profile page of another user.

Embodiment 18 is the method of embodiment 17, wherein identifying the set of candidate topics comprises identifying a topic, of which the other user is an authoritative producer, as one of the set of candidate topics, wherein the authoritative producer has been identified as likely to produce content related to the subject of content represented by the topic.

Embodiment 19 is the method of any one of embodiments 17 or 18, wherein identifying the set of candidate topics comprises selecting a topic followed by one or more followers of the other user, as one of the set of candidate topics.

Embodiment 20 is the method of embodiment 19, wherein the topic followed by the one or more followers is selected by calculating a count of how many users follow both the other user and the topic and filtering out topics for which the count of how many users follow both the other user and the respective topic is larger than a threshold amount, whereby very popular topics are filtered out.

Embodiment 21 is the method of any one of embodiments 17 to 20, wherein identifying the set of candidate topics comprises: detecting a plurality of communities made up of users that are each within a predetermined subset of the plurality of users; and embedding the plurality of users into an embedding space defined by the communities, wherein each dimension of the embedding space corresponds to a distinct one of the plurality of communities, to generate a plurality of user vectors each specifying coordinates of the respective user in the dimensions of the embedding space, each user vector has at least one non-zero value for a vector element when the user corresponding to the user vector follows or has engaged with at least one of the subset of the plurality of users within the community corresponding to the vector element, and the vector element indicates that the user corresponding to the user vector is likely interested in the community corresponding to the vector element. The method further comprises embedding the subset of the plurality of users into the embedding space to generate a plurality of producer vectors each specifying coordinates of the respective producer in the dimensions of the embedding space, and each producer vector has at least one non-zero value for a vector element when at least one user who is likely interested in the community corresponding to the vector element follows or has engaged with the producer corresponding to the producer vector; embedding the plurality of topics into the embedding space to generate a plurality of topic vectors each specifying coordinates of the respective topic in the dimensions of the embedding space, wherein each topic vector has at least one non-zero value for a vector element when the topic corresponding to the vector element is associated with the community corresponding to the vector element; computing relevance scores indicating measures of relatedness between the subset of the plurality of users and the plurality of topics using the plurality of producer vectors and the plurality of topic vectors; and identifying as topics that are part of the set of candidate topics one or more topics that have a relevance above a threshold relevance for the other user according to the relevance scores.

Embodiment 22 is a method comprising identifying a followed topic, a followed topic being a topic expressly followed by a user, wherein the user is one of a plurality of users of a messaging platform, the followed topic is one of a plurality of topics predefined as topics on the platform, each of the plurality of topics represents a subject of content on the platform, and each of the plurality of topics is an entity distinct from any account of the platform, and each of the plurality of topics is an entity of the platform that the platform enables users of the platform to follow. The method further comprises selecting a plurality of related messages, the related messages being messages related to a subject of content represented by the followed topic; and including one or more of the plurality of related messages in a stream of content sent to the user.

Embodiment 23 is the method of embodiment 22, wherein selecting the plurality of related messages comprises identifying an authoritative producer for the followed topic from the plurality of users on the platform, wherein the authoritative producer has been identified as likely to produce content related to the subject of content represented by the followed topic; and selecting one or more messages produced by the authoritative producer as messages of the plurality of related messages.

Embodiment 24 is the method of embodiment 23, wherein identifying the authoritative producer comprises: computing pairwise similarity between a predetermined number of most- followed users in accordance with pointwise mutual information (PMI) of each pair of the predetermined number of most-followed users to generate a PMI matrix, wherein the PMI represents an association measure between the respective pair of the predetermined number of most- followed users; factorizing the PMI matrix to derive a plurality of user interest vectors for the predetermined number of most-followed users; projecting the plurality of user interest vectors into two dimensions to derive a plurality of user clusters; and generating for display a user interface presenting the plurality of user clusters and receiving input identifying the authoritative producer for the followed topic.

Embodiment 25 is the method of any one of embodiments 22 to 24, wherein selecting the plurality of related messages comprises: detecting a plurality of communities made up of users that are each within a predetermined subset of the plurality of users; and embedding the plurality of users into an embedding space defined by the communities, wherein each dimension of the embedding space corresponds to a distinct one of the plurality of communities, to generate a plurality of user vectors each specifying coordinates of the respective user in the dimensions of the embedding space, and each user vector has at least one non-zero value for a vector element when the user corresponding to the user vector follows or has engaged with at least one of the subset of the plurality of users within the community corresponding to the vector element. The method further comprises embedding a group of messages into the embedding space to generate a plurality of message vectors each specifying coordinates of the respective message in the dimensions of the embedding space, wherein each message vector has at least one non-zero value for a vector element when at least one user associated with the community corresponding to the vector element has engaged with the message corresponding to the message vector; embedding the plurality of topics into the embedding space to generate a plurality of topic vectors each specifying coordinates of the respective topic in the dimensions of the embedding space, wherein each topic vector has at least one non-zero value for a vector element when the topic corresponding to the vector element is associated with the community corresponding to the vector element; computing relevance scores indicating measures of relatedness between the group of messages and the followed topic using the plurality of message vectors and the plurality of topic vectors; and identifying, as messages that are part of the plurality of related messages, one or more messages that have a relevance above a threshold relevance to the followed topic according to the relevance scores.

Embodiment 26 is the method of embodiment 25, wherein selecting the plurality of related messages comprises: retrieving information about the user as one or more user features input to the model, wherein the one or more user features include demographic information; retrieving information about the plurality of topics as one or more topic features input to the model; retrieving information about the group of messages as one or more message features input to the model; training the model using the one or more user features, the one or more topic features and the one or more message features; predicting a score for each message of the group of messages using the trained model, wherein the score indicates a likelihood the user is interested in the respective message; and selecting at least one of the plurality of related messages using the predicted scores.

Embodiment 27 is the method of embodiment 26, wherein the one or more topic features include the plurality of topic vectors in the embedding space, and the one or more message features include the plurality of message vectors in the embedding space.

Embodiment 28 is the method of any one of embodiments 22 to 27, wherein including one or more of the plurality of related messages comprises excluding from messages to be included in the stream of content, any messages not having a topical annotation that identifies the followed topic. Embodiment 29 is the method of embodiment 28, wherein the topical annotation is added for the message upon determining that the message contains one or more keywords predefined for the followed topic.

Embodiment 30 is the method of any one or embodiments 22 to 29, wherein including one or more of the plurality of related messages comprises excluding from messages to be included in the stream of content, any messages not containing one or more keywords predefined for the followed topic as part of the stored information about the followed topic.

Embodiment 31 is the method of any one of embodiments 22 to 30, wherein including one or more of the plurality of related messages in a stream of content sent to the user comprises excluding from messages included in the stream of content any messages that have been blocked by the user, any messages authored by a producer that has been muted or blocked by the user, and any messages that have been identified as abusive, spam, trolling or not safe for work.

Embodiment 32 is the method of any one of the embodiments 22 to 31, wherein selecting the plurality of related messages comprises selecting a message as one of the plurality of related messages a message related to a sub-topic of the followed topic if the platform has recorded historical engagements between the user and content items related to the sub-topic.

Embodiment 33 is the method of any one of embodiments 22 to 32, wherein selecting the plurality of related messages comprises: determining to select messages that were posted recently within a time window; accessing a store of messages posted within a time frame made up of a sequence of periods of time, wherein each message posted within the time frame is stored as being in one of the periods of time, wherein the time window ends at a current time and a most recent period of time covers the current time; for each predetermined period of time, identifying topics related to each of the messages in the period of time, generating a plurality of mappings between the messages and their related topics, and caching the plurality of mappings for the period of time; and identifying the periods of time falling with the time window; and selecting a message associated with the followed topic by a cached mapping for any of the identified periods of time, as one of the plurality of related messages.

Embodiment 34 is the method of embodiment 33, wherein generating the plurality of mappings comprises generating the plurality of mappings using a respective topical annotation of each message of the group of messages, wherein the topical annotation identifies one or more topics the respective message is related to.

Embodiment 35 is the method of embodiment 33 or 34, wherein each period of time is one hour, and the time frame is 24 predetermined periods of time.

Embodiment 36 is a method performed by a social media platform. The method comprises obtaining and storing digitally information for accounts, messages, and topics of the platform, wherein the stored information for the accounts includes identifiers unique to respective users of the accounts, the stored information for the messages includes their respective content and authoring accounts, and each of the topics represents a subject of content and is an entity that the platform enables the accounts of the platform to follow, and the stored information for the topics includes (i) data associating known-for accounts with respective topics, a known- for account for a topic being an account determined by the platform as being likely to produce messages interesting to users interested in the topic, (ii) data associating related messages with respective topics, a related message for a topic being a message determined by the platform as likely being related to the topic, (iii) data associating following accounts with respective topics, a following account for a topic being an account that is expressly following the topic, and (iv) data associating interested-in accounts with respective topics, an interested-in account for a topic being an account that is determined by the platform as likely being interested in the topic but not expressly following the topic; receiving a request from a first account of the platform for a timeline, the first account being a following account or an interested-in account for a particular topic; identifying candidate messages with respect to the particular topic from among the related messages for the particular topic and the messages produced by known-for accounts for the particular topic; identifying one or more selected messages to be included in the timeline among the identified candidate messages; generating the timeline that includes the identified one or more selected messages; and sending over a digital network the generated timeline in response to the request.

Embodiment 37 is the method of embodiment 36, further comprising, based on engagements between accounts and accounts or engagements between accounts and messages, updating the data associating the known-for accounts with respective topics, and the data associating the related messages with respective topics.

Embodiment 38 is the method of embodiment 37, wherein updating the data associating the related messages with respective topics comprises identifying, in a subset of the accounts of the platform, a plurality of communities each associated with one or more accounts in the subset of accounts; embedding accounts into an embedding space defined by the communities, wherein each dimension of the embedding space corresponds to a distinct community, to generate a plurality of account vectors each specifying coordinates of a corresponding account in the embedding space, wherein each account vector has a non-zero value for a vector element when the account corresponding to the account vector follows or has engaged with one of the subset of the accounts within the community corresponding to the vector element and a zero value otherwise; embedding messages into the embedding space to generate message vectors, wherein each message vector has a non-zero value for a vector element when an account associated with the community corresponding to the vector element has engaged with the corresponding message and a zero value otherwise; embedding the plurality of topics into the embedding space to generate a plurality of topic vectors each specifying coordinates of the respective topic in the dimensions of the embedding space, wherein each topic vector has at least one non-zero value for a vector element when the topic corresponding to the vector element is associated with the community corresponding to the vector element; embedding the plurality of topics into the embedding space to generate a plurality of topic vectors, wherein each topic vector has a nonzero value for a vector element when the topic corresponding to the vector is associated with the community corresponding to the vector element and a zero value otherwise; computing relevance scores indicating measures of relatedness between messages and a particular topic using the respective similarities between the plurality of message vectors and the topic vector of the particular topic; and identifying one or more messages that have a relevance above a threshold relevance to the particular topic according to the relevance scores, and updating the data associating the related messages with the particular topic.

Embodiment 39 is the method of embodiment 38, wherein the relevance scores are computed by computing cosine similarities or dot products of the plurality of message vectors and the topic vector of the particular topic.

Embodiment 40 is the method of embodiment 38, wherein the subset of accounts comprises a plurality of most- followed accounts that are followed by more than a predetermined number of accounts of the platform, and identifying a plurality of communities each associated with one or more accounts in the subset of accounts comprises identifying the plurality of communities in accordance with following relationships between the accounts of the platform and the subset of accounts.

Embodiment 41 is the method of embodiment 36, wherein identifying one or more selected messages to be included in the timeline comprises selecting, from the identified candidate messages, one or more messages each including a topical annotation indicating the particular topic.

Embodiment 42 is the method of embodiment 36, wherein identifying one or more selected messages to be included in the timeline comprises selecting, from the identified candidate messages, one or more messages each including one or more keywords predefined for the particular topic.

Embodiment 43 is the method of embodiment 36, further comprising, for each message of a plurality of messages, storing data associating with the message a respective topical annotation comprising a reference to a particular topic related to the message. Embodiment 44 is the method of embodiment 36, wherein identifying one or more selected messages to be included in the timeline comprises identifying only messages that have not been blocked by the first account, only messages authored by an account that has not been muted or blocked by the first account, and only messages that have not been identified by the platform as abusive, spam, trolling, or not safe for work.

Embodiment 45 is the method of embodiment 37, wherein updating the data associating the known-for accounts with respective topics comprises computing pairwise similarity between a predetermined number of most-followed accounts in accordance with pointwise mutual information (PMI) of each pair of the predetermined number of most-followed accounts to generate a PMI matrix, wherein the most-followed accounts are followed by more than a predetermined number of accounts of the platform; factoring the PMI matrix to derive a plurality of user interest vectors for the predetermined number of most-followed accounts, wherein each of the plurality of user interest vectors specifies coordinates of a corresponding most- followed account in an embedding space defined by a plurality of communities, each dimension of the embedding space corresponds to a distinct community, each user interest vector has a non- zero value for a vector element when the account corresponding to the user interest vector follows or has engaged with one of the accounts within the community corresponding to the vector element and a zero value otherwise; and projecting the plurality of user interest vectors into two dimensions to derive a plurality of user clusters, wherein each of the user clusters corresponds to an interest and is associated with a group of most- followed accounts from which one or more known-for accounts for a topic related to the particular interest are selected.

Embodiment 46 is the method of embodiment 45, further comprising generating a user interface for displaying the plurality of user clusters, and receiving from a user input selecting, from a group of most-followed accounts associated with a displayed user cluster corresponding to an interest, one or more known-for accounts for a topic related to the particular interest.

Embodiment 47 is the method of embodiment 45, wherein the plurality of user interest vectors are projected into two dimensions using a Uniform Manifold Approximation and Projection algorithm.

Embodiment 48 is the method of embodiment 36, further comprising displaying a set of candidate topics on a content presentation interface to a respective user of an account; and receiving from the respective user a selection of one or more topics to follow among the set of candidate topics, wherein the content presentation interface is a profile page of the account, a timeline of the account, a landing page of a topic, or a profile page of another account. Embodiment 49 is a system comprising: one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform the method of any one of embodiments 1 to 48.

Embodiment 50 is a computer storage medium encoded with a computer program, the

5 program comprising instructions that are operable, when executed by data processing apparatus, to cause the data processing apparatus to perform the method of any one of embodiments 1 to 48.

In this specification, for purposes of explanation, numerous specific details are set forth to provide a thorough understanding of the example embodiments. It will be evident, however, to a person skilled in the art, that the example embodiments may be practiced without these specific

10 details.

Throughout this specification, when a feature is described as “including” an element, unless otherwise described, another element may be further included. Also, terms such as “portion,” “module,” etc. may be used herein to indicate a unit for processing at least one function or operation, in which the unit and the block may be embodied as hardware including,

15 for example, and without limitation, hardware processing circuitry (e.g., a CPU, ASIC, etc.), or software or may be embodied by a combination of hardware and software.

The subject matter and the actions and operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural

20 equivalents, or in combinations of one or more of them. The subject matter and the actions and operations described in this specification can be implemented as or in one or more computer programs, e.g., one or more modules of computer program instructions, encoded on a computer program carrier, for execution by, or to control the operation of, data processing apparatus. The carrier can be a tangible non-transitory computer storage medium. Alternatively or in addition,

25 the carrier can be an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be or be part of a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. A

30 computer storage medium is not a propagated signal.

The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. Data processing apparatus can include specialpurpose logic circuitry, e.g., an FPGA (field programmable gate array), an ASIC (application

35 specific integrated circuit), or a GPU (graphics processing unit). The apparatus can also include,

52 in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program can be written in any form of programming language, including

5 compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand-alone program, e.g., as an app, or as a module, component, engine, subroutine, or other unit suitable for executing in a computing environment, which environment may include one or more computers interconnected by a data communication network in one or more locations.

10 A computer program may, but need not, correspond to a file in a file system. A computer program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code.

15 The processes and logic flows described in this specification can be performed by one or more computers executing one or more computer programs to perform operations by operating on input data and generating output The processes and logic flows can also be performed by special-purpose logic circuitry, e.g., an FPGA, an ASIC, or a GPU, or by a combination of special-purpose logic circuitry and one or more programmed computers.

20 Computers suitable for the execution of a computer program can be based on general or special-purpose microprocessors or both, or any other kind of central processing unit Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for executing instructions and one or more memory devices for storing instractions and data. The

25 central processing unit and the memory can be supplemented by, or incorporated in, specialpurpose logic circuitry.

Generally, a computer will also include, or be operatively coupled to, one or more mass storage devices, and be configured to receive data from or transfer data to the mass storage devices. The mass storage devices can be, for example, magnetic, magneto optical, or optical

30 disks, or solid state drives. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.

53 To provide for interaction with a user, the subject matter described in this specification can be implemented on one or more computers having, or configured to communicate with, a display device, e.g., a LCD (liquid crystal display) monitor, or a virtual-reality (VR) or augmented-reality (AR) display, for displaying information to the user, and an input device by

5 which the user can provide input to the computer, e.g., a keyboard and a pointing device, e.g., a mouse, a trackball or touchpad. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback and responses provided to the user can be any form of sensory feedback, e.g., visual, auditory, speech or tactile; and input from the user can be received in any form, including acoustic, speech, or tactile input, including touch motion or gestures, or

10 kinetic motion or gestures or orientation motion or gestures. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user, for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser, or by interacting with an app running on a user device, e.g., a smartphone or electronic tablet. Also, a computer can interact with a user by sending text

15 messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.

This specification uses the term “configured to” in connection with systems, apparatus, and computer program components. That a system of one or more computers is configured to perform particular operations or actions means that the system has installed on it software,

20 firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. That one or more computer programs is configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions. That special-purpose logic circuitry is configured to perform particular operations or actions

25 means that the circuitry has electronic logic that performs the operations or actions.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of what is being claimed, which is defined by the claims themselves, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of

30 separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised

54 from the combination, and the claim may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this by itself should not be understood as requiring that such operations be

5 performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program

10 components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the

15 processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

55

Claims

WHAT IS CLAIMED IS

1. A method performed by a social media platform, the method comprising: obtaining and storing digitally information for accounts, messages, and topics of the platform, wherein: the stored information for the accounts includes identifiers unique to respective users of the accounts, the stored information for the messages includes their respective content and authoring accounts, and each of the topics represents a subject of content and is an entity that the platform enables the accounts of the platform to follow, and the stored information for the topics includes

(i) data associating known-for accounts with respective topics, a known- for account for a topic being an account determined by the platform as being likely to produce messages interesting to users interested in the topic,

(ii) data associating related messages with respective topics, a related message for a topic being a message determined by the platform as likely being related to the topic,

(iii) data associating following accounts with respective topics, a following account for a topic being an account that is expressly following the topic, and

(iv) data associating interested-in accounts with respective topics, an interested-in account for a topic being an account that is determined by the platform as likely being interested in the topic but not expressly following the topic; receiving a request from a first account of the platform for a timeline, the first account being a following account or an interested- in account for a particular topic; identifying candidate messages with respect to the particular topic from among the related messages for the particular topic and the messages produced by known-for accounts for the particular topic; identifying one or more selected messages to be included in the timeline among the identified candidate messages; generating the timeline that includes the identified one or more selected messages; and sending over a digital network the generated timeline in response to the request.

56

2. The method of claim 1, further comprising, based on engagements between accounts and accounts or engagements between accounts and messages, updating the data associating the known-for accounts with respective topics, and the data associating the related messages with respective topics.

3. The method of claim 2, wherein updating the data associating the related messages with respective topics comprises: identifying, in a subset of the accounts of the platform, a plurality of communities each associated with one or more accounts in the subset of the accounts; embedding the accounts into an embedding space defined by the communities, wherein each dimension of the embedding space corresponds to a distinct community, to generate a plurality of account vectors each specifying coordinates of a corresponding account in the embedding space, wherein each account vector has a non-zero value for a vector element when the account corresponding to the account vector follows or has engaged with one of the subset of the accounts within the community corresponding to the vector element and a zero value otherwise; embedding the messages into the embedding space to generate message vectors, wherein each message vector has a non-zero value for a vector element when an account associated with the community corresponding to the vector element has engaged with a message corresponding to the message vector and a zero value otherwise; embedding the topics into the embedding space to generate a plurality of topic vectors each specifying coordinates of the respective topic in the dimensions of the embedding space, wherein each topic vector has at least one non-zero value for a vector element when the topic corresponding to the vector element is associated with the community corresponding to the vector element; embedding the topics into the embedding space to generate a plurality of topic vectors, wherein each topic vector has a non-zero value for a vector element when the topic corresponding to the vector is associated with the community corresponding to the vector element and a zero value otherwise; computing relevance scores indicating measures of relatedness between messages and a particular topic using respective similarities between the message vectors and the topic vector of the particular topic; and identifying one or more messages that have a relevance above a threshold relevance to the particular topic according to the relevance scores, and updating the data associating the related messages with the particular topic.

57

4. The method of claim 3, wherein the relevance scores are computed by computing a cosine similarity or a dot product of each of the message vectors and the topic vector of the particular topic.

5. The method of claim 3 or claim 4, wherein the subset of the accounts comprises a plurality of most- followed accounts that are followed by more than a predetermined number of accounts of the platform, and identifying a plurality of communities each associated with one or more accounts in the subset of the accounts comprises identifying the plurality of communities in accordance with following relationships between the accounts of the platform and the subset of the accounts.

6. The method of any one of the claims 1 to 5, wherein identifying one or more selected messages to be included in the timeline comprises selecting, from the identified candidate messages, one or more messages each including a topical annotation indicating the particular topic.

7. The method of any one of the claims 1 to 6, wherein identifying one or more selected messages to be included in the timeline comprises selecting, from the identified candidate messages, one or more messages each including one or more keywords predefined for the particular topic.

8. The method of any one of the claims 1 to 7, further comprising, for each message of a plurality of messages, storing data associating with the message a respective topical annotation comprising a reference to a particular topic related to the message.

9. The method of any one of the claims 1 to 8, wherein identifying one or more selected messages to be included in the timeline comprises identifying only messages that have not been blocked by the first account, only messages authored by an account that has not been muted or blocked by the first account, and only messages that have not been identified by the platform as abusive, spam, trolling, or not safe for work.

10. The method of any one of the claims 2 to 9, wherein updating the data associating the known- for accounts with respective topics comprises: computing pairwise similarity between a predetermined number of most- followed

58 accounts in accordance with pointwise mutual information (PMI) of each pair of the predetermined number of most-followed accounts to generate a PMI matrix, wherein the most- followed accounts are followed by more than a predetermined number of accounts of the platform; factoring the PMI matrix to derive a plurality of user interest vectors for the predetermined number of most-followed accounts, wherein each of the plurality of user interest vectors specifies coordinates of a corresponding most- followed account in an embedding space defined by a plurality of communities, each dimension of the embedding space corresponds to a distinct community, each user interest vector has a non-zero value for a vector element when the account corresponding to the user interest vector follows or has engaged with one of the accounts within the community corresponding to the vector element and a zero value otherwise; and projecting the plurality of user interest vectors into two dimensions to derive a plurality of user clusters, wherein each of the user clusters corresponds to a particular interest and is associated with a group of most-followed accounts from which one or more known-for accounts for a topic related to the particular interest are selected.

11. The method of claim 10, further comprising generating a user interface for displaying the plurality of user clusters, and receiving from a user input selecting, from a group of most- followed accounts associated with a displayed user cluster corresponding to a particular interest, one or more known-for accounts for a topic related to the particular interest.

12. The method of claim 10 or claim 11, wherein the plurality of user interest vectors are projected into two dimensions using a Uniform Manifold Approximation and Projection algorithm.

13. The method of any one of the claims 1 to 12, further comprising: displaying a set of candidate topics on a content presentation interface to a respective user of an account; and receiving from the respective user a selection of one or more topics to follow among the set of candidate topics, wherein the content presentation interface is a profile page of the account, a timeline of the account, a landing page of a topic, or a profile page of another account.

59

14. A system comprising: one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform the method of any one of the claims 1 to 13.

15. A computer storage medium encoded with a computer program, the program comprising instructions that are operable, when executed by a data processing apparatus, to cause the data processing apparatus to perform the method of any one of the claims 1 to 13.

60