CN111602147B

CN111602147B - Machine learning model based on non-local neural network

Info

Publication number: CN111602147B
Application number: CN201880086764.3A
Authority: CN
Inventors: 何恺明; 罗斯·格尔希克; 王晓龙
Original assignee: Meta Platforms Inc
Current assignee: Meta Platforms Inc
Priority date: 2017-11-17
Filing date: 2018-11-16
Publication date: 2023-07-07
Anticipated expiration: 2038-11-16
Also published as: EP3710998A1; WO2019099805A1; US20190156210A1; CN111602147A; EP3710998A4; US11562243B2

Abstract

In one embodiment, a method includes: training a baseline machine learning model based on a neural network comprising a plurality of phases, wherein each phase comprises a plurality of neural blocks; accessing a plurality of training samples each including a plurality of content objects; determining one or more non-local operations, wherein each non-local operation is based on one or more pairwise functions and one or more unary functions; generating one or more non-local blocks based on the plurality of training samples and the one or more non-local operations; determining a phase from a plurality of phases of the neural network; and training a non-local machine learning model by inserting each of the one or more non-local blocks between at least two of the plurality of neural blocks in the determined stage of the neural network.

Description

Machine learning model based on non-local neural network

Technical Field

The present disclosure relates generally to image and video analysis using machine learning within a network environment, and in particular to hardware and software for a smart assistant (smart assistant) system.

Background

The assistant system may provide information or services on behalf of the user based on a combination of user input, location awareness, and the ability to access information from various online sources (e.g., weather conditions, traffic congestion, news, stock prices, user schedules, retail prices, etc.). The user input may include text (e.g., online chat), especially text in an instant messaging application or other application, voice, images, or a combination thereof. The assistant system can perform concierge type services (e.g., booking dinner, purchasing event tickets, scheduling travel) or provide information based on user input. The assistant system can also perform administrative or data processing tasks based on the online information and events without user initiation or interaction. Examples of those tasks that may be performed by the assistant system may include calendar management (e.g., sending alert information to dinner appointments that the user is late due to traffic conditions, updating both parties' calendars, and changing restaurant reservations). The assistant system can be implemented through a combination of computing devices, application Programming Interfaces (APIs), and application proliferation (provisioning) on user devices.

A social networking system, which may include a social networking website, may enable its users (e.g., individuals or organizations) to interact with and with each other through it. The social networking system may utilize input from the user to create and store a user profile (user profile) associated with the user in the social networking system. The user profile may include demographic information of the user, communication channel information, and information about personal interests. The social networking system may also create and store a record of the user's relationship to other users of the social networking system with input from the user, as well as provide services (e.g., profile/dynamic message (news feed) posts, photo sharing, event organization, messaging, games, or advertisements) to facilitate social interactions between or among the users.

The social networking system may send content or messages related to its services to the user's mobile device or other computing device over one or more networks. The user may also install a software application on the user's mobile device or other computing device for accessing the user's user profile and other data within the social networking system. The social networking system may generate a personalized set of content objects for display to the user, such as dynamic messages that are connected (connected) to a pooled dynamic (store) of other users of the user.

Summary of particular embodiments

In particular embodiments, the assistant system may assist the user in obtaining information or services. The assistant system can enable the user to interact with it through multimodal user input (e.g., sound, text, images, video) in a stateful and multi-round session (stateful and multi-turn conversations) to obtain assistance. The assistant system can create and store a user profile that includes personal information and contextual information associated with the user. In particular embodiments, the assistant system can use natural language understanding to analyze user input. The analysis may be based on the user profile to obtain a more personalized and context-aware understanding. The assistant system can parse the entity associated with the user input based on the analysis. In particular embodiments, the assistant system may interact with different agents to obtain information or services associated with parsed entities. The assistant system can generate a response for the user regarding the information or service by using natural language generation. Through interaction with the user, the assistant system can use dialog management techniques to manage and forward session flows with the user. In particular embodiments, the assistant system may also help the user effectively and efficiently digest the obtained information by summarizing (summary) the information. The assistant system may also help users better participate in the online social network by providing tools that help users interact with the online social network (e.g., create posts, comments, messages). The assistant system may additionally assist the user in managing different tasks, such as continuously tracking events. In particular embodiments, the assistant system may actively perform tasks related to user interests and preferences based on the user profile without user input. In particular embodiments, the assistant system may check privacy settings to ensure that the user's profile or other user information is allowed to be accessed and different tasks performed according to the user's privacy settings.

In particular embodiments, the assistant system can use one or more non-local machine learning models to analyze content objects including one or more of speech (speech), text, images, video, or a combination thereof. The non-local machine learning model may be based on a deep neural network. In deep neural networks, the convolution operation (convolutional operations) and the round-robin operation (recurrent operations) may be building blocks (building blocks) that process one local neighborhood (local neighborhood) at a time. In particular embodiments, the non-local machine learning model may use non-local operations as a series of generic building blocks for capturing long-range dependencies. The non-local machine learning model may use non-local operations to compute the response at one location as a weighted sum of features at all locations. Building blocks may be inserted into multiple computer vision architectures. In particular embodiments, a non-local machine learning model may be applied to the video classification task. By way of example and not by way of limitation, the non-local machine learning model may fight or outperform the current competition winner on the Kinetics and the Charades data sets (i.e., the two common data sets) even without any additional functionality (bells and whistles). In particular embodiments, non-local machine learning models may also be applied to static image recognition. By way of example and not by way of limitation, the non-local machine learning model improves object detection/segmentation and pose estimation for a COCO (i.e., common dataset) task set. Although this disclosure describes a particular machine learning model based on a particular building block in a particular manner, this disclosure contemplates any suitable machine learning model based on any suitable building block in any suitable manner.

In particular embodiments, the assistant system can train a baseline machine learning model based on a neural network that includes multiple phases. Each stage may include a plurality of nerve blocks. The assistant system can then access a plurality of training samples that respectively include a plurality of content objects. In particular embodiments, the assistant system can determine one or more non-local operations. Each non-local operation may be based on one or more pairwise functions and one or more unitary functions. In particular embodiments, the assistant system can generate one or more non-local blocks based on the plurality of training samples and the one or more non-local operations. The assistant system can then determine a phase from among the multiple phases of the neural network. In particular embodiments, the assistant system can also train the non-local machine learning model by inserting each of the one or more non-local blocks between at least two of the plurality of neural blocks in the determined stage of the neural network.

Certain embodiments disclosed herein may provide one or more technical advantages. Technical advantages of these embodiments may include directly capturing long-range dependencies of a deep neural network by computing interactions between any two locations using non-local operations, regardless of their location distances. Another technical advantage of these embodiments may include best results obtained by using non-local operations even though only a few layers in the deep neural network are used. Another technical advantage of these embodiments may include maintaining a variable input size through non-local operations and easily combining the non-local operations with other operations (e.g., convolution). Some embodiments disclosed herein may not provide, offer some or all of the above technical advantages. One or more other technical advantages may be readily apparent to one skilled in the art in view of the figures, descriptions, and claims of the present disclosure.

The embodiments disclosed herein are merely examples and the scope of the present disclosure is not limited to them. Particular embodiments may include all, some, or none of the components, elements, features, functions, operations, or steps of the embodiments disclosed herein. Embodiments according to the invention are specifically disclosed in the appended claims directed to methods, storage media, systems, assistant systems and computer program products, wherein any feature mentioned in one claim category (e.g., methods) may also be claimed in another claim category (e.g., systems). The dependencies or return references in the appended claims are chosen for formal reasons only. However, any subject matter resulting from an intentional back-reference (particularly multiple references) to any preceding claim may also be claimed, such that any combination of claims and their features is disclosed and may be claimed, irrespective of the dependencies selected in the appended claims. The subject matter which may be claimed includes not only the combination of features as set forth in the appended claims, but also any other combination of features in the claims, wherein each feature mentioned in the claims may be combined with any other feature or combination of features in the claims. Furthermore, any of the embodiments and features described or depicted herein may be claimed in separate claims and/or in any combination with any of the embodiments or features described or depicted herein or in any combination with any of the features of the appended claims.

In an embodiment, a method, particularly for use in an assistant system for assisting a user in obtaining information or services by enabling the user to interact with the assistant system in a session using user input, including sound, text, images, or video, or any combination thereof, the assistant system being implemented, particularly by a combination of a computing device, an Application Programming Interface (API), and an application surge on the user device, may include, by one or more computing systems:

training a baseline machine learning model based on a neural network comprising a plurality of phases, wherein each phase comprises a plurality of neural blocks;

accessing a plurality of training samples each including a plurality of content objects;

determining one or more non-local operations, wherein each non-local operation is based on one or more pairwise functions and one or more unary functions;

generating one or more non-local blocks based on the plurality of training samples and the one or more non-local operations;

determining a phase from a plurality of phases of the neural network; and

the non-local machine learning model is trained by inserting each of the one or more non-local blocks between at least two of the plurality of neural blocks in the determined phase of the neural network.

The neural network may include one or more of a convolutional neural network or a recurrent neural network.

Each of the plurality of content objects may include one or more of text, an audio clip, an image, or a video.

The neural network may be based on one or more of a two-dimensional architecture or a three-dimensional architecture.

In an embodiment, a method may include:

a plurality of feature representations of the plurality of content objects are respectively generated based on the baseline machine learning model.

Generating each of the one or more non-local blocks may include:

each of the one or more non-local operations is applied to a feature representation of one of the plurality of content objects.

In an embodiment, a method may include:

an output location and a plurality of locations associated with the output location are determined for each of the plurality of content objects.

The output location may be in one or more of space, time, or time-air.

Each of the one or more non-local operations may be based on a function

Wherein:

x _i a feature representation at the output location may be indicated;

x _j a representation of the feature at one of the plurality of locations may be indicated;

y _i an output response at the output location may be indicated;

f(x _i ,x _j ) A pair-wise function may be indicated;

g(x _j ) A unary function may be indicated; and

c (x) may indicate a normalization factor.

The pairing function may be based on one or more of the following:

gaussian function

Embedding gaussian functions

Wherein θ is x _i And phi is x _j Is embedded in the mold;

dot product function

Or alternatively

Cascading functions

Wherein ReLU indicates a function of rectifying the linear units, and wherein w _f Is to make θ (x _i ) And phi (x) _j ) Is projected to a scalar (scaler) weight vector.

In an embodiment, a method may include:

sub-sampling (subsampling) is applied to the feature representation of the content object to generate a sub-sampled content object for each of the plurality of content objects, the sub-sampled content object being associated with the sub-sampled feature representation.

Sub-sampling may include pooling, which may include one or more of maximum pooling or average pooling.

Generating each of the one or more non-local blocks may include:

each of the one or more non-local operations is applied to a feature representation of one of the plurality of content objects and a sub-sampled feature representation of the content object corresponding to the sub-sample of the content object.

In an embodiment, a method may include:

Determining an output location for each of a plurality of content objects; and

a plurality of locations associated with the output location is determined for each of a plurality of sub-sampled content objects corresponding to the content object.

Each of the one or more non-local operations may be based on a function

And:

x _i a feature representation at the output location may be indicated;

a sub-sampled feature representation at one of a plurality of locations may be indicated;

y _i an output response at the output location may be indicated;

a pair-wise function may be indicated;

a unary function may be indicated; and

can indicate normalization factorAnd (5) a seed.

The pairing function may be based on one or more of the following:

gaussian function

Embedding gaussian functions

Wherein θ is x _i And phi is +.>

Is embedded in the mold;

dot product function

Or alternatively

Cascading functions

Wherein ReLU indicates a function of rectifying the linear units, and wherein w _f Is to make θ (x _i ) And->

Is projected to the scalar weight vector.

In an embodiment, a method may include:

receiving a query content object; and

the category of the query content object is determined based on the non-local machine learning model.

In an embodiment, one or more computer-readable non-transitory storage media may embody software that, when executed, is operable to:

determining a phase from a plurality of phases of the neural network; and

In an embodiment, a system may include: one or more processors and non-transitory memory coupled to the processors, the memory including instructions executable by the processors, the processors operable when executing the instructions to:

determining a phase from a plurality of phases of the neural network; and

In an embodiment, one or more computer-readable non-transitory storage media may embody software that, when executed, is operable to perform a method according to the present invention or any of the above-mentioned embodiments.

In an embodiment, a system may include: one or more processors; and at least one memory coupled to the processor and including instructions executable by the processor, the processor being operable when executing the instructions to perform a method according to the invention or any of the above-mentioned embodiments.

In an embodiment, a computer program product, preferably comprising a computer readable non-transitory storage medium, is operable when executed on a data processing system to perform a method according to the invention or any of the above mentioned embodiments.

In an embodiment, an assistant system for assisting a user in obtaining information or services by enabling the user to interact with the assistant system in a session using user input including sound, text, images, or video, or any combination thereof, the assistant system being implemented, inter alia, by a combination of a computing device, an Application Programming Interface (API), and an application surge on a user device, the system may comprise: one or more processors; and a non-transitory memory coupled to the processor, the memory comprising instructions executable by the processor, the processor being operable when executing the instructions to perform a method according to the invention or any of the above-mentioned embodiments.

In embodiments, the assistant system may assist the user by performing at least one or more of the following features or steps:

-creating and storing a user profile comprising personal information and context information associated with the user

Analyzing user input using natural language understanding, wherein the analysis may be based on a user profile to obtain a more personalized and context-aware understanding

-parsing entities associated with user input based on the analysis

Interaction with different agents to obtain information or services associated with the parsed entities

-generating a response for a user regarding information or services by using natural language generation

Managing and forwarding session flows with users using dialog management techniques through interactions with users

By summarizing the information, helping the user to effectively and efficiently digest the obtained information

Assisting users in better participation in online social networks by providing tools that assist users in interacting with online social networks (e.g., creating posts, comments, messages)

Assisting a user in managing different tasks, such as continuously tracking events

-actively performing pre-authorized tasks related to user interests and preferences based on a user profile without user input at a time related to the user

Checking privacy settings whenever it is necessary to ensure that the user profile is accessed and different tasks are performed in compliance with the privacy settings of the user.

In embodiments, the assistant system may include at least one or more of the following components:

a messaging platform for receiving text-mode based user input from a client system associated with a user and/or receiving image or video-mode based user input and processing it within the messaging platform using optical character recognition techniques to convert the user input into text,

An Audio Speech Recognition (ASR) module for receiving audio modality-based user input (e.g., to which a user may speak or send video including speech) from a client system associated with the user, and converting the audio modality-based user input into text,

-an assistant xbot for receiving an output of the messaging platform or the ASR module.

In an embodiment, a system may include:

at least one client system (130), in particular an electronic device,

at least one assistant system (140) according to the invention or any embodiment herein,

the client system and the assistant system are connected to each other in particular via a network (110),

wherein the client system includes an assistant application (136) for allowing a user of the client system (130) to interact with the assistant system (140),

wherein the assistant application (136) communicates user input to the assistant system (140) and based on the user input, the assistant system (140) generates a response and sends the generated response to the assistant application (136) and the assistant application (136) presents the response to a user of the client system (130),

wherein in particular the user input is audio or spoken or visual and the response may be text or also audio or spoken or visual.

In an embodiment, a system may include a social networking system (160), wherein a client system specifically includes a social networking application (134) for accessing the social networking system (160).

Brief Description of Drawings

FIG. 1 illustrates an example network environment associated with an assistant system.

Fig. 2 illustrates an example architecture of an assistant system.

FIG. 3 illustrates an example flow chart of an assistant system responding to a user request.

FIG. 4 illustrates example spatiotemporal non-local operations of a non-local machine learning model for video classification.

FIG. 5 illustrates an example spatio-temporal non-local block.

FIG. 6 illustrates an example flow chart of embedded Gaussian instantiation.

FIG. 7 illustrates example spatio-temporal non-local blocks based on embedded Gaussian instantiations.

FIG. 8 illustrates an example flow chart for dot product instantiation.

FIG. 9 illustrates an example flow diagram of cascading (establishment) instantiation.

FIG. 10 illustrates example visualizations of several examples of non-local block behavior computed by a non-local machine learning model.

FIG. 11 illustrates an example plot of a training process for multiple non-local machine learning models.

FIG. 12 illustrates an example method for training a non-local machine learning model.

FIG. 13 illustrates an example social graph.

Fig. 14 shows an example view of an embedding space.

Fig. 15 illustrates an example artificial neural network.

FIG. 16 illustrates an example computer system.

Description of example embodiments

FIG. 1 illustrates an example network environment 100 associated with an assistant system. Network environment 100 includes a client system 130, an assistant system 140, a social-networking system 160, and a third-party system 170 connected to each other through a network 110. Although fig. 1 illustrates a particular arrangement of client system 130, assistant system 140, social-networking system 160, third-party system 170, and network 110, the present disclosure contemplates any suitable arrangement of client system 130, assistant system 140, social-networking system 160, third-party system 170, and network 110. By way of example and not by way of limitation, two or more of client system 130, social-networking system 160, assistant system 140, and third-party system 170 may be directly connected to each other around network 110. As another example, two or more of client system 130, assistant system 140, social-networking system 160, and third-party system 170 may all or partially be physically or logically co-located with each other. Further, although FIG. 1 illustrates a particular number of client systems 130, assistant systems 140, social-networking systems 160, third-party systems 170, and networks 110, this disclosure contemplates any suitable number of client systems 130, assistant systems 140, social-networking systems 160, third-party systems 170, and networks 110. By way of example, and not by way of limitation, network environment 100 may include a plurality of client systems 130, assistant systems 140, social-networking systems 160, third-party systems 170, and networks 110.

The present disclosure contemplates any suitable network 110. By way of example and not by way of limitation, one or more portions of network 110 may include an ad hoc network, an intranet, an extranet, a Virtual Private Network (VPN), a Local Area Network (LAN), a Wireless LAN (WLAN), a Wide Area Network (WAN), a Wireless WAN (WWAN), a Metropolitan Area Network (MAN), a portion of the internet, a portion of the Public Switched Telephone Network (PSTN), a cellular telephone network, or a combination of two or more of these. The network 110 may include one or more networks 110.

Link 150 may connect client system 130, assistant system 140, social-networking system 160, and third-party system 170 to communication network 110 or to each other. The present disclosure contemplates any suitable links 150. In particular embodiments, one or more links 150 include one or more wired links such as, for example, digital Subscriber Line (DSL) or Data Over Cable Service Interface Specification (DOCSIS), wireless links such as, for example, wi-Fi or Worldwide Interoperability for Microwave Access (WiMAX), or optical links such as, for example, synchronous Optical Network (SONET) or Synchronous Digital Hierarchy (SDH). In particular embodiments, one or more links 150 each include an ad hoc network, an intranet, an extranet, VPN, LAN, WLAN, WAN, WWAN, MAN, a portion of the internet, a portion of the PSTN, a cellular technology based network, a satellite communication technology based network, another link 150, or a combination of two or more such links 150. The links 150 need not be identical throughout the network environment 100. The one or more first links 150 may differ from the one or more second links 150 in one or more respects.

In particular embodiments, client system 130 may be an electronic device that includes hardware, software, or embedded logic, or a combination of two or more such components and is capable of performing the appropriate functions implemented or supported by client system 130. By way of example, and not by way of limitation, client system 130 may comprise a computer system, such as a desktop, notebook or laptop computer, netbook, tablet computer, e-book reader, GPS device, camera, personal Digital Assistant (PDA), handheld electronic device, cellular telephone, smart phone, smart speaker, other suitable electronic device, or any suitable combination thereof. In particular embodiments, client system 130 may be a smart assistant device. More information about the intelligent assistant device can be found in U.S. patent application Ser. No. 15/949011, U.S. patent application Ser. No. 62/655751, U.S. patent application Ser. No. 29/631910, U.S. patent application Ser. No. 29/631747, U.S. patent application Ser. No. 29/631913, and U.S. patent application Ser. No. 29/631914, U.S. patent application Ser. No. 29/631914. The present disclosure contemplates any suitable client systems 130. Client system 130 may enable network users at client system 130 to access network 110. The client system 130 may enable its user to communicate with other users at other client systems 130.

In particular embodiments, client system 130 may include a web browser 132, such as MICROSOFT INTERNET EXPLORER, GOOGLE color, or MOZILLA FIREFOX, and may have one or more add-ons, plug-ins, or other extensions, such as tollbar or YAHOO tollbar. A user at client system 130 may enter a Uniform Resource Locator (URL) or other address that directs web browser 132 to a particular server (e.g., server 162 or a server associated with third party system 170), and web browser 132 may generate and communicate hypertext transfer protocol (HTTP) requests to the server. The server may accept the HTTP request and communicate one or more hypertext markup language (HTML) files to the client system 130 in response to the HTTP request. Client system 130 may render a web interface (e.g., a web page) for presentation to a user based on the HTML file from the server. The present disclosure contemplates any suitable source files. By way of example, and not by way of limitation, web interfaces may be rendered according to HTML files, extensible hypertext markup language (XHTML) files, or extensible markup language (XML) files, according to particular needs. Such an interface may also execute scripts such as, for example and without limitation, scripts written in JAVASCRIPT, JAVA, MICROSOFT SILVERLIGHT, combinations of markup language and scripts (e.g., AJAX (asynchronous JAVASCRIPT and XML)), and the like. Herein, references to a web interface include one or more corresponding source files (which a browser may use to render the web interface) and vice versa, where appropriate.

In particular embodiments, client system 130 may include a social networking application 134 installed on client system 130. A user at client system 130 may use social networking application 134 to access an online social network. A user at client system 130 may use social networking application 134 to communicate with a user's social connections (e.g., friends, attentives (followers), attentives accounts, contacts, etc.). A user at the client system 130 may also interact with a plurality of content objects (e.g., posts, news articles, temporary content, etc.) on the online social network using the social networking application 134. By way of example and not by way of limitation, a user may browse trending topics and breaking news using social network application 134.

In particular embodiments, client system 130 may include an assistant application 136. A user of client system 130 may use assistant application 136 to interact with assistant system 140. In particular embodiments, assistant application 136 may comprise a stand-alone application. In particular embodiments, assistant application 136 may be integrated into social network application 134 or another suitable application (e.g., a messaging application). In particular embodiments, assistant application 136 may also be integrated into client system 130, an assistant hardware device, or any other suitable hardware device. In particular embodiments, assistant application 136 may be accessed via web browser 132. In particular embodiments, the user may provide input via different modalities. By way of example, and not by way of limitation, modalities may include audio, text, images, video, and so forth. The assistant application 136 can communicate user input to the assistant system 140. Based on the user input, the assistant system 140 can generate a response. The assistant system 140 can send the generated response to the assistant application 136. The assistant application 136 may then present the response to the user of the client system 130. The presented response may be based on different modalities, such as audio, text, images, and video. By way of example and not by way of limitation, a user may verbally query the assistant application 136 for traffic information (i.e., via an audio modality). The assistant application 136 may then communicate the request to the assistant system 140. The assistant system 140 can generate and send results back to the assistant application 136 accordingly. The assistant application 136 may also present the results to the user in text.

In particular embodiments, assistant system 140 can assist a user in retrieving information from different sources. The assistant system 140 can also assist the user in requesting services from different service providers. In a particular embodiment, the assistant system 140 can receive a user request for information or services via the assistant application 136 in the client system 130. The assistant system 140 can use natural language understanding to analyze user requests based on user profiles and other relevant information. The results of the analysis may include different entities associated with the online social network. The assistant system 140 can then retrieve information or request services associated with the entities. In particular embodiments, assistant system 140 may interact with social-networking system 160 and/or third-party system 170 when retrieving information or requesting services for a user. In particular embodiments, assistant system 140 can use natural language generation techniques to generate personalized communication content for a user. The personalized communication content may include, for example, the retrieved information or the status of the requested service. In particular embodiments, assistant system 140 may enable a user to interact with it in stateful and multi-round sessions using dialog management techniques. The functionality of the assistant system 140 is described in more detail in the discussion of fig. 2 below.

In particular embodiments, social-networking system 160 may be a network-addressable computing system that may host an online social network. Social-networking system 160 may generate, store, receive, and send social-networking data (such as, for example, user profile data, concept profile data, social-graph information, or other suitable data related to an online social network). Social-networking system 160 may be accessed by other components of network environment 100 directly or via network 110. By way of example and not by way of limitation, client system 130 may access social-networking system 160 directly or via network 110 using web browser 132 or a native application associated with social-networking system 160 (e.g., a mobile social-networking application, a messaging application, another suitable application, or any combination thereof). In particular embodiments, social-networking system 160 may include one or more servers 162. Each server 162 may be a single server (unitary server) or a distributed server across multiple computers or multiple data centers. The server 162 may be of various types such as, for example and without limitation, a web server, a news server, a mail server, a message server, an advertisement server, a file server, an application server, an exchange server, a database server, a proxy server, another server suitable for performing the functions or processes described herein, or any combination thereof. In particular embodiments, each server 162 may include hardware, software, or embedded logic components, or a combination of two or more such components, for performing the appropriate functions implemented or supported by server 162. In particular embodiments, social-networking system 160 may include one or more data stores 164. The data storage 164 may be used to store various types of information. In particular embodiments, the information stored in data store 164 may be organized according to particular data structures. In particular embodiments, each data store 164 may be a relational database, column (column) database, relevance database, or other suitable database. Although this disclosure describes or illustrates a particular type of database, this disclosure contemplates any suitable type of database. Particular embodiments may provide interfaces that enable client system 130, social-networking system 160, or third-party system 170 to manage, retrieve, modify, add, or delete information stored in data store 164.

In particular embodiments, social-networking system 160 may store one or more social graphs in one or more data stores 164. In particular embodiments, the social graph may include a plurality of nodes, which may include a plurality of user nodes (each corresponding to a particular user) or a plurality of concept nodes (each corresponding to a particular concept), and a plurality of edges of the connected nodes. Social-networking system 160 may provide users of the online social network with the ability to communicate and interact with other users. In particular embodiments, users may join an online social network via social-networking system 160 and then add connections (e.g., relationships) with multiple other users in social-networking system 160 to which they want to be related. Herein, the term "friend" may refer to any other user of social-networking system 160 with whom the user forms an association, or relationship via social-networking system 160.

In particular embodiments, social-networking system 160 may provide users with the ability to take actions on various types of items or objects supported by social-networking system 160. By way of example and not by way of limitation, items and objects may include groups or social networks to which a user of social-networking system 160 may belong, events or calendar entries that may be of interest to the user, computer-based applications that may be used by the user, transactions that allow the user to purchase or sell merchandise via a service, interactions with advertisements that the user may perform, or other suitable items or objects. The user may interact with anything that can be represented in social-networking system 160 or by an external system of third-party system 170, third-party system 170 being separate from social-networking system 160 and coupled to social-networking system 160 via network 110.

In particular embodiments, social-networking system 160 may be capable of linking various entities. By way of example, and not by way of limitation, social-networking system 160 may enable users to interact with each other and receive content from third-party systems 170 or other entities, or allow users to interact with these entities through an Application Programming Interface (API) or other communication channel.

In particular embodiments, third party system 170 may include one or more types of servers, one or more data stores, one or more interfaces (including but not limited to APIs), one or more web services, one or more content sources, one or more networks, or any other suitable components (e.g., a server may communicate with these components). Third party system 170 may be operated by an entity different from the entity operating social-networking system 160. However, in particular embodiments, social-networking system 160 and third-party system 170 may operate in conjunction with each other to provide social-networking services to users of social-networking system 160 or third-party system 170. In this sense, social-networking system 160 may provide a platform or backbone that other systems (e.g., third-party systems 170) may use to provide social-networking services and functionality to users throughout the Internet.

In particular embodiments, third party system 170 may include a third party content object provider. The third party content object provider may include one or more sources of content objects that may be delivered to the client system 130. By way of example and not by way of limitation, a content object may include information about a user's interests or activities, such as, for example, movie show times, movie reviews, restaurant menus, product information and reviews, or other suitable information. As another example and not by way of limitation, the content object may include an incentive content object (e.g., a coupon, a discount coupon, a gift certificate, or other suitable incentive object).

In particular embodiments, social-networking system 160 also includes user-generated content objects that may enhance user interactions with social-networking system 160. User-generated content may include any content that a user may add, upload, send, or "post" to social-networking system 160. By way of example and not by way of limitation, a user communicates a post from client system 130 to social-networking system 160. The post may include data such as status updates or other text data, location information, photos, videos, links, music, or other similar data or media. Content may also be added to social-networking system 160 by a third party through a "communication channel" (e.g., dynamic message or stream).

In particular embodiments, social-networking system 160 may include various servers, subsystems, programs, modules, logs, and data stores. In particular embodiments, social-networking system 160 may include one or more of the following: web servers, action loggers, API request servers, relevance and ranking engines, content object classifiers, notification controllers, action logs, third-party content object exposure logs, inference modules, authorization/privacy servers, search modules, advertisement-targeting modules, user interface modules, user profile stores, associative stores, third-party content stores, or location stores. Social-networking system 160 may also include suitable components, such as a network interface, a security mechanism, a load balancer, a failover server, a management and network operations console, other suitable components, or any suitable combination thereof. In particular embodiments, social-networking system 160 may include one or more user-profile stores for storing user profiles. The user profile may include, for example, biographical information, demographic information, behavioral information, social information, or other types of descriptive information (e.g., work experience, educational history, hobbies or preferences, interests, affinities, or locations). The interest information may include interests associated with one or more categories. The categories may be general or specific. By way of example and not by way of limitation, if a user "like" an article about a brand of shoes, that category may be a brand, or a general category of "shoes" or "clothing. The association store may be used to store association information about users. The relationship information may indicate users that have similar or common work experiences, group membership, hobbies, educational history, or that are related or share common attributes in any way. The relationship information may also include user-defined relationships between different users and content (internal and external). The web server may be used to link social-networking system 160 to one or more client systems 130 or one or more third-party systems 170 via network 110. The web servers may include mail servers or other messaging functions for receiving and routing (routing) messages between social-networking system 160 and one or more client systems 130. The API request server may allow third party system 170 to access information from social-networking system 160 by calling one or more APIs. The action logger may be used to receive communications from the web server regarding the user's actions on or off social-networking system 160. In conjunction with the action log, a third-party content object log of user exposure to third-party content objects may be maintained. The notification controller may provide information about the content object to the client system 130. The information may be pushed to the client system 130 as a notification or the information may be pulled from the client system 130 in response to a request received from the client system 130. The authorization server may be used to enforce one or more privacy settings of users of social-networking system 160. The privacy settings of the user determine how particular information associated with the user may be shared. The authorization server may allow the user to opt-in (opt-out) or opt-out (opt-out) to have their actions logged by social-networking system 160 or shared with other systems (e.g., third-party system 170), such as, for example, by setting appropriate privacy settings. The third party content object store may be used to store content objects received from a third party (e.g., third party system 170). The location store may be used to store location information associated with users received from client systems 130. The advertisement pricing module may combine social information, current time, location information, or other suitable information to provide relevant advertisements to the user in the form of notifications.

Fig. 2 illustrates an example architecture of the assistant system 140. In particular embodiments, the assistant system 140 may assist the user in obtaining information or services. The assistant system 140 can enable a user to interact with it with multimodal user inputs (e.g., sound, text, images, video) in stateful and multi-turn sessions to obtain assistance. The assistant system 140 can create and store a user profile that includes personal information and contextual information associated with the user. In particular embodiments, assistant system 140 can use natural language understanding to analyze user input. The analysis may be based on the user profile to obtain a more personalized and context-aware understanding. The assistant system 140 can parse the entities associated with the user input based on the analysis. In particular embodiments, assistant system 140 can interact with different agents to obtain information or services associated with parsed entities. The assistant system 140 can generate a response for the user regarding the information or service by using natural language generation. Through interaction with the user, the assistant system 140 can use dialog management techniques to manage and forward the flow of sessions with the user. In particular embodiments, the assistant system 140 may also help the user effectively and efficiently digest the obtained information by aggregating the information. The assistant system 140 may also help users better participate in the online social network by providing tools that help users interact with the online social network (e.g., create posts, comments, messages). The assistant system 140 can additionally assist the user in managing different tasks, such as continuously tracking events. In particular embodiments, assistant system 140 can actively perform pre-authorized tasks related to user interests and preferences based on a user profile at times related to the user without user input. In particular embodiments, assistant system 140 can check privacy settings to ensure that a user's profile or other user information is allowed to be accessed and different tasks performed according to the user's privacy settings. More information about helping a user in terms of privacy settings can be found in U.S. patent application No. 62/675090 filed on 22 at 5/2018, which is incorporated by reference.

In a particular embodiment, the assistant system 140 can receive user input from the assistant application 136 in the client system 130 associated with the user. In particular embodiments, the user input may be user-generated input that is sent to the assistant system 140 in a single round. If the user input is based on a text modality, the assistant system 140 can receive it at the messaging platform 205. If the user input is based on an audio modality (e.g., the user may speak into the assistant application 136 or send a video including speech to the assistant application 136), the assistant system 140 can process it using an Audio Speech Recognition (ASR) module 210 to convert the user input into text. If the user input is based on an image or video modality, the assistant system 140 can process it using optical character recognition techniques within the messaging platform 205 to convert the user input into text. The output of the messaging platform 205 or the ASR module 210 may be received at the assistant xbot 215. More information about processing user input based on different modalities can be found in U.S. patent application No. 16/053600 filed on 8/2/2018, which is incorporated by reference.

In particular embodiments, assistant xbot 215 may be a type of chat robot (chat bot). The assistant xbot 215 may include a programmable service channel, which may be software code, logic, or routines that function as a user personal assistant. The assistant xbot 215 may serve as a user portal for the assistant system 140. Thus, assistant xbot 215 may be considered a type of session proxy. In particular embodiments, assistant xbot 215 may send text user input to Natural Language Understanding (NLU) module 220 to interpret the user input. In particular embodiments, NLU module 220 may obtain information from user context engine 225 and semantic information aggregator (semantic information aggregator) 230 to accurately understand user input. The user context engine 225 may store a user profile of the user. The user profile of the user may include user profile data including demographic information, social information, and contextual information associated with the user. The user profile data may also include interests and preferences of the user for multiple topics aggregated through sessions on dynamic messages, search logs, messaging platform 205, and the like. The use of the user profile may be protected by the privacy checking module 245 to ensure that the user's information is only available for his/her interests and is not shared with any other person. More information about user profiles can be found in U.S. patent application No. 15/967239 filed on

date

30, 4, 2018, which is incorporated by reference. Semantic information aggregator 230 may provide NLU module 220 with ontology data associated with a plurality of predefined domains (domains), intents (intents), and slots (slots). In particular embodiments, a domain may represent a social context of an interaction, e.g., education. The intent may be an element in a predefined classification of semantic intent that may indicate the purpose of a user interaction with the assistant system 140. In particular embodiments, if the user input includes text/speech input, the intent may be the output of NLU module 220. The NLU module 220 may classify the text/speech input as a member of a predefined classification, e.g., for the input "play a fifth symphony of beprofen," the NLU module 220 may classify the input as having an intent [ intent: play_music ]. In particular embodiments, a domain may be conceptually a namespace of an intent set, e.g., music. A slot may be a named substring with user input representing a basic semantic entity. For example, a slot of "pizza" may be [ slot: dish ]. In particular embodiments, the set of valid or expected naming slots may be based on the intent of the classification. By way of example, and not by way of limitation, for [ intent: playmusic, slot may be slot: song_name ]. Semantic information aggregator 230 may also extract information from social graphs, knowledge graphs, and concept graphs and retrieve user profiles from user context engine 225. Semantic information aggregator 230 may also process information from these different sources by determining what information to aggregate, annotating the user input n-grams (n-grams), ranking the n-grams with confidence scores based on the aggregated information, formulating the ranked n-grams as features that can be used by NLU module 220 to understand the user input. More information about aggregated semantic information can be found in U.S. patent application No. 15/967342, filed on

date

30, 4, 2018, which is incorporated by reference. Based on the output of the user context engine 225 and the semantic information aggregator 230, the NLU module 220 may identify domains, intents, and one or more slots from the user input in a personalized and context-aware manner. By way of example, and not by way of limitation, the user input may include "tell me how to go to starbucks (show me how to get to the Starbucks)". NLU module 220 may identify a particular starbucks that the user wants to go based on the user's personal information and associated contextual information. In particular embodiments, NLU module 220 may include a language dictionary (lexicon of language), a parser (parser), and grammar rules to divide sentences into internal representations. NLU module 220 may also include one or more programs that perform naive (naive) semantic or stochastic semantic analysis using language (pragmatics) to understand user input. In particular embodiments, the parser may be based on a deep learning architecture that includes a plurality of Long Short Term Memory (LSTM) networks. By way of example and not by way of limitation, the parser may be based on a Recurrent Neural Network Grammar (RNNG) model, which is one type of recursive and round-robin LSTM algorithm. More information about natural language understanding can be found in U.S. patent application Ser. No. 16/01062, filed on 18, 6, 2018, 7, 2, 16/025317, and U.S. patent application Ser. No. 16/038120, filed on 17, 7, 2018, each of which is incorporated by reference.

In particular embodiments, the identified domain, intent, and one or more slots from NLU module 220 may be sent to dialog engine 235. In particular embodiments, dialog engine 235 may manage the flow of sessions and dialog states between the user and assistant xbot 215. The dialog engine 235 may additionally store previous sessions between the user and the assistant xbot 215. In particular embodiments, dialog engine 235 may communicate with entity resolution module 240 to resolve entities associated with one or more slots, which supports dialog engine 235 forwarding session flows between a user and assistant xbot 215. In particular embodiments, entity resolution module 240 may access social graphs, knowledge graphs, and conceptual graphs when resolving an entity. An entity may include, for example, unique users or concepts, each of which may have a unique Identifier (ID). By way of example, and not by way of limitation, a knowledge graph may include a plurality of entities. Each entity may include a single record associated with one or more attribute values. A particular record may be associated with a unique entity identifier. Each record may have a different value for an attribute of the entity. Each attribute value may be associated with a confidence probability. The confidence probability of an attribute value represents the probability that the value is accurate for a given attribute. Each attribute value may also be associated with a semantic weight. The semantic weight of an attribute value may represent how semantically the value fits a given attribute in view of all available information. For example, the knowledge graph may include an entity of The movie "martial" (2015) that includes information that has been extracted from multiple content sources (e.g., facebook, wikipedia, movie review sources, media databases, and entertainment content sources) and then deduplicated (reduced), parsed, and fused to generate a single unique record of The knowledge graph. The entity may be associated with a spatial attribute value indicating the type (gene) of the movie "mars rescue" (2015). More information about knowledge graphs can be found in U.S. patent application Ser. No. 16/048049, filed on 7.27, 2018, and U.S. patent application Ser. No. 16/048101, filed on 7.27, 2018, each of which is incorporated by reference. Entity resolution module 240 may additionally request a user profile of the user associated with the user input from user context engine 225. In particular embodiments, entity resolution module 240 may communicate with privacy check module 245 to ensure that resolution of an entity does not violate privacy policies. In particular embodiments, privacy checking module 245 may use an authorization/privacy server to enforce privacy policies. By way of example and not by way of limitation, the entity to be parsed may be another user that specifies in his/her privacy settings that his/her identity should not be searchable on an online social network, so entity resolution module 240 may not return an identifier of the user in response to the request. Based on information obtained from the social graph, knowledge graph, concept graph, and user profile, and following applicable privacy policies, the entity resolution module 240 may thus accurately resolve entities associated with user input in a personalized and context-aware manner. In particular embodiments, each parsed entity may be associated with one or more identifiers hosted by social-networking system 160. By way of example, and not by way of limitation, the identifier may include a unique user Identifier (ID). In particular embodiments, each parsed entity may also be associated with a confidence score. More information about the resolution entity can be found in U.S. patent application Ser. No. 16/048049, filed on 7.27, 2018, and U.S. patent application Ser. No. 16/048072, filed on 7.27, 2018, each of which is incorporated by reference.

In particular embodiments, dialog engine 235 may communicate with different agents based on the identified intent and domain and parsed entities. In particular embodiments, an agent may be one implementation that acts as a broker (broker) between multiple content providers of a domain. The content provider may be an entity responsible for performing actions associated with the intent or completing tasks associated with the intent. By way of example and not by way of limitation, multiple device-specific implementations (e.g., real-time calls to client system 130 or messaging applications on client system 130) may be handled internally by a single agent. Alternatively, these device-specific implementations may be handled by multiple agents associated with multiple domains. In particular embodiments, the agents may include a first party agent 250 and a third party agent 255. In particular embodiments, first party agent 250 may include an internal agent (e.g., an agent (Messenger, instagram) associated with a service provided by an online social network) that is accessible and controllable by assistant system 140. In particular embodiments, the third party agent 255 may include an external agent (e.g., music streaming agent (Spotify)), ticket sales agent (Ticketmaster)) that the assistant system 140 is not capable of controlling. First party agent 250 may be associated with a first party provider 260, which first party provider 260 provides content objects and/or services hosted by social-networking system 160. The third party agent 255 may be associated with a third party provider 265 that provides content objects and/or services hosted by the third party system 170.

In particular embodiments, the communication from the dialog engine 235 to the first party agent 250 may include requesting a particular content object and/or service provided by the first party provider 260. Thus, the first party agent 250 may retrieve the requested content object from the first party provider 260 and/or perform tasks that instruct the first party provider 260 to perform the requested service. In particular embodiments, the communication from the dialog engine 235 to the third party agent 255 may include requesting a particular content object and/or service provided by the third party provider 265. Thus, the third party agent 255 may retrieve the requested content object from the third party provider 265 and/or perform tasks that instruct the third party provider 265 to perform the requested service. The third party agent 255 may access the privacy check module 245 to ensure that there is no privacy violation prior to interacting with the third party provider 265. By way of example and not by way of limitation, a user associated with user input may specify in his/her privacy settings that his/her profile information is not visible to any third party content provider. Thus, when retrieving a content object associated with a user input from the third party provider 265, the third party agent 255 may complete the retrieval without revealing to the third party provider 265 which user is requesting the content object.

In particular embodiments, each of the first party agent 250 or the third party agent 255 may be designated for a particular domain. By way of example and not by way of limitation, a domain may include weather, transportation, music, and the like. In particular embodiments, the assistant system 140 may cooperatively use multiple agents in response to user input. By way of example, and not by way of limitation, the user input may include "direct me to my next meeting (direct me to my next meeting)". The assistant system 140 can use the calendar agent to retrieve the location of the next meeting. The assistant system 140 can then use the navigation agent to direct the user to the next meeting.

In particular embodiments, each of the first party agent 250 or the third party agent 255 may retrieve a user profile from the user context engine 225 to perform tasks in a personalized and context-aware manner. By way of example, and not by way of limitation, the user input may include "take me to reservation to airport (book me a ride to the airport)". The transportation agent may perform the task of booking a ride. The transportation agent may retrieve the user profile of the user from the user context engine 225 prior to subscribing to the ride. For example, the user profile may indicate that the user prefers taxis, so the transportation agent may subscribe to taxis for the user. As another example, the contextual information associated with the user profile may indicate a user's time of day, so the transportation agent may subscribe the user to take from a carpool service (e.g., uber, lyft) because taking from the carpool service may be faster than a taxi company. In particular embodiments, each of the first party agent 250 or the third party agent 255 may take into account other factors in performing the task. By way of example and not by way of limitation, other factors may include price, rating, efficiency, partnership with an online social network, and so forth.

In particular embodiments, dialog engine 235 may communicate with a session understanding composer (CU composer) 270. Dialog engine 235 may send the status of the requested content object and/or the requested service to CU composer 270. In particular embodiments, dialog engine 235 may send the status of the requested content object and/or the requested service as a < k, c, u, d > tuple (tuple), where k indicates a knowledge source, c indicates a communication target, u indicates a user model, and d indicates a speech (discoure) model. In particular embodiments, CU composer 270 may include a Natural Language Generator (NLG) 271 and a User Interface (UI) payload generator 272. The natural language generator 271 may generate the communication content based on the output of the dialog engine 235. In particular embodiments, NLG271 may include a content determination component, a sentence planner, and a surface implementation (surface realization) component. The content determination component can determine the communication content based on the knowledge source, the communication target, and the user's desire. By way of example, and not by way of limitation, the determination may be based on descriptive logic. Description logic may include, for example, three basic concepts (notes) that are individuals (representing objects in a domain), concepts (describing a collection of individuals), and roles (representing binary relationships between individuals or concepts). The description logic may be characterized by a set of constructors that allow the natural language generator 271 to construct complex concepts/roles from atomic concepts/roles. In particular embodiments, the content determination component may perform the following tasks to determine the communication content. The first task may include a translation task in which input to the natural language generator 271 may be translated into concepts. The second task may include a selection task, wherein related concepts may be selected from concepts generated by the translation task based on the user model. The third task may include a verification task in which consistency of the selected concepts may be verified. The fourth task may include an instantiation task in which the verified concept may be instantiated as an executable file that may be processed by the natural language generator 271. The sentence planner can determine the organization of the communication content to make it understandable. The surface layer implementation component can determine the particular word to use, the order of sentences, and the style of the communication content. The UI payload generator 272 may determine a preferred modality of the communication content to be presented to the user. In particular embodiments, CU composer 270 may communicate with privacy check module 245 to ensure that the generation of the communication content complies with privacy policies. In particular embodiments, CU composer 270 may retrieve the user profile from user context engine 225 when generating the communication content and determining the modality of the communication content. Thus, the communication content may be more natural, personalized, and context-aware for the user. By way of example, and not by way of limitation, a user profile may indicate that a user likes a phrase in a conversation, and thus the generated communication content may be based on the phrase. As another example and not by way of limitation, the contextual information associated with the user profile may indicate that the user is using a device that outputs only audio signals, and thus the UI payload generator 272 may determine the modality of the communication content as audio. More information about natural language generation can be found in U.S. patent application Ser. No. 15/967279, filed on

day

30, 4, 2018, and U.S. patent application Ser. No. 15/966455, filed on

day

30, 4, 2018, each of which is incorporated by reference.

In particular embodiments, CU composer 270 may send the generated communication content to assistant xbot 215. In particular embodiments, assistant xbot 215 may send the communication content to messaging platform 205. Messaging platform 205 can also send communication content to client system 130 via assistant application 136. In an alternative embodiment, the assistant xbot 215 may send the communication content to a text-to-speech (TTS) module 275.TTS module 275 may convert the communication content to an audio clip. TTS module 275 may also send the audio clip to client system 130 via assistant application 136.

In a particular embodiment, the assistant xbot 215 may interact with the active inference (proactive inference) layer 280 without receiving user input. The active inference layer 280 may infer user interests and preferences based on the user profile retrieved from the user context engine 225. In particular embodiments, active inference layer 280 may also communicate with active agent 285 about inference. The active agent 285 may perform active tasks based on the inference. By way of example, and not by way of limitation, an active task may include sending a content object or providing a service to a user. In particular embodiments, each active task may be associated with an agenda item. Agenda items may include items that appear in a cycle, such as daily summaries. Agenda items may also include disposable items. In particular embodiments, the active agent 285 may retrieve the user profile from the user context engine 225 when performing the active task. Thus, the active agent 285 may perform active tasks in a personalized and context-aware manner. By way of example and not by way of limitation, the active push layer may infer that the user likes Maroo 5 bands, and the active agent 285 may generate recommendations for the user for new songs/albums of Maroo 5.

In particular embodiments, the proactive agent 285 may generate candidate entities associated with the proactive task based on the user profile. The generating may be based on retrieving a direct backend query of the candidate entity from the structured data store using a deterministic filter. Alternatively, the generation may be based on a machine learning model that is trained based on user profiles, entity attributes, and correlations between users and entities. By way of example, and not by way of limitation, the machine learning model may be based on a Support Vector Machine (SVM). As another example and not by way of limitation, the machine learning model may be based on a regression model. As another example and not by way of limitation, the machine learning model may be based on a Deep Convolutional Neural Network (DCNN). In particular embodiments, active agent 285 may also rank the generated candidate entities based on the user profile and content associated with the candidate entities. The ranking may be based on similarity between the user interests and the candidate entities. By way of example, and not by way of limitation, the assistant system 140 can generate feature vectors representing user interests and feature vectors representing candidate entities. The assistant system 140 can then calculate a similarity score (e.g., based on cosine similarity) between the feature vector representing the user's interest and the feature vector representing the candidate entity. Alternatively, the ranking may be based on a ranking model that is trained based on user feedback data.

In particular embodiments, the active task may include recommending candidate entities to the user. The active agent 285 may schedule (schedule) the recommendation to associate the recommendation time with the recommended candidate entity. Recommended candidate entities may also be associated with priorities and expiration times. In particular embodiments, the recommended candidate entity may be sent to the active scheduler. The active scheduler may determine the actual time to send the recommended candidate entity to the user based on the priority associated with the task and other relevant factors (e.g., clicks and impressions of the recommended candidate entity). In particular embodiments, the active scheduler may then send recommended candidate entities with the determined actual times to the asynchronous layer (asynchronous tier). The asynchronous layer may temporarily store the recommended candidate entity as a job (job). In particular embodiments, the asynchronous layer may send the job to the dialog engine 235 for execution at the determined actual time. In alternative embodiments, the asynchronous layer may execute the job by sending it to other skins (e.g., other notification services associated with social-networking system 160). In particular embodiments, dialog engine 235 may identify dialog intents, states, and histories associated with a user. Based on the dialog intent, the dialog engine 235 may select some candidate entities from the recommended candidate entities to send to the client system 130. In particular embodiments, the dialog state and history may indicate whether the user is engaged in an ongoing session with assistant xbot 215. If the user is engaged in an ongoing session and the priority of the recommended task is low, the dialog engine 235 may communicate with the active scheduler to reschedule the time to send the selected candidate entity to the client system 130. If the user is engaged in an ongoing session and the priority of the recommended task is high, the dialog engine 235 may initiate a new dialog session (session) with the user in which the selected candidate entity may be presented. Thus, interruption of an ongoing session can be prevented. Upon determining that sending the selected candidate entity does not interrupt the user, dialog engine 235 may send the selected candidate entity to CU composer 270 to generate personalized and context-aware communications including the selected candidate entity in accordance with the user's privacy settings. In particular embodiments, CU composer 270 may send the communication content to assistant xbot 215, which assistant xbot 215 may then send to client system 130 via messaging platform 205 or TTS module 275. More information about actively assisting a user can be found in U.S. patent application Ser. No. 15/967193, filed on

day

30, 4, 2018, and U.S. patent application Ser. No. 16/036827, filed on day 16, 7, 2018, each of which is incorporated by reference.

In particular embodiments, assistant xbot 215 may communicate with active agent 285 in response to user input. By way of example, and not by way of limitation, the user may ask the assistant xbot 215 to set a reminder. The assistant xbot 215 may request that the active agent 285 set such a reminder, and the active agent 285 may actively perform the task of reminding the user at a later time.

In particular embodiments, assistant system 140 can include a summer (summer) 290. The aggregator 290 may provide the user with a customized dynamic message summary. In particular embodiments, the aggregator 290 may include a plurality of meta agents (meta agents). The plurality of meta-agents may use the first party agent 250, the third party agent 255, or the proactive agent 285 to generate dynamic message summaries. In particular embodiments, the aggregator 290 may retrieve user interests and preferences from the active push layer 280. The aggregator 290 may then retrieve the entities associated with the user interests and preferences from the entity resolution module 240. The summarizer 290 may also retrieve the user profile from the user context engine 225. Based on information from the active inference layer 280, the entity resolution module 240, and the user context engine 225, the summarizer 290 may generate personalized and context-aware summaries for the user. In particular embodiments, the summarizer 290 may send the summaries to the CU composer 270.CU composer 270 may process the summaries and send the results of the processing to assistant xbot 215. Assistant xbot 215 may then send the processed summaries to client system 130 via messaging platform 205 or TTS module 275. More information about the summary can be found in U.S. patent application No. 15/967290 filed on

date

30, 4, 2018, which is incorporated by reference.

Fig. 3 shows an example flow chart of the assistant system 140 responding to a user request. In particular embodiments, assistant xbot 215 may access request manager 305 upon receiving a user request. The request manager 305 may include a context extractor 306 and a session understanding object generator (CU object generator) 307. The context extractor 306 may extract context information associated with the user request. The context extractor 306 may also update the context information based on the assistant application 136 executing on the client system 130. By way of example, and not by way of limitation, the updating of the context information may include displaying the content item on the client system 130. As another example and not by way of limitation, the updating of the context information may include setting an alert on the client system 130. As another example and not by way of limitation, the updating of the contextual information may include playing a song on the client system 130. The CU object generator 307 may generate a particular content object related to the user request. The content object may include dialog session data and features associated with the user request that may be shared with all of the modules of the assistant system 140. In particular embodiments, request manager 305 may store the context information and the generated content objects in data store 310, which data store 310 is a particular data store implemented in assistant system 140.

In particular embodiments, request manager 305 may send the generated content object to NLU module 220.NLU module 220 may perform a number of steps to process the content object. In step 221, nlu module 220 may generate a whitelist (whitelist) of content objects. In particular embodiments, the whitelist may include interpretation data that matches the user's request. In step 222, nlu module 220 may perform characterization based on the whitelist. In step 223, nlu module 220 may perform domain classification/selection on the user request based on the characteristics generated by the characterization to classify the user request into a predefined domain. The domain classification/selection result may be further processed based on two related processes. In step 224a, nlu module 220 may process domain classification/selection results using an intent classifier. The intent classifier may determine a user intent associated with the user request. In particular embodiments, each domain may have an intent classifier to determine the most likely intent in a given domain. As an example and not by way of limitation, the intent classifier may be based on a machine learning model that may take domain classification/selection results as input and calculate the probability that the input is associated with a particular predefined intent. In step 224b, the nlu module may use a meta-intent classifier to process domain classification/selection results. The meta-intent classifier may determine a category that describes the user's intent. In particular embodiments, intent common to multiple domains may be processed by a meta intent classifier. As an example and not by way of limitation, the meta-intent classifier may be based on a machine learning model that may take domain classification/selection results as inputs and calculate probabilities that the inputs are associated with a particular predefined meta-intent. At step 225a, nlu module 220 may annotate one or more slots associated with the user request using a slot marker (slot tag). In particular embodiments, the slot marker may annotate one or more slots for the n-grams requested by the user. At step 225b, nlu module 220 may annotate one or more slots with classification results from the meta-intent classifier using a meta-slot marker. In particular embodiments, the meta-slot marker may mark a generic slot, such as a reference to an item (e.g., first), a type of slot, a value of the slot, and so forth. By way of example and not by way of limitation, the user request may include "redeem $ 500 in me account for yen (change 500dollars in my account to Japanese yen)". The intent classifier may take the user request as input and formulate it as a vector. The intent classifier may then calculate a probability that the user request is associated with a different predefined intent based on a vector comparison between the vector representing the user request and the vector representing the different predefined intent. In a similar manner, the slot marker may take as input a user request and formulate each word as a vector. The intent classifier may then calculate the probability that each word is associated with a different predefined slot based on a vector comparison between the vector representing the word and the vector representing the different predefined slot. The user's intention may be classified as "change money". The slots of the user request may include "500", "dollars", "account", and "Japanese yen". The user's meta-intention may be classified as "financial service (financial service)". The meta slot (meta slot) may include "finance".

In particular embodiments, NLU module 220 may improve domain classification/selection of content objects by extracting semantic information from semantic information aggregator 230. In particular embodiments, semantic information aggregator 230 may aggregate semantic information in the following manner. The semantic information aggregator 230 may first retrieve information from the user context engine 225. In particular embodiments, user context engine 225 may include an offline aggregator 226 and an online inference service 227. The offline aggregator 226 may process a plurality of data associated with the user collected from previous time windows. By way of example and not by way of limitation, the data may include dynamic message posts/comments collected from a window of the previous 90 days, interactions with dynamic message posts/comments, instragram posts/comments, search history, and the like. The processing results may be stored in the user context engine 225 as part of the user profile. The online inference service 227 may analyze session data associated with the user received by the assistant system 140 at the current time. The analysis results may also be stored in the user context engine 225 as part of the user profile. In particular embodiments, both offline aggregator 226 and online inference service 227 may extract personalized features from multiple data. The extracted personalized features may be used by other modules of the assistant system 140 to better understand user input. In particular embodiments, semantic information aggregator 230 may then process the information retrieved from user context engine 225, i.e., the user profile, in the following steps. At step 231, the semantic information aggregator 230 may process information retrieved from the user context engine 225 based on Natural Language Processing (NLP). In particular embodiments, semantic information aggregator 230 may: text is cut (token) by text normalization, syntactic (syntax) features are extracted from the text, and semantic features are extracted from the text based on NLP. The semantic information aggregator 230 may additionally extract features from context information accessed from a dialog history between the user and the assistant system 140. The semantic information aggregator 230 may also perform global word embedding, domain-specific embedding, and/or dynamic embedding based on the context information. At step 232, the processing results may be annotated with the entity by the entity tag. Based on the annotations, the semantic information aggregator 230 may generate a dictionary for the retrieved information at step 233. In particular embodiments, the dictionary may include global dictionary features that may be dynamically updated offline. At step 234, the semantic information aggregator 230 may rank the entities marked by the entity marker. In particular embodiments, semantic information aggregator 230 may communicate with different graphs 330, including social graph, knowledge graph, and concept graph, to extract ontology data related to information retrieved from user context engine 225. In particular embodiments, semantic information aggregator 230 may aggregate user profiles, ranked entities, and information from graph 330. Semantic information aggregator 230 may then send the aggregated information to NLU module 220 to facilitate domain classification/selection.

In particular embodiments, the output of NLU module 220 may be sent to co-reference module 315 to interpret the reference of the content object associated with the user request. In particular embodiments, co-fingering module 315 may be used to identify the item that the user request is directed to. The co-reference module 315 may include an reference creation 316 and an reference resolution (reference resolution) 317. In particular embodiments, reference creation 316 may create a reference for an entity determined by NLU module 220. Reference resolution 317 may accurately resolve these references. By way of example and not by way of limitation, the user request may include "find me nearest walmar and direct me there (find me the nearest Walmart and direct me there)". The co-finger module 315 may interpret "there (thene)" as "last walmar (the nearest Walmart)". In particular embodiments, co-fingering module 315 may access user context engine 225 and dialog engine 235 as necessary to interpret the fingering with increased accuracy.

In particular embodiments, the identified domains, intents, meta-intents, slots, and meta-slots, as well as parsed references, may be sent to entity resolution module 240 to resolve related entities. Entity resolution module 240 may perform general and domain-specific entity resolution. In particular embodiments, entity resolution module 240 may include a domain entity resolution 241 and a generic entity resolution 242. Domain entity resolution 241 may resolve entities by categorizing slots and meta-slots into different domains. In particular embodiments, entities may be parsed based on ontology data extracted from graph 330. The ontology data may include structural relationships between different slots/meta-slots and fields. An ontology may also include information on how slots/meta-slots may be grouped, related, and subdivided according to similarity and differences within a hierarchy of higher level including domains. Generic entity resolution 242 may resolve entities by categorizing slots and meta-slots into different generic topics. In particular embodiments, parsing may also be based on the ontology data extracted from graph 330. The ontology data may include structural relationships between different slots/meta-slots and common topics. An ontology may also include how slots/meta-slots may be grouped and related within a hierarchy of higher level including topics, and sub-divided information based on similarity and differences. By way of example and not by way of limitation, in response to input of a query for the advantages of a Tesla (Tesla) car, generic entity resolution 242 may resolve the Tesla car into a vehicle (vehicle), and domain entity resolution 241 may resolve the Tesla car into an electric car (electric car).

In particular embodiments, the output of entity resolution module 240 may be sent to dialog engine 235 to forward the session flow with the user. The dialog engine 235 may include a dialog intent parser 236 and a dialog state updater/sequencer 237. In particular embodiments, dialog intent parsing 236 may parse user intent associated with the current dialog session based on a dialog history between the user and the assistant system 140. Dialog intent parsing 236 may map the intent determined by NLU module 220 to a different dialog intent. Dialog intent parsing 236 may also rank dialog intents based on signals from NLU module 220, entity parsing module 240, and dialog history between the user and assistant system 140. In particular embodiments, dialog state updater/sequencer 237 may update/sequence the dialog state of the current dialog session. As an example and not by way of limitation, if a dialog session ends, the dialog state updater/sequencer 237 may update the dialog state to "completed". As another example and not by way of limitation, the dialog state updater/sequencer 237 may order dialog states based on priorities associated with the dialog states.

In particular embodiments, the dialog engine 235 may communicate with the task completion module 335 regarding dialog intents and associated content objects. In particular embodiments, task completion module 335 may rank different dialog hypotheses for different dialog intents. Task completion module 335 may include an action selection component 336. In particular embodiments, dialog engine 235 may additionally check against dialog policies 320 regarding dialog states. In particular embodiments, dialog policy 320 may include a data structure describing an action execution plan of agent 340. The proxy 340 may select among registered content providers to accomplish this action. The data structure may be constructed by the dialog engine 235 based on the intent and one or more slots associated with the intent. Dialog strategy 320 may also include a plurality of targets that are cross-correlated by logical operators. In particular embodiments, the target may be the output result of a portion of the dialog policy, and it may be constructed by dialog engine 235. The target may be represented by an identifier (e.g., a string) having one or more naming parameters that parameterize the target. By way of example and not by way of limitation, the target and its associated target parameters may be expressed as { validation_artist, parameter } { artist } "Madonna }. In particular embodiments, the dialog policy may be based on a tree-structured representation, in which targets are mapped to leaves. In particular embodiments, dialog engine 235 may execute dialog policy 320 to determine the next action to be performed. Dialog policies 320 may include generic policies 321 and domain-specific policies 322, both of which may guide how a next system action is selected based on dialog states. In particular embodiments, task completion module 335 may communicate with dialog policy 320 to obtain guidance of the next system action. In particular embodiments, action selection component 336 may thus select an action based on dialog intent, associated content objects, and guidance from dialog policy 320.

In particular embodiments, the output of task completion module 335 may be sent to CU composer 270. In alternative embodiments, the selected action may require participation by one or more agents 340. Thus, the task completion module 335 may notify the agent 340 of the selected action. At the same time, the dialog engine 235 may receive instructions to update dialog states. By way of example, and not by way of limitation, the update may include waiting for a response by the agent. In particular embodiments, CU composer 270 may generate communication content for the user based on the output of task completion module 335 using NLG 271. In particular embodiments, NLG 271 may use different language models and/or language templates to generate natural language output. The generation of natural language output may be application specific. The generation of natural language output may also be personalized for each user. CU composer 270 may also determine the modality of the generated communication content using UI payload generator 272. Since the generated communication content may be considered a response to the user request, CU composer 270 may additionally use response sequencer 273 to order the generated communication content. By way of example, and not by way of limitation, the ordering may indicate a priority of the response.

In particular embodiments, the output of CU composer 270 may be sent to response manager 325. Response manager 325 may perform different tasks including storing/updating dialog states 326 retrieved from data store 310 and generating responses 327. In particular embodiments, the output of CU composer 270 may include one or more of natural language strings, speech, or actions with parameters. Thus, response manager 325 may determine what tasks to perform based on the output of CU composer 270. In particular embodiments, the generated response and communication content may be sent to assistant xbot 215. In an alternative embodiment, if the determined modality of the communication content is audio, the output of CU composer 270 may additionally be sent to TTS module 275. The speech generated by TTS module 275 and the response generated by response manager 325 may then be sent to assistant xbot 215.

In particular embodiments, assistant system 140 can use one or more non-local machine learning models to analyze content objects including one or more of speech, text, images, video, or a combination thereof. The non-local machine learning model may be based on a deep neural network. In deep neural networks, convolution operations and loop operations may be building blocks that process one local neighborhood at a time. In particular embodiments, the non-local machine learning model may use non-local operations as a series of generic building blocks for capturing long-range dependencies. The non-local machine learning model may use non-local operations to compute the response at one location as a weighted sum of features at all locations. Building blocks may be inserted into multiple computer vision architectures. In particular embodiments, a non-local machine learning model may be applied to the task of video classification. By way of example and not by way of limitation, the non-local machine learning model may compete or outperform the current competition winner on the Kinetics and Charades data sets (i.e., the two common data sets) even without any additional functionality. In particular embodiments, non-local machine learning models may also be applied to static image recognition. By way of example, and not by way of limitation, a non-local machine learning model may improve object detection/segmentation and pose estimation for a COCO (i.e., common dataset) task set. Although this disclosure describes a particular machine learning model based on a particular building block in a particular manner, this disclosure contemplates any suitable machine learning model based on any suitable building block in any suitable manner.

In particular embodiments, assistant system 140 can train a baseline machine learning model based on a neural network that includes multiple phases. Each stage may include a plurality of nerve blocks. The assistant system 140 can then access a plurality of training samples that respectively include a plurality of content objects. In particular embodiments, assistant system 140 can determine one or more non-local operations. Each non-local operation may be based on one or more pairwise functions and one or more unitary functions. In particular embodiments, assistant system 140 can generate one or more non-local blocks based on the plurality of training samples and the one or more non-local operations. The assistant system 140 can then determine a phase from among the plurality of phases of the neural network. In particular embodiments, assistant system 140 can also train the non-local machine learning model by inserting each of the one or more non-local blocks between at least two of the plurality of neural blocks in the determined stage of the neural network.

Capturing long range dependencies may be critical in deep neural networks. For sequence data (e.g., in speech, language), loop operations may be the primary solution for long-range dependency modeling. For image data, long-range dependencies can be modeled by large receptive fields (receptive fields) formed by deep stacks of convolution operations.

Both convolution and loop operations may process local neighbors in space or time. Thus, long-range dependencies can only be captured when these operations are repeatedly applied, thereby propagating signals stepwise in the data. There may be several limitations to repeating the local operation. First, it is computationally inefficient. Second, it can lead to optimization difficulties that need to be carefully addressed. Finally, these challenges make it difficult to implement multi-hop (multi-hop) dependency modeling, where messages need to be passed back and forth between remote locations.

FIG. 4 illustrates example spatiotemporal non-local operations of a non-local machine learning model for video classification. In particular embodiments, non-local operations may be efficient, simple, and general components for capturing long-range dependencies of deep neural networks. In particular embodiments, the neural network may include one or more of a convolutional neural network or a recurrent neural network. The neural network may be based on one or more of a two-dimensional architecture or a three-dimensional architecture. In particular embodiments, the non-local operation may be a generalization of classical non-local mean operations in computer vision. The non-local operation may calculate the response at a certain location as a weighted sum of features at all locations in the input feature map. As shown in fig. 4, position x _i Is responded by all the positions x _j Is calculated (only the highest weighted feature is shown in fig. 4). In this example, calculated by a non-local machine learning model, note how it relates balls (ball) in the first frame to balls in the last two frames. Further examples are in fig. 10. The set of locations may be in space, time, or time-air. Thus, non-local operations may be applicable to image, sequence, and video problems.

There may be several technical advantages to using non-local operations: (a) In contrast to the progressive behavior of cyclic and convolution operations, non-local operations can directly capture long-range dependencies by computing interactions between any two locations, regardless of their location distances; (b) Even if only a few layers (e.g., 5 layers) are used, non-local operations can be efficient and can achieve their best results; (c) Finally, non-local operations may maintain variable input sizes and may be easily combined with other operations (e.g., convolution).

In particular embodiments, the validity of non-local operations may be shown in the application of video classification. In video, long-range interactions can occur between spatially and temporally distant pixels. A single non-local block, which is the basic unit of a non-local machine learning model, can directly capture these spatiotemporal dependencies in a feed-forward manner. The architecture of the non-local machine learning model may be referred to as a non-local neural network. In particular embodiments, using several non-local blocks, a non-local neural network may be more accurate for video classification than two-dimensional (2D) and three-dimensional (3D) convolutional networks and inflated variants of 2D and 3D convolutional networks (inflated variants). Furthermore, non-local neural networks may be more computationally economical than their 3D convolution counterparts (counter). In the embodiments disclosed herein, comprehensive ablation studies (ablation studies) are presented on Kinetics and Charades datasets. In particular embodiments, the non-local machine learning model obtains results on both data sets that are comparable to or better than the latest competing winners by using only RGB and without any additional functionality (e.g., optical flow, multi-scale testing).

In the embodiments disclosed herein, experiments of object detection, segmentation and pose estimation on the COCO dataset are also presented to demonstrate the generality of non-local operations. By way of example and not by way of limitation, a non-local block may improve the accuracy of all three tasks over a strong masked R-CNN baseline (i.e., conventional work) with little additional computational cost. In conjunction with evidence on video, these image experiments suggest that non-local operations are often useful and become fundamental building blocks for designing deep neural networks.

The non-local mean is a classical filtering algorithm that calculates a weighted mean of all pixels in the image. It allows distant pixels to contribute to the filtered response at a location based on patch appearance similarity. The idea of non-local filtering was later developed to BM3D (block matched 3D), which performs filtering on a set of similar but non-local slices. BM3D is a reliable image denoising baseline even when compared to deep neural networks. Non-local matching is also critical to successful texture synthesis, super resolution and repair algorithms.

The long range dependence can be modeled by a graphical model such as Conditional Random Field (CRF). In the context of deep neural networks, CRF may be utilized to post-process semantic segmentation predictions of the network. The iterative mean field inference of CRF can be translated into a loop network and co-trained. Rather, embodiments disclosed herein may be simple feed forward for computing non-local filtering.

Recently, there has been a trend to model sequences in speech and language using a feed-forward (i.e., non-circular) network. In these approaches, long-term dependencies are captured by large receptive fields that benefit from very deep one-dimensional (1D) convolutions. These feed forward models are suitable for parallel implementation and are more efficient than the widely used loop models.

Self-attention (Self-attention) a non-local machine learning model may be associated with the most recent Self-attention method of machine translation (i.e., conventional work). The self-attention module calculates a response at a certain position in the sequence (e.g., sentence) by focusing on all positions and taking their weighted average in the embedding space. Self-attention can be considered a form of non-local mean, and in this sense, embodiments disclosed herein can connect machine-translated self-attention (bridge) to a more general class of non-local filtering operations applicable to image and video problems in computer vision.

Relationship networks have recently been proposed for relationship reasoning tasks. The relational network computes functions of feature embedding at all pairs of locations in its input. The non-local machine learning model may also handle all pairs. However, unlike a relational network that aggregates all pairs into a single output vector by summation, a non-local machine learning model can produce an output of the same size as its potentially variable size input. This may allow non-local operations to be followed by a convolution or recursive layer, or non-local operations to be used more than once, making it easy and flexible to integrate into standard network architectures used in computer vision.

One natural solution to video classification is to combine the success of Convolutional Neural Networks (CNNs) for images and Recurrent Neural Networks (RNNs) for sequences. In contrast, the feedforward model is implemented by 3D convolution (C3D) in time-air, and the 3D filter may be formed by a "expansion" pre-trained 2D filter. In addition to end-to-end modeling of the original video input, it has been found that optical flow and trajectories may be helpful. Both the flow and the track are ready-made modules, and long-range, non-local dependencies can be found.

In the following, a general definition of non-local operation is given, and then several specific instantiations thereof are provided.

In particular embodiments, general non-local operations in deep neural networks may be defined as:

in particular embodiments, each of the one or more non-local operations associated with the non-local machine learning model may be based on the functions described above. As used herein, i may indicate an index of output locations for which responses are to be calculated, and j may indicate an index listing all possible locations. In particular embodiments, the output location may be in one or more of space, time, or time-air. x may indicate an input signal and y may indicate an output signal having the same size as x. In particular embodiments, the pair-wise function f may be calculated Scalar between i and all j. A scalar may represent a relationship, such as affinity. In particular embodiments, the unitary function g may calculate a representation of the input signal at location j. The response may also be normalized by a normalization factor C (x). In particular embodiments, the input signal may include a content object. By way of example and not by way of limitation, the content object may include one or more of text, an audio clip, an image, or a video. In particular embodiments, the input signal may include a characteristic representation of the content object. Thus, x _i A feature representation at the output location may be indicated; x is x _j A representation of the feature at one of the plurality of locations may be indicated; and y is _i An output response at the output location may be indicated.

The non-local behavior in equation (1) may be due to consideration of all locations in operation

Is to be added to the fact that (1) is a new product. By comparison, the convolution operation may sum the weighted inputs in the local neighborhood (e.g., i-1. Ltoreq.j. Ltoreq.i+1 in the 1D case of kernel size 3), and the loop operation at time i may typically be based only on the current and latest time steps (e.g., j=i or i-1).

Non-local operation may also be different from the fully correlated (fc) layer in convolutional neural networks. Equation (1) may calculate the response based on the relationship between the different locations, while fc uses the learned weights. In other words, in fc, x is different from in the non-local layer _j And x _i The relationship between may not be a function of the input data. Furthermore, the formula in equation (1) may support variable size inputs and maintain corresponding sizes in the outputs. Conversely, the fc layer may require a fixed size input/output and lose position correspondence (e.g., from x at position i _i To y _i Corresponds to (c) a).

In particular embodiments, non-local operations may be flexible building blocks and are easy to use with convolution/loop layers. Building blocks may be referred to as non-local blocks. It can be added to the early part of the deep neural network, unlike the fc layer which is often used last. This may allow a richer hierarchy to be built that combines non-local and local information. In particular embodiments, generating each of the one or more non-local blocks may include applying each of the one or more non-local operations to a feature representation of one of the plurality of content objects. In particular embodiments, applying the non-local operations may further include determining an output location and a plurality of locations associated with the output location for each of the plurality of content objects.

In particular embodiments, the pair-wise function f and the unitary function g may be based on a plurality of different versions. In particular embodiments, the non-local machine learning model may be insensitive to different versions of these options, indicating that general non-local behavior may be the primary reason for the improvement observed for different tasks (e.g., video analysis).

In particular embodiments, for simplicity, the univariate function g may be in the form of a linear embedding: g (x) _j )＝W _g x _j 。W _g The weight matrix to be learned may be indicated. In particular embodiments, linear embedding may be implemented, for example, a 1 x 1 convolution in space or a 1 x 1 convolution in space-time.

In particular embodiments, the pair-wise function f may be based on different functions. By way of example and not by way of limitation, the different functions may include one or more of gaussian, embedded gaussian, dot product, cascade, any suitable function, or any combination thereof.

In certain embodiments, one option for the pair-wise function f may include a Gaussian function, which may be formulated as:

in the specific embodiment of the present invention,

dot product similarity may be used. In alternative embodiments, euclidean distances may be applicableA kind of electronic device. However, in modern deep learning platforms, dot products may be easier to implement. In a particular embodiment, the normalization factor may be set to +.>

In certain embodiments, another option for the pair-wise function f may include an extension of the gaussian function to calculate the similarity in the embedding space. In a particular embodiment, it can be formulated as:

In a particular embodiment, θ (x _i )＝W _θ x _i And phi (x) _j )＝W _φ x _j Two embeddings are possible. In a particular embodiment, the normalization factor may be set to

It may be noted that the recently proposed self-attention model for machine translation (i.e., conventional work) may be a special case of non-local operations embedded in a gaussian version. This can be seen from the fact that: for a given i->

May be calculated as softmax along dimension j. Thus (2)

This is the form of self-attention in the self-attention model. Thus, the non-local machine learning model can provide insight by correlating this latest self-attention model to classical non-local mean computer vision methods, and extend the sequential self-attention network in the self-attention model to generic use for image/video recognition in computer visionSpace/time non-local network. Despite the relation to the self-attention model, attention behavior (due to softmax) may not be necessary in applications of image and video analysis. To illustrate this, two alternative versions of the non-local operation will be described next.

In particular embodiments, another option for the pair-wise function f may include a dot product similarity, which is formulated as:

f(x _i ，x _j )＝θ(x _i ) ^T φ(x _j ) (4)

In particular embodiments, dot product similarity may be based on embedded versions. In a particular embodiment, the normalization factor may be set to C (x) =n. N may indicate the number of positions in x instead of the sum of f, as it may simplify the gradient calculation. Normalization like this may allow the input to have a variable size. The main difference between dot product and embedded gaussian version may be the presence of softmax, which acts as an activation function.

Cascading is used for visual reasoning by the pair-wise functions in the relational network. In particular embodiments, another option for the pair-wise function f may be based on a cascade form, formulated as:

as used herein, [ ·, ]]Can represent a cascade, and w _f The projection of the cascade vector to a scalar weight vector may be indicated. In a particular embodiment, the normalization factor may be set to C (x) =n. As used herein, a ReLU may indicate a function of rectifying a linear unit.

The above several variants may demonstrate the flexibility of general non-local operation. In certain embodiments, alternate versions of the pair-wise function f are possible and may improve the performance of the non-local machine learning model.

FIG. 5 illustrates an example spatiotemporal non-local block 500. In particular embodiments, non-local operations as characterized in equation (1) may be packed into non-local block 500. Non-local block 500 may be incorporated into many existing architectures. In particular embodiments, non-local block 500 may be defined as:

z _i ＝W _z y _i +x _i (6)

Wherein y is _i The response given in equation (1) can be expressed, and "+x _i "may mean a residual connection (residual connection). Residual connections may allow a new non-local block 500 to be inserted into any pre-trained model without disrupting its initial behavior (e.g., if W _z Initialized to zero). In fig. 5, a non-local block 500 takes an input 501 represented by x, and processes it with an embedding 502 represented by θ, another embedding 503 represented by Φ, and a unitary function 505 represented by g. The results of the embedding 502 and 503 are further processed by a pair-wise function 504 denoted by f. The results of the pair-wise function 504 and the unitary function 505 are processed by

Represented matrix multiplication 506. The result of the matrix multiplication 506 is processed by a 1 x 1 convolution 507. The result of the convolution 507 of input 501 and 1 x 1 is determined by->

The represented element-level summation (element-wise sum) 508 is processed, which produces an output 509 represented by z.

In particular embodiments, pair-wise function 504 may be implemented with different instantiations including one or more of gaussian, embedded gaussian, dot product, or concatenation. FIG. 6 illustrates an example flow chart of embedded Gaussian instantiation. As shown in fig. 6, the embedding gaussian-based pairwise function 504 processes the

embeddings

502 and 503 with matrix multiplication 506, followed by application of a softmax operation 601. FIG. 7 illustrates an example spatio-temporal non-local block 500 based on embedded Gaussian instantiation. In fig. 7, the feature map is shown as the shape of its tensor, for example, t×h×w×1024 for 1024 channels (when noted, appropriate shaping is performed). There is a 512 channel bottleneck.

Represents a matrix multiplication 506, and->

Representing element level summation 508. A softmax operation 601 is performed on each row. The Gaussian version may be accomplished by removing θ and φ. The pair wise calculations in equations (2), (3) or (4) can be accomplished simply by matrix multiplication 506 shown in fig. 7.

FIG. 8 illustrates an example flow chart for dot product instantiation. As shown in FIG. 8, dot product based pairwise function 504 processes the

embeddings

502 and 503 with matrix multiplication 506, followed by the application of scaling 1/N operation 801. FIG. 9 illustrates an example flow chart of cascade instantiation. As shown in fig. 9, concatenation-based pairwise function 504 processes embeddings 502 and 503 with concatenation (Concat) operation 901. The pair-wise function 504 is then processed by w using matrix multiplication 506 _f The weight factor 902 represented and the result of the cascading operation 901. The result of the matrix multiplication 506 is further processed by a commutating linear unit (ReLU) 903, after which a scaling of 1/N operation 801 is applied.

In particular embodiments, the paired computation of non-local block 500 may be lightweight when used in a high-level, sub-sampled feature map. By way of example and not by way of limitation, typical values in fig. 7 may be t=4 and h=w=14 or 7. In particular embodiments, the pair-wise calculations performed by matrix multiplication 506 may be compared to typical convolutional layers in a standard network. In particular embodiments, the following implementation may be further employed to make it more efficient.

In particular embodiments, the non-local block is defined by W _g 、W _θ And

the number of channels represented may be set to half the number of channels in x. This may reduce the computation of the block by about half. In a particular embodiment, the weight matrix W in equation (6) _z Can calculate y _i The position level on the upper embedding matches the number of channels to the number of channels of x, as shown in fig. 7.

In particular embodiments, one may useSubsampling techniques to further reduce computation. In particular embodiments, the non-local machine learning model may generate sub-sampled content objects for each of the plurality of content objects by applying sub-sampling to the feature representation of the content object. The sub-sampled content object may be associated with a sub-sampled feature representation. In particular embodiments, generating each of the one or more non-local blocks 500 may include applying each of the one or more non-local operations to a feature representation of one of the plurality of content objects and a sub-sampled feature representation of the content object corresponding to the sub-sample of the content object. In particular embodiments, the non-local machine learning model may also determine an output location for each of a plurality of content objects and determine a plurality of locations associated with the output location for each of a plurality of sub-sampled content objects corresponding to the content objects. Accordingly, equation (1) may be modified as:

Wherein->

A sub-sampled version of x may be indicated. In other words, each of the one or more non-local operations may be based on the above-described function. As used herein, x _i A feature representation at the output location may be indicated; />

A feature representation that may indicate sub-sampling at one of a plurality of locations; y is _i An output response at the output location may be indicated; />

A pair-wise function may be indicated; />

A unary function may be indicated; and->

The normalization factor may be indicated. In particular embodiments, sub-sampling may include pooling. By way of example, and not by way of limitation, pooling may include one or more of maximum pooling or average pooling. In particular embodiments, sub-sampling may be performed in the spatial domain, which may reduce the amount of pairwise computation by 1/4. In particular embodiments, sub-sampling may not change non-local behavior, but may simply make the computation more sparse. In particular embodiments, sub-sampling may be accomplished by adding a max-pooling layer after phi and g in FIG. 7. In particular embodiments, these efficient modifications may be used for all non-local blocks 500.

The pair-wise function 504 may be based on a gaussian function corresponding to the sub-sampled content object

Embedding Gaussian function->

(wherein θ is x _i Is embedded 502 of and phi is +.>

Embedded 503) of dot product functions

Or cascading function->

Wherein ReLU indicates a function of rectifying linear unit 903, and wherein w _f Is to make θ (x _i ) And->

Is projected to scalar weight vector 902.

In order to understand the behavior of the non-local network, a comprehensive ablation experiment was performed on the video classification task. Accordingly, the plurality of content objects may include a plurality of videos. This section first gives a description of the infrastructure network architecture (i.e., baseline machine learning model) for this task, and then gives an extension of these infrastructure network architectures to 3D convolutional neural networks and non-local neural networks in the embodiments disclosed herein. By way of example and not by way of limitation, a 3D convolutional neural network may include a 3D ConvNet, which is a conventional work. In particular embodiments, training the non-local machine learning model may include generating a plurality of feature representations for the plurality of content objects, respectively, based on the baseline machine learning model.

2D ConvNet baseline (C2D.) in particular embodiments, a simple 2D baseline architecture is constructed to isolate the temporal effects of non-local neural networks with respect to 3D ConvNet. In the constructed 2D baseline architecture, the time dimension is simply handled (i.e., only by pooling). Table 1 shows the C2D baselines constructed under a ResNet-50 (i.e., conventional convolutional neural network) backbone network. In table 1, the dimensions of the 3D output map and filter kernel are denoted as t×h×w (the 2D kernel is denoted as h×w), with the number of channels following. Residual blocks (residual blocks) are shown in brackets. In a particular embodiment, the input video clip may include 32 frames, each frame having 224×224 pixels (i.e., the input is 32×224×224). All convolutions in table 1 are essentially 2D kernels (implemented as 1 x k kernels) that process the input frame by frame. The model can be initialized directly from the ResNet weights pre-trained on the ImageNet (i.e., common dataset). In certain embodiments, the ResNet-101 (i.e., another conventional convolutional neural network) counterpart may be constructed in the same manner. The only operation involving the time domain may be the pooling layer. In other words, this baseline may only aggregate time information.

TABLE 1 Baseline ResNet-50C 2D model for video

Expanded 3D ConvNet (I3D) the C2D model in table 1 can be converted into a 3D convolution counterpart by "expanding" the kernel, as done in conventional work. By way of example, and not by way of limitation, a 2D k xk kernel may be expanded to a 3D t xk kernel spanning t frames. The kernel can be initialized from a 2D model (e.g., pre-trained on ImageNet): each of the t planes in the txk kernel may be initialized by pre-trained kxk weights and rescaled by 1/t. If the video includes a single static frame that repeats in time, this initialization may produce the same results as a 2D pre-trained model running on the static frame.

In certain embodiments, there may be two situations of expansion. One case may include expanding the 3 x 3 kernel in the residual block to 3 x 3, which may be denoted as I3D _3×3×3 . Another case may include expanding the first 1 x 1 kernel in the residual block to 3 x 1, which may be denoted as I3D _3×1×1 . Since 3D convolution may be computationally intensive, only one kernel may be inflated for every two residual blocks. In a particular embodiment, the inflated further layers show a diminishing return (diminishing return). By way of example, and not by way of limitation, conv ₁ Can be expanded to 5×7×7. Routine work has shown that the I3D model is more accurate than the CNN (convolutional neural network) +lstm (long term memory) counterpart.

In particular embodiments, non-local blocks 500 may be inserted into a C2D model or an I3D model to transform them into a non-local neural network. In particular embodiments, a different number of non-local blocks 500 may be added. By way of example and not by way of limitation, embodiments disclosed herein contemplate adding 1, 5, or 10 non-local blocks 500. Implementation details will be described in the context of the next section.

In particular embodiments, the non-local machine learning model can be pre-trained on ImageNet. In particular embodiments, a 32 frame input clip may be used to further fine tune the non-local machine learning model. These clips can be formed by randomly cropping 64 consecutive frames from the original full-length video and then discarding one every other frame. In particular embodiments, the spatial size may be 224×224 pixels randomly cropped from a scaled video, the short side of the video being at [256, 320]Randomly sampled in the pixel. In particular embodiments, the non-local machine learning model may be trained on an 8-GPU machine And each GPU may have 8 clips in one small batch (mini-batch) (thus a small batch size of 64 clips in total). In a particular embodiment, the non-local machine learning model may be trained 400k iterations total, starting at a learning rate of 0.01, and reducing it 10-fold every 150k iterations. In a particular embodiment, a momentum of 0.9 (momentum) and a weight decay of 0.0001 may be used. In particular embodiments, discard (dropout) may be employed after the global pooling layer, where the discard ratio is 0.5. In particular embodiments, non-local machine learning models may be fine-tuned at the time of application using an enabled BatchNorm (BN) (i.e., conventional work). This is in contrast to the common practice of trimming ResNet, where BN is frozen. In particular embodiments, enabling BN in the application of image/video analysis may reduce overfit (overfit). In particular embodiments, the weight layer introduced in non-local block 500 may be initialized based on conventional work. In particular embodiments, W may be represented as _z The BN layer is added after the last 1 x 1 layer of (c). In particular embodiments, BN layers may not be added to other layers in non-local block 500. In certain embodiments, the scaling parameters of the BN layer may be initialized to zero. This may ensure that the initial state of the entire non-local block 500 is an identity map, so it may be inserted into any pre-trained network while maintaining its initial behavior.

In certain embodiments, spatial full convolution inference may be performed on a video, the short sides of which are rescaled to 256. For the time domain, in the embodiments disclosed herein, 10 clips can be uniformly sampled from the full-length video and the softmax score can be calculated for them alone. The final prediction may include an average softmax score for all clips. In particular embodiments, the inference may include receiving a query content object and determining a category of the query content object based on a non-local machine learning model.

In the embodiments disclosed herein, a comprehensive study is conducted on a challenging Kinetics dataset (public dataset). Results for the Charades dataset (common dataset) are also reported to show the generality of the non-local machine learning model disclosed herein.

The Kinetics contains approximately 246k training videos and 20k verification videos. This is a classification task involving 400 human action categories. In particular embodiments, all non-local machine learning models are trained on a training set and tested on a validation set.

FIG. 10 illustrates example visualizations of several examples of behaviors of non-local blocks 500 computed by a non-local machine learning model. Similarly, FIG. 4 also visualizes several examples of the behavior of non-local blocks 500 calculated by the non-local machine learning model. In FIG. 10, res ₃ An example of the behavior of non-local block 500 in (1) is computed by a 5-block non-local machine learning model based on Kinetics training. These examples come from a leave-out (hold-out) verification video. The start of the arrow represents an x _i The endpoint represents x _j . Each x _i Is visualized. These 4 frames are from a 32 frame input, shown in steps of 8 frames. These visualizations show how the model finds relevant cues to support its prediction. In particular embodiments, the non-local machine learning model may learn to find meaningful relational cues regardless of spatial and temporal distance.

FIG. 11 illustrates an example plot of a training process for multiple non-local machine learning models. In particular, the example curves include a training process in which the ResNet-50C 2D baseline 1101 versus the non-local C2D 1103 having 5 blocks, and a verification process in which the ResNet-50C 2D baseline 1102 versus the non-local C2D1104 having 5 blocks. As used herein, "non-local C2D" refers to a non-local machine learning model based on a C2D architecture. Training errors (1101 and 1103) and validation errors (1102 and 1104) of name 1 (top-1) are shown in fig. 11. The validation error is calculated in the same way as the training error (so it is a 1-clip (1-clip) test with the same random jitter at training). More details regarding fig. 11 are disclosed below. In certain embodiments, the non-local C2D model is consistently better than the C2D baseline in terms of training and validation errors throughout the training process.

Table 2 ablation with classification accuracy (%) of 1 st and 5 th on the kinetic action classification

(a) Instantiation 1 different types of non-local blocks are added to the C2D baseline. All entries are relative to ResNet-50.

(b) Stage 1 non-local blocks are added to the different stages. All entries are relative to ResNet-50.

(c) Deeper non-local model: comparison between 1, 5 and 10 non-local blocks added to the C2D baseline. The results of ResNet-50 (top) and ResNet-101 (bottom) are shown.

(d) Space versus time versus space time: comparison between non-local operations applied along the spatial, temporal and spatio-temporal dimensions, respectively. 5 non-local blocks are used.

(e) Non-local relative 3D convolution: 5 blocks of non-local C2D relatively dilated 3D ConvNet (I3D). All entries are relative to ResNet-101. The parameters and the number of FLOPs were relative to the C2D baseline (43.2M and 34.2B).

(f) Non-local 3D ConvNet:5 non-local blocks are added above the optimal I3D model. These results indicate that non-local operations are complementary to the 3D convolution.

(g) Longer clip the model in table 2f was trimmed and tested on the 128-frame clip. The gain of non-local operation is uniform.

Table 2 shows the ablation results, which are analyzed as follows.

Table 2a compares (just at res ₄ Before the last residual block) of the C2D baseline) is added to the different types of the single non-local block 500 of the C2D baseline. In certain embodiments, even the addition of a non-local block 500 may result in an increase of about 1% over baseline. In certain embodiments, embedding gaussian, dot product and concatenated versions perform similarly until some random variation (72.7 to 72.9). As previously discussed, the non-local operation of the Gaussian kernel becomes similar to the conventional operation of the self-attention module. However, experiments disclosed herein demonstrate that the attention (softmax) behavior of this module may not be critical for improvement in video classification applications. Conversely, it is more likely that non-local behavior is essential and that it is insensitive to instantiation. In the remainder of this disclosure with respect to experiments, the embedded gaussian version is used by default. In particular embodiments, the embedded Gaussian version may be easier to visualize because its softmax score is at [0,1]Within a range of (2).

Which stage is a non-local block added? In particular embodiments, the non-local machine learning model may determine a phase from among a plurality of phases of the neural network to insert one or more non-local blocks 500. Table 2b compares individual non-local blocks 500 added to different phases of the res net. The non-local block 500 is added just before the last residual block of a stage. Non-local block 500 is at res ₂ 、res ₃ Or res ₄ The improvements are similar and in res ₅ The improvement is slightly smaller. One possible explanation may be res ₅ Has the following characteristics ofSmall spatial size (7 x 7), which may not be sufficient to provide accurate spatial information. More evidence of the non-local block 500 utilizing spatial information is studied in table 2 d.

The results of more non-local blocks 500 are shown in table 2 c. The embodiments disclosed herein add 1 block in ResNet-50 (to res ₄ ) 5 blocks (3 to res) ₄ And 2 to res ₃ To every other residual block) and 10 blocks (to res ₃ And res ₄ Each residual block in (c) is stored). The embodiments disclosed herein similarly add these non-local blocks in ResNet-101 to the corresponding residual blocks. Table 2c shows that more non-local blocks 500 generally lead to better results. In particular embodiments, multiple non-local blocks 500 may perform long range multi-hop communications. Messages can be passed back and forth between locations that are far in space, which can be difficult to achieve with a local model. Notably, the improvement of non-local blocks 500 may not be due solely to their increased depth of the baseline model. To understand this, note that in Table 2c, the non-local 5-block ResNet-50 model has an accuracy of 73.8, which is higher than 73.1 for the deeper ResNet-101 baseline. However, 5 ResNet-50 blocks were only 70% of the parameters of ResNet-101 baseline and 80% of FLOP, and were also shallower. This comparison shows that the improvement resulting from the non-local block 500 is a complement to what is deeper in the standard manner. In a particular embodiment, a standard residual block is added to the baseline model instead of the non-local block 500. The accuracy is not increased. Again, this shows that the improvement of non-local blocks 500 may not be due to their increased depth alone.

In certain embodiments, the non-local machine learning model may naturally process the spatiotemporal signal. This can be a good feature because related objects in the video can appear in far space and long time intervals, and their dependencies can be captured by non-local machine learning models. In table 2d, the effect of non-local blocks 500 applied along space, time or space-time is studied. By way of example and not by way of limitation, in a pure spatial version, the non-local dependency only occurs within the same frame, i.e., in equation (1), it sums index j only in the same frame of index i. The pure temporal version may be similarly established. Table 2D shows that both the pure spatial and pure temporal versions are improved over the C2D baseline, but not as good as the spatiotemporal version.

Table 2e compares the conventional work of the non-local C2D version of the non-local machine learning model with the inflated 3D ConvNet. In particular embodiments, non-local operations and 3D convolution may be considered two ways of extending C2D to the time dimension. Table 2e also compares the parameters relative to baseline and the number of FLOPs. The non-local C2D (NL C2D) model is more accurate than the I3D counterpart (e.g., 75.1 versus 74.4) with a smaller number of FLOPs (1.2 versus 1.5×). This comparison shows that when used alone, the non-local machine learning model may be more efficient than 3D convolution.

Despite the above comparison, non-local operations and 3D convolution can model different aspects of the problem, i.e., 3D convolution can capture local dependencies. Table 2f shows the insertion of 5 non-local blocks 500 into I3D _3×1×1 Results in the model. These non-local I3D (NL I3D) models improved (+1.6 point accuracy) over their I3D counterparts, indicating that the non-local operations and 3D convolutions are complementary.

In certain embodiments, the generality of a non-local machine learning model on longer input videos is studied. In particular embodiments, the non-local machine learning model may employ an input clip of 128 consecutive frames without sub-sampling. Thus, the sequence of all layers in the network is 4 times longer than the corresponding body of 32 frames. To put this model into memory, the small batch size is reduced to 2 clips per GPU. In this case all BN layers are frozen due to the small lot size used. In particular embodiments, the model may be initialized from a corresponding model trained with 32 frame inputs. In a particular embodiment, the model may begin at a learning rate of 0.0025, fine-tuning on 128 frame inputs using the same number of iterations as in the 32 frame case (although the small batch size is now smaller). Other implementation details are the same as before. Table 2g shows the results of the 128 frame clip. All models have better results on longer inputs than the 32 frame counterpart in table 2 f. In particular embodiments, the NL I3D model may retain its benefits relative to the I3D counterpart, indicating that the non-local machine learning model may work well over longer sequences.

Table 3 shows the results of the I3D model routine (two results in the top row) and the results of the Kinetics 2017 competition winner routine (four results in the middle row). In Table 3, with

The number of (a) indicates the result on the test set, otherwise indicates the result on the validation set. The best results of the Kinetics 2017 competition winner utilize the audio signal (last three results in the middle row) and therefore they are not purely visual solutions. In particular embodiments, these are comparisons of systems that may differ in many ways. Nevertheless, the non-local machine learning model (NL I3D) far exceeds all existing RGB or rgb+ stream based approaches. The non-local machine learning model is comparable to the well-designed results of 2017's winner of competition without using light flow and without any additional functionality.

Table 3. Comparison with the results most advanced in kinetic.

Charades is a common video dataset with 8k training video, 1.8k verification video, and 2k test video. This is a multi-labeled classification task with 157 action categories. In particular embodiments, the sigmoid output of each category is used to process multi-tag characteristics. The non-local machine learning model is initially pre-trained on Kinetics (128 frames). The small batch size is set to 1 clip per GPU. In a particular embodiment, the non-local machine learning model is trained 200k iterations, starting with a learning rate of 0.00125 and decreasing it by 10 every 75k iterations. The locations of 224 x 224 cropping windows are determined using a dithering strategy similar to that in Kinetics, but the video is rescaled such that the cropping windows output 288 x 288 pixels on which the non-local neural network is trimmed. The test was performed on a single scale of 320 pixels. Table 4 shows a comparison with previous results for the conventional work of Charades. The result of the non-local machine learning model is based on ResNet-101.NL I3D uses 5 non-local blocks 500. The results obtained by I3D from routine work were the 2017 competition winner of Charades, which was also fine-tuned from the model pre-trained in Kinetics. The I3D baseline disclosed herein is higher than previous results. As a controlled comparison, the non-local neural network was 2.3% higher over the I3D baseline on the test set.

Table 4. Classification accuracy (%) of the Charades dataset over the training/validation and training validation/test partitions.

In particular embodiments, non-local machine learning models of static image recognition have also been studied. Experiments were performed on masked R-CNN baselines (i.e., conventional work) for COCO (i.e., common dataset) object detection/segmentation and human body pose estimation (keypoint detection). The model was trained on COCO train2017 (i.e., train 35k in 2014) and tested on val2017 (i.e., mini in 2014).

Non-local machine learning model pass (just at res ₄ Before the last residual block) a non-local block 500 is added to modify the masked R-CNN backbone. All models were fine tuned by ImageNet pre-training. The evaluation was performed on a standard baseline ResNet-50/101 and a high baseline ResNeXt-152 (X152), the latter of which is a routine task. Unlike conventional work with staged training for Regional Proposal Networks (RPNs), the embodiments disclosed herein use improved implementations with end-to-end joint training, which results in a higher baseline than conventional work.

Table 5 shows the box and the masking Average Precision (AP) on the COCO. In Table 5, the backbone is ResNet-50/101 or ResNeXt-152, both of which have Feature Pyramid Networks (FPNs) (i.e., conventional neural network architectures). It can be seen that a single non-local block 500 improves all R50/101 and X152 baselines over all metrics related to detection and segmentation. AP (Access Point) ^{Frame (B)} In all cases 1 point is added (e.g. +1.3 points in R101). The non-local block 500 has a good effect complementary to increasing the capacity of the model, even when the model is upgraded from R50/101 to X152. This comparison shows that the existing model does not adequately capture non-local dependencies despite the increase in depth/capacity. Furthermore, the cost of the benefits is very low. A single non-local block 500 will only<The 5% calculation was added to the baseline model. In certain embodiments, more non-local blocks 500 are added to the backbone network, but a diminishing return occurs.

Table 5. 1 non-local block (NL) is added to mask R-CNN for COCO object detection and instance segmentation.

Non-local blocks 500 in the mask R-CNN for keypoint detection are next evaluated. In conventional work, masking R-CNN uses a stack of 8 convolutional layers to predict keypoints as a 1-hot mask. These layers are local operations and may ignore dependencies between key points that span long distances. The embodiments disclosed herein insert 4 non-local blocks 500 into the keypoint header (after every 2 convolutional layers). Table 6 shows the results for COCO. On the strong base line of R101, adding 4 non-local blocks 500 to the keypoint head results in a 1 point increase in the keypoint AP. If an additional non-local block 500 is added to the backbone network, as is done for object detection, a total increase of 1.4 points in keypoint AP over baseline can be observed. In particular, it can be seen that the AP ₇₅ The stricter criteria of (2) 4 points are raised, indicating stronger localization performance.

Table 6. Non-local blocks (NL) are added to the masking R-CNN for COCO keypoint detection. The backbone network is ResNet-101 with FPN.

Embodiments disclosed herein propose a new class of neural networks that capture long-range dependencies through non-local operations. In particular embodiments, non-local block 500 may be combined with any existing architecture. The importance of non-local modeling for the tasks of video classification, object detection and segmentation, and pose estimation is shown. The simple addition of non-local blocks 500 provides a reliable improvement over the baseline throughout the tasks. In particular embodiments, non-local layers may become an essential component of future network architectures.

FIG. 12 illustrates an example method 1200 for training a non-local machine learning model. The method may begin at step 1210, where the assistant system 140 may train a baseline machine learning model based on a neural network including a plurality of phases, where each phase includes a plurality of neural blocks. At step 1220, the assistant system 140 can access a plurality of training samples that respectively include a plurality of content objects. At step 1230, the assistant system 140 can determine one or more non-local operations, wherein each non-local operation is based on one or more pairwise functions 504 and one or more unary functions 505. At step 1240, the assistant system 140 can generate one or more non-local blocks 500 based on the plurality of training samples and the one or more non-local operations. At step 1250, the assistant system 140 can determine a phase from among a plurality of phases of the neural network. In step 1260, the assistant system 140 can train the non-local machine learning model by inserting each of the one or more non-local blocks 500 between at least two of the plurality of neural blocks in the determined stage of the neural network. Particular embodiments may repeat one or more steps of the method of fig. 12, where appropriate. Although this disclosure describes and illustrates particular steps of the method of fig. 12 as occurring in a particular order, this disclosure contemplates any suitable steps of the method of fig. 12 occurring in any suitable order. Furthermore, although this disclosure describes and illustrates an example method for training a non-local machine learning model to include particular steps of the method of fig. 12, this disclosure contemplates any suitable method for training a non-local machine learning model to include any suitable steps, which may or may not include all, some, or none of the steps of the method of fig. 12, where appropriate. Furthermore, although this disclosure describes and illustrates particular components, devices, or systems performing particular steps of the method of fig. 12, this disclosure contemplates any suitable combination of any suitable components, devices, or systems performing any suitable steps of the method of fig. 12.

FIG. 13 illustrates an example social graph 1300. In particular embodiments, social-networking system 160 may store one or more social graphs 1300 in one or more data stores. In particular embodiments, social graph 1300 may include multiple nodes, which may include multiple user nodes 1302 or multiple concept nodes 1304, and multiple edges 1306 that connect the nodes. Each node may be associated with a unique entity (i.e., user or concept), each of which may have a unique Identifier (ID), such as a unique number or user name. For purposes of teaching, the example social graph 1300 shown in FIG. 13 is shown in a two-dimensional visual mapping representation (two-dimensional visual map representation). In particular embodiments, social-networking system 160, client system 130, assistant system 140, or third-party system 170 may access social graph 1300 and related social-graph information for appropriate applications. Nodes and edges of social graph 1300 may be stored as data objects in, for example, a data store (e.g., a social graph database). Such data stores may include one or more searchable or queriable indexes of nodes or edges of the social graph 1300.

In particular embodiments, user node 1302 may correspond to a user of social-networking system 160 or assistant system 140. By way of example and not by way of limitation, a user may be a person (human user), entity (e.g., a business, company, or third party application), or group (e.g., of persons or entities) interacting or communicating with social-networking system 160 or assistant system 140 or interacting or communicating through social-networking system 160 or assistant system 140. In particular embodiments, when a user registers an account with social-networking system 160, social-networking system 160 may create user node 1302 corresponding to the user and store user node 1302 in one or more data stores. The users and user nodes 1302 described herein may refer to registered users and user nodes 1302 associated with registered users, where appropriate. Additionally or alternatively, the users and user nodes 1302 described herein may refer to users that are not registered with social-networking system 160, where appropriate. In particular embodiments, user node 1302 may be associated with information provided by a user or collected by various systems, including social-networking system 160. By way of example and not by way of limitation, a user may provide his or her name, profile picture, contact information, date of birth, gender, marital status, family status, profession, educational background, preferences, interests, or other demographic information. In particular embodiments, user node 1302 may be associated with one or more data objects that correspond to information associated with a user. In particular embodiments, user node 1302 may correspond to one or more web interfaces.

In particular embodiments, concept node 1304 may correspond to a concept. By way of example and not by way of limitation, a concept may correspond to a venue (such as, for example, a movie theater, restaurant, landmark, or city); a website (such as, for example, a website associated with social-networking system 160 or a third-party website associated with a web-application server); an entity (such as, for example, a person, business, group, sports team, or celebrity); a resource (such as, for example, an audio file, a video file, a digital photograph, a text file, a structured document, or an application), which may be located within social-networking system 160 or on an external server (e.g., a web application server); real or intellectual property (such as, for example, sculpture, painting, movie, game, song, idea, photograph, or written work); playing; activity; ideas or theories; another suitable concept; or two or more such concepts. Concept node 1304 may be associated with information of concepts provided by users or information collected by various systems, including social-networking system 160 and assistant system 140. By way of example, and not by way of limitation, information of a concept may include a name or title; one or more images (e.g., images of the cover of a book); location (e.g., address or geographic location); a website (which may be associated with a URL); contact information (e.g., telephone number or email address); other suitable conceptual information; or any suitable combination of such information. In particular embodiments, concept node 1304 may be associated with one or more data objects that correspond to information associated with concept node 1304. In particular embodiments, concept node 1304 may correspond to one or more web interfaces.

In particular embodiments, nodes in social graph 1300 may represent or be represented by a web interface (which may be referred to as a "profile interface"). The profile interface may be hosted by social-networking system 160 or assistant system 1130 or accessible to social-networking system 160 or assistant system 1130. The profile interface may also be hosted on a third party website associated with the third party system 170. By way of example, and not by way of limitation, a profile interface corresponding to a particular external web interface may be a particular external web interface, and a profile interface may correspond to a particular concept node 1304. The profile interface may be viewable by all or a selected subset of the other users. By way of example, and not by way of limitation, user node 1302 may have a corresponding user profile interface in which a corresponding user may add content, make a statement, or otherwise express himself or herself. As another example and not by way of limitation, concept node 1304 may have a corresponding concept profile interface in which one or more users may add content, make claims, or express themselves, particularly with respect to concepts corresponding to concept node 1304.

In particular embodiments, concept node 1304 may represent a third-party web interface or resource hosted by third-party system 170. The third party web interface or resource may include content representing an action or activity, selectable icons or other interactable objects (which may be implemented with JavaScript, AJAX or PHP code, for example), and other elements. By way of example and not by way of limitation, the third-party web interface may include selectable icons such as "praise," "check-in," "eat," "recommend," or other suitable actions or activities. A user viewing the third-party web interface may perform an action by selecting one of the icons (e.g., a "check-in") causing client system 130 to send a message to social-networking system 160 indicating the user's action. In response to the message, social-networking system 160 may create an edge (e.g., a check-in type edge) between user node 1302 corresponding to the user and concept node 1304 corresponding to the third-party web interface or resource, and store edge 1306 in one or more data stores.

In particular embodiments, a pair of nodes in social graph 1300 may be connected to each other by one or more edges 1306. Edges 1306 that are associated with a pair of nodes may represent a relationship between the pair of nodes. In particular embodiments, edge 1306 may include or represent one or more data objects or attributes that correspond to a relationship between a pair of nodes. As an example and not by way of limitation, a first user may indicate that a second user is a "friend" of the first user. In response to the indication, social-networking system 160 may send a "friend request" to the second user. If the second user confirms the "friend request," social-networking system 160 may create an edge 1306 in social graph 1300 that identifies user node 1302 of the first user to user node 1302 of the second user, and store edge 1306 as social-graph information in one or more data stores 1613. In the example of FIG. 13, social graph 1300 includes an edge 1306 indicating a friendship between user node 1302 of user "A" and user "B" and an edge indicating a friendship between user node 1302 of user "C" and user "B". Although this disclosure describes or illustrates a particular edge 1306 having a particular attribute that is associated with a particular user node 1302, this disclosure contemplates any suitable edge 1306 having any suitable attribute that is associated with a user node 1302. By way of example and not by way of limitation, edge 1306 may represent a friendship, a family relationship, a business or employment relationship, a fan relationship (including, for example, praise, etc.), a attention relationship, a visitor relationship (including, for example, visit, view, check-in, share, etc.), a subscriber relationship, a superior/inferior relationship, a reciprocal relationship, a non-reciprocal relationship, another suitable type of relationship, or two or more such relationships. Further, while the present disclosure generally describes nodes as being connected, the present disclosure also describes users or concepts as being connected. References herein to connected users or concepts may refer to nodes corresponding to those users or concepts that are connected by one or more edges 1306 in the social graph 1300, where appropriate.

In particular embodiments, an edge 1306 between user node 1302 and concept node 1304 may represent a particular action or activity performed by a user associated with user node 1302 towards a concept associated with concept node 1304. By way of example and not by way of limitation, as shown in fig. 13, a user may "praise," "attend," "play," "listen," "cook," "work" or "watch" concepts, each of which may correspond to an edge type or subtype. The concept profile interface corresponding to concept node 1304 may include, for example, a selectable "check-in" icon (such as, for example, a clickable "check-in" icon) or a selectable "add to favorites" icon. Similarly, after the user clicks on these icons, social-networking system 160 may create a "favorites" edge or a "check-in" edge in response to the user action corresponding to the respective action. As another example and not by way of limitation, a user (user "C") may use a particular application (soundfield), which is an online music application, to listen to a particular song ("Imagine"). In this case, social-networking system 160 may create a "listen" edge 1306 and a "use" edge (as shown in FIG. 13) between user node 1302 corresponding to the user and concept node 1304 corresponding to the song and application to indicate that the user has listened to the song and used the application. In addition, social-networking system 160 may create a "play" edge 1306 (shown in FIG. 13) between concept nodes 1304 corresponding to songs and applications to indicate that a particular song is played by a particular application. In this case, the "play" edge 1306 corresponds to actions performed by an external application (the soundfield) on an external audio file (the song "imagine"). Although this disclosure describes particular edges 1306 of the connected user nodes 1302 and concept nodes 1304 having particular attributes, this disclosure contemplates any suitable edges 1306 of the connected user nodes 1302 and concept nodes 1304 having any suitable attributes. Further, while this disclosure describes edges between user node 1302 and concept node 1304 representing a single relationship, this disclosure contemplates edges between user node 1302 and concept node 1304 representing one or more relationships. By way of example, and not by way of limitation, edge 1306 may indicate that the user likes and uses a particular concept. Alternatively, another edge 1306 may represent each type of relationship (or multiple single relationships) between user node 1302 and concept node 1304 (as shown in FIG. 13, between user node 1302 of user "E" and concept node 1304 of "acoustic field").

In particular embodiments, social-networking system 160 may create an edge 1306 between user node 1302 and concept node 1304 in social graph 1300. By way of example and not by way of limitation, a user viewing a concept profile interface (such as, for example, by using a web browser or a dedicated application hosted by the user's client system 130) may indicate that he or she likes the concept represented by the concept node 1304 by clicking or selecting a "like" icon, which may cause the user's client system 130 to send a message to the social networking system 160 indicating that the user likes the concept associated with the concept profile interface. In response to the message, social-networking system 160 may create an edge 1306 between user node 1302 and concept node 1304 associated with the user, as shown by "endorsed" edge 1306 between the user and concept node 1304. In particular embodiments, social-networking system 160 may store edges 1306 in one or more data stores. In particular embodiments, edge 1306 may be automatically formed by social-networking system 160 in response to a particular user action. By way of example and not by way of limitation, if a first user uploads a picture, views a movie, or listens to a song, an edge 1306 may be formed between a user node 1302 corresponding to the first user and concept nodes 1304 corresponding to those concepts. Although this disclosure describes forming a particular edge 1306 in a particular manner, this disclosure contemplates forming any suitable edge 1306 in any suitable manner.

Fig. 14 shows an example view of vector space 1400. In particular embodiments, an object or n-gram may be represented in a d-dimensional vector space, where d represents any suitable dimension. Although vector space 1400 is shown as a three-dimensional space, this is for illustration purposes only, as vector space 1400 may have any suitable dimensions. In particular embodiments, the n-gram may be represented in vector space 1400 as a vector, which is referred to as term embedding (term embedding). Each vector may include coordinates corresponding to a particular point in the vector space 1400 (i.e., the end point of the vector). By way of example, and not by way of limitation, as shown in fig. 14,

vectors

1410, 1420, and 1430 may be represented as points in vector space 1400. The n-gram may be mapped to a corresponding vector representation. By way of example, and not by way of limitation, by applying a function defined by a dictionary

n-gramst ₁ And t ₂ Vectors +.>

And->

Make->

And->

As another example and not by way of limitation, a dictionary trained to map text to vector representations may be utilized, or such dictionary itself may be generated through training. As another example and not by way of limitation, a model (e.g., word2 vec) may be used to map n-grams to vector representations in vector space 1400. In particular embodiments, the machine may be used A learning model (e.g., a neural network) maps the n-gram to a vector representation in vector space 1400. The machine learning model may have been trained using a sequence of training data (e.g., a corpus (corps) of multiple objects each including an n-gram).

In particular embodiments, an object may be represented in vector space 1400 as a vector, referred to as a feature vector or object embedding. By way of example, and not by way of limitation, by applying a function

Object e ₁ And e ₂ Vectors +.>

And->

Make->

And->

In particular embodiments, an object may be mapped to a vector based on one or more characteristics, attributes, or features of the object, relationships of the object to other objects, or any other suitable information associated with the object. By way of example and not by way of limitation, function +.>

Objects may be mapped to vectors by feature extraction, which may begin with an initial measurement dataset and construct derived values (e.g., features). By way of example and not by way of limitation, objects including video or images may be mapped to vectors by using algorithms to detect or isolate various desired portions or shapes of the objects. Features used to calculate vectors may be based on edge detection, corner (corner) detection, blob (blob) detection, ridge (ridge) detection, scale-invariant feature transforms Edge direction, intensity of change, auto-correlation, motion detection, optical flow, thresholding, blob extraction, template matching, information obtained by Hough transform (e.g., line, circle, ellipse, arbitrary shape), or any other suitable information. As another example and not by way of limitation, objects comprising audio data may be mapped to vectors based on features such as spectral slope, pitch coefficients, audio spectral centroid, audio spectral envelope, mel-frequency cepstral (Mel-frequency cepstrum), or any other suitable information. In a particular embodiment, the function +.>

The transformed reduced feature set (e.g., feature selection) may be used to map the object to a vector. In particular embodiments, the function

Object e may be mapped to vector +.>

Although this disclosure describes representing an n-gram or object in vector space in a particular manner, this disclosure contemplates representing an n-gram or object in vector space in any suitable manner.

In particular embodiments, social-networking system 160 may calculate a similarity measure for vectors in vector space 1400. The similarity measure may be cosine similarity, minkowski distance, mahalanobis distance, jaccard similarity coefficient, or any suitable similarity measure. By way of example and not by way of limitation,

And->

The similarity measure of (2) may be cosine similarity +.>

As another example, but not by way of limitation, +.>

And->

The similarity measure of (2) may be Euclidean distance +.>

The similarity measure of the two vectors may represent the degree of similarity of the two objects or n-grams corresponding to the two vectors, respectively, to each other, as measured by the distance between the two vectors in the vector space 1400. By way of example, and not by way of limitation,

vectors

1410 and 1420 may correspond to objects that are more similar to each other than the objects corresponding to

vectors

1410 and 1430 based on the distance between the respective vectors. Although this disclosure describes computing similarity measures between vectors in a particular manner, this disclosure contemplates computing similarity measures between vectors in any suitable manner.

More information about vector space, embedding, feature vectors, and similarity metrics can be found in U.S. patent application Ser. No. 14/949436, filed 11/23/2015, U.S. patent application Ser. No. 15/286315, filed 10/2016, and U.S. patent application Ser. No. 15/365789, filed 11/30/2016, each of which is incorporated by reference.

Fig. 15 illustrates an example artificial neural network ("ANN") 1500. In particular embodiments, an ANN may refer to a computational model that includes one or more nodes. Example ANN 1500 may include input layer 1510, hidden

layers

1520, 1530, 1560, and output layer 1550. Each layer of ANN 1500 may include one or more nodes, such as node 1505 or node 1515. In particular embodiments, each node of the ANN may be related to another node of the ANN. By way of example, and not by way of limitation, each node of the input layer 1510 may be connected to one or more nodes of the hidden layer 1520. In particular embodiments, one or more nodes may be bias nodes (e.g., nodes in a layer that are not related to any nodes in a previous layer and from which inputs are not received). In particular embodiments, each node in each tier may be connected to one or more nodes of a previous tier or a subsequent tier. Although fig. 15 depicts a particular ANN having a particular number of layers, a particular number of nodes, and a particular relationship between nodes, the present disclosure contemplates any suitable ANN having any suitable number of layers, any suitable number of nodes, and any suitable relationship between nodes. By way of example and not by way of limitation, although fig. 15 depicts an association between each node of the input layer 1510 and each node of the hidden layer 1520, one or more nodes of the input layer 1510 may not be associated with one or more nodes of the hidden layer 1520.

In particular embodiments, the ANN may be a feed-forward ANN (e.g., an ANN without loops or loops, where communication between nodes flows in one direction from an input layer and proceeds to a successive layer). By way of example, and not by way of limitation, the input of each node of hidden layer 1520 may include the output of one or more nodes of input layer 1510. As another example and not by way of limitation, the input of each node of output layer 1550 may include the output of one or more nodes of hidden layer 1560. In particular embodiments, the ANN may be a deep neural network (e.g., a neural network including at least two hidden layers). In particular embodiments, the ANN may be a depth residual network. The depth residual network may be a feed forward ANN that includes hidden layers organized into residual blocks. The input of each residual block following the first residual block may be a function of the output of the previous residual block and the input of the previous residual block. By way of example, and not by way of limitation, the input to residual block N may be F (x) +x, where F (x) may be the output of residual block N-1 and x may be the input to residual block N-1. Although this disclosure describes a particular ANN, this disclosure contemplates any suitable ANN.

In particular embodiments, the activation function may correspond to each node of the ANN. The activation function of a node may define the node for a given inputAnd outputting. In particular embodiments, the input of a node may include a set of inputs. By way of example, and not by way of limitation, the activation function may be an identity function (identity function), a binary step function, a logic function, or any other suitable function. As another example and not by way of limitation, the activation function of node k may be a sigmoid function

Hyperbolic tangent function->

Rectifier F _k (s _k )＝max(0，s _k ) Or any other suitable function F _s (s _k ) Wherein s is _k May be the active input of node k. In particular embodiments, the inputs corresponding to the activation functions of the nodes may be weighted. Each node may generate an output using a corresponding activation function based on the weighted input. In particular embodiments, each association between nodes may be associated with a weight. By way of example, and not by way of limitation, the relationship 1525 between node 1505 and node 1515 may have a weighting coefficient of 0.4, which may indicate that the output of node 1505 multiplied by 0.4 is used as the input to node 1515. As another example and not by way of limitation, the output y of node k _k May be y _k ＝F _k (s _k ) Wherein F _k May be an activation function corresponding to node k, s _k ＝∑ _j (w _jk x _j ) May be the valid input of node k, x _j May be the output of node j connected to node k, and w _jk May be a weighting coefficient between node j and node k. In particular embodiments, the input of the nodes of the input layer may be based on a vector representing the object. Although this disclosure describes particular inputs and outputs of a node, this disclosure contemplates any suitable inputs and outputs of a node. Further, while the present disclosure may describe particular correlations and weights between nodes, the present disclosure contemplates any suitable correlations and weights between nodes.

In particular embodiments, the ANN may be trained using training data. By way of example, and not by way of limitation, training data may include inputs and expected outputs of ANN 1500. As another example and not by way of limitation, the training data may include vectors, each vector representing a training object and an expected label for each training object. In particular embodiments, training the ANN may include modifying weights associated with the correlations between nodes of the ANN by optimizing an objective function. By way of example and not by way of limitation, training methods (e.g., conjugate gradient method, gradient descent method, random gradient descent method) may be used to counter-propagate the sum-of-squares error (e.g., using a cost function that minimizes the sum-of-squares error) as a measure of distance between each vector representing the training object. In particular embodiments, the ANN may be trained using a discard technique. By way of example and not by way of limitation, one or more nodes may be temporarily ignored in training (e.g., no input is received and no output is generated). For each training object, one or more nodes of the ANN may have a certain probability of being ignored. Nodes that are ignored for a particular training object may be different from nodes that are ignored for other training objects (e.g., nodes may be temporarily ignored on an object-by-object basis). Although this disclosure describes training an ANN in a particular manner, this disclosure contemplates training an ANN in any suitable manner.

In particular embodiments, one or more objects (e.g., content or other types of objects) of a computing system may be associated with one or more privacy settings. One or more objects may be stored on or otherwise associated with any suitable computing system or application, such as, for example, social-networking system 160, client system 130, assistant system 140, third-party system 170, a social-networking application, an assistant application, a messaging application, a photo-sharing application, or any other suitable computing system or application. Although the examples discussed herein are in the context of an online social network, these privacy settings may be applied to any other suitable computing system. The privacy settings (or "access settings") of the object may be stored in any suitable manner, such as, for example, in association with the object, indexed on an authorization server, in another suitable manner, or any suitable combination thereof. Privacy settings regarding an object may specify how the object (or particular information associated with the object) may be accessed, stored, or otherwise used (e.g., viewed, shared, modified, copied, executed, rendered, or identified) in an online social network. An object may be described as "visible" with respect to a particular user or other entity when the privacy setting of the object allows the user or other entity to access the object. By way of example and not by way of limitation, a user of an online social network may specify privacy settings for a user profile page that identify a set of users that may access work experience information on the user profile page, thus excluding other users from accessing the information.

In particular embodiments, the privacy settings of an object may specify a "blacklist" (blacklist) of users or other entities that should not be allowed to access certain information associated with the object. In particular embodiments, the blacklist may include third party entities. A blacklist may specify one or more users or entities to which the object is not visible. By way of example and not by way of limitation, a user may specify a set of users that may not access an album associated with the user, thus excluding those users from accessing the album (while also potentially allowing certain users not within the specified set of users to access the album). In particular embodiments, privacy settings may be associated with particular social graph elements. The privacy settings of a social graph element (e.g., node or edge) may specify how the social graph element, information associated with the social graph element, or objects associated with the social graph element may be accessed using an online social network. By way of example and not by way of limitation, a particular concept node 1304 corresponding to a particular photo may have privacy settings that specify that the photo is only accessible by users marked in the photo and friends of users marked in the photo. In particular embodiments, privacy settings may allow users to opt-in or opt-out of having their content, information, or actions stored/recorded by social-networking system 160 or assistant system 140 or shared with other systems (e.g., third-party system 170). Although this disclosure describes using particular privacy settings in a particular manner, this disclosure contemplates using any suitable privacy settings in any suitable manner.

In particular embodiments, the privacy settings may be based on one or more nodes or edges of social graph 1300. The privacy settings may be specified for one or more edges 1306 or edge types of the social graph 1300, or for one or

more nodes

1302, 1304 or node types with respect to the social graph 1300. The privacy settings applied to a particular edge 1306 that connects two nodes may control whether the relationship between the two entities corresponding to the two nodes is visible to other users of the online social network. Similarly, privacy settings applied to a particular node may control whether a user or concept corresponding to that node is visible to other users of the online social network. By way of example, and not by way of limitation, a first user may share an object with social-networking system 160. The object may be associated with a concept node 1304 of a user node 1302 of the first user via an edge 1306 Guan Liandao. The first user may specify privacy settings applied to a particular edge 1306 of the concept node 1304 that is related to the object, or may specify privacy settings applied to all edges 1306 of the concept node 1304 of Guan Liandao. As another example and not by way of limitation, a first user may share a set of objects (e.g., a set of images) of a particular object type. The first user may designate privacy settings as having particular privacy settings for all objects of that particular object type associated with the first user (e.g., designate that all images posted by the first user are visible only to friends of the first user and/or users marked in the images).

In particular embodiments, social-networking system 160 may present a "privacy wizard" (e.g., within a web page, a module, one or more dialog boxes, or any other suitable interface) to the first user to help the first user specify one or more privacy settings. The privacy wizard may display instructions, appropriate privacy related information, current privacy settings, one or more input fields for accepting one or more inputs from the first user (which specify changes or confirmation of privacy settings), or any suitable combination thereof. In particular embodiments, social-networking system 160 may provide a "dashboard" function to the first user that may display the first user's current privacy settings. The dashboard function may be displayed to the first user at any suitable time (e.g., after input from the first user invoking the dashboard function, after a particular event or trigger action occurs). The dashboard functionality may allow the first user to modify one or more current privacy settings of the first user at any time in any suitable manner (e.g., redirect the first user to the privacy wizard).

The privacy settings associated with the object may specify any suitable granularity (granularity) at which access is allowed or denied. As an example and not by way of limitation, access may be specified for a particular user (e.g., i am only, my roommates, my boss), a user within a particular degree of separation (e.g., friends of friends), a user community (e.g., game club, my family), a user network (e.g., employees of a particular employer, students or alumni of a particular university), all users ("public"), none users ("private"), users of the third party system 170, a particular application (e.g., a third party application, external website), other suitable entity, or any suitable combination thereof. Although this disclosure describes a particular granularity of allowing access or denying access, this disclosure contemplates any suitable granularity of allowing access or denying access.

In particular embodiments, one or more servers 162 may be authorization/privacy servers for enforcing privacy settings. In response to a request from a user (or other entity) for a particular object stored in data store 164, social-networking system 160 may send a request for the object to data store 164. The request may identify the user associated with the request and the object may be sent to the user (or the user's client system 130) only if the authorization server determines that the user is authorized to access the object based on the privacy settings associated with the object. If the requesting user is not authorized to access the object, the authorization server may prevent the requested object from being retrieved from the data store 164 or may prevent the requested object from being sent to the user. In a search-query context, an object may be provided as a search result only if the querying user is authorized to access the object, for example, if the privacy settings of the object allow it to be revealed to, discovered by, or otherwise visible to the querying user. In particular embodiments, the object may represent content that is visible to the user through the user's dynamic message. By way of example and not by way of limitation, one or more objects may be visible to a user's "Trending" page. In particular embodiments, the object may correspond to a particular user. The object may be content associated with a particular user or may be an account of a particular user or information stored on social-networking system 160 or other computing system. By way of example and not by way of limitation, a first user may view one or more second users of the online social network through the "people you may know (People You May Know)" function of the online social network or by viewing the first user's friends list. By way of example and not by way of limitation, a first user may specify that they do not wish to see objects associated with a particular second user in their dynamic message or friends list. An object may be excluded from the search results if its privacy settings do not allow it to be revealed to, discovered by, or visible to the user. Although this disclosure describes enforcing privacy settings in a particular manner, this disclosure contemplates enforcing privacy settings in any suitable manner.

In particular embodiments, different objects of the same type associated with a user may have different privacy settings. Different types of objects associated with a user may have different types of privacy settings. As an example and not by way of limitation, a first user may specify that a status update of the first user is public, but that any images shared by the first user are only visible to friends of the first user on the online social network. As another example and not by way of limitation, a user may specify different privacy settings for different types of entities (e.g., individual users, friends of friends, attendees, user groups, or corporate entities). As another example and not by way of limitation, a first user may designate a group of users that may view video published by the first user while preventing the video from being visible to an employer of the first user. In particular embodiments, different privacy settings may be provided for different groups or demographics of users. As an example and not by way of limitation, a first user may specify that other users at the same university as the first user may view the first user's photos, but that other users who are members of the first user's family may not view those same photos.

In particular embodiments, social-networking system 160 may provide one or more default privacy settings for each object of a particular object type. The privacy settings of an object set as default may be changed by a user associated with the object. As an example and not by way of limitation, all images posted by a first user may have default privacy settings, i.e., visible only to friends of the first user, and for a particular image, the first user may change the privacy settings of the images to be visible to friends and friends of friends.

In particular embodiments, the privacy settings may allow the first user to specify (e.g., by opting out, by not opting in) whether social-networking system 160 or assistant system 140 may receive, collect, record, or store particular objects or information associated with the user for any purpose. In particular embodiments, the privacy settings may allow the first user to specify whether a particular application or process may access, store, or use a particular object or information associated with the user. The privacy settings may allow the first user to choose to join or choose to leave objects or information accessed, stored, or used by a particular application or process. Social-networking system 160 or assistant system 140 may access such information to provide a particular function or service to the first user, but social-networking system 160 or assistant system 140 may not access the information for any other purpose. Prior to accessing, storing, or using such objects or information, social-networking system 160 or assistant system 140 may prompt the user to provide privacy settings that specify which applications or processes, if any, may access, store, or use the objects or information prior to allowing any such actions. By way of example and not by way of limitation, a first user may transmit a message to a second user via an application (e.g., a messaging app) associated with an online social-network, and may specify privacy settings for which social-networking system 160 or assistant 140 should not store such a message.

In particular embodiments, a user may specify whether social-networking system 160 or assistant system 140 may access, store, or use a particular type of object or information associated with the first user. By way of example and not by way of limitation, a first user may specify that an image sent by the first user through social-networking system 160 or assistant system 140 may not be stored by social-networking system 160 or assistant system 140. As another example and not by way of limitation, a first user may specify that messages sent from the first user to a particular second user may not be stored by social-networking system 160 or assistant system 140. As yet another example and not by way of limitation, a first user may specify that all objects sent via a particular application may be saved by social-networking system 160 or assistant system 140.

In particular embodiments, the privacy settings may allow the first user to specify whether particular objects or information associated with the first user may be accessed from a particular client system 130 or third party system 170. The privacy settings may allow the first user to opt-in or opt-out of accessing objects or information from a particular device (e.g., a phonebook on the user's smart phone), from a particular application (e.g., a messaging app), or from a particular system (e.g., an email server). Social-networking system 160 or assistant system 140 may provide default privacy settings for each device, system, or application and/or may prompt the first user to specify particular privacy settings for each context. By way of example and not by way of limitation, a first user may utilize location services features of social-networking system 160 or assistant system 140 to provide recommendations of restaurants or elsewhere in the vicinity of the user. The default privacy settings of the first user may specify that social-networking system 160 or assistant system 140 may provide location-based services using location information provided from client device 130 of the first user, but social-networking system 160 or assistant system 140 may not store or provide location information of the first user to any third-party system 170. The first user may then update the privacy settings to allow the third party image sharing application to use the location information to geotag the photo.

In particular embodiments, the privacy settings may allow a user to specify one or more geographic locations from which objects may be accessed. Access or denial of access to an object may depend on the geographic location of the user attempting to access the object. By way of example and not by way of limitation, users may share an object and specify that only users of the same city may access or view the object. As another example and not by way of limitation, a first user may share an object and specify that the object is only visible to a second user when the first user is in a particular location. If the first user leaves a particular location, the object may no longer be visible to the second user. As another example and not by way of limitation, a first user may specify that an object is visible only to a second user that is within a threshold distance from the first user. If the first user subsequently changes locations, the original second user that has access to the object may lose access, while the new second user group may gain access when they come within a threshold distance of the first user.

In particular embodiments, social-networking system 160 or assistant system 140 may have functionality that may use a user's personal or biometric information as input for user authentication or experience personalization purposes. Users may choose to take advantage of these functions to enhance their experience on an online social network. By way of example, and not by way of limitation, a user may provide personal or biometric information to social-networking system 160 or assistant system 140. The user's privacy settings may specify that such information is available only for a particular process (e.g., authentication), and also that such information cannot be shared with any third-party system 170 or used for other processes or applications associated with social-networking system 160 or assistant system 140. As another example and not by way of limitation, social-networking system 160 may provide the user with functionality to provide voiceprint recordings to an online social network. By way of example and not by way of limitation, if a user wishes to utilize this functionality of an online social network, the user may provide a sound recording of his or her own sound to provide status updates on the online social network. The record of the sound input may be compared to the user's voiceprint to determine what words the user has spoken. The user's privacy settings may specify that such sound recordings may be used only for sound input purposes (e.g., authenticating the user, sending sound messages, improving sound recognition to use the sound operating characteristics of the online social-networking), and also that such sound recordings may not be shared with any third-party system 170 or used by other processes or applications associated with social-networking system 160. As another example and not by way of limitation, social-networking system 160 may provide the user with functionality to provide a reference image (e.g., facial profile, retinal scan) to the online social network. The online social network may compare the reference image with later received image inputs (e.g., to authenticate the user, mark the user in a photograph). The user's privacy settings may specify that such sound recordings are only available for limited purposes (e.g., authentication, marking the user in a photograph), and also specify that such sound recordings cannot be shared with any third party system 170 or used by other processes or applications associated with social-networking system 160.

FIG. 16 illustrates an example computer system 1600. In particular embodiments, one or more computer systems 1600 perform one or more steps of one or more methods described or illustrated herein. In particular embodiments, one or more computer systems 1600 provide the functionality described or illustrated herein. In particular embodiments, software running on one or more computer systems 1600 performs one or more steps of one or more methods described or illustrated herein, or provides functionality described or illustrated herein. Particular embodiments include one or more portions of one or more computer systems 1600. Herein, references to a computer system may include a computing device, and vice versa, where appropriate. Further, references to computer systems may include one or more computer systems, where appropriate.

The present disclosure contemplates any suitable number of computer systems 1600. The present disclosure contemplates computer system 1600 taking any suitable physical form. By way of example, and not by way of limitation, computer system 1600 may be an embedded computer system, a system on a chip (SOC), a single board computer System (SBC), such as, for example, a computer on a module (COM) or a system on a module (SOM)), a desktop computer system, a laptop or notebook computer system, an interactive kiosk, a mainframe, a computer system mesh, a mobile telephone, a Personal Digital Assistant (PDA), a server, a tablet computer system, or a combination of two or more of these. Computer system 1600 may include one or more computer systems 1600, where appropriate; may be monolithic or distributed; spanning multiple locations; spanning multiple machines; spanning multiple data centers; or reside in a cloud, which may include one or more cloud components in one or more networks. Where appropriate, one or more computer systems 1600 may perform one or more steps of one or more methods described or illustrated herein without substantial spatial or temporal limitation. By way of example, and not by way of limitation, one or more computer systems 1600 may perform one or more steps of one or more methods described or illustrated herein in real-time or in batch mode. Where appropriate, one or more computer systems 1600 may perform one or more steps of one or more methods described or illustrated herein at different times or at different locations.

In a particular embodiment, the computer system 1600 includes a processor 1602, a memory 1604, a storage device 1606, an input/output (I/O) interface 1608, a communication interface 1610, and a bus 1612. Although this disclosure describes and illustrates a particular computer system having a particular number of particular components in a particular arrangement, this disclosure contemplates any suitable computer system having any suitable number of any suitable components in any suitable arrangement.

In a particular embodiment, the processor 1602 includes hardware for executing instructions (e.g., those comprising a computer program). By way of example, and not limitation, to execute instructions, processor 1602 may retrieve (or fetch) instructions from an internal register, an internal cache, memory 1604, or storage 1606; decoding the instructions and executing them; and then write one or more results to an internal register, internal cache, memory 1604, or storage 1606. In particular embodiments, processor 1602 may include one or more internal caches for data, instructions, or addresses. The present disclosure contemplates processor 1602 including any suitable number of any suitable internal caches, where appropriate. By way of example, and not by way of limitation, the processor 1602 may include one or more instruction caches, one or more data caches, and one or more Translation Lookaside Buffers (TLBs). Instructions in the instruction cache may be copies of instructions in the memory 1604 or the storage 1606, and the instruction cache may speed up retrieval of those instructions by the processor 1602. The data in the data cache may be: copies of data in the memory 1604 or storage 1606 for use in causing instructions to be executed at the processor 1602; results of previous instructions executed at processor 1602, for access by subsequent instructions executed at processor 1602 or for writing to memory 1604 or storage 1606; or other suitable data. The data cache may speed up read or write operations by the processor 1602. The TLB may accelerate virtual address translation with respect to the processor 1602. In particular embodiments, processor 1602 may include one or more internal registers for data, instructions, or addresses. The present disclosure contemplates processor 1602 including any suitable number of any suitable internal registers, where appropriate. Where appropriate, the processor 1602 may include one or more Arithmetic Logic Units (ALUs); may be a multi-core processor; or include one or more processors 1602. Although this disclosure describes and illustrates a particular processor, this disclosure contemplates any suitable processor.

In a particular embodiment, the memory 1604 includes a main memory for storing instructions for execution by the processor 1602 or data for operation by the processor 1602. By way of example, and not limitation, computer system 1600 may load instructions from storage 1606 or another source (such as, for example, another computer system 1600) to memory 1604. The processor 1602 may then load the instructions from the memory 1604 into an internal register or internal cache. To execute instructions, the processor 1602 may retrieve instructions from an internal register or internal cache and decode them. During or after execution of the instructions, the processor 1602 may write one or more results (which may be intermediate results or final results) to an internal register or internal cache. The processor 1602 may then write one or more of these results to the memory 1604. In particular embodiments, processor 1602 executes instructions in only one or more internal registers or internal caches or in memory 1604 (rather than storage 1606 or elsewhere), and operates on data in only one or more internal registers or internal caches or in memory 1604 (rather than storage 1606 or elsewhere). One or more memory buses (which may each include an address bus and a data bus) may couple processor 1602 to memory 1604. The bus 1612 may include one or more memory buses, as described below. In particular embodiments, one or more Memory Management Units (MMUs) reside between processor 1602 and memory 1604 and facilitate access to memory 1604 as requested by processor 1602. In a particular embodiment, the memory 1604 includes Random Access Memory (RAM). The RAM may be volatile memory, where appropriate. The RAM may be Dynamic RAM (DRAM) or Static RAM (SRAM), where appropriate. Further, the RAM may be single-port RAM or multi-port RAM, where appropriate. The present disclosure contemplates any suitable RAM. The memory 1604 may include one or more memories 1604, where appropriate. Although this disclosure describes and illustrates particular memory, this disclosure contemplates any suitable memory.

In a particular embodiment, the storage 1606 includes a mass storage device for data or instructions. By way of example, and not limitation, the storage 1606 may include a Hard Disk Drive (HDD), a floppy disk drive, flash memory, an optical disk, a magneto-optical disk, a magnetic tape, or a Universal Serial Bus (USB) drive, or a combination of two or more of these. Storage 1606 may include removable or non-removable (or fixed) media, where appropriate. Storage 1606 may be internal or external to computer system 1600, where appropriate. In a particular embodiment, the storage device 1606 is a non-volatile solid-state memory. In a particular embodiment, the storage 1606 includes read-only memory (ROM). The ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically Alterable ROM (EAROM), or flash memory, or a combination of two or more of these, where appropriate. The present disclosure contemplates mass storage device 1606 taking any suitable physical form. Storage 1606 may include one or more storage control units that facilitate communication between processor 1602 and storage 1606, where appropriate. Storage 1606 may include one or more storage devices 1606, where appropriate. Although this disclosure describes and illustrates particular storage devices, this disclosure contemplates any suitable storage devices.

In particular embodiments, I/O interface 1608 comprises hardware, software, or both that provide one or more interfaces for communication between computer system 1600 and one or more I/O devices. Computer system 1600 may include one or more of these I/O devices, where appropriate. One or more of these I/O devices may enable communications between a person and computer system 1600. By way of example, and not by way of limitation, the I/O device may include a keyboard, a keypad, a microphone, a monitor, a mouse, a printer, a scanner, a speaker, a still camera, a stylus, a tablet computer, a touch screen, a trackball, a video camera, another suitable I/O device, or a combination of two or more of these. The I/O device may include one or more sensors. The present disclosure contemplates any suitable I/O devices and any suitable I/O interfaces 1608 for them. The I/O interface 1608 may include one or more devices or software drivers that enable the processor 1602 to drive one or more of the I/O devices, where appropriate. I/O interface 1608 may include one or more I/O interfaces 1608, where appropriate. Although this disclosure describes and illustrates particular I/O interfaces, this disclosure contemplates any suitable I/O interfaces.

In particular embodiments, communication interface 1610 includes hardware, software, or both that provide one or more interfaces for communication (such as, for example, packet-based communication) between computer system 1600 and one or more other computer systems 1600 or one or more networks. By way of example, and not by way of limitation, communication interface 1610 may include a Network Interface Controller (NIC) or network adapter for communicating with an ethernet or other wire-based network, or a Wireless NIC (WNIC) or wireless adapter for communicating with a wireless network (e.g., WI-FI network). The present disclosure contemplates any suitable network and any suitable communication interface 1610 for it. By way of example, and not limitation, computer system 1600 may communicate with an ad hoc network, a Personal Area Network (PAN), a Local Area Network (LAN), a Wide Area Network (WAN), a Metropolitan Area Network (MAN), or one or more portions of the Internet, or a combination of two or more of these. One or more portions of one or more of these networks may be wired or wireless. As an example, computer system 1600 may communicate with a Wireless PAN (WPAN), such as, for example, a bluetooth WPAN, a WI-FI network, a WI-MAX network, a cellular telephone network, such as, for example, a global system for mobile communications (GSM) network, or other suitable wireless network, or a combination of two or more of these. Computer system 1600 may include any suitable communication interface 1610 for any of these networks, where appropriate. Communication interface 1610 may include one or more communication interfaces 1610, where appropriate. Although this disclosure describes and illustrates a particular communication interface, this disclosure contemplates any suitable communication interface.

In a particular embodiment, the bus 1612 includes hardware, software, or both that couple the components of the computer system 1600 to one another. By way of example, and not limitation, bus 1612 may include an Accelerated Graphics Port (AGP) or other graphics bus, an Extended Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a HYPERTRANSPORT (HT) interconnect, an Industry Standard Architecture (ISA) bus, an INFINIBAND interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (extended) (PCIe) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or any other suitable bus, or a combination of two or more of these. The bus 1612 may include one or more buses 1612, where appropriate. Although this disclosure describes and illustrates a particular bus, this disclosure contemplates any suitable bus or interconnect.

Herein, one or more computer-readable non-transitory storage media may include one or more semiconductor-based or other Integrated Circuits (ICs) (such as, for example, field Programmable Gate Arrays (FPGAs) or Application Specific ICs (ASICs)), a Hard Disk Drive (HDD), a hybrid hard disk drive (HHD), an Optical Disk Drive (ODD), a magneto-optical disk drive, a Floppy Disk Drive (FDD), a magnetic tape, a Solid State Drive (SSD), a RAM drive, a SECURE DIGITAL (SECURE DIGITAL) card or drive, any other suitable computer-readable non-transitory storage media, or any suitable combination of two or more of these, where appropriate. The computer readable non-transitory storage medium may be volatile, nonvolatile, or a combination of volatile and nonvolatile, where appropriate.

Herein, unless expressly indicated otherwise or indicated by context, "or" is inclusive rather than exclusive. Thus, herein, "a or B" means "A, B" or both, unless indicated otherwise explicitly or otherwise by context. Furthermore, unless explicitly indicated otherwise or indicated by context, "and" are both associative and individual. Thus, herein, "a and B" means "a and B, jointly or individually, unless indicated otherwise explicitly indicated otherwise by context.

The scope of the present disclosure includes all changes, substitutions, variations, alterations, and modifications to the example embodiments described or illustrated herein that will be understood by those of ordinary skill in the art. The scope of the present disclosure is not limited to the example embodiments described or illustrated herein. Furthermore, although the present disclosure describes and illustrates respective embodiments herein as including particular components, elements, features, functions, operations, or steps, any of these embodiments may include any combination or permutation of any of the components, elements, features, functions, operations, or steps described or illustrated anywhere herein that one of ordinary skill in the art would understand. Furthermore, references in the appended claims to an apparatus or system or component of an apparatus or system that is suitable, configured, capable, configured, implemented, operable, or operative to perform a particular function include the apparatus, system, component whether or not it or that particular function is activated, turned on, or unlocked, so long as the apparatus, system, or component is so adapted, arranged, enabled, configured, implemented, operable, or operative. Furthermore, although the present disclosure describes or illustrates particular embodiments as providing particular advantages, particular embodiments may provide some, all, or none of these advantages.

Claims

1. A method for use in an assistant system for assisting a user in obtaining information or services by enabling the user to interact with the assistant system in a session using user input including sound, text, images, or video, or any combination thereof, the assistant system implemented by a combination of a computing device, an Application Programming Interface (API), and an application surge on a user device, the method comprising:

determining a phase from the plurality of phases of the neural network; and

training a non-local machine learning model by inserting each of the one or more non-local blocks between at least two of a plurality of neural blocks in the determined phase of the neural network,

Wherein the assistant system uses one or more non-local machine learning models to analyze content objects including one or more of speech, text, images, video, or a combination thereof.

2. The method of claim 1, wherein the neural network comprises one or more of a convolutional neural network or a recurrent neural network.

3. The method of claim 1 or 2, wherein each of the plurality of content objects comprises one or more of text, an audio clip, an image, or a video.

4. The method of claim 1 or 2, wherein the neural network is based on one or more of a two-dimensional architecture or a three-dimensional architecture.

5. The method of claim 1 or 2, further comprising:

a plurality of feature representations are generated for the plurality of content objects, respectively, based on the baseline machine learning model.

6. The method of claim 5, wherein generating each of the one or more non-local blocks comprises:

7. The method of claim 5, further comprising:

8. The method of claim 7, wherein the output location is in one or more of space, time, or time-air.

9. The method of claim 7, wherein each of the one or more non-local operations is based on a function

And is also provided with wherein:

x _i indicating a representation of the feature at the output location;

x _j indicating a representation of the feature at one of the plurality of locations;

y _i indicating an output response at the output location;

f(x _i ,x _j ) Indicating the pair-wise function;

g(x _j ) Indicating the unary function; and

c (x) indicates a normalization factor.

10. The method of claim 9, wherein the pair-wise function is based on one or more of:

gaussian function

Embedding gaussian functions

Wherein θ is x _i And phi is x _j Is embedded in the mold;

dot product function f (x _i ，x _j )＝θ(x _i ) ^T φ(x _j ) The method comprises the steps of carrying out a first treatment on the surface of the Or alternatively

Cascading functions

Wherein ReLU indicates a function of rectifying the linear units, and wherein w _f Is to make θ (x _i ) And phi (x) _j ) Is projected to the scalar weight vector.

11. The method of claim 5, further comprising:

a sub-sampled content object is generated for each of the plurality of content objects by applying sub-sampling to a feature representation of the content object, wherein the sub-sampled content object is associated with the sub-sampled feature representation.

12. The method of claim 11, wherein the sub-sampling comprises pooling, the pooling comprising one or more of maximum pooling or average pooling.

13. The method of claim 11, wherein generating each of the one or more non-local blocks comprises:

14. The method of claim 11, further comprising:

determining an output location for each of the plurality of content objects; and

15. The method of claim 14, wherein each of the one or more non-local operations is based on a function

And wherein:

x _i indicating a representation of the feature at the output location;

a feature representation indicative of sub-sampling at one of the plurality of locations;

y _i indicating an output response at the output location;

indicating the pair-wise function;

Indicating the unary function; and

indicating the normalization factor.

16. The method of claim 15, wherein the pair-wise function is based on one or more of:

gaussian function

Embedding gaussian functions

Wherein θ is x _i And phi is +.>

Is embedded in the mold;

dot product function

Or alternatively

Cascading functions

Is projected to the scalar weight vector.

17. The method of claim 1 or 2, further comprising:

receiving a query content object; and

a category of the query content object is determined based on the non-local machine learning model.

18. A computer readable non-transitory storage medium embodying software that is operable when executed to perform a method according to any one of claims 1 to 17.

19. An assistant system for assisting a user in obtaining information or services by enabling the user to interact with the assistant system in a session using user input including sound, text, images, or video, or any combination thereof, the assistant system implemented by a combination of a computing device, an Application Programming Interface (API), and an application surge on a user device, the assistant system comprising: one or more processors; and a non-transitory memory coupled to the processor, the memory comprising instructions executable by the processor, the processor when executing the instructions operable to perform the method of any one of claims 1 to 17.

20. The assistant system of claim 19, for assisting a user by performing at least one or more of the following features or steps:

-analyzing the user input using natural language understanding, wherein the analysis can be based on the user profile to obtain a more personalized and context-aware understanding

-parsing an entity associated with the user input based on the analysis

Managing and forwarding session flows with a user using dialog management techniques through interactions with the user

Assisting a user in better participation in an online social network by providing tools that assist the user in interacting with the online social network

-actively performing pre-authorized tasks related to user interests and preferences based on said user profile without user input at a time related to the user

21. An assistant system according to claim 19 or 20, comprising at least one or more of the following:

a messaging platform for receiving text-mode based user input from a client system associated with a user and/or receiving image or video-mode based user input and processing the user input within the messaging platform using optical character recognition techniques to convert the user input to text,

an audio speech recognition, ASR, module for receiving user input based on an audio modality from the client system associated with a user and converting the user input based on an audio modality into text,

22. A system for a network environment, comprising:

at least one client system, in particular an electronic device,

at least one assistant system according to any one of claims 19 to 21,

the client system and the assistant system are connected to each other through a network,

Wherein the client system includes an assistant application for allowing a user of the client system to interact with the assistant system,

wherein the assistant application communicates user input to the assistant system and based on the user input, the assistant system generates a response and sends the generated response to the assistant application, and the assistant application presents the response to a user of the client system,

wherein the user input is audio or spoken or visual and the response can be text or can also be audio or spoken or visual.

23. The system of claim 22, further comprising a social networking system,

wherein the client system includes a social networking application for accessing the social networking system.