WO2023080805A1

WO2023080805A1 - Distributed embedding table with synchronous local buffers

Info

Publication number: WO2023080805A1
Application number: PCT/RU2021/000483
Authority: WO
Inventors: Dmitry Sergeevich KOLMAKOV
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2021-11-03
Filing date: 2021-11-03
Publication date: 2023-05-11

Abstract

A system and a method for training and inferencing using distributed embedding tables is disclosed. The disclosure enables a tradeoff between communications speed-up for memory overhead during both training and inference. The disclosure comprises usage of a replicated local buffers which store embeddings for the most frequent values of the non-numerical features, for example the categorical features. The disclosure provides exemplary methods for filling local buffers, maintaining consistency, and an exemplary method of updating the replicated local buffers in case the distribution of categorical features changes over time.

Description

DISTRIBUTED EMBEDDING TABLE WITH SYNCHRONOUS LOCAL BUFFERS

BACKGROUND

Some embodiments described in the present disclosure relate to training of a neural network and, more specifically, but not exclusively, to faster access for frequently embeddings.

Embeddings are a relatively low-dimensional values representing discrete data which is widely used in the artificial intelligence area. The embedding concept is a ubiquitous method of processing non-numerical features, such as names of countries, words, days of a week, website categories, and the like. Since categories such web pages, cities, and the like, may have many value options, and their embeddings correspondingly have more dimensions, their embedding tables may require much memory. Thus, the problem of storage, access and update of embedding values stored in the distributed table may account for bottlenecks.

Usage of a cache is a known method of speeding up access to remote or slower memory, however since embeddings are trainable parameters which are continuously updated during the training procedure, maintaining their coherency incur communication overhead and cause scalability problems.

SUMMARY

It is an object of the present disclosure to describe a distributed training system which is able to decrease network load and speed up training process, and support inference as well, by enabling faster access to vectorized representations or embeddings of frequently occurring values of one non-numerical features.

The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to an aspect of some embodiments of the present invention there is provided a system configured for distributed training of a neural network, comprising a plurality of computing nodes, each computing node comprising: at least one processing circuitry configured to: receive a plurality of data records, each comprising at least one non-numerical feature; generate a first embedding table, comprising vectorized representations, each associated with a value of the at least one non-numerical feature; generate a second embedding table comprising vectorized representations, each associated with a frequently occurring values of the at least one non-numerical feature; apply a first communication pattern to collect values of the at least one non- numerical feature for at least one batch, from at least one of the plurality of computing nodes; generate a plurality of gradients by performing at least one training iteration on the neural network using the at least one batch from the plurality of data records; update the first embedding table by the plurality of gradients using a second communication pattern with the plurality of computing nodes; update the second embedding table by the plurality of gradients; and equalize the second embedding table with the second embedding table of the plurality of computing nodes using a third communication pattern.

According to an aspect of some embodiments of the present invention there is provided a computer- implemented method for training a neural network, using a plurality of computing nodes, each computing node at least one processing circuitry, and the method comprising: receiving a plurality of data records, each comprising at least one non-numerical feature; generating a first embedding table, comprising vectorized representations, each associated with a value of the at least one non-numerical feature; generating a second embedding table comprising vectorized representations, each associated with a frequently occurring values of the at least one non-numerical; applying a first communication pattern to collect values of the at least one non- numerical feature for at least one batch, from at least one of the plurality of computing nodes; generating a plurality of gradients by performing at least one training iteration on the neural network using the at least one batch from the plurality of data records updating the first embedding table by the plurality of gradients using a second communication pattern with the plurality of computing nodes; updating the second embedding table by the plurality of gradients; and equalizing the second embedding table with the second embedding table of the plurality of computing nodes using a third communication pattern.

According to an aspect of some embodiments of the present invention there is provided a computer-implemented method for inferencing from a neural network, using a plurality of computing nodes, each computing node comprising at least one processing circuitry, and the method comprising: obtaining a first embedding table for at least one non-numerical feature, and a second embedding table comprising vectorized representations, each associated with a frequently occurring values of at least one non-numerical feature, and being shared by the plurality of computing nodes; inferencing from at least one data record using a neural network.

Optionally, the at least one processing circuitry, and an additional processing circuitry of an additional computing node from the plurality of computing nodes are configured to: synchronously add at least one value of the at least one non-numerical feature, frequently occurring in the at least one batch, and a vectorized representations associated therewith, to the second embedding table; and apply the third communication pattern on the at least non-numerical feature values.

Optionally, the at least one value comprises most frequent value of the at least one non-numerical feature in the at least one batch used by each computing node.

Optionally, wherein the at least one processing circuitry and an additional processing circuitry of an additional computing node from the plurality of computing nodes are configured to: synchronously choose and remove at least one additional value from the second embedding table; and synchronously add at least one value of the at least one non-numerical feature, frequently occurring in the at least one batch, and a vectorized representations associated therewith, to the second embedding table.

Optionally, wherein the first communication pattern collects values from at least one remote embedding table.

Optionally, wherein the second communication pattern is an all-to-all pattern.

Optionally, wherein the third communication pattern is an all-reduce synchronization pattern.

Optionally, further comprising at least one processing circuitry configured as a parameter server storing at least one remote embedding table.

Optionally, wherein the second embedding table is stored on a memory enabling faster access compared to the memory the first embedding table is stored on.

Optionally, wherein the second embedding table in each of the plurality of computing nodes stores the same values. Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims.

Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)

Some embodiments are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments may be practiced.

In the drawings:

FIG. 1A is a schematic block diagram of an exemplary system for training a machine learning model, according to some embodiments of the present disclosure;

FIG. IB is a schematic block diagram of an exemplary system for distributed training of a machine learning model, according to some embodiments of the present disclosure;

FIG. 2A is a flowchart schematically representing an optional flow of operations for distributed step of training of a machine learning model, according to some embodiments of the present disclosure;

FIG. 2B is a flowchart schematically representing an optional flow of operations for distributed inference using a machine learning model, according to some embodiments of the present disclosure;

FIG. 3A is a schematic illustration of an exemplary dataset segment comprising non- numerical data, according to an exemplary dataset;

FIG. 3B is a schematic graph of an exemplary distribution of non-numerical value frequency distribution, according to an exemplary dataset; FIG. 4A is a schematic illustration of an exemplary distributed training system, according to some embodiments of prior art;

FIG. 4B is an additional schematic illustration of an exemplary distributed training system, according to some embodiments of prior art;

FIG. 5A is a schematic graph representing an exemplary communication overhead in response to a batch size according to some embodiments of prior art;

FIG. 5B is another schematic illustration of an exemplary distributed training system, according to some embodiments of prior art;

FIG. 6 is a flowchart schematically representing an optional flow of operations for training iteration with local buffers placed on each computing node, according to some embodiments of the present disclosure;

FIG. 7 is a schematic block diagram of an exemplary training system as implemented on each computing node, according to some embodiments of the present disclosure;

FIG. 8 is a schematic illustration of an exemplary training batch preparation, according to some embodiments of the present disclosure;

FIG. 9 is a schematic illustration of exemplary of synchronous local buffer operations, according to some embodiments of the present disclosure;

FIG. 10 is a schematic illustration of an exemplary of a training iteration, according to some embodiments of the present disclosure;

FIG. 11 is a schematic illustration of an exemplary SLB creation and filling process, according to some embodiments of the present disclosure;

FIG. 12 is a schematic illustration of an exemplary of SLB updating process, according to some embodiments of the present disclosure;

FIG. 13 is a schematic graph representing simulated memory overhead and communication speed-up in response to a SLB size according to some embodiments of the present disclosure;

FIG. 14 is a schematic graph representing a simulated communication speed-up in response to a batch size according to some embodiments of the present disclosure; and

FIG. 15 is a schematic graph representing comparison between the communication speed-up of the disclosure versus an ideal solution wherein the distribution of categorical values is known in advance in response to a memory overhead according to some embodiments of the present disclosure. DETAILED DESCRIPTION

The disclosure comprises a distributed embedding table which enables a tradeoff between communications speed-up for memory overhead during both training and inference. The disclosure may be straightforward to implement and scale well with the size of computational cluster. The disclosure comprises usage of a replicated local buffers which store embeddings for the most frequent values of the non-numerical features, for example the categorical features. The disclosure provides exemplary methods for filling local buffers, maintaining consistency or coherency, and an exemplary method of updating the replicated local buffers in case the distribution of categorical features changes over time.

The disclosure enables usage of large, distributed embedding tables, either by storing on the same devices executing the training and/or the inference or by one or more additional devices functioning as a parameter server.

A distributed embedding table may be stored on a group of computing devices functioning as the parameter server, or the computing devices which are the worker nodes of the cluster executing the training. The table parts may be referred to as shards, and when stored by a worker node, also referred to as first embedding tables. The replicated local buffers may be referred to as second embedding tables.

Some embodiments of the disclosure propose a communication overhead saving set of communication patterns to keep the embedding values in the second embedding tables updated, as they may change during training.

It should be noted that the term synchronously refers to states where the computing device is at a stage equivalent to the state of at least one other computing device with which the communication is performed.

Before explaining at least one embodiment in detail, it is to be understood that embodiments are not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. Implementations described herein are capable of other embodiments or of being practiced or carried out in various ways.

Embodiments may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the embodiments. The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

Computer readable program instructions for carrying out operations of embodiments may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of embodiments.

Aspects of embodiments are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/ acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the fiinctions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Referring now to the drawings, FIG. 1A is a schematic illustration of an exemplary system training of a neural network, according to some embodiments of the present invention. An exemplary training or inference system 100 may function as a computing node for processes such as 200 and/or 250 for training a neural network or a similarly complex machine learning model from data records, and/or using the system for inference respectively. Further details about these exemplary processes follow as FIG. 2A and FIG. 2B are described.

The training of a neural network system 110 may include a network interface 113, which comprises an input interface 112, and an output interface 115. The training or inference system may also comprise one or more processors 111 for executing processes such as 200 and/or 250, and storage 116 for storing code (program code storage 114) and/or memory 118 for data, such as network parameters, and records for training and/or inference. The training of the inference system may be physically located on a site, implemented on a mobile device, implemented as distributed system, implemented virtually on a cloud service, on machines also used for other functions, and/or by several options. Alternatively, the system, or parts thereof, may be implemented on dedicated hardware, FPGA and/or the likes. Further alternatively, the system, or parts thereof, may be implemented on a server, a computer farm, the cloud, and/or the likes. For example, the storage 116 may comprise a local cache on the device, and some of the less frequently used data and code parts may be stored remotely.

The input interface 112, and the output interface 115 may comprise one or more wired and/or wireless network interfaces for connecting to one or more networks, for example, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a cellular network, the internet and/or the like. The input interface 112, and the output interface 115 may further include one or more wired and/or wireless interconnection interfaces, for example, a universal serial bus (USB) interface, a serial port, and/or the like. Furthermore, the output interface 115 may include one or more wireless interfaces for delivering various indications to other systems or users, and the input interface 112, may include one or more wireless interfaces for receiving information from one or more devices. Additionally, the input interface 112 may include specific means for communication with one or more sensor devices 122 such as a touch screen, a microphone for receiving instructions, configurations and/or the like. And similarly, the output interface 115 may include specific means for communication with one or more display devices 125 such as a loudspeaker, display and/or the like.

Both parts of the processing, storage and delivery of data records, and inference result processing may be executed using one more optional Neighbor System 124.

The one or more processors 111, homogenous or heterogeneous, may include one or more processing nodes arranged for parallel processing, as clusters and/or as one or more multi core one or more processors. Furthermore, the processor may comprise units optimized for deep learning such as Graphic Processing Units (GPU). The storage 116 may include one or more non-transitory persistent storage devices, for example, a hard drive, a Flash array and/or the like. The storage 116 may also include one or more volatile devices, for example, a random access memory (RAM) component, enhanced bandwidth memory such as video RAM (VRAM), and/or the like. The storage 116 may further include one or more network storage resources, for example, a storage server, a network attached storage (NAS), a network drive, and/or the like accessible via one or more networks through the input interface 112, and the output interface 115.

The one or more processors 111 may execute one or more software modules such as, for example, a process, a script, an application, an agent, a utility, a tool, an operating system (OS) and/or the like each comprising a plurality of program instructions stored in a non- transitory medium within the program code 114, which may reside on the storage medium 116.

Referring now to, FIG. IB which is a schematic block diagram of an exemplary system for distributed training of a machine learning model, according to some embodiments of the present disclosure.

An exemplary distributed training or inference system 150 may function as a computing node for processes such as 200 and/or 250 for distributed training a neural network or a similarly complex machine learning model from data records, and/or using the system for inference respectively. Further details about these exemplary processes follow as FIG. 2A and FIG. 2B are described.

The network shown in 150 may be used for providing a plurality of users with a platform comprising a plurality of computing nodes, and labelled as a LAN, WAN, a cloud service, a network for distributed training neural networks and similarly complex machine learning models, inference system, a compute server, and/or the like. The network may allow communication with physical or virtual machines, or parts thereof, for example graphic processing units (GPU), functioning as computing nodes, as shown in 151,155 and 158. The network may interface the outside network, e.g. the internet, and collect data continuously. Some embodiments may prepare additional training data and perform periodic retraining and/or online training.

The network computing nodes may be configured to function peer to peer, however, optionally, additional computing nodes may be configured as parameter servers, as shown in 165. A parameter server may be based on similar computing nodes, however may be a system configured for broad and fast memory access, and less processing capabilities. Optionally, more than one computing node may function as a parameter server, or the parameter server may be a plurality of devices configured to function as a single parameter server. For example, and auxiliary parameter server shown in 160 may store some of the parameters, the training data, and/or the like.

Reference is also made to FIG. 2A which is a flowchart schematically representing an optional flow of operations for distributed step of training of a machine learning model, according to some embodiments of the present disclosure. The exemplary process 200 may be executed for training a system for executing one or more distributed inference tasks, for example for analytics, web page traffic prediction, recommendation systems, and/or the like. The process 200 may be executed by the one or more processors 111.

The exemplary process 200 starts, as shown in 201, with receiving a plurality of data records, each comprising at least one non-numerical feature. The a plurality of data records may be received through the input interface 112, from an external server, for example over the internet, a parameter server, a neighbor system such as 124, and/or the like. Alternatively data records may be stored in memory 118. A batch used for training may comprise a plurality of data records, each comprising at least one non-numerical feature. The non-numerical features may be categorical features such as month names, places of births, vehicle types, words, and/or the like, however other non-numerical features such as voice samples may also be embedded to a lower dimensional vectorized representation.

The exemplary process 200 continues, as shown in 202, with generating a first embedding table, comprising vectorized representations, each associated with a value of the at least one non-numerical feature.

The first embedding table may be stored locally, or when a parameter server is present, it may be stored thereon, and communication therewith should be considered. The vectorized representations, which may also be referred to as embeddings, are numerical representations of the non-numerical features, which enable many neural networks and other machine learning models to process them more effectively. The vectorized representations may be updated during training, by a method similar to the method of updating model internal parameters, for example gradient descent or variants thereof. For example, the embedding of the value ‘Sunday’ for the day of a week feature may change from (3,220,9) to (4,221,8) vectorized representation. Some machine learning models may be trained using fixed, pretrained embeddings, such as Word2Vec, however the benefit of the present disclosure relates to embeddings which may be updated for matching the context, and wherein there are many possible values for the non- numerical features.

The exemplary process 200 continues, as shown in 203, with generating a second embedding table comprising vectorized representations, each associated with a frequently occurring values of the at least one non-numerical feature.

The second embedding table may share the format of the first embedding table, or have a functionally equivalent format, for providing a vectorized representation for a value of a non- numerical feature. The second embedding table in each of the plurality of computing nodes may store the same values. Alternatively, when each node has different distribution or occurrence patterns of values of the non-numerical features, local adaptations may be made, however such adaptations may not benefit from the optimization for allreduce communication patterns, offered by many platforms.

The second embedding table may be stored on a memory enabling faster access compared to the memory the first embedding table is stored on. For example, when the computing node may be a GPU module, the first embedding table may be stored on a local disk, while the second embedding table may be stored on video random access memory (VRAM) present on the GPU module. In another example, the first embedding table may be stored on a different computing device such as a neighbor system 124, and the second embedding table may be stored on a the local memory 118. In some implementations, the second embedding table of each of the plurality of computing nodes, or the worker nodes in the cluster, are synchronized following every training step, and therefore store the same values.

The exemplary process 200 continues, as shown in 204, with applying a first communication pattern to collect values of the non-numerical features for the batch, from the plurality of computing nodes.

The values of the non-numerical features for the batch may be collected using the input interface 112. The first communication pattern may collect values from at least one remote embedding table, stored on other computing devices from the plurality of computing devices, forming the cluster on which the training process 200 is executed. Alternatively the values of the non-numerical features may be collected from a parameter server.

The exemplary process 200 continues, as shown in 205, with generating gradients by performing at least one training iteration on the neural network using the batch.

This step may be performed using methods known to the person skilled in the art, for example by stochastic gradient descent. It may be performed by the processor 111, which may benefit from accelerators such as GPU, or other integrated circuits optimized for vector, matrix or tensor calculations, and similar operations used in machine learning.

The exemplary process 200 continues, as shown in 206, with updating the first embedding table by the gradients using a second communication pattern with the plurality of computing nodes.

When a local distributed part of the embedding table is present, the associated updating, for example to the memory 118, is straightforward. The second communication pattern may be an all-to-all pattern, used for updating the other distributed part of the embedding table, or an all-to-one pattern, when the embedding table is stored remotely on a parameter server.

The exemplary process 200 continues, as shown in 207, with updating the second vectorized representations on the second embedding table by the plurality of gradients. Furthermore, some values may be added to the second embedding table, or removed therefrom, optionally in a manner synchronized with other computing nodes in the cluster. The values added to the second embedding table may be chosen due to high occurrence frequency in the batch, and the values removed may be chosen arbitrarily, by location, or by low occurrence frequency in the batch.

The plurality of gradients may be used according to the hyperparameters to update the embeddings of the values stored on the second embedding table.

And subsequently, as shown in 208, the process 200 may continue by using the machine learning based model, executed by one or more processors 111 and the interfaces 112 and 115, for equalizing the second embedding table with the second embedding table of the plurality of computing nodes using a third communication pattern.

Following each node updating its second embedding table, the third communication pattern, which may be an all-reduce synchronization pattern, may be performed. When a parameter server is present, the pattern may be an all-to-one, followed by one-to-all pattern, or for example a sequential allreduce synchronization pattern. The third communication pattern may be a reductive pattern, and may comprise operations such as averaging. Reference is also made to FIG. 2B which is a flowchart schematically representing an optional flow of operations for distributed inference using a machine learning model, according to some embodiments of the present disclosure;

The exemplary process 250 may be executed for executing one or more automatic and/or semi-automatic inference tasks, for example analytics, recommendations, sentiment analysis and/or the like. The process 250 may be executed by the one or more processors 111. The process 250 may be used for inferencing using a neural network trained by a process such as 200, however the values stored in the second embedding table, which may be adapted for storing on a memory enabling faster access compared to the memory the first embedding table is stored on, may be determined separately.

The process 250 may start, as shown in 251 by obtaining a first embedding table for at least one non-numerical feature, and a second embedding table comprising vectorized representations, each associated with a frequently occurring values of at least one non-numerical feature, and being shared by the plurality of computing nodes.

The embedding tables may be received through the input interface 112, and comprise values for the same non-numerical features. Some values, as well as their associated embeddings, may be features both on the first embedding table, and the second embedding table.

The values in the second embedding table are expected to occur more frequently in the data than values not stored thereon, for maximal benefit from the acceleration and communication overhead saving.

And subsequently, as shown in 252, the process 250 may continue by inferencing from at least one data record using a neural network.

One or more data records may be received through the input interface 112. When the processors 111, encounters a non-numerical feature, the processor may query the second embedding table for the vectorized representation or the embedding. When the value of the non- numerical, the fetching is executed as part of the first communication pattern, which collects values from at least one remote embedding table, or a first embedding table of a different computing node from the plurality of computing nodes.

Followingly, the inference may be made by feeding the numerical values of the data record and the embeddings on the neural networks. The inference may be sent through the output interface 115, to produce an indication, to a different system, and/or the like.

Some implementations of the disclosure may monitor occurrence frequencies of values in the data records, and optionally update the second embedding table accordingly. An exemplary method of updating is shown in FIG.12. Reference is now made to FIG. 3A which is a schematic illustration of an exemplary dataset segment comprising non-numerical data, according to an exemplary dataset.

Non-numerical features, and particularly categorical features, provide valuable data for many machine learning tasks. For example, embedding tables are a resource consuming part of solutions for the Click-Through rate problem.

In recommender systems, Click-Through Rate (CTR) prediction is a crucial task, which is estimating the probability of a user clicking on a recommended item under a specific context. The user and the context are described by a set of features. Some features are numerical, integer, rational numbers, and like, for example time, age, number of children.

Other, non-numerical features, may include categorical features such as a region, gender, word pairs from recent the search requests, and the like.

The information may be anonymized, and in some examples, only hash values are given for the categorical data. An example of such data is shown on the FIG. 3A, wherein each row represents an event where some user has or has not clicked on an advertisement. The very first value li is actually a binary label, where 1 means the ad was clicked and 0 it was not. Integer features are represented as ii^j where i is the number of event and j is a number of feature, and Ci^k is a categorical feature, where k is the category number. The goal of the recommendation system is to predict li based on features i and Ci^k i.e. when the user clicks or not.

The number of unique values within categorical features may be very big. Some examples on the present disclosure are based on the for the Criteo dataset. The Criteo dataset comprises a 7 days log of user activity, and the number of unique values is 45840617. Therefore, storing embeddings as a vector consisting of 16 float values may require almost 3 GBytes of memory. Embedding tables utilized in production settings may be of several TBytes in size, and thus render storing the table on a single computing device impractical.

Reference is also made to FIG. 3B which is a schematic graph of an exemplary distribution of non-numerical value frequency distribution, according to an exemplary dataset;

Another important property of the categorical features is power law distribution of categorical values. Power law distribution, as compared to exponential distributions and the likes, is characterized by larger values inclination to grow faster than smaller values, for example, an already popular website may be expected to attract more traffic than a large number of less popular websites having similar traffic combined. Therefore, within a single category a small subset of values may occurs in the most of data records in the dataset, as well as during inference. Figure 3B shows a distribution of the first 1000 most frequent values for the category 28. Together they occur on 86% of all data records in the dataset. Reference is also made to FIG. 4A which is a schematic illustration of an exemplary distributed training system, according to some embodiments of prior art.

This example shows 4 computing devices as nodes and a central parameter server. Training of the neural network, together with embedding table, may be performed by a cluster of machines, or a plurality of comparing devices and the embedding table may be stored on a single parameter server, which may be remote, for example cloud based.

In that case, the parameter server transmits the values requested by the computing devices in a one-to-all pattern, and following the training step, an all-to-one pattern, which may comprise averaging allreduce communication, may be performed. Followingly, the parameter server may update the embedding table.

Reference is also made to FIG. 4B which is an additional schematic illustration of an exemplary distributed training system, according to some embodiments of prior art.

Training of the neural network, together with embedding table, may be performed by a cluster of machines, or a plurality of computing devices and the embedding table may be stored in a distributed method, on the plurality of computing devices, which also function as the

Training iterations of neural networks may consist of two phases, forward and backward. During forward phase an output of the neural network may be computed and during backward phase gradients may be computed. After training step i.e. following both forward and backward phases are finished, gradients for the neural network are averaged across all nodes using allreduce communication. This is a ubiquitous approach when neural network is trained in data parallel way, however, to maintain consistency and coherency of the distributed embedding table, two additional steps are required: First, embeddings from the remote table shards should be received to prepare a local batch at each computing device. Second, gradients calculated for the embedding values should be exchanged between at computing devices to be applied to the stored embedding values.

Reference is also made to FIG. 5A which is a schematic graph representing an exemplary communication overhead in response to a batch size according to some embodiments of prior art;

As mentioned in FIG 4B, embeddings from the remote table shards should be received to prepare a local batch at each computing device and gradients calculated for the embedding values should be exchanged between at computing devices to be applied to the stored embedding values.

These two steps add a significant amount of network load, which may be more noticeable for smaller batch sizes. Reference is also made to FIG. 5B which is another schematic illustration of an exemplary distributed training system, according to some embodiments of prior art.

The power law property may lead to usage of a least frequently used (LFU) cache. Before executing the training step, local batches may be formed by receiving embedding stored on other nodes, using all-to-all communications, or all-to-one communication when a parameter server is present. Frequently occurring values and their associated embeddings may be stored on the associated LFU cache.

After a training step is performed each node may have updated plurality of gradients, generated by the training step. The plurality of gradients may be sent back through the network, using all-to-all communications, or all-to-one communication when a parameter server is present. Followingly, all the embedding table parts may be updated and synchronized.

However, since embeddings are a trainable parameters which are continuously updated during the training procedure, values stored in the cache still need to be synchronized with the values in remote table and have to be flushed between training steps.

Referring now to FIG. 6 which is flowchart schematically representing an optional flow of operations for training iteration with local buffers placed on each computing node, according to some embodiments of the present disclosure .

The figure shows an example having 4 computing nodes for simplicity, however other number of nodes such as 2, 15, and 1000 may be present. This figure shows peer nodes, however a master, and/or a parameter server, may be present, and adapting the embodiment thereto, as well as other variants, is apparent to the person skilled in the art and within the scope of the claims.

Before executing the training step, local batches may be formed by the following: firstly, reading the embeddings found in local buffer, or in the associated distributed part, directly without any external communication. Secondly, applying the first communication pattern, and thereby receiving the other embedding, using all-to-all communications, or all-to-one communication when a parameter server is present.

After a training step is performed each node may have updated plurality of gradients, generated by the training step, for its associated utilized embeddings. The plurality of gradients for embeddings received from remote computing nodes may be sent back through the network, by applying a second communication pattern. The second communication pattern may have averaging characteristics, and may all-to-all communications, or all-to-one communication when a parameter server is present. The plurality of gradients for embeddings read from local buffer, as well as from the associated distributed part, may be applied locally. Following this step local buffers may become inconsistent and an additional step may be required to synchronize them. The additional step may be performed using a third communication patter, which may be an allreduce collective communication, similarly to the communication pattern used for averaging gradients for the neural network during training step.

Following this step the local buffers may be synchronized across the plurality of computing nodes forming the training cluster.

Referring now to FIG. 7 which is a schematic block diagram of an exemplary training system as implemented on each computing node, according to some embodiments of the present disclosure.

The computing device system may implement the following main parts if the worker node: computing device, external processor, and external memory. The main parts are marked by (1), (5) and (4) respectively. Note that some of the items are shown in FIG. 7 may be implemented in methods familiar to the person skilled in the art, and the items most characterizing the disclosure are emphasized by brighter edge marking, for example the Allreduce SLB gradients marked (10).

The computing device part marked (1) performs main computations during training step, but it has limited memory. Therefore external memory marked (4) is used to store embedding table shard, which may also be referred to as the first embedding table, as well as training data shard and a service information related to the Synchronous Local Buffer (SLB) index. When a parameter server is used to store Embedding table, similarly to as shown in FIG. 4A, it may be accessed through the network interface marked (6), similarly to 113 of FIG. 1A. Network interface is also used to communicate with other computing devices or nodes. The external processor marked (5), similarly to 111 shown in FIG. 1A, may prepare the batch data using training data and distributed embedding table shards placed in external memory. Note that steps known to the person skilled in the art were omitted from the figure description.

The Synchronous Local Buffer (SLB), which may also be referred to as the second embedding table may consist of two parts. The first part may be an index placed in the external memory (4) containing identifiers of embeddings, directly accessible from the non-numerical feature value, and included to the buffer. The second part, which may be referred to as the SLB itself, may be placed in the internal memory marked (2), and contain the actual vectorized representation, or embeddings values.

The batch data which may be stored in the internal memory marked (2) may consist of embedding values, received directly from SLB, or from external processor marked (5) which receives them from the external embedding table shard, and from other nodes through network interface marked (6). Note that steps known to the person skilled in the art were omitted from the figure description.

Referring now to FIG. 8 which is a schematic illustration of an exemplary training batch preparation, according to some embodiments of the present disclosure

Training data may be read from the external memory marked (1), which may contain integer features, labels and identifiers of categorical features for each data record, which may be also referred to as an event. Identifiers of categorical features may be used to prepare data for SLB marked (2). The embeddings not found in the SLB may be read from local Embedding table shard marked (3), and/or be received from other computing nodes through the network interface as marked (4). It may also be required to send embeddings from local embedding table shard to other computing nodes which require them. This step may also be referred to as the first communication pattern. This means that the step marked (4) on FIG. 8 is also a synchronization point for the cluster of computing nodes, during which all of the w=computing nodes communicate.

Followingly, within the computing device as marked (1) on FIG. 7, values from SLB are read from the internal memory of computing device marked (5) on. Values which need to be added to SLB are also written there during this stage. Followingly, all the embedding values may be placed in the right order defined by the initial list of identifiers corresponding to the each particular data record, by the list, read on the step marked (1). Note that steps known to the person skilled in the art were omitted from the figure description.

Referring now to FIG. 9 which is a schematic illustration of exemplary of synchronous local buffer operations, according to some embodiments of the present disclosure.

Figure 9 shows in details the steps marked (2) and (5) in FIG. 8. During the prepare data for LSB step, marked (2) service data may be prepared for the SLB. The SLB index may be used to determine which embeddings are already in SLB, and accordingly assigned to be read therefrom on the Read/Write values from/to LSB step marked (5). After the step marked (2) the most frequent identifiers within the local list of identifiers, characterizing the batch, may be determined.

Followingly, the top N most frequent identifiers among all batches currently processed by the computing nodes may be found. This operation may require communication between nodes and the resulting top N identifiers may be replicated among all computing nodes. These identifiers may be used followingly to fill the SLB,

At the Read/Write values from/to SLB step marked (5), embedding values from SLB may be read for identifiers already added to the SLB, and the embedding values marked to be added may be written to the SLB. Note that steps known to the person skilled in the art were omitted from the figure description.

Referring now to FIG. 10 which is a schematic illustration of an exemplary of a training iteration, according to some embodiments of the present disclosure;

Training iteration may be done as known to the person skilled in the art, and calculated neural network gradients may be averaged using allreduce communication pattern. However, an additional allreduce SLB gradients step, marked (10) where the whole SLB participates in another allreduce collective communication may be performed, so that embeddings in SLB on all computing nodes become synchronized. This step may also be referred to as the third communication pattern. Note that steps known to the person skilled in the art were omitted from the figure description.

Referring now to FIG. 11 which is a schematic illustration of an exemplary SLB creation and filling process, according to some embodiments of the present disclosure.

Denote three parameters: S_initiai - the initial size of SLB, A_s - size step, and S_max - maximum size of SLB.

Initially the SLB is empty and may be of size 5₀ = S_initiai, during the very first iteration the S_o most frequent non-numerical, for example categorical, values are read from the first embedding tables, the local and that of other computing nodes and devices, and written to the SLB, also referred to as the second embedding table. After these values are updated with calculated gradients, the SLB may be averaged using the allreduce collective communication. At the end of very first iteration the size of SLB may be increased to

= S_o + A_s. At the next iterations all operations are performed in the same way until the size of SLB reaches the maximum Sf_inai = S_max. Following that point, the SLB may be maintained constant, which means that the steps marked (2) and (5) as shown in FIG. 8 and the step marked 10 as shown in FIG. 10 may be omitted.

The goal of SLB is to estimate the most frequent values from the embedding table. The estimation accuracy may be calculated as:

Where F(S_max) is the true distribution function, F(S_max) is the estimated distribution function, S_max is the SLB size, and N is the sample size. Our step-by-step filling of the SLB may improve accuracy by enlarging the effective sample size used to perform estimation:

Where S_initiai is the initial SLB size, A_s is the SLB size step and B is the batch size.

Referring now to FIG. 12 which is a schematic illustration of an exemplary of SLB updating process, according to some embodiments of the present disclosure;

In some implementations it may be beneficial to update the content of the SLB, i.e. which values of the non-numerical properties are stored thereon, for example in case the distribution is changed over time.

The updating may be done synchronously, with an additional processing circuitry of an additional computing node from the plurality of computing nodes, or the other worker nodes of the clusters. It may also be performed by a parameter server, and associated communication pattern adaptations may be done.

Freeing place from the SLB may require choosing and removing at least one additional value from the second embedding table, however it may be performed gradually, wherein during each batch some values chosen by storage address, or a different criterion, may be removed.

Followingly, values of the at least one non-numerical feature, frequently occurring in the at least one batch, and not present in the SLB, and the vectorized representations associated therewith, may be added to the SLB, also referred to as the second embedding table.

The updating comprises adding at least one index value of the at least one non-numerical feature, frequently occurring either on the inference data, or the data in at least one batch used for further training, for example during online training together with the associated vectorized representation, to the second embedding table.

Followingly, the third communication pattern may be applied on the at least non- numerical feature values.

The values added may comprises the most frequent value of the at least one non- numerical feature in the inference data or training batches used by each computing node.

FIG. 12 shows an exemplary updating process. At the first step A_s of values, both index and embedding values, are dropped from the beginning of SLB. Followingly, during the next training iteration this free space is filled with a new values. The SLB may be updated so that there are no duplicates in the SLB, therefore only values which are not already in SLB are considered for updating. When the first step is over the next A_s of values may be dropped and filled during following training iteration. The procedure may continue until all values in the SLB are updated. It should be noted that different implementation may update the SLB in different order, or drop values In accordance with their occurrence frequency, however the latter may be complex to implement.

Referring now to FIG. 13 which is a schematic graph representing simulated memory overhead and communication speed-up in response to a SLB size according to some embodiments of the present disclosure.

Figure 13 shows by ratio a comparison of simulated network communications speed-up versus an estimated memory overhead required to place SLB. High Memory overhead is unwanted since the SLB is placed directly in the internal, expensive memory of the computing device, for example VRAM of GPU modules. Therefore the preferred combinations are in the center of the graph shown. The disclosed method enables significant communication speed-up for a relatively small memory overhead. For example, 12% memory overhead may provide 2.8X communication speed-up.

Referring now to FIG. 14 which is a schematic graph representing a simulated communication speed-up in response to a batch size according to some embodiments of the present disclosure;

The training batch size may influence the effectiveness of the proposed solution. FIG. 14 shows dependency of communication speed-up from the batch size. The smaller batch is the bigger speed-up may be provided by the disclosed SLB.

Referring now to FIG. 15 which is a schematic graph representing comparison between the communication speed-up of the disclosure versus an ideal solution wherein the distribution of categorical values is known in advance in response to a memory overhead according to some embodiments of the present disclosure.

Fig. 15 shows a comparison between the disclosed SLB solution, and an ideal solution wherein the distribution of the categorical values of the non-numerical features is known in advance. As expected, the disclosed SLB shows some performance gap form the ideal solution, since it has a non-zero distribution estimation error.

The descriptions of the various embodiments have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. It is expected that during the life of a patent maturing from this application many relevant machine learning models, training methods and communication patterns will be developed and the scope of the terms training, machine learning model, neural network, communication patterns and the like, are intended to include all such new technologies a priori.

As used herein the term “about” refers to ± 10 %.

The terms "comprises", "comprising", "includes", "including", “having” and then- conjugates mean "including but not limited to". This term encompasses the terms "consisting of' and "consisting essentially of'.

The phrase "consisting essentially of means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.

As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.

The word “exemplary” is used herein to mean “serving as an example, instance or illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.

The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments”. Any particular embodiment may include a plurality of “optional” features unless such features conflict.

Throughout this application, various embodiments may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of embodiments. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.

It is appreciated that certain features of embodiments, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of embodiments, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

Although embodiments have been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.

It is the intent of the applicant(s) that all publications, patents and patent applications referred to in this specification are to be incorporated in their entirety by reference into the specification, as if each individual publication, patent or patent application was specifically and individually noted when referenced that it is to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting. In addition, any priority document(s) of this application is/are hereby incorporated herein by reference in its/their entirety.

Claims

WHAT IS CLAIMED IS:

1. A system configured for distributed training of a neural network, comprising a plurality of computing nodes, each computing node comprising: at least one processing circuitry configured to: receive a plurality of data records, each comprising at least one non-numerical feature; generate a first embedding table, comprising vectorized representations, each associated with a value of the at least one non-numerical feature; generate a second embedding table comprising vectorized representations, each associated with a frequently occurring values of the at least one non-numerical feature; apply a first communication pattern to collect values of the at least one non- numerical feature for at least one batch, from at least one of the plurality of computing nodes; generate a plurality of gradients by performing at least one training iteration on the neural network using the at least one batch from the plurality of data records; update the first embedding table by the plurality of gradients using a second communication pattern with the plurality of computing nodes; update the second embedding table by the plurality of gradients; and equalize the second embedding table with the second embedding table of the plurality of computing nodes using a third communication pattern.

2. The system of claim 1 wherein the at least one processing circuitry, and an additional processing circuitry of an additional computing node from the plurality of computing nodes are configured to: synchronously add at least one value of the at least one non-numerical feature, frequently occurring in the at least one batch, and a vectorized representations associated therewith, to the second embedding table; and apply the third communication pattern on the at least non-numerical feature values.

3. The system of claim 2 wherein the at least one value comprises most frequent value of the at least one non-numerical feature in the at least one batch used by each computing node.

25

4. The system of claim 3 wherein the at least one processing circuitry and an additional processing circuitry of an additional computing node from the plurality of computing nodes are configured to: synchronously choose and remove at least one additional value from the second embedding table; and synchronously add at least one value of the at least one non-numerical feature, frequently occurring in the at least one batch, and a vectorized representations associated therewith, to the second embedding table.

5. The system of claim 1, wherein the first communication pattern collects values from at least one remote embedding table.

6. The system of claim 1, wherein the second communication pattern is an all-to-all pattern.

7. The system of claim 1 wherein the third communication pattern is an all-reduce synchronization pattern.

8. The system of claim 1 , further comprising at least one processing circuitry configured as a parameter server storing at least one remote embedding table.

9. The system of claim 1, wherein the second embedding table is stored on a memory enabling faster access compared to the memory the first embedding table is stored on.

10. The system of claim 1, wherein the second embedding table in each of the plurality of computing nodes stores the same values.

11. A computer-implemented method for inferencing from a neural network, using a plurality of computing nodes, each computing node comprising at least one processing circuitry, and the method comprising: obtaining a first embedding table for at least one non-numerical feature, and a second embedding table comprising vectorized representations, each associated with a frequently occurring values of at least one non-numerical feature, and being shared by the plurality of computing nodes; inferencing from at least one data record using a neural network.

12. The computer-implemented method if claim 11, further comprising: updating the second embedding table by adding at least one value of at least one frequently occurring non-numerical feature, and a vectorized representations associated therewith; and applying a synchronizing communication pattern on at least one value of the at least one non-numerical feature.

13. The computer- implemented method of claim 12 wherein the at least one value comprises most frequent value of the at least one non-numerical feature.

14. A computer-implemented method for training a neural network, using a plurality of computing nodes, each computing node at least one processing circuitry, and the method comprising: receiving a plurality of data records, each comprising at least one non-numerical feature; generating a first embedding table, comprising vectorized representations, each associated with a value of the at least one non-numerical feature; generating a second embedding table comprising vectorized representations, each associated with a frequently occurring values of the at least one non-numerical; applying a first communication pattern to collect values of the at least one non- numerical feature for at least one batch, from at least one of the plurality of computing nodes; generating a plurality of gradients by performing at least one training iteration on the neural network using the at least one batch from the plurality of data records updating the first embedding table by the plurality of gradients using a second communication pattern with the plurality of computing nodes; updating the second embedding table by the plurality of gradients; and equalizing the second embedding table with the second embedding table of the plurality of computing nodes using a third communication pattern.

15. The computer- implemented method of claim 14, further comprising: adding synchronously at least one value of the at least one non-numerical feature, frequently occurring in the at least one batch, and a vectorized representations associated therewith, to the second embedding table; and applying the third communication pattern on the at least non-numerical feature values.

16. The computer- implemented method of claim 15, wherein the at least one value comprises most frequent value of the at least one non-numerical feature in the at least one batch used by each computing node.

17. The computer- implemented method of claim 16, further comprising: choosing and removing synchronously at least one additional value from the second embedding table; and adding synchronously at least one value of the at least one non-numerical feature, frequently occurring in the at least one batch, and a vectorized representations associated therewith, to the second embedding table .

18. The computer-implemented method of claim 14, wherein the first communication pattern collects values from at least one remote embedding table.

19. The computer- implemented method of claim 14, wherein the second communication pattern is an averaging all-to-all pattern.

20. The computer- implemented method of claim 14, wherein the third communication pattern is an all-reduce synchronization pattern.

21. The computer- implemented method of claim 14, further comprising at least one processing circuitry configured as a parameter server storing at least one remote embedding table.

22. The computer- implemented method of claim 14, wherein the second embedding table is stored on a memory enabling faster access compared to the memory the first embedding table is stored on.

23. The computer- implemented method of claim 14, wherein the second embedding table in each of the plurality of computing nodes stores the same values.

28