WO2022089607A1

WO2022089607A1 - Parameter server node recovery method and recovery system

Info

Publication number: WO2022089607A1
Application number: PCT/CN2021/127609
Authority: WO
Inventors: 陈宬; 刘一鸣; 杨俊�; 王冀; 王艺霖; 石光川; 卢冕
Original assignee: 第四范式（北京）技术有限公司
Priority date: 2020-10-29
Filing date: 2021-10-29
Publication date: 2022-05-05
Also published as: CN112181732A

Abstract

Provided are a parameter server node recovery method and recovery system. The recovery method comprises: once a parameter server node restarts, acquiring first layer data from a non-volatile memory of the parameter server node, wherein the first layer data comprises parameter server node header information, and the first layer data is used to query node information of the parameter server node and first level storage sub-module information; on the basis of the first layer data, acquiring each piece of first level storage sub-module header information in second layer data from the non-volatile memory of the parameter server node, wherein the second layer data comprises M pieces of first level storage sub-module header information and M pieces of first level storage sub-module hash mapping information, the second layer data is used to query model parameters stored in each first level storage sub-module of the parameter server node; on the basis of each piece of first level storage sub-module header information, traversing each piece of first level storage sub-module hash mapping information in the second layer data to recover model parameters on the parameter server node.

Description

Parameter server node recovery method and recovery system

This application claims the priority of the Chinese patent application with the application number of 202011187099.2 and the filing date of October 29, 2020, entitled "Recovery Method and Recovery System for Parameter Server Nodes", wherein the content disclosed in the above application is incorporated by reference in in this application.

technical field

The present disclosure relates to the field of computer technology, and more particularly, to a recovery method and recovery system of a parameter server node.

Background technique

The industry has deployed online machine learning models. Because online services require very high real-time performance, model parameters need to be stored in high-speed DRAM (Dynamic Random Access Memory) memory. However, the total number of parameters of industrial machine learning models is huge, which exceeds the storage capacity of a single machine. Therefore, it is often necessary to deploy parameter server clusters based on DRAM memory to provide high concurrent parameter query support for online prediction services. Parameter Server is a programming framework used to facilitate the writing of distributed parallel programs, with the emphasis on distributed storage and collaboration support for large-scale parameters. The online parameter server is mainly used to store the trained super-large-scale parameters and provide high-concurrency and high-availability model parameter query services for online services. However, the traditional parameter server based on DRAM memory has two problems in the deployment process. The first is the increase of the overall hardware cost caused by the huge memory consumption. The second is that when any node in the parameter server cluster fails and goes offline, The recovery process is time-consuming.

SUMMARY OF THE INVENTION

Exemplary embodiments of the present disclosure may at least partially solve the above-mentioned problems.

According to a first aspect of the present disclosure, there is provided a system comprising at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least one computing device to A method for restoring a parameter server node is executed, wherein the parameter server node is a node in a parameter server cluster, and the parameter server node is a parameter server node based on non-volatile memory, and the non-volatile memory of the parameter server node The model parameters of each model stored in the volatile memory are logically divided into at least one-level storage sub-modules, and the recovery method includes: after the parameter server node restarts The stored first-layer data is obtained from the flexible memory, wherein the first-layer data includes the parameter server node header information, and the first-layer data is used to query the node information of the parameter server node and the first-level storage sub-module information ; Based on the first-layer data, obtain each first-level storage submodule header information in the stored second-layer data from the non-volatile memory of the parameter server node, wherein the second-layer The data includes M first-level storage sub-module header information and M first-level storage sub-module hash mapping information, and the second-level data is used to query the parameters stored in each first-level storage sub-module of the parameter server node. Model parameters, where M is the number of first-level storage submodules stored on the parameter server node; based on the header information of each first-level storage submodule, traverse each first-level storage in the second-level data Submodule hash map information to restore model parameters stored on the parameter server node.

According to a second aspect of the present disclosure, a method for restoring a parameter server node is provided, wherein the parameter server node is a node in a parameter server cluster, and the parameter server node is a non-volatile memory-based parameter server node, the model parameters of each model stored in the non-volatile memory of the parameter server node are logically divided into at least one-level storage sub-modules, and the recovery method includes: after the parameter server node restarts, The stored first-layer data is obtained from the non-volatile memory of the parameter server node, wherein the first-layer data includes parameter server node header information, and the first-layer data is used to query the parameter server node Node information and first-level storage submodule information; based on the first-level data, obtain each first-level storage submodule in the stored second-level data from the non-volatile memory of the parameter server node Header information, wherein the second-layer data includes M first-level storage submodule header information and M first-level storage submodule hash mapping information, and the second-layer data is used to query the parameter server node. The model parameters stored in each first-level storage sub-module, where M is the number of first-level storage sub-modules stored on the parameter server node; based on the header information of each first-level storage sub-module, traverse the second Each first level in the layer data stores submodule hash map information to restore model parameters stored on the parameter server node.

According to a third aspect of the present disclosure, a system for restoring a parameter server node is provided, wherein the parameter server node is a node in a parameter server cluster, and the parameter server node is a non-volatile memory-based parameter server node, the model parameters of each model stored in the non-volatile memory of the parameter server node are logically divided into at least one-level storage sub-modules, and the recovery device includes: a first acquisition device, configured as: After the parameter server node is restarted, the stored first-layer data is acquired from the non-volatile memory of the parameter server node, wherein the first-layer data includes parameter server node header information, and the first-layer data for querying node information and first-level storage submodule information of the parameter server node; a second acquiring device is configured to: based on the first-level data, obtain information from the non-volatile memory of the parameter server node Obtain the header information of each first-level storage submodule in the stored second-level data, wherein the second-level data includes M first-level storage submodule header information and M first-level storage submodules. Mapping information, the second layer data is used to query the model parameters stored in each first-level storage submodule of the parameter server node, wherein M is the number of first-level storage submodules stored on the parameter server node The recovery device is configured to: based on the header information of each first-level storage sub-module, traverse each first-level storage sub-module hash map information in the second-level data to recover the parameter server node. Stored model parameters.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one computing device, cause the at least one computing device to perform a parameter server node's operation according to the present disclosure. recovery method.

According to a fifth aspect of the present disclosure, there is provided an electronic device, comprising: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement the present disclosure The recovery method of the parameter server node.

According to the method and system for restoring a parameter server node of the present disclosure, the model parameters are stored on the parameter server based on non-volatile memory instead of the parameter server based on DRAM, which greatly reduces the hardware cost. In addition, for the parameter server based on non-volatile memory, the model parameter logical storage structure and data physical storage structure are designed, the model parameters are logically hierarchically stored, and the data is stored in two layers on the non-volatile memory. Meet the high concurrency and high availability requirements of the parameter server. In addition, the rapid recovery process after restart is designed. According to the data structure of the two-tier storage, all parameters stored on the parameter server node can be easily and quickly queried and restored, achieving the effect of millisecond-level recovery.

Description of drawings

These and/or other aspects and advantages of the present disclosure will become apparent, and be more readily understood, from the following description of embodiments, taken in conjunction with the accompanying drawings, wherein:

FIG. 1 is a schematic diagram illustrating a conventional parameter server cluster architecture.

FIG. 2 is a schematic diagram illustrating a node failure of a conventional parameter server cluster.

FIG. 3 is a flowchart illustrating a storage method of model parameters according to an exemplary embodiment of the present disclosure.

FIG. 4 is a schematic diagram illustrating a logical storage structure of model parameters according to an exemplary embodiment of the present disclosure.

FIG. 5 is a schematic diagram illustrating a data physical storage structure of a parameter server node according to an exemplary embodiment of the present disclosure.

FIG. 6 is a schematic diagram illustrating a data physical storage structure of a parameter server node according to another exemplary embodiment of the present disclosure.

FIG. 7 is a schematic diagram illustrating a data physical storage structure of a parameter server node according to another exemplary embodiment of the present disclosure.

FIG. 8 is a flowchart illustrating a recovery method of a parameter server node according to an exemplary embodiment of the present disclosure.

FIG. 9 is a flowchart illustrating a recovery method of a parameter server node according to an exemplary embodiment of the present disclosure.

10 is a block diagram illustrating a storage system of model parameters according to an exemplary embodiment of the present disclosure.

11 is a block diagram illustrating a recovery system of a parameter server node according to an exemplary embodiment of the present disclosure.

Detailed ways

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of embodiments of the present disclosure as defined by the claims and their equivalents. Various specific details are included to aid in that understanding, but are to be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

It should be noted here that "at least one of several items" in the present disclosure all means including "any one of the several items", "a combination of any of the several items", The three categories of "the whole of the several items" are juxtaposed. In the present disclosure, "and/or" all means at least one of the preceding two or more items joined by it. For example, "including at least one of A and B" and "including A and/or B" include the following three parallel situations: (1) including A; (2) including B; (3) including A and B . For another example, "execute at least one of step 1 and step 2", "execute step 1 and/or step 2" means the following three parallel situations: (1) execute step 1; (2) execute step 2; (3) Execute step one and step two. That is to say, "A and/or B" can also be expressed as "at least one of A and B", and "execute step 1 and/or step 2" can also be expressed as "execute step 1 and step 2" at least one of".

The era of big data and AI brings unprecedented challenges and opportunities to many applications such as recommendation systems and credit card anti-fraud. This type of AI application has two characteristics. The first feature is the super-high dimensionality of the model. Each dimension of the model corresponds to a parameter, a model that can handle massive data, and the dimensions can reach hundreds of millions of dimensions or even a billion dimensions. This also means that we need to store hundreds of millions or even billions of parameters for each model. So we need to use an online parameter server to store these parameters. The second feature is real-time. Taking anti-fraud as an example, it often takes a few milliseconds from the generation of the user's card swiping behavior to the final prediction result given by the model. The requirement of high real-time performance also destined us to store a large number of model parameters in high-speed DRAM memory. In addition, since a single machine in the parameter server cluster may have a single point of failure, and the online service requires 7*24 hours of uninterrupted, we need to make multiple copies of the same data to ensure data security. Figure 1 is a schematic diagram showing the existing parameter server cluster architecture. As shown in Figure 1, the storage sub-module Storage 3 of Model 1 (Model 1) can prepare three copies of the same content and store them on three server nodes (PS Node 1, PS Node 2, PS Node 3), this can ensure that online services can still be performed after one or two servers fail to go offline. Another advantage of multiple replicas is that multiple online replicas can share the access pressure, thereby increasing the access bandwidth of the overall parameter server.

However, traditional DRAM-based parameter servers have two pain points during deployment. The first is the increase in overall hardware cost caused by huge memory consumption. At present, the capacity of a single DRAM is limited, generally 16G, or 32G, which also causes the parameter server to store model parameters in a cluster of multiple machines to meet memory requirements. At the same time, the aforementioned multi-copy mechanism increases the memory overhead and cost. The second pain point is also the pain point of all DRAM-based memory systems. When any node in the parameter server cluster fails and goes offline, the node needs to be restarted, and the parameter data is extracted from HDFS (Hadoop Distributed File System) and stored in the new node. DRAM. This recovery process is time-consuming. Specifically, in traditional DRAM-based parameter servers, all parameters are stored in DRAM, and data is backed up in a back-end slow storage system (such as HDFS). When the DRAM-based parameter server is down and restarted, all data in the DRAM will be lost. DRAM-based parameter server node recovery requires: 1. Read all parameters from the slow storage HDFS disk, 2. Transfer these parameters to the restarted parameter server node through the network, 3. Rebuild all parameters in DRAM structure, and insert these parameters into the HashMap (hash map) data structure in DRAM. Especially when the parameter dimension is hundreds of millions or even trillions of dimensions, each of the above steps is very time-consuming. However, long recovery times have a distinct disadvantage for parameter server clusters. Fig. 2 is a schematic diagram showing a node failure of a conventional parameter server cluster. As shown in Figure 2, the parameter server uses multiple copies to share the access pressure of the system. This also means that when a node (PS Node 2) goes offline and recovers, the overall throughput of the system will also decrease, and the longer the system recovers, the longer the overall throughput of the system will decrease.

In order to solve the above-mentioned problems existing in the existing parameter server clusters, the present disclosure proposes a novel model parameter storage method and a novel parameter server node recovery method. Specifically, the present disclosure proposes that non-volatile memory (NVM) can be used instead of DRAM to realize a method for storing model parameters on a parameter server node based on non-volatile memory. It brings the possibility of rapid recovery of parameter server nodes. This is because, for a parameter server node based on non-volatile memory, all parameters will not be lost after restarting and are still stored in the non-volatile memory. Therefore, the above three-step recovery process can be omitted. In addition, when writing data based on non-volatile memory, CPU commands are used to ensure that the parameters will not be lost in the non-volatile memory. Therefore, during the recovery process, only simple data needs to be written in the non-volatile memory. Structural check and/or version check, you can immediately restore and go live. Here, NVM generally refers to memory that can store data in the event of a power failure. New-generation NVMs (such as STT-RAM, PCM, ReRAM, 3D xPoint, etc.), new-generation NVMs and first-generation NVMs (such as Flash Memory) can be used. ) compared with the advantages of: (1) The performance is greatly improved, and it is closer to the DRAM used for the main memory of the computer; (2) When getting rid of the physical characteristics, it must be a multiple of a fixed size of bytes (such as Flash is usually a multiple of the number of bytes). 4KB (4*1024 bytes) addressable, can be addressed by single byte (Byte-addressable). For example, persistent memory device (Persistent Memory, PMEM) is a 3D xPoint product, compared with traditional DRAM, PMEM has a larger capacity and lower price cost, just to meet the needs of high memory usage of parameter server clusters. Therefore, one implementation can be to store model parameters in a distributed manner, which consists of PMEM-based parameter server nodes. parameter server cluster.

In order to realize the above method, the present disclosure proposes a storage structure for model parameters in a non-volatile memory in a non-volatile memory-based parameter server node. Specifically, the model parameters corresponding to each model stored on each parameter server node may be logically divided into at least one level of storage sub-modules. When the model parameters are logically divided into two or more levels of storage sub-modules, the adjacent upper-level storage sub-modules include at least one lower-level storage sub-module. Hierarchical storage of model parameters is conducive to the management and query of model parameters, to meet the requirements of high concurrent access, and more conducive to traversal and query of model parameters when the parameter server node is restored. In addition, the present disclosure also proposes a new data organization form. Specifically, in the non-volatile memory of the parameter server node based on non-volatile memory, the data is divided into two layers of storage, the first layer of data is stored in two layers. The ID information of a single parameter server node and the ID information of all the first-level storage submodules on the parameter server node can be recorded. The second-level data can use a hash map (HashMap) for each first-level storage sub-module to store specific model parameters. In addition, the second-level data also includes the first-level storage sub-module header information for query use during fast recovery. In addition, the present disclosure also proposes a new recovery process. When the parameter server node restarts, it can quickly traverse the data stored in the parameter server node according to the first layer data and the second layer data stored in the non-volatile memory. The sub-modules are stored at all levels, so that all model parameters stored in the parameter server node can be quickly restored, and the millisecond-level recovery of the parameter server node based on non-volatile memory can be realized. The method and system for storing model parameters and the method and system for restoring parameter server nodes according to the present disclosure will be described in detail below with reference to FIGS. 3 to 11 .

Referring to FIG. 3, in step 301, model parameters of at least one model may be acquired. The model can be an AI model with high dimensions, for example, a recommendation model, a credit card anti-fraud model, etc. Massive historical data can be put into the offline training system to train the AI model, and the trained AI model can be deployed to the online inference system for use. Each dimension of an AI model corresponds to a parameter. An AI model that can process massive data can reach hundreds of millions of dimensions or even a billion dimensions, so each model may store hundreds of millions or even billions of parameters. The model parameters here can refer to parameters before training or during training, and can also refer to parameters after training is completed.

In step 302, model parameters of the at least one model may be stored in a parameter server cluster, wherein the parameter server cluster includes a plurality of non-volatile memory-based parameter server nodes. Here, the parameter server cluster may be an offline parameter server cluster for training, or an online parameter server cluster for inference.

According to an exemplary embodiment of the present disclosure, the model parameters of each model in the at least one model may be distributedly stored in the non-volatile memory of the plurality of non-volatile memory-based parameter server nodes medium; for each non-volatile memory-based parameter server node, the model parameters corresponding to each model stored on the parameter server node are logically divided into at least one-level storage sub-modules for storage. For example, the model parameters of each model in the at least one model may be logically divided into a plurality of storage modules (for example, the following first-level storage sub-modules), and the divided storage modules are distributed and stored into the non-volatile memory of the plurality of non-volatile memory-based parameter server nodes. In addition, in the non-volatile memory of each non-volatile memory-based parameter server node, each storage module may be further divided into at least one level of storage sub-modules (for example, the following second level of storage sub-modules). modules, tertiary storage submodules, etc.).

FIG. 4 is a schematic diagram illustrating a logical storage structure of model parameters according to an exemplary embodiment of the present disclosure. As shown in FIG. 4, only the model parameter storage structure on one parameter server node (ie, PS Node X) in the parameter server cluster is exemplarily shown, and other parameter server nodes can store model parameters in the same or similar manner . Each model can be logically divided into multiple storage sub-modules (eg, Storage), and these storage sub-modules can be hashed to multiple parameter server nodes. Multiple storage submodules stored in a parameter server node may belong to different models. For example, Storage 1 and Storage 3 in Figure 4 belong to Model 1 (Model 1), and Storage 24 belongs to Model 2 (Model2). For a Storage storage submodule, it can also be divided into multiple next-level storage submodules (for example, shards). Each shard is responsible for storing specific model parameters. Value pairs are stored in shards, and each shard only stores a part of the model's parameters to share the load. FIG. 4 only shows that the model parameters are logically divided into two levels of storage sub-modules (Storage and Shard) for storage, but the present disclosure is not limited to this. According to the present disclosure, the model parameters are logically divided into storage sub-modules of any level. module. For example, model parameters can be logically divided into only one-level storage sub-modules (such as Storage), and specific model parameters are stored under each Storage, or model parameters can be logically divided into three-level storage sub-modules, that is, , after Storage and Shard, each shard can be further divided into multiple third-level storage sub-modules, and specific model parameters are stored in each third-level storage sub-module, and so on.

Referring back to FIG. 3 , according to an exemplary embodiment of the present disclosure, the first-tier data and the second-tier data may be stored in the non-volatile memory of each non-volatile memory-based parameter server node.

For example, FIG. 5 is a schematic diagram illustrating a data physical storage structure of a parameter server node according to an exemplary embodiment of the present disclosure. Referring to Figure 5, on the non-volatile memory of the parameter server node, there will be two layers of data storage, such as the first layer and the second layer shown in Figure 5). For a parameter server node, there may be one storage in the first-level data for storing the information of the current node, and M storages in the second-level data for storing M first-level storage sub-nodes respectively. Module specific information. The first layer data and the second layer data according to an exemplary embodiment of the present disclosure will be described in detail below.

According to an exemplary embodiment of the present disclosure, in the first-level data, parameter server node header information may be stored for querying the node information and first-level storage sub-module information of the parameter server node.

The node information can be the ID information (Node ID) of the parameter server node. The parameter server cluster will assign an ID to each parameter server, and the ID is unique. Therefore, by obtaining the ID information of the parameter server node, it can be determined which parameter server node in the parameter server cluster it is.

In addition, the first-level storage submodule information may be the ID information list (Storage ID List) of all first-level storage submodules (Storage) stored on the parameter server node. According to an exemplary embodiment of the present disclosure, the IDs of the storage submodules (eg, Storage ID, Shard ID) are numbered for each model, for example, for model 1 there will be S1R1 (Storage 1, Shard 1), S1R2, S2R1, S2R2..., for Model 2 there will also be S1R1, S1R2, S2R1, S2R2... of Model 2. Therefore, in order to distinguish storage submodules between models, the first-level storage submodule ID included in the first-level storage submodule ID list may be composed of the model ID and the first-level storage submodule ID in the corresponding model, for example, the first-level storage submodule ID The list of primary storage sub-module IDs can include Model 1Storage 0, Model 1Storage 3, Model 2Storage 1…. In addition, the first-level storage submodule ID list can be stored in the first-level data in the form of a persistent list, which ensures that model parameters can be inserted into non-volatile memory.

According to an exemplary embodiment of the present disclosure, the first layer data (ie, parameter server node header information) can be stored in the non-volatile memory of the parameter server node as the root of the storage of the entire non-volatile memory-based parameter server node. Fixed location in the parameter server node to facilitate query when performing fast recovery.

According to an exemplary embodiment of the present disclosure, in the second-level data, M pieces of first-level storage sub-module header information and M pieces of first-level storage sub-module hash map (HashMap) information may be stored. The second-level data can be used to query the model parameters stored in each first-level storage sub-module of the parameter server node. For example, for each first-level storage submodule, a corresponding non-volatile memory storage pool may be newly created in the second-level data storage for separate storage.

According to an exemplary embodiment of the present disclosure, the first-level storage sub-module header information (Storage Head) may be allocated three spaces, which are respectively used to store the first-level storage sub-module ID (Storage ID), the first-level storage sub-module ID The version information (Version) of the model parameters stored in the module, and the pointer (Shard HashMap Pointer) of the first-level storage sub-module to the next-level hash map. As shown in Figure 5, the next level of the first-level storage sub-module (Storage) is the second-level storage sub-module (Shard). Therefore, the header information of the first-level storage sub-module points to the hash map of the next level. The pointer may be a Shard HashMap Pointer, but the present disclosure is not limited thereto. For example, when the next level of the first-level storage sub-module (Storage) is the model parameter itself, the pointer to the next-level hash map in the header information of the first-level storage sub-module may point to the model parameter hash map Pointer (for example, Para HashMap Point).

According to an exemplary embodiment of the present disclosure, the first-level storage sub-module hash mapping information may include the hash map of each level of storage sub-modules from the second-level to the Nth-level storage sub-module under the first-level storage sub-module. The hash map and the model parameter hash map under each of the Nth-level storage sub-modules under the first-level storage sub-module, where N is the number of stages of the storage sub-modules into which the model parameters are divided. Among them, the key of the hash map of each level of storage sub-module is the ID of this level of storage sub-module, the value is the pointer to the next level of hash map, the key of the model parameter hash map is the parameter name, and the value is the parameter value , the hash map of each level of storage sub-module and the model parameter hash map are linked by the corresponding pointer to the next level of hash map of this level of storage sub-module or the upper-level storage sub-module of this model parameter.

As shown in Figure 5, the model parameters are divided into secondary storage sub-modules (Storage and Shard) for storage. Therefore, the hash map information of the first-level storage submodule may include the hash map (Shard HashMap) of the second-level storage submodule and the model parameter hash map (Para HashMap) under each second-level storage submodule. In the hash map (Shard HashMap) of the second-level storage sub-module, the key is the second-level storage sub-module ID (Shard ID), and the value is the pointer to the hash map of the model parameters (Para HashMap Point). In the hash map (Para HashMap) of model parameters, the key is the parameter name (Para ID) and the value is the parameter value (Value). Between the header information of the first-level storage sub-module, the hash map of the second-level storage sub-module, and the hash map of model parameters, each level is linked by the pointer of the previous level to the hash map of the next level. That is, the pointer to the next-level hash map in the header information of the first-level storage sub-module points to the second-level storage sub-module hash map, and the pointer in the second-level storage sub-module hash map points to the next level. A pointer to a hashmap to a hashmap of model parameters. In addition, FIG. 5 only exemplarily shows the model parameter hash map under one second-level storage sub-module (Shard 0), but other second-level storage sub-modules also have model parameter hash maps. Of course, the present disclosure is not limited to the illustration shown in FIG. 5 .

For example, FIG. 6 is a schematic diagram illustrating a data physical storage structure of a parameter server node according to another exemplary embodiment of the present disclosure. As shown in FIG. 6 , when the model parameters are divided into first-level storage sub-modules (Storage), the hash map information of each first-level storage sub-module includes the model parameter hash map ( Para HashMap), the pointer to the next-level hash map in the header information of the first-level storage submodule is the pointer to the model parameter hash map (Para HashMap Point). The first-level storage submodule header information and the model parameter hash map are linked by a pointer to the model parameter hash map.

For another example, FIG. 7 is a schematic diagram illustrating a data physical storage structure of a parameter server node according to another exemplary embodiment of the present disclosure. As shown in FIG. 7 , when the model parameters are divided into three-level storage sub-modules (Storage, Shard, Slice), the hash map information of each first-level storage sub-module includes each The hash map (Shard HashMap) of the secondary storage sub-module, the hash map (Slice HashMap) of each third-level storage sub-module under each second-level storage sub-module, and the model under each third-level storage sub-module Parameter hash map (Para HashMap). The pointer to the next-level hash map in the header information of the first-level storage sub-module is a pointer to the second-level storage sub-module hash map (Shard HashMap Point), and the pointer in the second-level storage sub-module hash map points to The pointer of the next-level hash map is the pointer to the third-level storage sub-module hash map (Slice HashMap Point), and the pointer to the next-level hash map in the third-level storage sub-module hash map is the pointer to the model The pointer to the parameter hash map (Para HashMap Point). Between the first-level storage sub-module header information, the second-level storage sub-module hash map, the third-level storage sub-module hash map and the model parameter hash map, each level is pointed to by its previous level The pointers of the next-level hash map are linked, that is, the pointer to the next-level hash map in the header information of the first-level storage sub-module points to the second-level storage sub-module hash map, and the second-level storage sub-module The pointer to the next-level hash map in the hash map points to the third-level storage sub-module hash map, and the pointer to the next-level hash map in the third-level storage sub-module hash map points to the hash map of the model parameters. Hi map. In addition, FIG. 7 only exemplarily shows the hash map of the third-level storage sub-module under one second-level storage sub-module (Shard 0) and the hash map of the model parameters under one third-level storage sub-module (Slice0). However, other second-level storage sub-modules also have hash maps of third-level storage sub-modules, and other third-level storage sub-modules also have hash maps of model parameters.

Of course, the present disclosure is not limited to the above examples, and the model parameters can be divided into storage sub-modules of any level. In the case where the model parameters are divided into storage sub-modules of any level, according to the logic of the above examples, the storage and construction of the first One layer of data and second layer of data.

According to an exemplary embodiment of the present disclosure, a parameter of a specified model in the at least one model can be queried from a non-volatile memory-based parameter server cluster, and a corresponding model estimation service can be provided based on the queried parameter of the specified model . According to the parameter name of the parameter of the specified model, the parameter server node ID of the parameter storage and the storage sub-module ID of each level can be determined, and the parameter server node can be found according to the determined parameter server node ID, and according to the storage sub-module ID of each level, based on finding The second-level data on the parameter server node of , find the parameter value corresponding to the parameter name.

For example, when querying parameters under model X, the process of parameter query is that the user provides the parameter name (for example, Para0 in the last layer), and the parameter server node will return the process of the value corresponding to Para0. Specifically, first, the Storage ID and Shard ID where this parameter is located can be obtained by calculating the hash according to the name of the parameter (Para0). For example, assuming Para0='12345', model X has a total of 5 Storages, and each Storage has 8 Shards. After the Storage ID is calculated, you can use the Storage ID to divide the total number of parameter server nodes in the current parameter server cluster and take the remainder to obtain the node ID where the Storage ID is located (for example, there are 3 servers in total, and the Storage ID corresponding to Para0 is 0, That Storage 0 is stored on Node 0 (0 mod 3=0)). So far, through the parameter name Para0 provided by the user, the Node ID, Storage ID and Shard ID stored by this parameter are obtained through three calculations. Then, the corresponding parameter data server node can be found according to the Node ID, and then the corresponding No. 1 node can be found according to the Storage ID. For the corresponding storage in the second-level data, find the corresponding shard hash map according to the pointer to the shard hash map in the first-level storage sub-module header signal in the corresponding storage, and in the corresponding shard hash map, according to Shard ID finds the corresponding pointer to the model parameter hash map, finds the corresponding model parameter hash map according to the pointer to the model parameter hash map, and in the model parameter hash map, finds the corresponding parameter value value according to the parameter name Para0 .

According to an exemplary embodiment of the present disclosure, writing parameters only occurs when a new model comes online. The process of writing parameters is similar to the process of querying parameters, with two differences. First: In order to ensure that the old model parameters can still be accessed during the process of uploading new model parameters, all new model parameters will be inserted into a new storage pool. For example, the second layer of storage in Figure 5 will recreate an empty model 1, Storage 0 new. When writing a new value of Para0, the parameters of Para0 will be inserted into the new Storage 0 new under Shard 1. in HashMap. After all the parameters of the new model are uploaded to the parameter server cluster, the parameter query service will be switched to the new parameters. At the same time, the backend thread will slowly reclaim the space of the parameters of the old model. The second point is that when the new Para0 is written to the HashMap under Storage 0 new, the CPU instruction (such as clflushopt+mfence) can be used to ensure that the parameter data written each time is written to the NVM instead of the volatile CPU high-speed in the cache.

FIG. 8 is a flowchart illustrating a recovery method of a parameter server node according to an exemplary embodiment of the present disclosure. The recovery method of the parameter server node according to the exemplary embodiment of the present disclosure is suitable for the situation that the parameter server node described above with reference to FIG. 3 needs to be restarted due to a failure or the like.

Referring to FIG. 8, in step 801, after the parameter server node is restarted, the stored first-layer data may be obtained from the non-volatile memory of the parameter server node. As described above, the first-level data can parameterize the server node header information, which is used to query the parameter server node's node information and the first-level storage sub-module information. The parameter server node header information may include a parameter server node ID and a list of first-level storage submodule IDs. Therefore, it can be determined which parameter server node in the parameter server cluster the restarted parameter server node is according to the parameter server node ID in the parameter server node header, and the restart can be determined according to the first-level storage submodule ID list in the parameter server node header. All first-level storage submodules included in the parameter server node.

According to an exemplary embodiment of the present disclosure, the first tier data may be stored at a fixed location in the non-volatile memory of the parameter server node. Therefore, the first layer of data can be read from a fixed location in the non-volatile memory of the parameter server node. In this way, after the parameter server node is restarted, the storage information of the parameter server node can be quickly queried from the parameter server node.

According to an exemplary embodiment of the present disclosure, the first-level storage submodule IDs included in the first-level storage submodule ID list may be composed of a model ID and a first-level storage submodule ID in a corresponding model.

In step 802, each first-level storage sub-module header information in the stored second-level data may be acquired from the non-volatile memory of the parameter server node based on the first-level data. As mentioned above, the second-level data can include M first-level storage sub-module header information and M first-level storage sub-module hash mapping information, and the second-level data is used to query each first-level storage of the parameter server node. Model parameters stored in the sub-module, where M is the number of first-level storage sub-modules stored on the parameter server node.

According to an exemplary embodiment of the present disclosure, the first-level storage sub-module header information may include the first-level storage sub-module ID, version information of model parameters stored by the first-level storage sub-module, and the first-level storage sub-module The module's pointer to the next level hash map. Therefore, after acquiring the first data, all the first-level storage sub-module IDs of the parameter server node can be determined based on the ID list of the first-level storage sub-modules, and then based on the IDs of all the first-level storage sub-module IDs of the parameter server node For each first-level storage sub-module ID, the header information of the first-level storage sub-module with the corresponding first-level storage sub-module ID is acquired. Therefore, the respective first-level storage sub-module header information of all (M) first-level storage sub-modules stored on the parameter server node can be obtained.

According to an exemplary embodiment of the present disclosure, all (M) first-level storage sub-modules also have respective first-level storage sub-module hash map information. Each first-level storage sub-module may be associated with its first-level storage sub-module hash map information based on a pointer to the next-level hash map in its first-level storage sub-module header information. The hash map information of each first-level storage sub-module may include the hash map of each level of storage sub-modules from the second-level to the N-th level storage sub-module under the first-level storage sub-module, and the hash map of the first-level storage sub-module. Hash mapping of model parameters under each of the Nth-level storage sub-modules under the first-level storage sub-module, where N is the number of stages of the storage sub-modules into which the model parameters are divided. The key of the hash map of each level of storage sub-module is the ID of the level of storage sub-module, and the value is a pointer to the next level of hash map. The key of the model parameter hash map is the parameter name and the value is the parameter value. The hash map of each level of storage submodule and the model parameter hash map are linked through the corresponding pointer to the next level of the hash map of this level of storage submodule or the previous level of this model parameter storage submodule, for example, as shown in the figure 5 to Figure 7.

In step 803, based on the header information of each first-level storage sub-module, the hash map information of each first-level storage sub-module in the second-level data can be traversed to restore the model parameters stored on the parameter server node.

According to an exemplary embodiment of the present disclosure, the following operations may be performed for each first-level storage sub-module: search the next-level hash map through the corresponding pointer of the previous-level storage sub-module to the next-level hash map until the next-level hash map is searched. The model parameter hash map is searched; based on the model parameter hash map, the model parameters are restored.

For example, as shown in FIG. 5 , when the model parameters are divided into two-level storage sub-modules, the hash map information of each first-level storage sub-module includes each second-level storage sub-module under the first-level storage sub-module. The hash map (Shard HashMap) of the module and the model parameter hash map (Para HashMap) of each second-level storage sub-module under the first-level storage sub-module, wherein the pointer in the header information of the first-level storage sub-module The pointer of the next-level hash map is the pointer to the hash map of the second-level storage submodule (Shard HashMap Point), and the key of the hash map of each second-level storage submodule is the second-level storage submodule. ID (Shard ID), the value is a pointer (Para HashMap Point) to the model parameter hash map under the second-level storage submodule. When performing recovery, for each first-level storage submodule, the second-level storage submodule can be searched for through the pointer (Shard HashMap Point) in the header information of the first-level storage submodule that points to the hash map of the second-level storage submodule. Shard HashMap for storing submodules. In this regard, it is possible to check whether the hash map data of the second-level storage submodule is complete, and restore all the data of the hash map of the second-level storage submodule. Then, according to the pointers to the model parameter hash map of all the second-level storage sub-modules in the hash map of the second-level storage sub-module, the model parameter hash map under all the second-level storage sub-modules can be searched, thereby restoring All second level stores model parameters under submodules.

For another example, as shown in FIG. 6 , when the model parameters are divided into first-level storage submodules, the hash map information of each first-level storage submodule includes the model parameter hash under the first-level storage submodule. Map (Para HashMap), wherein the pointer to the next-level hash map in the header information of the first-level storage submodule is a pointer to the model parameter hash map (Para HashMap Point). When performing recovery, for each first-level storage sub-module, the model parameter hash map (Para HashMap ), thereby restoring the model parameters under each first-level storage submodule.

For another example, as shown in FIG. 7 , when the model parameters are divided into three-level storage sub-modules, the hash map information of each first-level storage sub-module includes each second-level storage sub-module under the first-level storage sub-module. The hash map of the sub-module (Shard HashMap), the hash map of each third-level storage sub-module under each second-level storage sub-module (Slice HashMap), and the model parameter hash under each third-level storage sub-module Map (Para HashMap). The pointer to the next-level hash map in the header information of the first-level storage sub-module is a pointer to the second-level storage sub-module hash map (Shard HashMap Point), and the pointer in the second-level storage sub-module hash map points to The pointer of the next-level hash map is the pointer to the third-level storage sub-module hash map (Slice HashMap Point), and the pointer to the next-level hash map in the third-level storage sub-module hash map is the pointer to the model The pointer to the parameter hash map (Para HashMap Point). When performing recovery, for each first-level storage submodule, the second-level storage can be searched through the pointer (Shard HashMap Point) in the header information of the first-level storage submodule to the hash map of the second-level storage submodule. Shard HashMap for submodules. In this regard, it is possible to check whether the hash map data of the second-level storage submodule is complete, and restore all the data of the hash map of the second-level storage submodule. Subsequently, according to the pointers (Slice HashMap Point) of all second-level storage sub-modules of the hash map of the second-level storage sub-modules to the third-level storage sub-module hash map, search all second-level storage sub-modules under the The hash map of each tertiary storage submodule. In this regard, it is possible to check whether the hash map data of all tertiary storage submodules is complete, and restore all data of the hash maps of all tertiary storage submodules. Subsequently, the model parameter hash map can be searched according to the pointer to the model parameter hash map in the hash map of each third-level storage submodule, thereby restoring model parameters under all third-level storage submodules.

Of course, the present disclosure is not limited to the above examples, the model parameters can be divided into storage sub-modules of any level, and in the case where the model parameters are divided into storage sub-modules of any level, the recovery process can be performed according to the logic of the above examples, analogically Restoration of model parameters.

In addition, according to another exemplary embodiment of the present disclosure, when the parameter server node is seriously damaged and the repair time is long, after the parameter server node is restarted, the model parameters stored in the parameter server node are out of date. Therefore, when the model parameter version information in the header information of the first-level storage sub-module in the second-layer data is obtained, it can be determined whether the model parameters stored in the parameter server node are the latest version based on the model parameter version information, and according to the determination The result then decides whether to perform recovery.

For example, FIG. 9 is a flowchart illustrating a recovery method of a parameter server node according to an exemplary embodiment of the present disclosure.

Referring to FIG. 9 ,

steps

901 and 902 in FIG. 9 perform the same operations as 801 and 802 in FIG. 8 , and thus will not be repeated here.

After the header information of each first-level storage sub-module in the second-level data is acquired in step 902, the version information in the first-level storage sub-module header information and the first-level storage sub-module header information stored on the parameter server cluster metadata node can be Compare the latest version information of the level storage sub-module. Here, the parameter server cluster metadata node is a server that stores global metadata information of the parameter server cluster, such as zoomkeeper. In the parameter server cluster, the latest version numbers corresponding to all the first-level storage submodules of each model can be recorded in the metadata node. When the model parameters of the first-level storage submodule are updated, the version of the first-level storage submodule will be incremented by 1 on the metadata node. Every time a parameter server node restarts, you can check the latest version of the current first-level storage submodule. When the version number stored on the metadata node is newer than the version number on the restarted parameter server node, it means that the model parameters on the restarted parameter server node are out of date and there is no need to restore them. When the version number stored on the metadata node is the same as the version number on the restarted parameter server node, it means that the model parameters on the restarted parameter server node are the latest version and can be restored.

Therefore, in step 903, in the case where the model parameter version information in the header information of the first-level storage submodule is consistent with the model parameter version information stored on the metadata node of the parameter server cluster, the recovery is performed, that is, referring to Fig. Step 803 described in 8 will not be repeated here.

In step 904, in the case where the model parameter version information in the header information of the first-level storage sub-module is inconsistent with the model parameter version information stored on the metadata node of the parameter server cluster, the recovery is not performed, and the back-end storage system Pull the latest version of the model parameters (for example, HDFS) and insert the parameter server node. Here, the backend storage system can store the latest version of the model parameters.

10 , a storage system 1000 for model parameters according to an exemplary embodiment of the present disclosure includes an acquisition device 1001 and a storage device 1002 .

The obtaining means 1001 can obtain model parameters of at least one model. The model can be an AI model with high dimensions, for example, a recommendation model, a credit card anti-fraud model, etc. Massive historical data can be put into the offline training system to train the AI model, and the trained AI model can be deployed to the online inference system for use. Each dimension of an AI model corresponds to a parameter. An AI model that can process massive data can reach hundreds of millions of dimensions or even a billion dimensions, so each model may store hundreds of millions or even billions of parameters. The model parameters here can refer to parameters before training or during training, and can also refer to parameters after training is completed.

The storage device 1002 can store the model parameters of the at least one model in a parameter server cluster, wherein the parameter server cluster includes a plurality of non-volatile memory-based parameter server nodes. Here, the parameter server cluster may be an offline parameter server cluster for training, or an online parameter server cluster for inference.

According to an exemplary embodiment of the present disclosure, the storage device 1002 may store the model parameters of each model in the at least one model in a distributed manner to the non-volatile memory-based parameter server nodes of the plurality of non-volatile memory-based parameter server nodes. In volatile memory; for each parameter server node based on non-volatile memory, the model parameters corresponding to each model stored on the parameter server node are logically divided into at least one level of storage sub-modules for storage. For example, the storage device 1002 may logically divide the model parameters of each model in the at least one model into a plurality of storage modules (for example, the first-level storage sub-modules described below), and distribute the divided storage modules stored in the non-volatile memory of the plurality of non-volatile memory-based parameter server nodes. In addition, in the non-volatile memory of each non-volatile memory-based parameter server node, each storage module may be further divided into at least one level of storage sub-modules (for example, the following second level of storage sub-modules). modules, tertiary storage submodules, etc.).

According to an exemplary embodiment of the present disclosure, the storage device 1002 may store the first-tier data and the second-tier data in the non-volatile memory of each non-volatile memory-based parameter server node. On the non-volatile memory of the parameter server node, there will be two layers of data storage, such as the first layer and the second layer as shown in Figure 5). For a parameter server node, there may be one storage in the first-level data for storing the information of the current node, and M storages in the second-level data for storing M first-level storage sub-nodes respectively. Module specific information. The first layer data and the second layer data according to an exemplary embodiment of the present disclosure will be described in detail below.

The node information can be the ID information of the parameter server node. The parameter server cluster will assign an ID to each parameter server, and the ID is unique. Therefore, by obtaining the ID information of the parameter server node, it can be determined which parameter server node in the parameter server cluster it is.

In addition, the first-level storage submodule information may be a list of ID information of all first-level storage submodules stored on the parameter server node. According to an exemplary embodiment of the present disclosure, the IDs of the storage submodules (eg, Storage ID, Shard ID) are numbered for each model, for example, for model 1 there will be S1R1 (Storage 1, Shard 1), S1R2, S2R1, S2R2..., for Model 2 there will also be S1R1, S1R2, S2R1, S2R2... of Model 2. Therefore, in order to distinguish storage submodules between models, the first-level storage submodule ID included in the first-level storage submodule ID list may be composed of the model ID and the first-level storage submodule ID in the corresponding model, for example, the first-level storage submodule ID The list of primary storage sub-module IDs can include Model 1Storage 0, Model 1Storage 3, Model 2Storage 1…. In addition, the first-level storage submodule ID list can be stored in the first-level data in the form of a persistent list, which ensures that model parameters can be inserted into non-volatile memory.

According to an exemplary embodiment of the present disclosure, the first layer data (ie, parameter server node header information) is used as the root of the storage of the entire non-volatile memory-based parameter server node, and the storage device 1002 may store the first layer data in the parameter server node. A fixed location in the server node's non-volatile memory to facilitate querying when the parameter server node performs a fast recovery.

According to an exemplary embodiment of the present disclosure, in the second-level data, M pieces of first-level storage sub-module header information and M pieces of first-level storage sub-module hash map (HashMap) information may be stored. The second-level data can be used to query the model parameters stored in each first-level storage sub-module of the parameter server node. For example, for each first-level storage sub-module, the storage device 1002 may newly create a corresponding non-volatile memory storage pool in the second-level data storage for separate storage.

According to an exemplary embodiment of the present disclosure, as shown in FIG. 5 , the first-level storage sub-module header information (Storage Head) may be allocated three spaces for storing the first-level storage sub-module ID (Storage ID), The version information (Version) of the model parameters stored in the first-level storage submodule, and the pointer (Shard HashMap Pointer) of the first-level storage submodule to the next-level hash map. The next level of the first-level storage submodule (Storage) is the second-level storage submodule (Shard). Therefore, the pointer to the next-level hash map in the header information of the first-level storage submodule can be Shard HashMap Pointer , but the present disclosure is not limited thereto. For example, when the next level of the first-level storage sub-module (Storage) is the model parameter itself, the pointer to the next-level hash map in the header information of the first-level storage sub-module may point to the model parameter hash map Pointer (for example, Para HashMap Point).

According to an exemplary embodiment of the present disclosure, the storage system 1000 may further include a service device (not shown), and the service device may query a parameter of a specified model of the at least one model from the non-volatile memory-based parameter server cluster, And provide the corresponding model estimation service based on the queried parameters of the specified model. For example, the service device can determine the parameter server node ID of the parameter storage and the ID of each level of storage sub-module according to the parameter name of the parameter of the specified model, find the parameter server node according to the determined parameter server node ID, and store the sub-module according to the various levels. ID, based on the second-level data on the found parameter server node, to find the parameter value corresponding to the parameter name.

According to an exemplary embodiment of the present disclosure, writing parameters only occurs when a new model comes online. The process of writing parameters is similar to the process of querying parameters, with two differences. First: in order to ensure that the old model parameters can still be accessed in the process of uploading new model parameters, the storage device 1002 can insert all new model parameters into a brand new storage pool. After all the parameters of the new model are uploaded to the parameter server cluster, the parameter query service will be switched to the new parameters. At the same time, the backend thread will slowly reclaim the space of the parameters of the old model. The second point is that when a new Para0 is written into the HashMap under Storage 0 new, the storage device 1002 can use a CPU instruction (such as clflushopt+mfence) to ensure that the parameter data written each time is written to the NVM instead of volatile in the CPU cache.

11 is a block diagram illustrating a recovery system of a parameter server node according to an exemplary embodiment of the present disclosure. The recovery system of a parameter server node according to an exemplary embodiment of the present disclosure is adapted to the situation that the parameter server node described above with reference to FIG. 3 needs to be restarted due to a failure or the like.

11 , a recovery system 1100 of a parameter server node according to an exemplary embodiment of the present disclosure may include a first acquisition device 1101 , a second acquisition device 1102 , and a recovery device 1103 .

After the parameter server node is restarted, the first obtaining means 1101 may obtain the stored first-layer data from the non-volatile memory of the parameter server node. As described above, the first-level data can parameterize the server node header information, which is used to query the parameter server node's node information and the first-level storage sub-module information. The parameter server node header information may include a parameter server node ID and a list of first-level storage submodule IDs. Therefore, the first obtaining device 1101 can determine which parameter server node in the parameter server cluster the restarted parameter server node is according to the parameter server node ID in the parameter server node header, and can determine the parameter server node in the parameter server node The list of module IDs identifies all first-level storage submodules included in the restarted parameter server node.

According to an exemplary embodiment of the present disclosure, the first tier data may be stored at a fixed location in the non-volatile memory of the parameter server node. Therefore, the first obtaining means 1101 can read the first layer data from a fixed location in the non-volatile memory of the parameter server node. In this way, after the parameter server node is restarted, the storage information of the parameter server node can be quickly queried from the parameter server node.

The second obtaining means 1102 may obtain, based on the first-layer data, the header information of each first-level storage sub-module in the stored second-layer data from the non-volatile memory of the parameter server node. As mentioned above, the second-level data can include M first-level storage sub-module header information and M first-level storage sub-module hash mapping information, and the second-level data is used to query each first-level storage of the parameter server node. Model parameters stored in the sub-module, where M is the number of first-level storage sub-modules stored on the parameter server node.

According to an exemplary embodiment of the present disclosure, the first-level storage sub-module header information may include the first-level storage sub-module ID, version information of model parameters stored by the first-level storage sub-module, and the first-level storage sub-module The module's pointer to the next level hash map. Therefore, after acquiring the first data, the second acquiring means 1102 may determine all the first-level storage sub-module IDs of the parameter server node based on the first-level storage sub-module ID list, and then based on all the first-level storage sub-module IDs of the parameter server node For each first-level storage sub-module ID in the sub-module ID, the header information of the first-level storage sub-module with the corresponding first-level storage sub-module ID is obtained. Therefore, the second obtaining means 1102 can obtain the respective first-level storage sub-module header information of all (M) first-level storage sub-modules stored on the parameter server node.

The restoration device 1103 may traverse the hash map information of each first-level storage sub-module in the second-level data based on the header information of each first-level storage sub-module, so as to restore the model parameters stored on the parameter server node.

According to an exemplary embodiment of the present disclosure, the restoring device 1103 may perform the following operation for each first-level storage submodule: search for the next-level hash through the corresponding pointer to the next-level hash map of the previous-level storage submodule Map until the model parameter hash map is searched; based on the model parameter hash map, restore the model parameters.

For example, as shown in FIG. 5 , when the model parameters are divided into two-level storage sub-modules, the hash map information of each first-level storage sub-module includes each second-level storage sub-module under the first-level storage sub-module. The hash map (Shard HashMap) of the module and the model parameter hash map (Para HashMap) of each second-level storage sub-module under the first-level storage sub-module, wherein the pointer in the header information of the first-level storage sub-module The pointer of the next-level hash map is the pointer to the hash map of the second-level storage submodule (Shard HashMap Point), and the key of the hash map of each second-level storage submodule is the second-level storage submodule. ID (Shard ID), the value is a pointer (Para HashMap Point) to the model parameter hash map under the second-level storage submodule. When the restoration device 1103 performs restoration, for each first-level storage submodule, the first-level storage submodule header information can be used to search for the first-level storage submodule hash map pointer (Shard HashMap Point) in the first-level storage submodule header information. The hash map of the secondary storage submodule (Shard HashMap). For this, the restoration device 1103 may check whether the hash map data of the second-level storage submodule is complete, and restore all the data of the hash map of the second-level storage submodule. Subsequently, the restoration device 1103 may search for the model parameter hash map under all the second-level storage sub-modules according to the pointers to the model parameter hash map of all the second-level storage sub-modules in the hash map of the second-level storage sub-module , thereby restoring the model parameters under all second-level storage submodules.

For another example, as shown in FIG. 6 , when the model parameters are divided into first-level storage submodules, the hash map information of each first-level storage submodule includes the model parameter hash under the first-level storage submodule. Map (Para HashMap), wherein the pointer to the next-level hash map in the header information of the first-level storage submodule is a pointer to the model parameter hash map (Para HashMap Point). When the restoration device 1103 performs restoration, for each first-level storage sub-module, the model parameter hash map can be searched through the pointer to the model parameter hash map (Para HashMap Point) in the header information of the first-level storage sub-module (Para HashMap), thereby restoring the model parameters under each first-level storage submodule.

For another example, as shown in FIG. 7 , when the model parameters are divided into three-level storage sub-modules, the hash map information of each first-level storage sub-module includes each second-level storage sub-module under the first-level storage sub-module. The hash map of the sub-module (Shard HashMap), the hash map of each third-level storage sub-module under each second-level storage sub-module (Slice HashMap), and the model parameter hash under each third-level storage sub-module Map (Para HashMap). The pointer to the next-level hash map in the header information of the first-level storage sub-module is a pointer to the second-level storage sub-module hash map (Shard HashMap Point), and the pointer in the second-level storage sub-module hash map points to The pointer of the next-level hash map is the pointer to the third-level storage sub-module hash map (Slice HashMap Point), and the pointer to the next-level hash map in the third-level storage sub-module hash map is the pointer to the model The pointer to the parameter hash map (Para HashMap Point). When the recovery device 1103 performs recovery, for each first-level storage submodule, the first-level storage submodule header information can be used to search for the first-level storage submodule hash map pointer (Shard HashMap Point) in the first-level storage submodule header information. The hash map of the secondary storage submodule (Shard HashMap). For this, the restoration device 1103 may check whether the hash map data of the second-level storage submodule is complete, and restore all the data of the hash map of the second-level storage submodule. Subsequently, the restoration device 1103 may search all second-level storage submodules according to the pointers (Slice HashMap Point) of all second-level storage submodules of the hash map of the second-level storage submodule to the third-level storage submodule hash map (Slice HashMap Point) Each third level under the submodule stores the hash map of the submodule. For this, the restoration device 1103 can check whether the hash map data of all the third-level storage submodules is complete, and restore all the data of the hash maps of all the third-level storage submodules. Subsequently, the restoring device 1103 may search the model parameter hash map according to the pointer to the model parameter hash map in the hash map of each third-level storage sub-module, thereby restoring the model parameters under all third-level storage sub-modules .

In addition, according to another exemplary embodiment of the present disclosure, when the parameter server node is seriously damaged and the repair time is long, after the parameter server node is restarted, the model parameters stored in the parameter server node are out of date. Therefore, when the second acquiring means 1102 acquires the model parameter version information in the header information of the first-level storage sub-module in the second layer data, the restoring means 1103 can determine the model stored in the parameter server node based on the model parameter version information Whether the parameter is the latest version, it is determined whether to perform recovery according to the determined result.

For example, after the second obtaining means 1102 obtains the header information of each first-level storage sub-module in the second-layer data, the restoring means 1103 may, according to the version information in the first-level storage sub-module header information and the data stored in the parameter server The latest version information of the first-level storage submodule on the cluster metadata node is compared. Here, the parameter server cluster metadata node is a server that stores global metadata information of the parameter server cluster, such as zoomkeeper. In the parameter server cluster, the latest version numbers corresponding to all the first-level storage submodules of each model can be recorded in the metadata node. When the model parameters of the first-level storage submodule are updated, the version of the first-level storage submodule will be incremented by 1 on the metadata node. Every time a parameter server node restarts, you can check the latest version of the current first-level storage submodule. When the version number stored on the metadata node is newer than the version number on the restarted parameter server node, it means that the model parameters on the restarted parameter server node are out of date and there is no need to restore them. When the version number stored on the metadata node is the same as the version number on the restarted parameter server node, it means that the model parameters on the restarted parameter server node are the latest version and can be restored.

In the case that the model parameter version information in the header information of the first-level storage sub-module is consistent with the model parameter version information stored on the metadata node of the parameter server cluster, the restoring means 1103 can execute restoring.

In the case where the model parameter version information in the header information of the first-level storage submodule is inconsistent with the model parameter version information stored on the metadata node of the parameter server cluster, the recovery device 1103 does not perform recovery, and restores the data from the back-end storage system ( For example, pull the latest version of the model parameters from HDFS) and insert the parameter server node. Here, the backend storage system can store the latest version of the model parameters.

According to an exemplary embodiment of the present disclosure, a parameter server cluster is provided, the parameter server cluster includes a plurality of non-volatile memory-based parameter server nodes, the parameter server cluster is used for distributed storage of models of at least one model parameter. Each non-volatile memory-based parameter server node included in the parameter server cluster may have the model parameter logical storage structure described above with reference to FIG. 4, and may have the above described with reference to FIGS. 5-7 or the like. The physical storage structure of the data. In addition, each non-volatile memory-based parameter server node included in the parameter server cluster can be recovered according to the recovery method described in FIG. 8 or FIG. 9 when it is restarted.

According to an exemplary embodiment of the present disclosure, the nonvolatile memory may include at least one of STT-RAM, PCM, ReRAM, and 3DxPoint. Among them, PMEM (3DxPoint product) can be used to realize the parameter server node based on PMEM.

According to the method and system for storing model parameters, the method and system for restoring parameter server nodes, and the parameter server cluster of the present disclosure, the model parameters are stored on the parameter server based on non-volatile memory instead of the parameter server based on DRAM, which greatly improves the Reduced hardware costs. In addition, for the parameter server based on non-volatile memory, the model parameter logical storage structure and data physical storage structure are designed, the model parameters are logically hierarchically stored, and the data is stored in two layers on the non-volatile memory. Meet the high concurrency and high availability requirements of the parameter server. In addition, the rapid recovery process after restart is designed. According to the data structure of the two-tier storage, all parameters stored on the parameter server node can be easily and quickly queried and restored, achieving the effect of millisecond-level recovery.

The method and system for storing model parameters, the method and system for restoring parameter server nodes, and the parameter server cluster according to the exemplary embodiments of the present disclosure have been described above with reference to FIGS. 3 to 11 .

Each device in the model parameter storage system shown in FIG. 10 and the parameter server node recovery system shown in FIG. 11 may be configured as software, hardware, firmware, or any combination of the foregoing to perform specific functions. For example, each device may correspond to a dedicated integrated circuit, may also correspond to pure software code, or may correspond to a module combining software and hardware. In addition, one or more functions implemented by each apparatus may also be performed uniformly by components in a physical entity device (eg, a processor, a client or a server, etc.).

In addition, the method of storing model parameters described with reference to FIG. 3 and the method of restoring parameter server nodes described with reference to FIGS. 8 and 9 may be implemented by programs (or instructions) recorded on a computer-readable storage medium. For example, in accordance with exemplary embodiments of the present disclosure, a computer-readable storage medium storing instructions may be provided that, when executed by at least one computing device, cause the at least one computing device to execute model parameters according to the present disclosure The storage method for the parameter server node or the recovery method for the parameter server node.

The computer program in the above-mentioned computer-readable storage medium can run in an environment deployed in computer equipment such as a client, a host, an agent device, a server, etc. It should be noted that the computer program can also be used to perform additional steps in addition to the above-mentioned steps or More specific processing is performed when the above steps are performed, and the contents of these additional steps and further processing have been mentioned in the description of the related methods with reference to FIGS.

It should be noted that each device in the model parameter storage system and the parameter server node recovery system according to the exemplary embodiment of the present disclosure can completely rely on the running of the computer program to realize the corresponding function, that is, the function of the computer program in each device The architecture corresponds to each step, so that the entire system is invoked through a special software package (eg, lib library) to implement corresponding functions.

On the other hand, each device in FIG. 10 and FIG. 11 can also be implemented by hardware, software, firmware, middleware, microcode or any combination thereof. When implemented in software, firmware, middleware, or microcode, program codes or code segments for performing corresponding operations may be stored in a computer-readable medium such as a storage medium, so that a processor can read and execute the corresponding program by reading code or code segment to perform the corresponding action.

For example, exemplary embodiments of the present disclosure may also be implemented as a computing device including a storage component and a processor, the storage component stores a computer-executable instruction set, and when the computer-executable instruction set is executed by the processor, executes the A method of storing model parameters or a method of restoring a parameter server node according to an exemplary embodiment of the present disclosure.

Specifically, the computing device may be deployed in a server or a client, or may be deployed on a node device in a distributed network environment. Furthermore, the computing device may be a PC computer, a tablet device, a personal digital assistant, a smartphone, a web application, or other device capable of executing the set of instructions described above.

Here, the computing device does not have to be a single computing device, but can also be any set of devices or circuits capable of individually or jointly executing the above-mentioned instructions (or instruction sets). The computing device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).

In a computing device, a processor may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

Some operations described in the method for storing model parameters or the method for restoring parameter server nodes according to the exemplary embodiments of the present disclosure may be implemented by software, some operations may be implemented by hardware, and in addition, some operations may be implemented by software These operations are implemented by means of a combination of hardware.

The processor may execute instructions or code stored in one of the storage components, which may also store data. Instructions and data may also be sent and received over a network via a network interface device, which may employ any known transport protocol.

The memory component may be integrated with the processor, eg, RAM or flash memory arranged within an integrated circuit microprocessor or the like. Additionally, the storage components may include separate devices, such as external disk drives, storage arrays, or any other storage device that may be used by a database system. The storage component and the processor may be operatively coupled, or may communicate with each other, eg, through I/O ports, network connections, etc., to enable the processor to read files stored in the storage component.

In addition, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via a bus and/or network.

The method for storing model parameters or the method for restoring parameter server nodes according to an exemplary embodiment of the present disclosure may be described as various interconnected or coupled functional blocks or functional diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logical device or operate along non-precise boundaries.

Therefore, the method for storing model parameters described with reference to FIG. 3 or the method for restoring parameter server nodes described with reference to FIG. 8 or 9 may be implemented by a system including at least one computing device and at least one storage device storing instructions.

According to an exemplary embodiment of the present disclosure, at least one computing device is a computing device for performing a method for storing model parameters or a method for restoring a parameter server node according to an exemplary embodiment of the present disclosure, and the storage device stores computer-executable instructions The set, when executed by the at least one computing device, performs a method for storing model parameters or a method for restoring a parameter server node according to the present disclosure.

According to an exemplary embodiment of the present disclosure, there is provided an electronic device, comprising: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement a reference The storage method of the model parameters described in FIG. 4 and/or the restoration method of the parameter server node described with reference to FIG. 8 or 9 .

Various exemplary embodiments of the present disclosure have been described above, and it should be understood that the above description is merely exemplary and not exhaustive, and the present disclosure is not limited to the disclosed exemplary embodiments. Numerous modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of this disclosure. Therefore, the scope of protection of the present disclosure should be determined by the scope of the claims.

Industrial Applicability

Claims

A system comprising at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least one computing device to perform a method of restoring a parameter server node, wherein , the parameter server node is a node in the parameter server cluster, and the parameter server node is a parameter server node based on non-volatile memory, each model stored in the non-volatile memory of the parameter server node The model parameters of are logically divided into at least one-level storage sub-modules, and the recovery method includes:

After the parameter server node is restarted, the stored first-layer data is acquired from the non-volatile memory of the parameter server node, wherein the first-layer data includes parameter server node header information, and the first-layer data for querying the node information and first-level storage submodule information of the parameter server node;

Based on the first-level data, obtain each first-level storage submodule header information in the stored second-level data from the non-volatile memory of the parameter server node, wherein the second-level data It includes M first-level storage submodule header information and M first-level storage submodule hash mapping information, and the second-level data is used to query the model stored in each first-level storage submodule of the parameter server node parameter, where M is the number of first-level storage submodules stored on the parameter server node;

Based on the header information of each first-level storage sub-module, traverse the hash map information of each first-level storage sub-module in the second-level data to restore the model parameters stored on the parameter server node.
The recovery method of claim 1, wherein the first tier data is stored at a fixed location in a non-volatile memory of the parameter server node;

Wherein, obtaining the stored first-layer data from the non-volatile memory of the parameter server node includes:

The first layer data is read from a fixed location in the non-volatile memory of the parameter server node.
The recovery method of claim 1, wherein the parameter server node header information includes a parameter server node ID and a list of first-level storage submodule IDs;

The first-level storage sub-module header information includes the first-level storage sub-module ID, the version information of the model parameters stored in the first-level storage sub-module, and the pointer to the next-level hash of the first-level storage sub-module Mapped pointer.
The restoration method of claim 3, wherein the first-level storage sub-module ID included in the first-level storage sub-module ID list consists of a model ID and a first-level storage sub-module ID in the corresponding model.
The recovery method of claim 3, wherein, based on the first-tier data, each first-tier storage sub-unit in the stored second-tier data is obtained from a non-volatile memory of the parameter server node Module header information, including:

Determine all first-level storage sub-module IDs of the parameter server node based on the first-level storage sub-module ID list;

Based on each of the first-level storage sub-module IDs of all the first-level storage sub-module IDs of the parameter server node, the first-level storage sub-module header information having the corresponding first-level storage sub-module ID is acquired.
The recovery method according to claim 3, wherein the first-level storage sub-module hash map information includes each level of each of the second-level to N-th storage sub-modules under the first-level storage sub-module The hash map of the storage submodule and the model parameter hash map under each of the Nth-level storage submodules under the first-level storage submodule, where N is the number of stages of the storage submodules into which the model parameters are divided;

Among them, the key of the hash map of each level of storage sub-module is the ID of this level of storage sub-module, and the value is the pointer to the next level of hash map,

The key of the model parameter hash map is the parameter name, the value is the parameter value,

The hash map of each level of storage sub-module and the model parameter hash map are linked by the corresponding pointer to the next level of hash map of this level of storage sub-module or the upper-level storage sub-module of the model parameter.
The recovery method according to claim 6, wherein, based on the header information of each first-level storage sub-module, traversing the hash map information of each first-level storage sub-module in the second-level data to recover The model parameters stored on the parameter server node include:

Do the following for each first-level storage submodule:

Search the next-level hash map through the pointer to the next-level hash map corresponding to the previous-level storage submodule until the model parameter hash map is searched;

Based on the model parameter hash map, the model parameters are restored.
The restoration method according to claim 7, wherein, when the model parameters are divided into first-level storage submodules: the hash map information of each first-level storage submodule includes the model under the first-level storage submodule A parameter hash map, wherein the pointer to the next-level hash map in the header information of the first-level storage submodule is a pointer to the model parameter hash map;

Wherein, searching the next-level hash map through the corresponding pointer of the previous-level storage submodule to the next-level hash map, until the model parameter hash map is searched, including:

The model parameter hash map is searched through the pointer to the model parameter hash map in the first-level storage submodule header information.
The restoration method according to claim 7, wherein, when the model parameters are divided into two-level storage sub-modules: the hash map information of each first-level storage sub-module includes each The hash map of the second-level storage sub-module and the model parameter hash map of each second-level storage sub-module under the first-level storage sub-module, wherein the header information of the first-level storage sub-module points to the next level hash map The pointer of the desired map is a pointer to the hash map of the second-level storage submodule, the key of the hash map of each second-level storage submodule is the ID of the second-level storage submodule, and the value points to the second-level storage submodule. A pointer to a hash map of model parameters under the storage submodule;

Wherein, searching the next-level hash map through the corresponding pointer of the upper-level storage submodule until the model parameter hash map is searched, including:

Search the hash map of the second-level storage sub-module through the pointer to the second-level storage sub-module hash map in the first-level storage sub-module header information;

The model parameter hash map is searched through the pointer to the model parameter hash map in the hash map of the second-level storage submodule.
The recovery method of claim 3, wherein the instructions, when executed by the at least one computing device, cause the at least one computing device to further perform the following steps:

Under the condition that the model parameter version information in the header information of the first-level storage sub-module is consistent with the model parameter version information stored on the metadata node of the parameter server cluster, the restoring step is performed;

In the case that the model parameter version information in the header information of the first-level storage submodule is inconsistent with the model parameter version information stored on the metadata node of the parameter server cluster, the restoring step is not performed, and the Pull the latest version of the model parameters from the back-end storage system and insert the parameter server node;

The metadata node of the parameter server cluster stores the latest version number of the model parameter, and the back-end storage system stores the latest version of the model parameter.
The recovery method of claim 1, wherein the non-volatile memory includes at least one of STT-RAM, PCM, ReRAM, and 3DxPoint.
A method for recovering a parameter server node, wherein the parameter server node is a node in a parameter server cluster, and the parameter server node is a parameter server node based on non-volatile memory, and the parameter server node has a non-volatile memory. The model parameters of each model stored in the volatile memory are logically divided into at least one-level storage sub-modules, and the recovery method includes:

After the parameter server node is restarted, the stored first-layer data is acquired from the non-volatile memory of the parameter server node, wherein the first-layer data includes parameter server node header information, and the first-layer data for querying the node information and first-level storage submodule information of the parameter server node;

Based on the first-level data, obtain each first-level storage submodule header information in the stored second-level data from the non-volatile memory of the parameter server node, wherein the second-level data It includes M first-level storage submodule header information and M first-level storage submodule hash mapping information, and the second-level data is used to query the model stored in each first-level storage submodule of the parameter server node parameter, where M is the number of first-level storage submodules stored on the parameter server node;

Based on the header information of each first-level storage sub-module, traverse the hash map information of each first-level storage sub-module in the second-level data to restore the model parameters stored on the parameter server node.
The recovery method of claim 12, wherein the first tier data is stored at a fixed location in a non-volatile memory of the parameter server node;

Wherein, obtaining the stored first-layer data from the non-volatile memory of the parameter server node includes:

The first layer data is read from a fixed location in the non-volatile memory of the parameter server node.
The recovery method of claim 12, wherein the parameter server node header information includes a parameter server node ID and a list of first-level storage submodule IDs;

The first-level storage sub-module header information includes the first-level storage sub-module ID, the version information of the model parameters stored in the first-level storage sub-module, and the pointer to the next-level hash of the first-level storage sub-module Mapped pointer.
The restoration method of claim 14, wherein the first-level storage sub-module ID included in the first-level storage sub-module ID list consists of a model ID and a first-level storage sub-module ID in the corresponding model.
15. The recovery method of claim 14, wherein, based on the first-tier data, each first-tier storage subsection of the stored second-tier data is obtained from a non-volatile memory of the parameter server node Module header information, including:

Determine all first-level storage sub-module IDs of the parameter server node based on the first-level storage sub-module ID list;

Based on each of the first-level storage sub-module IDs of all the first-level storage sub-module IDs of the parameter server node, the first-level storage sub-module header information having the corresponding first-level storage sub-module ID is acquired.
The recovery method according to claim 14, wherein the hash map information of the first-level storage submodule includes each of the second-level to Nth-level storage submodules under the first-level storage submodule The hash map of the storage submodule and the model parameter hash map under each of the Nth-level storage submodules under the first-level storage submodule, where N is the number of stages of the storage submodules into which the model parameters are divided;

Among them, the key of the hash map of each level of storage sub-module is the ID of this level of storage sub-module, and the value is the pointer to the next level of hash map,

The key of the model parameter hash map is the parameter name, the value is the parameter value,

The hash map of each level of storage sub-module and the model parameter hash map are linked by the corresponding pointer to the next level of hash map of this level of storage sub-module or the upper-level storage sub-module of the model parameter.
The recovery method according to claim 17, wherein, based on the header information of each first-level storage sub-module, traversing the hash map information of each first-level storage sub-module in the second-level data to recover The model parameters stored on the parameter server node include:

Do the following for each first-level storage submodule:

Search the next-level hash map through the pointer to the next-level hash map corresponding to the previous-level storage submodule until the model parameter hash map is searched;

Based on the model parameter hash map, the model parameters are restored.
The restoration method according to claim 18, wherein, when the model parameters are divided into first-level storage submodules: the hash map information of each first-level storage submodule includes the model under the first-level storage submodule A parameter hash map, wherein the pointer to the next-level hash map in the header information of the first-level storage submodule is a pointer to the model parameter hash map;

Wherein, searching the next-level hash map through the corresponding pointer of the previous-level storage submodule to the next-level hash map, until the model parameter hash map is searched, including:

The model parameter hash map is searched through the pointer to the model parameter hash map in the first-level storage submodule header information.
The restoration method according to claim 18, wherein, when the model parameters are divided into two-level storage sub-modules: the hash map information of each first-level storage sub-module includes each The hash map of the second-level storage sub-module and the model parameter hash map of each second-level storage sub-module under the first-level storage sub-module, wherein the header information of the first-level storage sub-module points to the next level hash map The pointer of the desired map is a pointer to the hash map of the second-level storage submodule, the key of the hash map of each second-level storage submodule is the ID of the second-level storage submodule, and the value points to the second-level storage submodule. A pointer to a hash map of model parameters under the storage submodule;

Wherein, searching the next-level hash map through the corresponding pointer of the upper-level storage submodule until the model parameter hash map is searched, including:

Search the hash map of the second-level storage sub-module through the pointer to the second-level storage sub-module hash map in the first-level storage sub-module header information;

The model parameter hash map is searched through the pointer to the model parameter hash map in the hash map of the second-level storage submodule.
The recovery method of claim 14, further comprising:

Under the condition that the model parameter version information in the header information of the first-level storage sub-module is consistent with the model parameter version information stored on the metadata node of the parameter server cluster, the restoring step is performed;

In the case that the model parameter version information in the header information of the first-level storage submodule is inconsistent with the model parameter version information stored on the metadata node of the parameter server cluster, the restoring step is not performed, and the Pull the latest version of the model parameters from the back-end storage system and insert the parameter server node;

The metadata node of the parameter server cluster stores the latest version number of the model parameter, and the back-end storage system stores the latest version of the model parameter.
The recovery method of claim 12, wherein the non-volatile memory includes at least one of STT-RAM, PCM, ReRAM, and 3DxPoint.
A recovery system for a parameter server node, wherein the parameter server node is a node in a parameter server cluster, and the parameter server node is a parameter server node based on non-volatile memory, and the non-volatile memory of the parameter server node The model parameters of each model stored in the volatile memory are logically divided into at least one-level storage sub-modules, and the recovery device includes:

a first obtaining device configured to: after the parameter server node restarts, obtain the stored first-layer data from the non-volatile memory of the parameter server node, wherein the first-layer data includes the parameter server node Nodding information, the first-level data is used to query the node information of the parameter server node and the first-level storage sub-module information;

A second obtaining device is configured to: obtain, based on the first-layer data, the header information of each first-level storage submodule in the stored second-layer data from the non-volatile memory of the parameter server node , wherein the second layer data includes M first-level storage sub-module header information and M first-level storage sub-module hash mapping information, and the second layer data is used to query each first-level storage sub-module of the parameter server node. Model parameters stored in the first-level storage submodule, wherein M is the number of first-level storage submodules stored on the parameter server node;

A restoration device, configured to: based on the header information of each first-level storage sub-module, traverse the hash map information of each first-level storage sub-module in the second-level data, so as to restore the data stored on the parameter server node model parameters.
A computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one computing device, cause the at least one computing device to execute the parameter server of any one of claims 12 to 22 The recovery method of the node.
An electronic device comprising:

processor;

a memory for storing the processor-executable instructions;

Wherein, the processor is configured to execute the instructions to implement the method for restoring a parameter server node as claimed in any one of claims 12 to 22.