WO2023050704A1 - 一种ai集群中数据缓存方法、系统、设备及计算机介质 - Google Patents

一种ai集群中数据缓存方法、系统、设备及计算机介质 Download PDF

Info

Publication number
WO2023050704A1
WO2023050704A1 PCT/CN2022/078186 CN2022078186W WO2023050704A1 WO 2023050704 A1 WO2023050704 A1 WO 2023050704A1 CN 2022078186 W CN2022078186 W CN 2022078186W WO 2023050704 A1 WO2023050704 A1 WO 2023050704A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
node
target
cluster node
shortest path
Prior art date
Application number
PCT/CN2022/078186
Other languages
English (en)
French (fr)
Inventor
姬贵阳
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Priority to US18/280,221 priority Critical patent/US20240152458A1/en
Publication of WO2023050704A1 publication Critical patent/WO2023050704A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0813Multiuser, multiprocessor or multiprocessing cache systems with a network or matrix configuration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0817Cache consistency protocols using directory methods
    • G06F12/0824Distributed directories, e.g. linked lists of caches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0875Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with dedicated cache, e.g. instruction or stack
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0893Caches characterised by their organisation or structure
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/12Shortest path evaluation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching

Definitions

  • the present application relates to the technical field of AI clusters, and more specifically, to a data caching method, system, device, and computer medium in an AI cluster.
  • AI cluster platform effectively solves the problem of enterprises and scientific research. University requirements for computing power.
  • One of the basic functions of the artificial intelligence platform is the operation of files, including the local download cache of the data set, the reading of files during the training process, and other file operations. These all depend on the storage resources of the cluster, and the AI cluster has very high storage requirements. High, there are frequent IO operations, which makes storage resources the bottleneck of data caching in the AI cluster, affecting the data caching performance of the AI cluster.
  • the purpose of this application is to provide a data caching method in an AI cluster, which can solve the technical problem of how to improve the data caching performance of the AI cluster to a certain extent.
  • the present application also provides a data caching system, device, and computer-readable storage medium in an AI cluster.
  • a data caching method in an AI cluster comprising:
  • the remaining cluster nodes include a node other than the target cluster node;
  • the obtaining the weight value of the target data set on each cluster node in the AI cluster includes:
  • cluster node is a management node, then determine the total number of cluster nodes in the AI cluster, and determine the total number of data sets on the shared storage nodes in the AI cluster;
  • a product value of the total number of cluster nodes and the total number of data sets is determined as the weight value of the management node.
  • the type of the cluster node After parsing the type of the cluster node, it also includes:
  • cluster node is a non-management node, then determine whether the target data set exists on the cluster node;
  • the method further includes:
  • the target data set is stored on the cluster node, then determine the number of first-type tasks that the cluster node pulls the target data set, and determine that the cluster node is pulled the second task number of the target data set.
  • the number of task types determining the sum of the number of tasks of the first type, the number of tasks of the second type, and 1 as the weight value of the cluster node.
  • the obtaining the target shortest path from other cluster nodes in the AI cluster to the target cluster node, and the predecessor node of the target cluster node in the shortest path includes:
  • the first node set is used to store a first type of cluster node whose target shortest path with the target cluster node is known;
  • the second set of nodes is used to store cluster nodes of a second type in the AI cluster except for the first set of nodes;
  • a data caching system in an AI cluster including:
  • a first determining module configured to determine a target data set to be cached
  • a first acquisition module configured to acquire the weight value of the target data set on each cluster node in the AI cluster
  • a second determining module configured to determine a target cluster node for caching the target data set
  • the second obtaining module is used to obtain the shortest path from other cluster nodes in the AI cluster to the target cluster node, and the predecessor node of the target cluster node in the shortest path;
  • a third determination module configured to determine a cache path for caching the target data set to the target cluster node based on the weight value, the shortest path, and the predecessor node, so as to cache the target data set according to the cache path
  • the target data set is cached to the target cluster node.
  • the first acquisition module includes:
  • a first parsing unit configured to parse the type of the cluster node for each of the cluster nodes in the AI cluster
  • the first processing unit is configured to determine the total number of cluster nodes in the AI cluster if the cluster node is a management node, and determine the total number of data sets on the shared storage nodes in the AI cluster;
  • a product value of the total number of cluster nodes and the total number of data sets is determined as the weight value of the management node.
  • the first acquisition module also includes:
  • the second processing unit is configured to determine whether the target data set exists on the cluster node if the cluster node is a non-management node; if the target data set does not exist on the cluster node, determine The weight value of the cluster node is infinite.
  • the second acquisition module includes:
  • the first determination unit is configured to determine a first node set, and the first node set is used to store a first type of cluster node whose target shortest path with the target cluster node is known;
  • the second determination unit is configured to determine a second node set, and the second node set is used to store the second type of cluster nodes in the AI cluster except the first node set;
  • a third determining unit configured to determine the first shortest path between each second-type cluster node and the target cluster node
  • the first setting unit is used to use the second type of cluster node corresponding to the first shortest path with the smallest value as the cluster node to be determined;
  • the fourth determining unit is configured to determine, for each second-type cluster node, the second shortest path between the second-type cluster node and the cluster node to be determined, and determine the relationship between the first shortest path and the second shortest path corresponding to the node to be determined and value; if the first shortest path corresponding to the second type of cluster node is less than the sum value, then update the target shortest path of the second type of cluster node to be the first shortest path corresponding to the second type of cluster node; if the second type of cluster node corresponds to The first shortest path of is greater than the sum value, then update the target shortest path of the second type of cluster node to be the sum value, and the predecessor node of the target cluster node in the shortest path corresponding to the second type of cluster node is the cluster node to be determined;
  • a first updating unit configured to update the cluster node to be determined as a cluster node of the first type
  • the first judging unit is used to judge whether the first node set contains all the cluster nodes, if not, return to the step of using the second type of cluster node corresponding to the first shortest path with the smallest value as the cluster node to be determined, if so, then end.
  • a data cache device in an AI cluster including:
  • a processor configured to implement the steps of any one of the above-mentioned methods for caching data in an AI cluster when executing the computer program.
  • a computer-readable storage medium wherein a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps of any one of the above-mentioned data caching methods in an AI cluster are implemented.
  • the application provides a method for caching data in an AI cluster, which includes determining the target data set to be cached; obtaining the weight value of the target data set on each cluster node in the AI cluster; determining the target cluster node for caching the target data set; obtaining the AI cluster
  • the target shortest path from the rest of the cluster nodes to the target cluster node, and the predecessor nodes of the target cluster node in the target shortest path, the remaining cluster nodes include the nodes in the AI cluster except the target cluster node; based on the weight value, the target shortest path and the predecessor node determine a cache path for caching the target data set to the target cluster node, so as to cache the target data set to the target cluster node according to the cache path.
  • the target shortest path can reflect the storage capacity required to cache the target data set in the AI cluster
  • the predecessor node can specify the target data set
  • the cache path for caching the target data set to the target cluster node is determined based on the weight value, the shortest path of the target, and the predecessor node, the cache path can be matched with the storage capacity of the AI cluster, so that the subsequent If the target dataset is cached based on the cache path, it is equivalent to caching the dataset based on the storage performance of the AI cluster, which can improve the data cache performance of the AI cluster.
  • the data caching system, equipment, and computer-readable storage medium in an AI cluster provided by the present application also solve corresponding technical problems.
  • FIG. 1 is a flowchart of a data caching method in an AI cluster provided by an embodiment of the present application
  • Fig. 2 is the determination flowchart of target shortest path and predecessor node in the present application
  • FIG. 3 is a schematic structural diagram of a data caching system in an AI cluster provided by an embodiment of the present application
  • FIG. 4 is a schematic structural diagram of a data cache device in an AI cluster provided by an embodiment of the present application.
  • FIG. 5 is another schematic structural diagram of a data cache device in an AI cluster provided by an embodiment of the present application.
  • FIG. 1 is a flowchart of a data caching method in an AI cluster provided by an embodiment of the present application.
  • Step S101 Determine the target data set to be cached.
  • the target data set to be cached the type, content, size, etc. of the data set can be determined first. All can be determined according to actual needs, and this application does not make specific limitations here.
  • Step S102 Obtain the weight value of the target data set on each cluster node in the AI cluster.
  • the weight value of the target data set on each cluster node in the AI cluster can be obtained; it is not difficult to understand that the higher the weight value, the storage occupied by the target data set on the cluster node The more resources there are, the application can use this weight value to reflect the storage resources on the cluster nodes consumed by the target data.
  • Step S103 Determine the target cluster node for caching the target data set.
  • the user may need to cache the target data set on a cluster node, so it is necessary to determine the target cluster node for caching the target data set.
  • the corresponding target cluster node can be determined according to the cache command sent by the user. This application does not make specific limitations here.
  • Step S104 Obtain the target shortest path from other cluster nodes in the AI cluster to the target cluster node, and the predecessor nodes of the target cluster node in the target shortest path, and the remaining cluster nodes include nodes in the AI cluster except the target cluster node.
  • each cluster node in the AI cluster can transmit data sets to the target cluster node, but considering that the distribution of the target data set in the AI cluster is not uniform, for example, There is no target data set on the cluster node of , and the shortest path from each cluster node to the target cluster node is different, so after determining the target cluster node, it is necessary to obtain the target shortest path from the remaining cluster nodes in the AI cluster to the target cluster node, and The predecessor node of the target cluster node in the target shortest path, and the remaining cluster nodes include nodes in the AI cluster other than the target cluster node, so that the cache path of the target cluster node can be determined with the help of the target shortest path and the predecessor node.
  • the predecessor node refers to the cluster node before the target cluster node on the shortest path from other cluster nodes to the target cluster node.
  • the other cluster node is a
  • the target cluster node is v
  • the shortest path from a to v is 4, specifically a-b-c-d-v, then the predecessor node of v can be c and so on.
  • Step S105 Determine a cache path for caching the target data set to the target cluster node based on the weight value, the target shortest path, and the predecessor node, so as to cache the target data set to the target cluster node according to the cache path.
  • the cache path for caching the target data set to the target cluster node can be determined based on the weight value, the shortest path of the target, and the previous node, so that according to the cache path, the The target dataset is cached to the target cluster node.
  • another cluster node with the smallest weight value can be used as the transmission node for transmitting the target data set, and the target data set can be transmitted to the target cluster node according to the target shortest path of the transmission node and the predecessor node, etc.
  • the application provides a method for caching data in an AI cluster, which includes determining the target data set to be cached; obtaining the weight value of the target data set on each cluster node in the AI cluster; determining the target cluster node for caching the target data set; obtaining the AI cluster
  • the target shortest path from the rest of the cluster nodes to the target cluster node, and the predecessor nodes of the target cluster node in the target shortest path, the remaining cluster nodes include the nodes in the AI cluster except the target cluster node; based on the weight value, the target shortest path and the predecessor node determine a cache path for caching the target data set to the target cluster node, so as to cache the target data set to the target cluster node according to the cache path.
  • the target shortest path can reflect the storage capacity required to cache the target data set in the AI cluster
  • the predecessor node can specify the target data set
  • the cache path for caching the target data set to the target cluster node is determined based on the weight value, the shortest path of the target, and the predecessor node, the cache path can be matched with the storage capacity of the AI cluster, so that the subsequent If the target dataset is cached based on the cache path, it is equivalent to caching the dataset based on the storage performance of the AI cluster, which can improve the data cache performance of the AI cluster.
  • the cluster node in the process of obtaining the weight value of the target data set on each cluster node in the AI cluster, can be analyzed for each cluster node in the AI cluster.
  • Type if the cluster node is a management node, determine the total number of cluster nodes in the AI cluster, determine the total number of data sets on the shared storage nodes in the AI cluster; determine the product value of the total number of cluster nodes and the total number of data sets as the management The weight value of the node; if the cluster node is a non-management node, it is judged whether there is a target data set on the cluster node; if there is no target data set on the cluster node, then the weight value of the cluster node is determined to be infinite; if the cluster node If there is a target data set, determine the number of the first type of tasks that the cluster node pulls the target data set, determine the number of the second type of tasks that the cluster node pulls the target data set
  • FIG. 2 is a flow chart of determining the target shortest path and the predecessor node in this application.
  • a data caching method in an AI cluster provided by an embodiment of the present application, the process of obtaining the target shortest path from other cluster nodes in the AI cluster to the target cluster node, and the predecessor node of the target cluster node in the shortest path may include the following steps :
  • Step S201 Determine a first node set, and the first node set is used to store cluster nodes of the first type whose target shortest path with the target cluster node is known.
  • the target shortest path between the first type of cluster node and the target cluster node there may be a known target shortest path between the first type of cluster node and the target cluster node. At this time, no attention can be paid to the target shortest path of the first type of cluster node, and the first node set can be used to determine the first Class cluster nodes are managed. It should be noted that, if the target shortest path of the cluster nodes of the first type is known, their corresponding predecessor nodes are also known.
  • Step S202 Determine a second node set, the second node set is used to store the second type of cluster nodes in the AI cluster except the first node set.
  • the second type of cluster nodes may be centrally processed by means of the second node set. Assuming that the first set of nodes is S, and the total set of cluster nodes in the AI cluster is V, the second set of nodes may be V-S.
  • Step S203 Determine the first shortest path between each second-type cluster node and the target cluster node.
  • the target shortest path between the second type of cluster node and the target cluster node can be split into: the shortest path between the predecessor node to the target cluster node and the shortest path between the second type of cluster node and the predecessor node and, therefore, the target shortest path between the second type of cluster node and the target cluster node and the corresponding predecessor node can be determined by means of the first shortest path between the second type of cluster node and the target cluster node.
  • Step S204 Use the cluster node of the second type corresponding to the first shortest path with the smallest value as the cluster node to be determined.
  • Step S205 For each cluster node of the second type, determine the second shortest path between the cluster node of the second type and the cluster node to be determined, and determine the sum of the first shortest path and the second shortest path corresponding to the node to be determined; if The first shortest path corresponding to the second type of cluster node is less than the sum value, then update the target shortest path of the second type of cluster node to be the first shortest path corresponding to the second type of cluster node; if the first shortest path corresponding to the second type of cluster node If the path is greater than the sum value, update the target shortest path of the second type of cluster node to the sum value, and the predecessor node of the target cluster node in the shortest path corresponding to the second type of cluster node is the cluster node to be determined.
  • the second type of cluster node corresponding to the first shortest path with the smallest value can be used as the cluster node to be determined, that is, as the predecessor node for verification, and for For each second-type cluster node, determine the second shortest path between the second-type cluster node and the cluster node to be determined, and determine the sum value of the first shortest path and the second shortest path corresponding to the node to be determined; if the second-type cluster The first shortest path corresponding to the node is less than the sum value, then update the target shortest path of the second type of cluster node to be the first shortest path corresponding to the second type of cluster node; if the first shortest path corresponding to the second type of cluster node is greater than the sum value , then update the target shortest path of the second type of cluster node to be the sum value, and the predecessor node of the target cluster node in the shortest path corresponding to the second type of cluster no
  • Step S206 Update the cluster node to be determined to be the cluster node of the first type.
  • Step S207 Determine whether the first node set includes all cluster nodes, if not, return to step S204; if yes, end.
  • the cluster node to be determined can be updated to the first type of cluster node, and it is judged whether the first node set contains all the cluster nodes, and if not, return Execute step S204; if yes, it can end directly, and at this time, the target shortest path between the second type of cluster node and the target cluster node and the corresponding predecessor node can be obtained.
  • FIG. 3 is a schematic structural diagram of a data caching system in an AI cluster provided by an embodiment of the present application.
  • the first determination module 101 is configured to determine the target data set to be cached
  • the first obtaining module 102 is used to obtain the weight value of the target data set on each cluster node in the AI cluster;
  • the second determining module 103 is configured to determine the target cluster node of the cache target data set
  • the second obtaining module 104 is used to obtain the shortest path from other cluster nodes in the AI cluster to the target cluster node, and the predecessor node of the target cluster node in the shortest path;
  • the third determining module 105 is configured to determine a cache path for caching the target data set to the target cluster node based on the weight value, the shortest path, and the predecessor node, so as to cache the target data set to the target cluster node according to the cache path.
  • the first acquisition module includes:
  • the first parsing unit is configured to parse the type of the cluster node for each cluster node in the AI cluster
  • the first processing unit is used to determine the total number of cluster nodes in the AI cluster if the cluster node is a management node, and determine the total number of data sets on the shared storage nodes in the AI cluster;
  • the first acquisition module may further include:
  • the second processing unit is configured to determine whether the target data set exists on the cluster node if the cluster node is a non-management node; if the target data set does not exist on the cluster node, determine that the weight value of the cluster node is infinite.
  • the second processing unit may also be used to determine the number of first-type tasks for the cluster node to pull the target data set if the target data set is stored on the cluster node, Determine the number of tasks of the second type from which the cluster nodes pull the target data set, and determine the sum of the number of tasks of the first type, the number of tasks of the second type, and 1 as the weight value of the cluster node.
  • the second acquisition module may include:
  • the first determination unit is configured to determine a first node set, and the first node set is used to store a first type of cluster node whose target shortest path with the target cluster node is known;
  • the second determination unit is configured to determine a second node set, and the second node set is used to store the second type of cluster nodes in the AI cluster except the first node set;
  • a third determining unit configured to determine the first shortest path between each second-type cluster node and the target cluster node
  • the first setting unit is used to use the second type of cluster node corresponding to the first shortest path with the smallest value as the cluster node to be determined;
  • the fourth determining unit is configured to determine, for each second-type cluster node, the second shortest path between the second-type cluster node and the cluster node to be determined, and determine the relationship between the first shortest path and the second shortest path corresponding to the node to be determined and value; if the first shortest path corresponding to the second type of cluster node is less than the sum value, then update the target shortest path of the second type of cluster node to be the first shortest path corresponding to the second type of cluster node; if the second type of cluster node corresponds to The first shortest path of is greater than the sum value, then update the target shortest path of the second type of cluster node to be the sum value, and the predecessor node of the target cluster node in the shortest path corresponding to the second type of cluster node is the cluster node to be determined;
  • a first updating unit configured to update the cluster node to be determined as a cluster node of the first type
  • the first judging unit is used to judge whether the first node set contains all the cluster nodes, if not, return to the step of using the second type of cluster node corresponding to the first shortest path with the smallest value as the cluster node to be determined, if so, then end.
  • the present application also provides a data caching device in an AI cluster and a computer-readable storage medium, both of which have corresponding effects of the data caching method in an AI cluster provided in the embodiment of the present application. Please refer to FIG. 4 .
  • FIG. 4 is a schematic structural diagram of a data cache device in an AI cluster provided by an embodiment of the present application.
  • a data caching device in an AI cluster provided in an embodiment of the present application includes a memory 201 and a processor 202, a computer program is stored in the memory 201, and the processor 202 implements the following steps when executing the computer program:
  • the target shortest path from the remaining cluster nodes in the AI cluster to the target cluster node, and the predecessor node of the target cluster node in the target shortest path, and the remaining cluster nodes include nodes in the AI cluster other than the target cluster node;
  • a cache path for caching the target data set to the target cluster node is determined based on the weight value, the target shortest path, and the predecessor node, so as to cache the target data set to the target cluster node according to the cache path.
  • a data caching device in an AI cluster provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented: for each cluster in the AI cluster Node, analyze the type of cluster node; if the cluster node is a management node, determine the total number of cluster nodes in the AI cluster, and determine the total number of data sets on the shared storage nodes in the AI cluster; compare the total number of cluster nodes with the total number of data sets The product value of is determined as the weight value of the management node.
  • a data caching device in an AI cluster provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented: After analyzing the type of the cluster node, if If the cluster node is a non-management node, it is judged whether there is a target data set on the cluster node; if there is no target data set on the cluster node, then the weight value of the cluster node is determined to be infinite.
  • a data caching device in an AI cluster includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented: determine whether there is an object on the cluster node After the data set, if there is a target data set on the cluster node, determine the number of tasks of the first type that the cluster node pulls the target data set, determine the number of tasks of the second type that the cluster node pulls the target data set, and convert the first type The sum of the number of tasks, the number of tasks of the second type and 1 is determined as the weight value of the cluster node.
  • a data caching device in an AI cluster includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented: determine the first set of nodes, the first The node set is used to store the first type of cluster nodes whose target shortest path between the target cluster nodes is known; the second node set is determined, and the second node set is used to store the second type of nodes in the AI cluster except the first node set Cluster node; determine the first shortest path between each second-type cluster node to the target cluster node; use the second-type cluster node corresponding to the first shortest path with the smallest value as the cluster node to be determined; for each second-type cluster node, determine the second shortest path between the second type of cluster node and the cluster node to be determined, determine the sum of the first shortest path and the second shortest path corresponding to the node to be determined; if the first shortest path corresponding to the
  • another data cache device in the AI cluster may also include: an input port 203 connected to the processor 202 for transmitting externally input commands to the processor 202;
  • the display unit 204 connected to 202 is used to display the processing results of the processor 202 to the outside world;
  • the communication module 205 connected to the processor 202 is used to realize the communication between the data cache device in the AI cluster and the outside world.
  • the display unit 204 can be a display panel, a laser scanning display, etc.; the communication methods adopted by the communication module 205 include but are not limited to mobile high-definition link technology (HML), universal serial bus (USB), high-definition multimedia interface (HDMI), Wireless connection, for example: wireless fidelity technology (WiFi), bluetooth communication technology, low power consumption bluetooth communication technology, communication technology based on IEEE802.11s.
  • HML mobile high-definition link technology
  • USB universal serial bus
  • HDMI high-definition multimedia interface
  • WiFi wireless fidelity technology
  • WiFi wireless fidelity technology
  • Bluetooth communication technology low power consumption bluetooth communication technology
  • communication technology based on IEEE802.11s based on IEEE802.11s.
  • a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the following steps are implemented:
  • the target shortest path from the remaining cluster nodes in the AI cluster to the target cluster node, and the predecessor node of the target cluster node in the target shortest path, and the remaining cluster nodes include nodes in the AI cluster other than the target cluster node;
  • a cache path for caching the target data set to the target cluster node is determined based on the weight value, the target shortest path, and the predecessor node, so as to cache the target data set to the target cluster node according to the cache path.
  • a computer-readable storage medium provided in an embodiment of the present application, in which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented: for each cluster node in the AI cluster, analyze the cluster node Type; if the cluster node is a management node, then determine the total number of cluster nodes in the AI cluster, determine the total number of data sets on the shared storage nodes in the AI cluster; determine the product value of the total number of cluster nodes and the total number of data sets as the management The weight value of the node.
  • a computer-readable storage medium provided by an embodiment of the present application, in which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented: After analyzing the type of the cluster node, if the cluster node is a non-management node , then judge whether there is a target data set on the cluster node; if there is no target data set on the cluster node, then determine that the weight value of the cluster node is infinite.
  • a computer program is stored in the computer-readable storage medium.
  • the following steps are implemented: After judging whether there is a target data set on the cluster node, if the cluster If there is a target data set on the node, then determine the number of the first type of tasks that the cluster node pulls the target data set, determine the number of the second type of tasks that the cluster node pulls the target data set, and divide the number of the first type of tasks, The sum of the number of tasks and 1 is determined as the weight value of the cluster node.
  • a computer-readable storage medium provided by an embodiment of the present application.
  • a computer program is stored in the computer-readable storage medium. When the computer program is executed by a processor, the following steps are implemented: determine a first node set, and the first node set is used for storing and The first type of cluster nodes whose target shortest path between the target cluster nodes is known; determine the second node set, and the second node set is used to store the second type of cluster nodes in the AI cluster except the first node set; determine each The first shortest path between the second type of cluster node and the target cluster node; the second type of cluster node corresponding to the first shortest path with the smallest value is used as the cluster node to be determined; for each second type of cluster node, determine the second type From the cluster node to the second shortest path between the cluster nodes to be determined, determine the sum value of the first shortest path corresponding to the node to be determined and the second shortest path; if the first shortest path corresponding to the second type of cluster node is
  • the computer-readable storage medium involved in this application includes random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM , or any other form of storage medium known in the technical field.
  • RAM random access memory
  • ROM read-only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disk, removable disk, CD-ROM , or any other form of storage medium known in the technical field.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种AI集群中数据缓存方法、系统、设备及计算机介质,确定待缓存的目标数据集;获取目标数据集在AI集群中各个集群节点上的权重值;确定缓存目标数据集的目标集群节点;获取AI集群中其余集群节点到目标集群节点的目标最短路径,及目标集群节点在目标最短路径中的前继节点,其余集群节点包括AI集群中除目标集群节点之外的节点;基于权重值、目标最短路径及前继节点确定将目标数据集缓存至目标集群节点的缓存路径,以按照缓存路径将目标数据集缓存至目标集群节点。本申请可以使得缓存路径与AI集群的存储能力相匹配,基于缓存路径缓存目标数据集便相当于基于AI集群的存储性能来缓存数据集,可以提高AI集群的数据缓存性能。

Description

一种AI集群中数据缓存方法、系统、设备及计算机介质
本申请要求在2021年9月30日提交中国专利局、申请号为202111162807.1、发明名称为“一种AI集群中数据缓存方法、系统、设备及计算机介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及AI集群技术领域,更具体地说,涉及一种AI集群中数据缓存方法、系统、设备及计算机介质。
背景技术
随着人工智能(Artificial Intelligence,AI)相关产业的蓬勃发展,越来越多的科研企业和高校的研究人员对计算力的要求也是越来越高,AI集群平台的构建有效解决了企业和科研高校对计算力的要求。人工智能平台的一个基本功能是文件的操作,包括数据集的本地下载缓存,训练过程中文件的读取等等文件的一列操作,这些都依赖于集群的存储资源,且AI集群对于存储要求非常高,有频繁的IO操作,这使得存储资源成为AI集群中数据缓存的瓶颈,影响AI集群的数据缓存性能。
综上所述,如何提高AI集群的数据缓存性能是目前本领域技术人员亟待解决的问题。
发明内容
本申请的目的是提供一种AI集群中数据缓存方法,其能在一定程度上解决如何提高AI集群的数据缓存性能的技术问题。本申请还提供了一种AI集群中数据缓存系统、设备及计算机可读存储介质。
为了实现上述目的,本申请提供如下技术方案:
一种AI集群中数据缓存方法,包括:
确定待缓存的目标数据集;
获取所述目标数据集在所述AI集群中各个集群节点上的权重值;
确定缓存所述目标数据集的目标集群节点;
获取所述AI集群中其余集群节点到所述目标集群节点的目标最短路径,及所述目标集群节点在所述目标最短路径中的前继节点,所述其余集群节点包括所述AI集群中除所述目标集群节点之外的节点;
基于所述权重值、所述目标最短路径及所述前继节点确定将所述目标数据集缓存至所述目标集群节点的缓存路径,以按照所述缓存路径将所述目标数据集缓存至所述目标集群节点。
可选的,所述获取所述目标数据集在所述AI集群中各个集群节点上的权重值,包括:
对所述AI集群中的每个所述集群节点,解析所述集群节点的类型;
若所述集群节点为管理节点,则确定所述AI集群中的集群节点总数,确定所述AI集群中共享存储节点上的数据集总个数;
将所述集群节点总数与所述数据集总个数的乘积值确定为所述管理节点的所述权重值。
可选的,所述解析所述集群节点的类型之后,还包括:
若所述集群节点为非管理节点,则判断所述集群节点上是否存有所述目标数据集;
若所述集群节点上不存有所述目标数据集,则确定所述集群节点的所述权重值为无穷大。
可选的,所述判断所述集群节点上是否存有所述目标数据集之后,还包括:
若所述集群节点上存有所述目标数据集,则确定所述集群节点拉取所述目标数据集的第一类任务数,确定所述集群节点被拉取所述目标数据集的第二类任务数,将所述第一类任务数、所述第二类任务数及1的和值确定为所述集群节点的所述权重值。
可选的,所述获取所述AI集群中其余集群节点到所述目标集群节点的目标最短路径,及所述目标集群节点在所述最短路径中的前继节点,包括:
确定第一节点集合,所述第一节点集合用于存储与所述目标集群节点间的目标最短路径已知的第一类集群节点;
确定第二节点集合,所述第二节点集合用于存储所述AI集群中除所述第一节点集合之外的第二类集群节点;
确定每个所述第二类集群节点到所述目标集群节点间的第一最短路径;
将值最小的所述第一最短路径对应的所述第二类集群节点作为待判定集群节点;
对于每个所述第二类集群节点,确定所述第二类集群节点到所述待判定集群节点间的第二最短路径,确定所述待判定节点对应的所述第一最短路径与所述第二最短路径的和值;若所述第二类集群节点对应的所述第一最短路径小于所述和值,则更新所述第二类集群节点的所述目标最短路径为所述第二类集群节点对应的所述第一最短路径;若所述第二类集群节点对应的所述第一最短路径大于所述和值,则更新所述第二类集群节点的所述目标最短路径为所述和值,且所述目标集群节点在所述第二类集群节点对应的所述最短路径中的前继节点为所述待判定集群节点;
将所述待判定集群节点更新为所述第一类集群节点;
判断所述第一节点集合是否包含全部的集群节点,若否,则返回执行所述将值最小的所述第一最短路径对应的所述第二类集群节点作为待判定集群节点的步骤,若是,则结束。
一种AI集群中数据缓存系统,包括:
第一确定模块,用于确定待缓存的目标数据集;
第一获取模块,用于获取所述目标数据集在所述AI集群中各个集群节点上的权重值;
第二确定模块,用于确定缓存所述目标数据集的目标集群节点;
第二获取模块,用于获取所述AI集群中其余集群节点到所述目标集群节点的最短路径,及所述目标集群节点在所述最短路径中的前继节点;
第三确定模块,用于基于所述权重值、所述最短路径及所述前继节点确定将所述目标数据集缓存至所述目标集群节点的缓存路径,以按照所述缓存路径将所述目标数据集缓存至所述目标集群节点。
可选的,所述第一获取模块包括:
第一解析单元,用于对所述AI集群中的每个所述集群节点,解析所述集群节点的类型;
第一处理单元,用于若所述集群节点为管理节点,则确定所述AI集群中的集群节点总数,确定所述AI集群中共享存储节点上的数据集总个数;
将所述集群节点总数与所述数据集总个数的乘积值确定为所述管理节点的所述权重值。
可选的,所述第一获取模块还包括:
第二处理单元,用于若所述集群节点为非管理节点,则判断所述集群节点上是否存有所述目标数据集;若所述集群节点上不存有所述目标数据集,则确定所述集群节点的所述权重值为无穷大。
可选的,所述第二获取模块包括:
第一确定单元,用于确定第一节点集合,第一节点集合用于存储与目标集群节点间的目标最短路径已知的第一类集群节点;
第二确定单元,用于确定第二节点集合,第二节点集合用于存储AI集群中除第一节点集合之外的第二类集群节点;
第三确定单元,用于确定每个第二类集群节点到目标集群节点间的第一最短路径;
第一设置单元,用于将值最小的第一最短路径对应的第二类集群节点作为待判定集群节点;
第四确定单元,用于对于每个第二类集群节点,确定第二类集群节点到待判定集群节点间的第二最短路径,确定待判定节点对应的第一最短路径与第二最短路径的和值;若第二类集群节点对应的第一最短路径小于和值,则更新第二类集群节点的目标最短路径为第二类集群节点对应的第一最短路径;若第二类集群节点对应的第一最短路径大于和值,则更新第二类集群节点的目标最短路径为和值,且目标集群节点在第二类集群节点对应的最短路径中的前继节点为待判定集群节点;
第一更新单元,用于将待判定集群节点更新为第一类集群节点;
第一判断单元,用于判断第一节点集合是否包含全部的集群节点,若否,则返回执行将值最小的第一最短路径对应的第二类集群节点作为待判定集群节点的步骤,若是,则结束。
一种AI集群中数据缓存设备,包括:
存储器,用于存储计算机程序;
处理器,用于执行所述计算机程序时实现如上任一所述AI集群中数据缓存方法的步骤。
一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述计算机程序被处理器执行时实现如上任一所述AI集群中数据缓存方法的步骤。
本申请提供的一种AI集群中数据缓存方法,确定待缓存的目标数据集;获取目标数据集在AI集群中各个集群节点上的权重值;确定缓存目标数据集的目标集群节点;获取AI集群中其余集群节点到目标集群节点的目标最短路径,及目标集群节点在目标最短路径中的前继节点,其余集群节点包括AI集群中除目标集群节点之外的节点;基于权重值、目标最短路径及前继节点确定将目标数据集缓存至目标集群节点的缓存路径,以按照缓存路径将目标数据集缓存至目标集群节点。本申请中,因为权重值可以反映目标数据集在各个集群节点上消耗的存储能力,目标最短路径可以反映在AI集群中缓存目标数据集所需消耗的存储能力,前继节点可以指明目标数据集在AI集群中的缓存方向,所以基于权重值、目标最短路径及前继节点确定将目标数据集缓存至目标集群节点的缓存路径的话,可以使得缓存路径与AI集群的存储能力相匹配,这样后续基于缓存路径缓存目标数据集的话,相当于基于AI集群的存储性能来缓存数据集,可以提高AI集群的数据缓存性能。本申请提供的一种AI集群中数据缓存系统、设备及计算机可读存储介质也解决了相应技术问题。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例提供的一种AI集群中数据缓存方法的流程图;
图2为本申请中目标最短路径及前继节点的确定流程图;
图3为本申请实施例提供的一种AI集群中数据缓存系统的结构示意图;
图4为本申请实施例提供的一种AI集群中数据缓存设备的结构示意图;
图5为本申请实施例提供的一种AI集群中数据缓存设备的另一结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参阅图1,图1为本申请实施例提供的一种AI集群中数据缓存方法的流程图。
本申请实施例提供的一种AI集群中数据缓存方法,可以包括以下步骤:
步骤S101:确定待缓存的目标数据集。
实际应用中,因为AI集群中存在多个数据集,而用户可能只对其中的一个或几个数据集进行缓存,所以可以先确定待缓存的目标数据集,数据集的类型、内容、大小等均可以根据实际需要确定,本申请在此不做具体限定。
步骤S102:获取目标数据集在AI集群中各个集群节点上的权重值。
实际应用中,在确定待缓存的目标数据集之后,可以获取目标数据集在AI集群中各个集群节点上的权重值;不难理解,权重值越高的话,集群节点上目标数据集占用的存储资源越多,所以本申请可以借助该权重值反映目标数据消耗的集群节点上的存储资源。
步骤S103:确定缓存目标数据集的目标集群节点。
实际应用中,用户可能需要在某个集群节点上缓存目标数据集,所以还需确定缓存目标数据集的目标集群节点,具体的,可以根据用户发送的缓存指令来确定相应的目标集群节点等,本申请在此不做具体限定。
步骤S104:获取AI集群中其余集群节点到目标集群节点的目标最短路径,及目标集群节点在目标最短路径中的前继节点,其余集群节点包括AI集群中除目标集群节点之外的节点。
实际应用中,由于AI集群中的集群节点间是互联的,所以理论上各个集群节点均可以向目标集群节点传输数据集,但考虑到目标数据集在AI集群中的分布并不均匀,比如有的集群节点上不存在目标数据集,且各个集群节点到目标集群节点的最短路径不同,所以在确定目标集群节点之后,还需获取AI集群中其余集群节点到目标集群节点的目标最短路径,及目标集群节点在目标最短路径中的前继节点,其余集群节点包括AI集群中除目标集群节点之 外的节点,以便后续借助目标最短路径、前继节点来确定目标集群节点的缓存路径。
需要说明的是,前继节点指的是其他集群节点到目标集群节点的最短路径上、目标集群节点之前的集群节点,比如其他集群节点为a,目标集群节点为v,a到v的最短路径为4,具体为a-b-c-d-v,则v的前继节点可以为c等。
步骤S105:基于权重值、目标最短路径及前继节点确定将目标数据集缓存至目标集群节点的缓存路径,以按照缓存路径将目标数据集缓存至目标集群节点。
实际应用中,在获取权重值、目标最短路径、前继节点之后,便可以基于权重值、目标最短路径及前继节点确定将目标数据集缓存至目标集群节点的缓存路径,以按照缓存路径将目标数据集缓存至目标集群节点。比如可以将权重值最小的其他集群节点作为传输目标数据集的传输节点,并按照该传输节点的目标最短路径及前继节点传输目标数据集至目标集群节点等,当然也可以有其他确定缓存路径的方式,本申请在此不做具体限定。
本申请提供的一种AI集群中数据缓存方法,确定待缓存的目标数据集;获取目标数据集在AI集群中各个集群节点上的权重值;确定缓存目标数据集的目标集群节点;获取AI集群中其余集群节点到目标集群节点的目标最短路径,及目标集群节点在目标最短路径中的前继节点,其余集群节点包括AI集群中除目标集群节点之外的节点;基于权重值、目标最短路径及前继节点确定将目标数据集缓存至目标集群节点的缓存路径,以按照缓存路径将目标数据集缓存至目标集群节点。本申请中,因为权重值可以反映目标数据集在各个集群节点上消耗的存储能力,目标最短路径可以反映在AI集群中缓存目标数据集所需消耗的存储能力,前继节点可以指明目标数据集在AI集群中的缓存方向,所以基于权重值、目标最短路径及前继节点确定将目标数据集缓存至目标集群节点的缓存路径的话,可以使得缓存路径与AI集群的存储能力相匹配,这样后续基于缓存路径缓存目标数据集的话,相当于基于AI集群的存储性能来缓存数据集,可以提高AI集群的数据缓存性能。
本申请实施例提供的一种AI集群中数据缓存方法,在获取目标数据集在AI集群中各个集群节点上的权重值的过程中,可以对AI集群中的每个集群节 点,解析集群节点的类型,若集群节点为管理节点,则确定AI集群中的集群节点总数,确定AI集群中共享存储节点上的数据集总个数;将集群节点总数与数据集总个数的乘积值确定为管理节点的权重值;若集群节点为非管理节点,则判断集群节点上是否存有目标数据集;若集群节点上不存有目标数据集,则确定集群节点的权重值为无穷大;若集群节点上存有目标数据集,则确定集群节点拉取目标数据集的第一类任务数,确定集群节点被拉取目标数据集的第二类任务数,将第一类任务数、第二类任务数及1的和值确定为集群节点的权重值。应当指出,管理节点指的是AI集群中具有管理功能的节点,共享存储节点指的是数据能被AI集群中所有集群节点共享的节点。
为了便于理解,假设AI集群中节点个数为10,共享存储节点上数据集个数为20,则管理节点的权重值为10*20=200;假设集群节点a上存在目标数据集,且集群节点a拉取目标数据集的任务数为2,集群节点a被拉取目标数据集的任务数为3,则集群节点a的权重值可以为1+2+3=6;若集群节点b上不存在目标数据集,则集群节点b的权重值为无穷大。
请参阅图2,图2为本申请中目标最短路径及前继节点的确定流程图。
本申请实施例提供的一种AI集群中数据缓存方法,获取AI集群中其余集群节点到目标集群节点的目标最短路径,及目标集群节点在最短路径中的前继节点的过程,可以包括以下步骤:
步骤S201:确定第一节点集合,第一节点集合用于存储与目标集群节点间的目标最短路径已知的第一类集群节点。
实际应用中,可能存在第一类集群节点与目标集群节点间的目标最短路径已知,此时可以不再关注第一类集群节点的目标最短路径,且可以借助第一节点集合来对第一类集群节点进行管理。需要说明的是,第一类集群节点的目标最短路径已知的话,其对应的前继节点也便已知。
步骤S202:确定第二节点集合,第二节点集合用于存储AI集群中除第一节点集合之外的第二类集群节点。
实际应用中,为了便于管理目标最短路径未知的第二类集群节点,可以借助第二节点集合来对第二类集群节点进行集中处理。假设第一节点集合为S,AI集群中集群节点的总集合为V,则第二节点集合可以为V-S。
步骤S203:确定每个第二类集群节点到目标集群节点间的第一最短路径。
实际应用中,因为第二类集群节点到目标集群节点间的目标最短路径可以拆分为:前继节点到目标集群节点间的最短路径与第二类集群节点到前继节点间的最短路径之和,所以可以借助第二类集群节点到目标集群节点间的第一最短路径来确定第二类集群节点到目标集群节点间的目标最短路径及相应的前继节点。假设目标集群节点为v,第二类集群节点为i,则第一最短路径可以表示为dist[i]=G[i][v]。
步骤S204:将值最小的第一最短路径对应的第二类集群节点作为待判定集群节点。
步骤S205:对于每个第二类集群节点,确定第二类集群节点到待判定集群节点间的第二最短路径,确定待判定节点对应的第一最短路径与第二最短路径的和值;若第二类集群节点对应的第一最短路径小于和值,则更新第二类集群节点的目标最短路径为第二类集群节点对应的第一最短路径;若第二类集群节点对应的第一最短路径大于和值,则更新第二类集群节点的目标最短路径为和值,且目标集群节点在第二类集群节点对应的最短路径中的前继节点为待判定集群节点。
实际应用中,为了便于确定目标最短路径及前继节点,可以先将值最小的第一最短路径对应的第二类集群节点作为待判定集群节点,也即作为前继节点来进行验证,并且对于每个第二类集群节点,确定第二类集群节点到待判定集群节点间的第二最短路径,确定待判定节点对应的第一最短路径与第二最短路径的和值;若第二类集群节点对应的第一最短路径小于和值,则更新第二类集群节点的目标最短路径为第二类集群节点对应的第一最短路径;若第二类集群节点对应的第一最短路径大于和值,则更新第二类集群节点的目标最短路径为和值,且目标集群节点在第二类集群节点对应的最短路径中的前继节点为待判定集群节点。为了便于理解,假设待判定集群节点为k,也即dist[k]=min{dist[i]};此时,目标最短路径便为dist[i]=min{dist[i],dist[k]+G[i][k]}。
步骤S206:将待判定集群节点更新为第一类集群节点。
步骤S207:判断第一节点集合是否包含全部的集群节点,若否,则返回执行步骤S204;若是,则结束。
实际应用中,完成验证待判定集群节点是否为前继节点之后,便可以将待判定集群节点更新为第一类集群节点,并判断第一节点集合是否包含全部的集群节点,若否,则返回执行步骤S204;若是,则可以直接结束,此时便可以得到第二类集群节点到目标集群节点间的目标最短路径及相应的前继节点。
请参阅图3,图3为本申请实施例提供的一种AI集群中数据缓存系统的结构示意图。
本申请实施例提供的一种AI集群中数据缓存系统,可以包括:
第一确定模块101,用于确定待缓存的目标数据集;
第一获取模块102,用于获取目标数据集在AI集群中各个集群节点上的权重值;
第二确定模块103,用于确定缓存目标数据集的目标集群节点;
第二获取模块104,用于获取AI集群中其余集群节点到目标集群节点的最短路径,及目标集群节点在最短路径中的前继节点;
第三确定模块105,用于基于权重值、最短路径及前继节点确定将目标数据集缓存至目标集群节点的缓存路径,以按照缓存路径将目标数据集缓存至目标集群节点。
本申请实施例提供的一种AI集群中数据缓存系统,第一获取模块包括:
第一解析单元,用于对AI集群中的每个集群节点,解析集群节点的类型;
第一处理单元,用于若集群节点为管理节点,则确定AI集群中的集群节点总数,确定AI集群中共享存储节点上的数据集总个数;
将集群节点总数与数据集总个数的乘积值确定为管理节点的权重值。
本申请实施例提供的一种AI集群中数据缓存系统,第一获取模块还可以包括:
第二处理单元,用于若集群节点为非管理节点,则判断集群节点上是否存有目标数据集;若集群节点上不存有目标数据集,则确定集群节点的权重值为无穷大。
本申请实施例提供的一种AI集群中数据缓存系统,第二处理单元还可以用于:若集群节点上存有目标数据集,则确定集群节点拉取目标数据集的第 一类任务数,确定集群节点被拉取目标数据集的第二类任务数,将第一类任务数、第二类任务数及1的和值确定为集群节点的权重值。
本申请实施例提供的一种AI集群中数据缓存系统,第二获取模块可以包括:
第一确定单元,用于确定第一节点集合,第一节点集合用于存储与目标集群节点间的目标最短路径已知的第一类集群节点;
第二确定单元,用于确定第二节点集合,第二节点集合用于存储AI集群中除第一节点集合之外的第二类集群节点;
第三确定单元,用于确定每个第二类集群节点到目标集群节点间的第一最短路径;
第一设置单元,用于将值最小的第一最短路径对应的第二类集群节点作为待判定集群节点;
第四确定单元,用于对于每个第二类集群节点,确定第二类集群节点到待判定集群节点间的第二最短路径,确定待判定节点对应的第一最短路径与第二最短路径的和值;若第二类集群节点对应的第一最短路径小于和值,则更新第二类集群节点的目标最短路径为第二类集群节点对应的第一最短路径;若第二类集群节点对应的第一最短路径大于和值,则更新第二类集群节点的目标最短路径为和值,且目标集群节点在第二类集群节点对应的最短路径中的前继节点为待判定集群节点;
第一更新单元,用于将待判定集群节点更新为第一类集群节点;
第一判断单元,用于判断第一节点集合是否包含全部的集群节点,若否,则返回执行将值最小的第一最短路径对应的第二类集群节点作为待判定集群节点的步骤,若是,则结束。
本申请还提供了一种AI集群中数据缓存设备及计算机可读存储介质,其均具有本申请实施例提供的一种AI集群中数据缓存方法具有的对应效果。请参阅图4,图4为本申请实施例提供的一种AI集群中数据缓存设备的结构示意图。
本申请实施例提供的一种AI集群中数据缓存设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:
确定待缓存的目标数据集;
获取目标数据集在AI集群中各个集群节点上的权重值;
确定缓存目标数据集的目标集群节点;
获取AI集群中其余集群节点到目标集群节点的目标最短路径,及目标集群节点在目标最短路径中的前继节点,其余集群节点包括AI集群中除目标集群节点之外的节点;
基于权重值、目标最短路径及前继节点确定将目标数据集缓存至目标集群节点的缓存路径,以按照缓存路径将目标数据集缓存至目标集群节点。
本申请实施例提供的一种AI集群中数据缓存设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:对AI集群中的每个集群节点,解析集群节点的类型;若集群节点为管理节点,则确定AI集群中的集群节点总数,确定AI集群中共享存储节点上的数据集总个数;将集群节点总数与数据集总个数的乘积值确定为管理节点的权重值。
本申请实施例提供的一种AI集群中数据缓存设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:解析集群节点的类型之后,若集群节点为非管理节点,则判断集群节点上是否存有目标数据集;若集群节点上不存有目标数据集,则确定集群节点的权重值为无穷大。
本申请实施例提供的一种AI集群中数据缓存设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:判断集群节点上是否存有目标数据集之后,若集群节点上存有目标数据集,则确定集群节点拉取目标数据集的第一类任务数,确定集群节点被拉取目标数据集的第二类任务数,将第一类任务数、第二类任务数及1的和值确定为集群节点的权重值。
本申请实施例提供的一种AI集群中数据缓存设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实 现如下步骤:确定第一节点集合,第一节点集合用于存储与目标集群节点间的目标最短路径已知的第一类集群节点;确定第二节点集合,第二节点集合用于存储AI集群中除第一节点集合之外的第二类集群节点;确定每个第二类集群节点到目标集群节点间的第一最短路径;将值最小的第一最短路径对应的第二类集群节点作为待判定集群节点;对于每个第二类集群节点,确定第二类集群节点到待判定集群节点间的第二最短路径,确定待判定节点对应的第一最短路径与第二最短路径的和值;若第二类集群节点对应的第一最短路径小于和值,则更新第二类集群节点的目标最短路径为第二类集群节点对应的第一最短路径;若第二类集群节点对应的第一最短路径大于和值,则更新第二类集群节点的目标最短路径为和值,且目标集群节点在第二类集群节点对应的最短路径中的前继节点为待判定集群节点;将待判定集群节点更新为第一类集群节点;判断第一节点集合是否包含全部的集群节点,若否,则返回执行将值最小的第一最短路径对应的第二类集群节点作为待判定集群节点的步骤,若是,则结束。
请参阅图5,本申请实施例提供的另一种AI集群中数据缓存设备中还可以包括:与处理器202连接的输入端口203,用于传输外界输入的命令至处理器202;与处理器202连接的显示单元204,用于显示处理器202的处理结果至外界;与处理器202连接的通信模块205,用于实现AI集群中数据缓存设备与外界的通信。显示单元204可以为显示面板、激光扫描式显示器等;通信模块205所采用的通信方式包括但不局限于移动高清链接技术(HML)、通用串行总线(USB)、高清多媒体接口(HDMI)、无线连接,例如:无线保真技术(WiFi)、蓝牙通信技术、低功耗蓝牙通信技术、基于IEEE802.11s的通信技术。
本申请实施例提供的一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:
确定待缓存的目标数据集;
获取目标数据集在AI集群中各个集群节点上的权重值;
确定缓存目标数据集的目标集群节点;
获取AI集群中其余集群节点到目标集群节点的目标最短路径,及目标集群节点在目标最短路径中的前继节点,其余集群节点包括AI集群中除目标集群节点之外的节点;
基于权重值、目标最短路径及前继节点确定将目标数据集缓存至目标集群节点的缓存路径,以按照缓存路径将目标数据集缓存至目标集群节点。
本申请实施例提供的一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:对AI集群中的每个集群节点,解析集群节点的类型;若集群节点为管理节点,则确定AI集群中的集群节点总数,确定AI集群中共享存储节点上的数据集总个数;将集群节点总数与数据集总个数的乘积值确定为管理节点的权重值。
本申请实施例提供的一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:解析集群节点的类型之后,若集群节点为非管理节点,则判断集群节点上是否存有目标数据集;若集群节点上不存有目标数据集,则确定集群节点的权重值为无穷大。
本申请实施例提供的一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:判断集群节点上是否存有目标数据集之后,若集群节点上存有目标数据集,则确定集群节点拉取目标数据集的第一类任务数,确定集群节点被拉取目标数据集的第二类任务数,将第一类任务数、第二类任务数及1的和值确定为集群节点的权重值。
本申请实施例提供的一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:确定第一节点集合,第一节点集合用于存储与目标集群节点间的目标最短路径已知的第一类集群节点;确定第二节点集合,第二节点集合用于存储AI集群中除第一节点集合之外的第二类集群节点;确定每个第二类集群节点到目标集群节点间的第一最短路径;将值最小的第一最短路径对应的第二类集群节点作为待判定集群节点;对于每个第二类集群节点,确定第二类集群节点到待判定集群节点间的第二最短路径,确定待判定节点对应的第一最短路径与第二最短路径的和值;若第二类集群节点对应的第一最短路径小于和值,则更新第二 类集群节点的目标最短路径为第二类集群节点对应的第一最短路径;若第二类集群节点对应的第一最短路径大于和值,则更新第二类集群节点的目标最短路径为和值,且目标集群节点在第二类集群节点对应的最短路径中的前继节点为待判定集群节点;将待判定集群节点更新为第一类集群节点;判断第一节点集合是否包含全部的集群节点,若否,则返回执行将值最小的第一最短路径对应的第二类集群节点作为待判定集群节点的步骤,若是,则结束。
本申请所涉及的计算机可读存储介质包括随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质。
本申请实施例提供的AI集群中数据缓存系统、设备及计算机可读存储介质中相关部分的说明请参见本申请实施例提供的AI集群中数据缓存方法中对应部分的详细说明,在此不再赘述。另外,本申请实施例提供的上述技术方案中与现有技术中对应技术方案实现原理一致的部分并未详细说明,以免过多赘述。
还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
对所公开的实施例的上述说明,使本领域技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (11)

  1. 一种AI集群中数据缓存方法,其特征在于,包括:
    确定待缓存的目标数据集;
    获取所述目标数据集在所述AI集群中各个集群节点上的权重值;
    确定缓存所述目标数据集的目标集群节点;
    获取所述AI集群中其余集群节点到所述目标集群节点的目标最短路径,及所述目标集群节点在所述目标最短路径中的前继节点,所述其余集群节点包括所述AI集群中除所述目标集群节点之外的节点;
    基于所述权重值、所述目标最短路径及所述前继节点确定将所述目标数据集缓存至所述目标集群节点的缓存路径,以按照所述缓存路径将所述目标数据集缓存至所述目标集群节点。
  2. 根据权利要求1所述的方法,其特征在于,所述获取所述目标数据集在所述AI集群中各个集群节点上的权重值,包括:
    对所述AI集群中的每个所述集群节点,解析所述集群节点的类型;
    若所述集群节点为管理节点,则确定所述AI集群中的集群节点总数,确定所述AI集群中共享存储节点上的数据集总个数;
    将所述集群节点总数与所述数据集总个数的乘积值确定为所述管理节点的所述权重值。
  3. 根据权利要求2所述的方法,其特征在于,所述解析所述集群节点的类型之后,还包括:
    若所述集群节点为非管理节点,则判断所述集群节点上是否存有所述目标数据集;
    若所述集群节点上不存有所述目标数据集,则确定所述集群节点的所述权重值为无穷大。
  4. 根据权利要求3所述的方法,其特征在于,所述判断所述集群节点上是否存有所述目标数据集之后,还包括:
    若所述集群节点上存有所述目标数据集,则确定所述集群节点拉取所述目标数据集的第一类任务数,确定所述集群节点被拉取所述目标数据集的第二类任务数,将所述第一类任务数、所述第二类任务数及1的和值确定为所述集群节点的所述权重值。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述获取所述AI集群中其余集群节点到所述目标集群节点的目标最短路径,及所述目标集群节点在所述最短路径中的前继节点,包括:
    确定第一节点集合,所述第一节点集合用于存储与所述目标集群节点间的目标最短路径已知的第一类集群节点;
    确定第二节点集合,所述第二节点集合用于存储所述AI集群中除所述第一节点集合之外的第二类集群节点;
    确定每个所述第二类集群节点到所述目标集群节点间的第一最短路径;
    将值最小的所述第一最短路径对应的所述第二类集群节点作为待判定集群节点;
    对于每个所述第二类集群节点,确定所述第二类集群节点到所述待判定集群节点间的第二最短路径,确定所述待判定节点对应的所述第一最短路径与所述第二最短路径的和值;若所述第二类集群节点对应的所述第一最短路径小于所述和值,则更新所述第二类集群节点的所述目标最短路径为所述第二类集群节点对应的所述第一最短路径;若所述第二类集群节点对应的所述第一最短路径大于所述和值,则更新所述第二类集群节点的所述目标最短路径为所述和值,且所述目标集群节点在所述第二类集群节点对应的所述最短路径中的前继节点为所述待判定集群节点;
    将所述待判定集群节点更新为所述第一类集群节点;
    判断所述第一节点集合是否包含全部的集群节点,若否,则返回执行所述将值最小的所述第一最短路径对应的所述第二类集群节点作为待判定集群节点的步骤,若是,则结束。
  6. 一种AI集群中数据缓存系统,其特征在于,包括:
    第一确定模块,用于确定待缓存的目标数据集;
    第一获取模块,用于获取所述目标数据集在所述AI集群中各个集群节点上的权重值;
    第二确定模块,用于确定缓存所述目标数据集的目标集群节点;
    第二获取模块,用于获取所述AI集群中其余集群节点到所述目标集群节点的最短路径,及所述目标集群节点在所述最短路径中的前继节点;
    第三确定模块,用于基于所述权重值、所述最短路径及所述前继节点确定将所述目标数据集缓存至所述目标集群节点的缓存路径,以按照所述缓存路径将所述目标数据集缓存至所述目标集群节点。
  7. 根据权利要求6所述的系统,其特征在于,所述第一获取模块包括:
    第一解析单元,用于对所述AI集群中的每个所述集群节点,解析所述集群节点的类型;
    第一处理单元,用于若所述集群节点为管理节点,则确定所述AI集群中的集群节点总数,确定所述AI集群中共享存储节点上的数据集总个数;
    将所述集群节点总数与所述数据集总个数的乘积值确定为所述管理节点的所述权重值。
  8. 根据权利要求7所述的系统,其特征在于,所述第一获取模块还包括:
    第二处理单元,用于若所述集群节点为非管理节点,则判断所述集群节点上是否存有所述目标数据集;若所述集群节点上不存有所述目标数据集,则确定所述集群节点的所述权重值为无穷大。
  9. 根据权利要求6至8任一项所述的方法,其特征在于,所述第二获取模块包括:
    第一确定单元,用于确定第一节点集合,第一节点集合用于存储与目标集群节点间的目标最短路径已知的第一类集群节点;
    第二确定单元,用于确定第二节点集合,第二节点集合用于存储AI集群中除第一节点集合之外的第二类集群节点;
    第三确定单元,用于确定每个第二类集群节点到目标集群节点间的第一最短路径;
    第一设置单元,用于将值最小的第一最短路径对应的第二类集群节点作为待判定集群节点;
    第四确定单元,用于对于每个第二类集群节点,确定第二类集群节点到待判定集群节点间的第二最短路径,确定待判定节点对应的第一最短路径与第二最短路径的和值;若第二类集群节点对应的第一最短路径小于和值,则更新第二类集群节点的目标最短路径为第二类集群节点对应的第一最短路径;若第二类集群节点对应的第一最短路径大于和值,则更新第二类集群节 点的目标最短路径为和值,且目标集群节点在第二类集群节点对应的最短路径中的前继节点为待判定集群节点;
    第一更新单元,用于将待判定集群节点更新为第一类集群节点;
    第一判断单元,用于判断第一节点集合是否包含全部的集群节点,若否,则返回执行将值最小的第一最短路径对应的第二类集群节点作为待判定集群节点的步骤,若是,则结束。
  10. 一种AI集群中数据缓存设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序时实现如权利要求1至5任一项所述AI集群中数据缓存方法的步骤。
  11. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至5任一项所述AI集群中数据缓存方法的步骤。
PCT/CN2022/078186 2021-09-30 2022-02-28 一种ai集群中数据缓存方法、系统、设备及计算机介质 WO2023050704A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/280,221 US20240152458A1 (en) 2021-09-30 2022-02-28 Data caching method, system and device in ai cluster, and computer medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111162807.1 2021-09-30
CN202111162807.1A CN113590666B (zh) 2021-09-30 2021-09-30 一种ai集群中数据缓存方法、系统、设备及计算机介质

Publications (1)

Publication Number Publication Date
WO2023050704A1 true WO2023050704A1 (zh) 2023-04-06

Family

ID=78242736

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/078186 WO2023050704A1 (zh) 2021-09-30 2022-02-28 一种ai集群中数据缓存方法、系统、设备及计算机介质

Country Status (3)

Country Link
US (1) US20240152458A1 (zh)
CN (1) CN113590666B (zh)
WO (1) WO2023050704A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590666B (zh) * 2021-09-30 2022-02-18 苏州浪潮智能科技有限公司 一种ai集群中数据缓存方法、系统、设备及计算机介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180262566A1 (en) * 2016-01-29 2018-09-13 Huawei Technologies Co., Ltd. Caching Method and System Based on Cache Cluster
CN110971432A (zh) * 2018-09-29 2020-04-07 华为技术有限公司 一种数据传输方法以及相关装置
CN112702399A (zh) * 2020-12-14 2021-04-23 中山大学 网络社团协作缓存方法、装置、计算机设备和存储介质
CN113094183A (zh) * 2021-06-09 2021-07-09 苏州浪潮智能科技有限公司 Ai训练平台的训练任务创建方法、装置、系统及介质
CN113590666A (zh) * 2021-09-30 2021-11-02 苏州浪潮智能科技有限公司 一种ai集群中数据缓存方法、系统、设备及计算机介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218233B (zh) * 2013-05-09 2015-11-18 福州大学 Hadoop异构集群中的数据分配策略
CN105743980A (zh) * 2016-02-03 2016-07-06 上海理工大学 一种自组织的云资源共享分布式对等网络模型构造方法
CN111367950B (zh) * 2020-02-28 2023-08-08 上海欣巴自动化科技股份有限公司 一种基于Kubernetes的分布式AGV调度系统及调度方法
CN112632092B (zh) * 2020-12-18 2024-10-08 北京浪潮数据技术有限公司 一种集群管理方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180262566A1 (en) * 2016-01-29 2018-09-13 Huawei Technologies Co., Ltd. Caching Method and System Based on Cache Cluster
CN110971432A (zh) * 2018-09-29 2020-04-07 华为技术有限公司 一种数据传输方法以及相关装置
CN112702399A (zh) * 2020-12-14 2021-04-23 中山大学 网络社团协作缓存方法、装置、计算机设备和存储介质
CN113094183A (zh) * 2021-06-09 2021-07-09 苏州浪潮智能科技有限公司 Ai训练平台的训练任务创建方法、装置、系统及介质
CN113590666A (zh) * 2021-09-30 2021-11-02 苏州浪潮智能科技有限公司 一种ai集群中数据缓存方法、系统、设备及计算机介质

Also Published As

Publication number Publication date
US20240152458A1 (en) 2024-05-09
CN113590666B (zh) 2022-02-18
CN113590666A (zh) 2021-11-02

Similar Documents

Publication Publication Date Title
US10928970B2 (en) User-interface for developing applications that apply machine learning
US20190199797A1 (en) System and method for scheduling computer tasks
US20210240705A1 (en) Dynamic asynchronous traversals for distributed graph queries
US20190034833A1 (en) Model Training Method and Apparatus
CN105677812A (zh) 一种数据查询方法及数据查询装置
CN113821332B (zh) 自动机器学习系统效能调优方法、装置、设备及介质
US10860579B2 (en) Query planning and execution with reusable memory stack
US11200231B2 (en) Remote query optimization in multi data sources
WO2021254240A1 (zh) 数据处理方法及装置
WO2021259041A1 (zh) Ai计算图的排序方法、装置、设备及存储介质
US11190620B2 (en) Methods and electronic devices for data transmission and reception
US9594839B2 (en) Methods and systems for load balancing databases in a cloud environment
US11709831B2 (en) Cost-based query optimization for array fields in database systems
CN109241100B (zh) 一种查询方法、装置、设备及存储介质
WO2023050704A1 (zh) 一种ai集群中数据缓存方法、系统、设备及计算机介质
US20240004853A1 (en) Virtual data source manager of data virtualization-based architecture
US11960616B2 (en) Virtual data sources of data virtualization-based architecture
WO2021114464A1 (zh) 一种数据重删方法、系统、设备及计算机可读存储介质
CN117971906B (zh) 一种多卡协同数据库查询方法、装置、设备及存储介质
US20140089331A1 (en) Integrated analytics on multiple systems
US20150074351A1 (en) Write-behind caching in distributed file systems
US11263026B2 (en) Software plugins of data virtualization-based architecture
US20130110862A1 (en) Maintaining a buffer state in a database query engine
CN112131242A (zh) 一种基于redis的数据快速查询方法及装置
US20190057130A1 (en) Method, system, and apparatus for performing flow-based processing using stored procedure

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22874107

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18280221

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22874107

Country of ref document: EP

Kind code of ref document: A1