CN106411605A - Node network self-organizing method, apparatus, server and system - Google Patents

Node network self-organizing method, apparatus, server and system Download PDF

Info

Publication number
CN106411605A
CN106411605A CN201610963725.XA CN201610963725A CN106411605A CN 106411605 A CN106411605 A CN 106411605A CN 201610963725 A CN201610963725 A CN 201610963725A CN 106411605 A CN106411605 A CN 106411605A
Authority
CN
China
Prior art keywords
node
meshed network
task
node network
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610963725.XA
Other languages
Chinese (zh)
Other versions
CN106411605B (en
Inventor
李远策
欧阳文
贾润莹
陈永强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201610963725.XA priority Critical patent/CN106411605B/en
Publication of CN106411605A publication Critical patent/CN106411605A/en
Application granted granted Critical
Publication of CN106411605B publication Critical patent/CN106411605B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本发明公开了一种节点网络自组织方法、装置、服务器和系统。其中,方法包括:接收主控节点发送的任务启动指令;选取用于该任务的端口号;将本从节点的主机名和选取的端口号返回给所述主控节点,以使所述主控节点根据各从节点启动的任务及返回的主机名和端口号,生成节点网络图;接收所述主控节点发送的节点网络图。该技术方案将端口号的分配由主控节点更改为各从节点,避免了主控节点分配的端口号与从节点已使用的端口号发生冲突,同时生成的节点网络图可以用于节点网络的管理及从节点间的连接建立,既满足了使用需求,同时提高了节点网络构建的成功率。

The invention discloses a node network self-organization method, device, server and system. Wherein, the method includes: receiving the task start command sent by the master control node; selecting the port number used for the task; returning the host name and the selected port number of the slave node to the master control node, so that the master control node Generate a node network map according to the tasks started by each slave node and the returned host name and port number; receive the node network map sent by the master control node. This technical solution changes the allocation of port numbers from the master control node to each slave node, avoiding the conflict between the port numbers allocated by the master control node and the port numbers already used by the slave nodes, and the generated node network diagram can be used for the node network Management and establishment of connections between nodes not only meets the usage requirements, but also improves the success rate of node network construction.

Description

一种节点网络自组织方法、装置、服务器和系统A node network self-organization method, device, server and system

技术领域technical field

本发明涉及计算机网络技术领域,具体涉及一种节点网络自组织方法、装置、服务器和系统。The invention relates to the technical field of computer networks, in particular to a node network self-organization method, device, server and system.

背景技术Background technique

分布式集群中有多个节点,这些节点上可以运行并行执行同一任务。在很多情况下,并行执行任务还需要节点与节点之间进行通信,因此节点网络的组织是非常关键的问题。现有技术中,通常通过主控节点在任务的建立时,为各执行任务的从节点分配执行该任务的进程所使用的端口号,这样各从节点就可以获知其他从节点上执行该任务的进程所使用的端口号,从而与其他从节点建立连接。但是,分配的端口号实际上是与各从节点一一对应的,主控节点如果没有获知某一节点上已使用的端口号,在分配端口号时分配了该节点已经使用的端口号,就会影响任务的启动。例如,节点1上运行的任务A使用端口8080,在主控节点为节点1分配任务B所使用的端口号时又指定了端口8080,那么就引起了端口冲突。A distributed cluster has multiple nodes that can run the same task in parallel. In many cases, parallel execution of tasks also requires communication between nodes, so the organization of the node network is a very critical issue. In the prior art, when the task is established, the master control node usually assigns the port number used by the process of executing the task to each slave node executing the task, so that each slave node can know the port number used by the process of executing the task on other slave nodes. The port number used by the process to establish connections with other slave nodes. However, the allocated port numbers are actually one-to-one correspondence with each slave node. If the master control node does not know the port number already used on a certain node, and allocates the port number already used by the node when allocating the port number, it will Will affect the start of the task. For example, task A running on node 1 uses port 8080, and port 8080 is specified when the master control node allocates the port number used by task B to node 1, which causes a port conflict.

发明内容Contents of the invention

鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的节点网络自组织方法、装置、服务器和系统。In view of the above problems, the present invention is proposed to provide a node network ad hoc method, device, server and system for overcoming the above problems or at least partially solving the above problems.

依据本发明的一个方面,提供了一种节点网络自组织方法,包括:According to one aspect of the present invention, a node network self-organization method is provided, including:

接收主控节点发送的任务启动指令;Receive the task start command sent by the master control node;

选取用于该任务的端口号;Select the port number to use for this task;

将本从节点的主机名和选取的端口号返回给所述主控节点,以使所述主控节点根据各从节点启动的任务及返回的主机名和端口号,生成节点网络图;Return the host name and the selected port number of this slave node to the master control node, so that the master control node generates a node network diagram according to the tasks started by each slave node and the returned host name and port number;

接收所述主控节点发送的节点网络图。Receive the node network diagram sent by the master control node.

可选地,所述选取用于该任务的端口号包括:Optionally, the port number selected for the task includes:

从本从节点当前未被占用的端口号中,随机选取一个端口号。Randomly select a port number from the currently unoccupied port numbers of the slave node.

可选地,所述接收所述主控节点发送的节点网络图包括:Optionally, the receiving the node network graph sent by the master control node includes:

定期向所述主控节点发送获取节点网络图的请求,接收所述主控节点根据该请求返回的节点网络图;Periodically send a request for obtaining a node network map to the master control node, and receive a node network map returned by the master control node according to the request;

和/或,and / or,

接收所述主控节点主动下发的节点网络图。and receiving the node network diagram actively delivered by the master control node.

可选地,该方法还包括:Optionally, the method also includes:

根据所述主控节点发送的节点网络图,与该节点网络图中的其他一个或多个从节点建立连接。According to the node network map sent by the master control node, a connection is established with one or more other slave nodes in the node network map.

可选地,所述任务启动指令为深度学习子任务的启动指令;所述深度学习子任务包括:parameter server子任务和/或worker子任务。Optionally, the task startup instruction is a startup instruction of a deep learning subtask; the deep learning subtask includes: a parameter server subtask and/or a worker subtask.

依据本发明的另一方面,提供了一种节点网络自组织方法,包括:According to another aspect of the present invention, a node network self-organization method is provided, including:

根据输入的任务信息,向一个或多个从节点发送任务启动指令;Send task start instructions to one or more slave nodes according to the input task information;

接收各从节点返回的主机名和端口号;Receive the host name and port number returned by each slave node;

根据各从节点启动的任务及返回的主机名和端口号,生成节点网络图;Generate a node network diagram according to the tasks started by each slave node and the returned host name and port number;

将所述节点网络图发送给一个或多个从节点。Send the node network graph to one or more slave nodes.

可选地,所述将所述节点网络图发送给一个或多个从节点包括:Optionally, the sending the node network graph to one or more slave nodes includes:

当接收到从节点发送的获取节点网络图的请求时,将所述节点网络图发送给该从节点;When receiving a request for acquiring a node network diagram sent by a slave node, sending the node network diagram to the slave node;

和/或,and / or,

将所述节点网络图发送给与本主控节点连接的所有从节点。Send the node network diagram to all slave nodes connected to the master control node.

可选地,所述任务信息为深度学习任务的任务信息;所述任务信息包括:用于执行深度学习任务的节点数量、深度学习子任务类型、各类型的子任务数量。Optionally, the task information is task information of a deep learning task; the task information includes: the number of nodes used to execute the deep learning task, the type of deep learning subtasks, and the number of subtasks of each type.

可选地,所述根据输入的任务信息,向一个或多个从节点发送任务启动指令包括:Optionally, the sending a task start instruction to one or more slave nodes according to the input task information includes:

从与本主控节点连接的所有从节点中选择与用于执行深度学习任务的节点数量相当的从节点;Select slave nodes equivalent to the number of nodes used to perform deep learning tasks from all slave nodes connected to the master control node;

根据深度学习子任务类型和各类型的子任务数量,确定在各从节点上启动的任务;Determine the tasks to start on each slave node according to the type of deep learning subtasks and the number of subtasks of each type;

向各选择的从节点发送与在该从节点上启动的任务对应的任务启动指令。A task start command corresponding to the task started on the slave node is sent to each selected slave node.

依据本发明的又一方面,提供了一种节点网络自组织装置,其中,该装置部署在分布式集群的从节点上,包括:According to yet another aspect of the present invention, a node network ad hoc device is provided, wherein the device is deployed on a slave node of a distributed cluster, including:

通信单元,适于接收节点网络自组织服务器发送的任务启动指令;The communication unit is adapted to receive the task start instruction sent by the self-organizing server of the node network;

端口选取单元,适于选取用于该任务的端口号;a port selection unit adapted to select a port number for the task;

所述通信单元,还适于将本装置所在从节点的主机名和选取的端口号返回给所述节点网络自组织服务器,以使所述节点网络自组织服务器根据各从节点启动的任务及各节点网络自组织装置返回的主机名和端口号,生成节点网络图;以及适于接收所述节点网络自组织服务器发送的节点网络图。The communication unit is also suitable for returning the host name and the selected port number of the slave node where the device is located to the node network self-organizing server, so that the node network self-organizing server The hostname and port number returned by the network ad hoc device generate a node network map; and the node network map is adapted to receive the node network self-organizing server.

可选地,所述端口选取单元,适于从本装置所在从节点的当前未被占用的端口号中,随机选取一个端口号。Optionally, the port selection unit is adapted to randomly select a port number from currently unoccupied port numbers of the slave node where the device is located.

可选地,所述通信单元,适于定期向所述节点网络自组织服务器发送获取节点网络图的请求,接收所述节点网络自组织服务器根据该请求返回的节点网络图;和/或,接收所述节点网络自组织服务器主动下发的节点网络图。Optionally, the communication unit is adapted to periodically send a request for acquiring a node network graph to the node network ad hoc server, and receive the node network graph returned by the node network ad hoc server according to the request; and/or, receive The node network self-organizing server actively delivers the node network diagram.

可选地,所述通信单元,还适于根据所述节点网络自组织服务器发送的节点网络图,与该节点网络图中的其他一个或多个从节点上的节点网络自组织装置建立连接。Optionally, the communication unit is further adapted to establish a connection with the node network ad hoc device on one or more other slave nodes in the node network map sent by the node network self-organizing server.

可选地,所述任务启动指令为深度学习子任务的启动指令;所述深度学习子任务包括:parameter server子任务和/或worker子任务。Optionally, the task startup instruction is a startup instruction of a deep learning subtask; the deep learning subtask includes: a parameter server subtask and/or a worker subtask.

依据本发明的再一方面,提供了一种节点网络自组织服务器,其中,该服务器部署在分布式集群的主控节点上,包括:According to yet another aspect of the present invention, a node network self-organizing server is provided, wherein the server is deployed on a master control node of a distributed cluster, including:

通信单元,适于根据输入的任务信息,向一个或多个从节点上的节点网络自组织装置发送任务启动指令;接收各节点网络自组织装置返回的主机名和端口号;The communication unit is adapted to send a task start instruction to one or more node network self-organizing devices on the slave node according to the input task information; receive the host name and port number returned by each node network self-organizing device;

节点网络图生成单元,适于根据各从节点启动的任务及各节点网络自组织装置返回的主机名和端口号,生成节点网络图;The node network diagram generation unit is adapted to generate a node network diagram according to the tasks started by each slave node and the host name and port number returned by each node network self-organizing device;

所述通信单元,还适于将所述节点网络图发送给一个或多个从节点上的节点网络自组织装置。The communication unit is further adapted to send the node network graph to one or more node network self-organizing devices on the slave nodes.

可选地,所述通信单元,适于在接收到从节点上的节点网络自组织装置发送的获取节点网络图的请求时,将所述节点网络图发送给该从节点上的节点网络自组织装置;和/或,将所述节点网络图发送给与本服务器所在的主控节点连接的所有从节点上的节点网络自组织装置Optionally, the communication unit is adapted to send the node network map to the node network self-organizing device on the slave node when receiving a request for acquiring the node network map sent by the node network self-organizing device on the slave node device; and/or, sending the node network graph to the node network self-organizing device on all slave nodes connected to the master control node where the server is located

可选地,所述任务信息为深度学习任务的任务信息;所述任务信息包括:用于执行深度学习任务的节点数量、深度学习子任务类型、各类型的子任务数量。Optionally, the task information is task information of a deep learning task; the task information includes: the number of nodes used to execute the deep learning task, the type of deep learning subtasks, and the number of subtasks of each type.

可选地,该服务器还包括:Optionally, the server also includes:

调度单元,适于从与本服务器所在的主控节点连接的所有从节点中选择与用于执行深度学习任务的节点数量相当的从节点;根据深度学习子任务类型和各类型的子任务数量,确定在各从节点上启动的任务;The scheduling unit is adapted to select slave nodes equivalent to the number of nodes used to perform deep learning tasks from all slave nodes connected to the master control node where the server is located; according to the type of deep learning subtasks and the number of subtasks of each type, Determine the tasks to start on each slave node;

所述通信单元,适于向选择的从节点上的节点网络自组织装置发送与在该从节点上启动的任务对应的任务启动指令。The communication unit is adapted to send a task start instruction corresponding to the task started on the slave node to the node network self-organizing device on the selected slave node.

依据本发明的再一方面,提供了一种节点网络自组织系统,其中,该系统包括一个或多个如上述任一项所述的节点网络自组织装置和如上述任一项所述的节点网络自组织服务器。According to another aspect of the present invention, a node network ad hoc system is provided, wherein the system includes one or more node network ad hoc devices as described in any one of the above and the node network as described in any one of the above Network Ad Hoc Server.

由上述可知,本发明的技术方案,从节点在接收到主控节点发送的任务启动指令后,主动选取用于该任务的端口号,将本从节点的主机名和选取的端口号返回给主控节点,以使主控节点根据各从节点启动的任务及返回的主机名和端口号,生成节点网络图。该技术方案将端口号的分配由主控节点更改为各从节点,避免了主控节点分配的端口号与从节点已使用的端口号发生冲突,同时生成的节点网络图可以用于节点网络的管理及从节点间的连接建立,既满足了使用需求,同时提高了节点网络构建的成功率。As can be seen from the above, in the technical solution of the present invention, after the slave node receives the task start command sent by the master control node, it actively selects the port number used for the task, and returns the host name and the selected port number of the slave node to the master control node. node, so that the master control node generates a node network diagram according to the tasks started by each slave node and the returned host name and port number. This technical solution changes the allocation of port numbers from the master control node to each slave node, avoiding the conflict between the port numbers allocated by the master control node and the port numbers already used by the slave nodes, and the generated node network diagram can be used for the node network Management and establishment of connections between nodes not only meets the usage requirements, but also improves the success rate of node network construction.

上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable , the specific embodiments of the present invention are enumerated below.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating a preferred embodiment and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same components. In the attached picture:

图1示出了根据本发明一个实施例的一种节点网络自组织方法的流程示意图;Fig. 1 shows a schematic flow chart of a node network self-organization method according to an embodiment of the present invention;

图2示出了根据本发明一个实施例的另一种节点网络自组织方法的流程示意图;Fig. 2 shows a schematic flow chart of another node network self-organization method according to an embodiment of the present invention;

图3示出了根据本发明一个实施例的一种节点网络自组织装置的结构示意图;Fig. 3 shows a schematic structural diagram of a node network self-organizing device according to an embodiment of the present invention;

图4示出了根据本发明一个实施例的一种节点网络自组织服务器的结构示意图;Fig. 4 shows a schematic structural diagram of a node network self-organizing server according to an embodiment of the present invention;

图5示出了根据本发明一个实施例的一种节点网络自组织系统的结构示意图。Fig. 5 shows a schematic structural diagram of a node network ad hoc system according to an embodiment of the present invention.

具体实施方式detailed description

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

图1示出了根据本发明一个实施例的一种节点网络自组织方法的流程示意图,如图1所示,该方法包括:Fig. 1 shows a schematic flow chart of a node network self-organization method according to an embodiment of the present invention. As shown in Fig. 1, the method includes:

步骤S110,接收主控节点发送的任务启动指令。Step S110, receiving a task start instruction sent by the master control node.

步骤S120,选取用于该任务的端口号。Step S120, selecting the port number used for the task.

步骤S130,将本从节点的主机名和选取的端口号返回给主控节点,以使主控节点根据各从节点启动的任务及返回的主机名和端口号,生成节点网络图。Step S130, returning the host name and selected port number of the slave node to the master control node, so that the master control node generates a node network diagram according to the tasks started by each slave node and the returned host name and port number.

步骤S140,接收主控节点发送的节点网络图。Step S140, receiving the node network diagram sent by the master control node.

可见,图1所示的方法,从节点在接收到主控节点发送的任务启动指令后,主动选取用于该任务的端口号,将本从节点的主机名和选取的端口号返回给主控节点,以使主控节点根据各从节点启动的任务及返回的主机名和端口号,生成节点网络图。该技术方案将端口号的分配由主控节点更改为各从节点,避免了主控节点分配的端口号与从节点已使用的端口号发生冲突,同时生成的节点网络图可以用于节点网络的管理及从节点间的连接建立,既满足了使用需求,同时提高了节点网络构建的成功率。It can be seen that in the method shown in Figure 1, the slave node actively selects the port number used for the task after receiving the task start instruction sent by the master node, and returns the host name and the selected port number of the slave node to the master node , so that the master control node generates a node network diagram according to the tasks started by each slave node and the returned host name and port number. This technical solution changes the allocation of port numbers from the master control node to each slave node, avoiding the conflict between the port numbers allocated by the master control node and the port numbers already used by the slave nodes, and the generated node network diagram can be used for the node network Management and establishment of connections between nodes not only meets the usage requirements, but also improves the success rate of node network construction.

在本发明的一个实施例中,图1所示的方法中,选取用于该任务的端口号包括:从本从节点当前未被占用的端口号中,随机选取一个端口号。In an embodiment of the present invention, in the method shown in FIG. 1 , selecting the port number used for the task includes: randomly selecting a port number from the port numbers not currently occupied by the slave node.

从节点上的端口号通常为65535循环,随机从未被占用的端口号中选取,效率高,同时不会产生端口冲突。The port number on the slave node is usually 65535, which is randomly selected from unoccupied port numbers, which is efficient and does not cause port conflicts.

在本发明的一个实施例中,图1所示的方法中,接收主控节点发送的节点网络图包括:定期向主控节点发送获取节点网络图的请求,接收主控节点根据该请求返回的节点网络图;和/或,接收主控节点主动下发的节点网络图。In one embodiment of the present invention, in the method shown in FIG. 1, receiving the node network map sent by the master control node includes: regularly sending a request to the master control node to obtain the node network map, and receiving the request returned by the master control node according to the request. A node network diagram; and/or, receiving a node network diagram actively issued by the master control node.

例如,每隔10分钟向主控节点发送获取节点网络图的请求,或者由主控节点在节点网络图发生变更后,将更新后的节点网络图下发至相关的从节点。For example, a request to obtain the node network map is sent to the master control node every 10 minutes, or the master control node sends the updated node network map to the relevant slave nodes after the node network map changes.

各从节点获取节点网络图的一个重要原因是,在很多情况下,从节点上运行的进程需要与其他从节点上的进程进行通信。因此在本发明的一个实施例中,图1所示的方法还包括:根据主控节点发送的节点网络图,与该节点网络图中的其他一个或多个从节点建立连接。An important reason for each slave node to obtain a node network map is that in many cases, processes running on slave nodes need to communicate with processes on other slave nodes. Therefore, in an embodiment of the present invention, the method shown in FIG. 1 further includes: according to the node network map sent by the master control node, establishing a connection with one or more other slave nodes in the node network map.

在本发明的一个实施例中,图1所示的方法中,任务启动指令为深度学习子任务的启动指令;深度学习子任务包括:parameter server子任务和/或worker子任务。在此例中,parameter server作为参数服务器,需要接收worker子任务提交的计算得到的参数。In an embodiment of the present invention, in the method shown in FIG. 1 , the task start command is a start command of a deep learning subtask; the deep learning subtask includes: a parameter server subtask and/or a worker subtask. In this example, the parameter server acts as a parameter server and needs to receive the calculated parameters submitted by the worker subtask.

图2示出了根据本发明一个实施例的另一种节点网络自组织方法的流程示意图,如图2所示,该方法包括:Fig. 2 shows a schematic flow chart of another node network self-organization method according to an embodiment of the present invention. As shown in Fig. 2, the method includes:

步骤S210,根据输入的任务信息,向一个或多个从节点发送任务启动指令。Step S210, sending a task start instruction to one or more slave nodes according to the input task information.

步骤S220,接收各从节点返回的主机名和端口号。Step S220, receiving the host name and port number returned by each slave node.

步骤S230,根据各从节点启动的任务及返回的主机名和端口号,生成节点网络图。Step S230, generating a node network diagram according to the tasks initiated by each slave node and the returned host name and port number.

步骤S240,将节点网络图发送给一个或多个从节点。Step S240, sending the node network map to one or more slave nodes.

在本发明的一个实施例中,图2所示的方法中,将节点网络图发送给一个或多个从节点包括:当接收到从节点发送的获取节点网络图的请求时,将节点网络图发送给该从节点;和/或,将节点网络图发送给与本主控节点连接的所有从节点。In one embodiment of the present invention, in the method shown in FIG. 2, sending the node network diagram to one or more slave nodes includes: when receiving a request for obtaining the node network diagram sent by the slave node, sending the node network diagram Send to the slave node; and/or, send the node network map to all slave nodes connected to the master control node.

在本实施例中提供了两种节点网络图的分发方法,可以结合使用,但并不代表对分发方法的限制,也可以在节点网络图发生更改时,将更新后的节点网络图仅发送给本次更新相关的节点。例如,任务A新增了两个执行节点,节点13和节点14,任务A原执行节点为节点1和节点2,那么只需要将更新后的节点网络图发送给节点1、3、13和14。当然,节点网络图也可以根据各个任务生成相应的分图,这样在下发节点网络图时只需要将其分发给该图中相关的节点即可。In this embodiment, two distribution methods of the node network diagram are provided, which can be used in combination, but this does not represent a limitation on the distribution method. When the node network diagram changes, the updated node network diagram can only be sent to Nodes related to this update. For example, task A adds two execution nodes, node 13 and node 14, and the original execution nodes of task A are node 1 and node 2, then only the updated node network diagram needs to be sent to nodes 1, 3, 13 and 14 . Of course, the node network diagram can also generate corresponding sub-graphs according to each task, so that when the node network diagram is delivered, it only needs to be distributed to the relevant nodes in the diagram.

在本发明的一个实施例中,图2所示的方法中,任务信息为深度学习任务的任务信息;任务信息包括:用于执行深度学习任务的节点数量、深度学习子任务类型、各类型的子任务数量。In one embodiment of the present invention, in the method shown in Figure 2, the task information is the task information of the deep learning task; the task information includes: the number of nodes used to perform the deep learning task, the type of deep learning subtask, each type of The number of subtasks.

深度学习任务是以图的形式进行计算任务的提交,这些任务在执行时会被进一步划分为多个作业,每个作业包括一个或多个子任务。子任务类型包括如下中的一种或多种:parameter server子任务、worker子任务。Deep learning tasks are submitted in the form of graphs. These tasks are further divided into multiple jobs during execution, and each job includes one or more subtasks. Subtask types include one or more of the following: parameter server subtask, worker subtask.

例如,TensorFlow就是一款开源的深度学习库。Tensor(张量)意味着N维数组,Flow(流)意味着基于数据流图的计算,TensorFlow为张量从图像的一端流动到另一端计算过程。该深度学习库可以与Spark大数据计算框架进行整合,即将一个TensorFlow任务作为一个Spark任务进行提交,也就是上文所称的深度学习任务。深度学习任务信息还可以包括如下中的一种或多种:执行深度学习的计算图;执行深度学习任务需调用的深度学习库接口;用于深度学习任务的数据地址;执行结果数据的保存地址。For example, TensorFlow is an open source deep learning library. Tensor (tensor) means an N-dimensional array, Flow (flow) means calculation based on a data flow graph, and TensorFlow is the calculation process of tensor flowing from one end of the image to the other end. The deep learning library can be integrated with the Spark big data computing framework, that is, submit a TensorFlow task as a Spark task, which is the deep learning task referred to above. The deep learning task information can also include one or more of the following: the calculation graph for performing deep learning; the deep learning library interface that needs to be called to execute the deep learning task; the data address for the deep learning task; the storage address of the execution result data .

在本发明的一个实施例中,上述方法中,根据输入的任务信息,向一个或多个从节点发送任务启动指令包括:从与本主控节点连接的所有从节点中选择与用于执行深度学习任务的节点数量相当的从节点;根据深度学习子任务类型和各类型的子任务数量,确定在各从节点上启动的任务;向各选择的从节点发送与在该从节点上启动的任务对应的任务启动指令。In an embodiment of the present invention, in the above method, according to the input task information, sending a task start command to one or more slave nodes includes: selecting from all slave nodes connected to the master control node and using the execution depth Slave nodes with the same number of learning task nodes; according to the deep learning subtask type and the number of subtasks of each type, determine the tasks started on each slave node; send to each selected slave node the task started on the slave node The corresponding task start command.

以一个深度学习任务为例,如果根据该深度学习任务的任务信息,需要调用深度学习库启动2个parameter server子任务和2个worker子任务,并且这四个子任务分别在四个从节点上执行,那么就先确定在每个任务上执行的子任务,再向各个从节点发送启动相应的子任务的指令。Taking a deep learning task as an example, if according to the task information of the deep learning task, it is necessary to call the deep learning library to start 2 parameter server subtasks and 2 worker subtasks, and these four subtasks are executed on four slave nodes respectively , then first determine the subtasks executed on each task, and then send instructions to each slave node to start the corresponding subtasks.

前文提及,深度学习库可以与Spark大数据计算框架进行整合,即分布式集群可以为Spark集群。Spark集群还可以通过Yarn进行任务的调度、作业管理和资源管理。Yarn可以为用户提供前端页面用于任务的提交,因此在本发明的一个实施例中,提交的深度学习任务可以是通过前端页面输入的。在任务启动后,用户还可以根据Yarn提供的前端页面,实时查看任务的处理状况,对任务进行杀死等操作。由于Spark集群还可以通过Yarn进行任务的调度、作业管理和资源管理,因此上述实施例中,从与本主控节点连接的所有从节点中选择与用于执行深度学习任务的节点数量相当的从节点还可以通过向Yarn发送请求,获取当前较为空闲的节点来执行深度学习任务。即:向分布式集群的节点调度器发送用于执行该深度学习任务的节点数量,并接收节点调度器返回的多个节点的信息。As mentioned above, the deep learning library can be integrated with the Spark big data computing framework, that is, the distributed cluster can be a Spark cluster. Spark clusters can also perform task scheduling, job management, and resource management through Yarn. Yarn can provide users with a front-end page for task submission, so in an embodiment of the present invention, the submitted deep learning task can be input through the front-end page. After the task is started, the user can also view the processing status of the task in real time and perform operations such as killing the task according to the front-end page provided by Yarn. Since the Spark cluster can also perform task scheduling, job management, and resource management through Yarn, in the above-mentioned embodiment, select slave nodes equivalent to the number of nodes used to perform deep learning tasks from all slave nodes connected to the master control node. Nodes can also obtain relatively idle nodes to perform deep learning tasks by sending requests to Yarn. That is: send the number of nodes used to execute the deep learning task to the node scheduler of the distributed cluster, and receive the information of multiple nodes returned by the node scheduler.

下面示出了为一个深度学习任务生成的节点网络图的示例:An example of a node network graph generated for a deep learning task is shown below:

{PS:[node1:8080,node2:8080]worker:[node3:9090,node4:9090]}{PS:[node1:8080,node2:8080]worker:[node3:9090,node4:9090]}

这意味着在节点1的8080端口上启动了parameter server子任务,在节点2的8080端口上启动了parameter server子任务;在节点3的9090端口上启动了worker子任务,在节点4的9090端口上启动了worker子任务。接下来需要主动将节点网络图下发给这些从节点,或者根据由各从节点发送的自恩物网络列表获取请求,将节点网络图下发给这些节点。例如,节点3的9090端口上启动的worker子任务可以分别和节点1的8080端口上启动的parameter server子任务以及在节点2的8080端口上启动的parameter server子任务建立连接。This means that the parameter server subtask is started on port 8080 of node 1, the parameter server subtask is started on port 8080 of node 2; the worker subtask is started on port 9090 of node 3, and the subtask of worker is started on port 9090 of node 4 The worker subtask is started on . Next, it is necessary to actively send the node network map to these slave nodes, or send the node network map to these nodes according to the request for acquiring the self-benefit network list sent by each slave node. For example, the worker subtask started on port 9090 of node 3 can establish connections with the parameter server subtask started on port 8080 of node 1 and the parameter server subtask started on port 8080 of node 2 respectively.

这些都可以在深度学习任务提交后,由Spark启动一个Driver进程,同时启动一个Scheduler调度进程,由该进程实现节点网络图的构建、管理和分发。After the deep learning task is submitted, Spark starts a Driver process and a Scheduler scheduling process at the same time, which realizes the construction, management and distribution of the node network graph.

具体地,从分布式集群的文件系统中获取用于该深度学习任务的数据包括:根据用于深度学习任务的数据地址,将分布式集群的文件系统中用于该深度学习任务的数据构建为弹性分布式数据集RDD对象;将获取的用于该深度学习任务的数据推送到相应的子任务上进行执行包括:将RDD对象分别推送到各节点,由各节点将RDD对象推送到在该节点中启动的子任务上。Specifically, obtaining the data used for the deep learning task from the file system of the distributed cluster includes: according to the data address used for the deep learning task, constructing the data used for the deep learning task in the file system of the distributed cluster as Elastic distributed data set RDD object; pushing the obtained data for the deep learning task to the corresponding subtask for execution includes: pushing the RDD object to each node, and each node pushes the RDD object to the node on subtasks started in .

以Spark分布式集群为例,其数据存储在HDFS(Hadoop Distributed FileSystem,Hadoop分布式文件系统)上。在操作数据时,将其相应的构建为一个RDD(resilientdistributed dataset,弹性分布式数据集)对象。RDD对象可以复用,如果深度学习任务所用的数据已被构建为RDD对象,那么自然就不需要执行该步骤。在使用这些数据时,通过管道(pipe)将其推送到各任务所在的从节点上,由各节点将RDD对象推送到在该从节点中启动的子任务上。以上例中的深度学习任务包含两个worker子任务为例,需要将RDD对象的一部分推送到节点3上,另一部分推送到节点4上,从而实现了分布式处理深度学习任务。Taking the Spark distributed cluster as an example, its data is stored on HDFS (Hadoop Distributed File System, Hadoop Distributed File System). When operating data, it is correspondingly constructed as an RDD (resilient distributed dataset, resilient distributed dataset) object. RDD objects can be reused. If the data used by deep learning tasks has been constructed as RDD objects, then this step is naturally unnecessary. When these data are used, they are pushed to the slave nodes where each task is located through a pipe, and each node pushes the RDD object to the subtask started in the slave node. The deep learning task in the above example includes two worker subtasks as an example. Part of the RDD object needs to be pushed to node 3, and the other part is pushed to node 4, thus realizing distributed processing of deep learning tasks.

图3示出了根据本发明一个实施例的一种节点网络自组织装置的结构示意图,该装置可以部署在分布式集群的从节点上。如图3所示,节点网络自组织装置300包括:Fig. 3 shows a schematic structural diagram of a node network self-organizing device according to an embodiment of the present invention, and the device can be deployed on slave nodes of a distributed cluster. As shown in Figure 3, the node network ad hoc device 300 includes:

通信单元310,适于接收节点网络自组织服务器发送的任务启动指令。The communication unit 310 is adapted to receive a task start instruction sent by the node network ad hoc server.

端口选取单元320,适于选取用于该任务的端口号。The port selecting unit 320 is adapted to select the port number used for the task.

通信单元310,还适于将本装置所在从节点的主机名和选取的端口号返回给节点网络自组织服务器,以使节点网络自组织服务器根据各从节点启动的任务及各节点网络自组织装置返回的主机名和端口号,生成节点网络图;以及适于接收节点网络自组织服务器发送的节点网络图。The communication unit 310 is also suitable for returning the host name and the selected port number of the slave node where the device is located to the node network self-organizing server, so that the node network self-organizing server returns the The host name and port number of the node network are generated to generate a node network diagram; and the node network diagram is suitable for receiving the node network self-organizing server.

可见,图3所示的装置,通过各单元的相互配合,从节点在接收到主控节点发送的任务启动指令后,主动选取用于该任务的端口号,将本从节点的主机名和选取的端口号返回给主控节点,以使主控节点根据各从节点启动的任务及返回的主机名和端口号,生成节点网络图。该技术方案将端口号的分配由主控节点更改为各从节点,避免了主控节点分配的端口号与从节点已使用的端口号发生冲突,同时生成的节点网络图可以用于节点网络的管理及从节点间的连接建立,既满足了使用需求,同时提高了节点网络构建的成功率。It can be seen that in the device shown in Figure 3, through the mutual cooperation of each unit, the slave node actively selects the port number used for the task after receiving the task start command sent by the master control node, and uses the host name of the slave node and the selected The port number is returned to the master control node, so that the master control node generates a node network diagram according to the tasks started by each slave node and the returned host name and port number. This technical solution changes the allocation of port numbers from the master control node to each slave node, avoiding the conflict between the port numbers allocated by the master control node and the port numbers already used by the slave nodes, and the generated node network diagram can be used for the node network Management and establishment of connections between nodes not only meets the usage requirements, but also improves the success rate of node network construction.

在本发明的一个实施例中,图3所示的装置中,端口选取单元320,适于从本装置所在从节点的当前未被占用的端口号中,随机选取一个端口号。In an embodiment of the present invention, in the device shown in FIG. 3 , the port selection unit 320 is adapted to randomly select a port number from currently unoccupied port numbers of the slave node where the device is located.

在本发明的一个实施例中,图3所示的装置中,通信单元310,适于定期向节点网络自组织服务器发送获取节点网络图的请求,接收节点网络自组织服务器根据该请求返回的节点网络图;和/或,接收节点网络自组织服务器主动下发的节点网络图。In one embodiment of the present invention, in the device shown in FIG. 3 , the communication unit 310 is adapted to periodically send a request for obtaining a node network map to the node network ad hoc server, and receive the node network graph returned by the node network ad hoc server according to the request. A network diagram; and/or, receiving a node network diagram actively delivered by the node network self-organizing server.

在本发明的一个实施例中,图3所示的装置中,通信单元310,还适于根据节点网络自组织服务器发送的节点网络图,与该节点网络图中的其他一个或多个从节点上的节点网络自组织装置300建立连接。In one embodiment of the present invention, in the device shown in FIG. 3 , the communication unit 310 is further adapted to communicate with other one or more slave nodes in the node network graph according to the node network graph sent by the node network self-organizing server. The nodes on the network ad hoc device 300 establish connections.

在本发明的一个实施例中,图3所示的装置中,任务启动指令为深度学习子任务的启动指令;深度学习子任务包括:parameter server子任务和/或worker子任务。In an embodiment of the present invention, in the apparatus shown in FIG. 3 , the task start command is a start command of a deep learning subtask; the deep learning subtask includes: a parameter server subtask and/or a worker subtask.

图4示出了根据本发明一个实施例的一种节点网络自组织服务器的结构示意图,该服务器可以部署在分布式集群的主控节点上。如图4所示,节点网络自组织服务器400包括:Fig. 4 shows a schematic structural diagram of a node network ad hoc server according to an embodiment of the present invention, and the server can be deployed on a master control node of a distributed cluster. As shown in Figure 4, the node network self-organizing server 400 includes:

通信单元410,适于根据输入的任务信息,向一个或多个从节点上的节点网络自组织装置发送任务启动指令;接收各节点网络自组织装置返回的主机名和端口号;The communication unit 410 is adapted to send a task start instruction to one or more node network self-organizing devices on the slave node according to the input task information; receive the host name and port number returned by each node network self-organizing device;

节点网络图生成单元420,适于根据各从节点启动的任务及各节点网络自组织装置返回的主机名和端口号,生成节点网络图;The node network diagram generation unit 420 is adapted to generate a node network diagram according to the tasks started by each slave node and the host name and port number returned by each node network self-organizing device;

通信单元410,还适于将节点网络图发送给一个或多个从节点上的节点网络自组织装置。The communication unit 410 is further adapted to send the node network graph to the node network self-organizing device on one or more slave nodes.

在本发明的一个实施例中,图4所示的服务器中,通信单元410,适于在接收到从节点上的节点网络自组织装置发送的获取节点网络图的请求时,将节点网络图发送给该从节点上的节点网络自组织装置;和/或,将节点网络图发送给与本服务器所在的主控节点连接的所有从节点上的节点网络自组织装置In one embodiment of the present invention, in the server shown in FIG. 4 , the communication unit 410 is adapted to send the node network graph to To the node network self-organizing device on the slave node; and/or, sending the node network map to the node network self-organizing device on all slave nodes connected to the master node where the server is located

在本发明的一个实施例中,图4所示的服务器中,任务信息为深度学习任务的任务信息;任务信息包括:用于执行深度学习任务的节点数量、深度学习子任务类型、各类型的子任务数量。In one embodiment of the present invention, in the server shown in Figure 4, the task information is the task information of the deep learning task; the task information includes: the number of nodes used to perform the deep learning task, the type of deep learning subtask, each type of The number of subtasks.

在本发明的一个实施例中,图4所示的服务器中还包括:调度单元430,适于从与本服务器所在的主控节点连接的所有从节点中选择与用于执行深度学习任务的节点数量相当的从节点;根据深度学习子任务类型和各类型的子任务数量,确定在各从节点上启动的任务;通信单元310,适于向选择的从节点上的节点网络自组织装置发送与在该从节点上启动的任务对应的任务启动指令。In one embodiment of the present invention, the server shown in FIG. 4 also includes: a scheduling unit 430, adapted to select a node for performing deep learning tasks from all slave nodes connected to the master control node where the server is located A considerable number of slave nodes; according to the deep learning subtask type and the number of subtasks of each type, determine the task started on each slave node; the communication unit 310 is suitable for sending a message to the node network self-organizing device on the selected slave node The task start command corresponding to the task started on the slave node.

需要说明的是,上述各装置和服务器实施例的具体实施方式与前述对应方法实施例的具体实施方式相类似,在此不再赘述。稍有不同的是,各从节点上不仅可以部署有节点网络自组织装置,还可以部署任务的执行装置。主控节点上不仅可以部署有节点网络自组织服务器,还可以部署任务的控制服务器。当然,这些服务器可以各自通过功能整合作为一个服务器来实现,同一从节点上的各装置也可以通过功能整合作为一个装置来实现。It should be noted that the specific implementation manners of the foregoing apparatus and server embodiments are similar to the specific implementation manners of the aforementioned corresponding method embodiments, and will not be repeated here. A slight difference is that not only node network self-organizing devices but also task execution devices can be deployed on each slave node. Not only the self-organizing server of the node network can be deployed on the master control node, but also the control server of the task can be deployed. Of course, these servers can be implemented as one server through functional integration, and each device on the same slave node can also be implemented as one device through functional integration.

图5示出了根据本发明一个实施例的一种节点网络自组织系统的结构示意图,如图5所示,节点网络自组织系统500包括一个或多个如上述任一实施例中的节点网络自组织装置300和如上述任一实施例中的节点网络自组织服务器400。Fig. 5 shows a schematic structural diagram of a node network self-organizing system according to an embodiment of the present invention. As shown in Fig. 5, the node network self-organizing system 500 includes one or more node networks as in any of the above-mentioned embodiments The ad hoc device 300 and the node network ad hoc server 400 as in any of the foregoing embodiments.

综上所述,本发明的技术方案,从节点在接收到主控节点发送的任务启动指令后,主动选取用于该任务的端口号,将本从节点的主机名和选取的端口号返回给主控节点,以使主控节点根据各从节点启动的任务及返回的主机名和端口号,生成节点网络图。该技术方案将端口号的分配由主控节点更改为各从节点,避免了主控节点分配的端口号与从节点已使用的端口号发生冲突,同时生成的节点网络图可以用于节点网络的管理及从节点间的连接建立,既满足了使用需求,同时提高了节点网络构建的成功率。In summary, in the technical solution of the present invention, after receiving the task start instruction sent by the master control node, the slave node actively selects the port number used for the task, and returns the host name and the selected port number of the slave node to the master node. Control nodes, so that the master control node generates a node network diagram according to the tasks started by each slave node and the returned host name and port number. This technical solution changes the allocation of port numbers from the master control node to each slave node, avoiding the conflict between the port numbers allocated by the master control node and the port numbers already used by the slave nodes, and the generated node network diagram can be used for the node network Management and establishment of connections between nodes not only meets the usage requirements, but also improves the success rate of node network construction.

需要说明的是:It should be noted:

在此提供的算法和显示不与任何特定计算机、虚拟装置或者其它设备固有相关。各种通用装置也可以与基于在此的示教一起使用。根据上面的描述,构造这类装置所要求的结构是显而易见的。此外,本发明也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本发明的内容,并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms and displays presented herein are not inherently related to any particular computer, virtual appliance, or other device. Various general purpose devices can also be used with the teachings based on this. The structure required to construct such an apparatus will be apparent from the foregoing description. Furthermore, the present invention is not specific to any particular programming language. It should be understood that various programming languages can be used to implement the content of the present invention described herein, and the above description of specific languages is for disclosing the best mode of the present invention.

在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.

类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, in order to streamline this disclosure and to facilitate an understanding of one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or its description. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. Modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore may be divided into a plurality of sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method or method so disclosed may be used in any combination, except that at least some of such features and/or processes or units are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will understand that although some embodiments described herein include some features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of the invention. and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的节点网络自组织装置、服务器和系统中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) can be used in practice to implement some or all of some or all of the node network self-organizing devices, servers, and systems according to embodiments of the present invention. Full functionality. The present invention can also be implemented as an apparatus or an apparatus program (for example, a computer program and a computer program product) for performing a part or all of the methods described herein. Such a program for realizing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such a signal may be downloaded from an Internet site, or provided on a carrier signal, or provided in any other form.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. does not indicate any order. These words can be interpreted as names.

本发明的实施例公开了A1、一种节点网络自组织方法,其中,该方法包括:The embodiment of the present invention discloses A1, a node network self-organization method, wherein the method includes:

接收主控节点发送的任务启动指令;Receive the task start command sent by the master control node;

选取用于该任务的端口号;Select the port number to use for this task;

将本从节点的主机名和选取的端口号返回给所述主控节点,以使所述主控节点根据各从节点启动的任务及返回的主机名和端口号,生成节点网络图;Return the host name and the selected port number of this slave node to the master control node, so that the master control node generates a node network diagram according to the tasks started by each slave node and the returned host name and port number;

接收所述主控节点发送的节点网络图。Receive the node network diagram sent by the master control node.

A2、如A1所述的方法,其中,所述选取用于该任务的端口号包括:A2. The method as described in A1, wherein the port number selected for the task includes:

从本从节点当前未被占用的端口号中,随机选取一个端口号。Randomly select a port number from the currently unoccupied port numbers of the slave node.

A3、如A1所述的方法,其中,所述接收所述主控节点发送的节点网络图包括:A3. The method according to A1, wherein the receiving the node network diagram sent by the master control node includes:

定期向所述主控节点发送获取节点网络图的请求,接收所述主控节点根据该请求返回的节点网络图;Periodically send a request for obtaining a node network map to the master control node, and receive a node network map returned by the master control node according to the request;

和/或,and / or,

接收所述主控节点主动下发的节点网络图。and receiving the node network diagram actively delivered by the master control node.

A4、如A1所述的方法,其中,该方法还包括:A4. The method as described in A1, wherein the method also includes:

根据所述主控节点发送的节点网络图,与该节点网络图中的其他一个或多个从节点建立连接。According to the node network map sent by the master control node, a connection is established with one or more other slave nodes in the node network map.

A5、如A1所述的方法,其中,A5. The method as described in A1, wherein,

所述任务启动指令为深度学习子任务的启动指令;所述深度学习子任务包括:parameter server子任务和/或worker子任务。The task startup instruction is a startup instruction of a deep learning subtask; the deep learning subtask includes: a parameter server subtask and/or a worker subtask.

本发明的实施例还公开了B6、一种节点网络自组织方法,其中,该方法包括:The embodiment of the present invention also discloses B6, a node network self-organization method, wherein the method includes:

根据输入的任务信息,向一个或多个从节点发送任务启动指令;Send task start instructions to one or more slave nodes according to the input task information;

接收各从节点返回的主机名和端口号;Receive the host name and port number returned by each slave node;

根据各从节点启动的任务及返回的主机名和端口号,生成节点网络图;Generate a node network diagram according to the tasks started by each slave node and the returned host name and port number;

将所述节点网络图发送给一个或多个从节点。Send the node network graph to one or more slave nodes.

B7、如B6所述的方法,其中,所述将所述节点网络图发送给一个或多个从节点包括:B7. The method as described in B6, wherein the sending the node network graph to one or more slave nodes includes:

当接收到从节点发送的获取节点网络图的请求时,将所述节点网络图发送给该从节点;When receiving a request for acquiring a node network diagram sent by a slave node, sending the node network diagram to the slave node;

和/或,and / or,

将所述节点网络图发送给与本主控节点连接的所有从节点。Send the node network diagram to all slave nodes connected to the master control node.

B8、如B6所述的方法,其中,所述任务信息为深度学习任务的任务信息;所述任务信息包括:用于执行深度学习任务的节点数量、深度学习子任务类型、各类型的子任务数量。B8. The method as described in B6, wherein the task information is the task information of the deep learning task; the task information includes: the number of nodes used to perform the deep learning task, the deep learning subtask type, and various types of subtasks quantity.

B9、如B8所述的方法,其中,所述根据输入的任务信息,向一个或多个从节点发送任务启动指令包括:B9. The method as described in B8, wherein, according to the input task information, sending a task start instruction to one or more slave nodes includes:

从与本主控节点连接的所有从节点中选择与用于执行深度学习任务的节点数量相当的从节点;Select slave nodes equivalent to the number of nodes used to perform deep learning tasks from all slave nodes connected to the master control node;

根据深度学习子任务类型和各类型的子任务数量,确定在各从节点上启动的任务;Determine the tasks to start on each slave node according to the type of deep learning subtasks and the number of subtasks of each type;

向各选择的从节点发送与在该从节点上启动的任务对应的任务启动指令。A task start command corresponding to the task started on the slave node is sent to each selected slave node.

本发明的实施例还公开了C10、一种节点网络自组织装置,其中,该装置部署在分布式集群的从节点上,包括:The embodiment of the present invention also discloses C10, a node network ad hoc device, wherein the device is deployed on the slave nodes of the distributed cluster, including:

通信单元,适于接收节点网络自组织服务器发送的任务启动指令;The communication unit is adapted to receive the task start instruction sent by the self-organizing server of the node network;

端口选取单元,适于选取用于该任务的端口号;a port selection unit adapted to select a port number for the task;

所述通信单元,还适于将本装置所在从节点的主机名和选取的端口号返回给所述节点网络自组织服务器,以使所述节点网络自组织服务器根据各从节点启动的任务及各节点网络自组织装置返回的主机名和端口号,生成节点网络图;以及适于接收所述节点网络自组织服务器发送的节点网络图。The communication unit is also suitable for returning the host name and the selected port number of the slave node where the device is located to the node network self-organizing server, so that the node network self-organizing server The hostname and port number returned by the network ad hoc device generate a node network map; and the node network map is adapted to receive the node network self-organizing server.

C111、如C110所述的装置,其中,C111. The device of C110, wherein,

所述端口选取单元,适于从本装置所在从节点的当前未被占用的端口号中,随机选取一个端口号。The port selection unit is adapted to randomly select a port number from currently unoccupied port numbers of the slave node where the device is located.

C112、如C110所述的装置,其中,C112. The device of C110, wherein,

所述通信单元,适于定期向所述节点网络自组织服务器发送获取节点网络图的请求,接收所述节点网络自组织服务器根据该请求返回的节点网络图;和/或,接收所述节点网络自组织服务器主动下发的节点网络图。The communication unit is adapted to periodically send a request for acquiring a node network map to the node network self-organizing server, receive the node network map returned by the node network self-organizing server according to the request; and/or receive the node network map The node network diagram actively issued by the self-organizing server.

C113、如C110所述的装置,其中,C113. The device of C110, wherein,

所述通信单元,还适于根据所述节点网络自组织服务器发送的节点网络图,与该节点网络图中的其他一个或多个从节点上的节点网络自组织装置建立连接。The communication unit is further adapted to, according to the node network map sent by the node network self-organizing server, establish a connection with other one or more node network self-organizing devices on the node network map.

C114、如C110所述的装置,其中,C114. The device of C110, wherein,

所述任务启动指令为深度学习子任务的启动指令;所述深度学习子任务包括:parameter server子任务和/或worker子任务。The task startup instruction is a startup instruction of a deep learning subtask; the deep learning subtask includes: a parameter server subtask and/or a worker subtask.

本发明的实施例还公开了D15、一种节点网络自组织服务器,其中,该服务器部署在分布式集群的主控节点上,包括:The embodiment of the present invention also discloses D15, a node network self-organizing server, wherein the server is deployed on the master control node of the distributed cluster, including:

通信单元,适于根据输入的任务信息,向一个或多个从节点上的节点网络自组织装置发送任务启动指令;接收各节点网络自组织装置返回的主机名和端口号;The communication unit is adapted to send a task start instruction to one or more node network self-organizing devices on the slave node according to the input task information; receive the host name and port number returned by each node network self-organizing device;

节点网络图生成单元,适于根据各从节点启动的任务及各节点网络自组织装置返回的主机名和端口号,生成节点网络图;The node network diagram generation unit is adapted to generate a node network diagram according to the tasks started by each slave node and the host name and port number returned by each node network self-organizing device;

所述通信单元,还适于将所述节点网络图发送给一个或多个从节点上的节点网络自组织装置。The communication unit is further adapted to send the node network graph to one or more node network self-organizing devices on the slave nodes.

D16、如D15所述的服务器,其中,D16. The server as described in D15, wherein,

所述通信单元,适于在接收到从节点上的节点网络自组织装置发送的获取节点网络图的请求时,将所述节点网络图发送给该从节点上的节点网络自组织装置;和/或,将所述节点网络图发送给与本服务器所在的主控节点连接的所有从节点上的节点网络自组织装置The communication unit is adapted to send the node network map to the node network self-organizing device on the slave node when receiving a request for acquiring a node network map sent by the node network self-organizing device on the slave node; and/ Or, send the node network graph to the node network self-organizing device on all slave nodes connected to the master control node where the server is located

D17、如D15所述的服务器,其中,所述任务信息为深度学习任务的任务信息;所述任务信息包括:用于执行深度学习任务的节点数量、深度学习子任务类型、各类型的子任务数量。D17. The server as described in D15, wherein the task information is the task information of the deep learning task; the task information includes: the number of nodes used to perform the deep learning task, the type of deep learning subtasks, and various types of subtasks quantity.

D18、如D17所述的服务器,其中,该服务器还包括:D18. The server as described in D17, wherein the server further includes:

调度单元,适于从与本服务器所在的主控节点连接的所有从节点中选择与用于执行深度学习任务的节点数量相当的从节点;根据深度学习子任务类型和各类型的子任务数量,确定在各从节点上启动的任务;The scheduling unit is adapted to select slave nodes equivalent to the number of nodes used to perform deep learning tasks from all slave nodes connected to the master control node where the server is located; according to the type of deep learning subtasks and the number of subtasks of each type, Determine the tasks to start on each slave node;

所述通信单元,适于向选择的从节点上的节点网络自组织装置发送与在该从节点上启动的任务对应的任务启动指令。The communication unit is adapted to send a task start instruction corresponding to the task started on the slave node to the node network self-organizing device on the selected slave node.

本发明的实施例还公开了E19、一种节点网络自组织系统,其中,该系统包括一个或多个如C10-C14中任一项所述的节点网络自组织装置和如权利要求D15-D18中任一项所述的节点网络自组织服务器。The embodiment of the present invention also discloses E19, a node network ad hoc system, wherein the system includes one or more node network ad hoc devices as described in any one of C10-C14 and claims D15-D18 The node network self-organizing server described in any one.

Claims (10)

1. a kind of meshed network self-organizing method, wherein, the method includes:
Receive the task start instruction that main controlled node sends;
Choose the port numbers for this task;
The port numbers of the host name of this from node and selection are returned to described main controlled node, so that described main controlled node is according to each Task and the host name of return and port numbers that from node starts, generate meshed network figure;
Receive the meshed network figure that described main controlled node sends.
2. the method for claim 1, wherein the described port numbers chosen for this task include:
From this from node currently unappropriated port numbers, randomly select a port number.
3. the method for claim 1, wherein the described meshed network figure receiving described main controlled node transmission includes:
Periodically send, to described main controlled node, the request obtaining meshed network figure, receive described main controlled node and returned according to this request Meshed network figure;
And/or,
Receive the meshed network figure that described main controlled node active issues.
4. a kind of meshed network self-organizing method, wherein, the method includes:
According to the mission bit stream of input, send task start instruction to one or more from nodes;
Receive host name and the port numbers that each from node returns;
Being started according to each from node of task and the host name of return and port numbers, generate meshed network figure;
Described meshed network figure is sent to one or more from nodes.
5. described meshed network figure wherein, described is sent to one or more from node bags by method as claimed in claim 4 Include:
When receiving the request of acquisition meshed network figure of from node transmission, described meshed network figure is sent to this from section Point;
And/or,
Described meshed network figure is sent to all from nodes being connected with this main controlled node.
6. a kind of meshed network self-organizing device, wherein, this device is deployed in the from node of distributed type assemblies, including:
Communication unit, is suitable to the task start instruction of receiving node self-organization of network server transmission;
Unit is chosen in port, is suitable to choose the port numbers for this task;
Described communication unit, the port numbers being further adapted for the host name of this device place from node and choosing return to described node Self-organization of network server, so that the described meshed network hoc service device task of being started according to each from node and each node net Host name and port numbers that network self-organizing device returns, generate meshed network figure;And be suitable to receive described meshed network from group Knit the meshed network figure of server transmission.
7. device as claimed in claim 6, wherein,
Unit is chosen in described port, is suitable to from the currently unappropriated port numbers of this device place from node, randomly selects A port number.
8. a kind of meshed network hoc service device, wherein, this server disposition, on the main controlled node of distributed type assemblies, wraps Include:
Communication unit, is suitable to the mission bit stream according to input, to the meshed network self-organizing device in one or more from nodes Send task start instruction;Receive host name and the port numbers that each meshed network self-organizing device returns;
Meshed network figure signal generating unit, be suitable to according to each from node start task and each meshed network self-organizing device return Host name and port numbers, generate meshed network figure;
Described communication unit, is further adapted for for described meshed network figure being sent to meshed network in one or more from nodes from group Knit device.
9. server as claimed in claim 8, wherein,
Described communication unit, the acquisition meshed network figure that the meshed network self-organizing device being suitable on receiving from node sends Request when, described meshed network figure is sent to the meshed network self-organizing device in this from node;And/or, by described section Spot net figure is sent to the meshed network self-organizing device in all from nodes that the main controlled node being located with book server is connected.
10. a kind of meshed network self-organizing system, wherein, this system includes one or more such as any one of claims 6-7 Described meshed network self-organizing device and the meshed network hoc service device as any one of claim 8-9.
CN201610963725.XA 2016-10-28 2016-10-28 A kind of meshed network self-organizing method, device, server and system Active CN106411605B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610963725.XA CN106411605B (en) 2016-10-28 2016-10-28 A kind of meshed network self-organizing method, device, server and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610963725.XA CN106411605B (en) 2016-10-28 2016-10-28 A kind of meshed network self-organizing method, device, server and system

Publications (2)

Publication Number Publication Date
CN106411605A true CN106411605A (en) 2017-02-15
CN106411605B CN106411605B (en) 2019-05-31

Family

ID=58014284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610963725.XA Active CN106411605B (en) 2016-10-28 2016-10-28 A kind of meshed network self-organizing method, device, server and system

Country Status (1)

Country Link
CN (1) CN106411605B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614232A (en) * 2018-12-07 2019-04-12 网易(杭州)网络有限公司 Task processing method, device, storage medium and electronic device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060015643A1 (en) * 2004-01-23 2006-01-19 Fredrik Orava Method of sending information through a tree and ring topology of a network system
CN101888331A (en) * 2009-05-13 2010-11-17 阿瓦亚公司 Be used to provide the method and apparatus of the quick rerouting of unicast packet
CN104158747A (en) * 2013-05-14 2014-11-19 中兴通讯股份有限公司 Network topology discovery method and system
CN105391580A (en) * 2015-11-27 2016-03-09 上海卫星工程研究所 Network model description method applicable to SpW/SpF
CN105721318A (en) * 2016-02-29 2016-06-29 华为技术有限公司 Method and device for discovering network topology in software defined network SDN (Software Defined Network)

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060015643A1 (en) * 2004-01-23 2006-01-19 Fredrik Orava Method of sending information through a tree and ring topology of a network system
CN101888331A (en) * 2009-05-13 2010-11-17 阿瓦亚公司 Be used to provide the method and apparatus of the quick rerouting of unicast packet
CN104158747A (en) * 2013-05-14 2014-11-19 中兴通讯股份有限公司 Network topology discovery method and system
CN105391580A (en) * 2015-11-27 2016-03-09 上海卫星工程研究所 Network model description method applicable to SpW/SpF
CN105721318A (en) * 2016-02-29 2016-06-29 华为技术有限公司 Method and device for discovering network topology in software defined network SDN (Software Defined Network)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614232A (en) * 2018-12-07 2019-04-12 网易(杭州)网络有限公司 Task processing method, device, storage medium and electronic device

Also Published As

Publication number Publication date
CN106411605B (en) 2019-05-31

Similar Documents

Publication Publication Date Title
WO2022062650A1 (en) Computing device sharing method and apparatus based on kubernetes, and device and storage medium
JP7092736B2 (en) Dynamic routing using container orchestration services
CN106529682A (en) Method and apparatus for processing deep learning task in big-data cluster
CN109196474B (en) Distributed operation control in a computing system
CN108139935B (en) The extension of the resource constraint of service definition container
CN109936604B (en) Resource scheduling method, device and system
EP2710470B1 (en) Extensible centralized dynamic resource distribution in a clustered data grid
US20170031622A1 (en) Methods for allocating storage cluster hardware resources and devices thereof
CN112181585B (en) Resource allocation method and device of virtual machine
JP5402226B2 (en) Management apparatus, information processing system, control program for information processing system, and control method for information processing system
CN110489126B (en) Compiling task execution method and device, storage medium and electronic device
WO2017206667A1 (en) Method and device for distributively deploying hadoop cluster
CN111092921B (en) Data acquisition method, device and storage medium
CN113382077B (en) Micro-service scheduling method, micro-service scheduling device, computer equipment and storage medium
CN108293041A (en) A kind of distribution method of resource, device and system
CN107291536B (en) Application task flow scheduling method in cloud computing environment
CN109992373B (en) Resource scheduling method, information management method and device and task deployment system
CN109117252B (en) Method and system for task processing based on container and container cluster management system
CN107515786A (en) Resource allocation method, master device, slave device and distributed computing system
CN111880936A (en) Resource scheduling method and device, container cluster, computer equipment and storage medium
WO2012068867A1 (en) Virtual machine management system and using method thereof
CN110166507B (en) Multi-resource scheduling method and device
WO2020108337A1 (en) Cpu resource scheduling method and electronic equipment
WO2018079162A1 (en) Information processing system
CN108234242A (en) A kind of method for testing pressure and device based on stream

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220725

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.