WO2022052659A1 - 一种限速队列提交训练任务的方法和装置 - Google Patents

一种限速队列提交训练任务的方法和装置 Download PDF

Info

Publication number
WO2022052659A1
WO2022052659A1 PCT/CN2021/109619 CN2021109619W WO2022052659A1 WO 2022052659 A1 WO2022052659 A1 WO 2022052659A1 CN 2021109619 W CN2021109619 W CN 2021109619W WO 2022052659 A1 WO2022052659 A1 WO 2022052659A1
Authority
WO
WIPO (PCT)
Prior art keywords
queue
training
token bucket
task
rate
Prior art date
Application number
PCT/CN2021/109619
Other languages
English (en)
French (fr)
Inventor
王文潇
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Priority to US18/012,930 priority Critical patent/US20230196134A1/en
Publication of WO2022052659A1 publication Critical patent/WO2022052659A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/215Flow control; Congestion control using token-bucket
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue

Definitions

  • the present invention relates to the field of computer technology, and more particularly, to a method and device for submitting training tasks to a speed-limited queue.
  • AI Artificial Intelligence
  • the platform When facing a small number of users, the platform may have enough carrying capacity to handle user requests, but when users reach a certain order of magnitude, they often encounter some high concurrency problems of requests, impacting platform services, causing system instability and even the location of the service. Server down.
  • the AI resource management platform needs to introduce a speed-limiting mechanism to limit the flow of requests from users, which can not only ensure the user's experience, but also ensure the stability of the platform's own services.
  • the purpose of the embodiments of the present invention is to provide an adaptive dynamic speed limit queue technology, which can adaptively adjust the speed limit queue length for processing user requests in the system according to the active time of different users and the number of time period requests. To control the concurrency of the deep learning training platform and ensure the smooth operation of the service system.
  • one aspect of the present invention provides a method for submitting a training task to a rate-limited queue, the method comprising:
  • the training tasks are submitted in sequence according to the chronological order in which the training tasks entered the token bucket rate-limiting queue.
  • submitting the training tasks sequentially according to the chronological order in which the training tasks entered the token bucket rate-limited queue according to the carrying capacity further includes:
  • the training tasks are submitted in turn according to the chronological order of the training tasks entering the token bucket rate-limiting queue.
  • the method further includes:
  • the request success information of the training task is sent according to the signal.
  • the method further includes:
  • a request to cancel the training task is sent and a request to delete the training task is sent.
  • adjusting the carrying capacity of the token-bucket rate-limited queue according to the token-bucket rate-limited queue parameter further includes:
  • the method further includes:
  • Another aspect of the embodiments of the present invention further provides a device for submitting training tasks in a rate-limited queue, the device comprising:
  • the prediction module is configured to monitor the load state information, and predict the token bucket speed limit queue parameters through the trained neural network prediction model according to the load state information;
  • the adjustment module is configured to adjust the carrying capacity of the token bucket rate limit queue according to the token bucket rate limit queue parameter
  • the remaining space judgment module is configured to configure the task parameters of the training task, and judge whether there is sufficient remaining space in the token bucket speed limit queue to place the training task according to the task parameters and carrying capacity;
  • the entering queue module is configured to send the training task to the token bucket rate limiting queue in response to judging that there is sufficient remaining space in the token bucket rate limiting queue to place the training task;
  • the submission module is configured to submit training tasks in turn according to the chronological order of the training tasks entering the token bucket rate-limiting queue according to the carrying capacity.
  • the submission module is further configured to:
  • the training tasks are submitted in sequence according to the time sequence of the training tasks entering the token bucket rate-limiting queue.
  • the apparatus further includes:
  • the submission notification module is configured to parse the training tasks submitted from the token bucket rate-limited queue, send the training tasks to the underlying service, and send signals; and send the request success information of the training tasks according to the signals.
  • the apparatus further includes:
  • a model update module configured to configure a preset time period, and collect sample information according to the preset time period; update the sample set of the neural network prediction model according to the sample information, and retrain and update the neural network prediction according to the updated sample set Model.
  • the present invention has at least the following beneficial technical effects: the processing capability of the platform for high concurrency scenarios is increased, the rejection rate of user requests is reduced, the user experience is enhanced, the platform performance is protected, and the related Development of deep learning platforms.
  • FIG. 1 shows a schematic block diagram of an embodiment of a method for submitting a training task to a rate-limited queue according to the present invention
  • FIG. 2 shows a schematic diagram of a module structure of an embodiment of a method for submitting a training task to a rate-limited queue according to the present invention
  • FIG. 3 shows a schematic flowchart of the adjustment process of the token bucket rate-limiting queue according to an embodiment of the method for submitting a training task to a rate-limiting queue of the present invention
  • FIG. 4 shows a schematic block diagram of an embodiment of an apparatus for submitting a training task to a rate-limited queue according to the present invention.
  • FIG. 1 shows a schematic block diagram of an embodiment of a method for submitting a training task to a rate-limited queue according to the present invention.
  • the method at least includes the following steps:
  • the present invention runs a neural network algorithm based on the token bucket speed limit queue to dynamically adjust the token insertion speed and queue length of the token bucket queue in real time.
  • the present invention calculates the token bucket rate-limiting queue parameters (including queue length and token putting rate) by collecting load status information (including current system user online number, system period average load and period information) through neural network model.
  • the present invention records the load state information, adds it to the sample set, and then updates the parameters of the neural network model.
  • FIG. 2 shows a schematic diagram of a module structure of an embodiment of a method for submitting a training task to a rate-limited queue according to the present invention. As shown in FIG. 2 , it includes a task configuration module, a rate-limiting module, an automatic Adapt module and run module, where:
  • Task configuration module The deep learning training platform sets a task configuration module, and the task configuration module is responsible for configuring task parameters, such as the number of iterations, the training frame, the number of batches, the number of cpu/gpu usage, etc.;
  • Speed-limiting module The deep learning training platform sets up a speed-limiting module. This module performs current-limiting processing on the submission of training tasks through the token bucket speed-limiting queue. Only after the cards are issued can the training tasks be actually issued to the bottom layer of the system. If the queue is full, the rejection policy will be implemented, the request will be discarded, and the user will be notified by email. At the same time, the current limiting effect of the token bucket rate limiting queue can be adjusted by adjusting the queue size and the token putting rate;
  • the deep learning training platform sets an adaptive module, which can automatically adjust the queue size of the token bucket rate-limiting queue and the token insertion rate based on the current system status and time period.
  • This module can be divided into two sub-modules: prediction module and training module.
  • the training module updates the data provided by the system in real time to the training set samples, then calculates the network parameters through neural network training, and abstracts the network parameters into a model and pushes it to the prediction module; the prediction module is based on the current state of the system (such as system load and current state).
  • the number of online users and time period predict a result through the network model parameters, that is, the token bucket rate-limiting queue parameter, and use this result to adjust the carrying capacity of the token-bucket rate-limiting queue (ie, queue size and token insertion rate) ;
  • Running module The deep learning training platform sets up a running module, parses the training task configuration task parameters that get the token, builds the training object, and sends the object to the system service to start the training of the deep training task.
  • the task submission process includes:
  • step S300 according to the user training requirements, configure the task parameters of the training task:
  • the user enters the task parameters of his deep learning task, such as the number of iterations, the training frame, the number of batches, the number of cpu (Central Processing Unit, central processing unit)/gpu (Graphics Processing Unit, graphics processor) usage, etc.;
  • the task parameters of his deep learning task such as the number of iterations, the training frame, the number of batches, the number of cpu (Central Processing Unit, central processing unit)/gpu (Graphics Processing Unit, graphics processor) usage, etc.;
  • step S400 the deep learning platform starts a speed limit module to receive and process the training task from step S300:
  • FIG. 3 shows a schematic diagram of the adjustment process of the token bucket rate-limiting queue according to the embodiment of the method for submitting a training task by the rate-limiting queue of the present invention, and the adaptive adjustment The process is shown in Figure 3:
  • step S100 the token bucket rate-limiting queue parameters are predicted according to the load state information in the system information.
  • This information is abstracted into data and input into the trained neural network prediction model, and the output information is obtained through prediction calculation: queue length and token insertion rate.
  • Step 1.1 The deep learning training platform monitors the load status information in the system information, and obtains relevant parameters of the load status information: the current number of online users, system load and time period information.
  • Step 1.2 Abstract this information into data and input it into the trained neural network prediction model.
  • Step 1.3 Predict the model through a neural network, and get the output data: queue length and token put-in rate.
  • step S200 the calculated queue length and token are put into the rate parameter update to the rate limiting queue of the platform rate limiting module, and the carrying capacity of the rate limiting queue is adjusted.
  • submitting the training tasks sequentially according to the chronological order in which the training tasks entered the token bucket rate-limited queue according to the carrying capacity further includes:
  • the training tasks are submitted in sequence according to the time sequence of the training tasks entering the token bucket rate-limiting queue.
  • an attempt is made to pop the earliest training task that entered the queue from the queue.
  • the pop-up condition is whether the token can be obtained from the token bucket. If there is a token in the token bucket, the deep learning platform starts a The running module is used to parse the training tasks popped from the queue, deliver the tasks to the underlying service, and send signals to the system information receiving system at the same time. If there is no token in the token bucket, the training task pop-up operation is canceled because the token cannot be obtained, and the training task is placed in the token bucket rate limit queue.
  • the method further includes:
  • the request success information of the training task is sent according to the signal.
  • the deep learning platform starts a running module to parse the training tasks popped from the queue, deliver the tasks to the underlying service, and send signals to the system information receiving system at the same time.
  • the method further includes:
  • a request to cancel the training task is sent and a request to delete the training task is sent.
  • the notification information is assembled according to the task parameters of the training task, and the information is displayed in the form of a In the form of an email, the user is notified that "due to excessive system load, this request is canceled", and the memory is released, the training task request is deleted, and the request ends.
  • adjusting the carrying capacity of the token-bucket rate-limited queue according to the token-bucket rate-limited queue parameter further includes:
  • a certain time interval is set and updated at regular intervals to dynamically adjust the carrying capacity of the token bucket rate-limiting queue.
  • the method further includes:
  • Configure a preset time period and collect sample information according to the preset time period; update the sample set of the neural network prediction model according to the sample information, and retrain and update the neural network prediction model according to the updated sample set.
  • a fixed time is set, for example, in some embodiments, at 1:00 am, the training model sample set is updated through the information collection of the last day, the neural network model is retrained, and the new model parameters are saved , used to predict the queue parameters for the next day:
  • the system continuously performs sampling operations in different time periods, and updates these samples to the training sample set of the neural network model.
  • the system automatically trains the neural network model through a new sample set to obtain the latest neural network prediction model.
  • FIG. 4 shows a schematic block diagram of an embodiment of an apparatus for submitting training tasks to a rate-limited queue according to the present invention.
  • the apparatus 101 includes:
  • the prediction module 11, the prediction module 11 is configured to monitor the load state information, and predict the token bucket speed limit queue parameters through the trained neural network prediction model according to the load state information;
  • the adjustment module 12 is configured to adjust the carrying capacity of the token bucket rate-limited queue according to the token-bucket rate-limited queue parameter;
  • the remaining space judgment module 13 is configured to configure the task parameters of the training task, and according to the task parameters and carrying capacity, judge whether there is sufficient remaining space in the token bucket rate limit queue to place the training task;
  • the entering queue module 14 is configured to send the training task to the token bucket rate limiting queue in response to judging that there is sufficient remaining space in the token bucket rate limiting queue to place the training task;
  • the submission module 15 is configured to submit the training tasks sequentially according to the chronological sequence of the training tasks entering the token bucket rate-limiting queue according to the carrying capacity.
  • the submitting module 15 is further configured to:
  • the training tasks are submitted in sequence according to the time sequence of the training tasks entering the token bucket rate-limiting queue.
  • the apparatus 101 further includes:
  • the submission notification module is configured to parse the training tasks submitted from the token bucket rate-limiting queue, send the training tasks to the underlying service, and send signals; and send the request success information of the training tasks according to the signals.
  • the apparatus 101 further includes:
  • a model update module configured to configure a preset time period, and collect sample information according to the preset time period; update the sample set of the neural network prediction model according to the sample information, and retrain and update the neural network prediction according to the updated sample set Model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明公开了一种限速队列提交训练任务的方法,包括:监控负载状态信息,根据负载状态信息通过训练好的神经网络预测模型预测令牌桶限速队列参数;根据令牌桶限速队列参数调整令牌桶限速队列的承载能力;配置训练任务的任务参数,根据任务参数和承载能力判断令牌桶限速队列中是否有充足的剩余空间以放置训练任务;响应于判断令牌桶限速队列中有充足的剩余空间以放置训练任务,将训练任务发送至令牌桶限速队列中;根据承载能力依据训练任务进入令牌桶限速队列的时间先后顺序依次提交训练任务。本发明还公开了一种相应的装置。本发明可以自适应地调整系统中处理任务请求的限速队列的承载能力,保证系统的平稳运行。

Description

一种限速队列提交训练任务的方法和装置
本申请要求于2020年09月10日提交中国国家知识产权局,申请号为202010949625.8,发明名称为“一种限速队列提交训练任务的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及计算机技术领域,更具体地,特别是指一种限速队列提交训练任务的方法和装置。
背景技术
目前,随着神经网络模型的不断改进,其精度不断提高,应用性也逐步增强,AI(Artificial Intelligence,人工智能)一词再度回归人们的视野。与此同时,AI也为当前一些行业注入新的生机与活力,伴随着行业的发展,催生了大量深度学习算法工程师。传统的深度学习训练方式是许多工程师共用几台服务器,在这种模式下必然会造成资源争抢等问题,极大降低了算法人员的效率。因此,建立一种AI资源管理平台是很好的解决方案,算法工程师可以在资源管理平台上自定义深度训练任务的资源规格大小等,在配置好训练信息之后,一键提交训练任务到资源管理平台运行。
在面向少量用户时,平台可能有足够的承载能力去处理用户的请求,但是当用户达到一定数量级之后,往往会遇到一些请求高并发问题,冲击平台服务,造成系统的不稳定甚至服务所在的服务器宕机。对于这种高并发造成的严重问题,AI资源管理平台需要引进一种限速机制,用来限流来 自用户的请求,既可以保证用户的使用感受,也可以保证平台自身服务的稳定性。
发明内容
有鉴于此,本发明实施例的目的在于提供一种自适应动态限速队列技术,根据不同用户的活跃时间和时段请求数量,可以自适应地调整系统中处理用户请求的限速队列长度,用来控制深度学习训练平台并发量,保证服务系统的平稳运行。
基于上述目的,本发明一方面提供了一种限速队列提交训练任务的方法,该方法包括:
监控负载状态信息,根据负载状态信息通过训练好的神经网络预测模型预测令牌桶限速队列参数;
根据令牌桶限速队列参数调整令牌桶限速队列的承载能力;
配置训练任务的任务参数,根据任务参数和承载能力判断令牌桶限速队列中是否有充足的剩余空间以放置训练任务;
响应于判断令牌桶限速队列中有充足的剩余空间以放置训练任务,将训练任务发送至令牌桶限速队列中;
根据承载能力依据训练任务进入令牌桶限速队列的时间先后顺序依次提交训练任务。
在本发明的限速队列提交训练任务的方法的一些实施方式中,根据承载能力依据训练任务进入令牌桶限速队列的时间先后顺序依次提交训练任务还包括:
判断是否可以获取令牌;
响应于获取到令牌,依据训练任务进入令牌桶限速队列的时间先后顺 序依次提交训练任务。
在本发明的限速队列提交训练任务的方法的一些实施方式中,方法还包括:
解析从令牌桶限速队列中提交的训练任务,将训练任务发送至底层服务,并发送信号;
根据信号发送训练任务的请求成功信息。
在本发明的限速队列提交训练任务的方法的一些实施方式中,方法还包括:
响应于根据任务参数和承载能力判断令牌桶限速队列中没有充足的剩余空间以放置训练任务,发送训练任务的请求取消信息并删除训练任务的请求。
在本发明的限速队列提交训练任务的方法的一些实施方式中,根据令牌桶限速队列参数调整令牌桶限速队列的承载能力还包括:
配置时间间隔,根据时间间隔调整令牌桶限速队列的承载能力。
在本发明的限速队列提交训练任务的方法的一些实施方式中,方法还包括:
配置预设时间段,并根据预设时间段收集样本信息;
根据样本信息更新神经网络预测模型的样本集,以及根据更新的样本集重新训练并更新神经网络预测模型。
本发明实施例的另一方面,还提供了一种限速队列提交训练任务的装置,该装置包括:
预测模块,预测模块配置为监控负载状态信息,根据负载状态信息通过训练好的神经网络预测模型预测令牌桶限速队列参数;
调整模块,调整模块配置为根据令牌桶限速队列参数调整令牌桶限速队列的承载能力;
剩余空间判断模块,剩余空间判断模块配置为配置训练任务的任务参数,根据任务参数和承载能力判断令牌桶限速队列中是否有充足的剩余空间以放置训练任务;
进入队列模块,进入队列模块配置为响应于判断令牌桶限速队列中有充足的剩余空间以放置训练任务,将训练任务发送至令牌桶限速队列中;
提交模块,提交模块配置为根据承载能力依据训练任务进入令牌桶限速队列的时间先后顺序依次提交训练任务。
在本发明的限速队列提交训练任务的装置的一些实施方式中,提交模块还配置为:
判断是否可以获取令牌;
响应于获取到令牌,依据训练任务进入令牌桶限速队列的时间先后顺序依次提交训练任务。
在本发明的限速队列提交训练任务的装置的一些实施方式中,装置还包括:
提交通知模块,提交通知模块配置为解析从令牌桶限速队列中提交的训练任务,将训练任务发送至底层服务,并发送信号;根据信号发送训练任务的请求成功信息。
在本发明的限速队列提交训练任务的装置的一些实施方式中,装置还包括:
模型更新模块,模型更新模块配置为配置预设时间段,并根据预设时间段收集样本信息;根据样本信息更新神经网络预测模型的样本集,以及根据更新的样本集重新训练并更新神经网络预测模型。
本发明至少具有以下有益技术效果:增加了平台针对高并发场景的处理能力,在最大限度不影响系统性能的前提下,降低用户请求的拒绝率,增强用户感受,保护平台性能,同时也指导相关深度学习平台的研发。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的实施例。
图1示出了根据本发明的限速队列提交训练任务的方法的实施例的示意性框图;
图2示出了根据本发明的限速队列提交训练任务的方法的实施例的模块结构示意图;
图3示出了根据本发明的限速队列提交训练任务的方法的实施例的令牌桶限速队列的调整流程示意图;
图4示出了根据本发明的限速队列提交训练任务的装置的实施例的示意性框图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明实施例进一步详细说明。
需要说明的是,本发明实施例中所有使用“第一”和“第二”的表述均是为了区分两个相同名称非相同的实体或者非相同的参量,可见“第一”和“第二”仅为了表述的方便,不应理解为对本发明实施例的限定,后续实施例对此 不再一一说明。
基于上述目的,本发明实施例的第一个方面,提出了一种限速队列提交训练任务的方法的实施例。图1示出的是根据本发明的限速队列提交训练任务的方法的实施例的示意性框图。如图1所示的实施例中,该方法至少包括如下步骤:
S100、监控负载状态信息,根据负载状态信息通过训练好的神经网络预测模型预测令牌桶限速队列参数;
S200、根据令牌桶限速队列参数调整令牌桶限速队列的承载能力;
S300、配置训练任务的任务参数,根据任务参数和承载能力判断令牌桶限速队列中是否有充足的剩余空间以放置训练任务;
S400、响应于判断令牌桶限速队列中有充足的剩余空间以放置训练任务,将训练任务发送至令牌桶限速队列中;
S500、根据承载能力依据训练任务进入令牌桶限速队列的时间先后顺序依次提交训练任务。
在本发明的一些实施例中,针对于深度学习训练任务平台,本发明基于令牌桶限速队列,运行神经网络算法对令牌桶队列的令牌放入速度和队列长度进行实时动态调整。本发明通过收集负载状态信息(包括当前系统用户在线数、系统时段平均负荷和时段信息),通过神经网络模型计算出令牌桶限速队列参数(包括队列长度和令牌放入速率)。本发明通过记录负载状态信息,将其添加到样本集中,然后更新神经网络模型参数。
在本发明一些实施例中,图2示出的是根据本发明的限速队列提交训练任务的方法的实施例的模块结构示意图,如图2所示,包括任务配置模块、限速模块、自适应模块和运行模块,其中:
任务配置模块:深度学习训练平台设置一个任务配置模块,任务配置 模块负责配置任务参数,例如迭代次数,训练框架、批数目、cpu/gpu使用数等;
限速模块:深度学习训练平台设置一个限速模块,该模块通过令牌桶限速队列对训练任务提交进行限流处理,任务每次提交后,需要先进入限速队列,并且在拿到令牌之后才可将训练任务真正下发到系统底层。如果队列已满,则执行拒绝策略,让本次请求废弃,并通过邮件的方式通知到用户。同时,可以通过调整队列大小和令牌放入速率对令牌桶限速队列的限流效果进行调整;
自适应模块:深度学习训练平台设置一个自适应模块,该模块可以基于当前系统的状态和时段,自动调整令牌桶限速队列的队列大小和令牌放入速率。该模块可以分为两个子模块:预测模块和训练模块。训练模块将系统实时提供的数据更新到训练集样本,然后通过神经网络训练计算出网络参数,并将网络参数抽象成模型推送到预测模块;预测模块则是根据系统当前状态(比如系统负荷和当前在线用户数)和时段通过网络模型参数预测出一个结果,即令牌桶限速队列参数,通过该结果去调节令牌桶限速队列的承载能力(即,队列大小和令牌放入速率);
运行模块:深度学习训练平台设置一个运行模块,解析拿到令牌的训练任务配置任务参数,构建训练对象,并将该对象下发到系统服务,开始进行深度训练任务的训练。
在本发明一些实施例中,具体实施过程如下:
其中,任务提交过程包括:
根据步骤S300,根据用户训练需求,配置训练任务的任务参数:
用户输入自己深度学习任务的任务参数,例如迭代次数,训练框架、批数目、cpu(Central Processing Unit,中央处理器)/gpu(Graphics Processing  Unit,图形处理器)使用数等;
将这些任务参数组装成一个抽象数据结构,训练任务以及抽象数据结构发送到限速模块。
根据步骤S400,深度学习平台启动一个限速模块用来接收和处理来自步骤S300的训练任务:
判断限速队列中是否还有剩余空间将任务放入。如果有空间,则将训练任务放到令牌桶限速队列中;如果队列没有可用剩余空间,则通知用户请求成功,本次请求结束。
还包括步骤:接收通知信息发送信号,以及执行请求拒绝操作。
再者,令牌桶限速队列自适应调整过程,图3示出的是根据本发明的限速队列提交训练任务的方法的实施例的令牌桶限速队列的调整流程示意图,自适应调整过程如图3所示:
根据步骤S100,通过系统信息中的负载状态信息预测令牌桶限速队列参数。
将这些信息抽象化成数据输入到已经训练好的神经网络预测模型中,通过预测计算得到输出信息:队列长度和令牌放入速率。
步骤1.1:深度学习训练平台监控系统信息中的负载状态信息,获取负载状态信息的相关参数:当前用户在线数、系统负载和时段信息。
步骤1.2:将这些信息抽象化成数据输入到已经训练好的神经网络预测模型中。
步骤1.3:通过神经网络预测模型,获取输出数据:队列长度和令牌放入速率。
根据步骤S200,将计算出的队列长度和令牌放入速率参数更新到平台限速模块的限速队列中,调整限速队列的承载能力。
根据本发明的限速队列提交训练任务的方法的一些实施方式,根据承载能力依据训练任务进入令牌桶限速队列的时间先后顺序依次提交训练任务还包括:
判断是否可以获取令牌;
响应于获取到令牌,依据训练任务进入令牌桶限速队列的时间先后顺序依次提交训练任务。
在本发明的一些实施例中,尝试将最早一个进入队列的训练任务弹出队列,弹出条件为是否可以从令牌桶中获取令牌,如果令牌桶中有令牌,则深度学习平台启动一个运行模块,用来解析从队列中弹出的训练任务,并将任务下发底层服务,同时发送信号给系统信息接收系统。如果令牌桶中没有令牌则因为获取不到令牌,训练任务弹出操作取消,并将训练任务放到令牌桶限速队列中。
根据本发明的限速队列提交训练任务的方法的一些实施方式,方法还包括:
解析从令牌桶限速队列中提交的训练任务,将训练任务发送至底层服务,并发送信号;
根据信号发送训练任务的请求成功信息。
在本发明的一些实施例中,深度学习平台启动一个运行模块,用来解析从队列中弹出的训练任务,并将任务下发底层服务,同时发送信号给系统信息接收系统。
根据本发明的限速队列提交训练任务的方法的一些实施方式,方法还包括:
响应于根据任务参数和承载能力判断令牌桶限速队列中没有充足的剩余空间以放置训练任务,发送训练任务的请求取消信息并删除训练任务的 请求。
在本发明的一些实施例中,如果是从判断限速队列中是否还有剩余空间将任务放入步骤中传过来的信号,则根据训练任务的任务参数,组装通知信息,并将该信息以邮件的形式通知用户“由于系统负荷过大,取消本次请求”,并释放内存,删除该训练任务请求,本次请求结束。
根据本发明的限速队列提交训练任务的方法的一些实施方式,根据令牌桶限速队列参数调整令牌桶限速队列的承载能力还包括:
配置时间间隔,根据时间间隔调整令牌桶限速队列的承载能力。
在本发明的一些实施例中,设置一定的时间间隔,每隔一段时间更新一次,动态调整令牌桶限速队列承载能力。在最大限度不影响系统性能的前提下,降低用户请求的拒绝数目。
根据本发明的限速队列提交训练任务的方法的一些实施方式,方法还包括:
配置预设时间段,并根据预设时间段收集样本信息;根据样本信息更新神经网络预测模型的样本集,以及根据更新的样本集重新训练并更新神经网络预测模型。
在本发明的一些实施例中,设置固定时间,例如在一些实施例中,凌晨1点,通过最近一天的信息收集,更新训练模型样本集,进行神经网络模型重新训练,并保存新的模型参数,用来进行下一天的队列参数预测:
通过最近一天的信息收集,系统在不断地在不同时段进行取样操作,将这些样本更新到神经网络模型的训练样本集。
在每天特定的时间点,系统通过新的样本集自动进行神经网络模型训练,获取最新的神经网络预测模型。
将之前步骤所使用的神经网络模型替换为新的神经网络预测模型。
本发明实施例的另一方面,提出了一种限速队列提交训练任务的装置的实施例。图4示出的是根据本发明的限速队列提交训练任务的装置的实施例的示意性框图,如图4所示,该装置101包括:
预测模块11,预测模块11配置为监控负载状态信息,根据负载状态信息通过训练好的神经网络预测模型预测令牌桶限速队列参数;
调整模块12,调整模块12配置为根据令牌桶限速队列参数调整令牌桶限速队列的承载能力;
剩余空间判断模块13,剩余空间判断模块13配置为配置训练任务的任务参数,根据任务参数和承载能力判断令牌桶限速队列中是否有充足的剩余空间以放置训练任务;
进入队列模块14,进入队列模块14配置为响应于判断令牌桶限速队列中有充足的剩余空间以放置训练任务,将训练任务发送至令牌桶限速队列中;
提交模块15,提交模块15配置为根据承载能力依据训练任务进入令牌桶限速队列的时间先后顺序依次提交训练任务。
根据本发明的限速队列提交训练任务的装置的一些实施方式,提交模块15还配置为:
判断是否可以获取令牌;
响应于获取到令牌,依据训练任务进入令牌桶限速队列的时间先后顺序依次提交训练任务。
根据本发明的限速队列提交训练任务的装置的一些实施方式,装置101还包括:
提交通知模块,提交通知模块配置为解析从令牌桶限速队列中提交的训练任务,将训练任务发送至底层服务,并发送信号;根据信号发送训练 任务的请求成功信息。
根据本发明的限速队列提交训练任务的装置的一些实施方式,装置101还包括:
模型更新模块,模型更新模块配置为配置预设时间段,并根据预设时间段收集样本信息;根据样本信息更新神经网络预测模型的样本集,以及根据更新的样本集重新训练并更新神经网络预测模型。
同样地,本领域技术人员应当理解,以上针对根据本发明的限速队列提交训练任务的方法阐述的所有实施方式、特征和优势同样地适用于根据本发明的装置。为了本公开的简洁起见,在此不再重复阐述。
需要特别指出的是,本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关硬件来完成,限速队列提交训练任务的方法的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,程序的存储介质可为磁碟、光盘、只读存储记忆体(ROM)或随机存储记忆体(RAM)等。上述计算机程序的实施例,可以达到与之对应的前述任意方法实施例相同或者相类似的效果。
本领域技术人员还将明白的是,结合这里的公开所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。为了清楚地说明硬件和软件的这种可互换性,已经就各种示意性组件、方块、模块、电路和步骤的功能对其进行了一般性的描述。这种功能是被实现为软件还是被实现为硬件取决于具体应用以及施加给整个系统的设计约束。本领域技术人员可以针对每种具体应用以各种方式来实现的功能,但是这种实现决定不应被解释为导致脱离本发明实施例公开的范围。
应当理解的是,在本文中使用的,除非上下文清楚地支持例外情况,单数形式“一个”旨在也包括复数形式。还应当理解的是,在本文中使用的“和/或”是指包括一个或者一个以上相关联地列出的项目的任意和所有可能组合。
上述本发明实施例公开实施例序号仅仅为了描述,不代表实施例的优劣。
所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本发明实施例公开的范围(包括权利要求)被限于这些例子;在本发明实施例的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,并存在如上的本发明实施例的不同方面的许多其它变化,为了简明它们没有在细节中提供。因此,凡在本发明实施例的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本发明实施例的保护范围之内。

Claims (10)

  1. 一种限速队列提交训练任务的方法,其特征在于,所述方法包括:
    监控负载状态信息,根据所述负载状态信息通过训练好的神经网络预测模型预测令牌桶限速队列参数;
    根据所述令牌桶限速队列参数调整令牌桶限速队列的承载能力;
    配置训练任务的任务参数,根据所述任务参数和所述承载能力判断所述令牌桶限速队列中是否有充足的剩余空间以放置所述训练任务;
    响应于判断所述令牌桶限速队列中有充足的所述剩余空间以放置所述训练任务,将所述训练任务发送至所述令牌桶限速队列中;
    根据所述承载能力依据所述训练任务进入所述令牌桶限速队列的时间先后顺序依次提交所述训练任务。
  2. 根据权利要求1所述的限速队列提交训练任务的方法,其特征在于,所述根据所述承载能力依据所述训练任务进入所述令牌桶限速队列的时间先后顺序依次提交所述训练任务还包括:
    判断是否可以获取令牌;
    响应于获取到所述令牌,依据所述训练任务进入所述令牌桶限速队列的时间先后顺序依次提交所述训练任务。
  3. 根据权利要求1所述的限速队列提交训练任务的方法,其特征在于,所述方法还包括:
    解析从所述令牌桶限速队列中提交的所述训练任务,将所述训练任务发送至底层服务,并发送信号;
    根据所述信号发送所述训练任务的请求成功信息。
  4. 根据权利要求1所述的限速队列提交训练任务的方法,其特征在于, 所述方法还包括:
    响应于根据所述任务参数和所述承载能力判断所述令牌桶限速队列中没有充足的剩余空间以放置所述训练任务,发送所述训练任务的请求取消信息并删除所述训练任务的请求。
  5. 根据权利要求1所述的限速队列提交训练任务的方法,其特征在于,所述根据所述令牌桶限速队列参数调整令牌桶限速队列的承载能力还包括:
    配置时间间隔,根据所述时间间隔调整所述令牌桶限速队列的承载能力。
  6. 根据权利要求1所述的限速队列提交训练任务的方法,其特征在于,所述方法还包括:
    配置预设时间段,并根据所述预设时间段收集样本信息;
    根据所述样本信息更新所述神经网络预测模型的样本集,以及根据更新的所述样本集重新训练并更新所述神经网络预测模型。
  7. 一种限速队列提交训练任务的装置,其特征在于,所述装置包括:
    预测模块,所述预测模块配置为监控负载状态信息,根据所述负载状态信息通过训练好的神经网络预测模型预测令牌桶限速队列参数;
    调整模块,所述调整模块配置为根据所述令牌桶限速队列参数调整令牌桶限速队列的承载能力;
    剩余空间判断模块,所述剩余空间判断模块配置为配置训练任务的任务参数,根据所述任务参数和所述承载能力判断所述令牌桶限速队列中是否有充足的剩余空间以放置所述训练任务;
    进入队列模块,所述进入队列模块配置为响应于判断所述令牌桶限速队列中有充足的所述剩余空间以放置所述训练任务,将所述训练任务发送至所述令牌桶限速队列中;
    提交模块,所述提交模块配置为根据所述承载能力依据所述训练任务进 入所述令牌桶限速队列的时间先后顺序依次提交所述训练任务。
  8. 根据权利要求7所述的限速队列提交训练任务的装置,其特征在于,所述提交模块还配置为:
    判断是否可以获取令牌;
    响应于获取到所述令牌,依据所述训练任务进入所述令牌桶限速队列的时间先后顺序依次提交所述训练任务。
  9. 根据权利要求7所述的限速队列提交训练任务的装置,其特征在于,所述装置还包括:
    提交通知模块,所述提交通知模块配置为解析从所述令牌桶限速队列中提交的所述训练任务,将所述训练任务发送至底层服务,并发送信号;根据所述信号发送所述训练任务的请求成功信息。
  10. 根据权利要求7所述的限速队列提交训练任务的装置,其特征在于,所述装置还包括:
    模型更新模块,所述模型更新模块配置为配置预设时间段,并根据所述预设时间段收集样本信息;根据所述样本信息更新所述神经网络预测模型的样本集,以及根据更新的所述样本集重新训练并更新所述神经网络预测模型。
PCT/CN2021/109619 2020-09-10 2021-07-30 一种限速队列提交训练任务的方法和装置 WO2022052659A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/012,930 US20230196134A1 (en) 2020-09-10 2021-07-30 Method and device for submitting training task by rate limiting queue

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010949625.8A CN112202684B (zh) 2020-09-10 2020-09-10 一种限速队列提交训练任务的方法和装置
CN202010949625.8 2020-09-10

Publications (1)

Publication Number Publication Date
WO2022052659A1 true WO2022052659A1 (zh) 2022-03-17

Family

ID=74015628

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109619 WO2022052659A1 (zh) 2020-09-10 2021-07-30 一种限速队列提交训练任务的方法和装置

Country Status (3)

Country Link
US (1) US20230196134A1 (zh)
CN (1) CN112202684B (zh)
WO (1) WO2022052659A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116582497A (zh) * 2023-04-21 2023-08-11 中国测绘科学研究院 一种单服务器条件下的高效gis服务自适应流量整形的方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112202684B (zh) * 2020-09-10 2022-05-13 苏州浪潮智能科技有限公司 一种限速队列提交训练任务的方法和装置
CN114185612B (zh) * 2021-11-30 2023-08-08 北京百度网讯科技有限公司 一种数据更新的方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101729386A (zh) * 2008-11-03 2010-06-09 华为技术有限公司 一种基于令牌调度的流量控制方法及装置
CN101841486A (zh) * 2010-06-01 2010-09-22 杭州华三通信技术有限公司 一种消息的传输方法和设备
US20150036503A1 (en) * 2013-08-05 2015-02-05 International Business Machines Corporation Rate Control By Token Buckets
CN106789720A (zh) * 2016-12-16 2017-05-31 无锡路通视信网络股份有限公司 一种基于系统硬件使用率的动态令牌桶生成方法
CN112202684A (zh) * 2020-09-10 2021-01-08 苏州浪潮智能科技有限公司 一种限速队列提交训练任务的方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101729386A (zh) * 2008-11-03 2010-06-09 华为技术有限公司 一种基于令牌调度的流量控制方法及装置
CN101841486A (zh) * 2010-06-01 2010-09-22 杭州华三通信技术有限公司 一种消息的传输方法和设备
US20150036503A1 (en) * 2013-08-05 2015-02-05 International Business Machines Corporation Rate Control By Token Buckets
CN106789720A (zh) * 2016-12-16 2017-05-31 无锡路通视信网络股份有限公司 一种基于系统硬件使用率的动态令牌桶生成方法
CN112202684A (zh) * 2020-09-10 2021-01-08 苏州浪潮智能科技有限公司 一种限速队列提交训练任务的方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116582497A (zh) * 2023-04-21 2023-08-11 中国测绘科学研究院 一种单服务器条件下的高效gis服务自适应流量整形的方法
CN116582497B (zh) * 2023-04-21 2024-01-23 中国测绘科学研究院 一种单服务器条件下的高效gis服务自适应流量整形的方法

Also Published As

Publication number Publication date
US20230196134A1 (en) 2023-06-22
CN112202684B (zh) 2022-05-13
CN112202684A (zh) 2021-01-08

Similar Documents

Publication Publication Date Title
WO2022052659A1 (zh) 一种限速队列提交训练任务的方法和装置
CN111835827B (zh) 物联网边缘计算任务卸载方法及系统
CN105281981B (zh) 网络服务的数据流量监控方法和装置
CN103379002B (zh) 电信网络的自适应监测
US7441028B1 (en) Method of defining a required information system capacity as a function of a user's quality of service objectives
US7734768B2 (en) System and method for adaptively collecting performance and event information
CN113032120B (zh) 一种基于边缘计算的工业现场大数据任务协同调度方法
EP4203437A1 (en) Data set and node cache-based scheduling method and device
WO2019062405A1 (zh) 应用程序的处理方法、装置、存储介质及电子设备
CN107592345A (zh) 交易限流装置、方法及交易系统
CN111311286B (zh) 一种智能客服数据处理方法、装置及计算设备、存储介质
US11750711B1 (en) Systems and methods for adaptively rate limiting client service requests at a blockchain service provider platform
CN112162863A (zh) 一种边缘卸载决策方法、终端及可读存储介质
JP7313467B2 (ja) サーバーの負荷予測及び高度なパフォーマンス測定
CN111901134B (zh) 一种基于循环神经网络模型rnn的预测网络质量的方法和装置
CN104579738B (zh) 用以在网络中管理流量的计算机实施的方法、计算机系统、计算机程序产品
CN111324644B (zh) 大型微服务架构下的数据库连接风暴的监控方法及装置
US20210168211A1 (en) Dynamically configuring a web server timeout
CN105095049B (zh) 用于监控应用操作的方法和装置
CN113543160A (zh) 5g切片资源配置方法、装置、计算设备及计算机存储介质
CN115827232A (zh) 一种为业务模型确定配置的方法、装置、系统及设备
Boulougaris et al. A QoS-aware, Proactive Tasks Offloading Model for Pervasive Applications
CN114936089A (zh) 资源调度方法、系统、设备及存储介质
CN114138493A (zh) 一种基于能耗感知的边缘算力资源调度方法
CN114423049A (zh) 一种感知预测方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21865715

Country of ref document: EP

Kind code of ref document: A1