WO2019095873A1 - 任务并行处理方法、装置、系统、存储介质及计算机设备 - Google Patents

任务并行处理方法、装置、系统、存储介质及计算机设备 Download PDF

Info

Publication number
WO2019095873A1
WO2019095873A1 PCT/CN2018/108298 CN2018108298W WO2019095873A1 WO 2019095873 A1 WO2019095873 A1 WO 2019095873A1 CN 2018108298 W CN2018108298 W CN 2018108298W WO 2019095873 A1 WO2019095873 A1 WO 2019095873A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
processor
data
executed
model
Prior art date
Application number
PCT/CN2018/108298
Other languages
English (en)
French (fr)
Inventor
吴林阳
孟小甫
赵永威
郭崎
陈峋宇
王康羽
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201711157341.XA external-priority patent/CN109814986B/zh
Priority claimed from CN201711484410.8A external-priority patent/CN109992307B/zh
Priority claimed from CN201810084077.XA external-priority patent/CN110097180B/zh
Priority claimed from CN201810083577.1A external-priority patent/CN110097179B/zh
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Priority to EP18878728.7A priority Critical patent/EP3614260A4/en
Priority to EP19210491.7A priority patent/EP3651020A1/en
Priority to KR1020197037907A priority patent/KR102569086B1/ko
Priority to JP2019568198A priority patent/JP7074777B2/ja
Publication of WO2019095873A1 publication Critical patent/WO2019095873A1/zh
Priority to US16/575,344 priority patent/US11221877B2/en
Priority to US16/702,502 priority patent/US11113103B2/en
Priority to US16/702,491 priority patent/US11360811B2/en
Priority to US16/705,190 priority patent/US11113104B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/485Task life-cycle, e.g. stopping, restarting, resuming execution
    • G06F9/4856Task life-cycle, e.g. stopping, restarting, resuming execution resumption being on a different machine, e.g. task migration, virtual machine migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Definitions

  • the present application relates to the field of computer technology, and in particular, to a task parallel processing method, apparatus, system, storage medium, and computer device.
  • CUDA Computer Unified Device Architecture, graphics platform manufacturer NVIDIA computing platform
  • Cudnn CUDA Deep Neural Network library, NVIDIA's deep neural network acceleration library
  • Cublas CUDA Basic Linear Algebra Subprograms, NVIDIA
  • the accelerator matrix library (such as the matrix operation acceleration library) is programmed to implement the program instructions of the convolutional neural network.
  • the CUDA Cudnn, Cublas and other accelerator API interface programming, there is no interdependence between the instructions of the convolutional neural network, and only the programming instructions can be executed sequentially.
  • the neural network is actually a series of queue functions, which is a graph structure.
  • the program instructions of the convolutional neural network there will be a task branch.
  • tensorflow Google DistBelief based on research and development of second-generation artificial intelligence learning systems
  • Caffe Convolutional Architecture for Fast Feature Embedding , convolution neural network framework
  • applying the above framework program to achieve task parallelism requires not only additional software installation, but also incompatibility of the program interface, which is inconvenient to use.
  • the present application proposes a task parallel processing method, including:
  • the parallel execution tasks in each of the work queues are controlled to start running.
  • the step of constructing the task directed acyclic graph DAG includes:
  • the program is split according to the operation node and/or the data node in the program, and the task to be executed is obtained.
  • the step of splitting the program according to the operation node in the program, the step of acquiring the task to be performed includes:
  • the model of the operation request with the model is split and/or the input data of the model is split to obtain a task to be executed.
  • the splitting the model of the operation request with the model, and obtaining the task to be performed includes:
  • the correspondence between the input data to be executed and the output data is set using each of the weights.
  • the splitting the model of the operation request with the model, and obtaining the task to be performed includes:
  • the model of the operation with the model is split in the window direction and/or the channel direction of the model according to a preset rule, and the task to be performed is obtained.
  • the step of splitting the input data of the operation request with the model and obtaining the task to be performed includes:
  • the input data of the operation with the model is split in the window direction of the data according to a preset rule, and the task to be executed is obtained.
  • the step of splitting the program according to the operation node in the program, the step of acquiring the task to be performed includes:
  • the program includes an operation request without a model
  • the input data and/or output data of the operation request without the model is split to obtain a task to be executed.
  • the step of splitting the input data and/or the output data of the operation request without the model to obtain the task to be performed includes:
  • the input data and/or the output data are split in the window direction of the data according to a preset rule to obtain a task to be executed.
  • the step of constructing the task directed acyclic graph DAG according to the dependencies between the tasks to be performed includes:
  • a task directed acyclic graph DAG is constructed according to the parallel node and the sequential node.
  • the step of distributing each of the required tasks to be distributed to the plurality of work queues of the processor according to the task directed acyclic graph DAG comprises:
  • the step of controlling the parallel execution of the tasks to be executed in each of the work queues according to the dependencies of the tasks to be executed in the acyclic graph DAG includes:
  • the present application proposes a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps mentioned in the above method.
  • the present application proposes a task parallel processing system including a memory, a multi-core processor, and a computer program stored on the memory and operable on the processor, the multi-core processor capable of running a split algorithm, the multi-core processor
  • the steps mentioned in the above method are implemented when the computer program is executed.
  • the present application also proposes a task parallel processing system, including a memory, a first processor and a second processor, the first processor being capable of running a splitting algorithm, the second processor being a multi-core processor, the first The steps mentioned in the above method are implemented when the processor and the second processor execute the computer program.
  • the present application also provides a task parallel processing apparatus, including: a DAG graph construction module, a task distribution module, and a scheduling control module.
  • the DAG graph construction module is configured to construct a task directed acyclic graph DAG according to a dependency relationship between tasks to be executed;
  • the task distribution module is configured to distribute each of the tasks to be executed to a plurality of work queues of the processor according to the task directed acyclic graph DAG;
  • the scheduling control module is configured to control, according to the dependencies of the tasks to be executed in the acyclic graph DAG, to control parallel execution tasks in each of the working queues to start running.
  • a task parallel processing method Compared with the prior art, a task parallel processing method, a storage medium, a computer device, a device and a system provided by the present application have the following beneficial effects:
  • a task parallel processing method, a storage medium, a computer device, a device and a system which are proposed by the present application, construct a task directed acyclic graph DAG according to a dependency relationship between tasks, and then according to a task directed acyclic graph DAG.
  • the task distribution and control are performed, and the task re-scheduling of the work queue is realized to realize the parallelism of the tasks of the multi-core processor, thereby improving the data processing efficiency.
  • the implementation of the task parallel processing method proposed in this embodiment does not depend on a framework program such as tensorflow or Caffe, so there is no need to consider interface compatibility issues when designing the program.
  • the present application further provides an instruction list scheduling method, including: acquiring a to-be-scheduled instruction set in a to-be-scheduled instruction list, and performing data dependency analysis on the to-be-scheduled instruction set to obtain an instruction between the instructions in the to-be-scheduled instruction set Data dependencies;
  • the selection node according to the corresponding order determines the instructions of each order in the post-scheduled instruction list.
  • the step of determining, according to the preset rule, the instructions in the order of the post-scheduled instruction list according to the selecting node in the corresponding order comprises:
  • the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
  • the method comprises:
  • the initial execution time is updated to the longest execution time corresponding to the currently accessed selection node.
  • the step of accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes:
  • the ordered instruction corresponding to the current access node is determined as the instruction of the corresponding order in the scheduled instruction list
  • the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
  • the instruction sequence in the instruction list to be scheduled is used as the instruction sequence in the post-scheduling instruction table.
  • the step of accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes:
  • the selection node is selected according to a random priority rule for access, and the longest execution time corresponding to the selected node currently selected for access is obtained.
  • the step of accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes:
  • the selection node is selected for access according to the breadth-first rule, and the longest execution time corresponding to the selected node currently selected for access is obtained.
  • the step of accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes:
  • the selection node is selected according to the depth-first rule for access, and the longest execution time corresponding to the selected node currently selected for access is obtained.
  • the step of accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes:
  • the selection node that is not less than the preset order is selected according to the depth-first rule to obtain the longest execution time corresponding to the selected node currently selected for access.
  • the step of accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes:
  • the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
  • the step of determining the instructions of each order in the post-scheduled instruction list in the selection node according to the corresponding order according to the preset rule comprises:
  • All the selected nodes corresponding to the current order are evaluated according to the preset priority of the instruction, the evaluation results of the selected nodes of the current order are obtained, and the instruction corresponding to the current order is determined according to the evaluation result.
  • the method includes setting a priority of each instruction according to a specific content and/or type of the currently selected node.
  • the step of determining the instructions of each order in the post-scheduled instruction list in the selection node according to the corresponding order according to the preset rule comprises:
  • the instruction corresponding to the current order is determined according to the length of the shortest execution time corresponding to all the selected nodes in the current order.
  • An instruction scheduling device includes: an acquisition module, a data dependency analysis module, and an evaluation module,
  • the obtaining module is configured to obtain a to-be-scheduled instruction set in the to-be-scheduled instruction list, and obtain, according to a data dependency relationship between the instructions, all the selected nodes corresponding to each instruction selection in the instruction scheduling process;
  • the data dependency analysis module is configured to perform data dependency analysis on the instruction set to be processed, and obtain a data dependency relationship between the instructions;
  • the evaluation module is configured to determine, according to a preset rule, an instruction of each order in the scheduled instruction list according to a selection node in a corresponding order.
  • a computer device comprising a memory, a processor, and a computer program stored on the memory and operative on the processor, the processor performing the steps recited in the method described above.
  • a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps recited in the method described above.
  • the instruction list scheduling method, device, computer device and storage medium provided by the present application have the following beneficial effects:
  • all the selected nodes corresponding to each instruction selection in the scheduling process are obtained, and then the instructions of each order in the scheduled instruction list are determined according to the evaluation results of the selected nodes corresponding to the respective orders.
  • the method can ensure that the selected instruction is the optimal result of the current state each time the instruction is selected, and the arranged instruction list obtained by using these optimal results, the arrangement between the instructions is more compact, and the instruction in the original instruction list is shortened. The execution time of the sequence.
  • the present application also provides a computer device, including a first processor, a second processor, and a memory, wherein the memory stores a plurality of offline models corresponding to the original network and input data and can be in the first process.
  • a runtime system running on the device includes:
  • a data processing device configured to acquire an offline model and input data corresponding to a current original network from the memory, where the offline model corresponding to the current original network includes model parameters corresponding to each computing node in the original network, An instruction and interface data of each computing node in the original network;
  • a device management device configured to control the second processor to be turned on or off
  • a task execution device configured to control the second processor to run an offline model of the current original network and input data.
  • the data processing apparatus includes an offline model loading module and an input data loading module;
  • the offline model loading module is configured to obtain an offline model corresponding to each current original network from the memory, and parse the offline model corresponding to the current original network;
  • the input data loading module is configured to obtain input data corresponding to the current original network from the memory.
  • the data processing apparatus further includes an input data pre-processing module, where the input data pre-processing module is configured to pre-process input data corresponding to the current original network acquired by the input data loading module. ??? enabling the second processor to run input data corresponding to the current original network, and for storing output data obtained by the second processor to the memory.
  • the computer device further includes application software capable of running on the runtime system
  • the data processing device is capable of providing an offline model API and an input data API
  • the device management device is capable of providing a second processor driver API
  • the task execution device is capable of providing a second processor running API
  • the application software is capable of invoking the offline model API and input data API, the second processor driver API, and the second processor running API.
  • the number of the second processors is multiple, or the second processor includes multiple processing modules;
  • the task execution apparatus is further capable of providing a task assignment API
  • the application software is further capable of invoking the task assignment API to control a plurality of the second processors or a plurality of processing modules controlling the second processor.
  • the present application also provides a data processing method for the computer device, the method comprising the following steps:
  • the control data processing device obtains, from the memory, an offline model and input data corresponding to the current original network, where the offline model corresponding to the current original network includes model parameters, instructions, and instructions corresponding to the respective computing nodes in the current original network. Interface data of each computing node in the current original network;
  • the method further includes the following steps:
  • the method before the step of acquiring an offline model corresponding to the current original network and inputting data from the memory, the method further includes the following steps:
  • the present application also provides a data processing method for the computer device, the method comprising the following steps:
  • the second processor driver API is called to control the second processor to shut down.
  • the present application also provides a computer readable storage medium having stored thereon a computer program that, when executed by one or more processors, implements the steps of any of the methods described above.
  • the computer device, the data processing method and the storage medium can directly obtain the offline model and the input data corresponding to the current original network from the memory, so that the second processor of the computer device can obtain the original network according to the original network.
  • the offline model and the input data run the current original network to obtain the output data of the current original network. Since the offline model corresponding to each original network only includes the model parameters and instructions corresponding to each computing node in the original network and the interface data of each computing node in the original network, the data size of the offline model of the original network is much smaller than the The data level of the original network, so that by running the offline model (lightweight) corresponding to the current original network on the computer device, the processing of the heavy-level neural network data by the computer device can be realized. At the same time, by directly running the offline model corresponding to the current original network on the computer device, the processing speed and efficiency of the computer device can be improved without performing processing operations such as compiling each computing node in the current original network.
  • the present application further provides a computer device, including a first processor, a second processor, a first memory, and a second memory, wherein the first memory stores a plurality of offline models and input data corresponding to the original network. And a runtime system capable of running on the first processor, the second memory storing an operating system capable of running on the first processor or the second processor;
  • the runtime system is a secure runtime system established based on a trusted operating environment, the first memory being a secure storage medium; and when the runtime system is running on the first processor, the runtime The system can obtain the offline model and the input data corresponding to the current original network from the first memory, and control the second processor to run the offline model corresponding to the current original network;
  • the offline model corresponding to the current original network includes model parameters, instructions corresponding to each computing node in the original network, and interface data of each computing node in the original network.
  • the runtime system comprises:
  • the data processing device capable of providing an offline model API and an input data API, configured to acquire an offline model and input data corresponding to the current original network from the first memory;
  • the device management device capable of providing a second processor driving API, configured to control the second processor to be turned on or off;
  • a task execution device capable of providing a second processor running API for controlling the second processor to run an offline model of the current original network and input data.
  • the data processing apparatus includes an offline model loading module and an input data loading module;
  • the offline model loading module is configured to provide an offline model API, configured to obtain an offline model corresponding to each current original network from the first memory, and parse the offline model corresponding to the current original network;
  • the input data loading module is capable of providing an input data API for obtaining input data corresponding to the current original network from the first memory.
  • the data processing apparatus further includes an input data pre-processing module, the input data pre-processing module capable of providing a data pre-processing API for pre-processing the input data of the current original network, so that The second processor is capable of running input data of the current original network and for storing output data obtained by the second processor to the first memory.
  • the number of the second processors is multiple, or the second processor includes multiple processing modules;
  • the task execution apparatus is further capable of providing a task assignment API for controlling a plurality of the second processors or controlling a plurality of processing modules of the second processor.
  • the computer device further includes secure application software capable of running on the runtime system, and the application software is capable of invoking the offline model API and input data API, the second processing The driver API, and the second processor runs the API.
  • the first memory and the second memory are physically disposed independently of each other;
  • first memory and the second memory are integrated, and the first memory and the second memory are logically disposed independently of each other.
  • the present application also provides a data processing method for the computer device, the method comprising the following steps:
  • the second processor that controls the computer device runs the current original network according to the offline model and the input data corresponding to the current original network, and obtains output data of the current original network;
  • the output data of the current original network is stored into the first memory.
  • the application also provides a data processing method for the computer device, the method comprising the following steps:
  • the second processor driver API is called to control the second processor to shut down.
  • the method further includes the following steps:
  • the method further includes the following steps:
  • the present application also provides a computer readable storage medium having stored thereon a computer program that, when executed by one or more processors, implements the steps of the method described in any of the above.
  • the computer device, the data processing method, and the storage medium can directly obtain the offline model and the input data corresponding to the current original network from the first memory, so that the second processor of the computer device obtains according to the data.
  • the offline model of the original network and the input data run the current original network. Since the offline model of the current original network only stores necessary network structure information such as model parameters and instructions corresponding to each computing node in the current original network and interface data of each computing node in the current original network. Therefore, the data size of the offline model of the current original network is much smaller than the data magnitude of the current original network, so that by running the offline model of the current original network, a secure runtime established based on a trusted execution environment such as TEE can be realized.
  • the system expands the application range of neural networks to the processing of heavy-weight data such as neural networks.
  • the processing speed and efficiency of the computer device can be improved without performing processing operations such as compiling each computing node in the original network.
  • FIG. 1 is a schematic structural diagram of a task parallel processing system proposed in an embodiment
  • FIG. 2 is a schematic structural diagram of a task parallel processing system proposed in an embodiment
  • FIG. 3 is a flow chart showing the steps of a task parallel processing method proposed in an embodiment
  • FIG. 4 is a schematic diagram of splitting input data and output data of an operation request without a model proposed in an embodiment
  • FIG. 5 is a schematic diagram of input and output of a convolution operation (conv) of a neural network model proposed in an embodiment
  • FIG. 6 is a schematic diagram of splitting a conv model proposed in an embodiment
  • FIG. 7 is a flow chart showing the steps of a task parallel processing method proposed in an embodiment
  • Figure 8 is a task directed acyclic graph DAG constructed in an embodiment
  • FIG. 9 is a schematic diagram of a result of task assignment performed in an embodiment
  • FIG. 10 is a flow chart showing the steps of a task parallel processing method proposed in an embodiment
  • Figure 11 is a task directed acyclic graph DAG constructed in an embodiment
  • FIG. 12 is a schematic diagram of a result of task assignment performed in an embodiment
  • FIG. 13 is a schematic structural diagram of a task parallel processing apparatus according to an embodiment
  • FIG. 14 is a schematic structural diagram of a computer system in an embodiment
  • 15 is a flow chart showing the steps of an instruction list scheduling method in an embodiment
  • 16 is a data dependency diagram of an instruction to be scheduled obtained in an embodiment
  • 17 is an association diagram of selected nodes obtained in one embodiment
  • FIG. 18 is a schematic structural diagram of an instruction list scheduling apparatus according to an embodiment
  • 20 is a structural block diagram of a computer device in an embodiment
  • FIG. 21 is a structural block diagram of an embodiment of the first processor of FIG. 20;
  • 22 is a block diagram showing the structure of an embodiment of the runtime system of FIG. 20;
  • FIG. 23 is a structural block diagram of another embodiment of the runtime system of FIG. 20;
  • FIG. 24 is a flowchart of a data processing method of an embodiment of the computer device of FIG. 20;
  • FIG. 25 is a flowchart of a data processing method of another embodiment of the computer device of FIG. 20;
  • 26 is a flowchart of an offline model generation method according to an embodiment
  • FIG. 27 is a flowchart of a method for generating an offline model according to another embodiment
  • FIG. 28 is a network structure diagram of a neural network according to an embodiment
  • FIG. 29 is a schematic diagram of an offline model generation process of the neural network in FIG. 28;
  • Figure 30 is a block diagram showing the structure of a computer device in another embodiment
  • FIG. 31 is a flowchart of a data processing method of an embodiment of the computer device of FIG. 30;
  • FIG. 32 is a flow chart of a data processing method of another embodiment of the computer device of FIG.
  • FIG. 1 is a schematic structural diagram of a task parallel processing system 600 (hereinafter referred to as a first task parallel processing system for convenience of distinction) according to an embodiment of the present application.
  • the processor system includes a processor 620 and a memory 610.
  • the memory 610 stores instructions executable by the processor 620.
  • the processor 620 includes a plurality of processor cores, and each processor core can communicate through the internal bus and execute differently. Task.
  • the processor core of processor 620 can run a split algorithm.
  • FIG. 2 is a schematic structural diagram of another task parallel processing system 700 (hereinafter referred to as a second task parallel processing system) for facilitating differentiation according to an embodiment of the present application.
  • the task parallel processing system includes a first processor 710, Two processors 720 and memory 730. Instructions executable by the first processor 710 and/or the second processor 720 are stored on the memory 730.
  • the processor core of the first processor 710 is required to have the ability to run a split algorithm; the second processor 720 may not have the ability to run a split algorithm.
  • the respective processor cores of the first processor 710 and the second processor 720 communicate via the internal bus to perform different tasks.
  • the first processor 710 and the second processor 720 communicate via a bus to work together.
  • the first processor 710 may be a multi-core processor or a single-core processor.
  • the second processor 720 can be a multi-core processor.
  • FIG. 3 is a flow chart of steps of a task parallel processing method proposed by the present application.
  • the method can be applied to the task parallel processing system shown in FIG. 1 or FIG. 2, and the following steps may be stored in the memory of the task parallel processing system in the form of instructions.
  • the task parallel processing method may include:
  • Step S301 Construct a task directed acyclic graph DAG according to the dependency relationship between the tasks to be executed.
  • the directed acyclic graph DAG in this embodiment is for indicating the drive dependency between tasks to be executed.
  • DAG Directed Acyclic Graph
  • DAG is a kind of directed graph. It is often used to represent the driving dependencies between events and to manage the scheduling between tasks. Based on these characteristics of DAG, DAG can be used to describe the logical relationship between acquired tasks to be executed.
  • the task to be executed may be executed by the processor core of the processor 620 in the first task parallel processing system 600 by executing a preset splitting algorithm, and splitting the program to be executed.
  • the task to be executed may be executed by the processor core of the first processor 710 in the second task parallel processing system 700 to execute a preset splitting algorithm, and the program to be executed is split.
  • This implementation step S301 can be performed by the processor core of the processor 620 in the first task parallel processing system 600, or by the processor core of the first processor in the second task parallel processing system 700.
  • Step S302 Distribute each of the tasks to be executed to a plurality of work queues of the processor according to the task directed acyclic graph DAG.
  • the processor core of the processor in the first task parallel processing system 600, or the processor core in the second task parallel processing system 700 may include one or more work queues.
  • a work queue is a mechanism for pushing a task back and forth. It can run the pending tasks to be executed in order. The running of each task in the work queue is controlled by a kernel thread, so the control thread of the work queue can be adjusted by the interrupt control mechanism in the processor system to achieve task rescheduling or even sleep.
  • the downstream tasks to be executed in parallel with the parallel nodes in the acyclic graph DAG are generally parallel executable tasks. Therefore, according to the constructed task directed acyclic graph DAG, the tasks to be executed can be distributed.
  • implementation step S302 may be performed by any processor core in the first task parallel processing system 600, or may be performed by any processor core in the second task parallel processing system 700.
  • Step S303 According to the dependency relationship of each task to be executed in the acyclic graph DAG, the parallel execution tasks in each of the work queues are controlled to start running.
  • each work queue runs independently, when there is an output result in a work queue that depends on the tasks to be executed in other work queues, an execution error occurs if the tasks to be executed are not scheduled. Therefore, in order to ensure that the program outputs the correct result, each task to be executed in each work queue is scheduled according to the dependency relationship of each task in the task-oriented acyclic graph DAG, and the operation of each task to be executed is controlled.
  • this implementation step may be performed by any of the processor cores in the first task parallel processing system 600, or may be performed by any of the processor cores in the second task parallel processing system 700.
  • the task parallel processing method proposed in this embodiment constructs a task directed acyclic graph DAG according to the dependency relationship between the tasks to be executed, and then performs task distribution and control according to the task directed acyclic graph DAG.
  • the rescheduling of the work queue realizes the parallelism of the tasks of the multi-core processor, improving the data processing efficiency.
  • the implementation of the task parallel processing method proposed in this embodiment does not depend on a framework program such as tensorflow or Caffe, so there is no need to consider interface compatibility issues when designing the program.
  • the steps of constructing the task directed acyclic graph DAG are performed according to the dependencies between the tasks to be performed:
  • the program is split according to the operation node and/or the data node in the program, and the task to be executed is obtained.
  • the execution program contains multiple operation requests (such as conv, pool, active, add, etc.), and there are operation nodes between each operation request. Therefore, the task to be executed can be obtained according to the operation node splitting program.
  • operational requests may be executed sequentially. In this case, you can consider the data level (code level) of the execution program, or you can split according to the data nodes in the program to increase the parallel possibility of the task.
  • the implementation step requires the processor core of the processor 620 in the first task parallel processing system 600, or the processor core of the first processor 710 in the second task parallel processing system 700 to execute a preset splitting algorithm, according to the program.
  • the operating node and/or the data node split the program to be executed to obtain the task to be performed.
  • the execution program when the execution program is split, the execution program may be split only according to the operation node, or may be split according to the data node directly at the data level, or the two may be combined.
  • the split mode is selected according to actual needs, which is not limited in this application.
  • the processor core of the processor 620 in the first task parallel processing system 600, or the processor of the first processor 710 in the second task parallel processing system 700 checks the program according to the operating node in the program.
  • splitting there are two situations: 1) the operation request of the model is included in the program; 2) the operation request with the model is not included in the program.
  • Case 1 When the program includes an operation request without a model (such as pool, batchnorm, Lrn, active, add, etc.), the program is split according to the operation node in the program, and the step of executing the task is obtained.
  • a model such as pool, batchnorm, Lrn, active, add, etc.
  • the input data and/or the output data of the operation request without the model are split to obtain a task to be executed.
  • the input data and/or the output data of the operation request without the model may be split in the window direction (height width direction, hw direction) of the data according to a preset rule. Get the task you need to perform.
  • FIG. 4 a schematic diagram of splitting input data and output data of an operation request without a model in the window direction of data.
  • the default rule for this split is to divide the input data and output data equally on the plane where the window is located.
  • dividing the input data and the output data in the window direction of the data to obtain the task to be performed is only a specific form of splitting the input data and the output data in the window direction of the data proposed by the embodiment.
  • the data may be split in the window direction of the data in a non-uniform manner, or the data may be split in the window direction of the data in different equalization manners, as long as the input data and the output can be performed according to certain rules.
  • the purpose of this step can be achieved by separating the data, and how to split it. This application is not limited.
  • the present application proposes to split the input data and the output data in the window direction of the data in order to obtain a plurality of tasks to be performed, and the purpose of this step can be achieved as long as the input data and the output data are split. Therefore, when splitting an operation request without a model to obtain a task to be executed, only the input data may be split, or only the output data may be split, and the input data may be split and the output data may be split.
  • the above situations can achieve the implementation purpose of this step. Specifically, how to split can be flexibly selected according to specific operations and actual needs.
  • Case 2 When the program includes an operation request with a model (such as conv, mlp, etc.), the program is split according to the operation node in the program, and the steps for obtaining the task to be performed include:
  • the weights corresponding to the tasks to be executed obtained by the split model are set in advance; and the weights are used to set the tasks to be executed. The correspondence between the input data and the output data.
  • the model with the operation of the model may be split in the window direction (height width direction, hw direction) of the model according to a preset rule, and the task to be executed is obtained; It is also possible to split the model of the operation with the model in the channel direction (channel direction, C direction) of the model to obtain a task to be performed; and to combine the two.
  • the input data of the operation with the model can also be split on the hw plane to obtain the task to be executed.
  • Fig. 5 is a schematic diagram showing the input and output of a convolution operation (conv) of a neural network model.
  • Figure 6 shows a schematic diagram of splitting the conv model in the channel direction.
  • the mlp (Multi-Layer Perceptron) task is divided into three subtasks in the C direction of the model.
  • the input data X is split into x1, x2, x3, and the corresponding output data is y1, y2, y3.
  • the output data Y can be obtained by arithmetic processing y1, y2, y3.
  • the method of splitting the input data of the operation with the model on the hw plane is similar to the operation of the input without the model, and the input data is split on the hw plane, and will not be described in detail here.
  • splitting the operation request with the model when splitting the operation request with the model, it can be split only in the direction of the model C, or can be split only in the hw plane of the model, and can also be in the C direction of the model and the model hw plane. Split on. Although multiple splitting methods can increase the parallel possibility of tasks, theoretically reduce the running time of the program, but the difficulty of its implementation will increase accordingly. In addition, in practical applications, the tasks to be executed after running the split, the actual operation The time is also slightly larger than the theoretical running time. Therefore, how to split the operation request with the model needs to be selected according to the actual scenario, which is not limited in this application.
  • the parallelism of the tasks to be executed obtained by the methods obtained by the above two situations is high, and the parallel nodes of the task-oriented acyclic graph DAG are more abundant, which makes the execution of the program more efficient.
  • the processor core of the first task parallel processing system 600 or the second task parallel processing system 700 constructs a task directed acyclic graph DAG according to the obtained dependency relationship between the tasks to be executed.
  • a task directed acyclic graph DAG is constructed according to the parallel node and the sequential node.
  • the two tasks to be executed are generally parallel tasks; when there is a dependency between the two tasks to be executed, the two tasks to be executed are generally serial tasks. Therefore, the parallel nodes and the sequential nodes in the directed directed acyclic graph DAG can be determined according to the dependencies between the tasks to be executed, and the tasks are filled to the task directed acyclic graph DAG according to the determined different types of nodes. The corresponding position, the completion of the task has a construction of the acyclic graph DAG.
  • the task parallel processing system includes at least one processor that can run the splitting algorithm, and is used for splitting the program to obtain a task to be executed.
  • the processor core of the first task parallel processing system 600 or the second task parallel processing system 700 distributes each of the required execution tasks to the processor according to the task directed acyclic graph DAG.
  • Multiple work queues including:
  • Step S2021 Perform topology topology on the task directed acyclic graph DAG, and obtain a task topology sorting sequence.
  • Step S2022 Sort the obtained topological sorting sequence according to the preset execution time of each task to be executed, to obtain the longest topology sorting sequence.
  • Step S2023 Distribute each of the tasks to be executed to the work queue according to the longest topology sorting sequence and the dependencies between the tasks to be executed.
  • the task when the processor core performs task distribution, the task may be distributed to the work queue of the processor core running the split algorithm, for example, the task is distributed to the processor 620 of the first task parallel processor system 600.
  • the work queue of the processor core; the task can also be distributed to a work queue of a processor core that does not have the ability to run the split algorithm, such as the work queue of the processor core of the second processor 720 in the second task parallel processing system 700.
  • the processor core can perform the tasks to be distributed, it can be guaranteed that the program to be executed can be executed in parallel, and whether the task processor core needs to execute the function of running the split algorithm does not affect the execution of the program. Therefore, this application does not limit this.
  • the task distribution is performed according to the longest path of the task topology sorting sequence, and the execution time of the program can be optimized, that is, the time for executing the task in the longest topology sorting sequence theoretically is the program execution time, so that the program needs to be executed. Execute in the shortest time.
  • the processor core of the first task parallel processing system 600 or the second task parallel processing system 700 controls each of the dependencies of the tasks to be executed in the directed acyclic graph DAG.
  • the parallel execution of tasks in the work queue includes:
  • Step S3031 Set a reference count for each of the required tasks to be executed according to the task directed acyclic graph DAG.
  • Step S3032 If the dependent task to be executed has been executed, modify the reference count of the dependent task to be executed;
  • Step S3033 When the reference count of the task to be executed reaches a preset value, the task running in each of the work queues that controls the reference count reaches a preset value is controlled.
  • Figure 7 is a flow chart showing the steps of a task parallel processing method. The method includes:
  • Step S701 split the execution program according to the operation node in the execution program, obtain tasks A3, B2, C2, D4, E5, and F1, and perform tasks A3, B2, C2, D4, E5, and F1 according to requirements.
  • the dependency relationship build task is as shown in Figure 8 for the task directed acyclic graph DAG.
  • Step S702 According to the task directed acyclic graph DAG shown in FIG. 8, the tasks A3, B2, C2, D4, E5, and F1 are to be distributed to the work queue 1, the work queue 2, and the work queue 3. The distribution results are shown in Figure 9.
  • Step S703 Set the reference count according to the task directed acyclic graph DAG to perform tasks A3, B2, C2, D4, E5, and control the operation of A3, B2, C2, D4, E5, F1 according to the set reference count.
  • the task in the work queue needs to be executed to start running. If the reference count of task A3 is 0, the task A3 needs to be put into the work queue and can be executed directly; the task E5 needs to be executed according to the execution result of task B2 and task C2, so task E5 will be executed.
  • the reference count is set to 2.
  • the reference count of task E5 needs to be adjusted to 1.
  • the reference count of task E5 to be executed is adjusted to 0, and when the reference count is 0, the reference count is E5. It can be started, and the control needs to perform the operation of task F1, and the final operation needs to execute the program.
  • Figure 10 is a flow chart showing the steps of a task parallel processing method. The method includes:
  • Step S6001 Obtain the data node in the following execution program, split the program to be executed, obtain the task to be executed, and build the task according to the dependency relationship between the tasks to be executed.
  • Figure DAG Obtain the data node in the following execution program, split the program to be executed, obtain the task to be executed, and build the task according to the dependency relationship between the tasks to be executed.
  • A, B, C, D, E are data nodes, conv, pool, active, add are operation nodes.
  • the task in this embodiment has the result of obtaining the data E in the acyclic graph DAG depending on the processing results of the data C and the data D.
  • the obtaining of the data C and the data D depends on the processing result of the data B, and the obtaining of the data B depends on The result of processing data A.
  • Step S6002 According to the task directed acyclic graph DAG described in FIG. 11, each task to be executed is distributed to the work queue 1' and the work queue 2'. The distribution results are shown in Figure 12.
  • Step S6003 Set a reference count according to the task-oriented acyclic graph DAG for the task to be executed, and control the running of each task to be executed according to the set reference count.
  • the task to be executed in the work queue starts running, otherwise it does not run.
  • the task's reference count is decremented by one until it is reduced to zero, and the task can be executed.
  • the reference count of the running task E add(C,D) becomes 0, and the task E needs to be executed. After the task E is executed, the execution of the program is completed.
  • the present application proposes a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the method referred to in the above embodiments.
  • the present application proposes a task parallel processing device, which is shown in FIG. 13 and includes a DAG map construction module 410, a task distribution module 420, and a schedule control module 430.
  • the DAG map construction module 410 is configured to construct a task directed acyclic graph DAG according to the dependency relationship between the tasks to be executed
  • the task distribution module 420 is configured to: according to the task directed acyclic graph DAG, Performing a task to be distributed to a plurality of work queues of the processor
  • the scheduling control module 430 is configured to control parallel execution of each of the work queues according to the dependencies of the tasks to be executed in the directed acyclic graph DAG The task starts running.
  • the DAG map construction module 410 is configured to split the program according to the operation node and/or the data node in the program to obtain the task to be executed.
  • the DAG map construction module 410 is configured to split the model of the operation request with the model and/or input data to the model if the program includes an operation request with a model. Perform a split to get the task to be performed.
  • the DAG map construction module 410 is configured to split the input data and/or output data of the operation request without the model if the program includes an operation request without a model, and obtain the required data. Perform the task.
  • the DAG map construction module 410 is configured to determine parallel nodes and sequential nodes in the directed directed acyclic graph DAG according to the obtained dependencies between the tasks to be executed; The parallel node and the sequential node construction task directed acyclic graph DAG.
  • the task distribution module 420 is configured to perform topology sorting on the task-oriented acyclic graph DAG, and obtain a task topology sorting sequence; according to the preset execution time of each task to be executed, the obtained location
  • the topological sorting sequence is sorted to obtain a longest topological sorting sequence; and each of the required tasks to be executed is distributed to the working queue according to the longest topological sorting sequence and the dependencies between the respective tasks to be executed.
  • the scheduling control module 430 is configured to set a reference count for each of the required execution tasks according to the task directed acyclic graph DAG; if the dependent execution task has been executed, the modification needs to be dependent Executing a reference count of the task; when the reference count of the task to be executed reaches a preset value, controlling the task to be executed in each of the work queues to reach a preset value to start running.
  • the present application can be implemented by hardware, or by software plus a necessary general hardware platform.
  • the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a USB flash drive, a mobile hard disk, etc.), including several The instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to run the methods of various implementation scenarios of the present application.
  • the different instructions can be processed in parallel according to the corresponding instruction list, thereby improving the processing efficiency of the computer system.
  • the order of the instructions in the instruction list corresponding to each processor core in the above computer system processing system may not be reasonable, for example, the instructions in the instruction list are not paralleled as much as possible, which may not improve the processing efficiency of the processing system, or may improve The effect of efficiency is not good. Therefore, how to provide an instruction list scheduling method, device, computer device and storage medium, and perform instruction sequence adjustment in the instruction list, so that the arrangement between instructions in the instruction list is more compact, and shortening the execution time of the instruction list becomes urgent Solved technical problems.
  • the computer system 300 of one embodiment may be a multi-processing including a multi-core processor computing system, a heterogeneous computing system, or the like including multiple processors.
  • Multi-processor Computing System the computer system may specifically include an instruction list scheduling device 310, a plurality of first processors 320, and a memory 330.
  • the plurality of first processors 320 may be simultaneously connected to the instruction list scheduling device 310, and the instruction list scheduling device 310 may The instruction list for the plurality of first processors 320 is rescheduled.
  • the instruction list scheduling device 310 may also include a second processor.
  • the second processor may include an acquisition module, a data dependency analysis module, an evaluation module, an operation module, a control module, and the like, wherein the acquisition module may be a hardware module such as an IO (Input Input/Output Output) interface.
  • the arithmetic module and the control module are hardware modules.
  • the plurality of first processors 320 can process different instructions in parallel according to the instruction list to improve the processing efficiency of the computer system.
  • the instruction list may include one or more instructions, and each instruction includes a set of reference operations on the resource, and the resource referenced by the instruction may be obtained by reading or running the instruction. That is, when the first processor or the like executes the instruction, the resource referenced by the instruction can be called to implement a specific operation.
  • the instruction may be a load instruction, a calculation instruction, a store instruction, or the like.
  • the instruction may also be a N-layer calculation of a neural network, N>0, N may be an integer, or Is not a whole number.
  • each instruction in the instruction list is arranged in an execution order, and the resource referenced by each instruction in the instruction list may be a virtual memory object or a physical memory object.
  • the virtual memory object can be a virtual storage space in a software logic of a memory block, a register, or other storage device capable of storing data.
  • the instruction scheduling process in this embodiment is a process of reordering instructions in the instruction list under the premise of ensuring the semantics of the original instruction list, which can make the arrangement between the instructions in the instruction list more compact, so that Improve the processing efficiency of the system by shortening the execution time of the instruction list.
  • the instruction list includes N instructions, where N ⁇ 1, N is a positive integer, and N instructions are marked as the first instruction, the second instruction, ..., the Nth instruction according to the execution timing.
  • the scheduling process of the instruction list is a process of reordering the above N instructions.
  • the instruction list scheduling apparatus 310 may first obtain the data dependency of each instruction in the instruction list to be scheduled.
  • the form of the data dependency may include RAW (Read After Write) / WAR (Write After Read) / WAW (Write After Write).
  • the data dependency relationship may be described by a Data Dependence Graph (DDG).
  • the second processor of the instruction list scheduling apparatus 310 may obtain a list of instructions to be scheduled by using the acquiring module, and perform data dependency analysis on the instructions in the instruction list to be scheduled by the data dependency analysis module, to obtain an instruction between the instructions.
  • Data dependencies may perform resource scan tracking on each instruction in the instruction list to be scheduled, and then analyze data dependencies between the instructions.
  • the data dependency between the instructions in this embodiment refers to whether the execution of the current instruction needs to depend on the execution result of other instructions. For example, if there is an instruction A "reading the data written by the written instruction B", then the instruction A depends on the execution result of the instruction B. Afterwards, the obtaining module can obtain all the selected nodes for each instruction selection in the instruction scheduling process according to the data dependency relationship between the instructions.
  • the instruction list scheduling apparatus may determine, by the evaluation module, the instructions of each order in the scheduled instruction list from all the selected nodes of the corresponding order according to the preset rule.
  • the second processor may use the evaluation module to evaluate the selection node corresponding to the current order, obtain an evaluation result of each selection node in the current order, and determine an instruction corresponding to the current order according to the evaluation result.
  • Each selection node records the ordered instruction and the instruction set to be scheduled corresponding to the selected node.
  • the evaluation module evaluates the selected node corresponding to the current order according to the priority of each instruction.
  • the second processor may further set an priority of the instruction according to a specific content and/or type of the currently selected node.
  • the instruction list scheduling apparatus 310 may adjust the first processor corresponding to the instruction in the instruction list to be scheduled.
  • the first processor corresponding to the to-be-scheduled instruction may be determined according to the type of the instruction, or the specific content of the to-be-scheduled instruction determines the corresponding first processor.
  • FIG. 15 is a flowchart of steps of an instruction list scheduling method according to an embodiment of the present application.
  • the instruction list scheduling method can be applied to the computer system shown in FIG. 14.
  • the above computer system can include a memory 330 and a plurality of first processors 320.
  • the instruction list scheduling method is used to implement rescheduling of instructions in the instruction list corresponding to the plurality of first processors in the computer system to improve processing efficiency of the computer.
  • the above method may include the following steps:
  • Step S100 Acquire a to-be-scheduled instruction set in the to-be-scheduled instruction list, and perform data dependency analysis on the scheduling instruction set to obtain a data dependency relationship between the instructions in the to-be-scheduled instruction set.
  • the second processor may obtain a to-be-scheduled instruction set of the to-be-scheduled instruction list through the acquiring module, and obtain a data dependency relationship of the foregoing instruction by using the data dependency analysis module.
  • the to-be-scheduled instruction set in this embodiment is composed of multiple to-be-scheduled instructions in the to-be-scheduled instruction list.
  • the to-be-scheduled instruction set does not include a non-semantic instruction (such as a synchronization instruction, etc.) in the to-be-scheduled instruction list.
  • the step of acquiring the to-be-scheduled instruction set of the to-be-scheduled instruction list includes: obtaining a to-be-scheduled instruction list, deleting a non-semantic instruction in the to-be-scheduled instruction list, and obtaining a to-be-scheduled instruction set.
  • the instruction set to be scheduled acquired by the acquisition module includes six instructions ⁇ L1, L2, C1, C2, S1, S2 ⁇ .
  • L1, C1, and S1 need to be executed sequentially
  • L2, C2, and S2 need to be executed sequentially
  • the rest of the instructions have no data dependency
  • L1, L2, S1, and S2 are I/O instructions
  • C1 and C2 are calculation instructions.
  • the data dependency analysis module performs data dependency analysis on the to-be-scheduled instruction, and obtains a data dependency relationship between the instructions in the instruction set to be scheduled, and uses the DDG (Data Dependence Graph) shown in FIG. 16 to describe the data dependency. relationship.
  • DDG Data Dependence Graph
  • the resource referenced by each to-be-scheduled instruction in the to-be-scheduled instruction list may be a virtual memory object or a physical memory object.
  • the virtual memory object can be a virtual storage space in a software logic of a memory block, a register, or other storage device capable of storing data.
  • Step S200 According to the data dependency relationship between the instructions, all the selected nodes that perform instruction selection in the instruction scheduling process are obtained.
  • Each selection node records the ordered instruction and the set of instructions to be scheduled corresponding to the selected node.
  • the process of obtaining all the selections may be: the second processor obtains, by using the acquiring module, all the first selected nodes when the first instruction is selected, and specifically, obtains the sorted instruction corresponding to each first selected node. And the set of instructions to be scheduled. It should be clear that there are data dependencies for the instructions in these to-be-scheduled instruction sets. Then, the second processor acquires, by the acquisition module, all the second selection nodes associated with each first selection node according to the data dependency relationship of each first selection node, and the second selection node corresponds to the second instruction selection.
  • the third selection node ..., the Nth selection node, N ⁇ 3, N is a positive integer.
  • the sum of the first selection node, ..., the Nth selection node acquired in the above steps constitutes all the selection nodes that are selected each time the instruction is selected.
  • the acquired instruction set in the to-be-scheduled instruction list includes a total of six instructions: ⁇ L1, L2, C1, C2, S1, S2 ⁇ , and the data dependency relationship between the six instructions is represented by FIG.
  • FIG. 16 It can be clearly seen from FIG. 16 that the six instructions L1 and L2 in the to-be-scheduled instruction set can be executed independently of other instructions. Therefore, when performing the first instruction selection, it is necessary to select from L1 and L2, that is, the first acquired.
  • the selection node corresponds to two cases of the selection instruction L1 or L2. When L1 is selected when the first instruction is selected, L1 is the sorted instruction.
  • the first selection node records the sorted instruction L1, and deletes the instruction set ⁇ L2, C1, C2, S1, S2 ⁇ of the instruction L1.
  • the first selection node records the sorted instruction L2, and the instruction set to be scheduled ⁇ L1, C1, C2 of the delete instruction L2. , S1, S2 ⁇ .
  • the above process can be cycled to obtain the second selection node when the second instruction is selected, ..., the sixth selection node when the sixth instruction is selected.
  • the instruction instruction set to be scheduled according to the previous instruction is selected, for example, the instruction set to be scheduled corresponding to FIG. 3, and the instruction selected when the first instruction is selected is L1 (corresponding to One of the first selection nodes), the obtained instruction set ⁇ L2, C1, C2, S1, S2 ⁇ , the scheduling instruction set instruction L2 of the first selection node, C1 may not depend on the execution of other instructions, at this time,
  • L2 corresponding to One of the first selection nodes
  • the instruction selected when the first instruction is selected is L2 (corresponding to another first selection node)
  • the instruction set to be scheduled ⁇ L1, C1, C2, S1, S2 ⁇ , the instruction instruction set command L1, C2 of the first selection node may not depend on the execution of other instructions, and at this time, when the second instruction selection is performed, Select from L1, C2 (there are also two second selection nodes). It can be seen that there is an association between all the selected nodes obtained in this embodiment,
  • Step S300 Determine, according to a preset rule, an instruction of each order in the list of instructions after scheduling according to a selection node of the corresponding order.
  • the second processor may evaluate, by using the evaluation module, the selected node corresponding to the current order, obtain the evaluation result of each selected node in the current order, and determine an instruction corresponding to the current order according to the evaluation result.
  • the current order is the second instruction.
  • the four second selection nodes in FIG. 17 are evaluated according to a preset rule, and the second instruction in the scheduled instruction list is obtained according to the evaluation result.
  • the evaluation module evaluates the selected node corresponding to the current order according to the preset priority of each instruction (for example, L2 has the highest priority, C1 is the second%), and the evaluation result is obtained.
  • the second processor sets the priority of each instruction according to the specific content and/or type of the currently selected node.
  • the evaluation module may determine the instruction corresponding to the current order according to the length of the shortest execution time corresponding to all the selected nodes in the current order. For example, in FIG. 17 L1 instruction node corresponding to the first selection, the minimum execution time of the instruction sequence corresponding to t 1, the first selection instruction node corresponding to L2, the minimum execution time for the instruction sequence corresponding to t 2, t 1 >t 2 , then L2 is determined as the first instruction in the dispatched instruction list. Similarly, the second instruction of the scheduled instruction list is determined, ..., the sixth instruction.
  • the instruction list scheduling method in this embodiment determines all the selected nodes for each instruction selection in the instruction scheduling process by analyzing the data dependency relationship of the instruction to be scheduled, and then determines the scheduling according to the evaluation result of the selected node corresponding to each order.
  • the instructions in each order in the instruction list. The method can ensure that the selected instruction is the optimal result of the current state each time the instruction is selected, and the arranged instruction list obtained by using these optimal results, the arrangement between the instructions is more compact, and the instruction in the original instruction list is shortened. The execution time of the sequence.
  • the evaluating module determines, according to the preset order, the instructions of each order in the scheduled instruction list in the selected node, including:
  • Step S210 The evaluation module accesses the selection node, and acquires the longest execution time corresponding to the currently accessed selection node.
  • the selection node accessed by the evaluation module may be a first selection node, a second selection node, ..., an Nth selection node.
  • Step S220 If the longest execution time corresponding to the currently accessed selection node is less than the initial execution time T 0 , the ordered instruction of the current access node is determined as the corresponding instruction in the scheduled instruction list.
  • the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
  • the maximum execution time corresponding to the currently selected selection node in this implementation step refers to the execution time when the arrangement of the instruction sequence corresponding to the current access node is the most unreasonable.
  • the execution time of the instruction sequence obtained by the instruction list scheduling method in the present embodiment is not greater than the instruction sequence in the instruction list to be scheduled, because the maximum execution time corresponding to the currently accessed selection node is less than the initial execution time.
  • the instruction in the instruction list is not scheduled according to the selection order of the current order, and the influence of the determined current order instruction on the subsequent instruction selection can be avoided. It is particularly suitable for scheduling a list of instructions containing a computationally intensive instruction, optionally a list of instructions containing neural network operational instructions.
  • the instruction list contains N instructions, which include a weight loading instruction A, and a neural network convolutional layer calculation instruction B. If the conventional method is used, the instruction A and the instruction B may not be parallel. To achieve the highest processing efficiency of the system, the instruction list scheduling scheme of this embodiment can implement the instruction A and the instruction B in parallel in the scheduled instruction list.
  • the method may further include: if the longest execution time corresponding to the currently accessed selection node is less than the initial execution time, the initial execution time is updated to the longest execution time corresponding to the currently accessed selection node. For example, in the above embodiment, when T 1 ⁇ T 0 , L1 and L2 are respectively used as the first instruction and the second instruction in the scheduled instruction list, and T 1 is updated to the initial execution time.
  • the ordered instruction corresponding to the current access node is determined as the instruction of the corresponding order in the scheduled instruction list, which can be guaranteed.
  • the execution time of the instruction sequence in the instruction list after scheduling is shorter.
  • the above scheme for updating the initial execution time is to further optimize the ordering of instructions and improve the processing efficiency of the system.
  • the step of the evaluation module accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes:
  • the selected node is accessed within a preset access time period to obtain the longest execution time corresponding to each selected node in the preset access time period.
  • This embodiment needs to determine the instructions of each order of the post-schedule instruction list in combination with the method proposed in the above embodiment.
  • the instruction list scheduling method proposed by the present application It is intended to further shorten the execution time of the instruction list by rearranging the instructions in the instruction list. Based on this, the purpose of the present application is achieved as long as the new instruction list obtained by the instruction list scheduling method proposed by the present application shortens the execution time. Therefore, when the instruction list scheduling method proposed by the present application is actually used to perform instruction reordering, the access time period and the scheduling time of the control instruction are generally set according to actual needs.
  • the instruction sequence in the instruction list to be scheduled is used as the instruction sequence in the post-scheduling instruction table.
  • the longest execution time corresponding to the currently selected selection node is not less than the initial execution time, and the instruction sequence in the instruction list to be scheduled is used as the instruction sequence in the scheduling instruction table, which is the instruction list scheduling method proposed in the foregoing embodiment. optimization. It can be guaranteed that the obtained instruction sequence in the list of scheduled instructions is the optimal result obtained within the preset time period.
  • the step of accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node is as follows:
  • Step S230 The evaluation module acquires the shortest execution time corresponding to the currently accessed selected node.
  • Step S240 If the shortest execution time corresponding to the currently accessed selection node is greater than the initial execution time T 0 , the access node associated with the currently accessed selection node is terminated.
  • the shortest execution time of the second selection node corresponding to the instruction L2 is T 2
  • the T 2 corresponds to the case where the unsorted instructions C1, C2, S1, and S2 corresponding to the selected node are perfectly parallel, and the sorting is most reasonable. If T 2 > T 0 , then accessing the third selection node associated with the second selection node, and the fourth selection node, ..., the sixth selection node associated with the third selection node.
  • the technical solution of the embodiment can eliminate invalid access to the selected node and improve the scheduling efficiency of the instruction list.
  • the step of the evaluation module accessing the selection node and obtaining the longest execution time corresponding to the selected node currently selected for access includes: the evaluation module according to random priority (eg, Monte Carlo tree search, MCTS, Monte Carlo Tree Search) selects the selected node to access and obtains the longest execution time corresponding to the selected node currently selected for access.
  • random priority eg, Monte Carlo tree search, MCTS, Monte Carlo Tree Search
  • the step of the evaluation module accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes: the evaluation module selects the rule according to the breadth first (BFS, Breadth First Search) rule.
  • the selection node performs access and obtains the longest execution time corresponding to the selected node currently selected for access.
  • the breadth priority in the embodiment refers to preferentially selecting a selection node in the same order as the currently accessed selection node for access. For example, if the second selection node is currently accessed, the next selected selection node preferentially selects other second selection nodes.
  • the step of the evaluation module accessing the selection node and obtaining the longest execution time corresponding to the currently accessed selection node includes: the evaluation module selects according to the rule of depth priority (BFS, Depth First Search)
  • the selection node performs access and obtains the longest execution time corresponding to the selected node currently selected for access.
  • the depth priority in this embodiment refers to preferentially selecting a selection node in the next order associated with the currently accessed selection node for access. For example, if the second selection node is currently accessed, the next visited selection node preferentially selects the third selection node associated with the second selection node.
  • the evaluation module may further select the selected node to access by using a random preference combined with a depth-first rule, or select the selected node to access by using a breadth-first priority combined with a depth-first rule.
  • the selected node that is smaller than the preset order is selected according to the breadth or random priority rule to obtain the longest execution time corresponding to the selected node currently selected for access; and the preset order is not selected according to the depth-first rule.
  • the selection node performs access to obtain the longest execution time corresponding to the selected node currently selected for access.
  • the preset values of the corresponding order are determined according to empirical values, or determined according to pre-experiment results.
  • the evaluation module of the instruction list scheduling apparatus does not have enough time to traverse all the selected nodes.
  • the selection node is selected by the principle of depth-first or breadth-first, the access node is selected.
  • the extent of the selection node of the final access may be compared one-sided (for example, only access nodes associated with a certain selected node, or only selected nodes of the previous order), and the selection node is selected only by random preference.
  • the randomness of the selected node that is ultimately accessed during access is too strong. Therefore, it is preferable to select the selected node for access by using the random preference combined with the depth-first rule, or to select the selected node for access by using the breadth-priority combined with the depth-first rule. Program.
  • FIG. 18 is a schematic structural diagram of an instruction list scheduling apparatus proposed in one embodiment, the apparatus includes an obtaining module 510, a data dependency analyzing module 520, and an evaluating module 530, wherein the obtaining module 510 is configured to acquire a to-be-scheduled The set of instructions to be scheduled in the instruction list, and according to the data dependency relationship between the instructions, all the selected nodes for each instruction selection in the instruction scheduling process are obtained.
  • the data dependency analysis module 520 is configured to perform data dependency analysis on the instruction set to be processed, and obtain a data dependency relationship between the instructions in the instruction set to be scheduled.
  • the evaluation module 530 is configured to determine, according to a preset rule, instructions in each order in the scheduled instruction list according to the selected nodes in the corresponding order.
  • the evaluation module 530 accesses the selection node and obtains the longest execution time corresponding to the currently accessed selection node; if the currently accessed selection node corresponds to the maximum execution time is less than the initial execution time Then, the ordered instruction of the currently accessed selection node is determined as the instruction of the corresponding order in the scheduled instruction list; wherein, the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
  • the instruction scheduling apparatus further includes an update module, where the maximum execution time corresponding to the currently accessed selection node is less than the initial execution time, and the initial execution time is updated to the current access selection. The maximum execution time corresponding to the node.
  • the evaluation module 530 is configured to access the selected node within a preset access time period, and obtain the longest execution time corresponding to the currently accessed selected node; if the currently accessed selected node corresponds to the longest execution If the time is less than the initial execution time, the ordered instruction corresponding to the current access node is determined as the instruction of the corresponding order in the scheduled instruction list; wherein, the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
  • the evaluation module 530 is configured to use the instruction sequence in the instruction list to be scheduled as the instruction sequence in the scheduling instruction table when the maximum execution time corresponding to the currently accessed selection node is not less than the initial execution time.
  • the evaluation module 530 is configured to select the selected node for access according to a random priority rule, and obtain a maximum execution time corresponding to the selected node currently selected for access.
  • the evaluation module 530 is configured to select the selected node for access according to the breadth-first rule, and obtain the longest execution time corresponding to the selected node currently selected for access.
  • the evaluation module 530 is configured to select the selected node for access according to a depth-first rule, and obtain a maximum execution time corresponding to the selected node currently selected for access.
  • the evaluation module 530 is configured to select the selected node that is smaller than the preset order according to the breadth or random priority rule to obtain the longest execution time corresponding to the selected node currently selected for access; according to the depth The priority rule selects the selected node that is not less than the preset order to access, and obtains the longest execution time corresponding to the selected node currently selected for access.
  • the evaluation module 530 is configured to obtain a shortest execution time corresponding to the currently accessed selection node; if the shortest execution time corresponding to the currently accessed selection node is greater than the initial execution time, the selection of the termination and the current access is terminated.
  • the node associates the selection node; wherein, the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
  • the evaluation module 530 is configured to evaluate all the selected nodes corresponding to the current order according to the preset priority of the instruction, obtain the evaluation results of the selection nodes of the current order, and determine the current order according to the evaluation result. Corresponding instructions.
  • the evaluation module 530 is configured to set the priority of each instruction according to the specific content and/or type of the currently selected node.
  • the evaluation module 530 is configured to determine an instruction corresponding to the current order according to the length of the shortest execution time corresponding to all the selected nodes in the current order.
  • Each of the above-described instruction list scheduling devices may be implemented in whole or in part by software, hardware, and combinations thereof.
  • Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor invokes the operations corresponding to the above modules.
  • a computer device which may be a terminal, and its internal structure diagram may be as shown in FIG.
  • the computer device includes a processor, memory, network interface, display screen, and input device connected by a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for operation of an operating system and computer programs in a non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal via a network connection.
  • the computer program is executed by the processor to implement the verification excitation generation method and/or the chip verification method mentioned in the above embodiments.
  • the display screen of the computer device may be a liquid crystal display or an electronic ink display screen
  • the input device of the computer device may be a touch layer covered on the display screen, or may be a button, a trackball or a touchpad provided on the computer device casing. It can also be an external keyboard, trackpad or mouse.
  • FIG. 19 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
  • the specific computer device may It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
  • a computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the computer program, the following steps: obtaining a list of instructions to be scheduled The set of instructions to be scheduled, and the data dependency analysis of the scheduling instruction set, to obtain the data dependency relationship between the instructions; according to the data dependency relationship between the instructions, each instruction selection process is obtained during the instruction scheduling process. Selecting a node; according to a preset rule, determining a sequence of instructions in the list of instructions after scheduling according to a corresponding node of the corresponding order.
  • the following steps are further performed: accessing the selection node, and obtaining a maximum execution time corresponding to the currently accessed selection node; if the currently accessed selection node corresponds to a maximum execution time is less than The initial execution time determines the ordered instruction of the current access node as the instruction of the corresponding order in the scheduled instruction list; wherein, the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
  • the processor executes the computer program, the following steps are further implemented: if the longest execution time corresponding to the currently accessed selection node is less than the initial execution time, the initial execution time update is the longest execution corresponding to the currently accessed selection node. time.
  • the processor executes the computer program
  • the following steps are further performed: if the currently executed selected node corresponds to the longest execution time being less than the initial execution time, the instruction sequence is randomly generated based on the ordered instruction corresponding to the current access node, and The sequence of instructions of the list of instructions to be scheduled is updated using the randomly generated sequence of instructions.
  • the following steps are further performed: accessing the selected node within a preset access time period, and obtaining a longest execution time corresponding to the currently accessed selected node; if the currently accessed selected node corresponds to The maximum execution time is less than the initial execution time, and the ordered instruction corresponding to the current access node is determined as the instruction of the corresponding order in the scheduled instruction list; wherein the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
  • the processor when executing the computer program, further implements the step of selecting the selected node for access in accordance with the breadth-first rule and obtaining the longest execution time corresponding to the selected node currently selected for access.
  • the processor when executing the computer program, further implements the step of selecting the selected node for access according to a random-first rule and obtaining the longest execution time corresponding to the selected node currently selected for access.
  • the processor when executing the computer program, further implements the step of selecting the selected node for access in accordance with the breadth-first rule and obtaining the longest execution time corresponding to the selected node currently selected for access.
  • the following steps are further performed: selecting the selected node that is smaller than the preset order according to the breadth or random priority rule to obtain the longest execution time corresponding to the selected node currently selected for access. Selecting the selected node that is not less than the preset order according to the depth-first rule to obtain the longest execution time corresponding to the selected node currently selected for access.
  • the processor executes the computer program
  • the following steps are further performed: obtaining a shortest execution time corresponding to the currently accessed selected node; and if the currently executed selected node corresponding to the shortest execution time is greater than the initial execution time, terminating the access and the current The selected node associated with the selected node is accessed; wherein the initial execution time is the execution time of the sequence of instructions in the list of instructions to be scheduled.
  • the processor executes the computer program, the following steps are further implemented: evaluating all the selected nodes corresponding to the current order according to the preset priority of the instruction, obtaining the evaluation results of the selected nodes of the current order, and according to the evaluation result Determine the instruction corresponding to the current order.
  • the processor also implements the step of setting the priority of each instruction based on the specific content and/or type of the currently selected node when executing the computer program.
  • the processor when executing the computer program, further implements the step of determining an instruction corresponding to the current order based on the length of the shortest execution time corresponding to all of the selected nodes in the current order.
  • a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of: acquiring a set of instructions to be scheduled in a list of instructions to be scheduled, and treating the instructions
  • the data dependency analysis is performed to obtain a data dependency relationship between the instructions; according to the data dependency relationship between the instructions, all the selected nodes for each instruction selection in the instruction scheduling process are obtained; according to the preset rule, according to the preset rule
  • the selection node of the order determines the instructions of each order in the list of instructions after scheduling.
  • the following steps are further performed: accessing the selection node, and obtaining the longest execution time corresponding to the currently accessed selection node; and the longest execution time corresponding to the currently accessed selection node If the initial execution time is less than the initial execution time, the ordered instruction of the current access node is determined as the instruction of the corresponding order in the scheduled instruction list; wherein the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
  • the following steps are further implemented: if the longest execution time corresponding to the currently accessed selection node is less than the initial execution time, the initial execution time is updated to be the longest corresponding to the currently accessed selection node. execution time.
  • the following steps are further performed: accessing the selected node within a preset access time period, and obtaining a longest execution time corresponding to the currently accessed selected node; if the currently accessed selected node corresponds to The longest execution time is less than the initial execution time, and the ordered instruction corresponding to the current access node is determined as the instruction of the corresponding order in the scheduled instruction list; wherein, the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
  • the following steps are further implemented: if the longest execution time corresponding to the currently accessed selection node is not less than the initial execution time, the instruction sequence in the instruction list to be scheduled is used as the post-scheduling instruction list. The sequence of instructions in .
  • the computer program when executed by the processor, further implements the step of selecting the selection node for access according to a random-first rule and obtaining the longest execution time corresponding to the selected node currently selected for access.
  • the computer program when executed by the processor, further implements the step of selecting the selected node for access in accordance with a depth-first rule and obtaining the longest execution time corresponding to the selected node currently selected for access.
  • the computer program when executed by the processor, further implements the step of selecting the selected node for access in accordance with the breadth-first rule and obtaining the longest execution time corresponding to the selected node currently selected for access.
  • the following steps are further performed: selecting the selected node that is smaller than the preset order to access according to the breadth or random priority rule, and obtaining the longest execution corresponding to the selected node currently selected for access Time; selecting the selected node that is not less than the preset order according to the depth-first rule to obtain the longest execution time corresponding to the selected node currently selected for access.
  • the following steps are further performed: obtaining a shortest execution time corresponding to the currently accessed selected node; and if the currently executed selected node corresponding to the shortest execution time is greater than the initial execution time, terminating the access and The currently selected selection node associated with the selection node; wherein, the initial execution time is the execution time of the instruction sequence in the instruction list to be scheduled.
  • the following steps are further performed: evaluating all the selected nodes corresponding to the current order according to the preset priority of the instruction, obtaining the evaluation results of the selected nodes of the current order, and according to the evaluation The result determines the instruction corresponding to the current order.
  • the computer program when executed by the processor, also implements the step of setting the priority of each instruction based on the specific content and/or type of the currently selected node.
  • the computer program when executed by the processor, further implements the step of determining an instruction corresponding to the current order based on the length of the shortest execution time corresponding to all of the selected nodes in the current order.
  • each computing node in the neural network model needs to be compiled and parsed separately, and then, according to the structural form of the neural network model.
  • Each computing node is executed in a certain form.
  • the neural network model and the network structure may be artificial neural network model data that has been trained or not trained. The above processing method for the neural network affects the processing speed of the processor, and the processing efficiency is low.
  • the embodiment of the present application further provides a method for generating an offline model, where the offline model generation method can be run on a cloud server or a neural network dedicated processor, and the obtained offline model of the original network is stored in the memory 130.
  • the cloud server or neural network dedicated processor is a processor capable of executing heavyweight data such as a neural network, which may not be included in the above computer device.
  • the foregoing method includes the following steps:
  • model data set of the original network and a model structure parameter may be obtained by using an acquisition module of the cloud server or the neural network dedicated processor, and the model data set of the original network is obtained.
  • model structure parameters can obtain the network structure diagram of the original network.
  • the model data set includes data such as model parameters corresponding to each computing node in the original network, and W1 to W6 in the neural network shown in FIG. 28 are used to represent model parameters of the computing node.
  • the model structure parameter includes a connection relationship of a plurality of computing nodes in the original network and a calculation attribute of each computing node, wherein the connection relationship between the computing nodes is used to indicate whether there is data transmission between the computing nodes, for example, when multiple computing nodes When there is a data flow between, it can be said that there are connection relationships between multiple computing nodes.
  • the connection relationship of the computing nodes may include an input relationship, an output relationship, and the like.
  • the calculation node F1 outputs as an input of the calculation nodes F4 and F5, it can be explained that there is a connection relationship between the calculation node F1 and the calculation node F4, and the connection relationship between the calculation node F1 and the calculation node F4.
  • the computing node F1 and the computing node F2 it can be said that there is no connection relationship between the computing node F1 and the computing node F2.
  • the calculation attribute of each computing node may include a calculation type and a calculation parameter of the corresponding calculation node, wherein the calculation type of the calculation node refers to what calculation is used by the calculation node, for example, the calculation type of the calculation node may include addition, subtraction, and Convolution operations and the like, correspondingly, the compute node may be a compute node for performing an add operation, a compute node for implementing a subtraction operation, a compute node for implementing a convolution operation, and the like.
  • the calculation parameter of the calculation node may be a necessary parameter required to complete the calculation type corresponding to the calculation node.
  • the calculation type of the calculation node may be a calculation node for implementing an addition operation.
  • the calculation parameter of the calculation node may be an addend in the addition operation, and the addend in the addition operation may be obtained as input data.
  • the module acquires, or the added number in the addition operation may be the output data of the previous computing node of the computing node, and the like.
  • the original network may be an artificial neural network established for a general purpose processor such as a CPU, GPU or DSP based on a deep learning system such as TensorFlow, MXNet, Caffe, and PyTorch.
  • the original network may also be an artificial neural network established for an intelligent processor such as an IPU.
  • the model data set (caffemodel) and model structure parameters (prototxt) of the Caffe network can be obtained.
  • the model data set (caffemodel) includes data such as model parameters of the Caffe network
  • the model structure parameter (prototxt) includes calculation attributes of each computing node of the Caffe network and a connection relationship between a plurality of computing nodes.
  • the computing module of the cloud server or the neural network dedicated processor may run the original network according to the model data set of the original network and the model structure parameters, and obtain instructions corresponding to the respective computing nodes in the original network.
  • the acquisition module of the cloud server or the neural network dedicated processor can also obtain the input data of the original network, and the operation module of the cloud server or the neural network dedicated processor can be based on the input data of the original network, the network model data set, and the model structure.
  • the parameter runs the original network, and obtains instructions corresponding to each computing node in the original network.
  • the process of running the original network to obtain the instructions of the respective computing nodes is substantially a compiled process, and the compiling process can be implemented by a cloud server or a neural network dedicated processor or a virtual device. That is, the cloud server or the neural network dedicated processor or virtual device runs the original network according to the model data set of the original network and the model structure parameters.
  • the virtual device refers to a virtual processor running space in the memory space of the memory.
  • the operation of the original network in this embodiment means that the cloud server or the neural network dedicated processor runs the machine learning algorithm (such as the neural network algorithm) using the artificial neural network model data, and implements the algorithm by performing the forward operation.
  • Target applications such as artificial intelligence applications such as speech recognition).
  • the control module of the cloud server or the neural network dedicated processor may generate an offline model corresponding to the original network according to model parameters and instructions corresponding to the respective computing nodes of the original network, for example, the cloud server or the neural network dedicated processor controls
  • the module may store the model parameters and instructions corresponding to the respective computing nodes of the original network into the non-volatile second memory to implement generation and storage of the offline model.
  • the model parameters and instructions of the computing node are stored in one-to-one correspondence.
  • the offline model corresponding to the original network can be directly obtained from the non-volatile memory, and the original network is run according to the offline model corresponding thereto, without performing online calculation on each computing node of the original network. Compile and get instructions to improve the speed and efficiency of the system.
  • directly running the offline model corresponding to the original network means that the offline learning model is used to run a machine learning algorithm corresponding to the original network (such as a neural network algorithm), and the target of the algorithm is implemented by performing a forward operation.
  • Applications such as artificial intelligence applications such as speech recognition).
  • step S102 may include:
  • the computing module of the cloud server or the neural network dedicated processor can obtain the execution order of each computing node in the original network according to the model structural parameters of the original network, and further, the computing module of the cloud server or the neural network dedicated processor can be The connection relationship of each computing node in the original network obtains the execution order of each computing node in the original network.
  • the input data of the calculation node F4 is the output data of the calculation node F1 and the output data of the calculation node F2
  • the input data of the calculation node F6 is the output data of the calculation node F4 and the output data of the calculation node F5.
  • each computing node in the neural network shown in FIG. 28 may be F1-F2-F3-F4-F5-F6 or F1-F3-F2-F5-F4-F6 and the like.
  • the computing nodes F1, F2, and F3 can be executed in parallel, and the computing nodes F4 and F5 can also be executed in parallel, and the execution order is not specifically limited herein.
  • the computing module of the cloud server or the neural network dedicated processor may run the original network according to the execution order of each computing node in the original network to obtain instructions corresponding to each computing node in the original network, that is, cloud server or neural network dedicated processing.
  • the device can compile the data of the model data set of the original network to obtain the instruction corresponding to each computing node, and the instruction corresponding to each computing node can know which computing function the computing node uses to implement, that is, the computing type of the computing node can be obtained. And calculation properties such as calculation parameters.
  • step S103 further includes:
  • the computing module of the cloud server or the neural network dedicated processor can obtain the memory allocation manner of the original network according to the model data set of the original network and the model structure parameters. Further, the cloud server or the neural network dedicated processor may obtain the execution order of each computing node in the original network according to the model structure parameters of the original network, and determine the memory allocation mode of the current network according to the execution order of each computing node in the original network. For example, related data of each computing node during operation is saved to a stack in the execution order of each computing node.
  • the memory allocation mode refers to determining a storage location of data (including input data, output data, model parameters, intermediate result data, and the like) related to each computing node in the original network in a memory space (such as the first memory).
  • a data table may be used to store a mapping relationship between data (input data, output data, model parameters, intermediate result data, and the like) associated with each computing node and a memory space.
  • S107 Store, according to a memory allocation manner of the original network, related data in the running process of the original network to the first memory, where the related data in the original network running process includes model parameters and instructions corresponding to the respective computing nodes of the original network.
  • X1 and X2 represent input data of the neural network
  • Y represents output data of the neural network
  • a cloud server or a neural network dedicated processor can convert the output data of the neural network into a control robot or different Control commands for digital interfaces.
  • W1 to W6 are used to represent model parameters corresponding to the calculation nodes F1, F2, and F3, and the output data of the calculation nodes F1 to F5 can be used as an intermediate calculation result.
  • the cloud server or the neural network dedicated processor can store related data in the original network running process to the first memory, such as an internal memory or a cache, etc. according to the determined memory allocation manner, and the specific storage manner can be seen.
  • the second memory may be a non-volatile memory such as an external memory.
  • the corresponding offline model stored in the storage space in the right half of FIG. 29 is the original network.
  • a cloud server or a neural network dedicated processor can obtain a model data set, model structure parameters, and input data of the original network, so that a network structure diagram of the original network can be obtained according to the model data set and model structure parameters of the original network. As shown in Figure 9.
  • the cloud server or the neural network dedicated processor can obtain the connection relationship of each computing node of the original network according to the model structure parameters of the original network, and obtain the execution order of each computing node in the original network according to the connection relationship of each computing node, and the original The memory allocation mode of the network during the running process, so that the storage location of the relevant data of the original network during the running process can be obtained.
  • the relevant data of the original network during operation can be stored in a stack in the order in which the respective compute nodes are executed.
  • the cloud server or the neural network dedicated processor may store the model parameters and instructions corresponding to the respective computing nodes of the original network in the non-volatile second memory to generate an offline model, and the storage mode of the offline model can be seen in FIG. 29 The middle right half of the storage space is shown.
  • the offline model only includes data such as model parameters and instructions necessary for running the original network, and does not need to store input data, output data or intermediate calculation results during the operation of the original network, thereby reducing the second The consumption of storage space in the memory.
  • an artificial neural network is a kind of heavyweight data, which is composed of a large number of nodes (or neurons) connected to each other.
  • the traditional computer device directly reads the neural network, and sequentially executes the computing nodes of the neural network in a certain manner according to the structural form of the neural network, and obtains the calculation result of the neural network. That is, the traditional computing device directly performs data processing on the heavyweight neural network, which will affect the data processing speed and efficiency of the computer device.
  • the artificial neural network data will not be able to operate, which will limit the application range of the neural network.
  • an embodiment of the present application provides a computer device, which may include a hardware system and a software system, where the hardware system may include a first processor 110, a second processor 120, and a memory 130.
  • the first processor 110 is configured to provide a computing and control capability, which may include a first obtaining module 111, a first computing module 113, a first control module 112, and the like.
  • the first obtaining module 111 may It is a hardware module such as an IO (Input Input/Output Output) interface, and the first arithmetic module 113 and the first control module 112 are hardware modules.
  • the first operation module 113 and the first control module 112 may be digital circuits or analog circuits or the like.
  • the second processor 120 can also be used to provide calculation and control capabilities, which can include a second acquisition module, a second operation module, and a second control module, etc., and the second acquisition module can be an IO (Input input / Output output)
  • the hardware module such as the interface, the second computing module and the second control module are all hardware modules.
  • the connection relationship and the configuration of the respective structures of the second processor 120 may be the same as the connection relationship and the configuration of the respective structures in the first processor. For details, refer to the description above, and details are not described herein again.
  • the first processor or the second processor may be a general processing unit such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), and a DSP (Digital Signal Processing).
  • a neural network dedicated processor such as an IPU (Intelligence Processing Unit).
  • the memory 130 is configured to store a plurality of offline models and input data corresponding to the original network and a software system of the computer device.
  • the software system of the computer device can include software such as an operating system, a computer program, application software, and a runtime system 131 that can run on the first processor 110 or the second processor 120.
  • the memory 130 can also be used to store output data of each original network (ie, calculation results of respective original networks).
  • the memory 130 may include a first storage module for storing an offline model, a second storage module for storing input data, a third storage module for storing output data, and a storage system for storing the runtime system. Four storage modules. Alternatively, the number of the memory 130 may be two or more.
  • the number of the memory 130 may be two, which are respectively labeled as a first memory and a second memory, wherein the first memory is used to store an offline model and input corresponding to the original network. Data, the second memory is used to store the runtime system.
  • the memory 130 may be a non-volatile memory such as a read only memory (ROM), a programmable ROM (PROM), an electrically programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), or a flash memory.
  • runtime refers to the state in which a program is running (or executed), and the runtime indicates which program is in the program during a certain period of time.
  • a runtime system is a process-level virtual machine that is used to represent the operating environment of a program.
  • the runtime system may be a software system established by using computer software, and the software system may be in a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), and a DSP (Digital Signal Processing). , digital signal processing) or IPU (Intelligence Processing Unit, intelligent processor) and other processors run to achieve specific data processing functions.
  • the runtime system in the embodiment of the present application is different from the operating system of the computer device, and the software system of the computer device can include the above-mentioned runtime system and operating system.
  • the runtime system 131 in the embodiment of the present application can be run on the first processor 110.
  • the runtime system 131 can include a data processing device 1310, a device management device 1314, and a task execution device 1315. Both the device 1310 and the device management device 1314 can be connected to the task execution device 1315.
  • the runtime system 131 can control the second processor 120 to run heavyweight data such as a neural network, that is, the runtime system 131 can control the second processor 120 according to the nerve.
  • the offline model of the network and the input data are calculated to obtain the output data of the neural network.
  • the data processing device 1310 is configured to obtain, from the memory 130, an offline model corresponding to the current original network and the input data thereof, and the offline model of the current original network is correspondingly set with the input data of the current network.
  • the offline model corresponding to the current original network includes necessary network structure information such as model parameters and instructions corresponding to each computing node in the current original network, and interface data of each computing node in the current original network. Since the offline model of the current original network does not include intermediate calculation results, input data, and output data of each computing node in the current original network, the data size of the offline model of the current original network is much smaller than that of the current original network.
  • the data level, that is, the offline model of the current original network can be considered as lightweight data.
  • the instruction corresponding to each computing node may be used to indicate which computing function is used by the computing node, and may specifically include computing attributes of respective computing nodes in the original network.
  • the node interface data of the current original network is used to represent the connection relationship of each computing node of the current original network.
  • the node interface data of the current original network may include an input data source and an output data source of each computing node. For example, as shown in FIG. 28, X1 and X2 are input data corresponding to the current original network, Y is output data corresponding to the current original network, and W1 to W6 are respectively model parameters corresponding to the computing nodes F1 to F3 in the current original network.
  • the node interface data of the current original network may include the computing nodes F1, F2, and F3 as the starting computing nodes, the inputs are preset input data, and the output data of the node F1 is calculated as the input data of the computing node F4 and the computing node F5. Wait. In this way, when the original network is run again, only the offline model and the input data of the current original network are obtained, and the running process of the current original network can be implemented by running the offline model corresponding to the current original network.
  • the device management device 1314 functions as a driving device of the second processor 120, which can be used to control the second processor 120 to be turned on or off. Wherein, when the second processor 120 is turned off, the second processor 120 does not perform any task, and when the second processor 120 is started, the second processor 120 can perform tasks such as calculation or control.
  • the second processor 120 may be a neural network accelerator for executing an offline model of the current original network.
  • the task execution device 1315 is configured to control the second processor 120 to run the offline model and input data of the current original network acquired by the data processing device 1310 to obtain output data of the current original network (ie, a calculation result of the neural network).
  • running the offline model corresponding to the original network means that the offline learning model is used to run the machine learning algorithm (such as the neural network algorithm) corresponding to the original network, and the target application of the algorithm is implemented by performing the forward operation (such as voice recognition and the like). Smart application).
  • the machine learning algorithm such as the neural network algorithm
  • the target application of the algorithm is implemented by performing the forward operation (such as voice recognition and the like). Smart application).
  • the runtime system 131 described above may be run on the first processor 110 to control the second processor 120 to operate by the runtime system 131.
  • the neural network and other data that is, when it is required to run heavy-weight data such as a neural network on the computer device 100, the offline model corresponding to the current original network and the input data may be first acquired from the memory 130 by the data processing device 1310. After completing the offline model corresponding to the current original network and the loading of the input data, the device management device 1314 may control the second processor 120 to start. Thereafter, the task executing device 1315 can control the second processor 120 to run the offline model and input data of the current original network to implement the running process of the current original network, and obtain the calculation result of the current original network.
  • the offline model of the current original network since the offline model of the current original network only stores the necessary model network parameters, such as model parameters and instructions corresponding to each computing node in the current original network, and interface data of each computing node in the current original network. Therefore, the data size of the offline model of the current original network is much smaller than the data magnitude of the current original network, so that by running the offline model of the current original network, the computer device can implement processing of heavyweight data such as a neural network. Expanded the application range of neural networks. At the same time, by directly running the offline model corresponding to the original network on the computer device, the processing speed and efficiency of the computer device can be improved without performing processing operations such as compiling each computing node in the original network.
  • the data processing device 1310 includes an offline model loading module 1311 and an input data loading module 1312.
  • the offline model loading module 1311 is configured to obtain an offline model corresponding to the current original network from the memory 130, and parse the obtained offline model of the current original network to obtain model parameters corresponding to each computing node in the current original network, The instruction and the interface data of each computing node in the current original network.
  • the process of parsing the offline model of the current original network by the offline model loading module 1311 may further include a process of performing data preprocessing (such as data format conversion, normalization, etc.) on the offline model corresponding to the current original network.
  • data preprocessing such as data format conversion, normalization, etc.
  • the input data loading module 1312 is configured to retrieve input data from the memory 130, which may be input data corresponding to the starting computing node of the original network. As shown in Figure 28, X1 and X2 serve as input data for the starting compute node of the original network. Further, the input data can be obtained by application software and stored in the memory 130.
  • the application software can be run on the first processor or the second processor. For example, the user can set the input data of the current original network through the interaction interface of the application software, and the runtime system can store the acquired input data of the current original network. In the memory 130.
  • the offline model loading module 1311 can also be used to obtain the loading progress of the offline model in real time
  • the input data loading module 1312 can also be used to obtain the loading progress of the input data in real time.
  • the offline model loading module 1311 completes the loading of the offline model corresponding to the current original network (for example, the data loading ratio of the offline model is 100%)
  • the input data loading module 1312 completes the loading of the input data corresponding to the current original network ( For example, the loading ratio of the input data is 100%)
  • the offline model loading module 1311 and the input data loading module 1312 may send a data loading completion signal to the device management device 1314, so that the device management device 1314 may load the completion signal according to the data received by the device management device 1314.
  • the second processor 120 is controlled to start. After the second processor 120 is started, the device management device 1314 may send a startup completion signal to the task execution device 1315, and the task execution device 1315 may control the second processor 120 to run the offline of the current original network according to the startup completion signal it receives. model.
  • the second processor startup can be controlled in advance to further increase the data processing speed and efficiency of the computer device.
  • the data magnitude of the offline model is greater than the data magnitude of the input data
  • the required loading time of the offline model may be greater than the loading time of the input data. Therefore, if the offline model loading module 1311 has completed the data loading ratio is greater than or equal to At the first predetermined ratio (eg, 80%), a load completion signal may be sent to the device management device 1314 to start the second processor 120 in advance.
  • the offline model loading module 1311 and the input data loading module 1312 can send a data loading completion signal to the device management device 1314, so that the device management device 1314 can control the second processor 120 according to the data loading completion signal it receives. start up.
  • the data processing device 1310 may further include an input data pre-processing module 1313 for pre-processing the input data (such as data format conversion, normalization, etc.). ) to enable the second processor 120 to run the input data.
  • the input data loading module 1312 may send an input data loading completion signal to the input data preprocessing module 1313, and the input data preprocessing module 1313 may input the input data according to the input data.
  • the loading completion signal performs data preprocessing operations such as normalization and format conversion on the input data corresponding to the current original network.
  • the device management device 1314 can control the second processor 120 to start according to the offline model loading completion signal transmitted by the offline model loading module 1311 and the pre-processing completion signal transmitted by the input data pre-processing model 1314.
  • the input data pre-processing module 1313 is further configured to store the output data obtained by the second processor 120 to the memory 130. Specifically, after the second processor 120 completes the offline model of the current original network and the execution process of the input data, The second processor 120 can transmit the output data (ie, the calculation result) of the current original network to the input data pre-processing module 1313, and the input data pre-processing module 1313 can perform pre-processing such as data format conversion on the output data of the current original network. The output data of the current original network can then be stored into the memory 130.
  • the software system of the computer device 100 further includes application software and an operating system (such as an Android operating system, a Microsoft operating system, a Linux operating system, etc.), and the application software can run on the operating system or the above-mentioned runtime system.
  • an operating system such as an Android operating system, a Microsoft operating system, a Linux operating system, etc.
  • the operating system and the runtime system described above provide an executable environment for various applications.
  • the operating system and application software may also be stored in memory 130, which may be run on first processor 110 or second processor 120.
  • Each device of the runtime system 131 can provide a security API (Application Programming Interface) that can be invoked by the application software, so that the application software can obtain the offline model and input data of the current original network through the runtime system 131, and control
  • the second processor 120 runs the offline model of the current original network to obtain output data of the current original network.
  • the data processing device 1310 can provide an offline model API and an input data API.
  • the offline model loading module 1311 can provide an offline model API
  • the input data loading module 1312 can provide an input data API.
  • the application software can invoke the offline model API of the data processing device 1310, so that the offline model loading module 1311 can obtain the offline model corresponding to the current original network from the memory 130.
  • the application software may invoke the input data API of the data processing device 1310, so that the input data loading module 1312 can obtain the input data corresponding to the current original network from the memory 130.
  • the input data of the current original network can be obtained by application software. For example, the user can manually set the input data corresponding to the current original network through the interactive display interface of the application software.
  • the application software can also simultaneously invoke the offline model API and the input data API, so that the offline model and the input data of the current original network can be loaded at the same time, which is used for illustration only, and is not used.
  • the order in which it is specifically performed is defined.
  • the input data pre-processing module 1313 of the data processing device 1310 is also capable of providing a data pre-processing API. After completing the loading of the input data of the current original network, the application software may invoke the data pre-processing API, so that the data pre-processing module 1313 can pre-process the input data of the current original network, so that the second processor can run the current original as described above. Input data for the network.
  • the device management device 1314 can provide a second processor driver API
  • the task execution device 1315 can provide a second processor runtime API.
  • the application software may start the second processor 120 by calling the second processor driver API provided by the task executing device 1315.
  • the application software may invoke the second processor running API provided by the task executing device 1315 to control the second processor 120 to execute the offline model corresponding to the current original network and input data to obtain the current original network.
  • Output Data After completing the execution process of the offline model corresponding to the current original network, the application software may close the second processor 120 by calling the second processor driver API.
  • the application software may also invoke the data pre-processing API, so that the input data pre-processing module 1313 can perform data pre-processing on the output data of the current original network, and will The output data of the original network is stored in the memory 130.
  • the number of the second processors 120 may be multiple, the task executing device 1315 may also be able to provide a task allocation API, and the task executing device 1315 may be configured to control the plurality of second processors 120 to implement multiple second Task assignment and scheduling between processors 120.
  • the application software may select a target second processor that executes the current task from the plurality of second processors 120 by calling a task assignment API provided by the task execution device 1315. After the offline model of the current original network and the loading of the input data are completed, the application software may start the target second processor by calling a second processor driver API corresponding to the target second processor.
  • the application software may invoke the second processor running API corresponding to the target second processor provided by the task executing device 1315 to control the target second processor to execute the offline model corresponding to the current original network. And input data.
  • the target second processor may be shut down by calling a second processor driver API corresponding to the target second processor.
  • the second processor 120 may be a multi-core processor, that is, the second processor 120 may include multiple processing modules.
  • the task execution device 1315 can be configured to control a plurality of processing modules of the plurality of second processors 120 to implement task allocation and scheduling between the plurality of processing modules of the plurality of second processors 120.
  • the application software may select a target processing module that executes the current task from among the plurality of processing modules in the second processor 120 by calling the task assignment API provided by the task execution device 1315. After the offline model of the current original network and the loading of the input data are completed, the application software may start the target processing module by calling a second processor driver API corresponding to the target processing module.
  • the application software may invoke the second processor running API corresponding to the target processing module to control the target processing module to execute the offline model and the input data corresponding to the current original network.
  • the target processing module may be closed by calling a second processor driver API corresponding to the target processing module.
  • the runtime system 131 can be a secure runtime system built on a trusted operating environment.
  • the runtime system 131 can be a runtime system built on a TEE (Trusted Execution Environment).
  • TEE Trusted Execution Environment
  • TEE can construct a runtime system that is isolated from non-secure software systems such as operating systems, thereby implementing software isolation and ensuring the offline model of the original network and the security of input data and output data.
  • the above application software may be a secure application such as TA, and the secure application software such as the TA may run on a runtime system based on TEE.
  • the storage space of the memory 130 can be divided into a secure storage space and a non-secure storage space.
  • the storage space for storing the offline model and the input data of the current original network is a secure storage space
  • the storage space for storing the software system such as the operating system and the application software is a non-secure storage space
  • the runtime system can be stored in the storage system.
  • the memory 130 can also be a secure memory.
  • the above runtime system, TA and secure storage space constitute a complete TEE operating environment.
  • the number of the memory 130 may be two or more, and one of the memories 130 may serve as a secure storage space for storing an offline model and input data of the current original network.
  • One of the memories 130 can be used as a non-secure storage space for storing software systems such as an operating system and application software. Further, the operating system, application software, and the like can also be stored in a secure storage space.
  • the secure storage space in the embodiment of the present application refers to a trusted storage space, which may be an encrypted storage space, and may specifically adopt a symmetric encryption algorithm, an asymmetric encryption algorithm, or a random encryption. Algorithm (such as using a random password generator to obtain a password).
  • the secure storage interval may also be a storage space encrypted by a fingerprint or the like.
  • the above secure runtime system 131 and application software can also be obtained by an encryption algorithm.
  • the secure storage space may be a secure storage space obtained by a trusted metric method, and the secure runtime system 131 and application software may also be obtained by a trusted metric method.
  • the first processor 110 can also be a security chip, such as a TPM (Trusted Platform Module), a TCM (Trusted Cryptography Module), or a TPCM (Trusted Platform Control Module). Module) and so on.
  • the second processor 120 may also be a security chip such as TPM, TCM or TPCM.
  • the computer device of the embodiment of the present application may further include only a processor and a memory, where the processor is a multi-core processor.
  • the processor can include a plurality of processing modules.
  • the processor includes a first processing module and a second processing module, wherein the runtime system can run on the first processing module.
  • the runtime system may include a data processing device, a device management device, and a task execution device, where the data processing device is configured to acquire, from the memory, an offline model corresponding to the current original network and input data, corresponding to the current original network.
  • the offline model includes model parameters, instructions, and interface data of each computing node in the original network corresponding to each computing node in the original network.
  • the device management device is configured to control the second processing module to be started or shut down
  • the task execution device is configured to control the second processing module to run the offline model of the current original network and input data.
  • the embodiment of the present application further provides a data processing method, which is used in the computer device shown in FIG. 20 to implement processing on heavy-weight data such as a neural network by using an offline model, and improve the computer device.
  • Data processing speed and efficiency includes the following steps:
  • the control data processing device acquires an offline model and input data corresponding to the current original network from the memory, and the offline model corresponding to the current original network includes model parameters and instructions corresponding to each computing node in the original network.
  • the offline model corresponding to the current original network and the input data can be read from the memory by the data processing device 1310 of the runtime system 131.
  • the offline model corresponding to the current original network may be obtained from the memory 130 by the offline model loading module 1311 of the data processing device 1310.
  • the input data is retrieved from the memory 130 by the input data loading module 1312, which may be the input data corresponding to the starting computing node of the original network.
  • the second processor that controls the computer device by the device management device is started. Specifically, the second processor can be controlled to be turned on or off by the device management device 1314 of the runtime system 131. That is, after the offline model loading module 1311 completes the loading of the offline model corresponding to the current original network, and the input data loading module 1312 completes the loading of the input data corresponding to the current original network, the offline model loading module 1311 and the input data loading module 1312 may The device management device 1314 transmits a data loading completion signal, so that the device management device 1314 can control the second processor 120 to start according to the data loading completion signal it receives.
  • the second processor that controls the computer device by the task execution device runs the current original network according to the offline model and the input data corresponding to the current original network, and obtains output data of the current original network.
  • the second processor 120 can be controlled by the task execution device 1315 of the runtime system 131 to run an offline model of the current original network.
  • running the offline model corresponding to the original network means that the offline learning model is used to run the machine learning algorithm (such as the neural network algorithm) corresponding to the original network, and the target application of the algorithm is implemented by performing the forward operation (such as voice recognition and the like). Smart application).
  • the data processing device Store, by the data processing device, output data of the current original network into the memory.
  • the output data of the current original network may be stored into the memory 130 by the data processing device 1310.
  • the data processing device 1310 can perform a pre-processing operation such as data format conversion on the output data of the current original network, and then store it in the memory 130.
  • the input data processing module 1313 of the data processing device 1310 can perform a pre-processing operation such as data format conversion on the output data of the current original network, and then store it in the memory 130.
  • step S110 may further include the following steps:
  • the offline model of the current original network may be parsed by the offline model loading module 1311 to obtain model parameters and instructions corresponding to each computing node in the current original network, and interfaces of each computing node in the current original network. data. Further, the offline model loading module 1311 may perform preprocessing operations such as data format conversion, normalization, and the like on the parsed data.
  • S112 Perform pre-processing on the input data of the current original network obtained, such as performing data format conversion, normalization, and the like on the input data.
  • the input data may be pre-processed (such as data format conversion, normalization, etc.) by the input data pre-processing module 1313 to enable the second processor 120 to run the input data.
  • the above method may further include the following steps:
  • the loading progress of the offline model corresponding to the current original network is obtained in real time; specifically, the offline model loading module 1311 can obtain the loading progress of the offline model corresponding to the current network in real time, and the loading progress of the offline model can be represented by using a data ratio or a remaining duration. .
  • the step of controlling the second processor of the controlling computer device is performed.
  • the first preset ratio may be 80% to 100%.
  • the offline model loading module 1311 may send a data loading completion signal to the device management device 1314, thereby The device management device 1314 can control the second processor 120 to start according to the data loading completion signal it receives.
  • the data loading ratio of the offline model loading module 1311 is greater than or equal to the first preset ratio (eg, 80%)
  • the loading completion signal may be sent to the device management device 1314 to start the second processor 120 in advance.
  • the required loading time of the offline model may be greater than the loading time of the input data. Therefore, whether to activate the second processor 120 may be determined based only on the loading progress of the offline model. Further, the input data loading module 1312 can also obtain the loading progress of the input data in real time.
  • the offline model loading module 1311 and the input data loading module 1312 may send a data loading completion signal to the device management device 1314, so that the device management device 1314 may The second processor 120 is controlled to start according to the data loading completion signal it receives.
  • the embodiment of the present application further provides a data processing method, which is used in the computer device shown in FIG. 20 to implement processing on heavy-weight data such as a neural network by using an offline model, and improve the computer.
  • Data processing efficiency and speed of the device includes the following steps:
  • the offline model API is invoked, and the offline model corresponding to the current original network is obtained.
  • the application software may invoke the offline model API provided by the offline model loading module 1311, so that the offline model loading module 1311 can read the current original from the memory 130.
  • the offline model corresponding to the network includes model parameters and instructions corresponding to each computing node in the current original network, and interface data of each computing node in the current original network; wherein the offline model generation process can be referred to the description above. .
  • the application software may call the input data API provided by the input data loading module 1312, and obtain the input data of the current original network from the memory 130 through the input data loading module 1312. Further, the application software may also invoke the data pre-processing API provided by the input data pre-processing module 1313, and perform pre-processing operations such as data format conversion, normalization, and input data input by the input data loading module 1312 through the input data pre-processing module 1313. So that the second processor 120 can run the input data of the current original network described above.
  • S220 Call the second processor driver API to control the second processor in the computer device to start.
  • the application software can invoke the second processor driver API provided by the device management module 1314, and the second processor 120 is controlled to be started by the device management module 1314.
  • S230 Call the second processor to run the API, and control the second processor to obtain the output data of the current original network according to the offline model and the input data corresponding to the current original network.
  • the application software can invoke the second processor running API provided by the task executing device 1315, and the task executing device 1315 controls the second processor 120 to obtain the output data of the current original network according to the offline model and the input data corresponding to the current original network. .
  • S240 Call a second processor driver API to control the second processor to be turned off.
  • the application software can invoke the second processor driver API provided by the device management module 1314, and the second processor 120 is controlled to be turned off by the device management module 1314.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • Synchlink DRAM SLDRAM
  • Memory Bus Radbus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM
  • the computer device, the data processing method and the storage medium can directly obtain the offline model and the input data corresponding to the current original network from the memory, so that the second processor of the computer device can obtain the original network according to the original network.
  • the offline model and the input data run the current original network to obtain the output data of the current original network. Since the offline model corresponding to each original network only includes the model parameters and instructions corresponding to each computing node in the original network and the interface data of each computing node in the original network, the data size of the offline model of the original network is much smaller than the The data level of the original network, so that the computer device can process the heavyweight neural network data by running the offline model corresponding to the current original network on the computer device. At the same time, by directly running the offline model corresponding to the current original network on the computer device, the processing speed and efficiency of the computer device can be improved without performing processing operations such as compiling each computing node in the current original network.
  • the computer device 200 may include a first processor 210, a second processor 220, a first memory 230, and a second memory 240, wherein the first memory 230 stores therein An offline model corresponding to the plurality of original networks and input data and a runtime system capable of running on the first processor 230, the second memory 240 storing an operating system capable of running on the first processor or the second processor.
  • the first memory 230 and the second memory 240 described above may be two memories that are physically independent of each other.
  • the first memory 230 and the second memory 240 may be integrated as a whole, and the first memory 230 and the second memory 240 are two storage spaces that are logically independent of each other.
  • the number of the first processors 210 may be two or more.
  • the number of the first processors 210 is two, one of the first processors 210 is used to run the above-described secure runtime system 231, and the other first processor 210 is used to run the operating system.
  • the foregoing first processor 210 may be a multi-core processor, which may include more than two processing modules, one of which may be used to run the above-described runtime system 231, wherein one processing module is used to run the above operations. system. In this way, computer equipment can be divided into a safe operating environment and a non-secure operating environment by hardware isolation.
  • the first processor 210 may be implemented by using a security chip such as TCM, TPM or TPCM.
  • the above-mentioned runtime system is a secure runtime system established based on a trusted operating environment.
  • the runtime system 231 may be a runtime system established based on a TEE (Trusted Execution Environment).
  • TEE Trusted Execution Environment
  • the TEE can construct a runtime system that is isolated from a non-secure software system such as an operating system, thereby implementing software isolation, ensuring the offline model of the original network and the security of input data and output data.
  • the secure runtime system 231 can be obtained by an encryption algorithm or by a trusted metric.
  • the first memory 230 is a secure storage medium.
  • the runtime system 231 When the runtime system 231 is running on the first processor 210, the runtime system 231 can obtain the offline model and input data corresponding to the current original network from the first memory 230, and control the second processor 220 to run the current original network corresponding. Offline model.
  • the security in the embodiment of the present application refers to Trusted, which can be implemented by using a preset encryption algorithm.
  • a symmetric encryption algorithm, an asymmetric encryption algorithm, or a random encryption algorithm may be used (such as using random The password generator gets the password).
  • a symmetric encryption algorithm, an asymmetric encryption algorithm, or a random encryption algorithm may be used (such as using random The password generator gets the password).
  • encrypt by fingerprint or the like it is also possible to encrypt by fingerprint or the like.
  • security can also be achieved through trusted metrics.
  • the runtime system 231 can provide a security API (Application Programming Interface) that can be invoked by the application software.
  • the API mainly includes key management, cryptographic algorithms, and secure storage.
  • the above-described runtime system 231 may include a data processing device, a device management device, and a task execution device, the structure of which is similar to that of the above-described runtime system 131, as shown in Figs. 22 and 23.
  • the data processing device can provide an offline model API and an input data API, and is used to obtain an offline model and input data corresponding to the current original network from the first memory 230.
  • the offline model corresponding to the current original network includes each computing node in the original network.
  • the device management device can provide a second processor driver API for controlling the second processor 220 to be turned on or off.
  • the task execution device can provide a second processor execution API for controlling the second processor 220 to run an offline model of the current original network and input data.
  • the data processing apparatus includes an offline model loading module and an input data loading module.
  • the offline model loading module is configured to provide an offline model API for obtaining an offline model corresponding to each current original network from the first memory 230, and parsing an offline model corresponding to the current original network.
  • the input data loading module is capable of providing an input data API for obtaining input data corresponding to the current original network from the first memory 230.
  • the data processing apparatus further includes an input data preprocessing module capable of providing a data preprocessing API for preprocessing the input data acquired by the input data loading module to enable the second processor 220 to operate
  • the input data of the current original network is used to store the output data obtained by the second processor 220 to the first memory 230.
  • the number of the second processors 220 is multiple, or the second processor 220 includes multiple processing modules; the task execution device can also provide a task allocation API for controlling the plurality of second processors 220, or controlling A plurality of processing modules of the second processor 220.
  • the computer device further includes a secure application software (TA, Trusted Application) that can be run on the runtime system 231, and the application software can invoke the offline model API and the input data API, the second processor driver API, and the second The processor runs the API.
  • TA Trusted Application
  • the secure application software can be implemented by an encryption algorithm or by a trusted metric.
  • the embodiment of the present application further provides a data processing method, which is used in the computer device shown in FIG. 30, and the method includes the following steps:
  • S310 Obtain an offline model and input data corresponding to the current original network from the first memory, where the offline model corresponding to the current original network includes model parameters, instructions, and respective current networks in the original network. Calculate the interface data of the node.
  • the secure runtime system 231 can obtain the offline model and input data corresponding to the current original network from the secure first memory 230.
  • the offline model corresponding to the current original network and the input data may be read from the first memory 230 by the data processing device of the runtime system 231.
  • the offline model corresponding to the current original network may be acquired from the first memory 230 by the offline model loading module of the data processing device.
  • the input data is obtained from the first memory 230 by the input data loading module, and the input data may be input data corresponding to the starting computing node of the original network.
  • the second processor that controls the computer device starts.
  • the secure runtime system 231 described above can control the second processor 220 of the computer device to boot.
  • the device management device of the runtime system 231 can control the second processor to be turned on or off.
  • the offline model loading module may send a data loading completion signal to the device management device, so that the device management device may control the second processor according to the received data loading completion signal thereof. 220 starts.
  • the second processor that controls the computer device runs the current original network according to the offline model and the input data corresponding to the current original network, and obtains output data of the current original network.
  • the runtime system 231 can control the second processor 220 of the computer device to run the offline model and its corresponding input data to obtain output data of the current original network.
  • the second processor 220 can be controlled by the task execution device of the runtime system 231 to run an offline model of the current original network.
  • running the offline model corresponding to the original network means that the offline learning model is used to run the machine learning algorithm (such as the neural network algorithm) corresponding to the original network, and the target application of the algorithm is implemented by performing the forward operation (such as voice recognition and the like). Smart application).
  • the machine learning algorithm such as the neural network algorithm
  • the target application of the algorithm is implemented by performing the forward operation (such as voice recognition and the like). Smart application).
  • the runtime system 231 can store the output data of the current original network into the secure first memory 230.
  • the output data of the current original network may be stored into the first memory 230 by the data processing device of the runtime system 231.
  • the data processing device can perform a pre-processing operation such as data format conversion on the output data of the current original network, and then store it in the first memory 230.
  • the input data processing module of the data processing device can perform a pre-processing operation such as data format conversion on the output data of the current original network, and then store it in the first memory 230.
  • the embodiment of the present application further provides a data processing method, which is used in the computer device shown in FIG. 32.
  • the offline model corresponding to the current original network includes model parameters, instructions corresponding to each computing node in the current original network, and interface data of each computing node in the current original network.
  • S420 Calling the input data API to obtain input data of the current original network; specifically, the secure application software may invoke the input data API, and obtain the input data of the current original network from the first memory 230 by using the input data loading module.
  • S440 Call the second processor to run the API, and control the second processor to obtain the output data of the current original network according to the offline model and the input data corresponding to the current original network.
  • the secure application software can invoke the second processor to run the API to control, by the task execution device, the second processor 220 to obtain the output data of the current original network according to the offline model and the input data corresponding to the current original network.
  • S450 Call a second processor driver API to control the second processor to be turned off.
  • the secure application software can invoke the second processor driver API to control the second processor 220 to be turned off by the device management module.
  • the above method further includes the following steps:
  • the data pre-processing API is called to store the output data of the current original network into the first memory.
  • the secure application software can invoke the data pre-processing API provided by the runtime system 231 to perform data format conversion, normalization, and the like on the output data through the input data pre-processing module of the data processing device, and the current The output data of the original network is stored in the first memory 230.
  • the method further includes the following steps:
  • the data preprocessing API is invoked to preprocess the acquired input data of the current original network, so that the second processor can run the input data.
  • the secure application software may also invoke a data pre-processing API provided by the input data pre-processing module to perform a data format conversion, normalization, and the like on the input data through the input data pre-processing module to enable the second processing.
  • the device 220 is capable of running the input data of the current original network described above.
  • the embodiment of the present application may further include an offline model generation process, where the offline model generation process may be run on a cloud server or a neural network dedicated processor, and the obtained offline model of the original network is stored to the first In a memory 230.
  • the cloud server or neural network dedicated processor is a processor capable of executing heavyweight data such as a neural network, which may not be included in the above computer device.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • Synchlink DRAM SLDRAM
  • Memory Bus Radbus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM
  • the data size of the offline model of the current original network is much smaller than the data magnitude of the current original network, so that by running the offline model of the current original network, a secure runtime established based on a trusted execution environment such as TEE can be realized.
  • the system expands the application range of neural networks to the processing of heavy-weight data such as neural networks.
  • the processing speed and efficiency of the computer device can be improved without performing processing operations such as compiling each computing node in the original network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Neurology (AREA)
  • Devices For Executing Special Programs (AREA)
  • Feedback Control In General (AREA)
  • Advance Control (AREA)
  • Stored Programmes (AREA)
  • Multi Processors (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明提出的一种任务并行处理方法、装置、系统、存储介质及计算机设备,能够根据任务有向无环图DAG进行需执行任务分发与控制,实现多核处理器的任务并行,提高了数据处理效率。

Description

任务并行处理方法、装置、系统、存储介质及计算机设备
相关申请
本申请要求2017年11月20日申请的,申请号为201711157341.X,名称为“任务并行处理方法、存储介质、计算机设备、装置和系统”的中国专利申请的优先权;2017年12月29日申请的,申请号201711484410.8,名称为“指令列表调度方法、装置、计算机设备及存储介质”的中国专利申请的优先权;2018年1月29日申请的,申请号为201810084077.X,名称为“计算机设备、数据处理方法及存储介质”的中国专利申请的优先权;以及2018年1月29日申请的,申请号为201810083577.1,名称为“计算机设备、数据处理方法及存储介质”的中国专利申请的优先权;为在此将其全文引入作为参考。
技术领域
本申请涉及计算机技术领域,特别是涉及一种任务并行处理方法、装置、系统、存储介质及计算机设备。
背景技术
传统技术中,尽管可以通过CUDA(Compute Unified Device Architecture,显卡厂商 NVIDIA推出的运算平台)、Cudnn(CUDA Deep Neural Network library,NVIDIA推出的深度神经网络加速库)、Cublas(CUDA Basic Linear Algebra Subprograms,NVIDIA推出的矩阵运算加速库)等加速器API接口进行编程,实现卷积神经网络的程序指令。但是,通过CUDA、Cudnn、Cublas等加速器API接口编程,实现的卷积神经网络的各指令间无相互依赖关系,只可以顺序执行编程指令。
神经网络实际是一串队列函数,是一种图结构。在实现卷积神经网络的程序指令,会存在任务分支。目前可以应用tensorflow( 谷歌基于DistBelief进行研发的第二代 人工智能学习系统)或者Caffe(Convolutional Architecture for Fast Feature Embedding,卷积神经网络框架)等框架应用程序实现卷积神经网络的程序的任务并行,但是,应用上述框架程序实现任务并行,不仅需额外安装软件,而且存在程序接口不兼容的问题,使用不便。
发明内容
基于此,有必要针对由于需借助tensorflow或者Caffe等框架应用程序实现任务并行,造成的使用不便的问题,提供一种任务并行处理方法、存储介质、计算机设备、装置和系统。
本申请提出了一种任务并行处理方法,包括:
根据需执行任务之间的依赖关系,构建任务有向无环图DAG;
根据所述任务有向无环图DAG,将各所述需执行任务分发至处理器的多个工作队列;
根据所述任务有向无环图DAG中各所述需执行任务的依赖关系,调控各所述工作队列中并行的需执行任务开始运行。
在其中一个实施例中,所述根据需执行任务之间的依赖关系,构建任务有向无环图DAG的步骤之前包括:
根据程序中的操作节点和/或数据节点对程序进行拆分,获取所述需执行任务。
在其中一个实施例中,所述根据程序中的操作节点对程序进行拆分,获取所述需执行任务的步骤包括:
若所述程序包括带模型的操作请求,则对所述带模型的操作请求的模型进行拆分和/或对所述模型的输入数据进行拆分,获取需执行任务。
在其中一个实施例中,所述对所述带模型的操作请求的模型进行拆分,获取需执行任务的步骤包括:
设置拆分模型得到的各所述需执行任务对应的权值;
使用各所述权值,设置所述需执行任务的输入数据与输出数据的对应关系。
在其中一个实施例中,所述对所述带模型的操作请求的模型进行拆分,获取需执行任务的步骤包括:
按照预设规则在模型的窗口方向和/或通道方向上拆分所述带模型的操作的模型,得到需执行任务。
在其中一个实施例中,所述对所述带模型的操作请求的输入数据进行拆分,获取需执行任务的步骤包括:
按照预设规则在数据的窗口方向拆分所述带模型的操作的输入数据,得到需执行任务。
在其中一个实施例中,所述根据程序中的操作节点对程序进行拆分,获取所述需执行任务的步骤包括:
若所述程序包括不带模型的操作请求,则对所述不带模型的操作请求的输入数据和/或输出数据进行拆分,获取需执行任务。
在其中一个实施例中,所述对所述不带模型的操作请求的输入数据和/或输出数据进行拆分,获取需执行任务的在步骤包括:
按照预设规则在数据的窗口方向拆分所述输入数据和/或输出数据,得到需执行任务。
在其中一个实施例中,所述根据需执行任务之间的依赖关系,构建任务有向无环图DAG的步骤包括:
根据获取的各所述需执行任务之间的依赖关系,确定所述任务有向无环图DAG中的并行结点与顺序结点;
根据所述并行结点与顺序结点构建任务有向无环图DAG。
在其中一个实施例中,所述根据所述任务有向无环图DAG将各所述需执行任务分发至所述处理器的多个工作队列的步骤包括:
对所述任务有向无环图DAG进行拓扑排序,获取任务拓扑排序序列;
根据各所述需执行任务的预设执行时间,对得到的所述拓扑排序序列进行排序,得到最长拓扑排序序列;
根据所述最长拓扑排序序列以及各所述需执行任务之间的依赖关系,分发各所述需执行任务至所述工作队列。
在其中一个实施例中,所述根据所述任务有向无环图DAG中各所述需执行任务的依赖关系,调控各所述工作队列中并行的需执行任务开始运行的步骤包括:
根据所述任务有向无环图DAG为各所述需执行任务设置引用计数;
若被依赖的需执行任务已执行,则修改需依赖的需执行任务的引用计数;
当所述需执行任务的引用计数达到预设值,控制各所述工作队列中引用计数达到预设值的需执行任务开始运行。
本申请提出了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述方法所提及的步骤。
本申请提出了一种任务并行处理系统,包括存储器、多核处理器,及存储在存储器上并可在处理器上运行的计算机程序,所述多核处理器能够运行拆分算法,所述多核处理器执行所述计算机程序时实现上述方法所提及的步骤。
本申请还提出了一种任务并行处理系统,包括存储器、第一处理器和第二处理器,所述第一处理器能够运行拆分算法,第二处理器为多核处理器,所述第一处理器和第二处理器执行所述计算机程序时实现上述方法所提及的步骤。
相应的,本申请还提出了一种任务并行处理装置,包括:DAG图构建模块、任务分发模块和调度控制模块,
所述DAG图构建模块,用于根据需执行任务之间的依赖关系,构建任务有向无环图DAG;
所述任务分发模块,用于根据所述任务有向无环图DAG,将各所述需执行任务分发至处理器的多个工作队列;
所述调度控制模块,用于根据所述任务有向无环图DAG中各所述需执行任务的依赖关系,调控各所述工作队列中并行的需执行任务开始运行。
与现有技术相比,本申请提供的一种任务并行处理方法、存储介质、计算机设备、装置和系统具有如下有益效果:
本申请提出的一种任务并行处理方法、存储介质、计算机设备、装置和系统,通过根据需执行任务之间的依赖关系,构建任务有向无环图DAG,再根据任务有向无环图DAG进行需执行任务分发与控制,依赖于工作队列的可重新调度性实现多核处理器的任务并行,提高了数据处理效率。本实施例提出的任务并行处理方法的实施不依赖tensorflow或者Caffe等框架程序,因此在设计程序时无需考虑接口兼容等问题。
本申请还提供了一种指令列表调度方法,包括:获取待调度指令列表中的待调度指令集,并对所述待调度指令集进行数据依赖分析,得到所述待调度指令集中各指令之间的数据依赖关系;
根据各指令之间的所述数据依赖关系,得到指令调度过程中每次进行指令选择的所有选择节点;
按照预设规则,根据对应次序的所述选择节点确定调度后指令列表中各次序的指令。
在其中一个实施例中,所述按照预设规则,根据对应次序的所述选择节点确定调度后指令列表中各次序的指令的步骤包括:
访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间;
若当前访问的所述选择节点对应的最长执行时间小于初始执行时间,则将当前访问的选择节点的已排序指令确定为调度后的指令列表中对应次序的指令;
其中,初始执行时间为待调度指令列表中指令序列的执行时间。
在其中一个实施例中,所述方法包括:
若当前访问的选择节点对应的最长执行时间小于初始执行时间,则初始执行时间更新为当前访问的选择节点对应的最长执行时间。
在其中一个实施例中,所述访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间的步骤包括:
在预设访问时间段内访问选择节点,并获取当前访问的选择节点对应的最长执行时间;
若当前访问的选择节点对应的最长执行时间小于初始执行时间,则将当前访问节点对应的已排序指令确定为调度后的指令列表中对应次序的指令;
其中,初始执行时间为待调度指令列表中指令序列的执行时间。
在其中一个实施例中,若当前访问的选择节点对应的最长执行时间不小于初始执行时间,则将待调度指令表中指令序列作为调度后指令表中的指令序列。
在其中一个实施例中,访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间的步骤包括:
按照随机优先的规则选择所述选择节点进行访问,并获取当前选择访问的选择节点对应的最长执行时间。
在其中一个实施例中,访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间的步骤包括:
按照广度优先的规则选择所述选择节点进行访问,并获取当前选择访问的选择节点对应的最长执行时间。
在其中一个实施例中,访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间的步骤包括:
按照深度优先的规则选择所述选择节点进行访问,并获取当前选择访问的选择节点对应的最长执行时间。
在其中一个实施例中,访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间的步骤包括:
按照广度或随机优先的规则选择小于预设次序的所述选择节点进行访问,得到当前选择访问的选择节点对应的最长执行时间;
按照深度优先的规则选择不小于预设次序的所述选择节点进行访问,得到当前选择访问的选择节点对应的最长执行时间。
在其中一个实施例中,所述访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间的步骤包括:
获取当前访问的选择节点对应的最短执行时间;
若当前访问的选择节点对应的最短执行时间大于初始执行时间,则终止访问与当前访问的选择节点关联的选择节点;
其中,初始执行时间为待调度指令列表中指令序列的执行时间。
在其中一个实施例中,按照预设规则,根据对应次序的选择节点中确定调度后指令列表中各次序的指令的步骤包括:
按照指令的预设优先级评估当前次序对应的所有选择节点,得到当前次序的各选择节点的评估结果,并根据所述评估结果确定当前次序对应的指令。
在其中一个实施例中,所述方法包括:根据当前选择节点的具体内容和/或类型设定各指令的优先级。
在其中一个实施例中,按照预设规则,根据对应次序的选择节点中确定调度后指令列表中各次序的指令的步骤包括:
根据当前次序所有的选择节点对应的最短执行时间的长短,确定当前次序对应的指令。
一种指令调度装置,包括:获取模块、数据依赖分析模块、评估模块,
所述获取模块,用于获取待调度指令列表中的待调度指令集,以及根据各指令之间的数据依赖关系,得到指令调度过程中每次指令选择对应的所有选择节点;
所述数据依赖分析模块,用于对待调度指令集进行数据依赖分析,得到各指令之间的数据依赖关系;
所述评估模块,用于按照预设规则,根据对应次序的选择节点中确定调度后指令列表中各次序的指令。
一种计算机设备,包括存储器、处理器,及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行上述的方法所提及的步骤。
一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,当所述计算机程序被处理器执行时,实现执行上述的方法所提及的步骤。
与传统技术相比,本申请提供的一种指令列表调度方法、装置、计算机设备及存储介质具有如下有益效果:
通过分析待调度指令的数据依赖关系,得到调度过程中每次指令选择对应的所有选择节点,再根据对各次序对应的选择节点的评估结果确定调度后的指令列表中各次序的指令。该方法可以保证每次选择指令时,选择的指令为当前状态的最优结果,使用这些最优结果得到的调度后的指令列表,各个指令之间的排列更加紧凑,便于缩短原指令列表中指令序列的执行时间。
本申请还提供了一种计算机设备,包括第一处理器、第二处理器和存储器,其中,所述存储器内存储有多个原始网络对应的离线模型及输入数据和能够在所述第一处理器上运行的运行时系统;所述运行时系统包括:
数据处理装置,所述数据处理装置用于从所述存储器中获取当前原始网络对应的离线模型及输入数据,所述当前原始网络对应的离线模型中包含原始网络中各个计算节点对应的模型参数、指令以及所述原始网络中的各个计算节点的接口数据;
设备管理装置,所述设备管理装置用于控制所述第二处理器启动或关闭;
任务执行装置,所述任务执行装置用于控制所述第二处理器运行所述当前原始网络的离线模型及输入数据。
在其中一个实施例中,所述数据处理装置包括离线模型加载模块和输入数据加载模块;
所述离线模型加载模块用于从所述存储器中获取各个所述当前原始网络对应的离线模型,并对所述当前原始网络对应的离线模型进行解析;
所述输入数据加载模块用于从所述存储器中获取所述当前原始网络对应的输入数据。
在其中一个实施例中,所述数据处理装置还包括输入数据预处理模块,所述输入数据预处理模块用于对所述输入数据加载模块获取的所述当前原始网络对应的输入数据进行预处理,使所述第二处理器能够运行所述当前原始网络对应的输入数据,并用于将所述第二处理器获得的输出数据存储至所述存储器。
在其中一个实施例中,所述计算机设备还包括能够在所述运行时系统上运行的应用软件;
所述数据处理装置能够提供离线模型API及输入数据API;
所述设备管理装置能够提供第二处理器驱动API;
所述任务执行装置能够提供第二处理器运行API;
所述应用软件能够调用所述离线模型API及输入数据API、所述第二处理器驱动API,以及所述第二处理器运行API。
在其中一个实施例中,所述第二处理器的数量为多个,或所述第二处理器包括多个处理模块;
所述任务执行装置还能够提供任务分配API,所述应用软件还能够调用所述任务分配API,以控制多个所述第二处理器或控制所述第二处理器的多个处理模块。
本申请还提供了一种数据处理方法,用于所述的计算机设备中,所述方法包括如下步骤:
控制数据处理装置从存储器中获取当前原始网络对应的离线模型及输入数据,其中,所述当前原始网络对应的离线模型中包含所述当前原始网络中各个计算节点对应的模型参数、指令以及所述当前原始网络中的各个计算节点的接口数据;
通过设备管理装置控制所述计算机设备的第二处理器启动;
通过任务执行装置控制所述计算机设备的第二处理器根据所述当前原始网络对应的离线模型及输入数据,运行所述当前原始网络,获得所述当前原始网络的输出数据;
控制所述数据处理装置将所述当前原始网络的输出数据存储至所述存储器中。
在其中一个实施例中,所述方法还包括如下步骤:
实时获取所述当前原始网络对应的离线模型的加载进度;
若所述当前原始网络对应的离线模型的加载进度大于或等于第一预设比例,则执行所述的控制所述计算机设备的第二处理器启动的步骤。
在其中一个实施例中,从所述存储器中获取当前原始网络对应的离线模型及输入数据的步骤之前,所述方法还包括如下步骤:
对所述当前原始网络对应的离线模型进行解析及预处理;
对所述当前原始网络对应的输入数据进行预处理。
同时,本申请还提供了一种数据处理方法,用于所述的计算机设备,所述方法包括如下步骤:
调用离线模型API,获取当前原始网络对应的离线模型,所述当前原始网络对应的离线模型中包含所述当前原始网络中各个计算节点对应的模型参数、指令以及所述当前原始网络中的各个计算节点的接口数据;
调用输入数据API,获取所述当前原始网络的输入数据;
调用第二处理器驱动API,控制所述计算机设备中的第二处理器启动;
调用第二处理器运行API,控制所述第二处理器根据所述当前原始网络对应的离线模型及输入数据,获得所述当前原始网络的输出数据;
调用第二处理器驱动API,控制第二处理器关闭。
此外,本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被一个或多个处理器执行时,实现上述任一项所述的方法的步骤。
上述的计算机设备、数据处理方法及存储介质,通过数据处理装置可以直接从存储器中获取当前原始网络对应的离线模型及输入数据,从而该计算机设备的第二处理器可以根据其获取的原始网络的离线模型及输入数据运行该当前原始网络,获得当前原始网络的输出数据。由于每个原始网络对应的离线模型中仅包含原始网络中各个计算节点对应的模型参数、指令以及原始网络中各个计算节点的接口数据,因而,原始网络的离线模型的数据量级远远小于该原始网络的数据量级,从而通过在计算机设备上运行该当前原始网络对应的离线模型(轻量级),可以实现计算机设备对重量级的神经网络数据的处理过程。同时,通过在该计算机设备上直接运行该当前原始网络对应的离线模型,无需对当前原始网络中的各个计算节点进行编译等处理操作,可以提高该计算机设备的处理速度及效率。
本申请还提供了一种计算机设备,包括第一处理器、第二处理器、第一存储器和第二存储器,其中,所述第一存储器内存储有多个原始网络对应的离线模型及输入数据和能够在所述第一处理器上运行的运行时系统,所述第二存储器内存储有能够在所述第一处理器或所述第二处理器上运行的操作系统;
所述运行时系统为基于可信运行环境建立的安全的运行时系统,所述第一存储器为安全存储介质;当所述运行时系统在所述第一处理器上运行时,所述运行时系统能够从所述第一存储器内获取当前原始网络对应的离线模型及输入数据,并控制所述第二处理器运行所述当前原始网络对应的离线模型;
其中,所述当前原始网络对应的离线模型中包含原始网络中各个计算节点对应的模型参数、指令以及所述原始网络中的各个计算节点的接口数据。
在其中一个实施例中,所述运行时系统包括:
数据处理装置,所述数据处理装置能够提供离线模型API及输入数据API,用于从所述第一存储器中获取当前原始网络对应的离线模型及输入数据;
设备管理装置,所述设备管理装置能够提供第二处理器驱动API,用于控制所述第二处理器启动或关闭;
任务执行装置,所述任务执行装置能够提供第二处理器运行API,用于控制所述第二处理器运行所述当前原始网络的离线模型及输入数据。
在其中一个实施例中,所述数据处理装置包括离线模型加载模块和输入数据加载模块;
所述离线模型加载模块能够提供离线模型API,用于从所述第一存储器中获取各个所述当前原始网络对应的离线模型,并对所述当前原始网络对应的离线模型进行解析;
所述输入数据加载模块能够提供输入数据API,用于从所述第一存储器中获取所述当前原始网络对应的输入数据。
在其中一个实施例中,所述数据处理装置还包括输入数据预处理模块,所述输入数据预处理模块能够提供数据预处理API,用于对所述当前原始网络的输入数据进行预处理,使所述第二处理器能够运行所述当前原始网络的输入数据,并用于将所述第二处理器获得的输出数据存储至所述第一存储器。
在其中一个实施例中,所述第二处理器的数量为多个,或所述第二处理器包括多个处理模块;
所述任务执行装置还能够提供任务分配API,用于控制多个所述第二处理器,或控制所述第二处理器的多个处理模块。
在其中一个实施例中,所述计算机设备还包括能够在所述运行时系统上运行的安全的应用软件,且所述应用软件能够调用所述离线模型API及输入数据API、所述第二处理器驱动API,以及所述第二处理器运行API。
在其中一个实施例中,所述第一存储器和所述第二存储器在物理上相互独立设置;
或者,所述第一存储器和所述第二存储器集成为一体,且所述第一存储器和所述第二存储器在逻辑上相互独立设置。
本申请还提供了一种数据处理方法,用于所述的计算机设备中,所述方法包括如下步骤:
从第一存储器中获取当前原始网络对应的离线模型及输入数据,所述当前原始网络对应的离线模型中包含所述当前原始网络中各个计算节点对应的模型参数、指令以及所述当前原始网络中的各个计算节点的接口数据;
控制所述计算机设备的第二处理器启动;
控制所述计算机设备的第二处理器根据所述当前原始网络对应的离线模型及输入数据,运行所述当前原始网络,获得所述当前原始网络的输出数据;
将所述当前原始网络的输出数据存储至所述第一存储器中。
本申请还提供了一种数据处理方法,用于所述的计算机设备,所述方法包括如下步骤:
调用离线模型API,从第一存储器中获取当前原始网络对应的离线模型,所述当前原始网络对应的离线模型中包含所述当前原始网络中各个计算节点对应的模型参数、指令以及所述当前原始网络中的各个计算节点的接口数据;
调用输入数据API,获取所述当前原始网络的输入数据;
调用第二处理器驱动API,控制所述计算机设备中的第二处理器启动;
调用第二处理器运行API,控制所述第二处理器根据所述当前原始网络对应的离线模型及输入数据,获得所述当前原始网络的输出数据;
调用第二处理器驱动API,控制第二处理器关闭。
在其中一个实施例中,所述方法还包括如下步骤:
调用数据预处理API,将所述当前原始网络的输出数据存储至所述第一存储器中。
在其中一个实施例中,在所述的调用输入数据API,获取所述当前原始网络的输入数据的步骤之后,所述方法还包括如下步骤:
调用数据预处理API,对获取的所述当前原始网络的输入数据进行预处理,使所述第二处理器能够运行所述输入数据。
此外,本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被一个或多个处理器执行时,实现上述任一项中所述的方法的步骤。
上述的计算机设备、数据处理方法及存储介质,通过运行时系统的数据处理装置可以直接从第一存储器中获取当前原始网络对应的离线模型及输入数据,从而计算机设备的第二处理器根据其获取的原始网络的离线模型及输入数据运行 该当前原始网络。由于当前原始网络的离线模型中仅仅存储了当前原始网络中各个计算节点对应的模型参数、指令以及当前原始网络中的各个计算节点的接口数据等必要的网络结构信息。因而该当前原始网络的离线模型的数据量级远远小于该当前原始网络的数据量级,从而通过运行当前原始网络的离线模型,能够实现在基于TEE等可信执行环境建立的安全的运行时系统对神经网络等重量级数据的处理过程,拓展了神经网络的应用范围。同时,通过在该计算机设备上直接运行该原始网络对应的离线模型,无需对原始网络中的各个计算节点进行编译等处理操作,可以提高该计算机设备的处理速度及效率。
附图说明
图1为一个实施例中提出的一种任务并行处理系统的结构示意图;
图2为一个实施例中提出的一种任务并行处理系统的结构示意图;
图3为一个实施例中提出的一种任务并行处理方法的步骤流程图;
图4为一个实施例中提出的对不带模型的操作请求的输入数据和输出数据进行拆分的示意图;
图5为一个实施例中提出的神经网络模型的卷积操作(conv)的输入输出示意图;
图6为一个实施例中提出的对conv模型进行拆分的示意图;
图7为一个实施例中提出的一种任务并行处理方法的步骤流程图;
图8为一个实施例中构建的任务有向无环图DAG;
图9为一个实施例中的需执行任务分发结果示意图;
图10为一个实施例中提出的一种任务并行处理方法的步骤流程图;
图11为一个实施例中构建的任务有向无环图DAG;
图12为一个实施例中的需执行任务分发结果示意图;
图13为一个实施例中提出的一种任务并行处理装置的结构示意图;
图14为一个实施例中提出计算机系统的结构示意图;
图15为一个实施例中一种指令列表调度方法的步骤流程图;
图16为一个实施例中得到的待调度指令的数据依赖关系图;
图17为一个实施例中得到的选择节点的关联图;
图18为一个实施例中提出的指令列表调度装置的结构示意图;
图19为一个实施例中提出的一种计算机设备的内部结构图;
图20为一实施例中计算机设备的结构框图;
图21为图20中第一处理器一实施例的结构框图;
图22为图20中运行时系统一实施例的结构框图;
图23为图20中运行时系统另一实施例的结构框图;
图24为图20中计算机设备一实施例的数据处理方法的流程图;
图25为图20中计算机设备另一实施例的数据处理方法的流程图;
图26为一实施例的离线模型生成方法的流程图;
图27为另一实施例的离线模型生成方法的流程图;
图28为一实施例的神经网络的网络结构图;
图29为图28中神经网络的离线模型生成过程示意图;
图30为另一实施例中计算机设备的结构框图;
图31为图30中计算机设备一实施例的数据处理方法的流程图;
图32为图30中计算机设备另一实施例的数据处理方法的流程图。
具体实施方式
为了使本申请的申请目的、技术方案及技术效果更加清楚明白,以下结合附图对本申请的具体实施例进行描述。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。应当清楚是,本实施例中的“第一”、“第二”等,仅用于区分所描述的对象,不具有任何顺序或技术含义。
图1示出的为本申请实施例提出一种任务并行处理系统600(为了便于区分,下文称第一任务并行处理系统)的结构示意图。所述处理器系统包括:处理器620和存储器610,存储器610上存储有处理器620可执行的指令;处理器620包括多个处理器核,各处理器核可以通过内总线进行通信,执行不同的任务。处理器620的处理器核可以运行拆分算法。
图2示出的为本申请实施例提出另一种任务并行处理系统700(为了便于区分,下文称第二任务并行处理系统)的结构示意图,该任务并行处理系统包括第一处理器710、第二处理器720和存储器730。存储器730上存储有第一处理器710和/或第二处理器720可执行的指令。第一处理器710的处理器核需具备运行拆分算法的能力;第二处理器720可以不具备运行拆分算法的能力。第一处理器710与第二处理器720各自的处理器核通过内总线进行通信,执行不同任务。第一处理器710与第二处理器720通过总线通信,协同工作。作为一种可选的实施方式,第一处理器710可以为多核处理器,也可以为单核处理器。第二处理器720可以为多核处理器。
如图3为本申请提出的一种任务并行处理方法的步骤流程图。该方法能够应用于图1或图2所示的任务并行处理系统,下述步骤可以以指令的形式存储于上述任务并行处理系统的存储器上,该任务并行处理方法可以包括:
步骤S301:根据需执行任务之间的依赖关系,构建任务有向无环图DAG。
本实施例中的有向无环图DAG是为了表示需执行任务之间的驱动依赖关系。DAG(Directed Acyclic Graph,有向无环图)是有向图的一种,常被用来表示事件之间的驱动依赖关系,管理任务之间的调度。基于DAG的这些特性,因此,可以使用DAG来描述获取的需执行任务之间的逻辑关系。
需执行任务之间的依赖关系是指:某些需执行任务的执行需要依赖于其他执行任务的执行结果。例如:读取A指令,需要依赖于写入A指令这一操作。
作为一种可选的实施方式,需执行任务可由第一任务并行处理系统600中处理器620的处理器核运行预设的拆分算法,拆分需执行的程序得到。
作为一种可选的实施方式,需执行任务可由第二任务并行处理系统700中第一处理器710的处理器核运行预设的拆分算法,拆分需执行的程序得到。
本实施步骤S301可以由第一任务并行处理系统600中处理器620的处理器核执行,或者第二任务并行处理系统700中第一处理器的处理器核执行。
步骤S302:根据所述任务有向无环图DAG,将各所述需执行任务分发至所述处理器的多个工作队列。
第一任务并行处理系统600中处理器的处理器核,或者第二任务并行处理系统700中处理器核都可以包括一个或多个工作队列。
工作队列(work queue)是将任务推后执行的一种机制,可以按序运行放入的待需执行任务。工作队列中的各需执行任务的运行由一个内核线程控制,因此可以通过处理器系统中的中断控制机制调整工作队列的控制线程实现任务重新调度甚至睡眠。
在将需执行任务分发至工作队列时,尽可能的将可并行的任务分发至不同的工作队列以减少程序的运行时间。任务有向无环图DAG中并行结点关联的下游需执行任务一般为可并行的需执行任务,因此,可以根据构建的任务有向无环图DAG,进行需执行任务的分发。
需要说明的是,本实施步骤S302可以由第一任务并行处理系统600中的任一处理器核执行,也可以由第二任务并行处理系统700中的任一处理器核执行。
步骤S303:根据所述任务有向无环图DAG中各所述需执行任务的依赖关系,调控各所述工作队列中并行的需执行任务开始运行。
由于各工作队列独立运行,当某工作队列中存在需依赖与其他工作队列中的需执行任务的输出结果时,若不对需执 行任务进行调度会出现执行错误。因此,为了保证程序输出正确结果,需根据任务有向无环图DAG中各所述需执行任务的依赖关系对各工作队列中的各需执行任务进行调度,控制各需执行任务的运行。
需要说明的是,本实施步骤可以由第一任务并行处理系统600中的任一处理器核执行,也可以由第二任务并行处理系统700中的任一处理器核执行。本实施例提出的一种任务并行处理方法,通过根据需执行任务之间的依赖关系,构建任务有向无环图DAG,再根据任务有向无环图DAG进行需执行任务分发与控制,依赖于工作队列的可重新调度性实现多核处理器的任务并行,提高了数据处理效率。本实施例提出的任务并行处理方法的实施不依赖tensorflow或者Caffe等框架程序,因此在设计程序时无需考虑接口兼容等问题。
在其中一个实施例中,根据需执行任务之间的依赖关系,构建任务有向无环图DAG的步骤之前包括:
根据程序中的操作节点和/或数据节点对程序进行拆分,获取所述需执行任务。执行程序中包含多个操作请求(如:conv,pool,active,add等),各操作请求之间存在操作节点。因此,可以根据操作节点拆分程序获取需执行任务。
在某些执行程序中,可能包含操作请求均需顺序执行。在这种情形下,可以考虑在执行程序的数据层面(代码层面),也可以根据程序中的数据节点进行拆分,增加任务的并行可能性。
本实施步骤需由第一任务并行处理系统600中处理器620的处理器核,或者第二任务并行处理系统700中第一处理器710的处理器核运行预设的拆分算法,根据程序中的操作节点和/或数据节点对须执行程序进行拆分得到需执行任务。
需要说明的是,在对执行程序进行拆分时,可以仅根据操作节点对执行程序进行拆分,也可以直接在数据层面根据数据节点进行拆分,还可以将二者结合。尽管将执行程序拆分的越细致,任务并行的可能性也越高,但是这也会增加任务并行时的调控难度。因此,在选择对执行程序的拆分时需根据实际需求选择拆分方式,本申请对此不作限定。
在其中一个实施例中,第一任务并行处理系统600中处理器620的处理器核,或者第二任务并行处理系统700中第一处理器710的处理器核对根据程序中的操作节点对程序进行拆分时,包括两种情形:1)程序中包括模型的操作请求;2)程序中不包括带模型的操作请求。
情形一:当所述程序中包括不带模型的操作请求(如pool,batchnorm,Lrn,active,add等)时,根据程序中的操作节点对程序进行拆分,获取所述需执行任务的步骤包括:
对所述不带模型的操作请求的输入数据和/或输出数据进行拆分,获取需执行任务。
当不带模型的操作请求的输入数据和/或输出数据进行拆分时,可以按照预设规则在数据的窗口方向(height width方向,hw方向)拆分所述输入数据和/或输出数据,得到需执行任务。
如图4示出的为在数据的窗口方向上,对不带模型的操作请求的输入数据和输出数据进行拆分的示意图。此次拆分的预设规则为,在窗口所在的平面上均分输入数据和输出数据。
设定输出数据Y=输入数据X,其中,X=x1+x2+x3+x4;Y=y1+y2+y3+y4。
需要说明的是,在数据的窗口方向上均分输入数据和输出数据得到需执行的任务,仅是本实施例提出的一种在数据的窗口方向上拆分输入数据和输出数据的具体形式,实际情形中,还可以以非均分的形式在数据的窗口方向上拆分数据,或者以不同的均分方式在数据的窗口方向上拆分数据,只要可以按照一定的规则将输入数据和输出数据拆分开,即可实现本步骤的目的,具体如何拆分,本申请不做限定。
还需要说明的是,本申请提出在数据的窗口方向上拆分输入数据和输出数据旨在获取多个需执行任务,只要输入数据和输出数据进行拆分即可达到本步骤的目的。因此,对不带模型的操作请求进行拆分得到需执行任务时,可以仅对输入数据进行拆分,也可以仅对输出数据进行拆分,还可以既拆分输入数据又拆分输出数据,上述情形均可以达到本步骤的实施目的,具体如何拆分可根据具体操作以及实际需求灵活选择。
情形二:当所述程序包括带模型的操作请求(如conv,mlp等)时,根据程序中的操作节点对程序进行拆分,获取所述需执行任务的步骤包括:
对所述带模型的操作请求的模型进行拆分和/或对所述模型的输入数据进行拆分,获取需执行任务。
当对所述带模型的操作请求的模型进行拆分时,需预先设置拆分模型得到的各所述需执行任务对应的权值;再使用各所述权值,设置所述需执行任务的输入数据与输出数据的对应关系。
对所述带模型的操作请求的模型进行拆分时,可以按照预设规则在模型的窗口方向(height width方向,hw方向)上拆分所述带模型的操作的模型,得到需执行任务;也可以在模型的通道方向(channel方向,C方向)上拆分所述带模型的操作的模型,得到需执行任务;还可以将二者进行结合。
此外,也可以在hw平面上拆分带模型的操作的输入数据,得到需执行任务。
图5示出的神经网络模型的卷积操作(conv)的输入输出示意图。图6示出的为在通道方向上,对conv模型进行拆分的示意图。
设定conv模型依照:输出数据Y=输入数据X,进行输入输出。则将mlp(Multi-Layer Perceptron,多层感知器)任务在模型的C方向上分成3个子任务。输入数据X拆分成x1,x2,x3,对应的输出数据为y1,y2,y3。
由于神经网络自身特殊结构,拆分后的输入数据除进行处理外还需乘以相应的权值Si,以获取对应的输出数据为y1,y2,y3,其中,i为X拆分数。即:y1=x1*S1+x2*S2+x3*S3;y2=x1*S4+x2*S5+x3*S6;y3=x1*S7+x2*S8+x3*S9。最后通过运算处理y1、y2、y3即可得到输出数据Y。
在hw平面上拆分带模型的操作的输入数据方式与不带模型的操作请求在hw平面上拆分输入数据类似,在此不做详述。
需要说明的是,对带模型的操作请求进行拆分时,既可以仅在模型C方向上拆分,也可以仅在模型hw平面上拆分,还可以同时模型的C方向上和模型hw平面上拆分。虽然多种拆分方式可以增加任务的并行可能性,在理论上减少程序的运行时间,但是其实现难度也会相应加大,此外,实际应用中,运行拆分后的需执行任务,实际运行时间也会稍大于理论运行时间,因此,如何拆分带模型的操作请求还需根据实际场景进行选择,本申请对此不作限定。
使用上述两个情形提供的对获取需执行任务的方法得到的需执行任务的并行可能性高,构建任务有向无环图DAG中并行结点更加丰富,进而使得需执行程序的运行更加高效。
在其中一个实施例中,第一任务并行处理系统600或第二任务并行处理系统700的处理器核,按照获取的所述需执行任务之间的依赖关系,构建任务有向无环图DAG,包括:
按照获取的各需执行任务之间的依赖关系,确定所述任务有向无环图DAG中的并行结点与顺序结点;
根据所述并行结点与顺序结点构建任务有向无环图DAG。
获取的需执行任务之间可能存在依赖关系,也可能无依赖关系。当两需执行任务之间无依赖关系时,两需执行任务一般为可并行任务;当两需执行任务之间存在依赖关系时,两需执行任务一般为串行任务。因此可以根据各需执行任务之间的依赖关系确定任务有向无环图DAG中的并行结点与顺序结点,根据确定的不同类型的节点将各任务填充至任务有向无环图DAG的相应位置,完成任务有向无环图DAG的构建。
需要说明的是,当需要对需执行的程序进行拆分得到需执行任务时,需要保证任务并行处理系统中至少包含一个可以运行拆分算法的处理器,用于拆分程序获取需执行任务。
在其中一个实施例中,第一任务并行处理系统600或第二任务并行处理系统700的处理器核,根据所述任务有向无环图DAG将各所述需执行任务分发至所述处理器的多个工作队列,包括:
步骤S2021:对任务有向无环图DAG进行拓扑排序,获取任务拓扑排序序列。
步骤S2022:根据各所述需执行任务的预设执行时间,对得到的所述拓扑排序序列进行排序,得到最长拓扑排序序列。
步骤S2023:根据所述最长拓扑排序序列以及各所述需执行任务之间的依赖关系,分发各所述需执行任务至所述工作队列。
本实施例中,处理器核进行任务分发时,可以将任务分发至具有运行拆分算法的处理器核的工作队列,例如,将任务分发至第一任务并行处理器系统600中处理器620的处理器核的工作队列;也可以将任务分发至不具有运行拆分算法能力的处理器核的工作队列,例如第二任务并行处理系统700中第二处理器720的处理器核的工作队列。只要保证处理器核能够执行被分发的任务,就可以保证可以以并行的方式运行需执行的程序,运行需执行任务处理器核是否具有运行拆分算法的能力,此时不会影响程序的执行,因此,本申请对此不做限定。
本实施例根据任务拓扑排序序列的最长路径进行需执行任务分发,可以优化程序的执行时间,即理论上执行最长拓扑排序序列中任务的时间即为程序执行时间,这样可以保证需执行程序以最短的时间执行完毕。
在其中一个实施例中,第一任务并行处理系统600或第二任务并行处理系统700的处理器核,根据所述任务有向无环图DAG中各所述需执行任务的依赖关系,调控各所述工作队列中并行的需执行任务的运行,包括:
步骤S3031:根据所述任务有向无环图DAG为各所述需执行任务设置引用计数。
步骤S3032:若被依赖的需执行任务已执行,则修改依赖的需执行任务的引用计数;
步骤S3033:当所述需执行任务的引用计数达到预设值,控制各所述工作队列中引用计数达到预设值的需执行任务运行。
图7示出的为一种任务并行处理方法的步骤流程图。该方法包括:
步骤S701:根据需执行程序中的操作节点对执行程序进行拆分,获取需执行任务A3,B2,C2,D4,E5,F1,并根据需执行任务A3,B2,C2,D4,E5,F1之间的依赖关系构建任务如图8所示的任务有向无环图DAG。
步骤S702:根据图8所示的任务有向无环图DAG,将需执行任务A3,B2,C2,D4,E5,F1分发至所述工作队列1,工作队列2,工作队列3。分发结果如图9所示。
步骤S703:根据任务有向无环图DAG为需执行任务A3,B2,C2,D4,E5设置引用计数,并根据设置的引用计数控制A3,B2,C2,D4,E5,F1的运行。
本实施例中设定当引用计数为0时,工作队列中需执行任务开始运行。如需执行任务A3的引用计数为0,需执行任务A3被放入工作队列可以直接执行;需执行任务E5需依赖需执行任务B2和需执行任务C2的执行结果,因此将需执行任务E5的引用计数设置为2。当需执行任务B2执行完毕,将需执行任务E5的引用计数调整为1,当需执行任务C2执行完毕,再将需执行任务E5的引用计数调整为0,引用计数为0时,引用计数E5可以开始执行,同理控制需执行任务F1的运行,最终运行完成需执行程序。
图10示出了一种任务并行处理方法的步骤流程图。该方法包括:
步骤S6001:获取下述需执行程序中的数据节点,对需执行程序进行拆分,获取需执行任务,并根据需执行任务之间的依赖关系构建任务如图11所示的任务有向无环图DAG。
B=conv(A);
C=pool(B);
D=active(B);
E=add(C,D)。
其中,A,B,C,D,E为数据节点,conv,pool,active,add为操作节点。
本实施例的中的任务有向无环图DAG中数据E的获得依赖对数据C和数据D的处理结果,数据C和数据D的获得依赖对数据B的处理结果,而数据B的获得依赖对数据A的处理结果。
步骤S6002:根据图11所述的任务有向无环图DAG,将各需执行任务分发至工作队列1’和工作队列2’。分发结果如图12所示。
步骤S6003:根据任务有向无环图DAG为需执行任务设置引用计数,并根据设置的引用计数控制各需执行任务的运行。
本实施例设定引用计数的值为0时,工作队列中的需执行任务开始运行,否则不运行。当被引用的任务被执行后,任务的引用计数会减1,直至减为0,该任务才可被执行。初始设定,需执行任务B=conv(A)的引用计数为0;需执行任务C=pool(B)的引用计数为1;需执行任务D=active(B)的引用计数为1;需执行任务E=add(C,D)的引用计数为2。当需执行任务B=conv(A)运行完毕,需执行任务C=pool(B)和需执行任务D=active(B)的引用计数均减小1,变为0,此时需执行任务C=pool(B)和需执行任务D=active(B)开始运行。同理,当运行任务C=pool(B)和运行任务D=active(B)运行完毕后,运行任务E=add(C,D)的引用计数变为0,此时需执行任务E开始运行,需执行任务E运行完毕即需执行程序运行完毕。
基于同样的申请思想,本申请提出了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述实施例所提及方法的步骤。
基于同样的申请构思,本申请提出了一种任务并行处理装置,该装置结构如图13所示,包括:DAG图构建模块410、任务分发模块420和调度控制模块430。其中,DAG图构建模块410用于根据需执行任务之间的依赖关系,构建任务有向无环图DAG;任务分发模块420用于根据所述任务有向无环图DAG,将各所述需执行任务分发至处理器的多个工作队列;调度控制模块430用于根据所述任务有向无环图DAG中各所述需执行任务的依赖关系,调控各所述工作队列中并行的需执行任务开始运行。
在其中一个实施例中,DAG图构建模块410用于根据程序中的操作节点和/或数据节点对程序进行拆分,获取所述需执行任务。
在其中一个实施例中,DAG图构建模块410用于若所述程序中包括带模型的操作请求,则对所述带模型的操作请求的模型进行拆分和/或对所述模型的输入数据进行拆分,获取需执行任务。
在其中一个实施例中,DAG图构建模块410用于若所述程序包括不带模型的操作请求,则对所述不带模型的操作请求的输入数据和/或输出数据进行拆分,获取需执行任务。
在其中一个实施例中,所述DAG图构建模块410用于按照获取的需执行任务之间的依赖关系,确定所述任务有向无环图DAG中的并行结点与顺序结点;根据所述并行结点与顺序结点构建任务有向无环图DAG。
在其中一个实施例中,任务分发模块420用于对所述任务有向无环图DAG进行拓扑排序,获取任务拓扑排序序列;根据各所述需执行任务的预设执行时间,对得到的所述拓扑排序序列进行排序,得到最长拓扑排序序列;根据所述最长拓扑排序序列以及各所述需执行任务之间的依赖关系,分发各所述需执行任务至所述工作队列。
在其中一个实施例中,调度控制模块430用于根据所述任务有向无环图DAG为各所述需执行任务设置引用计数;若被依赖的需执行任务已执行,则修改需依赖的需执行任务的引用计数;当所述需执行任务的引用计数达到预设值,控制各所述工作队列中引用计数达到预设值的需执行任务开始运行。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本申请可以通过硬件实现,也可以借助软件加必要的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)运行本申请各个实施场景的方法。
在上述处理系统的第一任务并行处理系统中处理器,或者第二任务并行处理系统中第一处理器的处理器核可以根据对应的指令列表并行处理不同的指令,提高该计算机系统的处理效率。但是,上述计算机系统处理系统中各个处理器核对应的指令列表中的指令顺序可能并不合理,例如没有使指令列表中的指令尽可能的并行,这样可能无法提升处理系统的处理效率,或者提升效率的效果不佳。因此,如何提供一种指令列表调度方法、装置、计算机设备及存储介质,进行指令列表中指令顺序调整,使指令列表中各个指令之间的排列更加紧凑,缩短指令列表的执行时间便成为亟需解决的技术问题。
如图14所示,为其中一实施例的计算机系统300可以为多核处理器计算机系统(Multi-core processor Computing System)、异构计算机系统(Heterogeneous Computing System)等包含有多个处理器的多处理器计算机系统(Multi-processor Computing System)。可选地,该计算机系统具体可以包括指令列表调度装置310、多个第一处理器320以及存储器330,多个第一处理器320可以同时连接至指令列表调度装置310,指令列表调度装置310可以用于多个第一处理器320的指令列表重新调度。可选地,该指令列表调度装置310也可以包括第二处理器。可选地,该第二处理器可以包括获取模块、数据依赖分析模块、评估模块、运算模块及控制模块等等,其中,该获取模块可以是IO(Input输入/Output输出)接口等硬件模块,运算模块及控制模块均为硬件模块。
多个第一处理器320可以根据指令列表并行处理不同的指令,以提高该计算机系统的处理效率。可选的,指令列表中可以包含一条或多条指令,每一条指令包含了一组对资源的引用操作,通过对指令的读取或运行,可以获知该指令的引用的资源。即当第一处理器等执行该指令时,可以调用该指令引用的资源,以实现特定的操作。例如,该指令可以是 加载指令(Load)、计算指令(computing)或存储指令(store)等等,当然该指令也可以是神经网络的N层计算,N>0,N可以为整数,也可以为非整数。
进一步地,该指令列表中的各个指令按照执行顺序排列,该指令列表中各个指令引用的资源可以是虚拟内存对象,也可以是物理内存对象。该虚拟内存对象可以是内存区块、寄存器或其他能够存储数据的存储装置在软件逻辑上的虚拟存储空间。本实施例中的指令调度过程即是在保证原指令列表语义不变的前提下,对指令列表中的指令重新排序的过程,这可以使得该指令列表中各个指令之间的排列更加紧凑,以便于缩短指令列表的执行时间,提高系统的处理效率。
例如,指令列表中包含N条指令,其中,N≥1,N为正整数,且N条指令按照执行时序标记为第一条指令、第二条指令,……,第N条指令。对该指令列表的调度过程即为重新对上述N条指令进行排序的过程。
具体地,在对指令列表进行调度时,指令列表调度装置310可以首先获得待调度的指令列表中各指令的数据依赖关系。可选的,该数据依赖关系的形式可以包括RAW(Read After Write,读后写)/WAR(Write After Read,写后读)/WAW(Write After Write,写后写)。可选的,该数据依赖关系可以用数据依赖图DDG(Data Dependence Graph,数据依赖图)描述。进一步的,指令列表调度装置310的第二处理器可以通过其获取模块获取待调度的指令列表,通过其数据依赖分析模块对待调度的指令列表中的指令进行数据依赖分析,得到上述指令之间的数据依赖关系。具体的,数据依赖分析模块可以对待调度的指令列表中的各指令进行资源扫描追踪,进而分析各指令之间的数据依赖关系。本实施例中指令之间的数据依赖是指当前指令的执行是否需要依赖其他指令的执行结果。简单举例来讲,若存在指令A“读取写入的指令B所写入的数据”,那么该指令A依赖于指令B的执行结果。之后,获取模块可以根据得到各指令之间的数据依赖关系,获取指令调度过程中每次进行指令选择的所有选择节点。
之后,指令列表调度装置可以通过评估模块按照预设规则,从对应次序的所有选择节点中确定调度后指令列表中各次序的指令。可选的,第二处理器可以通过其评估模块评估当前次序对应的选择节点,得到当前次序的各选择节点的评估结果,根据评估结果确定当前次序对应的指令。每一选择节点记载该选择节点对应的已排序指令和待调度指令集。可选的,评估模块按照各指令的优先级评估当前次序对应的选择节点。可选的,第二处理器还可以根据当前选择节点的具体内容和/或类型设定指令的优先级。
可选的,指令列表调度装置310在进行指令调度时,可以调整待调度的指令列表中指令对应的第一处理器。例如,该待调度指令对应的第一处理器可以根据指令的类型确定,或该待调度指令的具体内容确定对应的第一处理器。
图15为本申请一实施例的指令列表调度方法的步骤流程图,该指令列表调度方法可以应用于图14所示的计算机系统中。上述计算机系统可以包含存储器330及多个第一处理器320。该指令列表调度方法用以实现上述计算机系统中多个第一处理器对应的指令列表中指令的重新调度,以提高计算机的处理效率。具体地,上述方法可以包括如下步骤:
步骤S100:获取待调度指令列表中的待调度指令集,并对待调度指令集进行数据依赖分析,得到所述待调度指令集中各指令之间的数据依赖关系。
具体的,第二处理器可以通过其获取模块获取待调度指令列表的待调度指令集,通过数据依赖分析模块得到上述指令的数据依赖关系。本实施例中的待调度指令集由待调度指令列表中的多条待调度指令组成。可选的,待调度指令集中不包含待调度指令列表中的无语义指令(例如同步指令等)。进一步的,获取模块获取待调度指令列表的待调度指令集的步骤包括:获取待调度指令列表,删除待调度指令列表中无语义指令,得到待调度指令集。
例如,获取模块获取的待调度指令集中包含六条指令{L1、L2、C1、C2、S1、S2}。其中,L1、C1、S1需顺序执行,L2、C2、S2需顺序执行,其余指令无数据依赖关系。L1、L2、S1、S2为I/O指令,C1、C2为计算指令。数据依赖分析模块对上述待调度指令进行数据依赖赖分析,得到待调度指令集中各指令之间的数据依赖关系,使用如图16所示的DDG(Data Dependence Graph,数据依赖图)描述上述数据依赖关系。
上述待调度指令列表中各个待调度指令引用的资源可以是虚拟内存对象,也可以为物理内存对象。该虚拟内存对象可以是内存区块、寄存器或其他能够存储数据的存储装置在软件逻辑上的虚拟存储空间。
步骤S200:根据各指令之间的所述数据依赖关系,得到指令调度过程中每次进行指令选择的所有选择节点。
每个选择节点记载与该选择节点对应的已排序指令和待调度指令集。可选的,得到所有选择的过程可以为:第二处 理器通过其获取模块首选获取第一次指令选择时的所有第一选择节点,具体的为,获取各第一选择节点对应的已排序指令和待调度指令集。应当清楚的是这些待调度指令集中各指令存在数据依赖关系。之后第二处理器通过其获取模块根据每个第一选择节点的数据依赖关系,获取每个第一选择节点关联的所有第二选择节点,第二选择节点与第二次指令选择对应。循环上述步骤,得到第三选择节点,……,第N选择节点,N≥3,N为正整数。上述步骤中获取的第一选择节点,……,第N选择节点的总和组成每次进行指令选择的所有选择节点。
例如,获取的待调度指令列表中的待调度指令集共包含六条指令:{L1、L2、C1、C2、S1、S2},用图3表示这六条指令之间的数据依赖关系。由图16可知清楚的得知上述待调度指令集中的六条指令L1、L2可不依赖其他指令的执行,因此,在进行第一次指令选择时,需从L1、L2中选择,即获取的第一选择节点对应选择指令L1或L2的两种情形。当第一次指令选择时选择L1时,L1为已排序指令,此时第一选择节点记载已排序指令L1,以及删除指令L1的待调度指令集{L2、C1、C2、S1、S2}。同理,得到当第一次指令选择时选择L2时,得到的另一个第一选择节点,该第一选择节点记载已排序指令L2,以及删除指令L2的待调度指令集{L1、C1、C2、S1、S2}。循环上述过程可以得到第二次指令选择时的第二选择节点,……,第六次指令选择时的第六选择节点。
本实施步骤每次进行指令选择时,都需要依照前一指令选择得到的待调度指令指令集,例如图3对应的待调度指令集,当第一次指令选择时选择的指令为L1时(对应其中一个第一选择节点),得到的待调度指令集{L2、C1、C2、S1、S2},该第一选择节点的调度指令集中指令L2,C1可不依赖其他指令的执行,此时,在进行第二次指令选择时,需从L2,C1中选择(对应存在两个第二选择节点);当第一次指令选择时选择的指令为L2时(对应另一个第一选择节点),得到的待调度指令集{L1、C1、C2、S1、S2},该第一选择节点的调度指令集中指令L1,C2可不依赖其他指令的执行,此时,在进行第二次指令选择时,需从L1,C2中选择(也对应存在两个第二选择节点)。由此可知,本实施例得到的所有的选择节点之间存在关联,这种各选择节点关联可以用图17来表示。
步骤S300:按照预设规则,根据对应次序的选择节点确定调度后指令列表中各次序的指令。可选的,第二处理器可以通过其评估模块对当前次序对应的选择节点进行评估,得当前次序的各选择节点的评估结果,根据评估结果确定当前次序对应的指令。例如,当前次序为第二指令,此时对应图17中第二选择节点,按照预设规则评估图17中的四个第二选择节点,根据评估结果得到调度的指令列表中第二指令。可选的,评估模块按照各指令的预设优先级评估当前次序对应的选择节点(例如L2的优先级最高,C1次之……),得到评估结果。可选的,第二处理器根据当前选择节点的具体内容和/或类型设定各指令的优先级。
可选的,评估模块可以根据当前次序的所有选择节点对应的最短执行时间的长短,确定当前次序对应的指令。例如,图17中指令L1对应的第一选择节点,其对应的指令序列的最短执行时间为t 1,指令L2对应的第一选择节点,对应的指令序列的最短执行时间为t 2,t 1>t 2,则将L2确定为调度后的指令列表中的第一指令。同理确定调度后的指令列表的第二指令,……,第六指令。
本实施例提出的指令列表调度方法,通过分析待调度指令的数据依赖关系,得到指令调度过程中每次进行指令选择的所有选择节点,再根据对各次序对应的选择节点的评估结果确定调度后的指令列表中各次序的指令。该方法可以保证每次选择指令时,选择的指令为当前状态的最优结果,使用这些最优结果得到的调度后的指令列表,各个指令之间的排列更加紧凑,便于缩短原指令列表中指令序列的执行时间。
作为一种可选的实施方式,评估模块按照预设规则,根据对应次序的选择节点中确定调度后指令列表中各次序的指令的步骤包括:
步骤S210:评估模块访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间。评估模块访问的选择节点可以是第一选择节点、第二选择节点,……,第N选择节点。
步骤S220:若当前访问的选择节点对应的最长执行时间小于初始执行时间T 0,则将当前访问节点的已排序指令确定为调度后的指令列表中对应的指令。其中,初始执行时间为待调度指令列表中指令序列的执行时间。
本实施步骤中当前访问的选择节点对应的最长执行时间是指,当前访问节点对应的指令序列的排列最不合理时的执行时间。例如,图17中左侧第一个第二选择节点对应的最长执行时间为T 1=t 1+t 2+t 3+t 4+t 5,其中,t 1为已排序指令L1-L2 的执行时间,t 2为指令C1的执行时间;t 3为指令S1的执行时间,t 4为指令C2的执行时间,t 5为指令S2的执行时间,这是该选择节点对应的未排序指令C1、C2、S1、S2完全没有并行,排序最不合理时情形。若T 1<T 0,则分别将L1、L2作为调度后的指令列表中的第一指令和第二指令。
由于在当前访问的选择节点对应的最长执行时间小于初始执行时间时,因此本实施例中提出的指令列表调度方法得到的指令序列的执行时间不会大于待调度指令列表中的指令序列。
由于本实施例的评估模块访问按照预设规则访问的选择节点,不在仅根据当前次序的选择节点调度指令列表中的指令,可以避免确定的当前次序的指令对后续指令选择的影响。尤其适于调度包含计算量大的指令的指令列表,可选的,包含神经网络运算指令的指令列表。例如,指令列表中包含N条指令,该N条指令中包含一个权值加载指令A,和一个神经网络卷积层计算指令B,若使用传统方法,可能无法使该指令A和指令B并行,使系统达到最高处理效率,本实施例的指令列表调度方案可以实现在调度后的指令列表中指令A和指令B并行。
在其中一个实施例中,上述方法还可以包括:若当前访问的选择节点对应的最长执行时间小于初始执行时间,则初始执行时间更新为当前访问的选择节点对应的最长执行时间。例如,上述实施例中,当T 1<T 0时,分别将L1、L2作为调度后的指令列表中的第一指令和第二指令,同时将T 1更新为初始执行时间。
应当清楚的是,在当前访问的选择节点对应的最长执行时间小于初始执行时间时,将当前访问节点对应的已排序指令确定为调度后的指令列表中对应次序的指令,已经可以保证得到的调度后指令列表中的指令序列的执行时间更短。上述更新初始执行时间的方案是为了进一步优化指令的排序,提高系统的处理效率。
作为一种可选的实施方式,评估模块访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间的步骤包括:
在预设访问时间段内访问选择节点,得到预设访问时间段内每个选择节点对应的最长执行时间。本实施例需结合上述实施例提出的方法确定调度后指令列表的各次序的指令。
由于指令列表中一般存在多条待调度指令,根据这些待调度指令得到的选择节点的数量庞大,在实际操作时,难以有充足的时间遍历所有的选择节点,本申请提出的指令列表调度方法,旨在通过重新排列指令列表中的指令,进一步缩短指令列表的执行时间。基于此,只要通过本申请提出的指令列表调度方法得到的新的指令列表缩短了执行时间即实现本申请的目的。因此,在实际运用本申请提出的指令列表调度方法进行指令重新排序时,一般会根据实际需求,设定访问时间段,控制指令的调度时间。
作为一种可选的实施方式,若当前访问的选择节点对应的最长执行时间不小于初始执行时间,则将待调度指令表中指令序列作为调度后指令表中的指令序列。
本实施例在当前访问的选择节点对应的最长执行时间不小于初始执行时间,将待调度指令表中指令序列作为调度后指令表中的指令序列是对上述实施例提出的指令列表调度方法的优化。可以保证得到的调度后指令列表中的指令序列是,在预设时间段内得到的最优结果。
作为一种可选的实施方式,访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间的步骤:
步骤S230:评估模块获取当前访问的选择节点对应的最短执行时间。
步骤S240:若当前访问的选择节点对应的最短执行时间大于初始执行时间T 0,则终止访问与当前访问的选择节点关联的选择节点。例如,指令L2对应的第二选择节点的最短执行时间为T 2,T 2对应该选择节点对应的未排序指令C1、C2、S1、S2完美并行,排序最合理时情形。若T 2>T 0,则终止访问与该第二选择节点关联的第三选择节点,以及与这些第三选择节点关联的第四选择节点,……,第六选择节点。
由于评估模块每访问一个选择节点均会消耗时间,本实施例的技术方案可以排除对选择节点的无效访问,提高指令列表的调度效率。
作为一种可选的实施方式,评估模块访问所述选择节点,并获取当前选择访问的选择节点对应的最长执行时间的步骤包括:评估模块按照随机优先(例如蒙特卡洛树搜索,MCTS,Monte Carlo Tree Search)的选择所述选择节点进行访问,并获取当前选择访问的选择节点对应的最长执行时间。
作为一种可选的实施方式,评估模块访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间的步骤包括:评估模块按照广度优先(BFS,Breadth First Search)的规则选择所述选择节点进行访问,并获取当前选择访问的选择节点对应的最长执行时间。具体的,本实施例中的广度优先是指优先选择与当前访问的选择节点同一次序的选择节点进行访问。例如当前访问的是第二选择节点,则下一个访问的选择节点优先选择其他的第二选择节点。
作为一种可选的实施方式,评估模块访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间的步骤包括:评估模块按照深度优先(BFS,Depth First Search)的规则选择所述选择节点进行访问,并获取当前选择访问的选择节点对应的最长执行时间。具体的,本实施例中的深度优先是指优先选择当前访问的选择节点关联的下一次序的选择节点进行访问。例如当前访问的是第二选择节点,则下一个访问的选择节点优先选择与该第二选择节点关联的第三选择节点。
可选的,评估模块还可以采用随机优选结合深度优先的规则选择所述选择节点进行访问,或者采用广度优先结合深度优先的规则选择所述选择节点进行访问。具体的,按照广度或随机优先的规则选择小于预设次序的所述选择节点进行访问,得到当前选择访问的选择节点对应的最长执行时间;按照深度优先的规则选择不小于预设次序的所述选择节点进行访问,得到当前选择访问的选择节点对应的最长执行时间。可选的,上述对应次序的预设值根据经验值确定,或者根据预实验结果确定。
在设置访问时间段进行指令列表调度时,指令列表调度装置的评估模块没有足够的时间遍历所有的选择节点,此时,若单一采用深度优先或者广度优先的原则选择所述选择节点进行访问时,最终访问的选择节点的涉及范围可能比较片面(例如仅访问某一选择节点关联的选择节点,或者仅访问了前几次序的选择节点),而仅采用随机优选的原则选择所述选择节点进行访问时最终访问的选择节点的随机性又太强,因此优选采用上述随机优选结合深度优先的规则选择所述选择节点进行访问,或者采用广度优先结合深度优先的规则选择所述选择节点进行访问的方案。
应该理解的是,虽然上述流程图中的各个步骤按照箭头的指示显示,但是这些步骤并不是必然按照箭头指示的顺序执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,上述流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
图18示出的为在其中一个实施例中提出的指令列表调度装置结构示意图,该装置包括获取模块510、数据依赖分析模块520及评估模块530,其中,所述获取模块510用于获取待调度指令列表中的待调度指令集,以及根据各指令之间的数据依赖关系,得到指令调度过程中每次进行指令选择的所有选择节点。所述数据依赖分析模块520用于对待调度指令集进行数据依赖分析,得到所述待调度指令集中各指令之间的数据依赖关系。所述评估模块530用于按照预设规则,根据对应次序的选择节点中确定调度后指令列表中各次序的指令。
在其中一个实施例中,所述评估模块530访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间;若当前访问的所述选择节点对应的最长执行时间小于初始执行时间,则将当前访问的选择节点的已排序指令确定为调度后的指令列表中对应次序的指令;其中,初始执行时间为待调度指令列表中指令序列的执行时间。
在其中一个实施例中,所述指令调度装置还包括更新模块,所述更新模块,用于当前访问的选择节点对应的最长执行时间小于初始执行时间,则初始执行时间更新为当前访问的选择节点对应的最长执行时间。
在其中一个实施例中,所述评估模块530用于在预设访问时间段内访问选择节点,并获取当前访问的选择节点对应的最长执行时间;若当前访问的选择节点对应的最长执行时间小于初始执行时间,则将当前访问节点对应的已排序指令确定为调度后的指令列表中对应次序的指令;其中,初始执行时间为待调度指令列表中指令序列的执行时间。
在其中一个实施例中,所述评估模块530用于当前访问的选择节点对应的最长执行时间不小于初始执行时间时,则将待调度指令表中指令序列作为调度后指令表中的指令序列。
在其中一个实施例中,所述评估模块530用于按照随机优先的规则选择所述选择节点进行访问,并获取当前选择访问的选择节点对应的最长执行时间。
在其中一个实施例中,所述评估模块530用于按照广度优先的规则选择所述选择节点进行访问,并获取当前选择访问的选择节点对应的最长执行时间。
在其中一个实施例中,所述评估模块530用于按照深度优先的规则选择所述选择节点进行访问,并获取当前选择访问的选择节点对应的最长执行时间。
在其中一个实施例中,所述评估模块530用于按照广度或随机优先的规则选择小于预设次序的所述选择节点进行访问,得到当前选择访问的选择节点对应的最长执行时间;按照深度优先的规则选择不小于预设次序的所述选择节点进行访问,得到当前选择访问的选择节点对应的最长执行时间。
在其中一个实施例中,所述评估模块530用于获取当前访问的选择节点对应的最短执行时间;若当前访问的选择节点对应的最短执行时间大于初始执行时间,则终止访问与当前访问的选择节点关联选择节点;其中,初始执行时间为待调度指令列表中指令序列的执行时间。
在其中一个实施例中,所述评估模块530用于按照指令的预设优先级评估当前次序对应的所有选择节点,得到当前次序的各选择节点的评估结果,并根据所述评估结果确定当前次序对应的指令。
在其中一个实施例中,所述评估模块530用于根据当前选择节点的具体内容和/或类型设定各指令的优先级。
在其中一个实施例中,所述评估模块530用于根据当前次序所有的选择节点对应的最短执行时间的长短,确定当前次序对应的指令。
关于指令列表调度装置的具体限定可以参见上文中对于指令列表调度方法的限定,在此不再赘述。上述指令列表调度装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图19所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现上述实施例提及的验证激励的生成方法和/或芯片验证方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图19中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现以下步骤:获取待调度指令列表中的待调度指令集,并对待调度指令集进行数据依赖分析,得到各指令之间的数据依赖关系;根据各指令之间的所述数据依赖关系,得到指令调度过程中每次进行指令选择的所有选择节点;按照预设规则,根据对应次序的选择节点确定调度后指令列表中各次序的指令。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间;若当前访问的选择节点对应的最长执行时间小于初始执行时间,则将当前访问节点的已排序指令确定为调度后的指令列表中对应次序的指令;其中,初始执行时间为待调度指令列表中指令序列的执行时间。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:若当前访问的选择节点对应的最长执行时间小于初始执行时间,则初始执行时间更新为当前访问的选择节点对应的最长执行时间。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:若当前访问的选择节点对应的最长执行时间小于初始执行时间,则基于当前访问节点对应的已排序指令随机生成指令序列,并使用所述随机生成的指令序列将所述待调度指令列表的指令序列更新。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:在预设访问时间段内访问选择节点,并获取当前访问 的选择节点对应的最长执行时间;若当前访问的选择节点对应的最长执行时间小于初始执行时间,则将当前访问节点对应的已排序指令确定为调度后的指令列表中对应次序的指令;其中,初始执行时间为待调度指令列表中指令序列的执行时间。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:按照广度优先的规则选择所述选择节点进行访问,并获取当前选择访问的选择节点对应的最长执行时间。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:按照随机优先的规则选择所述选择节点进行访问,并获取当前选择访问的选择节点对应的最长执行时间。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:按照广度优先的规则选择所述选择节点进行访问,并获取当前选择访问的选择节点对应的最长执行时间。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:按照广度或随机优先的规则选择小于预设次序的所述选择节点进行访问,得到当前选择访问的选择节点对应的最长执行时间;按照深度优先的规则选择不小于预设次序的所述选择节点进行访问,得到当前选择访问的选择节点对应的最长执行时间。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:获取当前访问的选择节点对应的最短执行时间;若当前访问的选择节点对应的最短执行时间大于初始执行时间,则终止访问与当前访问的选择节点关联的选择节点;其中,初始执行时间为待调度指令列表中指令序列的执行时间。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:按照指令的预设优先级评估当前次序对应的所有选择节点,得到当前次序的各选择节点的评估结果,并根据所述评估结果确定当前次序对应的指令。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:根据当前选择节点的具体内容和/或类型设定各指令的优先级。
在一个实施例中,处理器执行计算机程序时还实现以下步骤:根据当前次序所有的选择节点对应的最短执行时间的长短,确定当前次序对应的指令。
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现以下步骤:获取待调度指令列表中的待调度指令集,并对待调度指令集进行数据依赖分析,得到各指令之间的数据依赖关系;根据各指令之间的所述数据依赖关系,得到指令调度过程中每次进行指令选择的所有选择节点;按照预设规则,根据对应次序的选择节点确定调度后指令列表中各次序的指令。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:访问所述选择节点,并获取当前访问的选择节点对应的最长执行时间;若当前访问的选择节点对应的最长执行时间小于初始执行时间,则将当前访问节点的已排序指令确定为调度后的指令列表中对应次序的指令;其中,初始执行时间为待调度指令列表中指令序列的执行时间。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:若当前访问的选择节点对应的最长执行时间小于初始执行时间,则初始执行时间更新为当前访问的选择节点对应的最长执行时间。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:在预设访问时间段内访问选择节点,并获取当前访问的选择节点对应的最长执行时间;若当前访问的选择节点对应的最长执行时间小于初始执行时间,则将当前访问节点对应的已排序指令确定为调度后的指令列表中对应次序的指令;其中,初始执行时间为待调度指令列表中指令序列的执行时间。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:若当前访问的选择节点对应的最长执行时间不小于初始执行时间,则将待调度指令表中指令序列作为调度后指令表中的指令序列。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:按照随机优先的规则选择所述选择节点进行访问,并获取当前选择访问的选择节点对应的最长执行时间。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:按照深度优先的规则选择所述选择节点进行访问,并获取当前选择访问的选择节点对应的最长执行时间。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:按照广度优先的规则选择所述选择节点进行访问, 并获取当前选择访问的选择节点对应的最长执行时间。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:按照广度或随机优先的规则选择小于预设次序的所述选择节点进行访问,得到当前选择访问的选择节点对应的最长执行时间;按照深度优先的规则选择不小于预设次序的所述选择节点进行访问,得到当前选择访问的选择节点对应的最长执行时间。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:获取当前访问的选择节点对应的最短执行时间;若当前访问的选择节点对应的最短执行时间大于初始执行时间,则终止访问与当前访问的选择节点关联的选择节点;其中,初始执行时间为待调度指令列表中指令序列的执行时间。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:按照指令的预设优先级评估当前次序对应的所有选择节点,得到当前次序的各选择节点的评估结果,并根据所述评估结果确定当前次序对应的指令。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:根据当前选择节点的具体内容和/或类型设定各指令的优先级。
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:根据当前次序所有的选择节点对应的最短执行时间的长短,确定当前次序对应的指令。
一般地,处理器在运行神经网络模型时,如运行Caffe网络模型时,每次均需要对该神经网络模型中的各个计算节点分别进行编译、解析,之后,按照该神经网络模型的结构形式按照一定的形式执行各个计算节点。其中,神经网络模型以及网络结构可以是已训练好或未训练好的人工神经网络模型数据。上述对神经网络的处理方法会影响处理器的处理速度,处理效率较低。
本申请实施例中还提供了一种离线模型的生成方法,该离线模型的生成方法可以在云端服务器或神经网络专用处理器上运行,并将其获得的原始网络的离线模型存储至存储器130中。该云端服务器或神经网络专用处理器为能够执行神经网络等重量级数据的处理器,其可以不包含于上述的计算机设备中。具体地,如图26所示,在上述步骤S110之前,上述方法包括如下步骤:
S010、获取原始网络的模型数据集及模型结构参数,具体地,可以通过云端服务器或神经网络专用处理器的获取模块获取原始网络的模型数据集及模型结构参数,通过该原始网络的模型数据集及模型结构参数可以获得该原始网络的网络结构图。其中,模型数据集包括原始网络中各个计算节点对应的模型参数等数据,图28所示的神经网络中的W1~W6即用于表示计算节点的模型参数。模型结构参数包括原始网络中多个计算节点的连接关系及各个计算节点的计算属性,其中,计算节点之间的连接关系用于表示计算节点之间是否有数据传递,例如,当多个计算节点之间具有数据流的传递时,则可以说明多个计算节点之间具有连接关系。进一步地,计算节点的连接关系可以包括输入关系和输出关系等等。如图28所示,计算节点F1输出作为计算节点F4和F5的输入,则可以说明计算节点F1和计算节点F4之间具有连接关系,计算节点F1和计算节点F4之间具有连接关系。再如,计算节点F1和计算节点F2之间没有数据传递,则可以说明计算节点F1和计算节点F2之间不存在连接关系。
各个计算节点的计算属性可以包括相应计算节点的计算类型及计算参数,其中计算节点的计算类型是指该计算节点用于完成何种计算,如计算节点的计算类型可以包括加法运算、减法运算及卷积运算等等,相应的,该计算节点可以是用于实现加法运算的计算节点、用于实现减法运算的计算节点或用于实现卷积运算的计算节点等等。计算节点的计算参数可以是完成该计算节点对应的计算类型所需的必要参数。例如,计算节点的计算类型可以是用于实现加法运算的计算节点,相应的,该计算节点的计算参数可以为加法运算中的加数,该加法运算中的被加数可以作为输入数据通过获取模块获取,或者,该加法运算中的被加数可以是该计算节点的上一计算节点的输出数据等等。
可选地,该原始网络可以为基于TensorFlow、MXNet、Caffe和PyTorch等深度学习系统,针对CPU、GPU或DSP等通用处理器建立的人工神经网络。该原始网络还可以是针对IPU等智能处理器建立的人工神经网络。例如,当该原始网络为基于Caffe建立的神经网络时,则可以获取该Caffe网络的模型数据集(caffemodel)及模型结构参数(prototxt)。其中,模型数据集(caffemodel)中包含该Caffe网络的模型参数等数据,模型结构参数(prototxt)中包含该Caffe网络的各个计算节点的计算属性以及多个计算节点之间的连接关系等。
S101、根据原始网络的模型数据集和模型结构参数运行原始网络,获得原始网络中各个计算节点对应的指令。具体地,云端服务器或神经网络专用处理器的运算模块可以根据原始网络的模型数据集和模型结构参数运行该原始网络,并获得原始网络中各个计算节点对应的指令。进一步地,云端服务器或神经网络专用处理器的获取模块还可以获取该原始网络的输入数据,云端服务器或神经网络专用处理器的运算模块可以根据原始网络的输入数据、网络模型数据集和模型结构参数运行原始网络,获得该原始网络中各个计算节点对应的指令。更进一步地,上述运行该原始网络获得各个计算节点的指令的过程实质上是编译的过程,该编译过程可以通过云端服务器或神经网络专用处理器或虚拟设备实现。即云端服务器或神经网络专用处理器或虚拟设备根据原始网络的模型数据集和模型结构参数运行原始网络。其中,虚拟设备指的是在存储器的内存空间中虚拟出一段处理器运行空间。
应当清楚的是,本实施例中的运行原始网络是指,云端服务器或神经网络专用处理器使用人工神经网络模型数据运行某种机器学习算法(如神经网络算法),通过执行前向运算实现算法的目标应用(如语音识别等人工智能应用)。
S103、根据原始网络的各个计算节点对应的模型参数及指令,生成原始网络对应的离线模型,并将所述原始网络对应的离线模型存储至非易失性存储器中。具体地,云端服务器或神经网络专用处理器的控制模块可以根据原始网络的各个计算节点对应的模型参数和指令,生成该原始网络对应的离线模型,例如,该云端服务器或神经网络专用处理器控制模块可以将原始网络的各个计算节点对应的模型参数和指令存储至非易失性的第二存储器中,以实现离线模型的生成及存储。其中,针对原始网络的每个计算节点,该计算节点的模型参数及指令一一对应进行存储。这样,当再次运行该原始网络时,可以直接从非易失性存储器中获取该原始网络对应的离线模型,并根据与其对应的离线模型运行原始网络,无需在线对该原始网络的各个计算节点进行编译获得指令,提高了系统的运行速度及效率。
应当清楚的是,本实施例中,直接运行该原始网络对应的离线模型是指,使用离线模型运行该原始网络对应的机器学习算法(如神经网络算法),通过执行前向运算实现算法的目标应用(如语音识别等人工智能应用)。
可选地,如图27所示,上述步骤S102可以包括:
S104、根据原始网络的模型结构参数,获得原始网络中各个计算节点的执行顺序。具体地,云端服务器或神经网络专用处理器的运算模块可以根据原始网络的模型结构参数,获得原始网络中各个计算节点的执行顺序,进一步地,云端服务器或神经网络专用处理器的运算模块可以根据原始网络中各个计算节点的连接关系,获得原始网络中各个计算节点的执行顺序。例如,如图28所示,计算节点F4的输入数据为计算节点F1的输出数据以及计算节点F2的输出数据,计算节点F6的输入数据为计算节点F4的输出数据以及计算节点F5的输出数据。因此,图28所示的神经网络中各个计算节点的执行顺序可以为F1-F2-F3-F4-F5-F6或F1-F3-F2-F5-F4-F6等等。当然,计算节点F1、F2和F3可以并行执行,计算节点F4和F5也可以并行执行,此处仅举例说明,并不具体限定其执行顺序。
S105、按照原始网络中各个计算节点的执行顺序运行原始网络,分别获得原始网络中各个计算节点对应的指令。具体地,云端服务器或神经网络专用处理器的运算模块可以根据原始网络中各个计算节点的执行顺序运行该原始网络,以获得原始网络中各个计算节点对应的指令,即云端服务器或神经网络专用处理器可以将原始网络的模型数据集等数据进行编译获得各个计算节点对应的指令,通过各个计算节点对应的指令可以获知该计算节点用于实现何种计算功能,即可以获得该计算节点的计算类型及计算参数等计算属性。
进一步地,如图27所示,上述步骤S103还包括:
S106、根据原始网络的模型数据集和模型结构参数,获得原始网络的内存分配方式。具体地,云端服务器或神经网络专用处理器的运算模块可以根据原始网络的模型数据集和模型结构参数,获得原始网络的内存分配方式。进一步地,云端服务器或神经网络专用处理器可以根据原始网络的模型结构参数,获得原始网络中各个计算节点的执行顺序,并根据原始网络中各个计算节点的执行顺序确定当前网络的内存分配方式。例如,按各个计算节点的执行顺序将各个计算节点在运行过程中的相关数据保存至一个栈内。其中,内存分配方式是指确定原始网络中各个计算节点相关的数据(包括输入数据、输出数据、模型参数及中间结果数据等等)在内存空间(如第一存储器)上的存储位置。例如,可以采用数据表存储各个计算节点相关的数据(输入数据、输出数据、模型参数及中间结果数据等等)和内存空间的映射关系。
S107、根据原始网络的内存分配方式,将原始网络运行过程中的相关数据存储至第一存储器中,其中,原始网络运 行过程中的相关数据包括原始网络的各个计算节点对应的模型参数、指令、输入数据、中间计算结果及输出数据等等。例如,如图28所示,X1和X2表示该神经网络的输入数据,Y表示该神经网络的输出数据,云端服务器或神经网络专用处理器可以将该神经网络的输出数据转换为控制机器人或不同数字接口的控制命令。W1~W6用于表示计算节点F1、F2和F3对应的模型参数,计算节点F1~F5的输出数据可以作为中间计算结果。云端服务器或神经网络专用处理器可以根据已确定的内存分配方式,将原始网络运行过程中的相关数据存储至第一存储器,如内存储器或缓存等易失性存储器,其具体的存储方式可参见图29中左半部分存储空间。
S108、从第一存储器中获取原始网络的各个计算节点对应的模型参数及指令,并将原始网络的各个计算节点对应的模型参数及指令存储于第二存储器中,生成离线模型。其中,第二存储器可以为外部存储器等非易失性存储器。该离线模型的生成过程具体可参见图29所示,图29中右半部分的存储空间内存储的即为原始网络的对应的离线模型。
如图28和图29所示,下面结合附图说明上述的离线模型生成过程:
首先,云端服务器或神经网络专用处理器可以获得该原始网络的模型数据集、模型结构参数以及输入数据,从而根据该原始网络的模型数据集和模型结构参数可以获得该原始网络的网络结构图,如图9所示。
其次,云端服务器或神经网络专用处理器可以根据原始网络的模型结构参数,获得原始网络各个计算节点的连接关系,并根据各个计算节点的连接关系获得原始网络中各个计算节点的执行顺序,以及原始网络在运行过程中的内存分配方式,从而可以获得原始网络在运行过程中相关数据的存储位置。如图29中左半部分存储空间所示,原始网络在运行过程中的相关数据可以按照各个计算节点执行顺序存储在一个栈中。
最后,云端服务器或神经网络专用处理器可以将原始网络的各个计算节点对应的模型参数及指令存储于非易失性的第二存储器中,生成离线模型,该离线模型的存储方式可参见图29中右半部分存储空间所示。并且,该离线模型仅仅包含运行该原始网络所必需的模型参数及指令等数据,而不需对原始网络运行过程中的输入数据、输出数据或中间计算结果等进行存储,从而可以减小第二存储器中的存储空间的消耗。
传统技术中,人工神经网络作为一种重量级的数据,其是由大量的节点(或称为神经元)之间相互连接构成。传统的计算机设备直接读取神经网络,并按照该神经网络的结构形式按照一定的方式依次执行该神经网络的各个计算节点,获得该神经网络的计算结果。即传统的计算设备直接对重量级的神经网络进行数据处理,这将影响计算机设备的数据处理速度及效率。并且,基于人工神经网络数据的特点,在某些只能处理轻量级数据的运行环境中,该人工神经网络数据将无法运行,这将限制神经网络的应用范围。
如图20所示,本申请一实施例提供了一种计算机设备,该计算机设备100可以包括硬件系统和软件系统,其中,硬件系统可以包括第一处理器110、第二处理器120和存储器130。如图21所示,该第一处理器110用于提供计算和控制能力,其可以包括第一获取模块111、第一运算模块113及第一控制模块112等等,该第一获取模块111可以是IO(Input输入/Output输出)接口等硬件模块,第一运算模块113及第一控制模块112均为硬件模块。例如,第一运算模块113及第一控制模块112可以为数字电路或模拟电路等等。上述硬件电路的物理实现包括但不限于物理器件,物理器件包括但不限于晶体管及忆阻器等等。该第二处理器120也可以用于提供计算和控制能力,其可以包括第二获取模块、第二运算模块及第二控制模块等等,该第二获取模块可以是IO(Input输入/Output输出)接口等硬件模块,第二运算模块及第二控制模块均为硬件模块。第二处理器120的各个结构的连接关系及构成可以与第一处理器中各个结构的连接关系及构成相同,具体可参见上文中的描述,此处不再赘述。可选地,第一处理器或第二处理器可以为CPU(Central Processing Unit,中央处理器)、GPU(Graphics Processing Unit,图形处理器)、DSP(Digital Signal Processing,数字信号处理)等通用处理器或IPU(Intelligence Processing Unit,智能处理器)等神经网络专用处理器。
如图20所示,存储器130用于存储有多个原始网络对应的离线模型及输入数据以及该计算机设备的软件系统。该计算机设备的软件系统可以包括操作系统、计算机程序、应用软件及运行时系统131等能够在第一处理器110或第二处理器120上运行的软件。进一步地,该存储器130还可以用于存储各个原始网络的输出数据(即各个原始网络的计算结果)。更进一步地,该存储器130可以包括用于存储离线模型的第一存储模块、用于存储输入数据的第二存储模块、用于存储输出数据的第三存储模块以及用于存储运行时系统的第四存储模块。或者,存储器130的数量可以为两个以上, 例如,存储器130的数量可以为两个,分别标记为第一存储器和第二存储器,其中,第一存储器用于存储原始网络对应的离线模型和输入数据,第二存储器用于存储运行时系统。可选地,该存储器130可以是非易失性存储器,如只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。
应当清楚的是,运行时是指一个程序在运行(或被执行)的状态,运行时表明了在某个时间段内,哪个程序正在程序。运行时系统是指进程级别的虚拟机,其用于表示一种程序的运行环境。具体地,运行时系统可以是通过计算机软件建立的一种软件系统,该软件系统可以在CPU(Central Processing Unit,中央处理器)、GPU(Graphics Processing Unit,图形处理器)、DSP(Digital Signal Processing,数字信号处理)或IPU(Intelligence Processing Unit,智能处理器)等处理器上运行,以实现特定的数据数据处理功能。本申请实施例中的运行时系统不同于计算机设备的操作系统,该计算机设备的软件系统可以同时包含上述的运行时系统和操作系统。
如图22所示,本申请实施例中的运行时系统131能够在第一处理器110上运行,该运行时系统131可以包括数据处理装置1310、设备管理装置1314及任务执行装置1315,数据处理装置1310及设备管理装置1314均可以连接至任务执行装置1315。具体地,当第一处理器110运行该运行时系统131时,运行时系统131能够控制第二处理器120运行神经网络等重量级数据,即运行时系统131能够控制第二处理器120根据神经网络的离线模型及输入数据进行计算,获得神经网络的输出数据。其中,数据处理装置1310用于从存储器130中获取当前原始网络对应的离线模型及其输入数据,当前原始网络的离线模型和当前网络的输入数据相对应设置。可选地,当前原始网络对应的离线模型中包含当前原始网络中各个计算节点对应的模型参数、指令以及当前原始网络中的各个计算节点的接口数据等必要的网络结构信息。由于当前原始网络的离线模型中未包含当前原始网络中各个计算节点的中间计算结果、输入数据及输出数据等相关数据,因此,该当前原始网络的离线模型的数据量级远远小于当前原始网络的数据量级,即可以认为该当前原始网络的离线模型为轻量级数据。
具体地,每个计算节点对应的指令可以用于表明该计算节点用于执行何种计算功能,其具体可以包括该原始网络中各个计算节点的计算属性。该当前原始网络的节点接口数据用于表示当前原始网络的各个计算节点的连接关系。具体地,当前原始网络的节点接口数据可以包括各个计算节点的输入数据来源和输出数据来源。例如,如图28所示,X1和X2为当前原始网络对应的输入数据,Y为当前原始网络对应的输出数据,W1~W6分别为当前原始网络中计算节点F1~F3对应的模型参数。当前原始网络的节点接口数据可以包括计算节点F1、F2和F3为起始计算节点,其输入分别为预设的输入数据,计算节点F1的输出数据作为计算节点F4和计算节点F5的输入数据等等。这样,在再次运行该原始网络时,只需获得该当前原始网络的离线模型和输入数据,即可通过运行该当前原始网络对应的离线模型实现该当前原始网络的运行过程。
设备管理装置1314作为第二处理器120的驱动装置,其可以用于控制第二处理器120启动或关闭。其中,当第二处理器120关闭时,第二处理器120不执行任何的任务,当第二处理器120启动时,第二处理器120可以执行计算或控制等任务。本申请实施例中,第二处理器120可以是神经网络加速器,其用于执行当前原始网络的离线模型。任务执行装置1315用于控制第二处理器120运行数据处理装置1310获取的当前原始网络的离线模型及输入数据,以获得当前原始网络的输出数据(即神经网络的计算结果)。应当清楚的是,运行原始网络对应的离线模型是指,使用离线模型运行该原始网络对应的机器学习算法(如神经网络算法),通过执行前向运算实现算法的目标应用(如语音识别等人工智能应用)。
具体地,当需要在该计算机设备100上运行神经网络等重量级数据时,可以在第一处理器110上运行上述的运行时系统131,以通过该运行时系统131控制第二处理器120运行该神经网络等数据。即,当需要在该计算机设备100上运行神经网络等重量级数据时,可以首先通过数据处理装置1310从存储器130中获取当前原始网络对应的离线模型及输入数据。当完成当前原始网络对应的离线模型及输入数据的加载后,设备管理装置1314可以控制第二处理器120启动。之后,任务执行装置1315可以控制第二处理器120运行当前原始网络的离线模型及输入数据,以实现该当前原始网络的运行过程,获得该当前原始网络的计算结果。
本申请实施例中,由于当前原始网络的离线模型中仅仅存储了当前原始网络中各个计算节点对应的模型参数、指令 以及当前原始网络中的各个计算节点的接口数据等必要的网络结构信息。因而该当前原始网络的离线模型的数据量级远远小于该当前原始网络的数据量级,从而通过运行当前原始网络的离线模型,使得计算机设备可以实现对神经网络等重量级数据的处理过程,拓展了神经网络的应用范围。同时,通过在该计算机设备上直接运行该原始网络对应的离线模型,无需对原始网络中的各个计算节点进行编译等处理操作,可以提高该计算机设备的处理速度及效率。
可选地,如图22所示,数据处理装置1310包括离线模型加载模块1311和输入数据加载模块1312。其中,离线模型加载模块1311用于从存储器130中获取当前原始网络对应的离线模型,并对其获取的当前原始网络的离线模型进行解析,以获得当前原始网络中各个计算节点对应的模型参数、指令以及当前原始网络中各个计算节点的接口数据。进一步地,离线模型加载模块1311对当前原始网络的离线模型进行解析的过程,还可以包括对当前原始网络对应的离线模型进行数据预处理(如数据格式转换、归一等预处理)的过程,以便第二处理器120能够执行该当前原始网络的离线模型。
输入数据加载模块1312用于从存储器130中获取输入数据,该输入数据可以是原始网络的起始计算节点对应的输入数据。如图28所示,X1和X2作为原始网络的起始计算节点的输入数据。进一步地,该输入数据可以通过应用软件获得,并存储于存储器130中。该应用软件可以在第一处理器或第二处理器上运行,例如,用户可以通过应用软件的交互界面设置当前原始网络的输入数据,运行时系统可以将该获取的当前原始网络的输入数据存储于存储器130中。
本申请实施例中,离线模型加载模块1311还可以用于实时获取离线模型的加载进度,输入数据加载模块1312还可以用于实时获取输入数据的加载进度。例如,当离线模型加载模块1311完成当前原始网络对应的离线模型的加载(例如,离线模型的数据加载比例为100%),且输入数据加载模块1312完成当前原始网络对应的输入数据的加载之后(例如,输入数据的加载比例为100%),离线模型加载模块1311和输入数据加载模块1312可以向设备管理装置1314发送数据加载完成信号,从而设备管理装置1314可以根据其接收到的数据加载完成信号控制第二处理器120启动。当第二处理器120启动后,设备管理装置1314可以向任务执行装置1315发送启动完成信号,任务执行装置1315可以根据其接收到的启动完成信号,控制第二处理器120运行当前原始网络的离线模型。
在其他实施例中,可以提前控制第二处理器启动,以便进一步提高计算机设备的数据处理速度及效率。并且,由于离线模型的数据量级大于输入数据的数据量级,离线模型的所需的加载时间可能大于输入数据的加载时间,因此,若离线模型加载模块1311已完成的数据加载比例大于或等于第一预设比例(如80%),时,即可向设备管理装置1314发送加载完成信号,以提前启动第二处理器120。进一步地,若离线模型加载模块1311已完成的数据加载比例大于或等于第一预设比例(如80%),且输入数据加载模块1312已完成的数据加载比例大于或等于第二预设比例(如80%),则离线模型加载模块1311和输入数据加载模块1312可以向设备管理装置1314发送数据加载完成信号,从而设备管理装置1314可以根据其接收到的数据加载完成信号控制第二处理器120启动。
可选地,如图23所示,数据处理装置1310还可以包括输入数据预处理模块1313,输入数据预处理模块1313用于对输入数据进行预处理(如数据格式转换、归一化等预处理),以使第二处理器120能够运行输入数据。此时,输入数据加载模块1312完成输入数据的记载后,输入数据加载模块1312可以向输入数据预处理模块1313发送输入数据加载完成信号,输入数据预处理模块1313可以向根据其接收到的输入数据加载完成信号,对当前原始网络对应的输入数据进行归一化、格式转换等数据预处理操作。设备管理装置1314可以根据其接收到的离线模型加载模块1311传送的离线模型加载完成信号,以及输入数据预处理模型1314传送的预处理完成信号,控制第二处理器120启动。
同时,输入数据预处理模块1313还用于将第二处理器120获得的输出数据存储至存储器130,具体地,当第二处理器120完成当前原始网络的离线模型及输入数据的执行过程之后,该第二处理器120可以将当前原始网络的输出数据(即计算结果)传送至输入数据预处理模块1313,输入数据预处理模块1313可以对当前原始网络的输出数据进行数据格式转换等预处理,之后可以将该当前原始网络的输出数据存储至存储器130中。
在一个实施例中,该计算机设备100的软件系统还包括应用软件和操作系统(如安卓操作系统、微软操作系统、Linux操作系统等),应用软件能够在操作系统或上述的运行时系统上运行,操作系统及上述的运行时系统为各种应用软件提供了可执行环境。具体地,操作系统和应用软件也可以存储于存储器130中,该操作系统可以在第一处理器110或第二处理器120上运行。
该运行时系统131的各个装置可以提供应用软件能够调用的安全API(Application Programming Interface,应用软件接口),从而使得应用软件能够通过运行时系统131获取当前原始网络的离线模型及输入数据,并控制第二处理器120运行上述当前原始网络的离线模型,获得当前原始网络的输出数据。具体地,数据处理装置1310能够提供离线模型API及输入数据API,进一步地,离线模型加载模块1311能够提供离线模型API,输入数据加载模块1312能够提供输入数据API。当需要运行神经网络等重量级数据时,应用软件可以调用该数据处理装置1310的离线模型API,从而使得离线模型加载模块1311可以从存储器130中获取该当前原始网络对应的离线模型。当完成当前原始网络对应的离线模型的加载后,应用软件可以调用数据处理装置1310的输入数据API,从而可以使得输入数据加载模块1312可以从存储器130中获取当前原始网络对应的输入数据。进一步地,该当前原始网络的输入数据可以通过应用软件获得。例如,用户可以通过应用软件的交互显示界面手动设置当前原始网络对应的输入数据。当然,在其他实施例中,应用软件还可以同时调用上述的离线模型API和输入数据API,从而可以同时对当前原始网络的离线模型和输入数据进行加载,此处仅用于举例说明,并不用于对其具体地执行顺序进行限定。
进一步地,数据处理装置1310的输入数据预处理模块1313还能够提供数据预处理API。当完成当前原始网络的输入数据加载后,应用软件可以调用数据预处理API,从而使得数据预处理模块1313能够对当前原始网络的输入数据进行预处理,使得第二处理器能够运行上述的当前原始网络的输入数据。
设备管理装置1314能够提供第二处理器驱动API,任务执行装置1315能够提供第二处理器运行API。当完成当前原始网络的离线模型及输入数据的加载之后,应用软件可以通过调用该任务执行装置1315提供的第二处理器驱动API,启动第二处理器120。当第二处理器120启动后,应用软件可以调用任务执行装置1315提供的第二处理器运行API,以控制第二处理器120执行当前原始网络对应的离线模型及输入数据,获得当前原始网络的输出数据。当完成当前原始网络对应的离线模型的执行过程之后,应用软件可以通过调用该第二处理器驱动API,关闭第二处理器120。
更进一步地,在完成当前原始网络的离线模型的执行过程之后,应用软件还可以调用数据预处理API,使得输入数据预处理模块1313能够对当前原始网络的输出数据进行数据预处理,并将当前原始网络的输出数据存储至存储器130中。
再进一步地,第二处理器120的数量可以为多个,任务执行装置1315还可以能够提供任务分配API,任务执行装置1315可以用于控制多个第二处理器120,以实现多个第二处理器120之间的任务分配及调度。具体地,应用软件可以通过调用任务执行装置1315提供的任务分配API,从多个第二处理器120中选定执行当前任务的目标第二处理器。当完成当前原始网络的离线模型及输入数据的加载之后,应用软件可以通过调用该目标第二处理器对应的第二处理器驱动API,启动该目标第二处理器。当目标第二处理器启动后,应用软件可以调用任务执行装置1315提供的该目标第二处理器对应的第二处理器运行API,以控制该目标第二处理器执行当前原始网络对应的离线模型及输入数据。当完成当前原始网络对应的离线模型的执行过程之后,可以通过调用该目标第二处理器对应的第二处理器驱动API,关闭该目标第二处理器。
可选地,在其他实施例中,该第二处理器120可以为多核处理器,即该第二处理器120可以包括多个处理模块。任务执行装置1315可以用于控制多个第二处理器120的多个处理模块,以实现多个第二处理器120的多个处理模块之间的任务分配及调度。具体地,应用软件可以通过调用任务执行装置1315提供的任务分配API,从第二处理器120中的多个处理模块中选定执行当前任务的目标处理模块。当完成当前原始网络的离线模型及输入数据的加载之后,应用软件可以通过调用该目标处理模块对应的第二处理器驱动API,启动该目标处理模块。当目标处理模块启动后,应用软件可以调用该目标处理模块对应的第二处理器运行API,以控制该目标处理模块执行当前原始网络对应的离线模型及输入数据。当完成当前原始网络对应的离线模型的执行过程之后,可以通过调用目标处理模块对应的第二处理器驱动API,关闭该目标处理模块。
作为进一步地改进,运行时系统131可以是基于可信运行环境建立的安全的运行时系统。例如,运行时系统131可以是基于TEE(Trusted Execution Environment,可信执行环境)建立的运行时系统。具体地,TEE可以构建一个隔离于操作系统等非安全软件系统的运行时系统,从而实现软件隔离,保障原始网络的离线模型及输入数据和输出数据的安全 性。上述的应用软件可以是TA等安全的应用,该TA等安全的应用软件可以运行于基于TEE构建的运行时系统。
存储器130的存储空间可以分为安全存储空间和非安全存储空间。具体地,用于存储当前原始网络的离线模型及输入数据的存储空间为安全的存储空间,用于存储操作系统及应用软件等软件系统的存储空间为非安全存储空间,运行时系统可以存储于存储器的安全存储空间或非安全存储空间。当然,该存储器130也可以为安全存储器。从而,上述的运行时系统、TA以及安全存储空间构成一个完整的TEE运行环境。
在其他实施例中,存储器130的数量可以为两个以上,其中一个存储器130可以作为安全存储空间,用于存储当前原始网络的离线模型及输入数据。其中一个存储器130可以作为非安全存储空间,用于存储操作系统及应用软件等软件系统。又进一步地,操作系统及应用软件等也可以存储于安全的存储空间中。
应当清楚的是,本申请实施例中的安全存储空间是指可信的(Trusted)存储空间,该安全存储空间可以是加密的存储空间,具体可以采用对称加密算法、非对称加密算法或随机加密算法(如采用随机密码生成器获得密码)。当然,安全的存储区间还可以是通过指纹等进行加密的存储空间。上述安全的运行时系统131以及应用软件也可以通过加密算法获得。或者,安全存储空间可以是通过可信度量方法获得的安全存储空间,上述安全的运行时系统131以及应用软件也可以通过可信度量方法获得。
当然,该第一处理器110还可以是安全芯片,如TPM(Trusted Platform Module,可信平台模块)、TCM(Trusted Cryptography Module,可信密码模块)或TPCM(Trusted Platform Control Module,可信平台控制模块)等。进一步地,第二处理器120也可以是TPM、TCM或TPCM等安全芯片。
可选地,本申请实施例的计算机设备还可以仅包括处理器和存储器,其中,该处理器是多核处理器。具体地,该处理器可以包括多个处理模块。例如,该处理器包括第一处理模块和第二处理模块,其中,运行时系统可以在第一处理模块上运行。进一步地,上述运行时系统可以包括数据处理装置、设备管理装置和任务执行装置等结构,其中,数据处理装置用于从存储器中获取当前原始网络对应的离线模型及输入数据,当前原始网络对应的离线模型中包含原始网络中各个计算节点对应的模型参数、指令以及原始网络中的各个计算节点的接口数据。设备管理装置用于控制第二处理模块启动或关闭,任务执行装置用于控制第二处理模块运行当前原始网络的离线模型及输入数据。更进一步地,该运行时系统的其他结构与上述实施例中的运行时系统的架构相同,具体可参见上文的描述,此处不再赘述。
如图24所示,本申请实施例还提供了一种数据处理方法,用于图20所示的计算机设备中,通过离线模型实现对神经网络等重量级数据的处理过程,提高了计算机设备的数据处理速度及效率。具体地,上述方法包括如下步骤:
S110、控制数据处理装置从存储器中获取当前原始网络对应的离线模型及输入数据,当前原始网络对应的离线模型中包含原始网络中各个计算节点对应的模型参数及指令。具体地,当第一处理器110运行该运行时系统131时,可以通过运行时系统131的数据处理装置1310从存储器中读取当前原始网络对应的离线模型及输入数据。进一步地,可以通过数据处理装置1310的离线模型加载模块1311从存储器130中获取当前原始网络对应的离线模型。通过输入数据加载模块1312从存储器130中获取输入数据,该输入数据可以是原始网络的起始计算节点对应的输入数据。
S120、通过设备管理装置控制计算机设备的第二处理器启动。具体地,可以通过运行时系统131的设备管理装置1314控制第二处理器启动或关闭。即,当离线模型加载模块1311完成当前原始网络对应的离线模型的加载,且输入数据加载模块1312完成当前原始网络对应的输入数据的加载之后,离线模型加载模块1311和输入数据加载模块1312可以向设备管理装置1314发送数据加载完成信号,从而设备管理装置1314可以根据其接收到的数据加载完成信号控制第二处理器120启动。
S130、通过任务执行装置控制计算机设备的第二处理器根据当前原始网络对应的离线模型及输入数据,运行当前原始网络,获得当前原始网络的输出数据。具体地,可以通过运行时系统131的任务执行装置1315控制第二处理器120运行当前原始网络的离线模型。应当清楚的是,运行原始网络对应的离线模型是指,使用离线模型运行该原始网络对应的机器学习算法(如神经网络算法),通过执行前向运算实现算法的目标应用(如语音识别等人工智能应用)。
S140、通过数据处理装置将当前原始网络的输出数据存储至存储器中。具体地,可以通过数据处理装置1310将当前原始网络的输出数据存储至存储器130中。进一步地,该数据处理装置1310能够对当前原始网络的输出数据进行数 据格式转换等预处理操作,之后,再将其存储至存储器130中。可选地,数据处理装置1310的输入数据处理模块1313能够对当前原始网络的输出数据进行数据格式转换等预处理操作,之后,再将其存储至存储器130中。
可选地,在完成当前原始网络对应的离线模型及输入数据的加载之后,还可以对获取的离线模型及输入数据进行预处理,以便第二处理器能够执行获取的离线模型及输入数据。具体地,上述步骤S110还可以包括如下步骤:
S111、对获取的当前原始网络对应的离线模型进行解析,以获得当前原始网络中各个计算节点对应的模型参数、指令及当前原始网络中各个计算节点的接口数据。进一步地,具体地,可以通过离线模型加载模块1311对获取的当前原始网络的离线模型进行解析,以获得当前原始网络中各个计算节点对应的模型参数、指令以及当前原始网络中各个计算节点的接口数据。更进一步地,还可以通过离线模型加载模块1311对解析后的数据进行数据格式转换、归一化等预处理操作。
S112、对获取的当前原始网络的输入数据进行预处理,如对输入数据进行数据格式转换、归一化等预处理操作。具体地,可以通过输入数据预处理模块1313对输入数据进行预处理(如数据格式转换、归一化等预处理),以使第二处理器120能够运行输入数据。
进一步地,上述方法还可以包括如下步骤:
实时获取当前原始网络对应的离线模型的加载进度;具体地,离线模型加载模块1311可以实时获取当前网络对应的离线模型的加载进度,该离线模型的加载进度可以采用数据比例或剩余时长等进行表示。
若当前原始网络对应的离线模型的加载进度大于或等于第一预设比例,则执行所述的控制计算机设备的第二处理器启动的步骤。具体地,该第一预设比例可以为80%~100%。例如,当离线模型加载模块1311完成当前原始网络对应的离线模型的加载(例如,离线模型的数据加载比例为100%),离线模型加载模块1311可以向设备管理装置1314发送数据加载完成信号,从而设备管理装置1314可以根据其接收到的数据加载完成信号控制第二处理器120启动。或者,若离线模型加载模块1311已完成的数据加载比例大于或等于第一预设比例(如80%)时,即可向设备管理装置1314发送加载完成信号,以提前启动第二处理器120。
由于离线模型的数据量级大于输入数据的数据量级,离线模型的所需的加载时间可能大于输入数据的加载时间,因此,可以仅仅依据离线模型的加载进度判断是否启动第二处理器120。进一步地,输入数据加载模块1312还可以实时获得输入数据的加载进度,若离线模型加载模块1311已完成的数据加载比例大于或等于第一预设比例(如80%),且输入数据加载模块1312已完成的数据加载比例大于或等于第二预设比例(如80%),则离线模型加载模块1311和输入数据加载模块1312可以向设备管理装置1314发送数据加载完成信号,从而设备管理装置1314可以根据其接收到的数据加载完成信号控制第二处理器120启动。
此外,如图25所示,本申请实施例还提供了一种数据处理方法,用于图20所示的计算机设备中,通过离线模型实现对神经网络等重量级数据的处理过程,提高了计算机设备的数据处理效率及速度。具体地,上述方法包括如下步骤:
S210、调用离线模型API,获取当前原始网络对应的离线模型,具体地,应用软件可以调用离线模型加载模块1311提供的离线模型API,从而使得离线模型加载模块1311能够从存储器130中读取当前原始网络对应的离线模型。其中,当前原始网络对应的离线模型中包含当前原始网络中各个计算节点对应的模型参数、指令以及当前原始网络中的各个计算节点的接口数据;其中,离线模型的生成过程可参见上文中的描述。
S210、调用输入数据API,获取当前原始网络的输入数据。具体地,应用软件可以调用输入数据加载模块1312提供的输入数据API,通过输入数据加载模块1312从存储器130中获取当前原始网络的输入数据。进一步地,应用软件还可以调用输入数据预处理模块1313提供的数据预处理API,通过输入数据预处理模块1313对输入数据加载模块1312获取的输入数据进行数据格式转换、归一化等预处理操作,以使第二处理器120能够运行上述的当前原始网络的输入数据。
S220、调用第二处理器驱动API,控制计算机设备中的第二处理器启动。具体地,应用软件能够调用设备管理模块1314提供的第二处理器驱动API,通过设备管理模块1314控制第二处理器120启动。
S230、调用第二处理器运行API,控制第二处理器根据当前原始网络对应的离线模型及输入数据,获得当前原始网络的输出数据。具体地,应用软件能够调用任务执行装置1315提供的第二处理器运行API,通过任务执行装置1315控 制第二处理器120根据当前原始网络对应的离线模型及输入数据,获得当前原始网络的输出数据。
S240、调用第二处理器驱动API,控制第二处理器关闭。具体地,应用软件能够调用设备管理模块1314提供的第二处理器驱动API,通过设备管理模块1314控制第二处理器120关闭。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。
此外,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被一个或多个处理器执行时,实现上述的方法的步骤。该计算机存储介质可以包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
上述的计算机设备、数据处理方法及存储介质,通过数据处理装置可以直接从存储器中获取当前原始网络对应的离线模型及输入数据,从而该计算机设备的第二处理器可以根据其获取的原始网络的离线模型及输入数据运行该当前原始网络,获得当前原始网络的输出数据。由于每个原始网络对应的离线模型中仅包含原始网络中各个计算节点对应的模型参数、指令以及原始网络中各个计算节点的接口数据,因而,原始网络的离线模型的数据量级远远小于该原始网络的数据量级,从而通过在计算机设备上运行该当前原始网络对应的离线模型,可以实现计算机设备对重量级的神经网络数据的处理过程。同时,通过在该计算机设备上直接运行该当前原始网络对应的离线模型,无需对当前原始网络中的各个计算节点进行编译等处理操作,可以提高该计算机设备的处理速度及效率。
本申请的其他实施例中,如图30所示,计算机设备200可以包括第一处理器210、第二处理器220、第一存储器230和第二存储器240,其中,第一存储器230内存储有多个原始网络对应的离线模型及输入数据和能够在第一处理器230上运行的运行时系统,第二存储器240内存储有能够在第一处理器或第二处理器上运行的操作系统。具体地,上述第一存储器230和第二存储器240可以是物理上相互独立的两个存储器。或者,第一存储器230和第二存储器240可以集成为一个整体,第一存储器230和第二存储器240为在逻辑上相互独立的两个存储空间。
进一步地,第一处理器210的数量可以为两个以上。例如,第一处理器210的数量为两个,其中一个第一处理器210用于运行上述安全的运行时系统231,另一个第一处理器210用于运行操作系统。或者,上述的第一处理器210可以是多核处理器,其可以包括两个以上的处理模块,其中一个处理模块可以用于运行上述的运行时系统231,其中一个处理模块用于运行上述的操作系统。这样,可以通过硬件上的隔离将计算机设备划分为安全运行环境和非安全运行环境。更进一步地,上述第一处理器210可以使用TCM、TPM或TPCM等安全芯片实现。
上述的运行时系统为基于可信运行环境建立的安全的运行时系统,例如,运行时系统231可以是基于TEE(Trusted Execution Environment,可信执行环境)建立的运行时系统。具体地,TEE可以构建一个隔离于操作系统等非安全软件系统的运行时系统,从而实现软件隔离,保障原始网络的离线模型及输入数据和输出数据的安全性。进一步地,该安全的运行时系统231可以通过加密算法获得,也可以通过可信度量获得。第一存储器230为安全存储介质。当运行时系统231在第一处理器210上运行时,运行时系统231能够从第一存储器230内获取当前原始网络对应的离线模型及输入数据,并控制第二处理器220运行当前原始网络对应的离线模型。
应当清楚的是,本申请实施例中的安全是指可信(Trusted),其可以采用预设的加密算法实现,例如,可以采用对称加密算法、非对称加密算法或随机加密算法(如采用随机密码生成器获得密码)。当然,还可以是通过指纹等进行加密。或者,安全也可以通过可信度量方法实现。
可选地,该运行时系统231可以提供应用软件能够调用的安全API(Application Programming Interface,应用软件接口),API主要包含了密钥管理、密码算法及安全存储等。上述运行时系统231可以包括数据处理装置、设备管理装置和 任务执行装置,其结构与上述的运行时系统131的结构类似,可参见图22和图23所示。其中,数据处理装置能够提供离线模型API及输入数据API,用于从第一存储器230中获取当前原始网络对应的离线模型及输入数据,当前原始网络对应的离线模型中包含原始网络中各个计算节点对应的模型参数、指令以及原始网络中的各个计算节点的接口数据。设备管理装置能够提供第二处理器驱动API,用于控制第二处理器220启动或关闭。任务执行装置能够提供第二处理器运行API,用于控制第二处理器220运行当前原始网络的离线模型及输入数据。
进一步地,数据处理装置包括离线模型加载模块和输入数据加载模块。离线模型加载模块能够提供离线模型API,用于从第一存储器230中获取各个当前原始网络对应的离线模型,并对当前原始网络对应的离线模型进行解析。输入数据加载模块能够提供输入数据API,用于从第一存储器230中获取当前原始网络对应的输入数据。
更进一步地,数据处理装置还包括输入数据预处理模块,输入数据预处理模块能够提供数据预处理API,用于对输入数据加载模块获取的输入数据进行预处理,使第二处理器220能够运行当前原始网络的输入数据,并用于将第二处理器220获得的输出数据存储至第一存储器230。
可选地,第二处理器220的数量为多个,或第二处理器220包括多个处理模块;任务执行装置还能够提供任务分配API,用于控制多个第二处理器220,或控制第二处理器220的多个处理模块。
进一步地,计算机设备还包括能够在运行时系统231上运行的安全的应用软件(TA,Trusted Application),且应用软件能够调用离线模型API及输入数据API、第二处理器驱动API,以及第二处理器运行API。该安全的应用软件可以通过加密算法实现,也可以通过可信度量的方式实现。
应当清楚的是,本申请实施例中的数据处理装置、设备管理装置以及任务执行装置的工作原理,与上述实施例中的各个装置的工作原理基本一致,具体可参见前文中的描述。
如图31所示,本申请实施例还提供了一种数据处理方法,用于如图30所示的计算机设备中,方法包括如下步骤:
S310、从第一存储器中获取当前原始网络对应的离线模型及输入数据,其中,当前原始网络对应的离线模型中包含当前原始网络中各个计算节点对应的模型参数、指令以及当前原始网络中的各个计算节点的接口数据。具体地,当第一处理器运行上述安全的运行时系统231时,安全的运行时系统231可以从安全的第一存储器230中获取当前原始网络对应的离线模型及输入数据。可选地,当第一处理器210运行该运行时系统231时,可以通过运行时系统231的数据处理装置从第一存储器230中读取当前原始网络对应的离线模型及输入数据。进一步地,可以通过数据处理装置的离线模型加载模块从第一存储器230中获取当前原始网络对应的离线模型。通过输入数据加载模块从第一存储器230中获取输入数据,该输入数据可以是原始网络的起始计算节点对应的输入数据。
S320、控制计算机设备的第二处理器启动。具体地,上述安全的运行时系统231可以控制计算机设备的第二处理器220启动。可选地,运行时系统231的设备管理装置可以控制第二处理器启动或关闭。当离线模型加载模块完成当前原始网络对应的离线模型的加载,离线模型加载模块可以向设备管理装置发送数据加载完成信号,从而设备管理装置可以根据其接收到的数据加载完成信号控制第二处理器220启动。
S330、控制计算机设备的第二处理器根据当前原始网络对应的离线模型及输入数据,运行当前原始网络,获得当前原始网络的输出数据。具体地,上述运行时系统231可以控制计算机设备的第二处理器220运行离线模型及其对应的输入数据,以获得当前原始网络的输出数据。可选地,可以通过运行时系统231的任务执行装置控制第二处理器220运行当前原始网络的离线模型。
应当清楚的是,运行原始网络对应的离线模型是指,使用离线模型运行该原始网络对应的机器学习算法(如神经网络算法),通过执行前向运算实现算法的目标应用(如语音识别等人工智能应用)。
S340、将当前原始网络的输出数据存储至第一存储器中。即运行时系统231能够将当前原始网络的输出数据存储至安全的第一存储器230中。可选地,可以通过运行时系统231的数据处理装置将当前原始网络的输出数据存储至第一存储器230中。进一步地,该数据处理装置能够对当前原始网络的输出数据进行数据格式转换等预处理操作,之后,再将其存储至第一存储器230中。更进一步地,数据处理装置的输入数据处理模块能够对当前原始网络的输出数据进行数据格式转换等预处理操作,之后,再将其存储至第一存储器230中。
如图32所示,本申请实施例还提供了一种数据处理方法,用于如图30所示的计算机设备中,上述方法可以包括如下步骤:
S410、调用离线模型API,从第一存储器中获取当前原始网络对应的离线模型。具体地,安全的应用软件(TA)可以调用离线模型API,从而使得离线模型加载模块能够从第一存储器230中读取当前原始网络对应的离线模型。其中,当前原始网络对应的离线模型中包含当前原始网络中各个计算节点对应的模型参数、指令以及当前原始网络中的各个计算节点的接口数据。
S420、调用输入数据API,获取当前原始网络的输入数据;具体地,安全的应用软件可以调用输入数据API,通过输入数据加载模块从第一存储器230中获取当前原始网络的输入数据。
S430、调用第二处理器驱动API,控制计算机设备中的第二处理器启动;具体地,安全的应用软件能够调用第二处理器驱动API,以通过设备管理模块控制第二处理器220启动。
S440、调用第二处理器运行API,控制第二处理器根据当前原始网络对应的离线模型及输入数据,获得当前原始网络的输出数据。具体地,安全的应用软件能够调用第二处理器运行API,以通过任务执行装置控制第二处理器220根据当前原始网络对应的离线模型及输入数据,获得当前原始网络的输出数据。
S450、调用第二处理器驱动API,控制第二处理器关闭。具体地,安全的应用软件能够调用第二处理器驱动API,以通过设备管理模块控制第二处理器220关闭。
进一步地,上述方法还包括如下步骤:
调用数据预处理API,将当前原始网络的输出数据存储至第一存储器中。具体地,安全的应用软件能够调用运行时系统231提供的数据预处理API,以通过数据处理装置的输入数据预处理模块对输出数据进行数据格式转换、归一化等预处理操作,并将当前原始网络的输出数据存储至第一存储器230中。
更进一步地,在调用输入数据API,获取当前原始网络的输入数据的步骤之后,上述方法还包括如下步骤:
调用数据预处理API,对获取的当前原始网络的输入数据进行预处理,使第二处理器能够运行输入数据。具体地,安全的应用软件还可以调用输入数据预处理模块提供的数据预处理API,以通过输入数据预处理模块对输入数据进行数据格式转换、归一化等预处理操作,以使第二处理器220能够运行上述的当前原始网络的输入数据。
可选地,本申请实施例中还可以包含离线模型的生成过程,该离线模型的生成过程可以在云端服务器或神经网络专用处理器上运行,并将其获得的原始网络的离线模型存储至第一存储器230中。该云端服务器或神经网络专用处理器为能够执行神经网络等重量级数据的处理器,其可以不包含于上述的计算机设备中。离线模型的生成过程具体可参见前文中的描述,此处不再赘述。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。
此外,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被一个或多个处理器执行时,实现上述方法的步骤。该计算机存储介质可以包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
本申请实施例中,由于当前原始网络的离线模型中仅仅存储了当前原始网络中各个计算节点对应的模型参数、指令以及当前原始网络中的各个计算节点的接口数据等必要的网络结构信息。因而该当前原始网络的离线模型的数据量级远远小于该当前原始网络的数据量级,从而通过运行当前原始网络的离线模型,能够实现在基于TEE等可信执行环境建立的安全的运行时系统对神经网络等重量级数据的处理过程,拓展了神经网络的应用范围。同时,通过在该计算机设备 上直接运行该原始网络对应的离线模型,无需对原始网络中的各个计算节点进行编译等处理操作,可以提高该计算机设备的处理速度及效率。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (15)

  1. 一种任务并行处理方法,其特征在于,包括:
    根据需执行任务之间的依赖关系,构建任务有向无环图DAG;
    根据所述任务有向无环图DAG,将各所述需执行任务分发至处理器的多个工作队列;
    根据所述任务有向无环图DAG中各所述需执行任务的依赖关系,调控各所述工作队列中并行的需执行任务开始运行。
  2. 根据权利要求1所述的方法,其特征在于,所述根据需执行任务之间的依赖关系,构建任务有向无环图DAG的步骤之前包括:
    根据程序中的操作节点和/或数据节点对程序进行拆分,获取所述需执行任务。
  3. 根据权利要求2所述的方法,其特征在于,所述根据程序中的操作节点对程序进行拆分,获取所述需执行任务的步骤包括:
    若所述程序包括带模型的操作请求,则对所述带模型的操作请求的模型进行拆分和/或对所述模型的输入数据进行拆分,获取需执行任务。
  4. 根据权利要求3所述的方法,其特征在于,所述对所述带模型的操作请求的模型进行拆分,获取需执行任务的步骤包括:
    设置拆分模型得到的各所述需执行任务对应的权值;
    使用各所述权值,设置所述需执行任务的输入数据与输出数据的对应关系。
  5. 根据权利要求3所述的方法,其特征在于,所述对所述带模型的操作请求的模型进行拆分,获取需执行任务的步骤包括:
    按照预设规则在模型的窗口方向和/或通道方向上拆分所述带模型的操作的模型,得到需执行任务。
  6. 根据权利要求3所述的方法,其特征在于,所述对所述带模型的操作请求的输入数据进行拆分,获取需执行任务的步骤包括:
    按照预设规则在数据的窗口方向拆分所述带模型的操作的输入数据,得到需执行任务。
  7. 根据权利要求2所述的方法,其特征在于,所述根据程序中的操作节点对程序进行拆分,获取所述需执行任务的步骤包括:
    若所述程序包括不带模型的操作请求,则对所述不带模型的操作请求的输入数据和/或输出数据进行拆分,获取需执行任务。
  8. 根据权利要求7所述的方法,其特征在于,所述对所述不带模型的操作请求的输入数据和/或输出数据进行拆分,获取需执行任务的在步骤包括:
    按照预设规则在数据的窗口方向拆分所述输入数据和/或输出数据,得到需执行任务。
  9. 根据权利要求1所述的方法,其特征在于,所述根据需执行任务之间的依赖关系,构建任务有向无环图DAG的步骤包括:
    根据获取的各所述需执行任务之间的依赖关系,确定所述任务有向无环图DAG中的并行结点与顺序结点;
    根据所述并行结点与顺序结点构建任务有向无环图DAG。
  10. 根据权利要求1-9任一项所述的方法,其特征在于,所述根据所述任务有向无环图DAG将各所述需执行任务分发至所述处理器的多个工作队列的步骤包括:
    对所述任务有向无环图DAG进行拓扑排序,获取任务拓扑排序序列;
    根据各所述需执行任务的预设执行时间,对得到的所述拓扑排序序列进行排序,得到最长拓扑排序序列;
    根据所述最长拓扑排序序列以及各所述需执行任务之间的依赖关系,分发各所述需执行任务至所述工作队列。
  11. 根据权利要求1-9任一项所述的方法,其特征在于,所述根据所述任务有向无环图DAG中各所述需执行任务的 依赖关系,调控各所述工作队列中并行的需执行任务开始运行的步骤包括:
    根据所述任务有向无环图DAG为各所述需执行任务设置引用计数;
    若被依赖的需执行任务已执行,则修改需依赖的需执行任务的引用计数;
    当所述需执行任务的引用计数达到预设值,控制各所述工作队列中引用计数达到预设值的需执行任务开始运行。
  12. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1-11中任意一项所述方法的步骤。
  13. 一种任务并行处理系统,其特征在于,包括存储器、多核处理器,及存储在存储器上并可在处理器上运行的计算机程序,所述多核处理器能够运行拆分算法,其特征在于,所述多核处理器执行所述计算机程序时实现权利要求1-11中任一项所述方法的步骤。
  14. 一种任务并行处理系统,其特征在于,包括存储器、第一处理器和第二处理器,所述第一处理器能够运行拆分算法,第二处理器为多核处理器,其特征在于,所述第一处理器和第二处理器执行所述计算机程序时实现权利要求1-11中任一项所述方法的步骤。
  15. 一种任务并行处理装置,其特征在于,包括:DAG图构建模块、任务分发模块和调度控制模块,
    所述DAG图构建模块,用于根据需执行任务之间的依赖关系,构建任务有向无环图DAG;
    所述任务分发模块,用于根据所述任务有向无环图DAG,将各所述需执行任务分发至处理器的多个工作队列;
    所述调度控制模块,用于根据所述任务有向无环图DAG中各所述需执行任务的依赖关系,调控各所述工作队列中并行的需执行任务开始运行。
PCT/CN2018/108298 2017-11-20 2018-09-28 任务并行处理方法、装置、系统、存储介质及计算机设备 WO2019095873A1 (zh)

Priority Applications (8)

Application Number Priority Date Filing Date Title
EP18878728.7A EP3614260A4 (en) 2017-11-20 2018-09-28 METHOD, DEVICE AND SYSTEM FOR PARALLEL PROCESSING OF TASKS, STORAGE MEDIUM AND COMPUTER DEVICE
EP19210491.7A EP3651020A1 (en) 2017-11-20 2018-09-28 Computer equipment, data processing method, and storage medium
KR1020197037907A KR102569086B1 (ko) 2017-11-20 2018-09-28 태스크 병렬 처리 방법, 장치, 시스템, 기억 매체 및 컴퓨터 기기
JP2019568198A JP7074777B2 (ja) 2017-11-20 2018-09-28 タスク並列処理方法、装置、システム、記憶媒体およびコンピュータ機器
US16/575,344 US11221877B2 (en) 2017-11-20 2019-09-18 Task parallel processing method, apparatus and system, storage medium and computer device
US16/702,491 US11360811B2 (en) 2017-11-20 2019-12-03 Task parallel processing method, apparatus and system, storage medium and computer device
US16/702,502 US11113103B2 (en) 2017-11-20 2019-12-03 Task parallel processing method, apparatus and system, storage medium and computer device
US16/705,190 US11113104B2 (en) 2017-11-20 2019-12-05 Task parallel processing method, apparatus and system, storage medium and computer device

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
CN201711157341.X 2017-11-20
CN201711157341.XA CN109814986B (zh) 2017-11-20 2017-11-20 任务并行处理方法、存储介质、计算机设备、装置和系统
CN201711484410.8 2017-12-29
CN201711484410.8A CN109992307B (zh) 2017-12-29 2017-12-29 指令列表调度方法、装置、计算机设备及存储介质
CN201810083577.1A CN110097179B (zh) 2018-01-29 2018-01-29 计算机设备、数据处理方法及存储介质
CN201810083577.1 2018-01-29
CN201810084077.XA CN110097180B (zh) 2018-01-29 2018-01-29 计算机设备、数据处理方法及存储介质
CN201810084077.X 2018-01-29

Related Child Applications (4)

Application Number Title Priority Date Filing Date
US16/575,344 Continuation US11221877B2 (en) 2017-11-20 2019-09-18 Task parallel processing method, apparatus and system, storage medium and computer device
US16/702,491 Continuation US11360811B2 (en) 2017-11-20 2019-12-03 Task parallel processing method, apparatus and system, storage medium and computer device
US16/702,502 Continuation US11113103B2 (en) 2017-11-20 2019-12-03 Task parallel processing method, apparatus and system, storage medium and computer device
US16/705,190 Continuation US11113104B2 (en) 2017-11-20 2019-12-05 Task parallel processing method, apparatus and system, storage medium and computer device

Publications (1)

Publication Number Publication Date
WO2019095873A1 true WO2019095873A1 (zh) 2019-05-23

Family

ID=66540014

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/108298 WO2019095873A1 (zh) 2017-11-20 2018-09-28 任务并行处理方法、装置、系统、存储介质及计算机设备

Country Status (5)

Country Link
US (4) US11221877B2 (zh)
EP (2) EP3614260A4 (zh)
JP (1) JP7074777B2 (zh)
KR (1) KR102569086B1 (zh)
WO (1) WO2019095873A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782426A (zh) * 2020-07-10 2020-10-16 上海淇毓信息科技有限公司 一种处理客户端任务的方法、装置和电子设备
WO2020263587A1 (en) * 2019-06-26 2020-12-30 Amazon Technologies, Inc. Neural network operation reordering for parallel execution
CN114499958A (zh) * 2021-12-24 2022-05-13 东软睿驰汽车技术(沈阳)有限公司 控制方法及装置、车辆及存储介质
EP4088259A4 (en) * 2020-01-07 2023-05-24 Argo AI, LLC METHOD AND SYSTEM FOR CONSTRUCTING STATIC ORIENTED ACYCLIC GRAPHS

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10853079B2 (en) * 2018-09-26 2020-12-01 Side Effects Software Inc. Dependency-based streamlined processing
US11175898B2 (en) * 2019-05-31 2021-11-16 Apple Inc. Compiling code for a machine learning model for execution on a specialized processor
KR102147912B1 (ko) * 2019-08-13 2020-08-25 삼성전자주식회사 프로세서 칩 및 그 제어 방법들
CN112465129B (zh) * 2019-09-09 2024-01-09 上海登临科技有限公司 片内异构人工智能处理器
CN112463709A (zh) * 2019-09-09 2021-03-09 上海登临科技有限公司 可配置的异构人工智能处理器
US20210232902A1 (en) * 2020-01-23 2021-07-29 Spero Devices, Inc. Data Flow Architecture for Processing with Memory Computation Modules
US20220036158A1 (en) * 2020-07-29 2022-02-03 Apple Inc. Task skew management for neural processor circuit
US11561826B1 (en) * 2020-11-12 2023-01-24 Xilinx, Inc. Scheduling processing of machine learning tasks on heterogeneous compute circuits
US11847490B2 (en) * 2021-02-18 2023-12-19 Dell Products L.P. Intelligent workload scheduling using a ranking of sequences of tasks of a workload
CN113283742A (zh) * 2021-05-21 2021-08-20 建信金融科技有限责任公司 一种任务分配方法和装置
US11537374B1 (en) * 2021-06-03 2022-12-27 Oracle International Corporation System and method for hot method call graph analysis
CN115686766A (zh) * 2021-07-28 2023-02-03 深圳富联富桂精密工业有限公司 自动化任务排程方法、电子设备及存储介质
CN113703775B (zh) * 2021-08-31 2023-11-28 上海阵量智能科技有限公司 一种编译方法、装置、设备及存储介质
CN114168275B (zh) * 2021-10-28 2022-10-18 厦门国际银行股份有限公司 任务调度方法、系统、终端设备及存储介质
CN114169427B (zh) * 2021-12-06 2022-10-04 北京百度网讯科技有限公司 基于端到端自适应的分布式训练方法、装置、设备
US20230205592A1 (en) * 2021-12-23 2023-06-29 Intel Corporation Asymmetric tuning
CN115114028B (zh) * 2022-07-05 2023-04-28 南方电网科学研究院有限责任公司 一种电力仿真二次控制的任务分配方法及装置
KR102715702B1 (ko) * 2023-03-30 2024-10-11 주식회사 딥이티 인공지능 모델에 대한 연산 및 메모리 최적화 장치 및 방법
CN116339958B (zh) * 2023-05-30 2023-09-08 支付宝(杭州)信息技术有限公司 一种任务执行方法、装置以及设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020144101A1 (en) * 2001-03-30 2002-10-03 Hong Wang Caching DAG traces
CN102012844A (zh) * 2010-11-29 2011-04-13 上海大学 一种面向cmp系统的线程调度方法
CN103077006A (zh) * 2012-12-27 2013-05-01 浙江工业大学 一种基于多线程的长事务并行执行方法

Family Cites Families (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6243696B1 (en) * 1992-11-24 2001-06-05 Pavilion Technologies, Inc. Automated method for building a model
US5768594A (en) * 1995-07-14 1998-06-16 Lucent Technologies Inc. Methods and means for scheduling parallel processors
US5937037A (en) * 1998-01-28 1999-08-10 Broadpoint Communications, Inc. Communications system for delivering promotional messages
US7903806B1 (en) * 2000-01-05 2011-03-08 Canoga Perkins Corp. Expert call analyzer and next generation telephony network configuration system
US7117045B2 (en) * 2001-09-08 2006-10-03 Colorado State University Research Foundation Combined proportional plus integral (PI) and neural network (nN) controller
US7958507B2 (en) * 2005-06-16 2011-06-07 Hewlett-Packard Development Company, L.P. Job scheduling system and method
US8010954B2 (en) * 2007-02-14 2011-08-30 The Mathworks, Inc. Parallel programming interface to dynamically allocate program portions
US8255889B2 (en) * 2007-02-14 2012-08-28 The Mathworks, Inc. Method of using parallel processing constructs and dynamically allocating program portions
US8239844B2 (en) * 2007-02-14 2012-08-07 The Mathworks, Inc. Method of using parallel processing constructs and dynamically allocating program portions
JP5545288B2 (ja) * 2009-02-18 2014-07-09 日本電気株式会社 タスク割当装置、タスク割当方法、及び、タスク割当プログラム
US8250576B2 (en) * 2009-09-30 2012-08-21 Microsoft Corporation Structured task hierarchy for a parallel runtime
US9262228B2 (en) * 2010-09-23 2016-02-16 Microsoft Technology Licensing, Llc Distributed workflow in loosely coupled computing
US9760348B2 (en) * 2010-11-29 2017-09-12 Microsoft Technology Licensing, Llc Verification of a dataflow representation of a program through static type-checking
US8792939B2 (en) * 2011-01-03 2014-07-29 Michelle Fisher Non-wireless bidirectional communication between a mobile device and associated secure element using an audio port
US9135065B1 (en) * 2011-08-31 2015-09-15 The Mathworks, Inc. Parallel processing of multidimensional arrays
US8966457B2 (en) * 2011-11-15 2015-02-24 Global Supercomputing Corporation Method and system for converting a single-threaded software program into an application-specific supercomputer
US9448867B2 (en) * 2011-12-31 2016-09-20 Intel Corporation Processor that detects when system management mode attempts to reach program code outside of protected space
US9122523B2 (en) * 2012-05-03 2015-09-01 Nec Laboratories America, Inc. Automatic pipelining framework for heterogeneous parallel computing systems
US9275355B2 (en) * 2012-09-24 2016-03-01 International Business Machines Corporation Business process model analyzer and runtime selector
US9332083B2 (en) * 2012-11-21 2016-05-03 International Business Machines Corporation High performance, distributed, shared, data grid for distributed Java virtual machine runtime artifacts
US20140282572A1 (en) * 2013-03-14 2014-09-18 Samsung Electronics Co., Ltd. Task scheduling with precedence relationships in multicore systems
US9146747B2 (en) * 2013-08-08 2015-09-29 Linear Algebra Technologies Limited Apparatus, systems, and methods for providing configurable computational imaging pipeline
US10089142B2 (en) * 2013-08-21 2018-10-02 Hasso-Plattner-Institut Fur Softwaresystemtechnik Gmbh Dynamic task prioritization for in-memory databases
US9304749B2 (en) * 2013-09-12 2016-04-05 Marvell World Trade Ltd. Method and system for instruction scheduling
US9747547B2 (en) * 2013-10-22 2017-08-29 In2H2 Hardware enhancements to radial basis function with restricted coulomb energy learning and/or k-Nearest Neighbor based neural network classifiers
US9576072B2 (en) * 2014-02-13 2017-02-21 Sap Se Database calculation using parallel-computation in a directed acyclic graph
US20150242741A1 (en) * 2014-02-21 2015-08-27 Qualcomm Incorporated In situ neural network co-processing
US9652286B2 (en) * 2014-03-21 2017-05-16 Oracle International Corporation Runtime handling of task dependencies using dependence graphs
CN104239137B (zh) 2014-08-21 2017-12-08 东软集团股份有限公司 基于dag节点最优路径的多模型并行调度方法及装置
US9799088B2 (en) * 2014-08-21 2017-10-24 Qualcomm Incorporated Render target command reordering in graphics processing
CA2959627C (en) * 2014-09-02 2020-06-16 Ab Initio Technology Llc Executing graph-based program specifications
US9442760B2 (en) * 2014-10-03 2016-09-13 Microsoft Technology Licensing, Llc Job scheduling using expected server performance information
US10163420B2 (en) * 2014-10-10 2018-12-25 DimensionalMechanics, Inc. System, apparatus and methods for adaptive data transport and optimization of application execution
US10061577B2 (en) * 2014-10-14 2018-08-28 Electric Cloud, Inc. System and method for optimizing job scheduling within program builds
US11163092B2 (en) * 2014-12-18 2021-11-02 Exxonmobil Upstream Research Company Scalable scheduling of parallel iterative seismic jobs
CN106156810B (zh) 2015-04-26 2019-12-03 阿里巴巴集团控股有限公司 通用机器学习算法模型训练方法、系统和计算节点
US20160335119A1 (en) * 2015-05-12 2016-11-17 minds.ai inc Batch-based neural network system
US9690555B2 (en) * 2015-06-29 2017-06-27 International Business Machines Corporation Optimization of application workflow in mobile embedded devices
US10102391B2 (en) * 2015-08-07 2018-10-16 Qualcomm Incorporated Hardware enforced content protection for graphics processing units
EP3353718B1 (en) * 2015-10-28 2023-07-19 Google LLC Modifying computational graphs
US10268461B2 (en) * 2015-11-23 2019-04-23 International Business Machines Corporation Global data flow optimization for machine learning programs
US10331495B2 (en) * 2016-02-05 2019-06-25 Sas Institute Inc. Generation of directed acyclic graphs from task routines
US11144587B2 (en) * 2016-03-08 2021-10-12 Shutterstock, Inc. User drawing based image search
US10795725B2 (en) * 2016-03-24 2020-10-06 Fuji Xerox Co., Ltd. Image processing device, image processing method, and non-transitory computer readable medium for image processing
US20180018610A1 (en) * 2016-07-14 2018-01-18 Lendinghome Corp. Systems and methods for optimizing parallel task completion
AU2017321776A1 (en) * 2016-08-31 2019-03-07 Apple Inc. Systems and methods of swimming analysis
US10152349B1 (en) * 2016-09-27 2018-12-11 Juniper Networks, Inc. Kernel scheduling based on precedence constraints and/or artificial intelligence techniques
US11157814B2 (en) * 2016-11-15 2021-10-26 Google Llc Efficient convolutional neural networks and techniques to reduce associated computational costs
US10089567B2 (en) * 2016-12-15 2018-10-02 At&T Intellectual Property I, L.P. Method and apparatus for providing a communications service using a low powered radio tag
US10503775B1 (en) * 2016-12-28 2019-12-10 Shutterstock, Inc. Composition aware image querying
US11748625B2 (en) * 2016-12-30 2023-09-05 Intel Corporation Distributed convolution for neural networks
CN107103113B (zh) 2017-03-23 2019-01-11 中国科学院计算技术研究所 面向神经网络处理器的自动化设计方法、装置及优化方法
US10719760B2 (en) * 2017-04-09 2020-07-21 Intel Corporation Neural network scheduling mechanism
US10795836B2 (en) * 2017-04-17 2020-10-06 Microsoft Technology Licensing, Llc Data processing performance enhancement for neural networks using a virtualized data iterator
US10643297B2 (en) * 2017-05-05 2020-05-05 Intel Corporation Dynamic precision management for integer deep learning primitives
CN107341127B (zh) 2017-07-05 2020-04-14 西安电子科技大学 基于OpenCL标准的卷积神经网络加速方法
US10817310B2 (en) * 2017-09-01 2020-10-27 Ab Initio Technology Llc Executing graph-based program specifications
US10586052B1 (en) * 2017-10-04 2020-03-10 EMC IP Holding Company LLC Input/output (I/O) inspection methods and systems to detect and defend against cybersecurity threats
US11227214B2 (en) * 2017-11-14 2022-01-18 Advanced Micro Devices, Inc. Memory bandwidth reduction techniques for low power convolutional neural network inference applications
US10452843B2 (en) * 2018-01-11 2019-10-22 ArecaBay, Inc. Self-adaptive application programming interface level security monitoring

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020144101A1 (en) * 2001-03-30 2002-10-03 Hong Wang Caching DAG traces
CN102012844A (zh) * 2010-11-29 2011-04-13 上海大学 一种面向cmp系统的线程调度方法
CN103077006A (zh) * 2012-12-27 2013-05-01 浙江工业大学 一种基于多线程的长事务并行执行方法

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020263587A1 (en) * 2019-06-26 2020-12-30 Amazon Technologies, Inc. Neural network operation reordering for parallel execution
US11016775B2 (en) 2019-06-26 2021-05-25 Amazon Technologies, Inc. Neural network operation reordering for parallel execution
US11567778B2 (en) 2019-06-26 2023-01-31 Amazon Technologies, Inc. Neural network operation reordering for parallel execution
EP4088259A4 (en) * 2020-01-07 2023-05-24 Argo AI, LLC METHOD AND SYSTEM FOR CONSTRUCTING STATIC ORIENTED ACYCLIC GRAPHS
US12013898B2 (en) 2020-01-07 2024-06-18 Ford Global Technologies, Llc Method and system for constructing static directed acyclic graphs
CN111782426A (zh) * 2020-07-10 2020-10-16 上海淇毓信息科技有限公司 一种处理客户端任务的方法、装置和电子设备
CN111782426B (zh) * 2020-07-10 2023-09-22 上海淇毓信息科技有限公司 一种处理客户端任务的方法、装置和电子设备
CN114499958A (zh) * 2021-12-24 2022-05-13 东软睿驰汽车技术(沈阳)有限公司 控制方法及装置、车辆及存储介质
CN114499958B (zh) * 2021-12-24 2024-02-09 东软睿驰汽车技术(沈阳)有限公司 控制方法及装置、车辆及存储介质

Also Published As

Publication number Publication date
EP3614260A1 (en) 2020-02-26
KR20200087078A (ko) 2020-07-20
JP7074777B2 (ja) 2022-05-24
US11221877B2 (en) 2022-01-11
EP3651020A1 (en) 2020-05-13
US20200104162A1 (en) 2020-04-02
US20200012521A1 (en) 2020-01-09
US20200125406A1 (en) 2020-04-23
EP3614260A4 (en) 2020-10-21
US11360811B2 (en) 2022-06-14
US11113103B2 (en) 2021-09-07
US11113104B2 (en) 2021-09-07
JP2020522824A (ja) 2020-07-30
KR102569086B1 (ko) 2023-08-22
US20200104722A1 (en) 2020-04-02

Similar Documents

Publication Publication Date Title
WO2019095873A1 (zh) 任务并行处理方法、装置、系统、存储介质及计算机设备
Huang et al. Taskflow: A lightweight parallel and heterogeneous task graph computing system
CN109814986B (zh) 任务并行处理方法、存储介质、计算机设备、装置和系统
US11609792B2 (en) Maximizing resource utilization of neural network computing system
Warneke et al. Exploiting dynamic resource allocation for efficient parallel data processing in the cloud
JP5705338B2 (ja) Mapreduce環境で機械学習アルゴリズムを処理するためのシステムおよび方法
CN112711478B (zh) 基于神经网络的任务处理方法、装置、服务器和存储介质
CN111522640A (zh) 计算图的并行执行方法和设备
US11694075B2 (en) Partitioning control dependency edge in computation graph
KR20210148586A (ko) 스케줄러, 스케줄러의 동작 방법 및 이를 포함한 가속기 시스템
CN109213587B (zh) GPU平台下的多Stream并行DAG图任务映射策略
Forsberg et al. HePREM: A predictable execution model for GPU-based heterogeneous SoCs
Knorr et al. Declarative data flow in a graph-based distributed memory runtime system
Faverge et al. Programming heterogeneous architectures using hierarchical tasks
Müller et al. He.. ro db: A concept for parallel data processing on heterogeneous hardware
Huang et al. Concurrent CPU-GPU Task Programming using Modern C++
Sun et al. Edge Generation Scheduling for DAG Tasks using Deep Reinforcement Learning
Yi et al. Harnessing parallelism in multicore clusters with the all-pairs and wavefront abstractions
TW202333052A (zh) 用於深度學習工作負載之運算密集內核產生器、微內核代碼快取、融合式內核產生器及無循環依賴圖形分割
Ghose et al. A framework for OpenCL task scheduling on heterogeneous multicores
Kumar Scheduling of dense linear algebra kernels on heterogeneous resources
Bosch et al. Asynchronous runtime with distributed manager for task-based programming models
CN110415162B (zh) 大数据中面向异构融合处理器的自适应图划分方法
Lucas On the use of hierarchical task for heterogeneous architectures
Walter et al. Real-time Scheduling of I/O Transfers for Massively Parallel Processor Arrays

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18878728

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2018878728

Country of ref document: EP

Effective date: 20191119

ENP Entry into the national phase

Ref document number: 2019568198

Country of ref document: JP

Kind code of ref document: A