WO2020073742A1 - 一种基于神经网络的任务处理方法及相关设备 - Google Patents

一种基于神经网络的任务处理方法及相关设备 Download PDF

Info

Publication number
WO2020073742A1
WO2020073742A1 PCT/CN2019/102139 CN2019102139W WO2020073742A1 WO 2020073742 A1 WO2020073742 A1 WO 2020073742A1 CN 2019102139 W CN2019102139 W CN 2019102139W WO 2020073742 A1 WO2020073742 A1 WO 2020073742A1
Authority
WO
WIPO (PCT)
Prior art keywords
module
network
thread
data
input data
Prior art date
Application number
PCT/CN2019/102139
Other languages
English (en)
French (fr)
Inventor
熊祎
易松松
Original Assignee
广州市百果园信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市百果园信息技术有限公司 filed Critical 广州市百果园信息技术有限公司
Priority to US17/284,201 priority Critical patent/US20210357759A1/en
Priority to SG11202103656SA priority patent/SG11202103656SA/en
Priority to RU2021112964A priority patent/RU2771008C1/ru
Publication of WO2020073742A1 publication Critical patent/WO2020073742A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/485Task life-cycle, e.g. stopping, restarting, resuming execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of computer network technology, such as a neural network-based task processing method and related equipment.
  • multi-core processor capabilities are often used to accelerate the operation of each layer in the neural network, that is, to allocate the computing tasks of each network layer in the neural network to multiple Each processor core performs processing to complete the operation of the same layer through multiple processor cores.
  • the time spent in distributing the computing tasks to multiple cores and collecting the calculation results from the multiple cores may exceed the time spent in the computing itself, such as
  • the overhead caused by scheduling between multiple cores may be higher than the cost of the operation itself, affecting the acceleration ratio.
  • Embodiments of the present application provide a neural network-based task processing method and related equipment to improve the acceleration ratio, and avoid the situation in which the calculation efficiency of neural network related applications on multi-core processors in the related art is low.
  • an embodiment of the present application provides a neural network-based task processing method, including: acquiring input data, wherein the input data is used to trigger a thread task, and the input data is source input data or cache exchange data ; According to the triggered at least two thread tasks, schedule at least two corresponding module threads in parallel to process the input data to generate processing result data; wherein, the at least two module threads and the network layer in the neural network At least two network modules that are divided correspond to each other; output the processing result data to a cache to exchange data as a cache of module threads other than the at least two module threads, or output the processing result data, The result of processing the input data as the source.
  • an embodiment of the present application further provides a task processing device based on a neural network, including: an input data acquisition module configured to acquire input data, wherein the input data is used to trigger a thread task, and the input data Input data for the source or cache exchange data; a module thread scheduling module is set to schedule at least two corresponding module threads in parallel according to the triggered at least two thread tasks, process the input data, and generate processing result data; wherein, The at least two module threads correspond to the at least two network modules divided according to the network layer in the neural network, respectively; the processing result data output module is configured to output the processing result data to the cache to divide the at least The buffers of the module threads other than the two module threads exchange data, or output the processing result data as the processing result of the source input data.
  • an embodiment of the present application further provides a device, including: a processor and a memory; at least one instruction is stored in the memory, and the instruction is executed by the processor, so that the device executes as the first The task processing method based on neural network.
  • FIG. 1 is a schematic flowchart of steps of an embodiment of a task processing method based on a neural network of the present application
  • FIG. 2 is a schematic diagram of two module threads executing two network module tasks in parallel in an embodiment of the present application
  • FIG. 3 is a schematic diagram of data exchange between two module threads in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of data exchange between a start module thread, an intermediate module thread, and an end module thread in an embodiment of the present application;
  • FIG. 5 is a schematic structural block diagram of an embodiment of a task processing device based on a neural network in an embodiment of the present application
  • FIG. 6 is a schematic structural block diagram of a device in an embodiment of the present application.
  • the operation task in the neural network is to execute the task sequentially, that is, the data input to the neural network is processed through different layers of the neural network in turn, and the final output result is obtained.
  • Multi-core processor capabilities are often used in related technologies to accelerate the task processing of neural network related applications. For example, for applications that do not need to be executed synchronously, such as face recognition in the cloud, etc., many similar unrelated tasks are usually distributed to different processor cores for operation by batch processing. When the time is approaching, the theoretical parallelism is obtained, but this solution does not support applications that require synchronization, that is, applications that require real-time echo cannot use this solution.
  • nnPack an open source matrix calculation library OpenBlas contains multi-processor core optimized layer algorithms for acceleration.
  • the only parallelism exists within the single layer of the neural network, so this scheme is time-consuming in the single layer. It is effective when it is large, but it cannot be effectively accelerated with multi-core when the single-layer time is small.
  • the number of network layers on the structure is large and the number of single-layer channels is small. It is basically below 0.5 milliseconds.
  • an embodiment of the present application proposes a new task processing method based on a neural network.
  • at least two module threads corresponding to the network module of the neural network can be scheduled in parallel to process the input data, so as to greatly mask the time-consuming operation of the neural network and increase the throughput, that is, make full use of the multiprocessor Nuclear performance accelerates neural network-related applications, enabling applications to be echoed in real time.
  • the network module is pre-divided according to the network layer in the neural network, and one network module may include at least one network layer, which is not limited in the embodiment of the present application.
  • the network layer in the neural network may be divided in advance to obtain at least two network modules.
  • Each network module may include at least one network layer.
  • a thread pool containing at least two threads may be created based on the network model obtained after the neural network is divided, and each thread in the thread pool may correspond to the execution of a unique network module.
  • the thread corresponding to the network module in the thread pool may be referred to as a module thread, so as to separately execute the tasks required by each network module in the neural network by scheduling the module threads in parallel to improve the acceleration ratio, Therefore, the operating efficiency of neural network related applications on multi-core processors is improved.
  • FIG. 1 a schematic flowchart of steps of an embodiment of a neural network-based task processing method of the present application is shown, including steps 110 to 130.
  • step 110 input data is obtained, where the input data is used to trigger a thread task, and the input data is source input data or cache exchange data.
  • input data can be obtained for the task that needs to be executed currently, so as to trigger a corresponding thread task for processing according to the currently acquired input data.
  • the currently acquired input data can be used to trigger thread tasks, including data related to task processing, such as source input data, cache exchange data, etc., which is not limited in the embodiments of the present application.
  • source input data in the embodiments of the present application may refer to source data required for task processing, such as image frame data required for image recognition tasks.
  • the cache exchange data may refer to cache data exchanged between different threads, and the cache data may be stored in the cache space.
  • step 120 according to the triggered at least two thread tasks, at least two corresponding module threads are scheduled in parallel to process the input data to generate processing result data; wherein, the at least two module threads and the neural network are based on At least two network modules divided in the network layer in the map correspond to each other.
  • the embodiment of the present application may schedule at least two module threads in parallel from the thread pool according to at least two thread tasks triggered by the input data, so that the scheduled at least two modules
  • the input data is processed to produce processing result data.
  • the processing result data may refer to the result data obtained after the module thread performs task processing based on the corresponding network module.
  • the generated processing result data may include The first processing result data and the second processing result data generated by the two module threads; where the first processing result data may refer to: the first module thread is generated after performing task processing based on the corresponding first network module The result data of the second process; the second process result data may refer to the result data generated after the second module thread performs task processing based on the corresponding second network module.
  • the first module thread can be bound to the first network module to correspond to the execution of the first network module; similarly, the second module thread can be bound to the second network module to correspond to the first The implementation of two network modules.
  • the generated processing result data may include the result data obtained by processing the scheduled at least three module threads, which is not limited in the embodiment of the present application.
  • the generated processing result data may include the first processing result data generated by the first module thread, the second processing result data generated by the second module thread, and the third processing result data generated by the third module thread.
  • step 130 the processing result data is output to the cache as cache exchange data of module threads other than the at least two module threads, or the processing result data is output as the source input data process result.
  • the processing result data may be output based on the network module corresponding to the module thread.
  • the processing result data generated by the module thread may be output to the cache as the At least two module threads other than the module threads interact to cache and exchange data, that is, the processing result data generated by the module thread is used as input data of the next module thread, so that the next module thread can be based on the generated by the module thread Process the result data for the next task processing, such as executing the task of the next network module in the neural network.
  • processing result of the source input data in the embodiment of the present application can be used to characterize the output result after the neural network processes the source input data.
  • the starting network module in the embodiment of the present application may refer to the first network module obtained after the neural network is divided, and is set to receive source input data transmitted to the neural network, and is also set to execute the network included in the starting network module The tasks that the layer needs to perform.
  • the starting network module includes the input network layer (that is, the input layer of the neural network) used to receive the source input data in the neural network, or may include at least one other network layer in the neural network, etc. This embodiment of the present application does not limit this.
  • the last network module in the embodiment of the present application may refer to the last network module obtained after the neural network is divided, and is set to output the processing result of the input data of the source, and is also set to be required to execute the network layer included in the last network module
  • the network module at the end includes a network layer in the neural network for outputting the processing result (that is, the output layer of the neural network), or may also include at least one other network layer in the neural network, etc. There are no restrictions.
  • the intermediate network module in the embodiment of the present application may include at least one intermediate network layer in the neural network, and the intermediate network layer may refer to a network other than the first network layer and the last network layer in the neural network Floor.
  • the intermediate network module may include a second network layer, a third network layer, and a fourth network layer connected in this neural network in sequence; or, the intermediate network module may The third network layer and the fourth network layer are included; alternatively, the intermediate network module may include the third network layer.
  • the intermediate network module may include the second network layer, the third network layer, and the fourth network Layer; in the case where the start network module includes the first network layer and the second network layer connected in sequence, and the end network module includes the fifth network layer, the intermediate network module may include a third network layer and a fourth network Layer; in the case where the start network module includes the first network layer and the second network layer connected in sequence, and the end network module includes the fifth network layer, the intermediate network module may include only the third Network layer, etc.
  • the embodiment of the present application can schedule at least two corresponding model threads in parallel according to the triggered at least two thread tasks, process the input data, and schedule the parallel
  • At least two module threads correspond to at least two network modules divided according to the network layer in the neural network, that is, the tasks of different network modules in the neural network are distributed to different module threads for parallel execution, which greatly masks the neural network operation
  • the time-consuming improves the operating efficiency of neural network related applications on multi-core processors, that is, avoids the low efficiency of neural network related applications on multi-core processors in related technologies, enabling real-time applications to make full use of multi-core computing capabilities It accelerates and has a wide range of applications.
  • the task processing method provided in the embodiment of the present application can be used as an engineering optimization method, and can be used as a dependent component of any neural network-related application in the form of a library or source code, and applied to various applications based on neural network deployment.
  • This allows devices equipped with multi-processor cores to take full advantage of multi-processor core performance, accelerate applications related to neural networks, and meet real-time requirements.
  • related multi-thread logic can be written according to the method provided in the embodiment of the present application, and then integrated into the application in the form of library or source code, so that devices such as mobile phones can fully utilize multi-core performance to accelerate when processing the computing tasks of the application , To achieve the purpose of making full use of hardware resources.
  • the embodiment of the present application may divide the neural network in advance to divide the network layer included in the neural network into N network modules, so that N corresponding to the N network modules A module thread respectively processes the tasks to be performed by the network layer in the neural network, and implements the tasks of each network module of the neural network in parallel.
  • N can be an integer greater than 1, and can be used to characterize the number of network modules obtained after the neural network is divided. Therefore, in an embodiment of the present application, before acquiring the input data, the method may further include: dividing a network layer in the neural network to obtain at least two network modules.
  • each network module obtained after the division may include at least one network layer in the neural network, for example, it may include at least one convolutional layer in the convolutional neural network, etc., which is not limited in the embodiment of the present application.
  • embodiments of the present application may select a network layer with an input and output in the neural network as a connection point between network modules.
  • the above division of the network layers in the neural network to obtain at least two network modules includes: separately determining the number of channels between each two adjacent network layers in the neural network; in the adjacent network When the number of channels between layers is one, the preceding network layer in the adjacent network layer is divided into the input layer of the input network module, and the succeeding network layer in the adjacent network layer is divided into the output The output layer of the network module; based on the output layer of the output network module and the input layer of the input network module, at least two network modules are generated.
  • the topology structure of the neural network can be determined, and the network layers included in the neural network can be determined, such as a convolutional neural network (Convolutional Neural Network). , CNN) included input layer, convolution layer, sampling layer, etc .; and can determine the number of channels between the network layers included in the neural network, such as the two adjacent network layers in the neural network can be determined The number of channels between them can then be divided into network layers in the neural network based on the number of channels between two adjacent network layers in the neural network to obtain at least two network modules.
  • a convolutional neural network Convolutional Neural Network
  • whether the number of channels between two adjacent network layers in the neural network is one may be determined to determine whether the two adjacent network layers are divided into different network modules. For example, when the number of channels between two adjacent network layers in a neural network is one, that is, when the number of channels between adjacent network layers is one, it can be determined In the front network layer, data is output to the rear network layer through a channel, and then the front network layer in the adjacent network layer can be divided into the input layer of the input network module, and the rear network in the adjacent network layer The layer is divided into the output layer of the output network module, and then at least two network modules may be determined based on the divided output layer of the output network module and the input layer of the input network module. The data output by the front network layer can be used as input data of the back network layer, that is, the back network layer of the adjacent network layer obtains the input data output by the front network layer through a channel.
  • the difference in processing time between the network modules obtained after the neural network is divided is less than the set threshold.
  • the embodiments of the present application may divide the neural network based on the time required for task processing of each network model in the neural network, so that the difference in processing time between the network modules obtained after the division may be less than the setting Threshold to achieve a better trade-off between resource overhead and parallel performance.
  • the neural network itself has pre- and post-processing parts, depending on the length of time, it can be used as a module or a part of a module to participate in the division, that is, based on the processing time of the front and back parts of the network itself, the front part of the network itself can be used as A module or a part of a module participates in the division of the network module, and at the same time, the rear part of the network itself can be used as a module or part of a module to participate in the division of the network module.
  • the time spent in the front and back of the network itself can be masked, and the time spent in the end can be equivalent to the processing time of the network module that takes the longest time among the divided modules.
  • the CNN can be split into N network modules in an offline state or an online state, and the time consumption between the network modules can be approximately equal.
  • the tasks of each network module can be assigned to different module threads for execution.
  • the N module threads corresponding to the N network modules obtained by the neural network can be processed in order, and the total time consumption is theoretically unchanged; but from The overall view of CNN applications is that the number of frames of image data processed per unit time has increased by a factor of N. For example, when N is 2, pre-processing can be divided into the model thread corresponding to the first network module CNN1 for processing, and post-processing can be divided into the model thread corresponding to the second network module CNN2 for processing.
  • the time-consuming ⁇ t1 required for image frame processing can be shortened to half of the time-consuming ⁇ t2 required for the original serial execution method, which can increase the number of image frames processed by the CNN application in unit time
  • the number of frames corresponding to the data shortens the refresh time of the image frames and increases the frame rate displayed by the CNN application, which in turn can improve the user experience.
  • adjacent image frame data can be included in parallel processing, that is, the adjacent image frame data can be processed in parallel through at least two network modules scheduled, which can maintain On the premise of the execution order of the image frames, full use of multi-core to accelerate.
  • the module threads can be bound to the processor cores in one-to-one correspondence, so that the tasks of the divided network module can be executed in parallel by multiple different processor cores, and the CNN application can be accelerated on the multi-core processor It can effectively use multi-core to accelerate and achieve the purpose of making full use of hardware resources.
  • graphics processors Graphics, Processing, Unit, GPU
  • the task processing method provided in the embodiments of the present application may be used to utilize idle GPUs Resources accelerate the computational tasks required for the CNN application to greatly cover the time-consuming CNN operations and meet real-time requirements.
  • the at least two module threads may include at least a start module thread and an end module thread;
  • the start module thread may refer to a thread corresponding to the start network module obtained after dividing the neural network, and may be used for Perform the task of the start network module;
  • the end module thread may refer to a thread corresponding to the end network module obtained after dividing the neural network, and may be used to perform the task of the end network module.
  • scheduling at least two corresponding module threads in parallel to process the input data includes: scheduling the starting module thread according to the triggered thread task and processing the input data; according to the trigger Thread task, schedule the end module thread, and process the input data.
  • scheduling the start module thread and processing the input data according to the triggered thread task includes: calling the start module thread to preprocess the source input data, and based on the nerve corresponding to the start module thread The starting network module of the network performs task processing, and outputs the processing result data to the cache as exchange data for the cache.
  • the end module thread is scheduled to process the input data, including: calling the end module thread to obtain the corresponding cache exchange data from the cache as input data, based on the end network module of the neural network corresponding to the end module thread Perform task processing, post-process and output the processing result data, and use the processing result data as the processing result of the source input data.
  • the first network module CNN1 can be called the starting network module
  • the second network module CNN2 can be called It is the last network module
  • the thread bound to the start network module can be called the start module thread
  • the thread bound to the end network module can be called the end module thread.
  • the start module thread corresponding to the first network module CNN1 can be scheduled according to the thread task triggered by the image frame data to perform the input image frame data Pre-processing, and perform task processing based on the initial network module of the neural network (ie, the first network module CNN1) to generate first processing result data, and output the first processing result data to the cache to synchronize the
  • the first processing result is used as buffer exchange data and transmitted to the end module thread corresponding to the second network module CNN2, that is, the cache exchange data is used as input data of the end module thread to trigger the end module thread based on the second network module CNN2 Perform task processing.
  • the new image frame data can be written to the input buffer corresponding to the start module thread during the synchronization phase, and the first processing result data output by the start module thread is also used as the exchange buffer data through ping-pong buffer exchange , Swapped into the input buffer corresponding to the last module thread as the input of the last module thread.
  • the first processing result data output from the start module thread to the output buffer can be exchanged into the input buffer corresponding to the end module thread through the preset synchronization barrier as the input of the end module thread, thereby making The last module thread can obtain the first processing result data generated by the starting network module from its input buffer for the next task processing.
  • the neural network in the embodiment of the present application may be a convolutional neural network; the obtained source input data may be image frame data.
  • the above pre-processing may be image pre-processing, and the post-processing may be image post-processing. After the image post-processing, other processing may be performed, such as image rendering processing, etc., which is not limited in the embodiments of the present application.
  • the aforementioned module thread may further include at least one intermediate module thread.
  • the network module obtained after dividing the neural network may further include at least one intermediate network module. Therefore, the at least two module threads scheduled in parallel may further include at least one intermediate module thread corresponding to the intermediate network module, and the intermediate module thread may be used to perform the task of the intermediate network module.
  • the scheduling of at least two corresponding module threads in parallel according to the triggered at least two thread tasks to process the input data includes scheduling at least one intermediate module thread to process the input data according to the triggered thread task.
  • scheduling at least one intermediate module thread according to the triggered thread task to process input data includes: calling at least one intermediate module thread to obtain corresponding cache exchange data from the cache as input data, based on neural network
  • the intermediate network module performs task processing, and outputs the data of the processing result to the cache as exchange data of the cache.
  • the embodiments of the present application may trigger intermediate thread tasks according to the buffer exchange data output by the starting module thread during task processing, and may call the intermediate network module with the neural network according to the triggered intermediate thread tasks
  • the corresponding intermediate module thread performs task processing on the cache exchange data output by the starting module thread, generates intermediate processing result data, and outputs the intermediate processing result data to the cache as the next intermediate module thread or the end module thread Cache exchange data for data exchange.
  • the first network can be divided into The module CNN1 is called the starting network module
  • the second network module CNN2 is called the intermediate network module
  • the third network module CNN3 is called the end network module
  • the module thread corresponding to the first network module CNN1 can be called To start the module thread
  • the module thread corresponding to the second network module CNN2 is called an intermediate module thread
  • the module thread corresponding to the third network module CNN3 is called an end module thread.
  • the connection relationship between the network module and the module thread can be one-to-one, so the data exchange between each module thread can be implemented by monitoring the unique consumption thread related to it.
  • the start module thread performs image preprocessing on the input image frame data, and performs task processing based on the first network module CNN1 of the neural network to generate first processing result data, and can use the first
  • the processing result data is output to the buffer A.
  • the intermediate module thread can obtain the cache exchange data written into the target cache space corresponding to the thread task it needs to perform, and perform task processing based on the second network module CNN2 of the neural network to generate intermediate processing result data, and can The intermediate processing result data is output to the buffer B.
  • the module thread can obtain the cache exchange data written into the target cache space (that is, cache C in FIG. 4) corresponding to the thread task that it needs to perform, and perform task processing based on the third network module CNN3 of the neural network.
  • the processing result data obtained after the task processing can be subjected to image post-processing and rendering processing to obtain the processing result of the source input data, and output according to the processing result, such as displayed on the application interface of the convolutional neural network related applications and many more.
  • the order of data exchange can be from back to front, as shown in Figure 4, you can first exchange data between cache B and cache C, and then exchange cache A and the second thread
  • the cache C is exchanged, that is, the order of data exchange can be opposite to the execution order of the module threads.
  • the time consuming of the second network module CNN2 may be approximately equal to the time consuming of the first network module CNN1 plus image pre-processing, or may be approximately equal to the time consuming of the third network module CNN3 plus post-processing, so that the trigger The three module threads achieve the highest efficiency.
  • obtaining input data includes: when the data written in the target cache space corresponding to the thread task is monitored, determining the trigger condition for generating the thread task, and using the data written in the target cache space as the Input data, in which the module thread is bound to the thread task one-to-one, and the thread task is bound to the target cache space.
  • the embodiment of the present application may determine whether the trigger condition of the thread task is generated by writing data in the target cache space corresponding to the thread task; write in the target cache space corresponding to the thread task to be read
  • the trigger condition for generating a thread task can be determined, and the data written in the target cache space can be used as input data to trigger the corresponding thread task according to the input data.
  • the corresponding module thread can be scheduled in parallel according to the triggered thread task to complete the task processing of the input data.
  • the embodiment of the present application divides the complete design network into a plurality of network modules with similar time consumption, and each network module can correspond one-to-one with the module thread, so that the tasks of different network modules can be distributed to different module lines Execution, and each module thread can be allocated to different processor cores, and then different processor cores can be used to perform the tasks of different network modules, making full use of hardware resources, such as making full use of the graphics processor resources of the device, making full use of such as mobile phones
  • the multi-core performance of mobile devices enables real-time applications to make full use of multi-core computing power to accelerate, increase throughput, and greatly mask the time-consuming of network operations, such as improving the operation of convolutional neural network related applications on multi-core processors Efficiency, accelerate CNN related applications to meet real-time requirements.
  • the pre- and post-processing parts required by the neural network application can be used as a third-party module to add to the network layer division to reduce data processing, display, and rendering overhead; and, the module threads and Thread tasks are bound, so that data synchronization between threads of different modules can be completed through a simple ping-pong cache, reducing the data synchronization overhead between multiple threads, and avoiding the problem of low efficiency of multiple threads due to excessive synchronization overhead, and On the premise of not changing the task execution sequence corresponding to the image frame data, it supports applications that require synchronization.
  • the neural network-based task processing device includes an input data acquisition module 510, a module thread scheduling module 520, and a processing Results data output module 530.
  • the input data acquiring module 510 is configured to acquire input data, wherein the input data is used to trigger a thread task, and the input data is source input data or cache exchange data.
  • the module thread scheduling module 520 is configured to schedule at least two corresponding module threads in parallel according to the triggered at least two thread tasks, process the input data, and generate processing result data; wherein, the at least two module threads and At least two network modules divided according to the network layer in the neural network correspond to each other.
  • the processing result data output module 530 is configured to output the processing result data to the cache as cache exchange data of module threads other than the at least two module threads, or to output the processing result data as The processing result of the source input data.
  • the task processing device based on the neural network further includes: a network layer division module.
  • the network layer dividing module is configured to divide the network layer in the neural network to obtain at least two network modules.
  • the network layer division module may include a channel number determination submodule, a network layer division submodule, and a network module generation module.
  • the channel number determination submodule is configured to determine the number of channels between every two adjacent network layers in the neural network, respectively.
  • the network layer division submodule is configured to divide the preceding network layer in the adjacent network layer into the input layer of the input network module when the number of channels between the adjacent network layers is one.
  • the following network layer in the adjacent network layer is divided into the output layer of the output network module.
  • the network module generating module is configured to generate at least two network modules based on the output layer of the output network module and the input layer of the input network module.
  • the difference in processing time between network modules is less than a set threshold.
  • the at least two module threads may include at least a start module thread and an end module thread.
  • the above-mentioned module thread scheduling module 520 is configured to schedule the starting module thread to process the input data according to the triggered thread task; and can schedule the end module thread to process the input data according to the triggered thread task.
  • scheduling the start module thread according to the triggered thread task and processing the input data may include: calling the start module thread to preprocess the source input data, and based on the neural network corresponding to the start module thread.
  • the initial network module performs task processing, and outputs the processing result data to the cache as exchange data for the cache.
  • scheduling the end module thread to process the input data includes: calling the end module thread to obtain the corresponding cache exchange data from the cache as input data, based on the end network module of the neural network corresponding to the end module thread Task processing, post-processing the processing result data and outputting it, using the processing result data as the processing result of the source input data.
  • the module thread scheduling module 520 includes a start module thread scheduling sub-module and an end module thread scheduling sub-module.
  • the start module thread scheduling sub-module is set to call the start module thread to pre-process the source input data, and perform task processing based on the start network module of the neural network corresponding to the start module thread, and output the processing result data to Cache to exchange data as a cache;
  • the end module thread scheduling submodule is set to call the end module thread to obtain the corresponding cache exchange data from the cache as input data, perform task processing based on the end network module of the neural network corresponding to the end module thread, and process the result data Process and output, using the processing result data as the processing result of the source input data.
  • the module thread scheduling module 520 in the embodiment of the present application may also include other sub-modules, for example, may also include at least one intermediate module thread to schedule the intermediate module thread according to the triggered thread task, process the input data, and generate intermediate The processing result data corresponding to the module thread, etc.
  • the processing result data corresponding to the intermediate module thread can be output to the cache as cache exchange data to be used as input data of the next module thread associated with the intermediate module thread, so that the next module thread can obtain the Cache exchange data for task processing.
  • the module thread further includes at least one intermediate module thread.
  • the module thread scheduling module 520 is configured to schedule at least one intermediate module thread according to the triggered thread task to process the input data. Among them, scheduling at least one intermediate module thread according to the triggered thread task and processing the input data includes: calling at least one intermediate module thread to obtain corresponding cache exchange data from the cache as input data, and the intermediate network module based on the neural network to perform the task Processing, output the processing result data to the buffer, and exchange the data as the buffer.
  • the module thread scheduling module 520 may also include an intermediate module thread scheduling submodule.
  • the intermediate module thread scheduling submodule is set to call the intermediate module thread to obtain corresponding cache exchange data from the cache as input data, and the intermediate network module based on the neural network performs task processing, and outputs the processing result data to the cache as cache exchange data .
  • the source input data may be image frame data
  • the pre-processing may be image pre-processing
  • the post-processing may be image rendering processing
  • the neural network may be a convolutional neural network.
  • the input data acquisition module 510 includes a monitoring submodule and a determination submodule.
  • the monitoring submodule is set to monitor the target cache space read corresponding to the thread task, for example, whether the target cache space read corresponding to the monitor thread task is written with data, etc .;
  • the determination submodule is set to determine the trigger condition for generating the thread task when the monitoring submodule listens to the data written in the target buffer space read corresponding to the thread task, and use the data written in the target buffer space as the Input data.
  • the module thread is bound to the thread task in one-to-one correspondence, and the thread task is bound to the target cache space.
  • thread tasks may be bound to processor cores in a one-to-one correspondence.
  • the task processing device based on the neural network provided above can execute the video recording method provided in any embodiment of the present application.
  • the above neural network-based task processing device may be integrated in the device.
  • the device can be composed of at least two physical entities or a physical entity, such as the device can be a personal computer (Personal Computer (PC), computer, mobile phone, tablet device, personal digital assistant, server, messaging device, game Console etc.
  • PC Personal Computer
  • mobile phone mobile phone
  • tablet device personal digital assistant
  • server messaging device
  • game Console etc.
  • An embodiment of the present application further provides a device, including: a processor and a memory. At least one instruction is stored in the memory, and the instruction is executed by the processor, so that the device executes the neural network-based task processing method described in the foregoing method embodiment.
  • the device includes a processor 60, a memory 61, a display screen 62 with a touch function, an input device 63, an output device 64, and a communication device 65.
  • the number of processors 60 in the device may be at least one, and one processor 60 is taken as an example in FIG. 6.
  • the number of the memory 61 in the device may be at least one, and one memory 61 is taken as an example in FIG. 6.
  • the processor 60, the memory 61, the display screen 62, the input device 63, the output device 64, and the communication device 65 of the device may be connected by a bus or other means. In FIG. 6, the connection by a bus is used as an example.
  • the memory 61 is configured to store software programs, computer executable programs, and modules, such as program instructions / modules corresponding to the task processing method described in any embodiment of the present application (for example, in the above task processing apparatus Input data acquisition module 510, module thread scheduling module 520, and processing result data output module 530, etc.).
  • the memory 61 may mainly include a storage program area and a storage data area, where the storage program area may store operation devices and application programs required for at least one function; the storage data area may store data created according to the use of the device and the like.
  • the memory 61 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the memory 61 also includes memories remotely provided with respect to the processor 60, and these remote memories may be connected to the device through a network. Examples of the above network include but are not limited to the Internet, intranet, local area network, mobile communication network, and combinations thereof.
  • the display screen 62 is a display screen 62 with a touch function, which may be a capacitive screen, an electromagnetic screen or an infrared screen.
  • the display screen 62 is configured to display data according to the instructions of the processor 60, and is also configured to receive a touch operation acting on the display screen 62 and send a corresponding signal to the processor 60 or other devices.
  • the display screen 62 is an infrared screen, it further includes an infrared touch frame, the infrared touch frame is disposed around the display screen 62, the infrared touch frame is also configured to receive an infrared signal, and the infrared signal Send to processor 60 or other device.
  • the communication device 65 is configured to establish a communication connection with other devices, which may be at least one of a wired communication device and a wireless communication device.
  • the input device 63 is configured to receive inputted numeric or character information, generate key signal input related to user settings and function control of the device, and also set up a camera to acquire an image and a sound pickup device to acquire audio data.
  • the output device 64 may include audio equipment such as a speaker. It should be noted that the composition of the input device 63 and the output device 64 can be set according to actual conditions.
  • the processor 60 executes various functional applications and data processing of the device by running software programs, instructions, and modules stored in the memory 61, that is, implementing the task processing method based on the neural network.
  • the processor 60 executes at least one program stored in the memory 61, the following operation is achieved: acquiring input data, wherein the input data is used to trigger a thread task, and the input data is source input data or cache Exchange data; according to the triggered at least two thread tasks, at least two corresponding module threads are scheduled in parallel to process the input data to generate processing result data; wherein, the at least two module threads and the neural network based on At least two network modules divided by the network layer correspond to each other; output the processing result data to the cache as cache exchange data of module threads other than the at least two module threads, or output the processing result Data, as the processing result of the input data as the source.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the device When instructions in the storage medium are executed by a processor of a device, the device enables the device to execute the neural network-based task processing method described in the foregoing method embodiments.
  • the neural network-based task processing method includes: acquiring input data, wherein the input data is used to trigger a thread task, and the input data is source input data or cache exchange data; according to at least two threads that are triggered Tasks, scheduling at least two corresponding module threads in parallel, processing the input data, and generating processing result data; wherein, the at least two module threads and at least two network modules divided according to the network layer in the neural network Corresponding respectively; output the processing result data to the cache as the cache exchange data of module threads other than the at least two module threads, or output the processing result data as the processing result of the source input data .
  • the present application can be implemented by software and necessary general hardware, and of course, can also be implemented by hardware.
  • the technical solutions of the present application can essentially be embodied in the form of software products, and the computer software products can be stored in computer-readable storage media, such as computer floppy disks, Read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or optical disc, etc., including several instructions to make a computer device (which can be a robot, personal A computer, server, or network device, etc.) executes the neural network-based task processing method described in any embodiment of the present application.
  • a computer device which can be a robot, personal A computer, server, or network device, etc.
  • each unit and module included is only divided according to functional logic, but it is not limited to the above division, as long as it can achieve the corresponding function; in addition
  • the name of each functional unit is only for the purpose of distinguishing each other, and is not used to limit the scope of protection of this application.
  • each part of the present application may be implemented by hardware, software, firmware, or a combination thereof.
  • multiple steps or methods may be implemented by software or firmware stored in a memory and executed by a suitable instruction execution device.
  • a logic gate circuit for implementing a logic function on a data signal
  • Discrete logic circuits special integrated circuits with appropriate combinational logic gate circuits
  • programmable gate arrays PROM
  • field programmable gate arrays Field-Programmable Gate Array, FPGA

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种基于神经网络的任务处理方法及相关设备,涉及计算机网络技术领域,该方法包括:获取输入数据,其中,输入数据用于触发线程任务,输入数据为源输入数据或缓存交换数据;根据触发的至少两个线程任务,并行调度对应的至少两个模块线程,对输入数据进行处理,产生处理结果数据;其中,至少两个模块线程与依据神经网络中的网络层进行划分的至少两个网络模块分别对应;将处理结果数据输出至缓存,以作为除至少两个模块线程之外的模块线程的缓存交换数据,或,输出处理结果数据,以作为源输入数据的处理结果。

Description

一种基于神经网络的任务处理方法及相关设备
本申请要求在2018年10月10日提交中国专利局、申请号为201811180174.5的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机网络技术领域,例如一种基于神经网络的任务处理方法及相关设备。
背景技术
随着人工智能技术的快速发展,以深度神经网络为代表的机器学习方法在计算机视觉、语音识别等领域取得了实际应用,成为研究热点。
在实际部署基于神经网络的应用时,不仅要考虑网络本身运算开销,还需考虑应用整体的延时和吞吐量控制。目前,在实际应用,尤其是部署在移动端的实时类应用,通常使用多核处理器能力,来加速神经网络中的每一层的运算,即将神经网络中的每一网络层的运算任务分配到多个处理器核心进行处理,以通过多个处理器核心来完成同一层的运算。但是,将神经网络中的每一网络层的运算任务分配到多个处理器核心进行处理时,将运算任务分配至多核以及从多核收集计算结果的耗时可能会超过运算本身的耗时,如在单层耗时基本在0.5毫秒以下的情况下,多核之间调度所带来的额外开销(Overhead)可能会比运算本身的开销高,影响加速比。
发明内容
本申请实施例提供一种基于神经网络的任务处理方法及相关设备,以提高加速比,避免了相关技术中神经网络相关应用在多核处理器上的运算效率低的情况。
第一方面,本申请实施例提供了一种基于神经网络的任务处理方法,包括:获取输入数据,其中,所述输入数据用于触发线程任务,所述输入数据为源输入数据或缓存交换数据;根据触发的至少两个线程任务,并行调度对应的至少两个模块线程,对所述输入数据进行处理,产生处理结果数据;其中,所述至少两个模块线程与依据神经网络中的网络层进行划分的至少两个网络模块分别 对应;将所述处理结果数据输出至缓存,以作为除所述至少两个模块线程之外的模块线程的缓存交换数据,或,输出所述处理结果数据,以作为源输入数据的处理结果。
第二方面,本申请实施例还提供了一种基于神经网络的任务处理装置,包括:输入数据获取模块,设置为获取输入数据,其中,所述输入数据用于触发线程任务,所述输入数据为源输入数据或缓存交换数据;模块线程调度模块,设置为根据触发的至少两个线程任务,并行调度对应的至少两个模块线程,对所述输入数据进行处理,产生处理结果数据;其中,所述至少两个模块线程与依据神经网络中的网络层进行划分的至少两个网络模块分别对应;处理结果数据输出模块,设置为将所述处理结果数据输出至缓存,以作为除所述至少两个模块线程之外的模块线程的缓存交换数据,或,输出所述处理结果数据,以作为源输入数据的处理结果。
第三方面,本申请实施例还提供了一种设备,包括:处理器和存储器;所述存储器中存储有至少一条指令,所述指令由所述处理器执行,使得所述设备执行如第一方面所述的基于神经网络的任务处理方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述存储介质中的指令由设备的处理器执行时,使得设备能够执行如第一方面所述的基于神经网络的任务处理方法。
附图说明
图1是本申请的一种基于神经网络的任务处理方法实施例的步骤流程示意图;
图2是本申请一实施例中的两个模块线程并行执行两个网络模块的任务的示意图;
图3是本申请一实施例中的两个模块线程进行数据交换的示意图;
图4是本申请一实施例中的起始模块线程、中间模块线程以及末尾模块线程之间进行数据交换的示意图;
图5是本申请一实施例中的一种基于神经网络的任务处理装置实施例的结构方框示意图;
图6是本申请一实施例中的一种设备的结构方框示意图。
具体实施方式
申请人在实现本申请时发现,神经网络中的运算任务是顺序执行任务,即输入到神经网络中的数据依次通过该神经网络不同层的处理,得到最后的输出结果。相关技术中经常使用多核处理器能力来加速神经网络相关应用的任务处理。例如,对于不需要同步执行的应用,如云端的人脸识别等,通常是通过批量处理的方式,把许多类似的不相关任务分配到不同的处理器核上进行运算,以在每个任务的耗时接近时,获得理论上最的并行度,但是这种方案不支持要求同步的应用,即需要实时回显的应用无法使用该方案。对于要求同步的应用,即在需要同步执行的场合,通常采用另一种方案—以神经网络中单层通道为单位,基于多核处理器来加速每一层的运算,如利用一个加速神经网络计算的加速包nnPack或一个开源的矩阵计算库OpenBlas中所包含的多处理器核优化的层算法来进行加速,唯一的并行度存在神经网络的单层内部,因此这种方案在单层耗时比较大时有效,而在单层耗时较小时无法有效利用多核进行加速,如对于移动端或实时级别的神经网络,结构上的网络层数较多而单层通道数量较小,单层耗时基本在0.5毫秒以下,此时多核调度带来的额外开销通常和单层网络运算本身的开销不相上下,甚至多核处理比单核处理慢。由此可知,相关技术给出一个两难问题:在需用同步执行的场合应用以单层通道为单位的方案仅在单层耗时比较大的情况下有效,在不需要同步执行的场合需要每个任务耗时接近且不适用于需要实时回显的应用。
为避免上述情况,本申请实施例提出了一种新的基于神经网络的任务处理方法。本申请实施例可以通过并行调度至少两个与神经网络的网络模块对应的模块线程,对输入数据进行处理,以极大地掩盖神经网络运算的耗时,增大吞吐量,即充分利用多处理器核性能来加速神经网络相关的应用,使得应用能够实时回显。其中,网络模块为预先依据神经网络中的网络层进行划分得到的,一个网络模块可以包括至少一个网络层,本申请实施例对此不作限制。
在实际处理中,为了实现在多核处理器上加速神经网络应用的任务处理,可以预先对神经网络中的网络层进行划分,得到至少两个网络模块。每一个网络模块可以包括至少一个网络层。在运行开始时,可以基于神经网络划分后得到的网络模型,创建包含至少两个线程的线程池,且线程池中的每个线程可以对应唯一网络模块的执行。本申请实施例中,可以将该线程池中的与网络模块对应的线程称为模块线程,以通过并行调度模块线程来分别执行神经网络中每 个网络模块所需要执行的任务,提高加速比,从而提升神经网络相关应用在多核处理器上的运行效率。
参照图1,示出了本申请的一种基于神经网络的任务处理方法实施例的步骤流程示意图,包括步骤110至步骤130。
在步骤110中,获取输入数据,其中,所述输入数据用于触发线程任务,所述输入数据为源输入数据或缓存交换数据。
本申请实施例在需要进行任务处理的情况下,可以针对当前所需要执行的任务获取输入数据,以根据当前获取到的输入数据触发相应的线程任务进行处理。其中,当前获取到的输入数据可以用于触发线程任务,包括与任务处理相关的数据,如可以包括:源输入数据、缓存交换数据等,本申请实施例对此不作限制。
需要说明的是,本申请实施例中的源输入数据可以是指任务处理所需要的源数据,如可以是图像识别任务所需要的图像帧数据等。缓存交换数据可以是指不同线程之间交换的缓存数据,且缓存数据可以存储在缓存空间中。
在步骤120中,根据触发的至少两个线程任务,并行调度对应的至少两个模块线程,对所述输入数据进行处理,产生处理结果数据;其中,所述至少两个模块线程与依据神经网络中的网络层进行划分的至少两个网络模块分别对应。
本申请实施例在获取到的输入数据后,可以根据该输入数据所触发的至少两个线程任务,从线程池中并行调度至少两个模块线程,以通过调度到的至少两个模块,对所述输入数据进行处理,产生处理结果数据。其中,处理结果数据可以是指模块线程基于其所对应的网络模块进行任务处理后得到的结果数据。
例如,在并行调度了两个模块线程后,如在并行调度第一模块线程和第二模块线程后,通过调度的这两个模块线程对输入数据进行并行处理后,产生的处理结果数据可以包括这两个模块线程所产生的第一处理结果数据和第二处理结果数据;其中,第一处理结果数据可以是指:第一模块线程基于其所对应的第一网络模块进行任务处理后所产生的结果数据;第二处理结果数据可以是指:第二模块线程基于其所对应的第二网络模块进行任务处理后所产生的结果数据。需要说明的是,第一模块线程可以与第一网络模块进行绑定,以对应该第一网络模块的执行;同理,第二模块线程可以与第二网络模块进行绑定,以对应该第二网络模块的执行。
当然,在并行调度至少三个模块线程的情况下,产生的处理结果数据可以 包括通过调度的至少三个模块线程进行处理后得到的结果数据,本申请实施例对此不作限制。例如,在并行调度第一模块线程、第二模块线程以及第三模块线程后,通过第一模块线程、第二模块线程以及第三模块线程分别对获取到的源输入数据或缓存交换数据进行处理后,产生的处理结果数据可以包括有第一模块线程所产生的第一处理结果数据、第二模块线程所产生的第二处理结果数据以及第三模块线程所产生的第三处理结果数据等。
在步骤130中,将所述处理结果数据输出至缓存,以作为除所述至少两个模块线程之外的模块线程的缓存交换数据,或,输出所述处理结果数据,以作为源输入数据的处理结果。
本申请实施例在模块线程产生处理结果数据后,可以基于模块线程所对应的网络模块,对处理结果数据进行输出。示例性的,在模块线程所对应的网络模块为神经网络中的起始网络模块或中间网络模块的情况下,可以将该模块线程所产生的处理结果数据输出至缓存中,作为与除所述至少两个模块线程之外的模块线程进行交互的缓存交换数据,即将该模块线程所产生的处理结果数据作为下一个模块线程的输入数据,从而使得下一个模块线程可以基于该模块线程所产生的处理结果数据进行下一步的任务处理,如执行神经网络中的下一个网络模块的任务。在模块线程所对应的网络模块为神经网络中的末尾网络模块的情况下,可以确定该模块线程所产生的处理结果数据为源输入数据的处理结果并输出,从而使得神经网络相关应用可以利用该处理结果进行业务处理,满足业务需求。
需要说明的是,本申请实施例中的源输入数据的处理结果可以用于表征神经网络对源输入数据进行处理后输出的结果。
本申请实施例中的起始网络模块可以是指神经网络划分后得到的第一个网络模块,设置为接收传输给神经网络的源输入数据,还设置为执行该起始网络模块所包含的网络层所需要执行的任务,起始网络模块包括神经网络中用于接收源输入数据的输入网络层(即神经网络的输入层),或者,还可以包括神经网络中其他的至少一个网络层等,本申请实施例对此不作限制。
另外,本申请实施例中的末尾网络模块可以是指神经网络划分后得到的最后一个网络模块,设置为输出源输入数据的处理结果,还设置为执行该末尾网络模块所包含的网络层所需要执行的任务,末尾网络模块包括神经网络中用于输出处理结果的网络层(即神经网络的输出层),或者,还可以包括神经网络中 其他的至少一个网络层等,本申请实施例对此也不作限制。
相应的,本申请实施例中的中间网络模块可以包括神经网络中的至少一个中间网络层,且该中间网络层可以是指神经网络中除了第一个网络层和最后一个网络层之外的网络层。例如,在神经网络包括5个网络层的情况下,中间网络模块可以包括该神经网络中依次连接的第二个网络层、第三个网络层以及第四个网络层;或者,中间网络模块可以包括第三个网络层和第四个网络层;或者,中间网络模块可以包括第三个网络层。示例性的,在起始网络模块包括第一个网络层,末尾网络模块包含第五个网络层的情况下,中间网络模块可以包括第二个网络层、第三个网络层以及第四个网络层;在起始网络模块包括依次连接的第一个网络层和第二网络层,末尾网络模块包含第五个网络层的情况下,中间网络模块可以包括第三个网络层和第四个网络层;在起始网络模块包括依次连接的第一个网络层和第二网络层,末尾网络模块包含第五个网络层和第四个网络层的情况下,中间网络模块可以只包括第三个网络层,等等。
综上,本申请实施例在获取到用于触发线程任务的输入数据后,可以根据触发的至少两个线程任务,并行调度对应的至少两个模型线程,对输入数据进行处理,且并行调度的至少两个模块线程与依据神经网络中的网络层进行划分的至少两个网络模块分别对应,即将神经网络中的不同网络模块的任务分配到不同的模块线程上并行执行,极大掩盖神经网络运算的耗时,提升了神经网络相关应用在多核处理器上的运行效率,即避免了相关技术中神经网络相关应用在多核处理器上的运算效率低的情况,使得实时应用能够充分利用多核计算能力进行加速,应用范围广。
在一实施例中,本申请实施例提供的任务处理方法可以作为工程优化方法,可以以库或源码的形式作为任意神经网络相关应用的依赖组件,应用到基于神经网络部署的各种应用中,使得配置有多处理器核的设备可以充分利用多处理器核性能,对神经网络相关应用进行加速,达到实时要求。例如,可以根据本申请实施例提供的方法编写相关多线程逻辑,再以库或源码形式集成至应用中,使得诸如手机等设备在处理该应用的运算任务时,可以充分利用多核性能来进行加速,达到充分利用硬件资源的目的。
在一实施例中,本申请实施例可以预先对神经网络中进行划分,以将该神经网络中所包含的网络层划分到N个网络模块中,从而可以通过该N个网络模块所对应的N个模块线程,分别对神经网络中网络层所要执行的任务进行处理, 实现并行执行神经网络的每个网络模块的任务。其中,N可以为大于1的整数,可以用于表征神经网络划分后得到的网络模块的数量。因此,在本申请的一个实施例中,所述获取输入数据之前,还可以包括:对所述神经网络中的网络层进行划分,得到至少两个网络模块。需要说明的是,划分后得到的每个网络模块可以包括神经网络中的至少一个网络层,如可以包括卷积神经网络中的至少一个卷积层等,本申请实施例对此不作限制。
在一实施例中,为了简化不同网络模块之间的数据交互复杂度,本申请实施例可以选择神经网络中具有一个输入输出的网络层,作为网络模块之间连接点。例如,上述对所述神经网络中的网络层进行划分,得到至少两个网络模块,包括:分别确定所述神经网络中每两个相邻网络层之间的通道数量;在所述相邻网络层之间的通道数量为一个的情况下,将所述相邻网络层中的在前网络层划分为输入网络模块的输入层,将所述相邻网络层中的在后网络层划分为输出网络模块的输出层;基于所述输出网络模块的输出层和所述输入网络模块的输入层,生成至少两个网络模块。
在一实施例中,本申请实施例在进行神经网络划分时,可确定该神经网络的拓扑图结构,确定出该神经网络中所包含的网络层,如确定出卷积神经网络(Convolutional Neural Network,CNN)中所包含的输入层、卷积层、采样层等;并且可确定该神经网络中所包含的网络层之间的通道数量,如可以确定该神经网络中相邻的两个网络层之间的通道数量,随后可基于该神经网络中相邻的两个网络层之间的通道数量对该神经网络中的网络层进行划分,得到至少两个网络模块。
在一实施例中,可以通过判断神经网络中相邻的两个网络层之间的通道数量是否为一个,来确定是否将该相邻的两个网络层划分到不同的网络模块中。例如,在神经网络中相邻的两个网络层之间的通道数量为一个的情况下,即在相邻网络层之间的通道数量为一个的情况下,可以确定该相邻网络层中的在前网络层通过一个通道将数据输出给在后网络层,然后可将该相邻网络层中的在前网络层划分为输入网络模块的输入层,将该相邻网络层中的在后网络层划分为输出网络模块的输出层,随后可以基于划分后得到的输出网络模块的输出层和输入网络模块的输入层,确定出至少两个网络模块。其中,在前网络层所输出的数据可以作为在后网络层的输入数据,亦即,相邻网络层中的在后网络层通过一个通道获取在前网络层所输出的输入数据。
在一实施例中,神经网络划分后得到的网络模块之间的处理耗时的差异值小于设定阈值。例如,本申请实施例可以基于神经网络中每个网络模型进行任务处理所需要的耗时,对神经网络进行划分,使得划分后得到的网络模块之间的处理耗时的差异值可以小于设定阈值,以在资源开销和并行性能之间达到较好的权衡。在神经网络本身有前后处理部分的情况下,视其耗时长短可以作为一个模块或某个模块的部分参与划分,即可以基于网络本身前、后部分的处理耗时,将网络本身前部分作为一个模块或某个模块的部分参与网络模块的划分,同时可以将网络本身后部分作为一个模块或某个模块的部分参与网络模块的划分。实际执行时,网络本身前后部分的耗时可以被掩盖起来,最终表现出来的耗时可以等同于划分后得到的众模块中耗时最长的那一个网络模块的处理耗时。
在一实施例中,可在离线状态或在线状态下,将CNN拆分为N个网络模块,且网络模块之间的耗时大致可以相等。每个网络模块的任务可以被分配到不同的模块线程上执行。对于CNN应用所需要处理的每一帧图像数据而言,可以依次由神经网络划分后的得到的N个网络模块所对应的N个模块线程按照顺序处理,总耗时理论上不变;但是从CNN应用整体来看,单位时间内所处理的图像数据的帧数增加了N倍。例如,在N为2的情况下,可以将前处理划分到第一个网络模块CNN1所对应的模型线程进行处理,而将后处理划分到第二个网络模块CNN2所对应的模型线程进行处理,如图2所示,从而可以将图像帧处理所需要的耗时△t1缩短为原来串行执行方法所需要的耗时△t2的一半,进而可以增加CNN应用在单位时间内所处理的图像帧数据对应的帧数,缩短图像帧的刷新时间,提高CNN应用显示的帧率,进而能够提升用户体验。
可见,基于本申请实施例提供的基于神经网络处理方法,可以将相邻图像帧数据纳入并行处理中,即可通过调度的至少两个网络模块对相邻图像帧数据进行并行处理,能够在保持图像帧的执行顺序的前提下充分利用多核进行加速。
在一实施例中,模块线程可以与处理器核一一对应绑定,从而可以通过多个不同处理器核来并行执行划分后的网络模块的任务,实现在多核处理器上加速CNN应用,即能够有效利用多核进行加速,达到充分利用硬件资源的目的。例如,在一些配置有图形处理器(Graphics Processing Unit,GPU)设备上,当CNN应用在该设备上执行诸如渲染,显示等操作,可以采用本申请实施例提供的任务处理方法,利用空闲的GPU资源对该CNN应用所需要执行的运算任务进行加速,以极大掩盖CNN运算耗时,达到实时要求。
在一实施例中,所述至少两个模块线程至少可以包括起始模块线程和末尾模块线程;起始模块线程可以是指与划分神经网络后得到的起始网络模块对应的线程,可以用于执行起始网络模块的任务;末尾模块线程可以是指与划分神经网络后得到的末尾网络模块对应的线程,可以用于执行末尾网络模块的任务。上述根据触发的至少两个线程任务,并行调度对应的至少两个模块线程,对所述输入数据进行处理,包括:根据触发的线程任务,调度起始模块线程,对输入数据进行处理;根据触发的线程任务,调度末尾模块线程,对输入数据进行处理。
在一实施例中,根据触发的线程任务,调度起始模块线程,对输入数据进行处理,包括:调用起始模块线程对源输入数据进行前处理,并基于与起始模块线程所对应的神经网络的起始网络模块进行任务处理,将处理结果数据输出至缓存,作为缓存交换数据。根据触发的线程任务,调度末尾模块线程,对输入数据进行处理,包括:调用末尾模块线程从缓存中获取对应的缓存交换数据作为输入数据,基于与末尾模块线程所对应的神经网络的末尾网络模块进行任务处理,将处理结果数据进行后处理并输出,将处理结果数据作为源输入数据的处理结果。
在一实施例中,在将卷积神经网络中的网络层划分到2个网络模块后,可以将划分到的第一个网络模块CNN1称为起始网络模块,将第二个网络模块CNN2称为末尾网络模块,并且可以将与起始网络模块进行绑定的线程称为起始模块线程,以及,可以将以与末尾网络模块进行绑定的线程称为末尾模块线程。如图3所示,在获取到输入的图像帧数据后,可以根据该图像帧数据所触发的线程任务,调度与第一个网络模块CNN1对应的起始模块线程,对输入的图像帧数据进行前处理,并基于神经网络的起始网络模块(即第一个网络模块CNN1)进行任务处理,产生第一处理结果数据,以及将所述第一处理结果数据输出至缓存,以在同步阶段将该第一处理结果作为缓存交换数据,传输给与第二个网络模块CNN2对应的末尾模块线程,即将该缓存交换数据作为末尾模块线程的输入数据,以触发末尾模块线程基于第二个网络模块CNN2进行任务处理。
需要说明的是,新的图像帧数据可以在同步阶段写入到起始模块线程对应的输入缓存中,并且也通过乒乓缓存交换,将起始模块线程输出的第一处理结果数据作为交换缓存数据,交换到末尾模块线程对应的输入缓存中,作为该末尾模块线程的输入。如图3所示,可以通过预先设置的同步屏障,将起始模块 线程输出至输出缓存的第一处理结果数据,交换到末尾模块线程对应的输入缓存中,作为末尾模块线程的输入,从而使得末尾模块线程可以从其输入缓存中获取到起始网络模块所产生的第一处理结果数据进行下一步的任务处理。
在一实施例中,本申请实施例中的神经网络可以为卷积神经网络;获取到的源输入数据可以为图像帧数据。相应的,上述前处理可以为图像前处理,后处理可以为图像后处理。在图像后处理后,还可以进行其他处理,如图像渲染处理等等,本申请实施例对此不作限制。
在一实施例中,上述模块线程还可以包括至少一个中间模块线程。例如,划分神经网络后得到的网络模块还可以包括至少一个中间网络模块。因此,并行调度的至少两个模块线程还可以包括至少一个与中间网络模块对应的中间模块线程,且该中间模块线程可以用于执行中间网络模块的任务。上述根据触发的至少两个线程任务,并行调度对应的至少两个模块线程,对所述输入数据进行处理,包括:根据触发的线程任务,调度至少一个中间模块线程,对输入数据进行处理。
在一实施例中,根据触发的线程任务,调度至少一个中间模块线程,对输入数据进行处理,包括:调用至少一个中间模块线程从缓存中获取对应的缓存交换数据作为输入数据,基于神经网络的中间网络模块进行任务处理,将处理结果数据输出至缓存,作为缓存交换数据。示例性的,本申请实施例可以在任务处理过程中,可以根据起始模块线程所输出的缓存交换数据,触发中间线程任务,并可根据触发的中间线程任务,调用与神经网络的中间网络模块对应的中间模块线程,对该起始模块线程所输出的缓存交换数据进行任务处理,产生中间处理结果数据,将所述中间处理结果数据输出至缓存,作为与下一个中间模块线程或末尾模块线程进行数据交换的缓存交换数据。
由于神经网络划分后得到的网络模块可以与模块线程一一对应,如图4所示,在将卷积神经网络中的网络层划分到3个网络模块后,可以将划分到的第一个网络模块CNN1称为起始网络模块,将第二个网络模块CNN2称为中间网络模块,将第三个网络模块CNN3称为末尾网络模块,以及可以将与第一个网络模块CNN1对应的模块线程称为起始模块线程,将与第二个网络模块CNN2对应的模块线程称为中间模块线程,将与第三个网络模块CNN3对应的模块线程称为末尾模块线程。另外,网络模块与模块线程之间的连接关系可以一一对应的,因此每个模块线程间的数据交换可以通过监控与之相关的唯一消费线程 即可实行。
例如,如图4所示,起始模块线程对输入的图像帧数据进行图像前处理,并基于神经网络的第一个网络模块CNN1进行任务处理,产生第一处理结果数据,并可将该第一处理结果数据输出至缓存A中。中间模块线程可以获取写入到其所需要执行的线程任务对应的目标缓存空间中的缓存交换数据,并基于神经网络的第二个网络模块CNN2进行任务处理,产生中间处理结果数据,并可将该中间处理结果数据输出至缓存B中。末尾模块线程可以获取写入到其所需要执行的线程任务对应的目标缓存空间(即图4中的缓存C)中的缓存交换数据,并基于神经网络的第三个网络模块CNN3进行任务处理,随后可对任务处理后得到的处理结果数据进行图像后处理,以及进行渲染处理,得到源输入数据的处理结果,并依据该处理结果进行输出,如在卷积神经网络相关应用的应用界面进行显示等等。
在数据交换的过程中,数据交换的顺序可以是从后往前依次交换,如图4所示,可首先将缓存B和缓存C进行数据交换,然后将缓存A和被交换到第二个线程的缓存C进行交换,即数据交换的顺序可以与模块线程的执行顺序相反。在一实施例中,第二网络模块CNN2的耗时可以约等于第一网络模块CNN1加图像前处理的耗时,或者,可以约等于第三网络模块CNN3加后处理的耗时,以使得触发的三个模块线程达到最高效率。
在一实施例中,获取输入数据包括:在监听到线程任务所对应读取的目标缓存空间中写入数据的情况下,确定产生线程任务的触发条件,将目标缓存空间中写入的数据作为输入数据,其中,模块线程与线程任务一一对应绑定,线程任务与目标缓存空间绑定。示例性的,本申请实施例可以通过线程任务所对应读取的目标缓存空间是否写入数据,确定是否产生线程任务的触发条件;在监听到线程任务所对应读取的目标缓存空间中写入数据的情况下,可以确定产生线程任务的触发条件,并且可以将目标缓存空间中写入的数据作为输入数据,以依据该输入数据触发相应的线程任务。在从目标缓存空间获取到输入数据后,可以根据触发的线程任务并行调度对应的模块线程,以完成对输入数据的任务处理。
综上,本申请实施例将完整的设计网络划分为多个耗时相近的网络模块,每一个网络模块可以与模块线程一一对应,从而可以将不同网络模块的任务分配至不同的模块线上执行,且每个模块线程可以分配至不同处理器核心,进而 可以通过不同的处理器核心来执行不同网络模块的任务,充分利用硬件资源,如充分利用设备的图形处理器资源、充分利用诸如手机等移动设备的多核性能,使得实时应用也能充分利用多核计算能力进行加速,增大吞吐量,极大地掩盖网络运算的耗时,如提升了卷积神经网络相关应用在多核处理器上的运行效率,对CNN相关应用进行加速以达到实时要求。
此外,本申请实施例可以将神经网络应用所需的前后处理部分作为第三方模块,加入到网络层划分中,减少数据处理及显示、渲染等开销;并且,本申请实施例中的模块线程与线程任务进行绑定,从而可以通过简单的乒乓缓存即可完成不同模块线程之间的数据同步,减少多线程间数据同步开销,避免了因同步开销过大而导致多线程效率低的问题,并且在不改变图像帧数据对应的任务执行顺序的前提下,支持要求同步的应用。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。
参照图5,示出了本申请实施例中的一种基于神经网络的任务处理装置实施例的结构框图,该基于神经网络的任务处理装置包括输入数据获取模块510、模块线程调度模块520以及处理结果数据输出模块530。
输入数据获取模块510,设置为获取输入数据,其中,所述输入数据用于触发线程任务,所述输入数据为源输入数据或缓存交换数据。
模块线程调度模块520,设置为根据触发的至少两个线程任务,并行调度对应的至少两个模块线程,对所述输入数据进行处理,产生处理结果数据;其中,所述至少两个模块线程与依据神经网络中的网络层进行划分的至少两个网络模块分别对应。
处理结果数据输出模块530,设置为将所述处理结果数据输出至缓存,以作为除所述至少两个模块线程之外的模块线程的缓存交换数据,或,输出所述处理结果数据,以作为源输入数据的处理结果。
在一实施例中,上述基于神经网络的任务处理装置还包括:网络层划分模块。该网络层划分模块设置为对所述神经网络中的网络层进行划分,得到至少两个网络模块。
在一实施例中,所述网络层划分模块可以包括通道数量确定子模块、网络 层划分子模块以及网络模块生成模块。
通道数量确定子模块,设置为分别确定所述神经网络中每两个相邻网络层之间的通道数量。
网络层划分子模块,设置为在所述相邻网络层之间的通道数量为一个的情况下,将所述相邻网络层中的在前网络层划分为输入网络模块的输入层,将所述相邻网络层中的在后网络层划分为输出网络模块的输出层。
网络模块生成模块,设置为基于所述输出网络模块的输出层和所述输入网络模块的输入层,生成至少两个网络模块。
在一实施例中,网络模块之间的处理耗时的差异值小于设定阈值。
在一实施例中,上述至少两个模块线程至少可以包括起始模块线程和末尾模块线程。上述模块线程调度模块520,设置为根据触发的线程任务,调度起始模块线程,对输入数据进行处理;并且可以根据触发的线程任务,调度末尾模块线程,对输入数据进行处理。其中,根据触发的线程任务,调度起始模块线程,对输入数据进行处理,可以包括:调用起始模块线程对源输入数据进行前处理,并基于与起始模块线程所对应的神经网络的起始网络模块进行任务处理,将处理结果数据输出至缓存,作为缓存交换数据。根据触发的线程任务,调度末尾模块线程,对输入数据进行处理包括:调用末尾模块线程从缓存中获取对应的缓存交换数据作为输入数据,基于与末尾模块线程所对应的神经网络的末尾网络模块进行任务处理,将处理结果数据进行后处理并输出,将处理结果数据作为源输入数据的处理结果。
在一实施例中,模块线程调度模块520包括起始模块线程调度子模块以及末尾模块线程调度子模块。
起始模块线程调度子模块,设置为调用起始模块线程对源输入数据进行前处理,并基于与起始模块线程所对应的神经网络的起始网络模块进行任务处理,将处理结果数据输出至缓存,作为缓存交换数据;
末尾模块线程调度子模块,设置为调用末尾模块线程从缓存中获取对应的缓存交换数据作为输入数据,基于与末尾模块线程所对应的神经网络的末尾网络模块进行任务处理,将处理结果数据进行后处理并输出,将处理结果数据作为源输入数据的处理结果。
当然,本申请实施例中的模块线程调度模块520还可以包括其他子模块,例如还可以包括至少一个中间模块线程,以根据触发的线程任务,调度中间模 块线程,对输入数据进行处理,产生中间模块线程对应的处理结果数据,等等。在以实施例中,可以将中间模块线程对应的处理结果数据输出至缓存,作为缓存交换数据,以作为与该中间模块线程关联的下一个模块线程的输入数据,使得下一个模块线程可以获取该缓存交换数据进行任务处理。
在一实施例中,模块线程还包括至少一个中间模块线程。模块线程调度模块520,设置为根据触发的线程任务,调度至少一个中间模块线程,对输入数据进行处理。其中,根据触发的线程任务,调度至少一个中间模块线程,对输入数据进行处理包括:调用至少一个中间模块线程从缓存中获取对应的缓存交换数据作为输入数据,基于神经网络的中间网络模块进行任务处理,将处理结果数据输出至缓存,作为缓存交换数据。
模块线程调度模块520,还可以包括中间模块线程调度子模块。该中间模块线程调度子模块,设置为调用中间模块线程从缓存中获取对应的缓存交换数据作为输入数据,基于神经网络的中间网络模块进行任务处理,将处理结果数据输出至缓存,作为缓存交换数据。
在一实施例中,上述源输入数据可以为图像帧数据,上述前处理可以为图像前处理,上述后处理可以为图像渲染处理,上述神经网络可以为卷积神经网络。
在一个实施例中,输入数据获取模块510包括监听子模块以及确定子模块。
监听子模块,设置为监听线程任务所对应读取的目标缓存空间,例如监听线程任务所对应读取的目标缓存空间是否写入数据等;
确定子模块,设置为在所述监听子模块监听到线程任务所对应读取的目标缓存空间中写入数据的情况下,确定产生线程任务的触发条件,将目标缓存空间中写入的数据作为输入数据。其中,模块线程与线程任务一一对应绑定,线程任务与目标缓存空间绑定。
在一实施例中,线程任务可以与处理器核一一对应绑定。
需要说明的是,上述提供的基于神经网络的任务处理装置可执行本申请任意实施例所提供的视录制方法。
在一实施例中,上述基于神经网络的任务处理装置可以集成在设备中。该设备可以是至少两个物理实体构成,也可以是一个物理实体构成,如设备可以是个人计算机(Personal Computer,PC)、电脑、手机、平板设备、个人数字助理、服务器、消息收发设备、游戏控制台等。
本申请实施例还提供一种设备,包括:处理器和存储器。存储器中存储有至少一条指令,且指令由所述处理器执行,使得所述设备执行如上述方法实施例中所述的基于神经网络的任务处理方法。
参照图6,示出了本申请一个示例中的一种设备的结构示意图。如图6所示,该设备包括:处理器60、存储器61、具有触摸功能的显示屏62、输入装置63、输出装置64以及通信装置65。该设备中处理器60的数量可以是至少一个,图6中以一个处理器60为例。该设备中存储器61的数量可以是至少一个,图6中以一个存储器61为例。该设备的处理器60、存储器61、显示屏62、输入装置63、输出装置64以及通信装置65可以通过总线或者其他方式连接,图6中以通过总线连接为例。
存储器61作为一种计算机可读存储介质,设置为存储软件程序、计算机可执行程序以及模块,如本申请任意实施例所述的任务处理方法对应的程序指令/模块(例如,上述任务处理装置中的输入数据获取模块510、模块线程调度模块520以及处理结果数据输出模块530等)。存储器61可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作装置、至少一个功能所需的应用程序;存储数据区可存储根据设备的使用所创建的数据等。此外,存储器61可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器61还包括相对于处理器60远程设置的存储器,这些远程存储器可以通过网络连接至设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
显示屏62为具有触摸功能的显示屏62,其可以是电容屏、电磁屏或者红外屏。一般而言,显示屏62设置为根据处理器60的指示显示数据,还设置为接收作用于显示屏62的触摸操作,并将相应的信号发送至处理器60或其他装置。示例性的,在显示屏62为红外屏的情况下,其还包括红外触摸框,该红外触摸框设置在显示屏62的四周,该红外触摸框还设置为接收红外信号,并将该红外信号发送至处理器60或者其他设备。
通信装置65,设置为与其他设备建立通信连接,其可以是有线通信装置和无线通信装置中至少一种。
输入装置63设置为接收输入的数字或者字符信息,以及产生与设备的用户设置以及功能控制有关的键信号输入,还设置为获取图像的摄像头以及获取音 频数据的拾音设备。输出装置64可以包括扬声器等音频设备。需要说明的是,输入装置63和输出装置64的组成可以根据实际情况设定。
处理器60通过运行存储在存储器61中的软件程序、指令以及模块,从而执行设备的各种功能应用以及数据处理,即实现上述基于神经网络的任务处理方法。
在一实施例中,处理器60执行存储器61中存储的至少一个程序时,实现如下操作:获取输入数据,其中,所述输入数据用于触发线程任务,所述输入数据为源输入数据或缓存交换数据;根据触发的至少两个线程任务,并行调度对应的至少两个模块线程,对所述输入数据进行处理,产生处理结果数据;其中,所述至少两个模块线程与依据神经网络中的网络层进行划分的至少两个网络模块分别对应;将所述处理结果数据输出至缓存,以作为除所述至少两个模块线程之外的模块线程的缓存交换数据,或,输出所述处理结果数据,以作为源输入数据的处理结果。
本申请实施例还提供一种计算机可读存储介质,所述存储介质中的指令由设备的处理器执行时,使得设备能够执行如上述方法实施例所述的基于神经网络的任务处理方法。示例性的,该基于神经网络的任务处理方法包括:获取输入数据,其中,所述输入数据用于触发线程任务,所述输入数据为源输入数据或缓存交换数据;根据触发的至少两个线程任务,并行调度对应的至少两个模块线程,对所述输入数据进行处理,产生处理结果数据;其中,所述至少两个模块线程与依据神经网络中的网络层进行划分的至少两个网络模块分别对应;将所述处理结果数据输出至缓存,以作为除所述至少两个模块线程之外的模块线程的缓存交换数据,或,输出所述处理结果数据,以作为源输入数据的处理结果。
需要说明的是,对于装置、设备、存储介质实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
通过以上关于实施方式的描述,所属领域的技术人员可以清楚地了解到,本申请可借助软件及必需的通用硬件来实现,当然也可以通过硬件实现。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存 储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括若干指令用以使得一台计算机设备(可以是机器人,个人计算机,服务器,或者网络设备等)执行本申请任意实施例所述的基于神经网络的任务处理方法。
值得注意的是,上述基于神经网络的任务处理装置中,所包括的每个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,每个功能单元的名称也只是为了便于相互区分,并不用于限制本申请的保护范围。
应当理解,本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行装置执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(Programmable Gate Array,PGA),现场可编程门阵列(Field-Programmable Gate Array,FPGA)等。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在至少一个实施例或示例中以合适的方式结合。

Claims (12)

  1. 一种基于神经网络的任务处理方法,包括:
    获取输入数据,其中,所述输入数据用于触发线程任务,所述输入数据为源输入数据或缓存交换数据;
    根据触发的至少两个线程任务,并行调度对应的至少两个模块线程,对所述输入数据进行处理,产生处理结果数据;其中,所述至少两个模块线程与依据神经网络中的网络层进行划分的至少两个网络模块分别对应;
    将所述处理结果数据输出至缓存,以作为除所述至少两个模块线程之外的模块线程的缓存交换数据,或,输出所述处理结果数据,以作为源输入数据的处理结果。
  2. 根据权利要求1所述的方法,所述获取输入数据之前,还包括:
    对所述神经网络中的网络层进行划分,得到至少两个网络模块。
  3. 根据权利要求2所述的方法,其中,对所述神经网络中的网络层进行划分,得到至少两个网络模块,包括:
    分别确定所述神经网络中每两个相邻网络层之间的通道数量;
    在所述相邻网络层之间的通道数量为一个的情况下,将所述相邻网络层中的在前网络层划分为输入网络模块的输入层,将所述相邻网络层中的在后网络层划分为输出网络模块的输出层;
    基于所述输出网络模块的输出层和所述输入网络模块的输入层,生成至少两个网络模块。
  4. 根据权利要求1所述的方法,其中,网络模块之间的处理耗时的差异值小于设定阈值。
  5. 根据权利要求1所述的方法,其中,所述至少两个模块线程至少包括起始模块线程和末尾模块线程;
    根据触发的线程任务,调度所述起始模块线程,对输入数据进行处理,包括:调用所述起始模块线程对源输入数据进行前处理,并基于与所述起始模块线程所对应的神经网络的起始网络模块进行任务处理,将处理结果数据输出至缓存,作为缓存交换数据;
    根据触发的线程任务,调度所述末尾模块线程,对输入数据进行处理,包括:调用所述末尾模块线程从缓存中获取对应的缓存交换数据作为输入数据,基于与所述末尾模块线程所对应的神经网络的末尾网络模块进行任务处理,将处理结果数据进行后处理并输出,将处理结果数据作为所述源输入数据的处理 结果。
  6. 根据权利要求5所述的方法,所述模块线程还包括至少一个中间模块线程,根据触发的线程任务,调度所述至少一个中间模块线程,对输入数据进行处理,包括:
    调用所述至少一个中间模块线程从缓存中获取对应的缓存交换数据作为输入数据,基于与所述中间模块线程所对应的神经网络的中间网络模块进行任务处理,将处理结果数据输出至缓存,作为缓存交换数据。
  7. 根据权利要求5或6所述的方法,其中,所述源输入数据为图像帧数据,所述前处理为图像前处理,所述后处理为图像后处理,所述神经网络为卷积神经网络。
  8. 根据权利要求1所述的方法,其中,所述获取输入数据包括:
    在监听到所述线程任务所对应读取的目标缓存空间中写入数据的情况下,确定产生所述线程任务的触发条件,将目标缓存空间中写入的数据作为输入数据,其中,所述模块线程与所述线程任务一一对应绑定,所述线程任务与所述目标缓存空间绑定。
  9. 根据权利要求1所述的方法,其中,所述模块线程与处理器核一一对应绑定。
  10. 一种基于神经网络的任务处理装置,包括:
    输入数据获取模块,设置为获取输入数据,其中,所述输入数据用于触发线程任务,所述输入数据为源输入数据或缓存交换数据;
    模块线程调度模块,设置为根据触发的至少两个线程任务,并行调度对应的至少两个模块线程,对所述输入数据进行处理,产生处理结果数据;其中,所述至少两个模块线程与依据神经网络中的网络层进行划分的至少两个网络模块分别对应;
    处理结果数据输出模块,设置为将所述处理结果数据输出至缓存,以作为除所述至少两个模块线程之外的模块线程的缓存交换数据,或,输出所述处理结果数据,以作为源输入数据的处理结果。
  11. 一种设备,包括:处理器和存储器;
    所述存储器中存储有至少一条指令,所述指令由所述处理器执行,使得所述设备执行如权利要求1至9任一项所述的基于神经网络的任务处理方法。
  12. 一种计算机可读存储介质,所述存储介质中的指令由设备的处理器执 行时,使得设备能够执行如权利要求1至9任一项所述的基于神经网络的任务处理方法。
PCT/CN2019/102139 2018-10-10 2019-08-23 一种基于神经网络的任务处理方法及相关设备 WO2020073742A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US17/284,201 US20210357759A1 (en) 2018-10-10 2019-08-23 Task processing method and device based on neural network
SG11202103656SA SG11202103656SA (en) 2018-10-10 2019-08-23 Task processing method based on neural network, and related device
RU2021112964A RU2771008C1 (ru) 2018-10-10 2019-08-23 Способ и устройство для обработки задач на основе нейронной сети

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811180174.5 2018-10-10
CN201811180174.5A CN109409513B (zh) 2018-10-10 2018-10-10 一种基于神经网络的任务处理方法及相关设备

Publications (1)

Publication Number Publication Date
WO2020073742A1 true WO2020073742A1 (zh) 2020-04-16

Family

ID=65467469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102139 WO2020073742A1 (zh) 2018-10-10 2019-08-23 一种基于神经网络的任务处理方法及相关设备

Country Status (5)

Country Link
US (1) US20210357759A1 (zh)
CN (1) CN109409513B (zh)
RU (1) RU2771008C1 (zh)
SG (1) SG11202103656SA (zh)
WO (1) WO2020073742A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220207783A1 (en) * 2020-12-30 2022-06-30 Advanced Micro Devices, Inc. Real-time low latency computer vision/machine learning compute accelerator with smart convolutional neural network scheduler
CN117193992A (zh) * 2023-11-08 2023-12-08 浙江大华技术股份有限公司 模型训练方法、任务调度方法、装置以及计算机存储介质

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11334329B2 (en) * 2018-06-08 2022-05-17 Shanghai Cambricon Information Technology Co., Ltd. General machine learning model, and model file generation and parsing method
CN109409513B (zh) * 2018-10-10 2021-03-12 广州市百果园信息技术有限公司 一种基于神经网络的任务处理方法及相关设备
CN111723900B (zh) * 2019-03-18 2023-10-20 北京灵汐科技有限公司 一种基于众核处理器的神经网络的映射方法及计算设备
CN111723919A (zh) * 2019-03-21 2020-09-29 中科寒武纪科技股份有限公司 数据处理方法、装置及相关产品
CN111723916A (zh) * 2019-03-21 2020-09-29 中科寒武纪科技股份有限公司 数据处理方法、装置及相关产品
CN110187965B (zh) * 2019-05-08 2021-02-12 深圳大学 神经网络的运行优化及数据处理方法、设备及存储介质
CN112418389A (zh) * 2019-08-23 2021-02-26 北京希姆计算科技有限公司 数据处理方法、装置、电子设备及计算机可读存储介质
CN111091182A (zh) * 2019-12-16 2020-05-01 北京澎思科技有限公司 数据处理方法、电子设备及存储介质
CN111985634B (zh) * 2020-08-21 2024-06-14 北京灵汐科技有限公司 神经网络的运算方法、装置、计算机设备及存储介质
CN112099850A (zh) * 2020-09-10 2020-12-18 济南浪潮高新科技投资发展有限公司 一种多核Hourglass网络加速方法
CN113905273B (zh) * 2021-09-29 2024-05-17 上海阵量智能科技有限公司 任务执行方法和设备
CN114518917B (zh) * 2022-04-20 2022-08-09 浙江大华技术股份有限公司 算法模块调度方法、算法模块调度装置以及可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657111A (zh) * 2013-11-20 2015-05-27 方正信息产业控股有限公司 一种并行计算方法和装置
CN106875013A (zh) * 2015-12-11 2017-06-20 百度(美国)有限责任公司 用于多核优化循环神经网络的系统和方法
CN108196882A (zh) * 2017-12-29 2018-06-22 普强信息技术(北京)有限公司 一种针对神经网络计算的加速方法及装置
WO2018125462A1 (en) * 2016-12-30 2018-07-05 Intel Corporation Intelligent packet aggregation
CN109409513A (zh) * 2018-10-10 2019-03-01 广州市百果园信息技术有限公司 一种基于神经网络的任务处理方法及相关设备

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2179739C2 (ru) * 2000-04-10 2002-02-20 Омский государственный технический университет Устройство для обработки информации
CN101819651A (zh) * 2010-04-16 2010-09-01 浙江大学 粒子群算法在多机上并行执行的方法
CN104035751B (zh) * 2014-06-20 2016-10-12 深圳市腾讯计算机系统有限公司 基于多图形处理器的数据并行处理方法及装置
CN104899561A (zh) * 2015-05-27 2015-09-09 华南理工大学 一种并行化的人体行为识别方法
CN105869117B (zh) * 2016-03-28 2021-04-02 上海交通大学 一种针对深度学习超分辨率技术的gpu加速方法
US10796220B2 (en) * 2016-05-24 2020-10-06 Marvell Asia Pte, Ltd. Systems and methods for vectorized FFT for multi-dimensional convolution operations
CN106650925A (zh) * 2016-11-29 2017-05-10 郑州云海信息技术有限公司 一种基于MIC集群的深度学习框架Caffe系统及算法
CN106682729A (zh) * 2016-12-13 2017-05-17 西北师范大学 基于局部收敛权阵进化的BP神经网络MapReduce训练方法
CN106909971A (zh) * 2017-02-10 2017-06-30 华南理工大学 一种面向多核计算环境的bp神经网络并行化方法
US10878314B2 (en) * 2017-03-09 2020-12-29 Alphaics Corporation System and method for training artificial intelligence systems using a SIMA based processor
CN107730905A (zh) * 2017-06-13 2018-02-23 银江股份有限公司 基于深度卷积神经网络的多任务套牌车辆视觉检测系统及方法
CN107491811A (zh) * 2017-09-01 2017-12-19 中国科学院计算技术研究所 用于加速神经网络处理器的方法和系统及神经网络处理器

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657111A (zh) * 2013-11-20 2015-05-27 方正信息产业控股有限公司 一种并行计算方法和装置
CN106875013A (zh) * 2015-12-11 2017-06-20 百度(美国)有限责任公司 用于多核优化循环神经网络的系统和方法
WO2018125462A1 (en) * 2016-12-30 2018-07-05 Intel Corporation Intelligent packet aggregation
CN108196882A (zh) * 2017-12-29 2018-06-22 普强信息技术(北京)有限公司 一种针对神经网络计算的加速方法及装置
CN109409513A (zh) * 2018-10-10 2019-03-01 广州市百果园信息技术有限公司 一种基于神经网络的任务处理方法及相关设备

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220207783A1 (en) * 2020-12-30 2022-06-30 Advanced Micro Devices, Inc. Real-time low latency computer vision/machine learning compute accelerator with smart convolutional neural network scheduler
US11816871B2 (en) * 2020-12-30 2023-11-14 Advanced Micro Devices, Inc. Real-time low latency computer vision/machine learning compute accelerator with smart convolutional neural network scheduler
CN117193992A (zh) * 2023-11-08 2023-12-08 浙江大华技术股份有限公司 模型训练方法、任务调度方法、装置以及计算机存储介质
CN117193992B (zh) * 2023-11-08 2024-02-02 浙江大华技术股份有限公司 模型训练方法、任务调度方法、装置以及计算机存储介质

Also Published As

Publication number Publication date
CN109409513A (zh) 2019-03-01
SG11202103656SA (en) 2021-05-28
US20210357759A1 (en) 2021-11-18
CN109409513B (zh) 2021-03-12
RU2771008C1 (ru) 2022-04-25

Similar Documents

Publication Publication Date Title
WO2020073742A1 (zh) 一种基于神经网络的任务处理方法及相关设备
CN106358003B (zh) 一种基于线程级流水线的视频分析加速方法
CN108734288B (zh) 一种运算方法及装置
TWI585680B (zh) 用於執行緒佇列安排之方法、裝置與系統(二)
US11544525B2 (en) Systems and methods for artificial intelligence with a flexible hardware processing framework
US20220035544A1 (en) Memory allocation method and device, and electronic apparatus
US20170097854A1 (en) Task placement for related tasks in a cluster based multi-core system
WO2022048097A1 (zh) 一种基于多显卡的单帧画面实时渲染方法
CN110968423A (zh) 使用机器学习将工作负荷分配给加速器的方法和设备
CN107450971A (zh) 任务处理方法及装置
JP2009075888A5 (zh)
CN108304925B (zh) 一种池化计算装置及方法
CN113592066B (zh) 硬件加速方法、装置、设备及存储介质
Elliott et al. Supporting real-time computer vision workloads using OpenVX on multicore+ GPU platforms
US20130207983A1 (en) Central processing unit, gpu simulation method thereof, and computing system including the same
US9274831B2 (en) Information processing apparatus, information processing method, and storage medium
KR20170125881A (ko) 공유 셰이더 코어에서의 비동기 디스플레이 셰이더 기능 제공
US20210232921A1 (en) Methods and systems for managing processing of neural network across heterogeneous processors
CN109903359A (zh) 一种粒子的显示方法、装置、移动终端和存储介质
CN115409681A (zh) 一种渲染方法及相关装置
CN114239669A (zh) 一种异构众核架构上基于算子融合的数据复用方法
WO2021047118A1 (zh) 一种图片处理方法、装置及系统
CN112732634B (zh) 面向边缘计算的arm-fpga协同局部动态重构处理方法
WO2023280208A1 (zh) 数据处理方法、执行工作站、电子设备和存储介质
CN110018782B (zh) 一种数据读/写方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19870119

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19870119

Country of ref document: EP

Kind code of ref document: A1