CN111914985A

CN111914985A - Configuration method and device of deep learning network model and storage medium

Info

Publication number: CN111914985A
Application number: CN201910388839.XA
Authority: CN
Inventors: 屠震元; 叶挺群
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-05-10
Filing date: 2019-05-10
Publication date: 2020-11-10
Anticipated expiration: 2039-05-10
Also published as: CN111914985B

Abstract

The application discloses a configuration method and device of a deep learning network model and a computer storage medium, and belongs to the field of deep learning. The configuration method provided by the embodiment of the application can firstly acquire the hardware information of the current device, then determines the input configuration information of the multilayer neural network included in the deep learning network model according to the hardware information of the current device, and further allocates the operation resources for the deep learning network according to the input configuration information of the multilayer neural network included in the deep learning network model and the hardware information of the current device. Therefore, even if the hardware condition of the device changes, the deep learning network model can be configured according to the hardware information of the current device by the configuration method provided by the application, namely, the configuration method provided by the application has universality.

Description

Configuration method and device of deep learning network model and storage medium

Technical Field

The present application relates to the field of deep learning technologies, and in particular, to a method and an apparatus for configuring a deep learning network model, and a storage medium.

Background

At present, when a deep learning network model is run on a device, a configuration method matched with a hardware condition of the current device is generally determined according to hardware information of the current device, and then the deep learning network model is configured according to the determined configuration method, so that the deep learning network model can operate on the basis of configuration information obtained after configuration.

However, different devices often have different hardware conditions, and therefore, when the deep learning network model is run on other devices, the configuration method determined according to the hardware information of the current device is often no longer suitable for the other devices because the hardware conditions change. This results in the need to re-determine the configuration method for hardware information of different devices each time a device is replaced. Therefore, the current method for configuring the deep learning network model has no universality and poor portability.

Disclosure of Invention

The embodiment of the application provides a method and a device for configuring a deep learning network model and a computer storage medium, which can be used for solving the problems that the method for configuring the deep learning network model in the related technology is not universal and has poor portability. The technical scheme is as follows:

in one aspect, a method for configuring a deep learning network model is provided, where the method includes:

acquiring hardware information of current equipment;

determining input configuration information of a multilayer neural network included in a deep learning network model according to the hardware information of the current device, wherein the input configuration information comprises a data conversion identifier and an operation parameter, and the data conversion identifier is used for indicating whether to perform data format conversion on characteristic data input to the neural network;

and allocating operation resources for the deep learning network model according to the hardware information of the current equipment and the input configuration information of the multilayer neural network included by the deep learning network model.

Optionally, the determining, according to the hardware information of the current device, input configuration information of a multi-layer neural network included in the deep learning network model includes:

determining the channel alignment number of the current equipment according to the hardware information of the current equipment;

acquiring the number of input data channels of each layer of neural network in the multilayer neural network included in the deep learning network model;

determining a data conversion identifier of each layer of neural network according to the channel alignment number of the current equipment and the input data channel number of each layer of neural network;

and determining the input configuration information of each layer of neural network according to the data conversion identification of each layer of neural network and the weight data of each layer of neural network.

Optionally, the determining a data conversion identifier of each layer of neural network according to the number of channel alignments of the current device and the number of input data channels of each layer of neural network includes:

if the channel alignment number of the current device is the same as the input data channel number of a target layer neural network, determining that a data conversion identifier of the target layer neural network is a first identifier, wherein the target layer neural network refers to any one layer of neural networks in a plurality of layers of neural networks included in the deep learning network model, and the first identifier is used for indicating that data format conversion is not performed on feature data input into the target layer neural network;

the determining the input configuration information of each layer of neural network of the deep learning network model according to the data conversion identification and the weight data of each layer of neural network comprises the following steps:

and taking the data conversion identifier and the weight data of the target layer neural network as input configuration information of the target layer neural network identifier, wherein the operation parameters included in the input configuration information refer to the weight data.

if the channel alignment number of the current device is different from the input data channel number of a target layer neural network, determining that a data conversion identifier of the target layer neural network is a second identifier, wherein the target layer neural network is any one layer of neural networks in the multilayer neural networks included in the deep learning network model, and the second identifier is used for indicating that data format conversion is performed on feature data input into the target layer neural network;

performing data format conversion on the weight data of the target layer neural network according to the channel alignment number;

and taking the converted weight data and the data conversion identifier of the target layer neural network as input configuration information of the target layer neural network, wherein the operation parameters included in the input configuration information refer to the converted weight data.

Optionally, the method further comprises:

and if the target layer neural network is not any one of a Reshape layer, a full connection layer, a batch normalization BN layer and a scale layer, or if the target layer neural network is a connection layer and the next layer neural network of the target layer neural network is not a Reshape layer, executing a step of determining the data conversion identifier of each layer of neural network according to the channel alignment number of the current equipment and the input data channel number of each layer of neural network.

Optionally, the method further comprises:

if the target layer neural network is any one of a Reshape layer, a full connection layer, a BN layer and a scale layer, or if the target layer neural network is a connection layer and a next layer neural network of the target layer neural network is a Reshape layer, determining that a data conversion identifier of the target layer neural network is a first identifier;

Optionally, the allocating, according to the hardware information of the current device and the input configuration information of the multilayer neural network included in the deep learning network model, an operation resource to the deep learning network model includes:

determining the type identifier of the operation hardware of the current equipment according to the hardware information of the current equipment, wherein the operation hardware refers to hardware for realizing data operation of each layer of neural network in the deep learning network model;

and allocating operation resources for the deep learning network model according to the type identification of the operation hardware of the current equipment and the input configuration information of the multilayer neural network included by the deep learning network model.

Optionally, the allocating, according to the type identifier of the arithmetic hardware of the current device and the input configuration information of the multilayer neural network included in the deep learning network model, an arithmetic resource to the deep learning network model includes:

when the type of the operation hardware of the current device identifies that the identified operation hardware is a GPU (graphics processing unit), creating a plurality of CPU (central processing unit) threads for the deep learning network model according to input configuration information of the multilayer neural network included by the deep learning network model, wherein each CPU thread in the plurality of CPU threads comprises at least three GPU tasks, and the at least three GPU tasks are used for realizing data operation of the deep learning network model based on the input configuration information of the multilayer neural network included by the deep learning network model;

and allocating a corresponding flow queue for each CPU thread, wherein the flow queue corresponding to each CPU thread comprises at least three GPU tasks in the corresponding CPU thread, and the flow queues corresponding to different CPU threads are different.

Optionally, the GPU comprises a plurality of GPU thread blocks and a plurality of shared memories SM, each thread block of the plurality of GPU thread blocks comprising a plurality of GPU threads;

after allocating the corresponding flow queue for each CPU thread, the method further includes:

distributing a corresponding GPU thread block for each GPU task according to the number of threads required for executing each GPU task and the number of a plurality of GPU threads included in each GPU thread block;

searching for an idle SM of a currently unexecuted GPU thread block from the plurality of SMs;

and determining the SM of the GPU thread block corresponding to each GPU task from the searched idle SM.

In another aspect, an apparatus for configuring a deep learning network model is provided, the apparatus including:

the acquisition module is used for acquiring the hardware information of the current equipment;

the determining module is used for determining input configuration information of a multilayer neural network included in the deep learning network model according to the hardware information of the current device, wherein the input configuration information comprises a data conversion identifier and an operation parameter, and the data conversion identifier is used for indicating whether to perform data format conversion on characteristic data input to the neural network;

and the allocation module is used for allocating operation resources to the deep learning network model according to the hardware information of the current equipment and the input configuration information of the multilayer neural network included by the deep learning network model.

Optionally, the determining module includes:

the first determining submodule is used for determining the channel alignment number of the current equipment according to the hardware information of the current equipment;

the acquisition submodule is used for acquiring the number of input data channels of each layer of neural network in the multilayer neural network included in the deep learning network model;

the second determining submodule is used for determining the data conversion identifier of each layer of neural network according to the channel alignment number of the current equipment and the input data channel number of each layer of neural network;

and the third determining submodule is used for determining the input configuration information of each layer of neural network according to the data conversion identification of each layer of neural network and the weight data of each layer of neural network.

Optionally, the second determining submodule is specifically configured to:

Optionally, the apparatus is further configured to:

Optionally, the allocation module comprises:

a fourth determining submodule, configured to determine, according to the hardware information of the current device, a type identifier of operation hardware of the current device, where the operation hardware refers to an identifier of hardware used for implementing data operation of each layer of the neural network in the deep learning network model;

and the allocation submodule is used for allocating operation resources for the deep learning network model according to the type identification of the operation hardware of the current equipment and the input configuration information of the multilayer neural network included by the deep learning network model.

Optionally, the allocation submodule is specifically configured to:

the allocation submodule is further configured to:

In another aspect, an apparatus for configuring a deep learning network model is provided, the apparatus including a processor, a communication interface, a memory, and a communication bus;

the processor, the communication interface and the memory complete mutual communication through the communication bus;

the memory is used for storing computer programs;

the processor is used for executing the program stored on the memory to realize the steps of the configuration method of the deep learning network model.

In another aspect, a computer-readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, implements the steps of the method for configuring a deep learning network model provided in the foregoing.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the configuration method provided by the embodiment of the application can firstly acquire the hardware information of the current device, then determines the input configuration information of the multilayer neural network included in the deep learning network model according to the hardware information of the current device, and further allocates the operation resources for the deep learning network according to the input configuration information of the multilayer neural network included in the deep learning network model and the hardware information of the current device. Therefore, even if the hardware condition of the device changes, the deep learning network model can be configured according to the hardware information of the current device by the configuration method provided by the application, namely, the configuration method provided by the application has universality.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a diagram of a software system architecture provided by an embodiment of the present application;

FIG. 2 is a flowchart of a method for configuring a deep learning network model according to an embodiment of the present disclosure;

fig. 3 is a diagram illustrating a comparison of effects when a resource allocation method of the present application and a resource allocation method of the related art are used for resource allocation according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a configuration device of a deep learning network model provided in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a determination module provided in an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a distribution module provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of an intelligent device for configuring a deep learning network model according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before explaining the embodiments of the present application in detail, an application scenario related to the embodiments of the present application will be described.

At present, processing images, video data and the like through a deep learning network model has been widely applied in many fields. In an application, the deep learning network model may run on devices with different hardware conditions. For example, the deep learning network model may be run on a device using a CPU as computing hardware, may be run on a device using a GPU as computing hardware, or may be run on a device having other computing hardware. When the deep learning network model is run on different devices, different configuration methods are generally required to configure the deep learning network model, so that the deep learning network model can perform operations on the basis of configuration information obtained after configuration.

However, since the variety of devices is large, the arrangement methods of the devices for different hardware conditions are various and are not universal. At present, the same configuration method is generally adopted for devices with the same type of arithmetic hardware, however, the specific parameters of the arithmetic hardware are different, and the influence on the arithmetic effect and the delay of the deep learning network model is also different, so that the resources of the arithmetic hardware cannot be utilized to the maximum extent possibly by adopting the same configuration method, and thus, the waste of the hardware resources is caused. Based on this, the embodiment of the application provides a configuration method of a deep learning network model. The configuration method can be universally used on various devices with different hardware conditions to configure the deep learning network model, and the configuration method can configure the deep learning network model according to the hardware information of the current device to utilize hardware resources to the maximum extent.

Fig. 1 is an architecture diagram of a software system 100 provided in an embodiment of the present application, where the software system may run on a terminal, a server, and the like, and a configuration method of a deep learning network model provided in the embodiment of the present application may be implemented by each module in the software system. As shown in fig. 1, the software system includes a decoding module 101, a hardware matching module 102, a data format conversion module 103, a device scheduling and management module 104, and a configuration information generation module 105.

The decoding module 101 is configured to receive input image data and decode the received image data. The image data may be an input image or a video.

The hardware matching module 102 is configured to obtain hardware information of a current device. Moreover, in this embodiment of the application, the hardware matching module 102 may be further configured to determine whether the current device supports the device according to the acquired hardware information.

The data format conversion module 103 is configured to convert the data format decoded by the decoding module 101 according to the hardware information of the current device acquired by the hardware matching module 102, and determine weight data of each layer of neural network in the multi-layer neural network of the deep learning network model according to the converted data. Meanwhile, the data format conversion module 103 may also convert the identifier for each layer of neural network device data.

The device scheduling and managing module 104 is configured to allocate operation resources to the deep learning network model according to the hardware information of the current device and the input configuration information of the multi-layer neural network included in the deep learning network model.

The configuration information generating module 105 is configured to generate configuration information of the deep learning network model according to the weight data of each layer of neural network determined by the data format converting module 103, the data conversion identifier, and the operation resource allocated to the deep learning network model.

Next, a method for configuring a deep learning network model provided in the embodiment of the present application is described.

Fig. 2 is a flowchart of a configuration method of a deep learning network model according to an embodiment of the present application. The method can be applied to devices such as terminals, servers and the like, and comprises the following steps as shown in fig. 2:

step 201: and acquiring the hardware information of the current equipment.

In the embodiment of the application, when a deep learning network model runs on a device, the device can acquire hardware information of the device before forward reasoning is performed through the deep learning network model.

It should be noted that the device may obtain its own hardware information through the software stack. In the embodiment of the present application, the hardware information of the device itself includes information of arithmetic hardware for implementing an operation of each layer of the neural network of the deep learning network model. Illustratively, the hardware information may include a hardware identification of the computing hardware in the device, a type identification of the computing hardware, a model parameter of the computing hardware, and so on.

For example, when the computing hardware in the device is a GPU, the hardware information may include information such as an identifier of the GPU, a type identifier indicating that the computing hardware is the GPU, and a model parameter of the GPU.

In addition, the computational hardware may be different for different devices. The type identifier of the computing hardware may include an identifier for indicating that the computing hardware is a GPU, a CPU, a Hisi platform, or the like, which is not limited in this embodiment of the present application.

Optionally, the device may further receive input image data and decode the image data before acquiring hardware information of the device itself. The image data may be a picture or a video. Also, the device may be decoding software installed in the device or hardware that can decode image data in the device.

It should be further noted that, in the embodiment of the present application, a preset type identifier set of the computing hardware may be stored in the device. In this case, after acquiring the hardware information, the device may compare the type identifier of the arithmetic hardware in the acquired hardware information with the type identifiers included in the type identifier set. If the type identifier set includes the type identifier of the computing hardware of the current device, it may be determined that the current device is a device supported by the configuration method provided in the embodiment of the present application, and at this time, the device may execute step 202. Otherwise, it may be determined that the current device is a device that is not supported by the configuration method provided in the embodiment of the present application, and at this time, the device may end the operation.

Step 202: and determining input configuration information of the multilayer neural network included by the deep learning network model according to the hardware information of the current equipment.

After the device acquires the hardware information of the device, the channel alignment number of the current device may be determined according to the acquired hardware information. Then, the device can obtain the number of input data channels of each layer of neural network in the multilayer neural network included in the deep learning network model; determining a data conversion identifier of each layer of neural network according to the channel alignment number of the current equipment and the input data channel number of each layer of neural network; and determining the input configuration information of each layer of neural network according to the data conversion identification of each layer of neural network and the weight data of each layer of neural network.

It should be noted that different types of computing hardware have different hardware acceleration instruction sets. In the forward reasoning process of the deep learning network model, the condition that the hardware acceleration instruction in the hardware acceleration instruction set can be called is that the number of data channels of each layer of neural network can be aligned according to the number of channel alignments. Based on this, in order to better utilize the hardware acceleration performance of the current device, the device may determine a channel alignment number according to the obtained device information, and further set a data conversion identifier for each layer of the neural network according to the channel alignment number. Therefore, when forward reasoning is carried out subsequently, each layer of neural network of the deep learning network model can ensure that the channel number of input data is aligned with the channel alignment number according to the data conversion identification, so that the deep learning network model can call instructions in a hardware acceleration instruction set to improve the operation efficiency.

Wherein, the channel alignment numbers of different operation hardware are different. The device may determine the channel alignment number of the arithmetic hardware according to the type identifier of the arithmetic hardware in the acquired hardware information. For example, when the computing hardware is a CPU, the number of lane alignments may be 16; when the computing hardware is a GPU, the number of channel alignments may be 4, and when the computing hardware is a Hisi platform, the number of channel alignments may be 4.

After obtaining the channel alignment number, the device may obtain the number of input data channels of each layer of neural network included in the deep learning network model, and compare the number of input data channels of each layer of neural network with the channel alignment number.

For example, the device may compare the number of data channels of the target layer neural network with the number of channel alignments, and determine that the data conversion identifier of the target layer neural network is the first identifier if the number of channel alignments of the current device is the same as the number of input data channels of the target layer neural network. In this case, the device may use the data conversion identification and the weight data of the target layer neural network as input configuration information of the target layer neural network identification. In this case, the operation parameter included in the input configuration information is weight data. The target layer neural network refers to any one of the multilayer neural networks included in the deep learning network model, and the first identifier is used for indicating that data format conversion is not performed on the characteristic data input into the target layer neural network. Of course, if the number of the channel alignments of the current device is different from the number of the input data channels of the target layer neural network, it is determined that the data conversion identifier of the target layer neural network is the second identifier. In this case, the converted weight data and the data conversion identifier of the target layer neural network may be used as input configuration information of the target layer neural network, and in this case, the operation parameters included in the input configuration information refer to the converted weight data. The second identification is used for indicating the data format conversion of the characteristic data input into the target layer neural network.

It should be noted that, as described in the foregoing step 201, before acquiring the hardware information of the current device, the device may receive and decode the image data. Based on this, in this step, the device may first obtain the number of input data channels of the first-layer neural network of the deep learning network model, compare the number of input data channels of the first-layer neural network with the number of channel alignments, and if the two numbers are the same, the device may directly input the decoded image data and the weight data of the first-layer neural network into the first-layer neural network, and set the data conversion identifier of the first-layer neural network as the first identifier. The first identifier is used for indicating that the characteristic data input into the first-layer neural network is not subjected to data format conversion. At this time, the first identifier and the weight data of the first layer neural network are input configuration information of the first layer neural network.

If the number of input data channels of the first layer of neural network is different from the number of channel alignments, the device may perform data format conversion on the decoded image data and the weight data of the first layer of neural network according to the number of channel alignments, so that the number of channels of the converted image data is aligned according to the number of channel alignments. Then, the converted image data and the weight data are input to the first layer neural network, and a data conversion flag of the first layer neural network is set as a second flag. The second identifier is used for indicating data format conversion of the characteristic data input into the first-layer neural network. At this time, the second identifier and the converted weight data of the first layer neural network are input configuration information of the first layer neural network.

The data input to the first layer neural network outputs characteristic data after being processed by the first layer neural network. At this time, the apparatus may compare whether the number of input data channels of the second layer neural network is the same as the number of channel alignments. If the feature data and the weight data of the second-layer neural network are the same, the device can input the feature data and the weight data of the second-layer neural network into the second-layer neural network, and set the data conversion identifier of the second-layer neural network as the first identifier. And taking the first identification and the weight data of the second-layer neural network as input configuration information of the second-layer neural network.

Optionally, if the number of input data channels of the second layer neural network is different from the number of channel alignments, the device may perform data format conversion on the feature data output by the first layer neural network and the weight data of the second layer neural network according to the number of channel alignments. And inputting the converted feature data and the weight data into a second-layer neural network, and setting a data conversion identifier of the second-layer neural network as a second identifier. In this case, the second identification and the converted weight data of the second layer neural network are used as input configuration information of the second layer neural network.

For each layer of neural network in the following deep learning network model, the device can refer to the processing mode of the second layer of neural network for processing, so as to obtain the input configuration information of each layer of neural network in the deep learning network model.

Notably, in deep learning network models, some layers may not need to align the number of input data channels with the number of channel alignments. For example, when the deep learning network model includes a Reshape layer, since the Reshape layer is only modified in dimension, if the number of input data channels is aligned with the number of channel alignments, it means that a pad operation needs to be added, if a pad is provided, an additional memory needs to be created and the memory is moved, which consumes unnecessary performance. In addition, for the connection layer whose next layer is the Reshape layer, since the Reshape layer does not support memory alignment, in this embodiment of the application, the operation of aligning the number of input data channels with the number of channel alignments is not performed on the connection layer to which the Reshape layer is connected, but the data conversion identifier of the connection layer is directly set as the first identifier.

For another example, when the deep learning network model includes a BN layer or a scale layer, from the viewpoint of memory storage, the BN layer and the scale layer are not aligned, so that the memory storage can be saved, and the time for repeatedly applying for and releasing the memory can be saved. Based on this, if a certain layer in the deep learning network model is a BN layer or scale layer, the apparatus may not perform the above operation of aligning the number of input data channels with the number of channel alignments, but directly set the data conversion flag of the BN layer or scale layer as the first flag.

In addition, for the fully-connected layer included in the deep learning network model, because the operation of the fully-connected layer is matrix multiplication operation, and the space for improving the matrix multiplication performance by the alignment of the channel direction is limited, the operation of aligning the number of input data channels and the number of channel alignments can be directly set as the first identifier without performing the operation of aligning the number of input data channels and the number of channel alignments on the fully-connected layer.

Step 203: and allocating operation resources for the deep learning network model according to the hardware information of the current equipment and the input configuration information of the multilayer neural network included by the deep learning network model.

After determining input configuration information of a multilayer neural network included in the deep learning network model, the device may determine a type identifier of operational hardware of the current device according to hardware information of the current device, where the operational hardware is hardware used for implementing data operation of each layer of the neural network in the deep learning network model; and allocating operation resources for the deep learning network model according to the type identification of the operation hardware of the current equipment and the input configuration information of the deep learning network model.

As described in step 201, the obtained hardware information may include a type identifier of the computing hardware, and based on this, the device may directly obtain the type identifier of the computing hardware from the hardware information.

After obtaining the type identifier of the operational hardware, the device may allocate the operational resources to the deep learning network model according to the hardware identified by the type identifier of the operational hardware.

When the type identification of the operation hardware of the current equipment identifies that the identified operation hardware is GPU, establishing a plurality of CPU threads for the deep learning network model according to the input configuration information of the deep learning network model, wherein each CPU thread in the plurality of CPU threads comprises at least one GPU task, and the at least one GPU task is used for realizing the data operation of the deep learning network model based on the input configuration information of the deep learning network model; and allocating a corresponding flow queue for each CPU thread, wherein the flow queue corresponding to each CPU thread comprises at least three GPU tasks in the corresponding CPU thread, and the flow queues corresponding to different CPU threads are different.

It should be noted that, when performing the operation of each layer of neural network, the apparatus may create a plurality of CPU threads for the operation of the layer of neural network. Each CPU thread may include at least three GPU tasks. The at least three GPU tasks comprise copying data in a CPU memory into a video memory of the GPU, processing the data copied into the video memory by the GPU, and copying the processed data back into the CPU memory by the GPU. The operation of the layer of neural network can be realized through the GPU tasks. Based on this, when resource allocation and scheduling are performed in the embodiments of the present application, a corresponding stream queue may be allocated for each CPU thread, and at least three GPU tasks included in each CPU thread are placed in the corresponding stream queue, so that the GPU tasks belonging to different CPU threads are in different stream queues. Because the GPU tasks belonging to different CPU threads are positioned in different stream queues, the GPU tasks belonging to different CPU threads can be executed simultaneously, and the blocking among the GPU tasks of different CPU threads is avoided.

Fig. 3 is a diagram illustrating a comparison of effects when resource allocation is performed by using the resource allocation method of the present application and the resource allocation method in the related art. In the related art, after the tasks in the CPU threads are loaded into the GPU, since the GPU tasks in each CPU thread are in one queue, each GPU task needs to be executed in order. As shown in fig. 3, when the GPU tasks of the CPU thread a and the CPU thread B are placed in a stream queue, first, the first GPU task a1 in the CPU thread a is executed, that is, the data corresponding to the thread a in the CPU memory is copied to the video memory of the GPU, then, the second GPU task a2 is executed, that is, the GPU processes the data, then, the third GPU task A3 is executed, and the data processed by the GPU task a2 is copied back to the CPU. After the execution of these three GPU tasks, the next GPU task B1 belonging to CPU thread B can be executed, and thus it can be seen that there is a blocking of GPU tasks between different threads with respect to each other. In the embodiment of the present application, after the tasks in the CPU threads are loaded into the GPU, the GPU tasks of different threads are located in different stream queues, so that while the GPU task of a certain thread is executed, the GPU task of another thread can also be executed. As shown in fig. 3, while executing the GPU tasks of CPU thread 2, the GPU tasks in CPU thread 1 may also be executed in parallel.

It should be noted that the GPU performs the GPU task through the GPU thread block. Therein, the GPU may comprise a plurality of GPU thread blocks, each thread block of the plurality of GPU thread blocks comprising a plurality of GPU threads, and a plurality of shared memory SM. On this basis, after the corresponding stream queue is allocated to each CPU thread, a corresponding GPU thread block may also be allocated to each GPU task according to the number of threads required to execute each GPU task and the number of the plurality of GPU threads included in each GPU thread block.

The device can determine the number of the GPU thread blocks required by each GPU task according to the number of threads required by the GPU task and the number of the GPU threads included in each GPU thread block, and then the required number of the GPU thread blocks are distributed for each GPU task from the GPU thread blocks.

After allocating the corresponding GPU thread block for each GPU task, the device may search, according to the GPU thread block corresponding to each GPU task, for an idle SM that is not currently executing the GPU thread block from the multiple SMs, and determine, from the searched idle SM, an SM used for executing the GPU thread block corresponding to each GPU task.

It should be noted that, when the SMs executes the GPU thread blocks, one GPU thread block may monopolize one SM, that is, one SM may execute one GPU thread block at the same time. Based on this, when a current GPU task needs to be executed by N thread blocks, the device may determine N idle SMs, which are not currently executing GPU thread blocks, from among the plurality of SMs, and execute N thread blocks corresponding to the GPU task by the determined N SMs.

It should be noted that, since the number of GPU thread blocks corresponding to a GPU task determines the number of SMs required for executing the GPU task, the number of GPU thread blocks included in one GPU thread block may determine the number of GPU thread blocks required for the GPU task. Based on this, in the embodiments of the present application, task balancing across multiple SMs within a GPU may be ensured by changing the number of threads within a GPU thread block.

Optionally, in this embodiment of the present application, before allocating an operation resource to the deep learning network model according to the type identifier of the operation hardware of the current device and the input configuration information of the deep learning network model, the device may further determine whether the type identifier of the operation hardware of the current device is a preset identifier. If the identifier is the preset identifier, the step may not be executed in the embodiment of the present application. The preset identifier refers to a type identifier of a device that does not support resource allocation.

Optionally, in this embodiment of the present application, before allocating an operation resource to the deep learning network model according to the type identifier of the operation hardware of the current device and the input configuration information of the deep learning network model, the device may further determine whether a multilayer neural network included in the deep learning network model includes a neural network of a preset type, and if so, this step may not be executed in this embodiment of the present application. The preset type of neural network refers to a neural network that does not support resource allocation according to the embodiment of the present application.

In the embodiment of the application, first, hardware information of a current device may be obtained, then, input configuration information of a multilayer neural network included in a deep learning network model is determined according to the hardware information of the current device, and further, an operation resource is allocated to the deep learning network according to the input configuration information of the multilayer neural network included in the deep learning network model and the hardware information of the current device. Therefore, even if the hardware condition of the device changes, the deep learning network model can be configured according to the hardware information of the current device by the configuration method provided by the application, namely, the configuration method provided by the application has universality.

In addition, in the embodiment of the application, the data conversion identifier is set for each layer of neural network through the channel alignment number and the data channel number of each layer of neural network, so that in the subsequent forward reasoning process, the format of the input data of each layer of neural network can be ensured to be aligned with the channel alignment number according to the data conversion identifier of each layer of neural network, and thus, each layer of neural network can utilize the hardware acceleration instruction set of the current equipment to carry out reasoning acceleration during data operation, that is, the hardware performance of the current equipment is utilized to the maximum extent. In addition, the embodiment of the application can allocate different flow queues to different CPU threads, so that GPU tasks of different CPU threads can be executed in parallel, and the operation performance of the deep learning network model is effectively improved. In addition, in the embodiment of the present application, after the GPU thread blocks are allocated to the GPU tasks, the GPU thread blocks may be executed through the current idle SM, that is, if one GPU task does not occupy all SMs, the remaining SMs may also be used to execute other GPU thread blocks, thereby ensuring task balance of the SMs.

Next, a configuration apparatus of a deep learning network model provided in an embodiment of the present application is described.

Referring to fig. 4, an embodiment of the present application provides an apparatus 400 for configuring a deep learning network model, where the apparatus 400 includes:

an obtaining module 401, configured to obtain hardware information of a current device;

a determining module 402, configured to determine, according to hardware information of a current device, input configuration information of a multilayer neural network included in a deep learning network model, where the input configuration information includes a data conversion identifier and an operation parameter, and the data conversion identifier is used to indicate whether to perform data format conversion on feature data input to the neural network;

and an allocating module 403, configured to allocate an operation resource to the deep learning network model according to the hardware information of the current device and the input configuration information of the multilayer neural network included in the deep learning network model.

Optionally, referring to fig. 5, the determining module 402 includes:

the first determining submodule 4021 is configured to determine a channel alignment number of the current device according to hardware information of the current device;

the obtaining submodule 4022 is configured to obtain the number of input data channels of each layer of the multilayer neural networks included in the deep learning network model;

the second determining submodule 4023 is configured to determine a data conversion identifier of each layer of neural network according to the number of channel alignments of the current device and the number of input data channels of each layer of neural network;

the third determining sub-module 4024 is configured to determine the input configuration information of each layer of neural network according to the data conversion identifier of each layer of neural network and the weight data of each layer of neural network.

Optionally, the second determining sub-module 4023 is specifically configured to:

if the channel alignment number of the current device is the same as the input data channel number of the target layer neural network, determining that the data conversion identifier of the target layer neural network is a first identifier, wherein the target layer neural network is any one layer of the multilayer neural network included in the deep learning network model, and the first identifier is used for indicating that the data format conversion is not performed on the characteristic data input into the target layer neural network;

determining input configuration information of each layer of neural network of the deep learning network model according to the data conversion identification and the weight data of each layer of neural network, wherein the input configuration information comprises the following steps:

if the channel alignment number of the current device is different from the input data channel number of the target layer neural network, determining that the data conversion identifier of the target layer neural network is a second identifier, wherein the target layer neural network is any one layer of neural network in the multilayer neural network included in the deep learning network model, and the second identifier is used for indicating that the data format conversion is carried out on the characteristic data input into the target layer neural network;

Optionally, the apparatus 400 is further configured to:

and if the target layer neural network is not any one of a Reshape layer, a full connection layer, a batch normalization BN layer and a scale layer, or if the target layer neural network is a connection layer and the next layer neural network of the target layer neural network is not the Reshape layer, executing a step of determining the data conversion identifier of each layer of neural network according to the channel alignment number of the current equipment and the input data channel number of each layer of neural network.

Optionally, the apparatus 400 is further configured to:

if the target layer neural network is any one of a Reshape layer, a full connection layer, a BN layer and a scale layer, or if the target layer neural network is a connection layer and a next layer neural network of the target layer neural network is a Reshape layer, determining that the data conversion identifier of the target layer neural network is a first identifier;

Optionally, referring to fig. 6, the assignment module 403 includes:

a fourth determining submodule 4031, configured to determine, according to hardware information of the current device, a type identifier of an operation hardware of the current device, where the operation hardware refers to hardware used for implementing data operation of each layer of neural network in the deep learning network model;

and the allocating submodule 4032 is configured to allocate an operation resource to the deep learning network model according to the type identifier of the operation hardware of the current device and the input configuration information of the multilayer neural network included in the deep learning network model.

Optionally, the assignment sub-module 4032 is specifically configured to:

when the type identification of the operation hardware of the current equipment identifies that the identified operation hardware is a GPU (graphics processing unit), a plurality of CPU (central processing unit) threads are created for the deep learning network model according to input configuration information of the multilayer neural network included by the deep learning network model, each CPU thread in the plurality of CPU threads comprises at least three GPU tasks, and the at least three GPU tasks are used for realizing data operation of the deep learning network model based on the input configuration information of the multilayer neural network included by the deep learning network model;

the assignment sub-module 4032 is further specifically configured to:

searching for an idle SM of a current unexecuted GPU thread block from the SMs;

In summary, the embodiment of the present application provides a method that hardware information of a current device can be obtained, then input configuration information of a multilayer neural network included in a deep learning network model is determined according to the hardware information of the current device, and then an operation resource is allocated to the deep learning network according to the input configuration information of the multilayer neural network included in the deep learning network model and the hardware information of the current device. Therefore, even if the hardware condition of the device changes, the deep learning network model can be configured according to the hardware information of the current device by the configuration method provided by the application, namely, the configuration method provided by the application has universality.

It should be noted that: in the configuration apparatus for a deep learning network model provided in the foregoing embodiment, when the deep learning network model is configured, only the division of the functional modules is illustrated, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the configuration device of the deep learning network model provided in the above embodiment and the configuration method embodiment of the deep learning network model belong to the same concept, and the specific implementation process thereof is detailed in the method embodiment and is not described herein again.

Fig. 7 shows a block diagram of a smart device 700 provided in an exemplary embodiment of the present application. The smart device 700 may be: a smartphone, a tablet, a laptop, or a desktop computer. Smart device 700 may also be referred to by other names such as user device, portable scoring device, laptop scoring device, desktop scoring device, and the like.

In general, the smart device 700 includes: a processor 701 and a memory 702.

The processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 701 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 701 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 701 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 701 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 702 may include one or more computer-readable storage media, which may be non-transitory. Memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer-readable storage medium in memory 702 is used to store at least one instruction for execution by processor 701 to implement a method of configuring a deep learning network model provided by method embodiments of the present application.

In some embodiments, the smart device 700 may further optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 703 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 704, touch screen display 705, camera 706, audio circuitry 707, positioning components 708, and power source 709.

The peripheral interface 703 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 701 and the memory 702. In some embodiments, processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 701, the memory 702 and the peripheral interface 703 may be implemented on a single chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 704 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 704 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 704 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. Radio frequency circuitry 704 may communicate with other scoring devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 704 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 705 is a touch display screen, the display screen 705 also has the ability to capture touch signals on or over the surface of the display screen 705. The touch signal may be input to the processor 701 as a control signal for processing. At this point, the display 705 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 705 may be one, providing the front panel of the smart device 700; in other embodiments, the number of the display screens 705 may be at least two, and the at least two display screens are respectively disposed on different surfaces of the smart device 700 or are in a folding design; in still other embodiments, the display 705 may be a flexible display disposed on a curved surface or on a folded surface of the smart device 700. Even more, the display 705 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The Display 705 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or the like.

The camera assembly 706 is used to capture images or video. Optionally, camera assembly 706 includes a front camera and a rear camera. Generally, the front camera is arranged on the front panel of the scoring device, and the rear camera is arranged on the back of the scoring device. In some embodiments, the number of the rear cameras is at least two, and each of the rear cameras is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, the main camera and the wide-angle camera are fused to realize panoramic shooting and a VR (Virtual Reality) shooting function or other fusion shooting functions. In some embodiments, camera assembly 706 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 707 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 701 for processing or inputting the electric signals to the radio frequency circuit 704 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different positions of the smart device 700. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 701 or the radio frequency circuit 704 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 707 may also include a headphone jack.

The Location component 708 is used to locate the current geographic Location of the smart device 700 to implement navigation or LBS (Location Based Service). The Positioning component 708 can be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.

The power supply 709 is used to supply power to various components in the smart device 700. The power source 709 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When the power source 709 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the smart device 700 also includes one or more sensors 710. The one or more sensors 710 include, but are not limited to: acceleration sensor 711, gyro sensor 712, pressure sensor 713, fingerprint sensor 714, optical sensor 715, and proximity sensor 716.

The acceleration sensor 711 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the smart device 700. For example, the acceleration sensor 711 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 701 may control the touch screen 705 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 711. The acceleration sensor 711 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 712 may detect a body direction and a rotation angle of the smart device 700, and the gyro sensor 712 may cooperate with the acceleration sensor 711 to acquire a 3D motion of the smart device 700 by the user. From the data collected by the gyro sensor 712, the processor 701 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 713 may be disposed on a side bezel of smart device 700 and/or underneath touch display 705. When the pressure sensor 713 is disposed on a side frame of the smart device 700, a holding signal of the smart device 700 from a user can be detected, and the processor 701 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 713. When the pressure sensor 713 is disposed at a lower layer of the touch display 705, the processor 701 controls the operability control on the UI interface according to the pressure operation of the user on the touch display 705. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 714 is used for collecting a fingerprint of a user, and the processor 701 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 714, or the fingerprint sensor 714 identifies the identity of the user according to the collected fingerprint. When the user identity is identified as a trusted identity, the processor 701 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. The fingerprint sensor 714 may be disposed on the front, back, or side of the smart device 700. When a physical button or vendor Logo is provided on the smart device 700, the fingerprint sensor 714 may be integrated with the physical button or vendor Logo.

The optical sensor 715 is used to collect the ambient light intensity. In one embodiment, the processor 701 may control the display brightness of the touch display 705 based on the ambient light intensity collected by the optical sensor 715. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 705 is increased; when the ambient light intensity is low, the display brightness of the touch display 705 is turned down. In another embodiment, processor 701 may also dynamically adjust the shooting parameters of camera assembly 706 based on the ambient light intensity collected by optical sensor 715.

A proximity sensor 716, also known as a distance sensor, is typically disposed on the front panel of the smart device 700. The proximity sensor 716 is used to capture the distance between the user and the front of the smart device 700. In one embodiment, when the proximity sensor 716 detects that the distance between the user and the front surface of the smart device 700 gradually decreases, the processor 701 controls the touch display screen 705 to switch from the bright screen state to the dark screen state; when the proximity sensor 716 detects that the distance between the user and the front surface of the smart device 700 gradually becomes larger, the processor 701 controls the touch display 705 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the architecture shown in FIG. 7 does not constitute a limitation on the smart device 700, and may include more or fewer components than shown, or combine certain components, or employ a different arrangement of components.

In an exemplary embodiment of the present application, there is also provided a computer-readable storage medium, such as a memory, including instructions executable by a processor in the apparatus to perform the configuration scoring method for the deep learning network model in the above embodiments. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for configuring a deep learning network model, the method comprising:

acquiring hardware information of current equipment;

2. The method of claim 1, wherein the determining input configuration information of a multi-layer neural network included in the deep learning network model according to the hardware information of the current device comprises:

3. The method of claim 2, wherein determining the data conversion identifier of each layer of neural network according to the channel alignment number of the current device and the input data channel number of each layer of neural network comprises:

4. The method of claim 2, wherein determining the data conversion identifier of each layer of neural network according to the channel alignment number of the current device and the input data channel number of each layer of neural network comprises:

5. The method according to any one of claims 2-4, further comprising:

6. The method of claim 5, further comprising:

7. The method according to any one of claims 1 to 6, wherein the allocating operation resources to the deep learning network model according to the hardware information of the current device and the input configuration information of the multi-layer neural network included in the deep learning network model comprises:

8. The method according to claim 7, wherein the allocating operation resources for the deep learning network model according to the type identifier of the operation hardware of the current device and the input configuration information of the multi-layer neural network included in the deep learning network model comprises:

9. The method of claim 8, wherein the GPU comprises a plurality of GPU thread blocks and a plurality of shared memories SM, wherein each thread block of the plurality of GPU thread blocks comprises a plurality of GPU threads;

10. An apparatus for configuring a deep learning network model, the apparatus comprising:

11. The apparatus of claim 10, wherein the determining module comprises:

12. The apparatus of claim 11, wherein the second determination submodule is specifically configured to:

13. The apparatus of claim 11, wherein the second determination submodule is specifically configured to:

14. The apparatus of any of claims 11-13, wherein the apparatus is further configured to:

15. The apparatus of claim 14, wherein the apparatus is further configured to:

16. The apparatus of any of claims 10-15, wherein the assignment module comprises:

a fourth determining submodule, configured to determine, according to the hardware information of the current device, a type identifier of operation hardware of the current device, where the operation hardware is hardware used to implement data operation of each layer of neural network in the deep learning network model;

17. The apparatus of claim 16, wherein the assignment submodule is specifically configured to:

18. The apparatus according to claim 17, wherein the GPU comprises a plurality of GPU thread blocks and a plurality of shared memories SM, each thread block of the plurality of GPU thread blocks comprising a plurality of GPU threads;

the allocation submodule is further configured to: