CN111914985B

CN111914985B - Configuration method, device and storage medium of deep learning network model

Info

Publication number: CN111914985B
Application number: CN201910388839.XA
Authority: CN
Inventors: 屠震元; 叶挺群
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-05-10
Filing date: 2019-05-10
Publication date: 2023-07-04
Anticipated expiration: 2039-05-10
Also published as: CN111914985A

Abstract

The application discloses a configuration method and device of a deep learning network model and a computer storage medium, and belongs to the field of deep learning. The configuration method provided by the embodiment of the application can firstly acquire the hardware information of the current equipment, then determine the input configuration information of the multi-layer neural network included in the deep learning network model according to the hardware information of the current equipment, and further allocate operation resources for the deep learning network according to the input configuration information of the multi-layer neural network included in the deep learning network model and the hardware information of the current equipment. In this way, even if the hardware condition of the device changes, the configuration method provided by the application can be used for configuring the deep learning network model according to the hardware information of the current device, that is, the configuration method provided by the application has universality.

Description

Configuration method, device and storage medium of deep learning network model

Technical Field

The present invention relates to the field of deep learning technologies, and in particular, to a method and apparatus for configuring a deep learning network model, and a storage medium.

Background

Currently, when a deep learning network model is run on a device, a configuration method matched with the hardware condition of the current device is generally required to be determined according to the hardware information of the current device, and then the deep learning network model is configured according to the determined configuration method, so that the deep learning network model can operate on the basis of configuration information obtained after configuration.

However, since different devices often have different hardware conditions, when the deep learning network model is run on other devices, the configuration method determined according to the hardware information of the current device is often no longer applicable to other devices because the hardware conditions change. This results in the need to re-determine the configuration method for the hardware information of a different device each time the device is replaced. Therefore, the current method for configuring the deep learning network model has no universality and poor portability.

Disclosure of Invention

The embodiment of the application provides a method, a device and a computer storage medium for configuring a deep learning network model, which can be used for solving the problems that the method for configuring the deep learning network model in the related technology has no universality and poor portability. The technical scheme is as follows:

in one aspect, a method for configuring a deep learning network model is provided, the method comprising:

acquiring hardware information of current equipment;

determining input configuration information of a multi-layer neural network included in a deep learning network model according to the hardware information of the current equipment, wherein the input configuration information comprises a data conversion identifier and operation parameters, and the data conversion identifier is used for indicating whether to perform data format conversion on characteristic data input to the neural network;

And distributing operation resources for the deep learning network model according to the hardware information of the current equipment and the input configuration information of the multi-layer neural network included in the deep learning network model.

Optionally, the determining, according to the hardware information of the current device, input configuration information of the multi-layer neural network included in the deep learning network model includes:

determining the channel alignment number of the current equipment according to the hardware information of the current equipment;

acquiring the number of input data channels of each layer of neural network in the multi-layer neural network included in the deep learning network model;

determining the data conversion identification of each layer of neural network according to the channel alignment number of the current equipment and the input data channel number of each layer of neural network;

and determining the input configuration information of each layer of neural network according to the data conversion identification of each layer of neural network and the weight data of each layer of neural network.

Optionally, the determining the data conversion identifier of each layer of neural network according to the channel alignment number of the current device and the input data channel number of each layer of neural network includes:

if the channel alignment number of the current device is the same as the input data channel number of a target layer neural network, determining a data conversion identifier of the target layer neural network as a first identifier, wherein the target layer neural network refers to any layer of neural networks in a multi-layer neural network included in the deep learning network model, and the first identifier is used for indicating that the characteristic data input into the target layer neural network is not subjected to data format conversion;

The determining the input configuration information of each layer of neural network of the deep learning network model according to the data conversion identification and the weight data of each layer of neural network comprises the following steps:

and taking the data conversion identifier and the weight data of the target layer neural network as input configuration information of the target layer neural network identifier, wherein the operation parameters included in the input configuration information are the weight data.

if the channel alignment number of the current device is different from the input data channel number of the target layer neural network, determining that the data conversion identifier of the target layer neural network is a second identifier, wherein the target layer neural network is any layer of neural network in the multi-layer neural network included in the deep learning network model, and the second identifier is used for indicating to perform data format conversion on the characteristic data input into the target layer neural network;

Performing data format conversion on the weight data of the target layer neural network according to the channel alignment number;

and taking the converted weight data and the data conversion identification of the target layer neural network as input configuration information of the target layer neural network, wherein the operation parameters included in the input configuration information are the converted weight data.

Optionally, the method further comprises:

and if the target layer neural network is not any one of a Reshape layer, a full connection layer, a batch normalization BN layer and a scale layer, or if the target layer neural network is a connection layer and the next layer neural network of the target layer neural network is not the Reshape layer, executing the step of determining the data conversion identification of each layer neural network according to the channel alignment number of the current equipment and the input data channel number of each layer neural network.

Optionally, the method further comprises:

if the target layer neural network is any one of a Reshape layer, a full connection layer, a BN layer and a scale layer, or if the target layer neural network is a connection layer and the next layer neural network of the target layer neural network is a Reshape layer, determining that a data conversion identifier of the target layer neural network is a first identifier;

Optionally, the allocating operation resources for the deep learning network model according to the hardware information of the current device and the input configuration information of the multi-layer neural network included in the deep learning network model includes:

determining a type identifier of operation hardware of the current equipment according to the hardware information of the current equipment, wherein the operation hardware is used for realizing data operation of each layer of neural network in the deep learning network model;

and distributing operation resources for the deep learning network model according to the type identification of the operation hardware of the current equipment and the input configuration information of the multi-layer neural network included in the deep learning network model.

Optionally, the allocating operation resources for the deep learning network model according to the type identifier of the operation hardware of the current device and the input configuration information of the multi-layer neural network included in the deep learning network model includes:

When the type identification of the operation hardware of the current device indicates that the identified operation hardware is a Graphic Processor (GPU), creating a plurality of CPU threads for the deep learning network model according to the input configuration information of the multi-layer neural network included in the deep learning network model, wherein each CPU thread in the plurality of CPU threads comprises at least three GPU tasks, and the at least three GPU tasks are used for realizing data operation of the deep learning network model based on the input configuration information of the multi-layer neural network included in the deep learning network model;

and distributing a corresponding flow queue for each CPU thread, wherein the flow queue corresponding to each CPU thread comprises at least three GPU tasks in the corresponding CPU thread, and the flow queues corresponding to different CPU threads are different.

Optionally, the GPU includes a plurality of GPU thread blocks and a plurality of shared memory SMs, each of the plurality of GPU thread blocks including a plurality of GPU threads;

after the corresponding flow queue is allocated to each CPU thread, the method further comprises the following steps:

distributing corresponding GPU thread blocks for each GPU task according to the number of threads required for executing each GPU task and the number of a plurality of GPU threads included in each GPU thread block;

Searching idle SMs which do not execute GPU thread blocks currently from the SMs;

and determining the SM for executing the GPU thread blocks corresponding to each GPU task from the searched idle SM.

In another aspect, a configuration apparatus of a deep learning network model is provided, the apparatus including:

the acquisition module is used for acquiring the hardware information of the current equipment;

the determining module is used for determining input configuration information of the multi-layer neural network included in the deep learning network model according to the hardware information of the current equipment, wherein the input configuration information comprises a data conversion identifier and operation parameters, and the data conversion identifier is used for indicating whether to perform data format conversion on characteristic data input to the neural network;

and the allocation module is used for allocating operation resources for the deep learning network model according to the hardware information of the current equipment and the input configuration information of the multi-layer neural network included in the deep learning network model.

Optionally, the determining module includes:

a first determining submodule, configured to determine a channel alignment number of the current device according to hardware information of the current device;

the acquisition sub-module is used for acquiring the number of input data channels of each layer of neural network in the multi-layer neural networks included in the deep learning network model;

The second determining submodule is used for determining the data conversion identification of each layer of neural network according to the channel alignment number of the current equipment and the input data channel number of each layer of neural network;

and the third determining submodule is used for determining the input configuration information of each layer of neural network according to the data conversion identification of each layer of neural network and the weight data of each layer of neural network.

Optionally, the second determining submodule is specifically configured to:

Optionally, the device is further configured to:

Optionally, the allocation module includes:

a fourth determining submodule, configured to determine, according to hardware information of the current device, a type identifier of operation hardware of the current device, where the operation hardware is an identifier of hardware for implementing data operation of each layer of neural network in the deep learning network model;

and the allocation submodule is used for allocating operation resources for the deep learning network model according to the type identification of the operation hardware of the current equipment and the input configuration information of the multi-layer neural network included in the deep learning network model.

Optionally, the allocation submodule is specifically configured to:

the allocation submodule is specifically further configured to:

In another aspect, a configuration apparatus for a deep learning network model is provided, the apparatus comprising a processor, a communication interface, a memory, and a communication bus;

the processor, the communication interface and the memory complete communication with each other through the communication bus;

the memory is used for storing a computer program;

the processor is used for executing the program stored in the memory to realize the steps of the configuration method of the deep learning network model.

In another aspect, a computer readable storage medium is provided, in which a computer program is stored, which when executed by a processor, implements the steps of the method for configuring a deep learning network model provided above.

The beneficial effects that technical scheme that this application embodiment provided include at least:

the configuration method provided by the embodiment of the application can firstly acquire the hardware information of the current equipment, then determine the input configuration information of the multi-layer neural network included in the deep learning network model according to the hardware information of the current equipment, and further allocate operation resources for the deep learning network according to the input configuration information of the multi-layer neural network included in the deep learning network model and the hardware information of the current equipment. In this way, even if the hardware condition of the device changes, the configuration method provided by the application can be used for configuring the deep learning network model according to the hardware information of the current device, that is, the configuration method provided by the application has universality.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of a software system architecture provided in an embodiment of the present application;

FIG. 2 is a flowchart of a method for configuring a deep learning network model according to an embodiment of the present application;

fig. 3 is a graph showing the effect of the resource allocation method according to the present application and the related art resource allocation method when performing the resource allocation according to the embodiment of the present application;

fig. 4 is a schematic structural diagram of a configuration device of a deep learning network model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a determining module provided in an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a distribution module according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an intelligent device for configuring a deep learning network model according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before explaining the embodiments of the present application in detail, application scenarios related to the embodiments of the present application are described.

Currently, processing of images, video data, and the like by a deep learning network model has been widely used in various fields. In an application, the deep learning network model may run on devices with different hardware conditions. For example, the deep learning network model may run on a device with CPU as the computing hardware, on a device with GPU as the computing hardware, or on a device with other computing hardware. When the deep learning network model is run on different devices, it is generally necessary to configure the deep learning network model by using different configuration methods, so that the deep learning network model can operate on the basis of configuration information obtained after configuration.

However, since devices are various, the current methods for configuring devices for different hardware conditions are also various, and thus the device has no versatility. In addition, the same configuration method is generally adopted for the devices with the same type of operation hardware at present, however, because specific parameters of the operation hardware are different, the influence on the operation effect and delay of the deep learning network model is also different, and therefore, the same configuration method may not utilize the resources of the operation hardware to the greatest extent, and thus the waste of hardware resources is caused. Based on the above, the embodiment of the application provides a configuration method of a deep learning network model. The configuration method can be generally used for various devices with different hardware conditions to configure the deep learning network model, and can configure the deep learning network model according to the hardware information of the current device to maximally utilize hardware resources.

Fig. 1 is an architecture diagram of a software system 100 provided in an embodiment of the present application, where the software system may be run on a terminal, a server, or other devices, and the configuration method of the deep learning network model provided in the embodiment of the present application may be implemented by each module in the software system. As shown in fig. 1, the software system includes a decoding module 101, a hardware matching module 102, a data format conversion module 103, a device scheduling and management module 104, and a configuration information generation module 105.

The decoding module 101 is configured to receive input image data and decode the received image data. Wherein the image data may be an input image or video.

The hardware matching module 102 is configured to obtain hardware information of a current device. In addition, in the embodiment of the present application, the hardware matching module 102 may be further configured to determine, according to the obtained hardware information, whether the current device supports the device.

The data format conversion module 103 is configured to convert the data format obtained by decoding by the decoding module 101 according to the hardware information of the current device acquired by the hardware matching module 102, and determine weight data of each layer of the multiple layers of neural networks of the deep learning network model according to the converted data. At the same time, the data format conversion module 103 may also convert the identification for each layer of neural network device data.

The device scheduling and management module 104 is configured to allocate operation resources for the deep learning network model according to the hardware information of the current device and the input configuration information of the multi-layer neural network included in the deep learning network model.

The configuration information generating module 105 is configured to generate configuration information of the deep learning network model according to the weight data, the data conversion identifier, and the operation resources allocated to the deep learning network model of each layer of the neural network determined by the data format converting module 103.

Next, a configuration method of the deep learning network model provided in the embodiment of the present application is described.

Fig. 2 is a flowchart of a method for configuring a deep learning network model according to an embodiment of the present application. The method can be applied to equipment such as a terminal, a server and the like, and as shown in fig. 2, the method comprises the following steps:

step 201: and acquiring the hardware information of the current equipment.

In the embodiment of the application, when the device runs with the deep learning network model, the device can acquire own hardware information before forward reasoning is performed through the deep learning network model.

It should be noted that, the device may obtain its own hardware information through a software stack. In the embodiment of the application, the hardware information of the device acquired by the device includes information of operation hardware for realizing operation of each layer of neural network of the deep learning network model. By way of example, the hardware information may include a hardware identification of the computing hardware in the device, a type identification of the computing hardware, model parameters of the computing hardware, and so forth.

For example, when the computing hardware in the device is a GPU, the hardware information may include information such as an identifier of the GPU, a type identifier for indicating that the computing hardware is the GPU, and a model parameter of the GPU.

In addition, the computing hardware may be different for different devices. The type identifier of the computing hardware may include an identifier for indicating that the computing hardware is a GPU, a CPU, a Hisi platform, or the like, which is not limited in the embodiment of the present application.

Alternatively, the apparatus may also receive input image data and decode the image data before acquiring its own hardware information. The image data may be a picture or a video. The decoding software installed on the device may be used to decode the image data, or the decoding hardware may be used to decode the image data.

It should also be noted that, in the embodiment of the present application, the device may store a preset type identifier set of the computing hardware. In this case, after acquiring the hardware information, the device may compare the type identifier of the operation hardware in the acquired hardware information with the type identifiers included in the type identifier set. If the type identifier set includes the type identifier of the computing hardware of the current device, it may be determined that the current device is a device supported by the configuration method provided by the embodiment of the present application, and at this time, the device may execute step 202. Otherwise, it may be determined that the current device is a device that is not supported by the configuration method provided by the embodiment of the present application, and at this time, the device may end the operation.

Step 202: and determining input configuration information of the multi-layer neural network included in the deep learning network model according to the hardware information of the current equipment.

After the device acquires the hardware information of the device, the channel alignment number of the current device can be determined according to the acquired hardware information. Then, the device can acquire the input data channel number of each layer of neural network in the multi-layer neural networks included in the deep learning network model; determining the data conversion identification of each layer of neural network according to the channel alignment number of the current equipment and the input data channel number of each layer of neural network; and determining the input configuration information of each layer of neural network according to the data conversion identification of each layer of neural network and the weight data of each layer of neural network.

It should be noted that different types of computing hardware have different sets of hardware acceleration instructions. In the forward reasoning process of the deep learning network model, the condition that the hardware acceleration instruction in the hardware acceleration instruction set can be invoked is that the number of data channels of each layer of neural network can be aligned according to the channel alignment number. Based on the above, in order to better utilize the hardware acceleration performance of the current device, the device may determine the channel alignment number according to the obtained device information, and further set a data conversion identifier for each layer of neural network according to the channel alignment number. Therefore, when forward reasoning is carried out later, each layer of neural network of the deep learning network model can ensure that the channel number of input data is aligned with the channel alignment number according to the data conversion identification, so that the deep learning network model can call instructions in a hardware acceleration instruction set to improve operation efficiency.

Wherein, the channel alignment numbers of different operation hardware are different. The device may determine the channel alignment number of the operation hardware according to the type identifier of the operation hardware in the acquired hardware information. For example, when the arithmetic hardware is a CPU, the channel alignment number may be 16; when the operation hardware is a GPU, the channel alignment number may be 4, and when the operation hardware is a Hisi platform, the channel alignment number may be 4.

After obtaining the channel alignment number, the device may obtain the number of input data channels for each layer of neural network included in the deep learning network model, and compare the number of input data channels for each layer of neural network with the channel alignment number.

For example, the device may compare the number of data channels of the target layer neural network with the number of channel alignments, and determine that the data conversion identifier of the target layer neural network is the first identifier if the number of channel alignments of the current device is the same as the number of input data channels of the target layer neural network. In this case, the device may use the data conversion identification and the weight data of the target layer neural network as input configuration information of the target layer neural network identification. At this time, the operation parameters included in the input configuration information refer to weight data. The target layer neural network refers to any layer of neural network in the multi-layer neural network included in the deep learning network model, and the first identifier is used for indicating that data format conversion is not performed on characteristic data input into the target layer neural network. Of course, if the channel alignment number of the current device is different from the input data channel number of the target layer neural network, determining the data conversion identifier of the target layer neural network as the second identifier. In this case, the converted weight data and the data conversion identifier of the target layer neural network may be used as input configuration information of the target layer neural network, and at this time, the operation parameter included in the input configuration information refers to the converted weight data. The second identifier is used for indicating data format conversion of the characteristic data input into the target layer neural network.

It should be noted that, as described in the foregoing step 201, the device may receive and decode the image data before acquiring the hardware information of the current device. Based on this, in this step, the device may first obtain the number of input data channels of the first layer neural network of the deep learning network model, compare the number of input data channels of the first layer neural network with the number of channel alignments, and if they are the same, directly input the decoded image data and the weight data of the first layer neural network to the first layer neural network, and set the data conversion identifier of the first layer neural network as the first identifier. The first identifier is used for indicating that the characteristic data input into the first layer neural network is not subjected to data format conversion. At this time, the first identifier and the weight data of the first layer neural network are input configuration information of the first layer neural network.

If the number of the input data channels of the first layer neural network is different from the number of the channel alignment, the device can perform data format conversion on the decoded image data and the weight data of the first layer neural network according to the number of the channel alignment, so that the number of the channels of the converted image data is aligned according to the number of the channel alignment. And then, inputting the converted image data and weight data into the first layer of neural network, and setting the data conversion identifier of the first layer of neural network as a second identifier. The second identifier is used for indicating that the characteristic data input into the first-layer neural network is subjected to data format conversion. At this time, the second identifier and the converted weight data of the first layer neural network are input configuration information of the first layer neural network.

The data input to the first layer neural network outputs the characteristic data after being processed by the first layer neural network. At this time, the apparatus may compare whether the number of input data channels of the second layer neural network is the same as the number of channel alignments. If the characteristic data and the weight data of the second layer neural network are the same, the equipment can input the characteristic data and the weight data of the second layer neural network into the second layer neural network, and set the data conversion identifier of the second layer neural network as the first identifier. And taking the first identifier and the weight data of the second-layer neural network as input configuration information of the second-layer neural network.

Optionally, if the number of channels of the input data of the second layer neural network is different from the number of channels aligned, the device may perform data format conversion on the feature data output by the first layer neural network and the weight data of the second layer neural network according to the number of channels aligned. And inputting the converted characteristic data and weight data into a second layer of neural network, and setting a data conversion identifier of the second layer of neural network as a second identifier. In this case, the second identification and the converted weight data of the second layer neural network are used as input configuration information of the second layer neural network.

For each layer of neural networks in the subsequent deep learning network model, the device may process the second layer of neural networks by referring to a processing manner of the second layer of neural networks, so as to obtain input configuration information of each layer of neural networks in the deep learning network model.

Notably, in a deep learning network model, some layers may not need to align the number of input data channels with the number of channel alignments. For example, when the Reshape layer is included in the deep learning network model, since the Reshape layer is only modified in dimension, if the number of input data channels is aligned with the number of channel alignments, this means that the pad operation needs to be added, if the pad is carried, additional memory needs to be opened up, and memory movement is performed, which consumes unnecessary performance, based on this, if a certain layer of neural network is the Reshape layer, the above operation of aligning the number of input data channels with the number of channel alignments may not be performed, that is, the data conversion identifier of the Reshape layer may be directly set as the first identifier. In addition, for the connection layer of the next layer being the Reshape layer, since the Reshape layer does not support memory alignment, in the embodiment of the present application, the connection layer for the next layer connecting the Reshape layer also does not perform the operation of aligning the number of input data channels with the number of channel alignments, but directly sets the data conversion identifier of the connection layer as the first identifier.

For another example, when the deep learning network model includes the BN layer or the scale layer, from the view of the video memory, the BN layer and the scale layer are not aligned with the memory, so that the video memory can be saved, and at the same time, the time for repeatedly applying for and releasing the memory can be saved. Based on this, if one layer of the deep learning network model is the BN layer or the scale layer, the apparatus may directly set the data conversion identifier of the BN layer or the scale layer as the first identifier without performing the above operation of aligning the number of input data channels with the number of channel alignments.

In addition, for the full-connection layer included in the deep learning network model, since the operation of the full-connection layer is a matrix multiplication operation and the alignment of the channel directions is a matrix multiplication performance improvement space is limited, the above operation of aligning the number of input data channels with the number of channel alignments may not be performed on the full-connection layer, but the data conversion identifier of the BN layer or scale layer may be directly set as the first identifier.

Step 203: and distributing operation resources for the deep learning network model according to the hardware information of the current equipment and the input configuration information of the multi-layer neural network included in the deep learning network model.

After determining the input configuration information of the multi-layer neural network included in the deep learning network model, the device can determine the type identifier of the operation hardware of the current device according to the hardware information of the current device, wherein the operation hardware is used for realizing the data operation of each layer of neural network in the deep learning network model; and allocating operation resources for the deep learning network model according to the type identification of the operation hardware of the current equipment and the input configuration information of the deep learning network model.

As described in the foregoing step 201, the obtained hardware information may include the type identifier of the computing hardware, on the basis of which the device may directly obtain the type identifier of the computing hardware from the hardware information.

After obtaining the type identification of the computing hardware, the device may allocate computing resources for the deep learning network model according to the hardware identified by the type identification of the computing hardware.

When the type of the operation hardware of the current device is GPU, creating a plurality of CPU threads for the deep learning network model according to the input configuration information of the deep learning network model, wherein each CPU thread in the plurality of CPU threads comprises at least one GPU task, and the at least one GPU task is used for realizing data operation of the deep learning network model based on the input configuration information of the deep learning network model; and distributing a corresponding flow queue for each CPU thread, wherein the flow queue corresponding to each CPU thread comprises at least three GPU tasks in the corresponding CPU thread, and the flow queues corresponding to different CPU threads are different.

It should be noted that, when performing the operation of each layer of the neural network, the device may create multiple CPU threads for the operation of the layer of the neural network. Wherein each CPU thread may include at least three GPU tasks. The at least three GPU tasks comprise copying data in the CPU memory into a video memory of the GPU, processing the data copied into the video memory by the GPU, and copying the processed data back into the memory of the CPU by the GPU. The operation of the neural network of the layer can be realized through the GPU tasks. Based on this, when performing resource allocation and scheduling, the embodiments of the present application may allocate a corresponding flow queue for each CPU thread, and put at least three GPU tasks included in each CPU thread into the corresponding flow queue, so that GPU tasks belonging to different CPU threads will be in different flow queues. Because the GPU tasks belonging to different CPU threads are positioned in different flow queues, the GPU tasks belonging to different CPU threads can be executed simultaneously, and the blocking among the GPU tasks of different CPU threads is avoided.

Fig. 3 shows a comparison graph of effects when resource allocation is performed by adopting the resource allocation method of the present application and the resource allocation method in the related art. In the related art, after tasks in CPU threads are loaded into GPUs, since GPU tasks in respective CPU threads are in one queue, respective GPU tasks need to be sequentially executed. As shown in fig. 3, when GPU tasks of the CPU thread a and the CPU thread B are put into one stream queue, first a first GPU task A1 in the CPU thread a is executed, that is, data corresponding to the thread a in the CPU memory is copied into the video memory of the GPU, then a second GPU task A2, that is, the GPU processes the data, then a third GPU task A3 is executed, and the data processed by the GPU task A2 is copied back to the CPU. After the three GPU tasks are executed, the next GPU task B1 belonging to the CPU thread B can be executed, so that it can be seen that the GPU tasks between different threads are blocked from each other. In the embodiment of the present application, after the tasks in the CPU thread are loaded into the GPU, the GPU tasks of different threads are located in different flow queues, so that while executing the GPU task of a certain thread, the GPU task of another thread can be executed. As shown in fig. 3, the GPU tasks in CPU thread 1 may also be executed in parallel while the GPU tasks of CPU thread 2 are executed.

It should be noted that, the GPU performs GPU tasks through the GPU thread blocks. Wherein the GPU may comprise a plurality of GPU thread blocks and a plurality of shared memory SMs, each of the plurality of GPU thread blocks comprising a plurality of GPU threads. On this basis, after the corresponding flow queue is allocated to each CPU thread, a corresponding GPU thread block may be allocated to each GPU task according to the number of threads required for executing each GPU task and the number of multiple GPU threads included in each GPU thread block.

The device can determine the number of the GPU thread blocks required by each GPU task according to the number of threads required by the GPU task and the number of the plurality of GPU threads included in each GPU thread block, and further allocate the required number of GPU thread blocks for each GPU task from the plurality of GPU thread blocks.

After the corresponding GPU thread blocks are allocated to each GPU task, the device may search for an idle SM that does not currently execute the GPU thread blocks from the plurality of SMs according to the GPU thread blocks corresponding to each GPU task, and determine an SM for executing the GPU thread blocks corresponding to each GPU task from the searched idle SMs.

It should be noted that, when the SMs execute GPU thread blocks, one GPU thread block may monopolize one SM, that is, one SM may execute one GPU thread block at the same time. Based on this, when a current GPU task needs to be executed by N thread blocks, the device may determine N idle SMs of the GPU thread blocks that are not currently executed from the plurality of SMs, and execute N thread blocks corresponding to the GPU task by the determined N SMs.

It should be noted that, since the number of GPU thread blocks corresponding to a GPU task determines the number of SMs required for executing the GPU task, and the number of GPU threads included in one GPU thread block may determine the number of GPU thread blocks required for the GPU task. Based on this, in the embodiments of the present application, task balancing across multiple SMs within a GPU may be ensured by modifying the number of threads within a GPU thread block.

Optionally, in the embodiment of the present application, before allocating an operation resource to the deep learning network model according to the type identifier of the operation hardware of the current device and the input configuration information of the deep learning network model, the device may further determine whether the type identifier of the operation hardware of the current device is a preset identifier. If the identifier is the preset identifier, the embodiment of the application may not execute the step. The preset identifier refers to a type identifier of a device which does not support resource allocation.

Optionally, in this embodiment of the present application, before allocating an operation resource to the deep learning network model according to the type identifier of the operation hardware of the current device and the input configuration information of the deep learning network model, the device may further determine whether a multi-layer neural network included in the deep learning network model includes a neural network of a preset type, and if so, the embodiment of the present application may not execute this step. The preset type of neural network refers to a neural network which does not support resource allocation through the embodiment of the application.

In the embodiment of the application, firstly, the hardware information of the current device can be obtained, then the input configuration information of the multi-layer neural network included in the deep learning network model is determined according to the hardware information of the current device, and further the operation resource is allocated to the deep learning network according to the input configuration information of the multi-layer neural network included in the deep learning network model and the hardware information of the current device. In this way, even if the hardware condition of the device changes, the configuration method provided by the application can be used for configuring the deep learning network model according to the hardware information of the current device, that is, the configuration method provided by the application has universality.

In addition, in the embodiment of the application, the data conversion identifier is set for each layer of neural network through the channel alignment number and the data channel number of each layer of neural network, so that in the process of carrying out forward reasoning subsequently, the format of input data of each layer of neural network can be ensured to be aligned with the channel alignment number according to the data conversion identifier of each layer of neural network, and therefore each layer of neural network can utilize the hardware acceleration instruction set of the current equipment to carry out reasoning acceleration when carrying out data operation, namely, the hardware performance of the current equipment is utilized to the greatest extent. In addition, according to the embodiment of the application, different flow queues can be allocated for different CPU threads, so that GPU tasks of the different CPU threads can be executed in parallel, and the operation performance of the deep learning network model is effectively improved. In addition, in the embodiment of the application, after the GPU thread blocks are allocated to the GPU tasks, the GPU thread blocks can be executed through the current idle SM, that is, if one GPU task does not occupy all SMs, the rest SMs can also be used to execute other GPU thread blocks, so that the task balance of the SMs is ensured.

Next, a configuration device of a deep learning network model provided in an embodiment of the present application will be described.

Referring to fig. 4, an embodiment of the present application provides a configuration apparatus 400 of a deep learning network model, where the apparatus 400 includes:

an obtaining module 401, configured to obtain hardware information of a current device;

a determining module 402, configured to determine, according to hardware information of a current device, input configuration information of a multi-layer neural network included in the deep learning network model, where the input configuration information includes a data conversion identifier and an operation parameter, where the data conversion identifier is used to indicate whether to perform data format conversion on feature data input to the neural network;

the allocation module 403 is configured to allocate operation resources for the deep learning network model according to the hardware information of the current device and the input configuration information of the multi-layer neural network included in the deep learning network model.

Optionally, referring to fig. 5, the determining module 402 includes:

a first determining submodule 4021, configured to determine a channel alignment number of the current device according to hardware information of the current device;

an obtaining submodule 4022, configured to obtain the number of input data channels of each layer of neural networks in the multiple layers of neural networks included in the deep learning network model;

A second determining submodule 4023, configured to determine a data conversion identifier of each layer of neural network according to the channel alignment number of the current device and the input data channel number of each layer of neural network;

the third determining submodule 4024 is configured to determine input configuration information of each layer of neural network according to the data conversion identifier of each layer of neural network and the weight data of each layer of neural network.

Optionally, the second determination submodule 4023 is specifically configured to:

if the channel alignment number of the current device is the same as the input data channel number of the target layer neural network, determining a data conversion identifier of the target layer neural network as a first identifier, wherein the target layer neural network refers to any layer of neural network in the multi-layer neural network included in the deep learning network model, and the first identifier is used for indicating that the characteristic data of the input target layer neural network is not subjected to data format conversion;

according to the data conversion identification and the weight data of each layer of neural network, determining the input configuration information of each layer of neural network of the deep learning network model comprises the following steps:

and taking the data conversion identifier and the weight data of the target layer neural network as input configuration information of the target layer neural network identifier, wherein the operation parameters included in the input configuration information refer to the weight data.

if the channel alignment number of the current device is different from the input data channel number of the target layer neural network, determining that the data conversion identifier of the target layer neural network is a second identifier, wherein the target layer neural network refers to any layer of neural network in the multi-layer neural network included in the deep learning network model, and the second identifier is used for indicating to perform data format conversion on the characteristic data of the input target layer neural network;

and taking the converted weight data and the data conversion identification of the target layer neural network as input configuration information of the target layer neural network, wherein the operation parameters included in the input configuration information refer to the converted weight data.

Optionally, the apparatus 400 is further configured to:

if the target layer neural network is not any one of the Reshape layer, the full connection layer, the batch normalization BN layer and the scale layer, or if the target layer neural network is a connection layer and the next layer neural network of the target layer neural network is not the Reshape layer, the step of determining the data conversion identification of each layer neural network according to the channel alignment number of the current equipment and the input data channel number of each layer neural network is executed.

Optionally, the apparatus 400 is further configured to:

if the target layer neural network is any one of a Reshape layer, a full connection layer, a BN layer and a scale layer, or if the target layer neural network is a connection layer and the next layer neural network of the target layer neural network is the Reshape layer, determining that the data conversion identifier of the target layer neural network is a first identifier;

Optionally, referring to fig. 6, the allocation module 403 includes:

a fourth determining submodule 4031, configured to determine, according to hardware information of the current device, a type identifier of operation hardware of the current device, where the operation hardware is hardware for implementing data operation of each layer of neural network in the deep learning network model;

the allocation submodule 4032 is configured to allocate an operation resource for the deep learning network model according to the type identifier of the operation hardware of the current device and the input configuration information of the multi-layer neural network included in the deep learning network model.

Optionally, the allocation submodule 4032 is specifically configured to:

the allocation submodule 4032 is also specifically configured to:

searching idle SMs which do not execute GPU thread blocks currently from a plurality of SMs;

In summary, the embodiment of the present application provides that the hardware information of the current device may be obtained, and then the input configuration information of the multi-layer neural network included in the deep learning network model is determined according to the hardware information of the current device, so that the computing resource is allocated to the deep learning network according to the input configuration information of the multi-layer neural network included in the deep learning network model and the hardware information of the current device. In this way, even if the hardware condition of the device changes, the configuration method provided by the application can be used for configuring the deep learning network model according to the hardware information of the current device, that is, the configuration method provided by the application has universality.

It should be noted that: the configuration device of the deep learning network model provided in the above embodiment only illustrates the division of the above functional modules when configuring the deep learning network model, and in practical application, the above functional allocation may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the configuration device of the deep learning network model provided in the above embodiment and the configuration method embodiment of the deep learning network model belong to the same concept, and the detailed implementation process of the device is referred to the method embodiment, which is not described herein again.

Fig. 7 shows a block diagram of a smart device 700 according to an exemplary embodiment of the present application. The smart device 700 may be: smart phones, tablet computers, notebook computers or desktop computers. The smart device 700 may also be referred to by other names as user device, portable scoring device, laptop scoring device, desktop scoring device, etc.

In general, the smart device 700 includes: a processor 701 and a memory 702.

Processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 701 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 701 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 701 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 701 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 702 may include one or more computer-readable storage media, which may be non-transitory. The memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 702 is used to store at least one instruction for execution by processor 701 to implement the method of configuring a deep learning network model provided by embodiments of the method of the present application.

In some embodiments, the smart device 700 may further optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 703 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 704, touch display 705, camera 706, audio circuitry 707, positioning component 708, and power supply 709.

A peripheral interface 703 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 701 and memory 702. In some embodiments, the processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 701, the memory 702, and the peripheral interface 703 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 704 is configured to receive and transmit RF (Radio Frequency) signals, also referred to as electromagnetic signals. The radio frequency circuitry 704 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 704 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 704 may communicate with other scoring devices via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 704 may also include NFC (Near Field Communication ) related circuitry, which is not limited in this application.

The display screen 705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 705 is a touch display, the display 705 also has the ability to collect touch signals at or above the surface of the display 705. The touch signal may be input to the processor 701 as a control signal for processing. At this time, the display 705 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 705 may be one, providing a front panel of the smart device 700; in other embodiments, the display screen 705 may be at least two, and disposed on different surfaces of the smart device 700 or in a folded design; in still other embodiments, the display 705 may be a flexible display disposed on a curved surface or a folded surface of the smart device 700. Even more, the display 705 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The display 705 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 706 is used to capture images or video. Optionally, the camera assembly 706 includes a front camera and a rear camera. Typically, the front camera is disposed on a front panel of the scoring apparatus, and the rear camera is disposed on a rear surface of the scoring apparatus. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera, and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and VR (Virtual Reality) shooting function or other fusion shooting functions. In some embodiments, camera assembly 706 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 707 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 701 for processing, or inputting the electric signals to the radio frequency circuit 704 for voice communication. For purposes of stereo acquisition or noise reduction, the microphone may be multiple and separately disposed at different locations of the smart device 700. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 701 or the radio frequency circuit 704 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 707 may also include a headphone jack.

The location component 708 is operative to locate the current geographic location of the smart device 700 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 708 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, or the Galileo system of Russia.

The power supply 709 is used to power the various components in the smart device 700. The power supply 709 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 709 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the smart device 700 also includes one or more sensors 710. The one or more sensors 710 include, but are not limited to: acceleration sensor 711, gyroscope sensor 712, pressure sensor 713, fingerprint sensor 714, optical sensor 715, and proximity sensor 716.

The acceleration sensor 711 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the smart device 700. For example, the acceleration sensor 711 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 701 may control the touch display screen 705 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 711. The acceleration sensor 711 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 712 may detect a body direction and a rotation angle of the smart device 700, and the gyro sensor 712 may collect a 3D motion of the user on the smart device 700 in cooperation with the acceleration sensor 711. The processor 701 may implement the following functions based on the data collected by the gyro sensor 712: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 713 may be disposed at a side frame of the smart device 700 and/or at an underlying layer of the touch display screen 705. When the pressure sensor 713 is disposed on a side frame of the smart device 700, a grip signal of the smart device 700 by a user may be detected, and the processor 701 performs left-right hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 713. When the pressure sensor 713 is disposed at the lower layer of the touch display screen 705, the processor 701 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 705. The operability controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.

The fingerprint sensor 714 is used to collect a fingerprint of the user, and the processor 701 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 714, or the fingerprint sensor 714 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 701 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 714 may be provided on the front, back, or side of the smart device 700. When a physical key or vendor Logo is provided on the smart device 700, the fingerprint sensor 714 may be integrated with the physical key or vendor Logo.

The optical sensor 715 is used to collect the ambient light intensity. In one embodiment, the processor 701 may control the display brightness of the touch display 705 based on the ambient light intensity collected by the optical sensor 715. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 705 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 705 is turned down. In another embodiment, the processor 701 may also dynamically adjust the shooting parameters of the camera assembly 706 based on the ambient light intensity collected by the optical sensor 715.

A proximity sensor 716, also referred to as a distance sensor, is typically provided on the front panel of the smart device 700. The proximity sensor 716 is used to collect the distance between the user and the front of the smart device 700. In one embodiment, when the proximity sensor 716 detects that the distance between the user and the front face of the smart device 700 gradually decreases, the processor 701 controls the touch display 705 to switch from the bright screen state to the off screen state; when the proximity sensor 716 detects that the distance between the user and the front surface of the smart device 700 gradually increases, the processor 701 controls the touch display screen 705 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 7 is not limiting of the smart device 700 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

In an exemplary embodiment of the present application, a computer-readable storage medium, such as a memory, comprising instructions executable by a processor in the above apparatus to perform the configuration scoring method of the deep-learning network model in the above embodiment is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims

1. A method for configuring a deep learning network model, the method comprising:

acquiring hardware information of current equipment;

determining the channel alignment number of the current equipment according to the hardware information of the current equipment; acquiring the number of input data channels of each layer of neural network in the multi-layer neural network included in the deep learning network model; determining the data conversion identification of each layer of neural network according to the channel alignment number of the current equipment and the input data channel number of each layer of neural network; determining input configuration information of each layer of neural network in the multi-layer neural network included in the deep learning network model according to the data conversion identifier of each layer of neural network and the weight data of each layer of neural network, wherein the input configuration information comprises the data conversion identifier and operation parameters, and the data conversion identifier is used for indicating whether to perform data format conversion on characteristic data input to the neural network;

2. The method of claim 1, wherein determining the data conversion identifier of each layer of the neural network according to the channel alignment number of the current device and the input data channel number of each layer of the neural network comprises:

3. The method of claim 1, wherein determining the data conversion identifier of each layer of the neural network according to the channel alignment number of the current device and the input data channel number of each layer of the neural network comprises:

4. A method according to any one of claims 2-3, wherein the method further comprises:

5. The method according to claim 4, wherein the method further comprises:

6. The method of claim 1, wherein the GPU comprises a plurality of GPU thread blocks and a plurality of shared memory SMs, each of the plurality of GPU thread blocks comprising a plurality of GPU threads;

7. A configuration apparatus for a deep learning network model, the apparatus comprising:

The allocation module is used for allocating operation resources for the deep learning network model according to the hardware information of the current equipment and the input configuration information of the multi-layer neural network included in the deep learning network model;

the determining module includes:

the third determining submodule is used for determining input configuration information of each layer of neural network according to the data conversion identifier of each layer of neural network and the weight data of each layer of neural network;

the distribution module comprises:

a fourth determining submodule, configured to determine, according to hardware information of the current device, a type identifier of operation hardware of the current device, where the operation hardware is hardware for implementing data operation of each layer of neural network in the deep learning network model;

The allocation sub-module is used for creating a plurality of CPU threads for the deep learning network model according to input configuration information of a multi-layer neural network included in the deep learning network model when the identified operation hardware of the type identification of the operation hardware of the current device is a graphic processor GPU, wherein each CPU thread in the plurality of CPU threads comprises at least three GPU tasks, and the at least three GPU tasks are used for realizing data operation of the deep learning network model based on the input configuration information of the multi-layer neural network included in the deep learning network model;

8. The apparatus of claim 7, wherein the second determination submodule is specifically configured to:

9. The apparatus of claim 7, wherein the second determination submodule is specifically configured to:

10. The apparatus according to any one of claims 8-9, wherein the apparatus is further configured to:

11. The apparatus of claim 10, wherein the apparatus is further configured to:

12. The apparatus of claim 7, wherein the GPU comprises a plurality of GPU thread blocks and a plurality of shared memory SMs, each of the plurality of GPU thread blocks comprising a plurality of GPU threads;

the allocation submodule is specifically further configured to: