CN113592062A

CN113592062A - Neural network configuration method and device, computer equipment and storage medium

Info

Publication number: CN113592062A
Application number: CN202110737010.3A
Authority: CN
Inventors: 不公告发明人
Original assignee: DeepRoute AI Ltd
Current assignee: DeepRoute AI Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-11-02

Abstract

The application relates to a neural network configuration method, a neural network configuration device, computer equipment and a storage medium. The method comprises the following steps: generating a list of valid configurations of the neural network; determining a convolution layer to be configured in a neural network; traversing the combination of the hardware enhancement item and the convolution algorithm identification in the effective configuration list to obtain the operation delay of the convolution layer to be configured, and determining the target hardware enhancement item and the target convolution algorithm identification of the convolution layer to be configured according to the operation delay; and determining the next convolution layer to be configured in the neural network, and returning to the step of traversing the combination of the hardware enhancement item and the convolution algorithm identifier in the effective configuration list to obtain the operation delay of the convolution layer to be configured until determining the target hardware enhancement item and the target convolution algorithm identifier of each convolution layer of the neural network. By adopting the method, the efficiency of the neural network for executing the reasoning task can be improved.

Description

Neural network configuration method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of deep learning technologies, and in particular, to a neural network configuration method, apparatus, computer device, and storage medium.

Background

Convolutional neural networks have enjoyed great success in the field of deep learning in recent years. However, the inference task such as graph classification is still a very computationally intensive task, requiring the construction of application-specific hardware, and a library of inference procedures that are efficiently executed on low-power-consumption hardware. GPGPU (General-Purpose-Graphics Processing unit) is one of the most widely used hardware. In addition, developing efficient subroutine libraries for emerging neural network architectures is a very important task. This approach provides another layer of optimization over existing libraries, enabling the inference engine to take full advantage of the functionality and efficiency of the underlying hardware, rather than the interface from a traditional library.

Deep learning function libraries, such as cuDNN (computer-Unified Device Architecture Deep Neural Network library, CUDA-based Deep Neural Network library), have been adopted by most Deep learning architectures, such as TensorFlow, PyTorch, and the like. The deep learning architecture also provides a packaging function which can accurately identify the optimal selection in the library, so that a user can realize the application of the packaging function by configuring the environment variables, and the running speed of the neural network is improved.

However, since the developers of the inference engine usually use the libraries such as cuDNN directly when performing the development work, rather than through the deep learning framework, the developers cannot directly use the proposed encapsulation function to identify the best choice in the libraries, which may affect the operation efficiency of the neural network due to suboptimal configuration.

Therefore, the existing neural network configuration method has the problem of influencing the operation efficiency of the neural network.

Disclosure of Invention

In view of the above, it is necessary to provide a neural network configuration method, apparatus, computer device and storage medium for solving the above technical problems.

In a first aspect, a neural network configuration method is provided, including:

generating a list of valid configurations of the neural network; the effective configuration list comprises hardware enhancement items and convolution algorithm identification;

determining a convolutional layer to be configured in the neural network;

traversing the combination of the hardware enhancement item and the convolution algorithm identification in the effective configuration list to obtain the operation delay of the convolution layer to be configured, and determining the target hardware enhancement item and the target convolution algorithm identification of the convolution layer to be configured according to the operation delay;

and determining the next convolution layer to be configured in the neural network, and returning to the step of traversing the combination of the hardware enhancement item and the convolution algorithm identifier in the effective configuration list to obtain the operation delay of the convolution layer to be configured until determining the respective target hardware enhancement item and the target convolution algorithm identifier of each convolution layer of the neural network.

In one embodiment, the traversing the combination of the hardware enhancement item and the convolution algorithm identifier in the effective configuration list to obtain the operation delay of the convolutional layer to be configured, and determining the target hardware enhancement item and the target convolution algorithm identifier of the convolutional layer to be configured according to the operation delay includes:

selecting a convolution algorithm identifier to be tested from the effective configuration list;

aiming at the convolution algorithm identification to be tested, selecting a hardware enhancement item to be tested from the effective configuration list;

acquiring the running delay of the convolution layer to be configured when the convolution algorithm identifier to be tested and the hardware enhancement item to be tested execute the benchmark test task;

obtaining a current optimization operation delay, when the operation delay is smaller than the current optimization operation delay, updating the numerical value of the current optimization operation delay to the numerical value of the operation delay, updating a preset current preference enhancement item to the hardware enhancement item to be tested, and updating a preset current preference algorithm identification to the convolution algorithm identification to be tested;

aiming at the convolution algorithm identifications to be tested, selecting the next hardware enhancement item to be tested in the effective configuration list until the hardware enhancement items of the effective configuration list are traversed, and selecting the next convolution algorithm identification to be tested in the effective configuration list until the convolution algorithm identification of the effective configuration list is traversed;

determining the current preferred enhancement item as the target hardware enhancement item, and determining the current preferred algorithm identification as the target convolution algorithm identification.

In an embodiment, the obtaining of the running delay of the convolutional layer to be configured when executing the benchmark test task by using the convolutional algorithm identifier to be tested and the hardware enhancement item to be tested includes:

transmitting the convolution algorithm identification to be tested and the hardware enhancement item to be tested into a preset benchmark test function;

and operating the benchmark test function to obtain the operation delay returned by the benchmark test function when executing the benchmark test task according to the convolution algorithm corresponding to the convolution algorithm identifier to be tested and the hardware enhancement mode corresponding to the hardware enhancement item to be tested.

In one embodiment, further comprising:

and when the operation delay is not less than the current optimized operation delay, aiming at the convolution algorithm identifier to be tested, selecting the next hardware enhancement item to be tested in the effective configuration list.

In one embodiment, the convolution algorithm identification includes at least one of a direct convolution algorithm identification, an implicit universal matrix multiplication algorithm, a fast fourier transform algorithm, a wenger algorithm; the hardware enhancements include core tensor options.

In a second aspect, a neural network configuration apparatus is provided, including:

a generation module configured to generate a list of valid configurations of the neural network; the effective configuration list comprises hardware enhancement items and convolution algorithm identification;

a convolutional layer determination module configured to determine a convolutional layer to be configured in the neural network;

the combination traversing module is configured to traverse the combination of the hardware enhancement item and the convolution algorithm identifier in the effective configuration list to obtain the operation delay of the convolution layer to be configured, and determine the target hardware enhancement item and the target convolution algorithm identifier of the convolution layer to be configured according to the operation delay;

and the return module is configured to determine the next convolution layer to be configured in the neural network, and return to the step of traversing the combination of the hardware enhancement item and the convolution algorithm identifier in the effective configuration list to obtain the operation delay of the convolution layer to be configured until determining the respective target hardware enhancement item and the target convolution algorithm identifier of each convolution layer of the neural network.

In one embodiment, the combined traversal module is further configured to:

In a third aspect, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the neural network configuration method of the first aspect when executing the computer program.

In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the neural network configuration method of the first aspect described above.

According to the neural network configuration method, the device, the computer equipment and the computer readable storage medium, firstly, an effective configuration list of the neural network is generated, convolutional layers to be configured in the neural network are determined, then combinations of hardware enhancement items and convolutional algorithm identifications in the effective configuration list are traversed, so that operation delays of different combinations are obtained, therefore, a target hardware enhancement item and a target convolutional algorithm identification of the convolutional layers to be configured can be determined according to the operation delays of different combinations, and finally, the convolutional layers to be configured are configured until the respective target hardware enhancement item and the target convolutional algorithm identification of each convolutional layer are determined, so that each convolutional layer in the neural network is independently optimally configured, and the efficiency of the neural network for executing inference tasks is improved.

Drawings

FIG. 1 is a schematic flow diagram of a neural network configuration method of an embodiment;

FIG. 2 is a diagram of a convolution operation based on a four-dimensional tensor according to an embodiment;

FIG. 3 is a schematic illustration of a core tensor operation of an embodiment;

FIG. 4 is a flowchart illustrating the steps of traversing hardware enhancements and convolution algorithm identification, according to one embodiment;

FIG. 5 is a block diagram of a neural network configuration device of an embodiment;

FIG. 6 is an internal block diagram of a computer device of an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in FIG. 1, a neural network configuration method is provided. The neural network configuration method provided by this embodiment may be applied to a server or a server cluster formed by servers, and may also be applied to various terminals, for example, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.

It should be noted that, at present, as many function libraries as possible are developed for different processors to efficiently perform deep learning tasks. However, developing an excellent function library for convolutional neural networks is a very challenging task, mainly due to several aspects:

1) different hardware generation differences, even different series of processors manufactured by the same hardware manufacturer, may have different operational capabilities;

2) the function library usually needs to design alternative hard-coded sets;

3) different algorithms or numerical precision choices may result in significant operational delays or precision adjustments. The cuDNN function library has been adopted by deep learning architectures such as TensorFlow, PyTorch, etc., which also have the above-mentioned drawbacks, and for this purpose, a wrapper function is provided to more accurately identify the best-chosen wrapper function in the library, so that the user can configure the environment variables to implement its application.

However, since the developer of the inference engine usually uses the cuDNN library directly during development work, rather than through the deep learning framework, it is not possible to identify the best choice among the libraries by directly using the proposed encapsulation function. In addition, with the development of the cuDNN library, cudnns provide more flexible options for performing specific operations, but users need to spend a certain amount of time and effort to explore the optimal configuration options, increasing the cost of configuring the neural network. For example, enabling a Tensor Core (Tensor Core) configuration option on a particular platform may generally help to reduce latency, but it may have different effects on different generations of hardware, and may even increase the execution latency, requiring the user to optimize the configuration at their own discretion.

In order to solve the above problems, the present disclosure provides a neural network configuration method to implement a fully automatic configuration of each layer convolution layer of a neural network, so that the neural network can operate efficiently. The method may comprise the steps of:

step S102, generating an effective configuration list of the neural network; the valid configuration list includes hardware enhancements and convolution algorithm identification.

Wherein, the valid configuration list can be an optional parameter for configuring the neural network. The optional parameters may include convolution algorithms and hardware enhancement term related parameters used by the convolutional layers in the neural network to perform inference operations. Optionally, the convolution algorithm identifier may be used to mark at least one of a direct convolution algorithm identifier, an implicit universal matrix multiplication algorithm, a fast fourier transform algorithm, and a wenger algorithm; the hardware enhancements may include core tensor options.

It should be added that an inference engine needs to go through multiple stages to operate a neural network. First, it is necessary to read network graphic data from an input file. After the graph data is read, the graph data is analyzed, and node objects and connections are created. Before performing the inference task, each node needs to be configured with appropriate parameters. And finally, executing engines of all layers of the neural network to complete corresponding reasoning operation. The inference operation comprises convolution operation of the convolution layer based on a certain convolution algorithm. In general, in a neural network including convolutional layers, Input tensors (Input tensors), Filters (Filters), and Output tensors (Output tensors) are four-dimensional. FIG. 2 is a diagram illustrating a convolution operation based on a four-dimensional tensor, according to an embodiment. As shown in the figure, the NCHW input tensor and the KCRS weight tensor are convolved to generate the NKPQ output tensor, and the input, intermediate and output data are four-dimensional tensors in the calculation process. Wherein, N represents the Number of images in each batch (Number of images in mini-batch), C represents the Number of input feature maps (Number of input feature maps), H represents the Height of input images (Height of input image), W represents the Width of input images (Width of input image), K represents the Number of output feature maps (Number of output feature maps), R represents the Height of filter kernel (Height of filter kernel), and S represents the Width of filter kernel (Width of filter kernel).

Because of the importance of convolution operations in the field of neural networks, a large number of specific convolution algorithms have been developed to implement convolution operations. Common Convolution algorithms include Direct Convolution (Direct Convolution), Implicit Generalized Matrix Multiplication (Implicit Generalized Matrix Multiplication), Fast Fourier Transform (FFT), and Winograd (Winograd). Wherein, the direct convolution algorithm is an algorithm for directly performing convolution calculation operation; the implicit general matrix multiplication algorithm is an algorithm for converting a convolution problem into a matrix multiplication problem; the fast Fourier transform algorithm is an algorithm for converting a convolution problem into a fast Fourier transform problem; while the wengrad algorithm is an algorithm that performs convolution calculations by optimizing the number of multiply/add operations. The various convolution algorithms described above are usually provided in a library provided by a hardware supplier, but due to the limitations of the various convolution algorithms themselves and numerical accuracy considerations, it is often difficult to accurately evaluate the algorithm efficiency of a certain convolution algorithm when performing a complex convolution operation task using the convolution algorithm.

It should be noted that the above convolution algorithm is only used for exemplary illustration, and the embodiment does not limit the specific convolution algorithm. In other words, the neural network configuration method of the present embodiment can be applied to any convolution algorithm.

On the other hand, advanced optimization techniques such as hardware enhancement have been introduced to further improve the operation efficiency of the neural network. The techniques provide hardware enhancements for user selection, which, if selected to turn on, may invoke hardware instructions to perform complex mathematical operations on a set or subset of data elements, thereby optimizing operating efficiency by improving hardware performance. Among them, Tensor Core (Tensor Core) is one of the more common hardware enhancement terms. FIG. 3 is a schematic diagram of a core tensor operation of an embodiment. As shown, the core tensor operation by multiplication and accumulation of the 4 x 4 matrix is implemented in a single hardware instruction. Wherein D represents the result of the core tensor operation, FP16 represents a 16-bit floating point number, and FP32 represents a 32-bit floating point number.

A common library typically provides a variety of core tensor options for user selection, for example, configuration options that enable the tensor core on a particular set of GPUs, configuration options that allow the tensor core operation to be used without actively performing a down-conversion of data types to the tensor to take advantage of the tensor core, or configuration options that allow the tensor core operation to be used with actively performing a down-conversion of data types to the tensor to take advantage of the tensor core may be selected.

And the hardware enhancement item is started, so that the hardware utilization rate can be improved by utilizing the hardware instruction under certain conditions, and the hardware performance is enhanced. However, in some cases, an increase in the operation delay may be caused, and even an erroneous operation result may be generated.

It can be seen that although there are many convolution algorithms and hardware enhancements to improve the operating efficiency of the neural network, not all convolution algorithms and hardware enhancements may be optimized.

And step S104, determining the convolutional layer to be configured in the neural network.

The convolutional layer to be configured may be a convolutional layer serving as a configuration target in the neural network.

Specifically, in order to optimally configure each individual convolutional layer in the neural network, one convolutional layer may be first determined as the convolutional layer to be configured, and after the configuration of one convolutional layer is completed, the configuration of the next convolutional layer may be performed.

Step S106, traversing the combination of the hardware enhancement item and the convolution algorithm identification in the effective configuration list to obtain the operation delay of the convolution layer to be configured, and determining the target hardware enhancement item and the target convolution algorithm identification of the convolution layer to be configured according to the operation delay.

Specifically, at least one hardware enhancement item and at least one convolution algorithm identifier may be selected from the valid configuration list to form a combination of the hardware enhancement item and the convolution algorithm identifier.

And then, taking the combination of the hardware enhancement item and the convolution algorithm identification as a test parameter of the benchmark test function, and testing the convolution layer to be configured to obtain the operation delay of the convolution layer to be configured returned by the benchmark test function.

Next, selecting a next hardware enhancement item and/or a next convolution algorithm identifier from the effective configuration list to form a new combination of the hardware enhancement item and the convolution algorithm identifier, taking the new combination of the hardware enhancement item and the convolution algorithm identifier as a test parameter of the benchmark test function, and testing the convolution layer to be configured again to obtain the operation delay of the convolution layer to be configured returned by the benchmark test function.

And so on until all combinations of hardware enhancements and convolution algorithm identifications in the active configuration list are traversed. According to the operation delays corresponding to different combinations, the shortest operation delay can be determined, and the hardware enhancement item and the convolution algorithm identifier corresponding to the shortest operation delay are the target hardware enhancement item and the target convolution algorithm identifier.

The hardware enhancement mode corresponding to the target hardware enhancement item and the convolution algorithm corresponding to the target convolution algorithm identification can minimize the operation delay of the convolution layer to be configured in the neural network, thereby optimizing the configuration of the neural network.

For example, the target hardware enhancement term is a configuration option "allowing the tensor core operation to be used, but not actively performing data type down conversion on the tensor to utilize the tensor core", and the target convolution algorithm is identified as an implicit general matrix multiplication algorithm ", so that compared with other hardware enhancement modes and convolution algorithms, the running delay of the convolution layer to be configured when the target hardware enhancement term is enabled and convolution operation is performed through the convolution algorithm corresponding to the target convolution algorithm is the minimum.

In this embodiment, the configuration optimization of the convolutional layer is described as an example, but in practical application, the configuration optimization of the deconvolution layer may also be performed by using the neural network configuration of this embodiment. Since the embodiments are similar to those described above for the convolutional layer, detailed description thereof is omitted.

Step S108, determining the next convolution layer to be configured in the neural network, and returning to the step of traversing the combination of the hardware enhancement item and the convolution algorithm identifier in the effective configuration list to obtain the operation delay of the convolution layer to be configured until determining the respective target convolution algorithm identifier of each convolution layer of the neural network.

Specifically, after the configuration of one convolutional layer of the neural network is completed, the next convolutional layer of the neural network may be configured until the configuration of each convolutional layer in the neural network is completed. Therefore, each convolution layer can adopt the adaptive convolution algorithm to carry out convolution operation, the operation delay is reduced, and the reasoning operation efficiency is improved.

In the neural network configuration method, firstly, an effective configuration list of the neural network is generated, convolutional layers to be configured in the neural network are determined, then combinations of hardware enhancement items and convolutional algorithm identifications in the effective configuration list are traversed to obtain operation delays of different combinations, so that target hardware enhancement items and target convolutional algorithm identifications of the convolutional layers to be configured can be determined according to the operation delays of different combinations, and finally, the next convolutional layer to be configured is configured until the respective target hardware enhancement items and the target convolutional algorithm identifications of the convolutional layers are determined, so that each convolutional layer in the neural network is independently optimally configured, and the efficiency of the neural network for executing inference tasks is improved.

Further, the combination of hardware enhancement items in the effective configuration list and the convolution algorithm identification is traversed, so that the problem that the running efficiency is influenced due to the fact that suboptimal configuration is selected is solved. Meanwhile, the configuration process is completely automated, users do not need to participate, and development cost is reduced.

Further, even if the method is upgraded to a new-version library or a new-generation hardware, the neural network configuration method can be used for configuration optimization without changing the code content, and the configuration method is high in universality, so that the development cost is reduced.

Further, the neural network configuration method can be executed in a configuration stage or a preheating stage of the inference engine, and therefore the execution efficiency of the inference task is not affected. Therefore, the execution efficiency of the neural network for executing the reasoning task is ensured, and the operation efficiency of the neural network is improved.

In one embodiment, as shown in fig. 4, step S106 may specifically include the following steps:

step S1061, selecting a convolution algorithm identifier to be tested from the effective configuration list;

step S1062, aiming at the convolution algorithm identification to be tested, selecting a hardware enhancement item to be tested in the effective configuration list;

step S1063, obtaining the running delay of the convolution layer to be configured when the convolution algorithm identifier to be tested and the hardware enhancement item to be tested execute the benchmark test task;

step S1064, obtaining a current optimization operation delay, when the operation delay is less than the current optimization operation delay, updating the value of the current optimization operation delay to the value of the operation delay, updating a preset current preferred enhancement item to the enhancement item of the hardware to be tested, and updating a preset current preferred algorithm identifier to the convolution algorithm identifier to be tested;

step S1065, returning to step S1062 until the hardware enhancement item of the effective configuration list is traversed, and returning to step S1061 until the convolution algorithm identification of the effective configuration list is traversed;

step S1066, determining the currently preferred enhancement item as the target hardware enhancement item, and determining the currently preferred algorithm identifier as the target convolution algorithm identifier.

Specifically, at least one convolution algorithm identifier may be first selected from the valid configuration list as the above-mentioned convolution algorithm identifier to be tested. Then, aiming at the convolution algorithm identifier to be tested, at least one hardware enhancement item is selected from a plurality of hardware enhancement items in the effective configuration list to be used as the convolution algorithm identifier to be tested. Thus, a combination of hardware enhancements and convolution algorithm identification is obtained.

Then, the convolution layer to be configured can be controlled to execute a benchmark test task according to the convolution algorithm identifier to be tested and the configuration of the hardware enhancement item to be tested, so as to obtain a running delay.

A current optimization run delay may be preset and initialized to a maximum value. After the benchmark test task is executed to obtain the running delay, the obtained running delay is compared with the current optimized running delay.

And when the obtained running delay is smaller than the current optimized running delay, updating the value of the current optimized running delay into the value of the obtained running delay.

For example, if the current optimized operation delay is 10ms (milliseconds), and after the benchmark test task is executed, the obtained operation delay is 8ms, and the obtained operation delay is smaller than the current optimized operation delay, the current optimized operation delay may be updated to 8ms, so as to form a new current optimized operation delay.

In addition, a current preference enhancement item and a current preference algorithm identifier can be preset and initialized to be null. And when the obtained running delay is less than the current optimized running delay, updating the preset current optimized enhancement item to be a hardware enhancement item to be tested, and updating the preset current optimized algorithm identification to be a convolution algorithm identification to be tested.

For example, if the current optimized running delay is 10ms, after the benchmark test task is executed according to the configuration of "allow to use the tensor core operation, but not actively perform the data type down-conversion on the tensor to utilize the tensor core" and "implicit general matrix multiplication algorithm", the obtained running delay is 8ms, and the obtained running delay is smaller than the current optimized running delay, the preset current preferred enhancement item may be updated to the identifier of "allow to use the tensor core operation, but not actively perform the data type down-conversion on the tensor to utilize the tensor core", and the preset current preferred algorithm identifier may be updated to the identifier of "implicit general matrix multiplication algorithm".

Then, returning to step S1062, that is, for the same convolution algorithm identifier, selecting the next hardware enhancement item as a new hardware enhancement item to be tested, thereby forming a combination of the new hardware enhancement item and the convolution algorithm identifier, and repeating step S1063 and step S1064.

It should be noted that, the current optimized operation delay is already updated to the operation delay when the benchmark test task was executed last time, if the operation delay obtained by executing the benchmark test task with the new hardware enhancement item to be tested is smaller than the current optimized operation delay, the value of the current optimized operation delay is updated again with the value of the operation delay obtained by executing the benchmark test task with the current benchmark test task, and the current preferred enhancement option identifier is updated correspondingly, otherwise, the updating is not performed.

After traversing all hardware enhancement items in the effective configuration list, the process may return to step S1061, that is, select the next convolution algorithm identifier as a new convolution algorithm identifier to be tested, and repeat steps S1062 to S1064 until all convolution algorithm identifiers in the effective configuration list are traversed.

After traversing all convolution algorithm identifications in the effective configuration list, the current preferred algorithm identification can be determined as a target convolution algorithm identification of the convolution layer to be configured, and the current preferred enhancement item can also be determined as a target hardware enhancement item of the convolution layer to be configured.

Therefore, in this embodiment, the value of the current optimized operation delay is continuously updated in an iterative manner to continuously approach the shortest operation delay, and after traversing all hardware enhancement items and convolution algorithm identifiers, the current optimized operation delay updated finally is the shortest operation delay that can be achieved when the convolution layer to be configured performs convolution operation under the configuration of the specific hardware enhancement item and the specific convolution algorithm identifier. Accordingly, the last updated current preferred algorithm identification and current preferred enhancement term are the target hardware enhancement term and target convolution algorithm identification that cause the convolutional layer to reach the optimal configuration.

According to the neural network configuration method, the target hardware enhancement item and the target convolution algorithm identification which enable the convolution layer to achieve the optimal configuration are obtained in an iteration mode, complex operation is not needed in the whole configuration process, and the configuration efficiency of the neural network is improved.

transmitting the convolution algorithm identification to be tested and the hardware enhancement item to be tested into a preset benchmark test function; and operating the benchmark test function to obtain the operation delay returned by the benchmark test function when executing the benchmark test task according to the convolution algorithm corresponding to the convolution algorithm identifier to be tested and the hardware enhancement mode corresponding to the hardware enhancement item to be tested.

Specifically, the execution of the benchmark task may be realized by a benchmark function (benchmark function). And transmitting the convolution algorithm identification to be tested and the hardware enhancement item to be tested into the benchmark test function as function parameters, and operating the benchmark test function. And the time value returned after the benchmark test function is operated is the operation delay.

It should be noted that the complexity of the operation of the neural network configuration method provided in this embodiment is the number of convolution algorithms and the number of hardware enhancement terms, the complexity is usually between 10 and 20, and it takes about several milliseconds to execute the benchmark test task through the benchmark test function in each iteration, that is, it takes a short time to complete the optimal configuration of each layer of the neural network before the neural network executes the inference task.

According to the neural network configuration method, the operation delay is obtained by using the reference test function, so that the operation delay can be obtained in a convenient and fast mode, and the configuration efficiency of the neural network is improved.

In one embodiment, the method further comprises:

Specifically, when the running delay is greater than or equal to the current optimized running delay, it indicates that the combination of the newly selected hardware enhancement item and the convolution algorithm identifier is not a more optimal configuration, so that the next iteration is directly started without updating the current optimized running delay, the current optimized enhancement item and the current optimized algorithm identifier, so as to improve the configuration efficiency of the neural network.

To facilitate a thorough understanding of the embodiments of the present application by those skilled in the art, the following description will be made in conjunction with specific exemplary code.

In the above exemplary code, best _ config (initialized convolutional layer configuration) is set, and best _ config.time (running delay in convolutional layer configuration) is initialized to the maximum value (FLOAT _ MAX).

Then, each algo (convolution algorithm identification) in Algorithms (set of convolution algorithm identifications) in the valid configuration list is traversed by a loop statement for algo in Algorithms.

For each algo, traversing each Math _ type (core tensor option) in Math _ Types (core tensor option) in the active configuration list through a loop statement for Math _ type in Math _ Types. The selected algo and math _ type in each cycle will be referred to as benchmark _ conv2d () (benchmark test function) to get the time value returned.

When the condition that time is less than best _ config.time is met, assigning and updating are carried out through best _ config.algo, best _ config.math _ type and best _ config.time, and returning to a loop start statement. Until all combinations of algo and math _ type are traversed.

And finally, returning the best _ config (optimized configuration of the convolutional layer) of the convolutional layer through the return best _ config, wherein the algo and the math _ type in the best _ config correspond to the convolutional algorithm identifier and the core tensor option which make the time (running delay) shortest.

It should be understood that although the steps in the flowcharts of fig. 1 and 4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 and 4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 5, there is provided a neural network configuration apparatus including:

a generating module 502 configured to generate a list of valid configurations of the neural network; the effective configuration list comprises hardware enhancement items and convolution algorithm identification;

a convolutional layer determining module 504 configured to determine a convolutional layer to be configured in the neural network;

a combination traversing module 506, configured to traverse a combination of the hardware enhancement item and the convolution algorithm identifier in the effective configuration list, to obtain an operation delay of the convolutional layer to be configured, and determine a target hardware enhancement item and a target convolution algorithm identifier of the convolutional layer to be configured according to the operation delay;

a returning module 508, configured to determine a next convolution layer to be configured in the neural network, and return to the step of traversing the combination of the hardware enhancement term and the convolution algorithm identifier in the effective configuration list to obtain the operation delay of the convolution layer to be configured until determining the target hardware enhancement term and the target convolution algorithm identifier of each convolution layer of the neural network.

In one embodiment, the combined traversal module 506 is further configured to:

In one embodiment, the return module 508 is further configured to:

For specific limitations of the neural network configuration device, reference may be made to the above limitations of the neural network configuration method, which are not described herein again. The modules in the neural network configuration device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

The neural network configuration device provided by the above can be used for executing the neural network configuration method provided by any of the above embodiments, and has corresponding functions and beneficial effects.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of indoor positioning of an air sensor. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

determining a convolutional layer to be configured in the neural network;

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

determining a convolutional layer to be configured in the neural network;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A neural network configuration method, comprising:

determining a convolutional layer to be configured in the neural network;

2. The method of claim 1, wherein traversing combinations of hardware enhancement terms and convolution algorithm identifiers in the valid configuration list to obtain a running delay of the convolutional layer to be configured, and determining a target hardware enhancement term and a target convolution algorithm identifier of the convolutional layer to be configured according to the running delay, comprises:

3. The method of claim 2, wherein the obtaining of the running delay of the convolutional layer to be configured when executing the benchmark test task by using the convolutional algorithm identifier to be tested and the hardware enhancement item to be tested comprises:

4. The method of claim 2, further comprising:

5. The method of claim 1, wherein the convolution algorithm identification for labeling comprises at least one of a direct convolution algorithm, an implicit universal matrix multiplication algorithm, a fast fourier transform algorithm, a wenger algorithm; the hardware enhancements include core tensor options.

6. A neural network configuration apparatus, comprising:

7. The apparatus of claim 6, wherein the combined traversal module is further configured to:

8. The apparatus of claim 7, wherein the combined traversal module is further configured to:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 5.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.