CN111415004A

CN111415004A - Method and apparatus for outputting information

Info

Publication number: CN111415004A
Application number: CN202010184800.9A
Authority: CN
Inventors: 王衍
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Priority date: 2020-03-17
Filing date: 2020-03-17
Publication date: 2020-07-14
Anticipated expiration: 2040-03-17
Also published as: CN111415004B

Abstract

The embodiment of the disclosure discloses a method and a device for outputting information. One embodiment of the method comprises: acquiring an input feature map and a convolution kernel of at least one input channel; performing non-zero index extraction on the sparse weight parameter matrix serving as the convolution kernel to obtain a non-zero element index list; for each input channel, performing traversal multiplication and addition calculation on each non-zero element in the non-zero element index list and the input feature map of the input channel to obtain a tangent plane of an output feature map corresponding to the non-zero element of the input channel; and for each output channel, accumulating the tangent planes of the output characteristic diagrams corresponding to all the non-zero elements in the non-zero element index list to obtain and output the output characteristic diagram of the output channel. The implementation method can utilize the sparsity of the convolutional neural network to carry out calculation acceleration, can greatly reduce the calculation amount, improves the execution speed of the visual detection task, and is suitable for a general hardware architecture.

Description

Method and apparatus for outputting information

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a method and a device for outputting information.

Background

Convolution (convolution) refers to an operator in a Deep Neural Network (Deep Neural Network) in a visual inspection task, and completes feature extraction by sliding different convolution kernels on an input image and running a multiply-add operation. Convolutional neural networks have become the most widely used model in the deep learning field. With the application of deep learning on mobile terminals, including mobile phones, automobiles, internet of things and other computationally-limited devices, the application is more and more, and since most of the computation amount of the convolutional neural network is distributed in the convolutional computation, efficient implementation of the convolutional computation becomes especially necessary.

Han et al have demonstrated that the weight parameters of convolutional neural networks generally have sparsity of 20% to 80%, i.e., after the algorithm model is optimized by pruning, the sparsity of the model is about 20% to 80% without affecting the inference accuracy. The sparsity is the ratio of the number of zero-valued elements in the model weight parameter to the number of the total weight parameter. These zero values introduce computational effort, consuming a lot of computation power, but do not contribute to the result. In order to realize the efficient calculation of the convolutional neural network at the mobile terminal, the convolution operator of the sparse model needs to be realized again, and an efficient calculation method without zero value calculation is found.

The structural damage is often brought by the parameter sparsification of the weight pruning, the calculation of the convolution is no longer performed by a regular convolution kernel in a sliding calculation mode, so that most of convolution acceleration (im2col and winogrd) is invalid, and most of the current modes aiming at the sparsification convolution acceleration are aimed at specific hardware.

The existing sparse acceleration scheme aiming at specific hardware can obtain an acceleration effect by utilizing the characteristics of the hardware, but has no universality, too limited application scene and higher deployment cost. Currently, most of mobile terminal deep learning is still deployed on general CPU computing equipment, a general sparse convolution acceleration method is still needed, and easy integration and rapid deployment can be realized.

Disclosure of Invention

Embodiments of the present disclosure propose methods and apparatuses for outputting information.

In a first aspect, an embodiment of the present disclosure provides a method for outputting information, including: acquiring an input feature map and a convolution kernel of at least one input channel; performing non-zero index extraction on the sparse weight parameter matrix serving as the convolution kernel to obtain a non-zero element index list; for each input channel, performing traversal multiplication and addition calculation on each non-zero element in the non-zero element index list and the input feature map of the input channel to obtain a tangent plane of an output feature map corresponding to the non-zero element of the input channel; and for each output channel, accumulating the tangent planes of the output characteristic diagrams corresponding to all the non-zero elements in the non-zero element index list to obtain and output the output characteristic diagram of the output channel.

In some embodiments, the method further comprises: in the inner loop of the sparse convolution calculation and the traversal calculation process of the output feature graph, the acceleration is carried out through instruction disorder and data prefetching.

In some embodiments, non-zero index extraction is performed on a sparse weight parameter matrix serving as a convolution kernel to obtain a non-zero element index list, including: and traversing the sparse weight parameter matrix of the convolution kernel, and storing the weight parameter larger than the preset threshold value into a non-zero element index list.

In some embodiments, prior to performing the non-zero index extraction, the method further comprises: and defining lists in _ ptr, w _ ptr, out _ ptr and out _ cnt for storing indexes, wherein the lists are respectively used for storing the initial data address of the input feature graph, the non-zero weight parameter address, the initial data address of the output feature graph and the number of the tangent planes corresponding to the initial data address of the output feature graph.

In some embodiments, after performing the non-zero index extraction, the method further comprises: storing the address of the non-zero element into a w _ ptr list; calculating a channel initial address of the input feature map and an initial offset of sliding traversal according to the corresponding input channel, calculating to obtain an initial address of each traversal, and storing the initial address in _ ptr; and judging whether the initial address of the output characteristic diagram corresponding to the current calculation is stored into the out _ ptr or not, if not, storing into the out _ ptr, otherwise, increasing the count of the section by 1 and updating into an out _ cnt list.

In some embodiments, for each non-zero element in the non-zero element index list, performing a traversal multiply-add calculation on the non-zero element and the input feature map of the input channel includes: traversing the out _ ptr list, and reading the count value of the corresponding subscript in the out _ cnt list; traversing the count value, and reading an input index list in _ ptr and a weight index list w _ ptr according to a circulating variable of the traversal; and traversing each pixel in the input feature map of the input channel, and sequentially calculating a tangent plane by the multiplication and the addition of the nonzero element and the pixel at the corresponding position of the input feature map.

In a second aspect, an embodiment of the present disclosure provides an apparatus for outputting information, including: an acquisition unit configured to acquire an input feature map and a convolution kernel of at least one input channel; the index extraction unit is configured to extract a non-zero index of the sparse weight parameter matrix serving as the convolution kernel to obtain a non-zero element index list; the calculation unit is configured to perform traversal multiplication and addition calculation on each non-zero element in the non-zero element index list and an input feature map of the input channel for each input channel, so as to obtain a tangent plane of an output feature map corresponding to the non-zero element of the input channel; and the output unit is configured to accumulate the tangent planes of the output characteristic graphs corresponding to all the non-zero elements in the non-zero element index list for each output channel to obtain and output the output characteristic graph of the output channel.

In some embodiments, the apparatus further comprises an acceleration unit configured to: in the inner loop of the sparse convolution calculation and the traversal calculation process of the output feature graph, the acceleration is carried out through instruction disorder and data prefetching.

In some embodiments, the index extraction unit is further configured to: and traversing the sparse weight parameter matrix of the convolution kernel, and storing the weight parameter larger than the preset threshold value into a non-zero element index list.

In some embodiments, the apparatus further comprises a definition unit configured to: before non-zero index extraction, lists in _ ptr, w _ ptr, out _ ptr and out _ cnt used for storing indexes are defined and are respectively used for storing the initial data address of the input feature graph, the non-zero weight parameter address, the initial data address of the output feature graph and the number of tangent planes corresponding to the initial data address of the output feature graph.

In some embodiments, the apparatus further comprises a storage unit configured to: after non-zero index extraction, storing the address of a non-zero element into a w _ ptr list; calculating a channel initial address of the input feature map and an initial offset of sliding traversal according to the corresponding input channel, calculating to obtain an initial address of each traversal, and storing the initial address in _ ptr; and judging whether the initial address of the output characteristic diagram corresponding to the current calculation is stored into the out _ ptr or not, if not, storing into the out _ ptr, otherwise, increasing the count of the section by 1 and updating into an out _ cnt list.

In some embodiments, the computing unit is further configured to: traversing the out _ ptr list, and reading the count value of the corresponding subscript in the out _ cnt list; traversing the count value, and reading an input index list in _ ptr and a weight index list w _ ptr according to a circulating variable of the traversal; and traversing each pixel in the input feature map of the input channel, and sequentially calculating a tangent plane by the multiplication and the addition of the nonzero element and the pixel at the corresponding position of the input feature map.

In a third aspect, an embodiment of the present disclosure provides an electronic device for outputting information, including: one or more processors; a storage device having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement a method as in any one of the first aspects.

In a fourth aspect, embodiments of the disclosure provide a computer readable medium having a computer program stored thereon, wherein the program when executed by a processor implements a method as in any one of the first aspect.

The method and the device for outputting the information can utilize the sparsity of the convolutional neural network to carry out calculation acceleration, and the calculation amount can be greatly reduced. The sparse convolution operator implementation mode is suitable for general hardware architectures such as a CPU, a DSP and the like. The sparse convolution operator is realized without depending on specific hardware or software, is easy to integrate into a depth learning inference frame, can effectively reduce the calculated amount, and improves the execution speed of a visual detection task.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram for one embodiment of a method for outputting information, according to the present disclosure;

3a-3c are schematic diagrams of a convolution process of a method for outputting information according to the present disclosure;

FIG. 4 is a flow diagram of yet another embodiment of a method for outputting information in accordance with the present disclosure;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for outputting information according to the present disclosure;

FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the disclosed method for outputting information or apparatus for outputting information may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as an image recognition application, a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting image browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio L layer III, motion Picture Experts compression standard Audio layer 3), MP4(Moving Picture Experts Group Audio L layer IV, motion Picture Experts compression standard Audio layer 4) players, laptop portable computers, desktop computers, and the like.

The server 105 may be a server that provides various services, such as a background recognition server that provides recognition functions for images displayed on the

terminal devices

101, 102, 103. The background recognition server may analyze and otherwise process the received data such as the image recognition request, and feed back a processing result (for example, the category of people and objects in the image) to the terminal device.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for outputting information provided by the embodiment of the present disclosure is generally performed by the server 105, and accordingly, the apparatus for outputting information is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for outputting information in accordance with the present disclosure is shown. The method for outputting information comprises the following steps:

step 201, obtaining an input feature map and a convolution kernel of at least one input channel.

In the present embodiment, an execution subject (e.g., a server shown in fig. 1) of the method for outputting information may receive an original image to be recognized from a terminal through a wired connection manner or a wireless connection manner. The server stores the original convolutional neural network model. The present disclosure may optimize the model. The original image can be used as an input feature map. If the convolutional neural network model has multiple layers, the feature diagram output after the processing of the previous convolutional layer can be used as the input feature diagram of the next convolutional layer. The convolution kernel consists of a matrix of weight parameters that has been trained.

Step 202, performing non-zero index extraction on the sparse weight parameter matrix serving as the convolution kernel to obtain a non-zero element index list.

In this embodiment, the essence of the sparse acceleration is to avoid zero-valued elements from participating in the calculation, so that the sparse weight parameter matrix of the convolution kernel can be traversed, and the element index whose weight parameter is not 0 is directly extracted to generate the non-zero element index list.

In some optional implementations of this embodiment, for further pruning, the sparse weight parameter matrix of the convolution kernel is traversed, and the weight parameters greater than the preset threshold are stored in the non-zero element index list.

In some optional implementations of this embodiment, lists in _ ptr, w _ ptr, out _ ptr, and out _ cnt for storing indexes are defined, and are respectively used to store the starting data address of the input feature map, the non-zero weight parameter address, the starting data address of the output feature map, and the number of slices corresponding to the starting data address of the output feature map.

As shown in fig. 3a, the address of each non-zero element is stored in the w _ ptr list.

And calculating a channel starting address base _ input _ addr of the input feature map and a sliding traversal starting offset according to the corresponding input channel, calculating to obtain a starting address base _ input _ addr + offset of the traversal, and storing the starting address base _ input _ addr + offset in _ ptr. offset refers to the position offset of the first non-zero value in the weight parameter of the convolution kernel in the input profile.

The specific calculation method of offset is as follows: assuming the convolution kernel shape is r x s (r rows and s columns), the width of the input feature map is w;

for(int i＝0:r)

for(intj＝0:s)

offset＝i*w+j

and judging whether the initial address of the output characteristic diagram corresponding to the current calculation is stored into the out _ ptr or not, if not, storing into the out _ ptr, otherwise, increasing the count by 1 and updating into an out _ cnt list.

Step 203, for each input channel, for each non-zero element in the non-zero element index list, performing traversal multiply-add calculation on the non-zero element and the input feature map of the input channel, so as to obtain a tangent plane of the output feature map corresponding to the non-zero element of the input channel.

In this embodiment, the convolution operator is usually implemented by multiple loop nesting, and the traversal order of loops can be defined as a data stream; different data streams can obtain memory access performance which is not passed, so that the calculation performance is influenced; the data flow of the convolution operator is optimized in a targeted manner aiming at the calculation characteristics of the sparse convolution.

The premise of data flow optimization is to change the circulation sequence of convolution calculation, and the correctness of the convolution calculation result is not influenced. Taking an input feature map of a single channel and a single element in a convolution kernel, and performing traversal multiplication and addition calculation to obtain a tangent plane of the output feature map, wherein the tangent plane is called a partial plane; this section is not a complete feature map, and only when partial planes corresponding to all elements in the convolution kernel are accumulated, a complete feature map can be obtained. The specific process is shown in fig. 3 b.

After the initial address, each pixel of the input feature map is traversed, multiplication operation is sequentially carried out on each pixel and one element of the convolution kernel, and the obtained output feature map is a tangent plane;

for a convolution kernel with the size of r & lts & gt, the sparsity of the sparse weight tensor is sparsity, so that the average number of partialplanes is r & lts & gt (1-sparsity), and compared with non-sparse convolution, a large amount of calculation is reduced;

the specific data flow for the sparse convolution is:

(1) traversing the out _ ptr list, and reading a count value cnt of a corresponding subscript in the out _ cnt list;

(2) traversing the count value cnt, and reading an input index list in _ ptr and a weight index list w _ ptr according to a circulating variable of the traversal;

(3) and traversing the input feature graph w x h, and sequentially calculating a partial plane.

And 204, accumulating the tangent planes of the output characteristic diagrams corresponding to all the non-zero elements in the non-zero element index list for each output channel to obtain and output the output characteristic diagram of the output channel.

In this embodiment, the partial plane is accumulated to obtain the output characteristic diagram, as shown in fig. 3c, the slice accumulation is the direct addition of the corresponding pixels, each slice is w × h, and the corresponding pixels are accumulated.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method for outputting information is shown. The process 400 of the method for outputting information includes the steps of:

step 401, obtaining an input feature map and a convolution kernel of at least one input channel.

And step 402, performing non-zero index extraction on the sparse weight parameter matrix serving as the convolution kernel to obtain a non-zero element index list.

The steps 401-402 are substantially the same as the steps 201-202, and therefore will not be described again.

Step 403, respectively storing the number of the tangent planes corresponding to the initial data address of the input feature map, the non-zero weight parameter address, the initial data address of the output feature map, and the initial data address of the output feature map into a predefined input index list, a predefined weight index list, a predefined output index list, and a predefined output index count list.

In this embodiment, lists in _ ptr, w _ ptr, out _ ptr, and out _ cnt for storing indexes are defined, and are respectively used to store the starting data address of the input feature map, the non-zero weight parameter address, the starting data address of the output feature map, and the number of slices corresponding to the starting data address of the output feature map. The specific process refers to step 202.

Step 404, take one output index at a time.

In this embodiment, data prefetching can significantly improve memory access performance, so that the instruction pipeline is not blocked. The inner loop of the sparse convolution calculation and the traversal calculation process of the feature graph can be accelerated through instruction disordering and data prefetching.

Step 405, traverse the output index count.

In this embodiment, each output index corresponds to an output index count. E.g., index count is 4, then 4 steps 406 need to be performed.

And 406, taking a weight index and an input index each time, and performing traversal multiplication and addition calculation on the non-zero element pointed by the weight index and the pixel pointed by the initial address of the input feature map pointed by the input index to obtain a section of the output feature map.

In this embodiment, to further increase the computation speed, acceleration can be performed by instruction out-of-order and data prefetching. For example, if the count of the output index is not less than 4, 4 input indexes and 4 weight indexes are taken, and the non-zero elements pointed by the 4 weight indexes are respectively subjected to traversal multiplication addition calculation with the input feature map pointed by the 4 input indexes, so that a plurality of sections of the output feature map can be obtained. The traversal calculation here means that the traversal calculation is performed for each input index, because each index represents a start address.

The specific implementation process is as follows:

(1) taking 4 input indexes and 4 non-zero weight parameters for calculation at a time

(2) Loading an input index 1, prefetching an input index 2, and calculating a partial plane of the output index 1;

(3) prefetching an input index 3, and calculating a partial plane of an output index 2;

(4) prefetching an input index 4, and calculating a partial plane of an output index 3;

"fetch" refers to the traversal of an index from an index array.

"load" is the loading of data from memory (memory) into a register (register) based on an index (actually a memory address).

"prefetch" relates to the concept of an instruction pipeline, and when a multiply-add computation program is compiled into an assembly instruction, access and computation of a plurality of instructions are generated and put in the instruction pipeline, and data needs to be loaded into a register from a memory first before multiplication or addition can be performed; when the calculation is executed each time, the efficiency is influenced by making memory access again because the memory access is time-consuming; the pre-fetching operation is to load the data required by the next computation instruction to the register while executing the computation instruction, and after the computation instruction is executed, the data required by the next computation is already in the register, so that the time consumed by memory access is saved, and the acceleration is realized.

The weight parameters are only loaded to the register once, and the weight parameters do not need to be traversed in real time when the feature graph is traversed, so that prefetching is not needed when calculation is needed; the input index represents a starting address and needs to be traversed, and prefetching needs to be accelerated in the traversing process; input index 1 corresponds to the first non-zero weight parameter.

If the count of the output index is not less than 4, a single input index and the weight index are taken for traversal calculation, the nonzero element pointed by the weight index and the input feature map pointed by the input index are subjected to traversal multiplication and addition calculation, and a tangent plane of the output feature map can be obtained.

The detailed process is substantially the same as step 203, and therefore, the detailed description is omitted.

Step 407, if the reading of the output index is completed, for each output channel, accumulating the tangent planes of the output characteristic diagrams corresponding to all the non-zero elements in the non-zero element index list to obtain and output the output characteristic diagram of the output channel.

Step 407 is substantially the same as step 204, and therefore is not described in detail.

If the output index is not read, step 404 and step 407 are continued.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for outputting information in the present embodiment embodies steps accelerated by instruction out-of-order and data prefetching. Therefore, the scheme described in the embodiment can further improve the execution speed of the visual detection task.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for outputting information, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable in various electronic devices.

As shown in fig. 5, the apparatus 500 for outputting information of the present embodiment includes: an acquisition unit 501, an index extraction unit 502, a calculation unit 503, and an output unit 504. The obtaining unit 501 is configured to obtain an input feature map and a convolution kernel of at least one input channel; an index extraction unit 502 configured to perform non-zero index extraction on the sparse weight parameter matrix serving as the convolution kernel to obtain a non-zero element index list; a calculating unit 503, configured to perform, for each input channel, traversal multiply-add calculation on each non-zero element in the non-zero element index list and the input feature map of the input channel, so as to obtain a tangent plane of the output feature map corresponding to the non-zero element of the input channel; and the output unit 504 is configured to, for each output channel, accumulate the tangent planes of the output feature maps corresponding to all non-zero elements in the non-zero element index list to obtain and output the output feature map of the output channel.

In the present embodiment, specific processing of the acquisition unit 501, the index extraction unit 502, the calculation unit 503, and the output unit 504 of the apparatus 500 for outputting information may refer to step 201, step 202, step 203, step 204 in the corresponding embodiment of fig. 2.

In some optional implementations of this embodiment, the apparatus 500 further comprises an acceleration unit (not shown in the drawings) configured to: in the inner loop of the sparse convolution calculation and the traversal calculation process of the output feature graph, the acceleration is carried out through instruction disorder and data prefetching.

In some optional implementations of this embodiment, the index extraction unit 502 is further configured to: and traversing the sparse weight parameter matrix of the convolution kernel, and storing the weight parameter larger than the preset threshold value into a non-zero element index list.

In some optional implementations of this embodiment, the apparatus 500 further comprises a defining unit (not shown in the drawings) configured to: before non-zero index extraction, lists in _ ptr, w _ ptr, out _ ptr and out _ cnt used for storing indexes are defined and are respectively used for storing the initial data address of the input feature graph, the non-zero weight parameter address, the initial data address of the output feature graph and the number of tangent planes corresponding to the initial data address of the output feature graph.

In some optional implementations of this embodiment, the apparatus 500 further comprises a storage unit (not shown in the drawings) configured to: after non-zero index extraction, storing the address of a non-zero element into a w _ ptr list; calculating a channel initial address of the input feature map and an initial offset of sliding traversal according to the corresponding input channel, calculating to obtain an initial address of each traversal, and storing the initial address in _ ptr; and judging whether the initial address of the output characteristic diagram corresponding to the current calculation is stored into the out _ ptr or not, if not, storing into the out _ ptr, otherwise, increasing the count of the section by 1 and updating into an out _ cnt list.

In some optional implementations of this embodiment, the computing unit 503 is further configured to: traversing the out _ ptr list, and reading the count value of the corresponding subscript in the out _ cnt list; traversing the count value, and reading an input index list in _ ptr and a weight index list w _ ptr according to a circulating variable of the traversal; and traversing each pixel in the input feature map of the input channel, and sequentially calculating a tangent plane by the multiplication and the addition of the nonzero element and the pixel at the corresponding position of the input feature map.

Referring now to fig. 6, a schematic diagram of an electronic device (e.g., the server or terminal device of fig. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The terminal device/server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 607 including, for example, a liquid crystal display (L CD), speaker, vibrator, etc., storage devices 608 including, for example, magnetic tape, hard disk, etc., and communication devices 609.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring an input feature map and a convolution kernel of at least one input channel; performing non-zero index extraction on the sparse weight parameter matrix serving as the convolution kernel to obtain a non-zero element index list; for each input channel, performing traversal multiplication and addition calculation on each non-zero element in the non-zero element index list and the input feature map of the input channel to obtain a tangent plane of an output feature map corresponding to the non-zero element of the input channel; and for each output channel, accumulating the tangent planes of the output characteristic diagrams corresponding to all the non-zero elements in the non-zero element index list to obtain and output the output characteristic diagram of the output channel.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including AN object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, an index extraction unit, a calculation unit, and an output unit. Where the names of these units do not in some cases constitute a limitation of the unit itself, for example, the acquisition unit may also be described as a "unit that acquires an input feature map and convolution kernel of at least one input channel".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for outputting information, comprising:

acquiring an input feature map and a convolution kernel of at least one input channel;

performing non-zero index extraction on the sparse weight parameter matrix serving as the convolution kernel to obtain a non-zero element index list;

for each input channel, performing traversal multiplication and addition calculation on each non-zero element in the non-zero element index list and the input feature map of the input channel to obtain a tangent plane of an output feature map corresponding to the non-zero element of the input channel;

and for each output channel, accumulating the tangent planes of the output characteristic diagrams corresponding to all the non-zero elements in the non-zero element index list to obtain and output the output characteristic diagram of the output channel.

2. The method of claim 1, wherein the method further comprises:

in the inner loop of the sparse convolution calculation and the traversal calculation process of the output feature graph, the acceleration is carried out through instruction disorder and data prefetching.

3. The method of claim 1, wherein the performing non-zero index extraction on the sparse weight parameter matrix as the convolution kernel to obtain a non-zero element index list comprises:

and traversing the sparse weight parameter matrix of the convolution kernel, and storing the weight parameter larger than the preset threshold value into a non-zero element index list.

4. The method of claim 1, wherein prior to performing non-zero index extraction, the method further comprises:

and defining lists in _ ptr, w _ ptr, out _ ptr and out _ cnt for storing indexes, wherein the lists are respectively used for storing the initial data address of the input feature graph, the non-zero weight parameter address, the initial data address of the output feature graph and the number of the tangent planes corresponding to the initial data address of the output feature graph.

5. The method of claim 4, wherein after performing non-zero index extraction, the method further comprises:

storing the address of the non-zero element into a w _ ptr list;

calculating a channel initial address of the input feature map and an initial offset of sliding traversal according to the corresponding input channel, calculating to obtain an initial address of each traversal, and storing the initial address in _ ptr;

and judging whether the initial address of the output characteristic diagram corresponding to the current calculation is stored into the out _ ptr or not, if not, storing into the out _ ptr, otherwise, increasing the count of the section by 1 and updating into an out _ cnt list.

6. The method of claim 5, wherein the performing a traversal multiply-add calculation on each non-zero element in the non-zero element index list and the input feature map of the input channel comprises:

traversing the out _ ptr list, and reading the count value of the corresponding subscript in the out _ cnt list;

traversing the count value, and reading an input index list in _ ptr and a weight index list w _ ptr according to a circulating variable of the traversal;

and traversing each pixel in the input feature map of the input channel, and sequentially calculating a tangent plane by the multiplication and the addition of the nonzero element and the pixel at the corresponding position of the input feature map.

7. An apparatus for outputting information, comprising:

an acquisition unit configured to acquire an input feature map and a convolution kernel of at least one input channel;

the index extraction unit is configured to extract a non-zero index of the sparse weight parameter matrix serving as the convolution kernel to obtain a non-zero element index list;

the calculation unit is configured to perform traversal multiplication and addition calculation on each non-zero element in the non-zero element index list and an input feature map of the input channel for each input channel, so as to obtain a tangent plane of an output feature map corresponding to the non-zero element of the input channel;

and the output unit is configured to accumulate the tangent planes of the output characteristic graphs corresponding to all the non-zero elements in the non-zero element index list for each output channel to obtain and output the output characteristic graph of the output channel.

8. The apparatus of claim 7, wherein the apparatus further comprises an acceleration unit configured to:

9. The apparatus of claim 7, wherein the index extraction unit is further configured to:

10. The apparatus of claim 7, wherein the apparatus further comprises a definition unit configured to:

before non-zero index extraction, lists in _ ptr, w _ ptr, out _ ptr and out _ cnt used for storing indexes are defined and are respectively used for storing the initial data address of the input feature graph, the non-zero weight parameter address, the initial data address of the output feature graph and the number of tangent planes corresponding to the initial data address of the output feature graph.

11. The apparatus of claim 10, wherein the apparatus further comprises a storage unit configured to:

after non-zero index extraction, storing the address of a non-zero element into a w _ ptr list;

12. The apparatus of claim 11, wherein the computing unit is further configured to:

13. An electronic device for outputting information, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.

14. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-6.