CN115346099A

CN115346099A - Image convolution method, chip, equipment and medium based on accelerator chip

Info

Publication number: CN115346099A
Application number: CN202210955730.1A
Authority: CN
Inventors: 刘相伟; 杨明珠; 王松
Original assignee: Beijing Xiaoyan Exploration Technology Co ltd
Current assignee: Beijing Xiaoyan Exploration Technology Co ltd
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2022-11-15

Abstract

The present disclosure provides an accelerator chip based image convolution method, chip, device and medium, the method comprising: allocating a first weight and image data to each computing unit at a time for storage; alternately executing preset convolution calculation operation and data circulation operation until the convolution calculation operation of the first number of times is completed; the convolution calculation operation includes: performing convolution calculation on the currently stored first weight and the image data by using a calculation unit to obtain a convolution result of the image data currently stored by the calculation unit; the data circulation operation comprises the following steps: according to a preset flow direction, transferring the first weight or image data currently stored by each computing unit to an adjacent computing unit for storage; after the convolution calculation operations of the first number of times are completed, for each image data, a target convolution result of the image data is generated based on the first number of convolution results of the image data. The image convolution can improve the calculation efficiency and the data throughput rate of the accelerator chip at the same time.

Description

Image convolution method, chip, equipment and medium based on accelerator chip

Technical Field

The present disclosure relates to the field of chip technologies, and in particular, to an accelerator chip-based image convolution method, chip, device, and medium.

Background

An accelerator chip is a chip that can be used to perform Machine Learning (ML) operations, for example, the accelerator chip can be used to perform convolution calculations on an image. In order to improve the computing speed and the data throughput rate of the chip, some data used in the computing process is usually multiplexed, but the existing data multiplexing mode has a limited effect on improving the computing speed and the data throughput rate of the chip.

Disclosure of Invention

The present disclosure provides an accelerator chip based image convolution method, chip, device and medium.

According to a first aspect of the present disclosure, there is provided an accelerator chip based image convolution method, a pulse array of the accelerator chip having a first number of cyclically connected computational units, the method comprising:

allocating a first weight and image data to each computing unit at a time for storage;

alternately executing preset convolution calculation operation and data circulation operation until the convolution calculation operation of the first number of times is completed;

the convolution calculation operation includes: performing convolution calculation on the currently stored first weight and the image data by using a calculation unit to obtain a convolution result of the image data currently stored by the calculation unit;

the data circulation operation comprises the following steps: according to a preset flow direction, transferring the first weight or image data currently stored by each computing unit to an adjacent computing unit for storage;

after the convolution calculation operations of the first number of times are completed, for each image data, a target convolution result of the image data is generated based on the first number of convolution results of the image data.

In the disclosed embodiment, assigning a first weight and an image data to each computing unit at a time for storage comprises:

reading a first number of first weights and a first number of image data at a time from a preset storage device;

and distributing each first weight to a corresponding computing unit for storage, and distributing each image data to a corresponding computing unit for storage.

In the embodiment of the present disclosure, the first number of first weights is obtained by dividing the original weights into the first number of parts according to the depth direction, where the original weights are three-dimensional arrays, and the first weights are two-dimensional arrays.

In the embodiment of the present disclosure, the streaming of the first weight or the image data currently stored by each computing unit to an adjacent computing unit for storage includes:

retaining each first weight in the calculation unit to which it was first assigned;

and transferring the image data stream currently stored by each computing unit to an adjacent computing unit for storage according to a preset streaming direction.

In an embodiment of the present disclosure, generating, for each image data, a target convolution result for the image data based on a first number of convolution results for the image data includes:

respectively extracting convolution results of the image data from each computing unit aiming at each image data to obtain a first number of convolution results of the image data;

and summing the first number of convolution results of the image data to obtain a target convolution result of the image data.

retaining each image data in the calculation unit to which it was first assigned;

and transferring the first weight stream currently stored by each computing unit to an adjacent computing unit for storage according to a preset stream transfer direction.

In an embodiment of the present disclosure, for each image data, generating a target convolution result for the image data based on a first number of convolution results for the image data includes:

extracting, for each image data, a first number of convolution results of the image data from a computing unit to which the image data is first distributed;

summing the first number of convolution results of the image data to obtain a target convolution result of the image data;

and outputting the target convolution results of all the image data at one time.

According to a second aspect of the present disclosure, there is provided an accelerator chip, a pulse array of which has a first number of cyclically connected computational units, based on which the method provided by the first aspect of the present disclosure can be performed.

According to a third aspect of the present disclosure, there is provided an electronic apparatus comprising:

an accelerator chip provided by a second aspect of the present disclosure; and

a memory communicatively coupled to the accelerator chip; wherein,

the memory stores instructions executable by the accelerator chip to enable the at least one processor to perform the method provided by the first aspect of the disclosure.

A fourth aspect of the present disclosure provides a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method provided by the first aspect of the present disclosure.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

The technical scheme provided by the disclosure has the following beneficial effects:

according to the image convolution method based on the accelerator chip, one first weight and one image data can be distributed to each computing unit of the accelerator chip for storage only through one data output step, so that a plurality of first weights and a plurality of image data can be deployed in the chip rapidly, one image data and each first weight can be subjected to one convolution calculation through circulation of the first weights or the image data in a plurality of circularly connected computing units, a plurality of convolution results of each image data are obtained, and then the convolution results of the plurality of image data are output through one data output step. The image convolution process can multiplex a plurality of image data and a plurality of weights simultaneously, all the computing units can synchronously carry out convolution computation, the computation parallelism is high, all the computing units can synchronously transfer data to adjacent computing units, the progressive delay and waiting process of data transmission are saved, the convolution results of a plurality of image data can be obtained simultaneously based on the method, the computation efficiency and the data throughput rate of the accelerator chip can be improved simultaneously, and the process of obtaining the convolution results of a plurality of image data only needs one data input step and one data output step, so that the data access times of the chip can be obviously reduced, and the power consumption of the chip is reduced.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram illustrating an exemplary structure of an accelerator chip according to an embodiment of the present disclosure;

FIG. 2 is a schematic flowchart illustrating an image convolution method based on an accelerator chip according to an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of the storage of first weights and image data in a computational unit of an accelerator chip provided by an embodiment of the disclosure;

FIG. 4 is a flow chart illustrating a data flow operation according to an embodiment of the disclosure;

FIG. 5 is a flow diagram illustrating another operation of performing data flow operations according to an embodiment of the disclosure;

fig. 6 shows a schematic block diagram of an electronic device for implementing the method provided by the embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

An accelerator chip is a chip that can be used to perform Machine Learning (ML) operations, for example, the accelerator chip can be used to perform convolution calculations on an image. In order to increase the computation speed and data throughput of the chip, some data used in the computation process is usually multiplexed, and specifically, the data multiplexing methods commonly used in the prior art include Input feature data multiplexing (Input feature map reuse), filter reuse (Filter reuse), and convolution reuse (Convolutional reuse).

The input feature data multiplexing refers to repeatedly utilizing the same input feature data for calculation, namely inputting one input feature data into a chip, sequentially calculating convolution results of the input feature data corresponding to a plurality of convolution weights, and inputting the next input feature data for performing the convolution calculation again after obtaining a plurality of convolution results of the input feature data; aiming at the mode of multiplexing input characteristic data, a complete data input and output process can only obtain the convolution result of one input characteristic data, and the data throughput rate is low. The filter multiplexing means that a batch of input characteristic data (namely a plurality of input characteristic data) is input into a chip, a convolution result of each input characteristic data corresponding to the same convolution weight is calculated, and after one convolution result of each input characteristic data is obtained, the next batch of input characteristic data is input to perform convolution calculation again; in the filter multiplexing scenario, the convolution weights used are usually higher-dimensional arrays (e.g., arrays above three-dimensional), which may result in higher complexity of convolution calculation, which may seriously affect the calculation speed of the chip. The convolution multiplexing is similar to the process of multiplexing the input characteristic data, the convolution multiplexing refers to inputting one input characteristic data into a chip, performing convolution calculation by using the same convolution weight and the subdata at different positions of the input characteristic data, and inputting the next input characteristic data to perform the convolution calculation again after obtaining the convolution results of the plurality of subdata of the input characteristic data; for the convolution multiplexing mode, only one convolution result of input characteristic data can be obtained in one completed data input and output process, and the data throughput rate is low. As can be seen, in the existing data multiplexing, data of only one dimension can be multiplexed, for example, only input feature data is multiplexed or only convolution weights are multiplexed, and the existing data multiplexing mode has a limited effect on improving the computation speed and the data throughput rate of a chip, and cannot improve the computation speed and the data throughput rate at the same time.

The embodiment of the disclosure provides an accelerator chip-based image convolution method, chip, device and medium, which aim to solve at least one of the above technical problems in the prior art.

The accelerator chip provided by the embodiment of the present disclosure is specifically a neural network accelerator chip, and the accelerator chip may be used for performing Machine Learning (ML) operation. The main component of the accelerator chip for performing various machine learning operations is the pulse array of NxM, where N is the number of rows of the pulse array, and M is the number of columns of the pulse array. In the disclosed embodiment, the pulse array has a first number of computational cells, which are generally referred to herein as Processing elements, abbreviated PEs. In the embodiment of the present disclosure, the first number of computing units are circularly connected, so that data can sequentially enter each computing unit according to a preset circulation direction. It should be noted that, in general, the first number is an integer not less than 4. Fig. 1 shows an exemplary structural diagram of an accelerator chip according to an embodiment of the present disclosure, in fig. 1, a pulse array of the accelerator chip has 16 computing units, where the 16 computing units are PE1 to PE16, respectively, where PE1 to PE16 are connected in sequence, and PE16 is further connected to PE1, so as to form a loop structure.

Fig. 2 shows a schematic flowchart of an image convolution method based on an accelerator chip according to an embodiment of the present disclosure, and as shown in fig. 2, the method mainly includes the following steps:

s210: each calculation unit is assigned a first weight and image data at a time for storage.

S220: and alternately executing preset convolution calculation operation and data circulation operation until the convolution calculation operation of the first number of times is completed.

In the embodiment of the present disclosure, after performing convolution calculation by using the calculation unit, a convolution result of corresponding image data may be obtained. Here, the convolution calculation operation includes: and performing convolution calculation on the currently stored first weight and the image data by using the calculating unit to obtain a convolution result of the image data currently stored by the calculating unit.

In an embodiment of the present disclosure, the data flow operation includes: and according to a preset flow direction, transferring the first weight or the image data currently stored by each computing unit to an adjacent computing unit for storage. It should be noted that, in step S220, it is necessary to select the first weight or the image data for circulation, that is, all the computing units can only circulate the first weight in step S220, or all the computing units can only circulate the image data in step S220.

S230: after the first number of convolution calculation operations is completed, for each image data, a target convolution result for the image data is generated based on the first number of convolution results for the image data.

It will be appreciated that after the first number of convolution operations is completed, each image data is respectively convolved with the first number of first weights, and thus there are a first number of convolution results for each image data. The convolution results of the image data are stored in the calculation unit, and in S230, a first number of convolution results for each image data may be extracted from the calculation unit, and a target convolution result of the image data may be generated based on the first number of convolution results of the image data.

S240: and outputting the target convolution results of all the image data at one time.

In the disclosed embodiment, the first weight and the image data may be stored in a storage device. When a first weight and an image data are assigned to each computing unit at a time for storage, a first number of first weights and a first number of image data may be read from the storage device at a time, each first weight may be assigned to a corresponding computing unit for storage, and each image data may be assigned to a corresponding computing unit for storage. The embodiment of the disclosure can read a plurality of first weight image data at a time, and in the process of convolution calculation, the data does not need to be read from the outside of the chip for a plurality of times, and the convolution calculation of the plurality of image data can be completed through the flow of the data in the chip, so that the access times to a storage device outside the chip can be remarkably reduced, and the reduction of the power consumption of the chip is facilitated.

Specifically, as shown in fig. 2, the storage device may include a data storage, a first buffer, and a second buffer. The data memory may store a plurality of image data and weights for performing convolution calculation, and before performing step S210, a first number of image data that needs to be convolved this time may be stored in the first buffer, and a first number of first weights corresponding to the first number of image data may be stored in the second buffer. When each calculation unit is assigned a first weight and image data at a time for storage, a first number of image data may be read at a time from the first buffer, and a first number of first weights may be read at a time from the second buffer. Optionally, the storage device may further include a third buffer, where the third buffer may be used to store the final target convolution result of each image data, and in step 240, the accelerator chip may output all the target convolution results of the image data to the third buffer at a time.

Fig. 3 shows a schematic diagram of storing the first weights and the image data in the computing units of the accelerator chip according to the embodiment of the disclosure, as shown in fig. 3, the pulse array of the accelerator chip has 16 computing units, which are PE1 to PE16, respectively, and obtains 16 image data LA1 to LA16 and 16 first weights W1 to W16. In assigning the image data and the first weight, the calculation unit PE1 may be assigned the image data LA1 and the first weight W1 for storage, the calculation unit PE2 may be assigned the image data LA2 and the first weight W2 for storage, and so on, and the calculation unit PE16 may be assigned the image data LA16 and the first weight W16 for storage.

In the embodiment of the present disclosure, the first number of first weights is obtained by dividing the original weights into the first number of parts according to the depth direction, where the original weights are three-dimensional arrays, and the first weights are two-dimensional arrays. It should be noted here that, in the art, when calculating a convolution result of image data, convolution calculation is usually performed based on original weights and the image data, and since the original weights are three-dimensional arrays, the complexity of convolution calculation based on direct original weights is high. The embodiment of the disclosure can divide the original weight of the three-dimensional array into the first weights of the first number of the parts of the two-dimensional array according to the depth direction, so that the convolution calculation is performed based on the first weights of the two-dimensional array, the calculation process is greatly simplified, the calculation speed of a chip is favorably improved, and the power consumption of the chip is reduced.

As described above, in step S220, the first weight or the image data needs to be selected for circulation. In the case that all the computing units can only transfer the first weight in step S220, when the first weight or the image data currently stored by each computing unit is transferred to an adjacent computing unit for storage, the embodiment of the present disclosure may retain each first weight in the computing unit to which it is first assigned, and transfer the image data currently stored by each computing unit to an adjacent computing unit for storage according to the preset transfer direction. It is understood that the calculation unit to which the first weight is first assigned refers to the calculation unit to which the first weight is assigned in S210. In the process of alternately executing the preset convolution calculation operation and the data circulation operation, the first weight in each calculation unit is fixed, and the image data in the calculation unit is replaced once every time the data circulation operation is executed. In this way, each calculation unit performs convolution calculation with different image data based on the fixed first weight during different convolution calculation operations, and thus, one calculation unit can acquire a convolution result of each data image in the first number of data images by performing the convolution calculation operations a first number of times.

In the embodiment of the disclosure, the data circulation direction in the computing units can be predetermined, and it is specified that each computing unit only needs to circulate the image data to an adjacent computing unit, so that data transmission across the computing units is not needed, the routing relationship is simple, clear and constant, and the routing algorithm is greatly simplified; in addition, only image data need to be transferred between adjacent computing units, the transmission distance is short, the data type is single, excessive power consumption of a chip is not needed, and computing resources are saved.

Fig. 4 shows a schematic flowchart of a flow of executing a data streaming operation according to an embodiment of the present disclosure, in fig. 4, an accelerator chip includes 4 computing units, where the 4 computing units are respectively computing units PE1 to PE4, the computing unit PE1 fixedly stores the first weight W1, the computing unit PE2 fixedly stores the first weight W2, the computing unit PE3 fixedly stores the first weight W3, the computing unit PE4 fixedly stores the first weight W4, and P1 to P4 represent convolution calculation operations of 4 stages that are executed in sequence. In phase P1, the computing unit PE1 stores image data LA1, the computing unit PE2 stores image data LA2, the computing unit PE3 stores image data LA3, and the computing unit PE4 stores image data LA4. Taking the computing unit PE as an example, in the phase P1, the computing unit PE1 computes a convolution result corresponding to the image data LA1 based on the image data LA1 and the first weight W1, and then, streams the image data LA1 to the computing unit PE2, and streams the image data LA4 to the computing unit PE1; in stage P2, the computing unit PE1 computes a convolution result corresponding to the image data LA4 based on the image data LA4 and the first weight W1, then transfers the image data LA4 to the computing unit PE2, and so on until each computing unit completes 4 times of convolution computation operations.

In the embodiment of the present disclosure, in the case that all the calculation units can only circulate the first weight in step S220, one calculation unit may obtain one convolution result for each of the first number of data images by performing the convolution calculation operation a first number of times, and therefore, when the target convolution result for the image data is generated based on the first number of convolution results for the image data, the convolution results for the image data may be respectively extracted from the respective calculation units to obtain the first number of convolution results for the image data, and the first number of convolution results for the image data may be summed to obtain the target convolution result for the image data.

As described above, in step S220, the first weight or the image data needs to be selected for the circulation. In the case that all the computing units can only transfer the first weight in step S220, when the first weight currently stored in each computing unit or the image data is transferred to an adjacent computing unit for storage, the embodiment of the present disclosure may retain each image data in the computing unit to which it is first allocated, and transfer the first weight currently stored in each computing unit to an adjacent computing unit for storage according to the preset transfer direction. It is understood that the calculation unit to which the image data is first allocated refers to the calculation unit to which the image data is allocated in S210. In the process of alternately executing the preset convolution calculation operation and the data circulation operation, the image data in each calculation unit is fixed, and the first weight in each calculation unit is replaced once when the data circulation operation is executed once. In this way, each calculation unit performs convolution calculation with different first weights based on fixed image data during different convolution calculation operations, and therefore, one calculation unit can acquire a first number of times of convolution results of the same data image by performing a first number of times of convolution calculation operations.

In the embodiment of the disclosure, the data flow direction in the computing units can be predetermined, and it is specified that each computing unit only needs to flow the first weight to an adjacent computing unit without data transmission across the computing units, the routing relationship is simple, clear and constant, and the routing algorithm is greatly simplified; in addition, only image data needs to be transferred between adjacent computing units, the transmission distance is short, the data type is single, excessive power consumption of a chip is consumed, and computing resources are saved. In addition, the data size of the first weight is usually smaller than that of the image data, and the first weight is transferred to further save the power consumption of the chip.

Fig. 5 is a schematic diagram illustrating another flow of executing a data streaming operation according to an embodiment of the disclosure, in fig. 5, the accelerator chip includes 4 computing units, where the 4 computing units are respectively computing units PE1 to PE4, the computing unit PE1 fixedly stores image data LA1, the computing unit PE2 fixedly stores image data LA2, the computing unit PE3 fixedly stores image data LA3, the computing unit PE4 fixedly stores image data LA4, and P1 to P4 represent convolution operations of 4 stages that are sequentially executed. At stage P1, the calculation unit PE1 stores the first weight W1, the calculation unit PE2 stores the first weight W2, the calculation unit PE3 stores the first weight W3, and the calculation unit PE4 stores the first weight W4. Taking the computing unit PE as an example, at the stage P1, the computing unit PE1 computes a convolution result corresponding to the image data LA1 based on the image data LA1 and the first weight W1, and then transfers the first weight W1 to the computing unit PE2, and transfers the first weight W4 to the computing unit PE1; in stage P2, the computing unit PE1 computes a convolution result corresponding to the image data LA1 based on the image data LA1 and the first weight W4, and then transfers the first weight W4 to the computing unit PE2, and so on until each computing unit completes 4 times of convolution computation operations.

In the embodiment of the present disclosure, in the case where all the calculation units can only circulate the image data in step S220, one calculation unit may acquire the convolution results of the same data image by the first number of times by performing the convolution calculation operation of the first number of times, and therefore, when the target convolution result of the image data is generated based on the convolution results of the first number of times of the image data, the first number of convolution results of the image data may be extracted from the calculation unit to which the image data is first allocated for each image data, and the first number of convolution results of the image data may be summed to obtain the target convolution result of the image data.

Alternatively, the calculation unit does not obtain a convolution result, and the newly obtained convolution result and the previously obtained convolution result may be summed, so as to save a plurality of convolution results. Here, the newly obtained convolution result can be regarded as loading of the membrane potential with the initial value in the current convolution calculation. The summation process of the computing unit can be implemented by the following formula:

in the above formula, y (i) is the summation result, wi represents the first weight used in the convolution calculation, and S is the number of counter cells of the pulse array of the accelerator chip (i.e., the above-mentioned first number), wi, C _in Representing the corresponding depth of wi in the original weight.

The embodiment of the disclosure provides an accelerator chip, wherein a pulse array of the accelerator chip is provided with a first number of circularly connected computing units, and the accelerator chip-based image convolution method can be executed. The accelerator chip performs the image convolution method with reference to the foregoing contents, which are not described herein again.

The embodiment of the disclosure also provides an electronic device, which includes the above accelerator chip and a memory connected to the accelerator chip in communication. Wherein the memory stores instructions executable by the accelerator chip to enable the at least one processor to perform the accelerator chip based image convolution method described above. . The accelerator chip executes the image convolution method with reference to the foregoing contents, which are not described herein again.

Fig. 6 shows a schematic block diagram of an electronic device for implementing the method provided by the embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as jammers, laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the electronic device 600 includes an accelerator chip 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 can also be stored. The accelerator chip 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the electronic device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the electronic device 600 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The accelerator chip 601 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of accelerator chip 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The accelerator chip 601 performs the various methods and processes described above, such as an accelerator chip-based image convolution method. For example, in some embodiments, the accelerator chip-based image convolution method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the accelerator chip 601, one or more steps of the accelerator chip-based image convolution method described above may be performed. Alternatively, in other embodiments, the accelerator chip 601 may be configured by any other suitable means (e.g., by means of firmware) to perform the accelerator chip-based image convolution method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of image convolution based on an accelerator chip having a pulse array with a first number of cyclically connected computational cells, the method comprising:

distributing a first weight and image data to each computing unit at one time for storage;

alternately executing preset convolution calculation operation and data circulation operation until the convolution calculation operation for the first number of times is completed;

the convolution calculation operation includes: performing convolution calculation on the first weight and the image data which are currently stored by the calculating unit by using the calculating unit to obtain a convolution result of the image data which are currently stored by the calculating unit;

the data flow operation comprises: according to a preset flowing direction, the first weight or the image data which is currently stored by each computing unit is flowed to an adjacent computing unit for storage;

after a first number of times of the convolution calculation operations are completed, for each of the image data, generating a target convolution result for the image data based on a first number of convolution results for the image data;

2. The method of claim 1, wherein said assigning a first weight and an image data to each of said computing units at a time for storage comprises:

reading the first number of first weights and the first number of image data at a time from a preset storage device;

3. The method of claim 1, wherein the first number of the first weights is obtained by dividing an original weight into the first number of parts according to a depth direction, wherein the original weight is a three-dimensional array, and the first weight is a two-dimensional array.

4. The method according to claim 1, wherein said transferring the current stored first weight or the image data stream of each of the computing units to an adjacent one of the computing units for storage comprises:

retaining each of the first weights in the computing unit to which it was first assigned;

and according to a preset circulation direction, circulating the image data currently stored by each computing unit to an adjacent computing unit for storage.

5. The method of claim 4, wherein the generating, for each of the image data, a target convolution result for the image data based on the first number of convolution results for the image data comprises:

for each image data, extracting convolution results of the image data from each computing unit respectively to obtain the first number of convolution results of the image data;

6. The method of claim 1, wherein said transferring the first weight or the image data stream currently stored by each of the computing units to an adjacent one of the computing units for storage comprises:

retaining each of the image data in the computing unit to which it was first assigned;

and according to a preset flowing direction, the first weight flowing currently stored by each computing unit is transferred to an adjacent computing unit for storage.

7. The method of claim 6, wherein said generating, for each of said image data, a target convolution result for said image data based on said first number of convolution results for said image data comprises:

extracting, for each of the image data, the first number of convolution results of the image data from the calculation unit to which the image data is first allocated;

summing the first number of convolution results of the image data to obtain a target convolution result of the image data.

8. An accelerator chip, a pulse array of the accelerator chip having the first number of cyclically connected computational units, the accelerator chip capable of performing the method of any of claims 1-8.

9. An electronic device, comprising:

the accelerator chip of claim 8; and

a memory communicatively coupled to the accelerator chip; wherein,

the memory stores instructions executable by the accelerator chip to enable the at least one processor to perform the method of any of claims 1-8.

10. A non-transitory computer readable storage medium storing computer instructions, wherein,

the computer instructions are for causing a computer to perform the method of any one of claims 1-7.