CN116484909A

CN116484909A - Vector engine processing method and device for artificial intelligent chip

Info

Publication number: CN116484909A
Application number: CN202310323014.6A
Authority: CN
Inventors: 王洲; 尹首一; 位经传; 胡杨; 韩慧明; 魏少军
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-07-25

Abstract

The application relates to a vector engine processing method and device for an artificial intelligence chip. A vector engine for use in a chip, the method comprising: receiving a channel configuration instruction, and configuring the number of instruction channels of the vector engine to be a first number according to the channel configuration instruction; configuring the number of parallel instructions in each instruction channel to be a second number; and executing vector operation processing on target data based on the first number of instruction channels and the second number of parallel instructions in each instruction channel, wherein the target data is operation data corresponding to the artificial neural network currently processed by the chip. By adopting the method, the internal computing processing capacity of the artificial intelligent chip can be improved so as to meet the computing requirement of the artificial neural network.

Description

Vector engine processing method and device for artificial intelligent chip

Technical Field

The application relates to the technical field of chips, in particular to a vector engine processing method and device for an artificial intelligent chip.

Background

Currently, artificial intelligence chips have been applied to computational processing in the field of artificial neural networks. With the continuous development of the artificial neural network field, the explosive type increased computing demands in the related operation process are followed, but the current artificial intelligent chip cannot meet such a large amount of computing demands, so the problem of improving the computing processing capacity inside the artificial intelligent chip to meet the computing demands of the artificial neural network is needed to be solved.

Disclosure of Invention

Based on this, it is necessary to provide a vector engine processing method and device for an artificial intelligent chip, which can improve the computing processing capacity inside the artificial intelligent chip to meet the computing requirement of an artificial neural network.

In a first aspect, the present application provides a data processing method. A vector engine for use in a chip, the method comprising:

receiving a channel configuration instruction, and configuring the number of instruction channels of the vector engine to be a first number according to the channel configuration instruction;

configuring the number of parallel instructions in each instruction channel to be a second number;

and executing vector operation processing on target data based on the first number of instruction channels and the second number of parallel instructions in each instruction channel, wherein the target data is operation data corresponding to the artificial neural network currently processed by the chip.

In one embodiment, the receiving channel configuration instructions includes: and receiving the channel configuration instruction sent by the target engine in the chip, wherein the target engine comprises a scalar engine or an intelligent engine, and the channel configuration instruction is determined by the target engine according to the current operation calculation requirement of the artificial neural network.

In one embodiment, the configuring the number of parallel instructions in each of the instruction channels to be the second number includes: determining the second number based on overhead requirements of the vector engine; and if the second number is inconsistent with the initial parallel instruction number which is pre-configured, adjusting the number of parallel instructions in each instruction channel from the initial parallel instruction number to the second number.

In one embodiment, the method further comprises: if a configuration change instruction sent by the label engine in the chip is received, performing configuration change processing based on the configuration change instruction, wherein the configuration change processing comprises: according to the first quantity and the second quantity, configuring the quantity of instruction channels of the vector engine as a third quantity, and configuring the quantity of parallel instructions in each instruction channel as a fourth quantity; wherein the third number is smaller than the first number, and the vector computing power of the vector engine is equal before and after performing the configuration change process.

In one embodiment, the method further comprises: acquiring a plurality of judgment operations of the same type, combining the plurality of judgment operations of the same type to obtain a target judgment operation, and judging the target judgment operation; the judging operation comprises the operation of logic judgment; and feeding back the decision result of the decision processing to a scalar engine in the chip.

In one embodiment, the method further comprises: if the judgment exit condition is met, stopping executing the multiple judgment operations of the same type, combining the multiple judgment operations of the same type to obtain a target judgment operation, and executing judgment processing on the target judgment operation; wherein the decision exit condition includes at least one of: the vector engine is overloaded with computing power; the vector engine computes errors.

In a second aspect, the present application also provides a data processing apparatus. A vector engine for use in a chip, the apparatus comprising:

the receiving module is used for receiving a channel configuration instruction, and configuring the number of instruction channels of the vector engine to be a first number according to the channel configuration instruction;

the configuration module is used for configuring the number of parallel instructions in each instruction channel to be a second number;

the operation module is used for executing vector operation processing on target data based on the first number of instruction channels and the second number of parallel instructions in each instruction channel, wherein the target data is operation data corresponding to an artificial neural network currently processed by the chip.

In one embodiment, the receiving module is specifically configured to: and receiving the channel configuration instruction sent by the target engine in the chip, wherein the target engine comprises a scalar engine or an intelligent engine, and the channel configuration instruction is determined by the target engine according to the current operation calculation requirement of the artificial neural network.

In one embodiment, the configuration module is specifically configured to: determining the second number based on overhead requirements of the vector engine; and if the second number is inconsistent with the initial parallel instruction number which is pre-configured, adjusting the number of parallel instructions in each instruction channel from the initial parallel instruction number to the second number.

In one embodiment, the apparatus further comprises:

the change module is configured to, if a configuration change instruction sent by the label engine in the chip is received, perform a configuration change process based on the configuration change instruction, where the configuration change process includes: according to the first quantity and the second quantity, configuring the quantity of instruction channels of the vector engine as a third quantity, and configuring the quantity of parallel instructions in each instruction channel as a fourth quantity; wherein the third number is smaller than the first number, and the vector computing power of the vector engine is equal before and after performing the configuration change process.

In one embodiment, the apparatus further comprises:

the judging module is used for acquiring a plurality of judging operations of the same type, carrying out combination processing on the plurality of judging operations of the same type to obtain a target judging operation, and carrying out judgment processing on the target judging operation; the judging operation comprises the operation of logic judgment; and feeding back the decision result of the decision processing to a scalar engine in the chip.

In one embodiment, the apparatus further comprises:

a stopping module, configured to stop executing the multiple decision operations of the same type if the decision exit condition is satisfied, perform a merging process on the multiple decision operations of the same type to obtain a target decision operation, and perform a decision process on the target decision operation; wherein the decision exit condition includes at least one of: the vector engine is overloaded with computing power; the vector engine computes errors.

In a third aspect, the present application also provides a chip, including a scalar engine, a vector engine, and an intelligent engine; the vector engine is adapted to implement the steps of the method of any of the first aspects above.

In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects described above.

In a fifth aspect, the present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects above.

According to the vector engine processing method and device for the artificial intelligent chip, the vector engines in the chip can be configured into the first number by receiving the channel configuration instruction and according to the channel configuration instruction; further, the vector engine may autonomously configure the number of parallel instructions in each instruction channel to a second number, so that vector operation processing may be performed on the target data based on the first number of instruction channels and the second number of parallel instructions in each instruction channel. The target data is operation data corresponding to the artificial neural network currently processed by the chip, namely, the vector engine can perform parallel vector calculation on the operation data of the artificial neural network based on the configured instruction channels and parallel instructions in the instruction channels, so that the calculation processing capacity of the artificial neural network inside the artificial intelligent chip is effectively improved. In addition, the number of the instruction channels is configured through the received channel configuration instructions, namely, the instruction channels are configured through other modules in the chip, so that the number of the instruction channels of the vector engine can be ensured to meet corresponding calculation requirements when different artificial neural network operations are performed; the vector engine can autonomously configure the number of parallel instructions in each instruction channel to be the second number, so that the vector engine can timely adjust the computing capacity of the vector engine, and the flexibility of parallel computation is ensured.

Drawings

FIG. 1 is a flow diagram of a data processing method in one embodiment;

FIG. 2 is a diagram of vector engine data processing in one embodiment;

FIG. 3 is a flow diagram of configuring the number of parallel instructions in one embodiment;

FIG. 4 is a flow diagram of a decision process in one embodiment;

FIG. 5 is a block diagram of a data processing apparatus in one embodiment;

FIG. 6 is a block diagram of another data processing apparatus in one embodiment;

fig. 7 is an internal structural diagram of a chip in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

First, before the technical solution of the embodiments of the present application is specifically described, a description is first given of a technical background or a technical evolution context on which the embodiments of the present application are based.

Artificial intelligence has been accumulated for half a century since being formally proposed, and has come to revolutionary great development and attention of researchers in various fields. Under the promotion of greatly improved computing power and data volume, the artificial intelligence obtains great breakthrough in the field of machine learning, especially the deep learning field dominated by the neural network, and the deep learning algorithm becomes a noun of the artificial intelligence algorithm. Therefore, the artificial intelligence chip is usually referred to as a deep learning algorithm chip, and the computing architecture innovation is used to perform hardware optimization processing on the deep learning algorithm, so as to optimize the artificial intelligence application in various aspects of computing power, power consumption, cost and the like.

In pursuit of the improvement of the performance of the artificial intelligence chip, there are several typical designs adopted currently as follows.

The deep learning processor Eyeriss emphasizes the energy efficiency priority rule, which is a processor based on a data flow architecture concept, and designs an autonomous row fixed RS (Row stationary) calculation rule for a PE (Processing Element) calculation unit. The second generation version of Eyeriss V2 developed on this basis provides a sparse and more flexible network structure. Related researchers in 2017 proposed multiple data reuse modes including convolutional reuse, image reuse and convolutional kernel reuse to improve chip energy efficiency. Further, DNPU, UNPU, LNPU, GANPU series AI chips were successively proposed. The DNPU mainly relies on heterogeneous architecture, hybrid load division method, dynamic adaptive fixed point calculation, and multiplier based on quantization table to realize chip configurability and low power consumption. Un pu is a DNN accelerator of fully variable weight bit precision, which can vary the optimal bit precision according to different precision/performance requirements. The LNPU utilizes direct feedback alignment for fast online learning and has a built-in direct error propagation mechanism based on a pseudo-random number generator. The GANPU proposes an adaptive spatio-temporal workload multiplexing approach to generating an antagonistic network processing unit to maintain high utilization when accelerating multiple DNNs in a single GAN model, while exploiting a dual sparsity architecture to skip redundant computation due to input and output feature zeros, and an exponential-only ReLU speculation algorithm and its lightweight processing element architecture.

APU (Acceleration Processing Unit) acceleration processing unit is a heterogeneous computing structure, and a traditional CPU and a graphic processor GPU are integrated on one chip at the same time, so that tasks are flexibly distributed between the CPU and the GPU according to the computing property, and the operations in the aspect of artificial intelligence are distributed to the GPU for processing, thereby improving the efficiency of data parallel operation. The DPU (Deep-Learning Processing Unit) Deep learning processing unit of Xilinx company realizes a configurable calculation engine based on FPGA and is used for accelerating Deep learning algorithms such as convolutional neural networks; the TPU (Tensor Processing Units) tensor processor of Google company is also a special chip for accelerating the operation capability of the neural network, and can obtain 15-30 times of performance improvement and 30-80 times of energy efficiency ratio improvement compared with the synchronous CPU and GPU.

In addition, the proposed multimode AI chip Thinker can balance resource conflict between calculation and bandwidth of CNN and RNN, an evolvable AI chip Evolver supporting on-chip training and reinforcement learning, a ReDCIM facing to a general cloud AI scene, a tranCIM based on self-attention neural network transition acceleration integrated with memory calculation and another transition acceleration chip based on approximate calculation and gradual sparse are designed.

The AI chip STICKER-T provides a block circulation algorithm and a unified frequency domain acceleration realization path. The DianNao series neural network accelerator supports the acceleration processing of deep neural networks such as large-scale CNN and the like, and is the earliest neural network accelerator facing the special computation in the artificial intelligence field in the world. The AI chip named BPU (Brain Processing Unit) is manufactured by using TSMC 40nm technology, and the application fields comprise intelligent driving, intelligent life, intelligent city and other data intensive operation scenes.

A central processing unit (central processing unit, abbreviated as CPU) is used as an operation and control core of the computer system, and is a final execution unit for information processing and program running. Von neumann architecture is the basis of modern computers. Under the architecture, the program and the data are stored in a unified way, the instruction and the data need to be accessed from the same storage space and transmitted through the same bus, and cannot be executed in an overlapping way. According to von neumann system, the operation of the CPU is divided into the following 5 phases: instruction fetch stage, instruction decode stage, instruction execute stage, access count and result write back.

The CPU is one of the main devices of the electronic computer, and is a core accessory in the computer. Its function is mainly to interpret computer instructions and process data in computer software. The CPU is a core component in the computer responsible for reading instructions, decoding the instructions and executing the instructions. The CPU mainly comprises two parts, namely a controller and an arithmetic unit, and also comprises a cache memory and a bus for realizing data and control of the connection between the cache memory and the cache memory. The three main core components of the electronic computer are CPU, internal memory and input/output device. The central processing unit mainly processes instructions, performs operations, controls time, and processes data. In the computer architecture, a CPU is a core hardware unit that performs control allocation and general-purpose operations on all hardware resources (such as a memory and an input/output unit) of a computer. The CPU is the operation and control core of the computer. The operation of all software layers in the computer system will ultimately be mapped by the instruction set into the operation of the CPU.

Graphics processor (graphics processing unit, GPU), also known as display core, vision processor, display chip, is a microprocessor that is dedicated to image and graphics related operations on personal computers, workstations, gaming machines, and some mobile devices (e.g., tablet computers, smartphones, etc.).

The GPU reduces the dependency of the graphics card on the CPU and performs part of the original CPU, and particularly, the core technology adopted by the GPU in 3D graphics processing includes hardware T & L (geometric transformation and illumination processing), cubic environment texture mapping and vertex blending, texture compression and concave-convex mapping, dual texture four-pixel 256-bit rendering engine, and the like, where the hardware T & L technology can be said to be a flag of the GPU.

The neural network processor (Antifcial Intelligence), also referred to as a neural network accelerator or computing card, i.e., a deep learning processor, refers to a module dedicated to handling a large number of computing tasks in an intelligent application (other non-computing tasks are still responsible for by the CPU). Many data processing of neural networks involves matrix multiplication and addition. A large number of GPUs working in parallel provides an inexpensive approach, but suffers from the disadvantage of higher power. FPGAs with built-in DSP modules and local memory are more energy efficient, but they are typically more expensive. Deep learning refers to a multi-layer neural network and a method of training it. The neural network processor colloquially refers to learning, judging and deciding by simulating a mechanism of a human brain through a deep neural network.

A multi-core processor refers to the integration of two or more complete compute engines (cores) in a single processor, where the processor can support multiple processors on a system bus, with all bus control signals and command signals provided by a bus controller. The development of multi-core technology stems from engineers recognizing that merely increasing the speed of a single core chip generates excessive heat and does not bring about a corresponding performance improvement, as is the case with previous processor products. They recognize that at that rate in the previous product, the heat generated by the processor can be too high. Even without the heat problem, the cost performance is unacceptable and the processor price is much higher at a slightly faster rate. The advantages of multi-core technology in application are two aspects: the method brings more powerful calculation performance for users; more importantly, the method can meet the requirement of simultaneous multi-task processing and multi-task computing environment of users.

The heterogeneous mode is to realize 'collaborative computing and mutual acceleration' between computing units using different types of instruction sets and architectures, thereby breaking through the development bottleneck of a single processor architecture and effectively solving the problems of energy consumption, expandability and the like. For heterogeneous processors, the general processor chips include CPU, DSP, GPU, FPGA, ASIC, etc., the CPU & GPU needs software support, while the FPGA & ASIC is a software and hardware integrated architecture, and the software is hardware. The energy consumption ratio is as follows: ASIC > FPGA > GPU > CPU, yielding the root cause of such a result: for computationally intensive algorithms, the higher the data movement and computational efficiency the higher the energy consumption ratio. The ASIC and the FPGA are closer to the bottom IO, so the calculation efficiency and the data movement are high, but the FPGA has redundant transistors and wires, the operation frequency is low, and the ASIC does not have high energy consumption. The GPU and the CPU belong to general processors, all need to carry out processes of instruction fetching, instruction decoding and instruction execution, and the processing of the bottom IO is shielded in the mode, so that software and hardware are decoupled, but the data moving and operation cannot reach higher efficiency, and therefore, no ASIC and FPGA have high energy consumption ratio. The difference of the energy consumption ratio between the GPU and the CPU is mainly characterized in that most of transistors in the CPU are used in a cache and a control logic unit, so compared with the GPU, the CPU has the advantages that for an algorithm which is computationally intensive and has low computational complexity, the redundant transistors cannot play a role, and the energy consumption ratio is lower than that of the CPU.

During the long-term development of the processor chips, the processor chips form a plurality of characteristics of use and market vividness. The CPU and GPU fields have a large amount of open source software and application software, and any new technology firstly uses the CPU to realize the algorithm, so that the CPU programming resources are rich and are easy to obtain, the development cost is low, and the development period is shortened. The realization of the FPGA is realized by adopting bottom hardware description languages such as Verilog/VHDL and the like, so that a developer needs to have deeper understanding on the chip characteristics of the FPGA, but the high parallelism characteristic of the FPGA can often lead the service performance to be improved in order of magnitude; meanwhile, the FPGA is dynamically reconfigurable, and after the FPGA is deployed in a data center, different logics can be configured according to service forms to realize different hardware acceleration functions; for example, the FPGA board on the current server is deployed with picture compression logic serving QQ services; the advertisement real-time prediction needs to be expanded to obtain more FPGA computing resources, and the FPGA board can be changed into 'new' hardware to serve the advertisement real-time prediction through a simple FPGA reconfiguration process, so that the method is very suitable for batch deployment. The ASIC chip can obtain optimal performance, namely, the area utilization rate is high, the speed is high, and the power consumption is low; but the risk of developing the neural network SC is extremely large, a large enough market is needed to guarantee the cost price, and the time period from development to market is long, so that the neural network SC is not suitable for the field that algorithms such as deep learning CNN are iterating rapidly.

Based on the background about the artificial intelligent chip and the artificial neural network, the applicant discovers that the calculation power of the chip is difficult to meet the explosive-type increase calculation requirement of the artificial neural network in the existing artificial intelligent chip for processing the operation of the artificial neural network through long-term research and collection and verification of experimental data, and the two forms a huge shear difference, so that the calculation capability of the chip needs to be further improved. Therefore, the problem of improving the computing processing capacity inside the artificial intelligence chip to meet the computing requirement of the artificial neural network needs to be solved.

The technical solutions related to the embodiments of the present application are described below in conjunction with the scenarios applied by the embodiments of the present application.

It should be noted that, in the data processing method provided in the embodiment of the present application, the execution body may be a data processing device, and the data processing device may be implemented in a manner of software, hardware, or a combination of software and hardware into part or all of a vector engine module in a chip. The chip may be an artificial intelligent chip, for example CPU, DSP, GPU, FPGA, ASIC, and the chip may include a vector engine, a scalar engine, and an intelligent engine, which may be hardware components or components formed by combining software and hardware. The chip can be applied to various intelligent devices, such as personal computers, notebook computers, smart phones, tablet computers, internet of things devices and the like, and the application device of the chip is not particularly limited in the embodiment of the application. In the following method embodiments, the execution subject is a vector engine in a chip.

In one embodiment, as shown in fig. 1, there is provided a data processing method applied to a vector engine in a chip, including the steps of:

step 101, receiving a channel configuration instruction, and configuring the number of instruction channels of the vector engine to be a first number according to the channel configuration instruction.

The chip may be an artificial intelligence chip, which may include a vector engine. To increase the processing power of the artificial intelligence chip, this may be achieved by increasing the vector processing power within the chip. Specifically, the vector processing module in the chip can be designed to perform batch processing on parallel processed data in the chip, and the vector processing module is the vector engine. In the embodiment of the application, for the vector engine, the instruction channels and parallel instruction parameters of the vector engine are introduced, and the number of the instruction channels and the number of parallel instructions in the channels of the vector engine can be flexibly configured according to the calculation requirement, so that the chip realizes flexible high-performance parallel calculation based on the vector engine, and the chip is improved for the related high-efficiency calculation of the artificial neural network.

For ease of understanding, a vector engine designed in an embodiment of the present application will be described first.

The artificial intelligence calculation is mainly a large amount of linear algebra operation, such as tensor processing, and the control flow is relatively simple, so that the method is very suitable for improving the performance by adopting a parallel calculation acceleration module. The vector engine in the embodiment of the application can realize SIMD (Single Instruction Multiple Data, single instruction multiple data stream), can be used for intensive large data volume driving, and improves the parallel computing capacity aiming at the artificial neural network.

In SIMD, all parallel execution units are synchronous, and all execute the same instruction sent by the same program counter, but they may have respective address registers, so that different data are fetched for operation. The SIMD can execute operation on a large amount of data only through one-time instruction fetching and decoding operation, so that the time cost required by instruction fetching and decoding is reduced in an apportionment mode, meanwhile, the instruction bandwidth and the instruction space are saved, and the performance power consumption ratio is improved. As shown in fig. 2, a schematic diagram of vector engine data processing provided in the embodiment of the present application is shown, where 201 is each parallel instruction in a single instruction channel, 202 is operation data corresponding to an artificial neural network, and 203 is a vector calculation result after parallel processing. It can be seen that the vector engine can realize efficient parallel processing of operation data through parallel instructions in the configured instruction channels.

In this embodiment of the present application, for the vector engine, the configuration of the number of instruction channels of the vector engine is external, that is, the vector engine may receive, through a top port, a channel configuration instruction sent by another module in the chip, and configure the number of instruction channels of the vector engine to be the first number according to the channel configuration instruction. Optionally, the channel configuration instruction is determined by other modules based on matching the computational requirements of the artificial neural network currently processed by the chip.

Alternatively, the total number of instruction channels in the vector engine may be a preset number, and the vector engine supports 64-bit floating point operations. The preset number may be, for example, 16. Accordingly, the first number may be less than or equal to the predetermined number, for example, between 1 and 16.

Step 102, the number of parallel instructions in each instruction channel is configured as a second number.

In order to simplify the design cost, the parallel instructions which can be executed in a single instruction channel can be distributed in the vector engine, so that the vector engine can flexibly adjust and configure the number of the parallel instructions in each instruction channel to be a second number based on the current cost requirement of the vector engine.

Step 103, performing vector operation processing on the target data based on the first number of instruction channels and the second number of parallel instructions in each instruction channel. The target data is operation data corresponding to the artificial neural network currently processed by the chip.

For the vector engine, after the first number and the second number are configured, the vector operation can be started.

Optionally, during operation of the vector engine, the channel configuration instructions may continue to be received to update the number of instruction channels. And it may also update the number of parallel instructions within each instruction channel based on its own overhead requirements.

Optionally, each time the vector engine is reset before starting operation, all internal register states and incomplete data calculation are cleared, so that accurate calculation of current target data is achieved.

Optionally, in the whole chip operation process, when a new artificial neural network mapping is performed, the number of instruction channels and the number of parallel instructions of the vector engine need to be readjusted for reconfiguration and recalculation.

In the data processing method, the vector engine in the chip can configure the number of instruction channels of the vector engine to be a first number according to the channel configuration instruction by receiving the channel configuration instruction; further, the vector engine may autonomously configure the number of parallel instructions in each instruction channel to a second number, so that vector operation processing may be performed on the target data based on the first number of instruction channels and the second number of parallel instructions in each instruction channel. The target data is operation data corresponding to the artificial neural network currently processed by the chip. That is, the vector engine can perform parallel vector calculation on the operation data of the artificial neural network based on the configured instruction channels and the parallel instructions in the instruction channels, so that the calculation processing capacity of the artificial neural network in the artificial intelligent chip is effectively improved. In addition, the number of the instruction channels is configured through the received channel configuration instructions, namely, the instruction channels are configured through other modules in the chip, so that the number of the instruction channels of the vector engine can be ensured to meet corresponding calculation requirements when different artificial neural network operations are performed; the vector engine can autonomously configure the number of parallel instructions in each instruction channel to be the second number, so that the vector engine can timely adjust the computing capacity of the vector engine, and the flexibility of parallel computation is ensured.

The following describes a procedure for the vector engine to configure the reception channel configuration instruction.

In one embodiment, receiving a channel configuration instruction includes:

and receiving a channel configuration instruction sent by a target engine in the chip, wherein the target engine comprises a scalar engine or an intelligent engine, and the channel configuration instruction is determined by the target engine according to the current operation calculation requirement of the artificial neural network.

Specifically, the other modules in the chip may be a scalar engine and an intelligent engine included in the chip, where the scalar engine is mainly used to integrally control the operation of each module in the chip, and the intelligent engine is mainly used to perform specific computation on the data currently processed by the chip, for example, the data after the parallel processing performed by the vector engine is transmitted to the intelligent engine to perform specific subsequent computation.

In the embodiment of the present application, in the process of mapping the currently processed artificial neural network to the chip for performing the operation processing, the scalar engine may determine the operation calculation requirement of the current artificial neural network, for example, the current operation is to perform convolution processing or pooling processing in a certain type of artificial neural network, and so on, and then the scalar engine may determine, based on the operation calculation requirement, how many instruction channels configured by the current vector engine can meet the current operation requirement. Thus, the vector engine may receive a channel configuration instruction sent by the scalar engine to configure the number of instruction channels to the first number.

Optionally, a channel configuration table may be preset in the chip, where the channel configuration table includes a correspondence between a plurality of groups of operation calculation requirements and the number of instruction channels. The scalar engine may query the channel configuration table to determine a target instruction channel number after specifying the current target operation computing requirement, and generate a channel configuration instruction based thereon for transmission to the vector engine, the vector engine regarding the target instruction channel number as the second number.

In addition, in the embodiment of the application, in the specific data calculation process, the intelligent engine can also generate the channel configuration instruction in real time based on the current data calculation requirement and send the channel configuration instruction to the vector engine, so that the vector engine can instruct the number of channels to be the second number, the data quantity which is transmitted to the intelligent engine for calculation after the parallel processing of the vector engine becomes controllable, and the normal and smooth processing of the target data by the vector engine and the intelligent engine at the same time is ensured.

Alternatively, both the scalar engine and the intelligent engine may be connected to the vector engine through a top port for sending channel configuration instructions.

In the embodiment of the application, the number of the channels in the vector engine is placed outside through the top port and is directly triggered by the scalar engine or the intelligent engine, so that when different artificial neural network operations are performed, the scalar engine or the intelligent engine can ensure that the vector engine is matched with the number of the channels which are suitable for the vector engine according to the current operation, the parallel computing capacity of the vector engine is ensured to be matched with the current artificial neural network, the flexibility of parallel computing is ensured, and the problem of insufficient computing power of the current artificial intelligent chip is effectively solved.

The process of configuring the number of parallel instructions within each instruction channel of the vector engine will be described below.

Referring to fig. 3, a flowchart illustrating a configuration of the number of parallel instructions according to an embodiment of the present application is shown. Configuring the number of parallel instructions within each instruction channel to a second number, comprising:

in step 301, a second number is determined based on overhead requirements of the vector engine.

In step 302, if the second number is inconsistent with the pre-configured initial parallel instruction number, the number of parallel instructions in each instruction channel is adjusted from the initial parallel instruction number to the second number.

Optionally, the number of parallel instructions in each channel can be initially configured by the vector engine itself, or can be preconfigured by the scalar engine, and in the vector calculation process of the vector engine, the number of parallel instructions in each channel can be self-adaptively adjusted based on the current overhead requirement of the vector engine itself.

The overhead requirement of the vector engine can be determined based on the current computing power, power consumption and the like of the vector engine, and the second quantity determined under the conditions of lower current computing power, higher current computing power, lower current power consumption and the like can be different, so that the situations of overload computing power, running errors and the like of the vector engine are avoided.

After the second number is determined, if the second number is different from the initial parallel instruction number configured in advance in the vector engine vector calculation process, the method means that the parallel instruction number needs to be timely adjusted to the second number, and normal vector operation is ensured.

In the embodiment of the application, the parallel instructions which can be executed in a single channel can be distributed in the vector engine, so that design cost can be simplified, the cost of the instructions is calculated by the vector engine, additional logic judgment functions and other adjustments can be well performed, the flexibility of parallel calculation is ensured, the computing capacity is improved by maximally utilizing the parallel capacity, the problem of insufficient computing capacity of the current artificial intelligent chip is effectively solved, and the normal operation of the vector engine is ensured.

In the process of mapping the current artificial neural network to the chip for data operation processing, the chip may also process data of other types of artificial neural networks or other operation tasks, and at this time, the vector engine is required to have the capability of processing newly added data. In view of this, in the embodiment of the present application, a channel instruction interchange mechanism of a vector engine is designed to implement simultaneous computation of multiple data, which is specifically as follows.

In one embodiment, the data processing method further comprises: if a configuration change instruction sent by a scalar engine in a chip is received, carrying out configuration change processing based on the configuration change instruction, wherein the configuration change processing comprises: according to the first quantity and the second quantity, the quantity of the instruction channels of the vector engine is configured to be a third quantity, and the quantity of parallel instructions in each instruction channel is configured to be a fourth quantity; wherein the third number is smaller than the first number, and the vector computing power of the vector engine is equal before and after the configuration change process is performed.

Wherein, since the scalar engine is responsible for global data scheduling and module control, it can be determined by the scalar engine whether to trigger a change in vector engine configuration.

Optionally, when other operation tasks are added, the scalar engine determines the number of candidate channels required by the operation tasks according to the operation calculation requirements of the operation tasks, generates a configuration change instruction based on the number of candidate channels, and sends the configuration change instruction to the vector engine. The vector engine alters the number of instruction channels for calculating the target data of the current artificial neural network and adaptively adjusts the number of instructions within each channel and in parallel based on the configuration alteration instruction.

For example, the first number is 4 and the second number is 4, i.e., the vector engine performs arithmetic processing on the target data in parallel with 4-channel 4-instruction execution before other tasks are processed. And determining that the number of candidate channels is 2 based on the configuration change instruction, namely determining that the third number is 2 and the fourth number is 8, namely changing the configuration change instruction into the vector engine to execute the operation processing on the target data in parallel by using the 2-channel 8 instruction, wherein the two channels can be used for carrying out parallel processing on the data of other tasks, and optionally, the number of the parallel instructions of the two channels can be triggered and configured again by the scalar engine.

Since the 4-channel 4 instructions execute in parallel before the change, and the 2-channel 8 instructions execute in parallel after the change, it can be seen that the vector computing power of the vector engine is equal before and after the configuration change process is performed. Therefore, the vector computing capacity of the vector engine is kept unchanged, and other tasks are distributed by a plurality of channels, so that the overall data processing flexibility of the chip is improved. In addition, the configuration change process is triggered by the scalar engine, enabling the scalar engine to schedule global data more reasonably.

In the above embodiments, the data processing process that the vector engine is in the normal computing state is provided, and in addition, in order to maximize and utilize the decision capability of the vector engine itself to implement the fast logic decision, in this embodiment of the present application, the vector engine may also be in a flexible vector decision state, where in this state, it may combine multiple decision operations to form a custom vector decision, so as to implement the fast logic decision.

Referring to fig. 4, a schematic diagram of a decision processing flow provided in an embodiment of the present application is shown. The data processing method further comprises the following steps:

step 401, obtaining a plurality of decision operations of the same type, and merging the plurality of decision operations of the same type to obtain a target decision operation. Wherein the decision operation includes an operation of performing a logical decision.

Wherein, the vector engine does not need extra operation when in the normal computing state. When it needs to make decisions on multiple decision operations of the same type, a flexible vector decision state can be entered. The decision operation includes an operation of performing a logical decision, for example, if decision, else decision, or whish decision.

Step 402, performing decision processing on the target decision operation.

In this embodiment of the present application, a logic decision module is provided in each instruction channel of the vector engine, where the logic decision module is used to perform logic decision operation, and the logic decision module may be implemented by a hardware component or a component combining hardware and software.

After entering a flexible vector judgment state, the logic judgment module can acquire the plurality of judgment operations, based on the logic judgment module, the judgment capability of the vector engine can be utilized to combine the plurality of judgment operations to form a self-defined target judgment operation, and then scalar judgment processing can be directly carried out on the target judgment operation by utilizing a vector judgment mechanism. Therefore, a plurality of judgment operations which are needed to be calculated one by one originally can be realized by only carrying out judgment processing on the target judgment operation, and the judgment efficiency is improved.

Step 403, feeding back the decision result of the decision processing to the scalar engine in the chip.

The vector engine in the embodiment of the application is also provided with a decision feedback mechanism. After the vector engine performs decision processing, the decision result can be fed back to the scalar engine through the top port so as to ensure that the scalar engine does not perform repeated decision.

Optionally, the vector engine is further provided with an overhead feedback mechanism, and the vector engine can synchronously inform the scalar engine of the current overhead configuration, so that the standard engine can reasonably adjust and schedule the scalar engine, and timely adjust the data processing capacity of the vector engine.

In the embodiment of the application, the introduction of the logic judgment function of the vector engine can bear partial logic judgment in vector calculation, and the processing efficiency of the vector engine on the logic judgment is improved by carrying out combination processing on the judgment operations of the same type and carrying out judgment processing based on the combined target judgment operation. The data result is not required to be returned through a bus, so that rapid judgment and calculation are performed, the overall parallel calculation speed is improved, and in addition, the judgment result is fed back to the scalar engine, so that the scalar engine is ensured not to perform repeated judgment.

In addition, the vector engine is provided with a calculation force overload termination mechanism which is used for guaranteeing that the self vector operation is controllable, and can exit the logic decision function and inform the scalar engine under special conditions. Specifically, the following is described.

In one embodiment, the data processing method further comprises: if the judgment exit condition is met, stopping executing the multiple judgment operations of the same type, merging the multiple judgment operations of the same type to obtain a target judgment operation, and executing judgment processing on the target judgment operation; wherein the decision exit condition includes at least one of: the vector engine is overloaded with computing power; the vector engine computes errors.

In the running process of the vector engine, if the calculation is wrong in the calculation process, or if the calculation power of the vector engine is overloaded, for example, the current to-be-processed calculation amount exceeds the load of the vector engine, the vector engine automatically stops executing the logic decision path aiming at the target decision operation, and stops the decision processing, thereby ensuring the controllability of the self vector operation.

Optionally, in the event that a decision exit condition is met, the vector engine may also inform the scalar engine itself that the decision process for the target decision operation has currently exited.

In addition, after the calculation and the judgment are completed, the vector engine can exit the flexible vector judgment mechanism, set the vector engine in a normal calculation state and continue to execute the operation processing of the target data.

In one embodiment, for ease of understanding, the embodiments of the present application describe the overall configuration-to-operation process of the vector engine on the chip.

S1, resetting is carried out before the vector engine calculation is started, and all internal register states and incomplete data calculation are cleared.

S2, triggering by a scalar engine or an intelligent engine to configure the channel number of the vector engine.

The number of channels in the vector engine is placed outside the number of channels through the top port, and before the vector engine starts to calculate, the vector engine can firstly receive a channel configuration instruction sent by the scalar engine or the intelligent engine, and the channel configuration instruction is used for configuring the required number of channels so as to match the related calculation flow requirements.

S3, the parallel instruction quantity of each channel of the vector engine is triggered by the scalar engine to perform pre-configuration.

S4, after the channel number of the vector engine and the number of parallel instructions in each channel are configured, the vector engine is in a normal calculation state, vector operation is started, and parallelization vector calculation processing is carried out on operation data of the target artificial neural network.

S5, in the process of executing calculation by the vector engine, the vector engine automatically allocates the instruction number again to each channel based on the current overhead demand of the vector engine.

S6, in the process of executing calculation by the vector engine, the scalar engine triggers a channel instruction interchange mechanism based on the current operation processing requirement of the chip.

The trigger channel instruction interchange mechanism, i.e. for example, the parallel execution of 4-channel 4 instructions is switched to the parallel execution of 2-channel 8 instructions. In this way, the vector computing capacity of the vector engine is unchanged, but multiple channels can be used for the scalar engine to allocate for other tasks, so that the scalar engine can schedule global data more reasonably.

S7, in the whole chip calculation process, when a new artificial neural network mapping is carried out, the channel number of the vector engine and the parallel instruction number of each channel are required to be readjusted so as to carry out reconfiguration and recalculation.

S8, when the vector engine detects that a plurality of judgment operations of the same type need to be judged, the vector engine enters a flexible vector judgment state.

At this time, the vector engine utilizes the self-possessed decision capability to combine a plurality of decision operations to form a custom vector decision, and utilizes a vector decision mechanism to perform scalar decision operation to obtain a decision result.

The path for stopping executing logic judgment of the vector engine under the condition of overload of self calculation force ensures that self vector operation is controllable. The flexible vector decision state is exited and the scalar engine is notified when an exception or the like occurs in the computation. The vector processor is set to a normal computing state after exiting the flexible use vector decision mechanism.

S9, the vector engine feeds back the decision result to the scalar engine. Therefore, the scalar engine is ensured not to make repeated decisions, and the scalar engine is informed of overhead configuration, so that the standard engine can conveniently adjust and schedule the scalar engine.

The chip architecture scheme for improving the parallel computing capacity of the artificial neural network based on the vector engine can maximally utilize the parallel computing capacity to improve the computing capacity, and effectively solves the problem of insufficient computing capacity of the current artificial intelligent chip. The channel and instruction mechanism can be adaptively adjusted, the number of channels to be scheduled and the number of parallel instructions in the channels can be adjusted according to the calculation configuration requirement, and the flexibility of parallel calculation is ensured; the introduction of the vector engine logic judgment function can bear partial logic judgment in vector calculation, and the data result is not required to be returned through a bus, so that quick judgment and calculation can be performed, and the overall parallel calculation speed is improved.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a data processing device for realizing the above related data processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation of one or more embodiments of the data processing device provided below may refer to the limitation of the data processing method hereinabove, and will not be repeated herein.

In one embodiment, as shown in fig. 5, there is provided a data processing apparatus applied to a vector engine in a chip, the data processing apparatus 500 including: a receiving module 501, a configuring module 502 and an operation module 503, wherein:

a receiving module 501, configured to receive a channel configuration instruction, and configure the number of instruction channels of the vector engine to be a first number according to the channel configuration instruction;

a configuration module 502, configured to configure the number of parallel instructions in each instruction channel to a second number;

the operation module 503 is configured to perform vector operation processing on target data, where the target data is operation data corresponding to an artificial neural network currently processed by the chip, based on the first number of instruction channels and the second number of parallel instructions in each instruction channel.

In one embodiment, the receiving module 501 is specifically configured to: and receiving a channel configuration instruction sent by a target engine in the chip, wherein the target engine comprises a scalar engine or an intelligent engine, and the channel configuration instruction is determined by the target engine according to the current operation calculation requirement of the artificial neural network.

In one embodiment, the configuration module 502 is specifically configured to: determining a second number based on overhead requirements of the vector engine; and if the second quantity is inconsistent with the initial parallel instruction quantity which is pre-configured, adjusting the quantity of the parallel instructions in each instruction channel from the initial parallel instruction quantity to the second quantity.

In one embodiment, as shown in fig. 6, another data processing apparatus is provided, the data processing apparatus 600 further comprising:

a modification module 504, configured to, if a configuration modification instruction sent by the scalar engine in the chip is received, perform a configuration modification process based on the configuration modification instruction, where the configuration modification process includes: according to the first quantity and the second quantity, the quantity of the instruction channels of the vector engine is configured to be a third quantity, and the quantity of parallel instructions in each instruction channel is configured to be a fourth quantity; wherein the third number is smaller than the first number, and the vector computing power of the vector engine is equal before and after the configuration change process is performed.

In one embodiment, the data processing apparatus 600 further comprises:

the decision module 505 is configured to obtain a plurality of decision operations of the same type, combine the plurality of decision operations of the same type to obtain a target decision operation, and perform decision processing on the target decision operation; the judgment operation comprises the operation of logic judgment; and feeding back the decision result of the decision processing to a scalar engine in the chip.

In one embodiment, the data processing apparatus 600 further comprises:

A stopping module 506, configured to stop executing the multiple decision operations of the same type if the decision exit condition is satisfied, perform a merging process on the multiple decision operations of the same type to obtain a target decision operation, and perform a decision process on the target decision operation; wherein the decision exit condition includes at least one of: the vector engine is overloaded with computing power; the vector engine computes errors.

Each of the modules in the above-described data processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a chip is provided, including a scalar engine, a vector engine, and an intelligent engine, the block diagram of which may be as shown in FIG. 7. The vector engine is used to implement the steps of any of the method embodiments described above.

It will be appreciated by those skilled in the art that the structure shown in fig. 7 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the chip on which the present application is applied, and that a particular chip may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

receiving a channel configuration instruction, and configuring the number of instruction channels of the vector engine to be a first number according to the channel configuration instruction; configuring the number of parallel instructions in each instruction channel to be a second number; and executing vector operation processing on target data based on the first number of instruction channels and the second number of parallel instructions in each instruction channel, wherein the target data is operation data corresponding to the artificial neural network currently processed by the chip.

In one embodiment, the computer program when executed by the processor further performs the steps of:

and receiving the channel configuration instruction sent by the target engine in the chip, wherein the target engine comprises a scalar engine or an intelligent engine, and the channel configuration instruction is determined by the target engine according to the current operation calculation requirement of the artificial neural network.

determining the second number based on overhead requirements of the vector engine; and if the second number is inconsistent with the initial parallel instruction number which is pre-configured, adjusting the number of parallel instructions in each instruction channel from the initial parallel instruction number to the second number.

if a configuration change instruction sent by the label engine in the chip is received, performing configuration change processing based on the configuration change instruction, wherein the configuration change processing comprises: according to the first quantity and the second quantity, configuring the quantity of instruction channels of the vector engine as a third quantity, and configuring the quantity of parallel instructions in each instruction channel as a fourth quantity; wherein the third number is smaller than the first number, and the vector computing power of the vector engine is equal before and after performing the configuration change process.

acquiring a plurality of judgment operations of the same type, combining the plurality of judgment operations of the same type to obtain a target judgment operation, and judging the target judgment operation; the judging operation comprises the operation of logic judgment; and feeding back the decision result of the decision processing to a scalar engine in the chip.

If the judgment exit condition is met, stopping executing the multiple judgment operations of the same type, combining the multiple judgment operations of the same type to obtain a target judgment operation, and executing judgment processing on the target judgment operation; wherein the decision exit condition includes at least one of: the vector engine is overloaded with computing power; the vector engine computes errors.

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A data processing method, applied to a vector engine in a chip, the method comprising:

and executing vector operation processing on target data based on the first number of instruction channels and the second number of parallel instructions in each instruction channel, wherein the target data is operation data corresponding to an artificial neural network currently processed by the chip.

2. The method of claim 1, wherein receiving the channel configuration instruction comprises:

3. The method of claim 1, wherein said configuring the number of parallel instructions within each of said instruction lanes to a second number comprises:

determining the second number based on overhead requirements of the vector engine;

and if the second number is inconsistent with the initial parallel instruction number which is pre-configured, adjusting the number of parallel instructions in each instruction channel from the initial parallel instruction number to the second number.

4. The method according to claim 1, wherein the method further comprises:

if a configuration change instruction sent by a label engine in the chip is received, carrying out configuration change processing based on the configuration change instruction, wherein the configuration change processing comprises: according to the first quantity and the second quantity, configuring the quantity of instruction channels of the vector engine as a third quantity, and configuring the quantity of parallel instructions in each instruction channel as a fourth quantity;

Wherein the third number is smaller than the first number, and vector computing capabilities of the vector engine are equal before and after performing the configuration change process.

5. The method according to claim 1, wherein the method further comprises:

acquiring a plurality of judgment operations of the same type, combining the plurality of judgment operations of the same type to obtain a target judgment operation, and carrying out judgment processing on the target judgment operation; the judging operation comprises the operation of logic judgment;

and feeding back the decision result of the decision processing to a scalar engine in the chip.

6. The method of claim 5, wherein the method further comprises:

if the judgment exit condition is met, stopping executing the multiple judgment operations of the same type, combining the multiple judgment operations of the same type to obtain a target judgment operation, and executing judgment processing on the target judgment operation;

wherein the decision exit condition includes at least one of:

the vector engine is overloaded with computing power;

the vector engine computes errors.

7. A data processing apparatus for use with a vector engine in a chip, the apparatus comprising:

8. A chip, comprising a scalar engine, a vector engine and an intelligent engine; the vector engine being adapted to implement the steps of the method of any of claims 1 to 6.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.