CN115408309A - Cache management method of AI (Artificial Intelligence) processor and AI processor applying cache management method - Google Patents

Cache management method of AI (Artificial Intelligence) processor and AI processor applying cache management method Download PDF

Info

Publication number
CN115408309A
CN115408309A CN202211038987.7A CN202211038987A CN115408309A CN 115408309 A CN115408309 A CN 115408309A CN 202211038987 A CN202211038987 A CN 202211038987A CN 115408309 A CN115408309 A CN 115408309A
Authority
CN
China
Prior art keywords
data
storage block
sequence control
processor
read
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211038987.7A
Other languages
Chinese (zh)
Inventor
曹玉龙
景博
孙康睿
李加敏
韩平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced Institute of Information Technology AIIT of Peking University
Hangzhou Weiming Information Technology Co Ltd
Original Assignee
Advanced Institute of Information Technology AIIT of Peking University
Hangzhou Weiming Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Institute of Information Technology AIIT of Peking University, Hangzhou Weiming Information Technology Co Ltd filed Critical Advanced Institute of Information Technology AIIT of Peking University
Priority to CN202211038987.7A priority Critical patent/CN115408309A/en
Publication of CN115408309A publication Critical patent/CN115408309A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The present disclosure relates to the field of artificial intelligence and chip technologies, and more particularly, to a cache management method for an AI processor and an AI processor using the same. The AI processor is configured with a uniform cache module, the uniform cache module comprises a plurality of write sequence control modules, a plurality of storage blocks and a plurality of read sequence control modules, wherein the plurality of storage blocks are static random access memories; the cache management method comprises the following steps: receiving data to be stored and calling one or more writing sequence control modules; the write sequence control module acquires a physical address corresponding to a storage block allocated by the unified cache module and writes data to be stored into the allocated storage block according to the physical address; and if the data written into the allocated storage blocks are read, reading the data through one or more read sequence control modules. The method and the device can make full use of the storage resources, realize the balance of the storage resources and further avoid the waste of the storage resources.

Description

Cache management method of AI (Artificial Intelligence) processor and AI processor applying cache management method
Technical Field
The present disclosure relates to the field of artificial intelligence and chip technologies, and more particularly, to a cache management method for an AI processor and an AI processor using the same.
Background
With the increasing development of artificial intelligence algorithms and application technologies and the gradual maturity of the industrial environment of artificial intelligence special chips, the fully customized artificial intelligence ASIC gradually embodies the advantages thereof. ASIC (Application Specific Integrated Circuit) refers to an Application Specific Integrated Circuit (ASIC) designed and manufactured according to the requirements of a Specific user and a Specific electronic system. The design of ASIC with Complex Programmable Logic Devices (CPLDs) and Field Programmable Gate Arrays (FPGAs) is one of the most popular ways currently, has the characteristics of field programmability of users, supports boundary scan technology, and has advantages in performance power consumption ratio, reliability, integration level, etc.
However, the data distribution characteristics of the convolutional neural network provide new challenges for the prior art, and the overall characteristic diagram shows a trend of decreasing data volume, while the convolutional kernel weight shows a trend of increasing data volume, which makes the distributed memory unable to fully utilize its storage resources, thereby resulting in the problem of storage resource waste or unbalanced use.
Disclosure of Invention
Based on the above technical problem, the present invention is directed to an AI processor configured with a unified cache module, where the unified cache module is configured with a plurality of write sequence control modules, a plurality of storage blocks, and a plurality of read sequence control modules, and data to be stored is written into a corresponding storage block through the write sequence control modules.
The first aspect of the present invention provides a cache management method for an AI processor, where the AI processor is configured with a unified cache module, and the unified cache module includes a plurality of write sequence control modules, a plurality of storage blocks, and a plurality of read sequence control modules, where the plurality of storage blocks are static random access memories; the cache management method comprises the following steps: receiving data to be stored and calling one or more writing sequence control modules; the write sequence control module acquires a physical address corresponding to a storage block allocated by the unified cache module, and writes the data to be stored into the allocated storage block according to the physical address; and if the data written into the storage block is read, reading the data through one or more read sequence control modules.
In some embodiments of the present invention, the AI processor further comprises a control module, and the unified cache module further comprises a user number manager; the write sequence control module obtains a physical address corresponding to a storage block allocated by the unified cache module, and the method comprises the following steps: the writing sequence control module applies for the user number manager and obtains a virtual user number, an allocated storage block number and a physical address corresponding to the allocated storage block; and the writing sequence control module sends the obtained virtual user number to the control module.
In some embodiments of the present invention, the unified cache module further includes a storage block manager, and the writing the data to be stored into the allocated storage block according to the physical address includes: taking a first storage block as an allocated storage block number, wherein the first storage block is any one of the plurality of storage blocks; writing the data to be stored into a first storage block according to the physical address of the first storage block; if the first storage block is full, applying for a storage block manager and obtaining a new idle storage block; and writing the data except the data written into the first storage block in the data to be stored into the new idle storage block.
In some embodiments of the invention, the AI processor further comprises a computing core comprising a plurality of computing core modules, and a plurality of interfaces are provided between the plurality of computing core modules and the plurality of read sequence control modules, so that data read by the plurality of read sequence control modules is transmitted to the plurality of computing core modules through the plurality of interfaces for computation.
In some embodiments of the present invention, the reading the data written in the memory block comprises: the control module sends the virtual user number corresponding to the data to be read to the read sequence control module; the read sequence control module acquires a physical address of a corresponding storage block according to the virtual user number, a preset logical address and an address mapping table; and the read sequence control module reads the data to be read according to the acquired physical address.
In some embodiments of the present invention, after the read sequence control module reads the data to be read according to the obtained physical address, the method further includes: if the data to be read exceeds one storage block, acquiring the physical address of the next storage block when the data in the first storage block is read to a preset byte; and finishing the reading of the residual data in the data to be read according to the acquired physical address of the next storage block.
In some embodiments of the present invention, the unified cache module further comprises a balance manager, and the balance manager is used for balance management of the storage block resources in the unified cache module.
In some embodiments of the present invention, if the number of idle storage blocks in the computing process of the plurality of computing core modules is less than the preset number, the balance manager is started to move the data that has been stored in the unified cache module to an external memory; and if the number of the idle storage blocks exceeds the preset number, retrieving the data from the external memory to the unified cache module, wherein the external memory is a synchronous dynamic random access memory.
In some embodiments of the invention, the unified cache module further comprises a first crossbar and a second crossbar; the first cross switch is arranged between the writing sequence control module and the storage block and used for cross processing of data from the writing sequence control module to the storage block; the second cross switch is arranged between the plurality of read sequence control modules and the storage block and used for cross processing of data from the storage block to the read sequence control modules.
A second aspect of the present invention provides an AI processor to which the cache management method of the AI processor in the embodiments of the present invention is applied.
A third aspect of the present invention provides a computer device comprising a memory and an AI processor, the memory having stored therein computer readable instructions which, when executed by the AI processor, cause the AI processor to perform the steps of:
receiving data to be stored, and calling one or more writing sequence control modules;
the write sequence control module acquires a physical address corresponding to a storage block allocated by the uniform cache module, and writes the data to be stored into the allocated storage block according to the physical address;
and if the data written into the storage block is read, reading the data through one or more read sequence control modules.
The technical scheme provided in the embodiment of the application has at least the following technical effects or advantages:
according to the method, the AI processor is provided with the uniform cache module, the uniform cache module is provided with the write sequence control modules, the storage blocks and the read sequence control modules, the write sequence control modules receive data to be stored firstly and call one or more write sequence control modules, the write sequence control modules acquire physical addresses corresponding to the storage blocks distributed by the uniform cache module and write the data to be stored into the distributed storage blocks according to the physical addresses, if the data written into the storage blocks are read, the data are read through one or more read sequence control modules, through the cache management method, storage resources can be fully utilized, balance of the storage resources is achieved, waste of the storage resources is avoided, particularly in application of a neural network, reasonable resource distribution can be achieved no matter whether the cache convolution kernel weight or the cache characteristic diagram, and further the cache management efficiency is improved powerfully. Moreover, the AI processor is provided with the computing core and the unified cache module, so that storage resources can be fully utilized, a higher data hit rate is provided for the computing core, the computing efficiency is improved, data movement is reduced, and further the power consumption is reduced.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a schematic diagram illustrating steps of a cache management method of an AI processor in an exemplary embodiment of the present application;
FIG. 2 shows a schematic diagram of an AI processor of the prior art;
FIG. 3 is a diagram illustrating a unified cache module in an exemplary embodiment of the present application;
fig. 4 shows a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.
Detailed Description
Hereinafter, embodiments of the present application will be described with reference to the accompanying drawings. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present application. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present application. It will be apparent to one skilled in the art that the present application may be practiced without one or more of these details. In other instances, well-known features have not been described in order to avoid obscuring the present application.
It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the application. As used herein, the singular is intended to include the plural unless the context clearly dictates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Exemplary embodiments according to the present application will now be described in more detail with reference to the accompanying drawings. These exemplary embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. The figures are not drawn to scale, wherein certain details may be exaggerated and omitted for clarity. The shapes of various regions, layers, and relative sizes and positional relationships therebetween shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, as actually required.
Several examples are given below in conjunction with the description of figures 1-4 to describe exemplary embodiments according to the present application. It should be noted that the following application scenarios are merely illustrated for facilitating understanding of the spirit and principles of the present application, and the embodiments of the present application are not limited in any way in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.
In the early 60's of the 20 th century, central processing units, CPUs, began to emerge and be used in the computer industry. Today, although CPUs vary greatly in both design and implementation, the basic operating principles of CPUs based on the von neumann architecture have not changed significantly. When the CPU works, the CPU needs to read data from the memory every time the CPU executes an instruction, and performs corresponding operation on the data according to the instruction, so that the CPU is not only responsible for data operation, but also needs to execute commands of storage reading, instruction analysis, branch jumping and the like. However, in the field of artificial intelligence deep learning, program instructions are relatively few, and the calculation requirement of big data is great, so that massive data processing is required. When the CPU executes the AI algorithm, the CPU spends a large amount of time in reading and analyzing data or instructions, and under the premise of a certain power consumption, the instruction execution speed cannot be increased without limitation by increasing the CPU frequency and the memory bandwidth without limitation. Therefore, in this situation, the traditional CPU architecture has significant disadvantages, and the computational bottleneck problem in the field of artificial intelligence chips is difficult to solve.
At present, the artificial intelligence computing requirements represented by deep learning are mainly accelerated by adopting the existing general chips suitable for parallel computing, such as a GPU (graphics processing unit), an FPGA (field programmable gate array) and the like. However, since such a general chip design is not intended to be specific to deep learning, there are naturally limitations in performance, power consumption, and the like once the application of the deep learning algorithm is involved. Such problems are increasingly highlighted as the scale of artificial intelligence applications is expanded.
GPUs are designed primarily to handle massively parallel computing in image processing as image processors. Therefore, the parallel computing advantage cannot be fully exerted when the method is applied to a deep learning algorithm. The deep learning comprises two calculation links of training and deducing, the GPU is very efficient in deep learning algorithm training, but the advantage of parallelism cannot be fully exerted on the occasion of deducing single input. Secondly, the GPU adopts an SIMT calculation mode, the hardware structure is relatively fixed, and the hardware structure cannot be flexibly configured.
With the development of artificial intelligence technology, the fully customized artificial intelligence ASIC gradually embodies its advantages. ASIC (Application Specific Integrated Circuit) refers to an Application Specific Integrated Circuit (ASIC) designed and manufactured according to the requirements of a Specific user and a Specific electronic system. The design of ASIC with Complex Programmable Logic Devices (CPLDs) and Field Programmable Gate Arrays (FPGAs) is one of the most popular ways currently, has the characteristics of field programmability of users, supports boundary scan technology, and has advantages in performance power consumption ratio, reliability, integration level, etc.
However, the data distribution characteristics of the convolutional neural network provide new challenges for the prior art, and the overall characteristic diagram shows a trend of decreasing data volume, while the convolutional kernel weight shows a trend of increasing data volume, which makes the distributed memory unable to fully utilize its storage resources, thereby resulting in the problem of storage resource waste or unbalanced use.
Therefore, in an exemplary embodiment of the present application, a cache management method for an AI processor is provided, where the AI processor is configured with a unified cache module, and the unified cache module includes a plurality of write sequence control modules, a plurality of storage blocks, and a plurality of read sequence control modules, where the plurality of storage blocks are static random access memories; as shown in fig. 1, the cache management method includes:
s1, receiving data to be stored, and calling one or more writing sequence control modules;
s2, the write sequence control module acquires a physical address corresponding to a storage block allocated by the unified cache module, and writes the data to be stored into the allocated storage block according to the physical address;
and S3, if the data written into the storage block is read, reading through one or more read sequence control modules.
Referring to fig. 2, an AI processor in the prior art is shown in fig. 2 and includes a control module, a computation core, and a cache module, and when a neural network algorithm is applied, a convolution kernel weight is stored in time, a feature map obtained after convolution is stored in time, data transmission is performed between the computation core and a uniform cache module, the computation core performs corresponding computation, a computation result is partial and accumulated data, and the computation result is stored in the uniform cache module. The AI processor also interacts with external memory, i.e., the mutual migration of data.
In a specific implementation manner, a computation core and a control module are configured in an AI processor, and with reference to fig. 3, a user number manager is configured in the unified cache module; the write sequence control module obtains a physical address corresponding to a storage block allocated by the unified cache module, and the method comprises the following steps: the writing sequence control module applies for and obtains a virtual user number, an allocated storage block number and a physical address corresponding to the allocated storage block from the user number manager; and the writing sequence control module sends the obtained virtual user number to the control module. The user number manager is used for managing the virtual user number, and comprises the allocation and the recovery of the virtual user number and the storage of a chain head (a physical address). Each virtual user has a chain, and the chain node is a specific memory block number.
In a specific implementation manner, the unified cache module further includes a storage block manager, and the storage block manager is configured to manage the storage block, including functions of allocating, recovering, chaining, and searching. The writing the data to be stored into the allocated storage block according to the physical address includes: taking a first storage block as an allocated storage block number, wherein the first storage block is any one of the plurality of storage blocks; writing the data to be stored into a first storage block according to the physical address of the first storage block; if the first storage block is full, applying for a storage block manager and obtaining a new idle storage block; and writing the data except the data written into the first storage block in the data to be stored into the new idle storage block. In order to prevent the data stream from being blocked, a FIFO with a small buffer amount may be disposed in the write sequence control module for temporarily storing the data during the application of the virtual user number, generally about several cycles, and if the virtual user number is not applied, the user is notified to stop sending the data. If the write sequence control module is full of 1 storage block, applying for a new storage block to the storage block manager, the storage block manager checking whether a free storage block exists, if so, sending the storage block number to the write sequence control module, hanging the new storage block number on a chain node, and the write sequence control module continuing to write data to the new storage block (an actual static random access memory) until all data are written, such as a convolution kernel weight.
In some embodiments of the present application, the AI processor further includes a computing core, the computing core includes a plurality of computing core modules, and a plurality of interfaces are disposed between the plurality of computing core modules and the plurality of read sequence control modules, so that data read by the plurality of read sequence control modules is transmitted to the plurality of computing core modules through the plurality of interfaces for computation. For example, a convolution kernel weight access interface in the computation core module is connected to the read sequence control module 0, a feature map access interface is connected to the read sequence control module 1, and a partial and accumulated read interface is connected to the read sequence control module 2. Similarly, a plurality of interfaces are arranged between the plurality of computing core modules and the plurality of writing sequence control modules, so that data needing to be written by the computing core modules are written into the storage block through the interfaces and the reading sequence control modules. The AI processor is provided with the computing core and the unified cache module, so that storage resources can be fully utilized, a higher data hit rate is provided for the computing core, the computing efficiency is enhanced, data movement is reduced, and power consumption is reduced. It is understood that the convertible implementation can be further extended to the field of multi-core AI processors to provide a unified shared cache for multiple AI cores.
In a specific implementation, the reading data written in the memory block includes: the control module sends the virtual user number corresponding to the data to be read to the read sequence control module; the reading sequence control module applies for inquiry to the user number manager and obtains the virtual user number; acquiring a physical address of a corresponding storage block according to the virtual user number, a preset logical address and an address mapping table; and the read sequence control module reads the data to be read according to the acquired physical address. When the method is implemented specifically, a user gives an operation instruction, the instruction comprises a logical address of the data to be stored, and the physical address of the data to be stored can be found by combining an address mapping table stored in the read sequence control module. And more specifically, the mapping items mapped according to the address mapping table are connected in series according to the sequence written into the unified cache module to form a serial chain, wherein the latest written mapping item is called a chain head. And applying for a chain head to a user number manager, searching a corresponding physical address by the storage block manager according to the logical address and the chain head, and sending the physical address to the read sequence module, so that the read sequence control module can obtain the physical address of the storage block, and further writing the data to be stored into the corresponding storage block according to the physical address. If the data to be read exceeds one storage block, acquiring the physical address of the next storage block when the data in the first storage block is read to a preset byte; and finishing the reading of the residual data in the data to be read according to the acquired physical address of the next storage block. For example, taking the convolution algorithm as an example, a convolution kernel weight value access interface in the computation kernel module is connected to the read sequence control module 0, a feature map access interface is connected to the read sequence control module 1, and a partial sum accumulation read interface is connected to the read sequence control module 2, and the read sequence control module 0 uses the user ID0 to query the user number manager as to which actual storage block the chain head of the user ID is, and reads data from the storage block. When reading data, the actual physical address is searched according to the logical address sent by the user. Assuming that the storage depth of a storage block is 1024, when the read address exceeds 1023, the read sequence control module sends a request for searching the next storage block number to the storage block manager, the storage block manager returns the correct storage block number according to the link point information, and the read sequence control module generates an actual physical address according to the storage block number and the combination of the low-order address (0-1023). And the calculation kernel module starts to calculate convolution after receiving the corresponding convolution kernel weight and the characteristic diagram data. Since the convolution calculation requires accumulation of multi-channel data, partial sum data may be generated, which needs to be buffered first and then accumulated. At this time, the write sequence control module 1 will apply for a new virtual user number, and assuming that the applied virtual user number is 2, the control module will send part of the virtual user number corresponding to the data to the read sequence control module 2. When the new accumulated data is generated later, the read sequence control module 2 will use the virtual user number 2 to inquire the user manager which storage block the chain head of the user ID is, read the data from the storage block, accumulate the data and the newly generated accumulated data, and then write the final result into the corresponding storage block through the write sequence control module 1. Because the data bit width of partial and accumulated data is often several times of that of feature map data, as a switchable implementation manner, a plurality of memory blocks (SRAM Bank) are bundled for use, a write sequence control module can continuously apply for a plurality of memory blocks when writing data and compile them into a group for uniform write operation, and a read sequence control module can also continuously inquire a plurality of memory blocks when reading data and compile them into a group for uniform read operation.
In some embodiments of the present application, the unified cache module further includes a balancing manager, and the balancing manager is used for balancing management of storage block resources in the unified cache module. In some embodiments of the present application, if the number of idle storage blocks in the computing process of the plurality of computing core modules is less than the preset number, the balance manager is started to move the data stored in the unified cache module to an external memory; and if the number of the idle storage blocks exceeds the preset number, retrieving the data from the external memory to the unified cache module, wherein the external memory is a synchronous dynamic random access memory.
In a specific embodiment, during the computation performed by the computation core module, it is found that there are not many free memory blocks below a certain horizontal line (assuming that 4, the horizontal line may be configured by the controller), at this time, the balancing manager will automatically start, select one block from currently allocated memory blocks for replacement, and forward the data in the selected block to an external memory, such as a Synchronous Dynamic Random Access Memory (SDRAM). The selection of the SRAM Bank for replacement may use a certain algorithm, the calculation of convolution is generally from front to back, and we use a simple processing manner as an example here to select the chain end of the virtual user with the most chain nodes. The balance manager will read out the data and transfer them to SDRAM, the memory block manager will set the memory block as idle, and print idle mark on the node to indicate the block data on external SDRAM. If the virtual user has new data written, chaining can be continued on the original chain. Assuming that the computation core module finds that a certain SRAM Bank is marked as free when reading data, the read sequence control module sends a request to the balancing manager to retrieve the block data. After receiving the request, the balance manager applies for the number of SRAM Bank from the storage block manager, then searches the position of the block data in SDRAM according to the virtual user number user id and the logic address high order bits, and retrieves the data from the corresponding position and sends the data to the new SRAM Bank. As an alternative embodiment, a threshold number of free memory blocks is set by the control module and the retrieval process is initiated when the number of free SRAM banks is above the threshold. The selection of the retrieval SRAM Bank may use a certain algorithm, including using a first-in first-out manner, retrieving the data block that is replaced first, or retrieving the data block with a smaller address first.
In some embodiments of the present application, as shown in fig. 3, the unified cache module further comprises a first crossbar (crossbar 0 in fig. 3) and a second crossbar (crossbar 1 in fig. 3); the first cross switch is arranged between the writing sequence control module and the storage block and used for cross processing of data from the writing sequence control module to the storage block; the second cross switch is arranged between the plurality of read sequence control modules and the storage block and used for cross processing of data from the storage block to the read sequence control modules.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
The present embodiment provides an AI processor configured to execute the cache management method of the AI processor provided in any one of the above embodiments. The AI processor is configured with a uniform cache module, when the uniform cache module is configured with a plurality of write sequence control modules, a plurality of storage blocks and a plurality of read sequence control modules, the AI processor receives data to be stored at first when the AI processor is used for cache management, particularly when a neural network algorithm is applied, wherein the data to be stored at least comprises a convolution kernel weight and a characteristic diagram, one or a plurality of write sequence control modules are called, the write sequence control modules acquire physical addresses corresponding to the storage blocks allocated by the uniform cache module, and write the data to be stored into the allocated storage blocks according to the physical addresses, if the data written into the storage blocks are read, the data are read by one or a plurality of read sequence control modules, and by means of the AI processor and the cache management method thereof, storage resources can be fully utilized, balance of the storage resources is realized, further, waste of the storage resources is avoided, particularly in the application of the neural network, reasonable resource allocation can be achieved no matter whether the convolution kernel weight or the cache characteristic diagram is cached, and further the efficiency of the cache management is powerfully improved.
It is further emphasized that the system provided in the embodiments of the present application may be based on artificial intelligence technology to acquire and process relevant data. Among them, artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Reference is now made to fig. 4, which is a schematic diagram illustrating a computer device provided in some embodiments of the present application. As shown in fig. 4, the computer device 2 includes: an AI processor 200, a memory 201, a bus 202 and a communication interface 203, the AI processor 200, the communication interface 203 and the memory 201 being connected by the bus 202; the memory 201 stores therein a computer program that is executable on the AI processor 200, and the AI processor 200 executes the cache management method of the AI processor according to any of the foregoing embodiments when executing the computer program.
The Memory 201 may include a Random Access Memory (RAM) and a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 203 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.
Bus 202 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 201 is configured to store a program, and the processor 200 executes the program after receiving an execution instruction, where the cache management method for the AI processor disclosed in any embodiment of the present application may be applied to the processor 200, or implemented by the processor 200.
The processor 200 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 200. The Processor 200 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in the memory 201, and the processor 200 reads the information in the memory 201 and completes the steps of the method in combination with the hardware thereof.
The present application further provides a computer-readable storage medium corresponding to the AI processor cache management method provided in the foregoing embodiments, and a computer program is stored thereon, and when being executed by a processor, the computer program will execute the AI processor cache management method provided in any of the foregoing embodiments.
In addition, examples of the computer-readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memories (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical and magnetic storage media, which are not described in detail herein.
The computer-readable storage medium provided by the foregoing embodiment of the present application and the method for allocating a quantum key distribution channel in a spatial division multiplexing optical network provided by the embodiment of the present application have the same inventive concept, and have the same beneficial effects as methods adopted, run, or implemented by application programs stored in the computer-readable storage medium.
Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the cache management method for an AI processor provided in any of the foregoing embodiments.
It should be noted that: the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing an arrangement of this type will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application. In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore, may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification, and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except for at least some of such features and/or processes or elements being mutually exclusive. Each feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the creation apparatus of a virtual machine according to embodiments of the present application. The present application may also be embodied as an apparatus or device program for carrying out a portion or all of the methods described herein. A program implementing the present application may be stored on a computer-readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website, or provided on a carrier signal, or provided in any other form.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. The cache management method of the AI processor is characterized in that the AI processor is provided with a unified cache module, the unified cache module comprises a plurality of write sequence control modules, a plurality of storage blocks and a plurality of read sequence control modules, wherein the plurality of storage blocks are static random access memories; the cache management method comprises the following steps:
receiving data to be stored and calling one or more writing sequence control modules;
the write sequence control module acquires a physical address corresponding to a storage block allocated by the uniform cache module, and writes the data to be stored into the allocated storage block according to the physical address;
and if the data written into the storage block is read, reading the data through one or more read sequence control modules.
2. The AI processor's cache management method of claim 1, wherein the AI processor further comprises a control module, the unified cache module further comprising a user number manager; the write sequence control module obtains a physical address corresponding to a storage block allocated by the uniform cache module, and the method comprises the following steps:
the writing sequence control module applies for the user number manager and obtains a virtual user number, an allocated storage block number and a physical address corresponding to the allocated storage block;
and the writing sequence control module sends the obtained virtual user number to the control module.
3. The AI processor cache management method according to claim 2, wherein the unified cache module further includes a memory block manager, and the writing the data to be stored into the allocated memory block according to the physical address includes:
taking a first storage block as an allocated storage block number, wherein the first storage block is any one of the plurality of storage blocks;
writing the data to be stored into a first storage block according to the physical address of the first storage block;
if the first storage block is full, applying for a storage block manager and obtaining a new idle storage block;
and writing the data except the data written into the first storage block in the data to be stored into the new idle storage block.
4. The AI processor cache management method according to claim 2, wherein the AI processor further includes a computing core, the computing core includes a plurality of computing core modules, and a plurality of interfaces are provided between the plurality of computing core modules and the plurality of read sequence control modules, so that data read by the plurality of read sequence control modules is transmitted to the plurality of computing core modules through the plurality of interfaces for computation.
5. The AI processor cache management method according to claim 4, wherein the reading of the data written in the memory block includes:
the control module sends the virtual user number corresponding to the data to be read to the read sequence control module;
the read sequence control module acquires a physical address of a corresponding storage block according to the virtual user number, a preset logical address and an address mapping table;
and the reading sequence control module reads the data to be read according to the acquired physical address.
6. The AI processor cache management method of claim 5, further comprising, after the read sequence control module reads data to be read according to the retrieved physical address:
if the data to be read exceeds one storage block, acquiring the physical address of the next storage block when the data in the first storage block is read to a preset byte;
and finishing the reading of the residual data in the data to be read according to the acquired physical address of the next storage block.
7. The AI processor cache management method according to any of claims 4-6, wherein the unified cache module further includes a balancing manager for balanced management of storage block resources in the unified cache module; if the number of the idle storage blocks is less than the preset number in the calculation process of the plurality of calculation core modules, starting a balance manager to move the data stored in the unified cache module to an external memory; and if the number of the idle storage blocks exceeds the preset number, retrieving the data from the external memory to the unified cache module, wherein the external memory is a synchronous dynamic random access memory.
8. The AI processor's cache management method of claim 1, wherein the unified cache module further includes a first crossbar and a second crossbar; the first cross switch is arranged between the writing sequence control module and the storage block and used for cross processing of data from the writing sequence control module to the storage block; the second cross switch is arranged between the plurality of read sequence control modules and the storage block and used for cross processing of data from the storage block to the read sequence control modules.
9. AI processor, characterized in that it applies the method according to any one of claims 1 to 8.
10. A computer device comprising a memory and an AI processor, wherein the memory has stored therein computer-readable instructions which, when executed by the AI processor, cause the AI processor to carry out the method of any of claims 1-8.
CN202211038987.7A 2022-08-29 2022-08-29 Cache management method of AI (Artificial Intelligence) processor and AI processor applying cache management method Pending CN115408309A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211038987.7A CN115408309A (en) 2022-08-29 2022-08-29 Cache management method of AI (Artificial Intelligence) processor and AI processor applying cache management method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211038987.7A CN115408309A (en) 2022-08-29 2022-08-29 Cache management method of AI (Artificial Intelligence) processor and AI processor applying cache management method

Publications (1)

Publication Number Publication Date
CN115408309A true CN115408309A (en) 2022-11-29

Family

ID=84160837

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211038987.7A Pending CN115408309A (en) 2022-08-29 2022-08-29 Cache management method of AI (Artificial Intelligence) processor and AI processor applying cache management method

Country Status (1)

Country Link
CN (1) CN115408309A (en)

Similar Documents

Publication Publication Date Title
Mittal et al. A survey of techniques for optimizing deep learning on GPUs
CN109522254B (en) Arithmetic device and method
KR102123633B1 (en) Matrix computing device and method
US11093225B2 (en) High parallelism computing system and instruction scheduling method thereof
WO2020073211A1 (en) Operation accelerator, processing method, and related device
KR102572757B1 (en) Modifying machine learning models to improve locality
JP7451614B2 (en) On-chip computational network
CN111160545A (en) Artificial neural network processing system and data processing method thereof
US9317456B2 (en) Method and system for performing event-matching with a graphical processing unit
JP7008983B2 (en) Methods and equipment for accessing tensor data
WO2020028183A1 (en) A storage-based graph for enabling computation graph optimization
US20210334234A1 (en) Distributed graphics processor unit architecture
CN106846235A (en) Convolution optimization method and system that a kind of utilization NVIDIA Kepler GPU assembly instructions accelerate
CN111028360B (en) Data reading and writing method and system in 3D image processing, storage medium and terminal
JP2022530873A (en) Machine learning model update for machine learning accelerators
EP3662376B1 (en) Reconfigurable cache architecture and methods for cache coherency
CN105393210A (en) Memory unit for emulated shared memory architectures
CN113065643A (en) Apparatus and method for performing multi-task convolutional neural network prediction
US11429299B2 (en) System and method for managing conversion of low-locality data into high-locality data
US8914779B2 (en) Data placement for execution of an executable
Nakano et al. The random address shift to reduce the memory access congestion on the discrete memory machine
CN115408309A (en) Cache management method of AI (Artificial Intelligence) processor and AI processor applying cache management method
CN112433773B (en) Configuration information recording method and device for reconfigurable processor
CN111340224B (en) Accelerated design method of CNN (computer network) suitable for low-resource embedded chip
US10620958B1 (en) Crossbar between clients and a cache

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination