CN113553292B

CN113553292B - Vector processor and related data access method

Info

Publication number: CN113553292B
Application number: CN202110722536.4A
Authority: CN
Inventors: 崔鲁平
Original assignee: Ruisixinke Shenzhen Technology Co ltd
Current assignee: Ruisixinke Shenzhen Technology Co ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2022-04-19
Anticipated expiration: 2041-06-28
Also published as: CN113553292A

Abstract

The embodiment of the invention discloses a vector processor and a related data access and storage method, which are characterized by comprising the following steps: the vector memory access unit is coupled with a memory, and the memory comprises a plurality of memory blocks; the vector memory access unit is used for: receiving a memory access instruction; the memory access instruction comprises N memory access requests; n is an integer greater than 0; respectively obtaining N memory access request addresses according to the N memory access requests; determining M blocks corresponding to address ranges of the N memory access request addresses in the plurality of blocks; m is an integer greater than 0 and less than or equal to N; generating M bus requests corresponding to the M blocks, and transmitting the M bus requests to the memory. By adopting the embodiment of the invention, the performance of the vector processor for accessing and storing data can be improved.

Description

Vector processor and related data access method

Technical Field

The invention relates to the technical field of computers, in particular to a vector processor and a related data access method.

Background

With the increasing data size and complexity of each field, the demands on the computing power and processing performance of the processor are higher and higher. A Vector Processor System (VPS) is a parallel processing System mainly based on a pipeline structure and oriented to Vector parallel computing. The parallel processing structure such as the advanced control and overlapping operation technology, the operation pipeline, the parallel memory with cross access and the like is adopted, and the parallel processing structure plays an important role in improving the operation speed.

A vector processor is a central processing unit that can perform direct operations on a one-dimensional array (vector) instruction set, for example, the vector processor can simultaneously execute a plurality of operation requests included in a vector instruction. In the vector processor, a plurality of processing units can share the same set of control components such as address fetching, decoding, address calculation, memory access and the like, so that the parallelism of an application program can be fully developed with lower hardware overhead. Currently, how to provide efficient and flexible data supply support for a vector processor is a characterization of how to fully exploit the computing power. A vector access unit is a unit in a vector processor that is used to read data from or write data to memory. The vector memory access unit can obtain the address of the accessed memory through the calculation of the address calculation unit, and then the data is shifted in the memory and the vector register through high bandwidth. However, in practice it has been found that vector processor memory access operations typically take a long time, thereby reducing vector processor performance. Therefore, how to improve the memory data performance of the vector processor is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the invention provides a vector processor and a related data access and storage method, which are used for improving the performance of the access and storage data of the vector processor.

In a first aspect, an embodiment of the present invention provides a vector processor, including: the vector memory access unit is coupled with a memory, and the memory comprises a plurality of memory blocks; the vector memory access unit is used for: receiving a memory access instruction; the memory access instruction comprises N memory access requests; n is an integer greater than 0; respectively obtaining N memory access request addresses according to the N memory access requests; determining M blocks corresponding to address ranges of the N memory access request addresses in the plurality of blocks; m is an integer greater than 0 and less than or equal to N; generating M bus requests corresponding to the M blocks, and transmitting the M bus requests to the memory.

In the prior art, when a vector memory access unit in a vector processor receives a memory access instruction, because each memory access instruction can comprise a plurality of memory access requests, the vector memory access unit can obtain a plurality of memory access request addresses based on the memory access requests in the memory access instruction, wherein each memory access request address corresponds to one memory Block in a memory, and further sends a corresponding bus request to the memory aiming at each memory access request address, so that the vector processor can read data from the memory or write data to the memory. For example, suppose that the access instruction includes 2 access requests, after the vector access unit receives the access instruction, 2 access request addresses, such as address 0 and address 5, are obtained based on the 2 access requests in the access instruction, and suppose that the address range corresponding to Block0 in the memory is address 0-address 5, then address 0 corresponds to Block0 and address 5 corresponds to Block0, further, the vector access unit sends a bus request for access Block0 to the memory based on address 0, and then sends a bus request for access Block0 to the memory based on address 5. Namely, the problem that the efficiency of reading/writing data by a vector processor is low because different access requests need to repeatedly access the same Block exists in the prior art.

In the embodiment of the invention, after a vector access unit in a vector processor receives an access instruction, a plurality of access request addresses are obtained according to access requests in the access instruction, then blocks corresponding to the access request addresses are determined, and different access requests of which the access request addresses belong to the same Block can be fused (namely for one access instruction, one Block finally corresponds to one bus request), so that bus requests for each Block are generated and sent to a memory, and the repeated access times for a single Block are reduced. For example, suppose that the memory access instruction includes 2 memory access requests, after the vector memory access unit receives the memory access instruction, 2 memory access request addresses are obtained based on the 2 memory access requests in the memory access instruction, such as address 0 and address 5, and if the address range corresponding to Block0 in the memory is address 0-address 5, both address 0 and address 5 correspond to Block0, then the memory access requests corresponding to address 0 and address 5 are fused to generate a bus request for memory access Block0, and then only a bus request for memory access Block0 is sent to the memory. Therefore, the embodiment of the invention improves the vector memory access unit in the vector processor, and can avoid sending excessive bus requests to repeatedly access the same memory block when the vector memory access unit reads data from the memory or writes data to the memory, thereby reducing the memory access times of the data to a certain extent, and improving the memory access performance and the user experience of the vector processor.

In one possible implementation manner, the vector access unit includes an address operation unit and a request fusion unit; the vector memory access unit is specifically configured to: receiving the access instruction through the address operation unit; the memory access instruction comprises the N memory access requests; respectively obtaining N memory access request addresses according to the N memory access requests, and sending the N memory access request addresses to the request fusion unit; receiving the N memory access request addresses sent by the address operation unit through the request fusion unit, and determining the M blocks corresponding to the address ranges to which the N memory access request addresses belong in the plurality of blocks; generating the M bus requests corresponding to the M blocks, and transmitting the M bus requests to the memory.

In the embodiment of the invention, a vector memory access unit in a vector processor comprises an address operation unit and a request fusion unit, wherein after the address operation unit receives a memory access instruction, a plurality of memory access request addresses are obtained according to a memory access request in the memory access instruction, and the obtained memory access request addresses are sent to the request fusion unit. Further, the request fusion unit receives a plurality of access request addresses sent by the address operation unit and determines the blocks corresponding to the access request addresses, and then the request fusion unit can fuse different access requests of which the access request addresses belong to the same Block (that is, for one access instruction, one Block finally corresponds to one bus request), generate bus requests for each Block and send the bus requests to the memory, so that the number of times of repeated accesses for a single Block is reduced. Therefore, when the vector memory access unit reads data from the memory or writes data to the memory, excessive bus requests can be prevented from being sent to repeatedly access the same memory block, and the memory access times of the data are further reduced to a certain extent, so that the memory access performance and the user experience of the vector processor are improved.

In one possible implementation, the address arithmetic unit includes L address operators; each address arithmetic unit obtains a memory access request address based on a memory access request in each clock cycle; l is an integer greater than 1.

In the embodiment of the invention, because the address operation unit in the vector memory access unit can comprise a plurality of address operators, the address operators can calculate the memory access request addresses of different memory access requests in parallel in the same clock cycle, so that the efficiency of calculating the memory access request addresses is improved, and the memory access performance and the user experience of the vector processor are improved.

In a possible implementation manner, the vector memory access unit is specifically configured to: dividing the N memory access requests into S request sets through the address operation unit; s is an integer greater than 1; each request set comprises less than or equal to L memory access requests; s access request address sets are obtained according to the S request sets respectively; each memory access request address set comprises the request addresses of the memory access requests in the corresponding request set.

In the embodiment of the invention, too many address calculators will increase the area overhead of hardware, so that L address calculators can be reasonably configured when designing the address calculation unit. When the number of the access requests is larger than the number of the address arithmetic units, the access requests can be grouped to obtain a plurality of request sets, wherein the number of the access requests in each request set is smaller than or equal to the number of the address arithmetic units, so that the address arithmetic unit can calculate the access request address set corresponding to the access requests in one request set in one clock cycle, and the access request addresses can be calculated in parallel while the area overhead of hardware can be reduced.

In one possible implementation, S is a number rounded up N/L.

In the embodiment of the invention, when the access requests are grouped, the access request addresses can be divided into S request sets according to the number of the address arithmetic units, and S is an N/L number which is rounded up, so that each request set comprises access requests as many as possible, and N access request addresses can be calculated and obtained in the fewest clock cycles, the efficiency of calculating the access request addresses is improved, and the access performance and the user experience of the vector processor are improved.

In one possible implementation, the vector access unit further includes a data register; the vector memory access unit is specifically configured to: calculating an ith memory access request address set corresponding to an ith request set in the S request sets through the address operation unit, and sending the ith memory access request address set to the data register; i is 0, 1, 2 … … S; and receiving the ith access request address set sent by the address operation unit through the data register, and storing the ith access request address set.

In the embodiment of the invention, a plurality of request sets are obtained after the memory access requests are grouped, and because the address operation unit can only calculate the memory access request address set corresponding to the memory access request in one request set in each clock cycle, a data register can be added in the vector memory access unit for temporarily storing the memory access request address calculated in the previous clock cycle, thereby avoiding the situation that the memory access request address is lost.

In a possible implementation manner, the vector memory access unit is specifically configured to: when S memory access request address sets are stored in the data register, the S memory access request address sets are sent to the request fusion unit through the data register; receiving the S access request address sets sent by the data register through the request fusion unit; the S memory access request address sets comprise the N memory access request addresses.

In the embodiment of the invention, after the address operation unit completes the calculation of the access request address set of the S request sets, the data register also records the S access request address sets (the S access request address sets comprise the access request addresses of the N access requests), so that the data register can send the recorded access request addresses to the request fusion unit, and the subsequent request fusion unit can generate the bus request based on the N access request addresses. Therefore, when the vector memory access unit reads data from the memory or writes data to the memory, excessive bus requests can be prevented from being sent to repeatedly access the same memory block, and the memory access times of the data are further reduced to a certain extent, so that the memory access performance and the user experience of the vector processor are improved.

In one possible implementation, the vector processor further includes a vector register file; the N memory access requests are N data reading requests; the vector memory access unit is further configured to: receiving data stored in the M blocks fed back by the memory based on the M bus requests respectively; and respectively writing the data corresponding to the N read data requests into the vector register based on the data stored in the M blocks.

In the embodiment of the invention, because the vector processor cannot directly read the data in the memory, the data fed back by the memory can be stored in the vector register, which is convenient for quickly reading the data in the memory through the vector register. The vector processor provided by the embodiment of the invention can read the data in the memory more efficiently, thereby improving the memory access performance and the user experience of the vector processor.

In a second aspect, an embodiment of the present invention provides a data access method, which is applied to a vector processor; the vector processor comprises a vector access unit, the vector access unit is coupled with a memory, and the memory comprises a plurality of memory blocks; the method comprises the following steps: receiving a memory access instruction through the vector memory access unit; the memory access instruction comprises N memory access requests; n is an integer greater than 0; respectively obtaining N memory access request addresses according to the N memory access requests; determining M blocks corresponding to address ranges of the N memory access request addresses in the plurality of blocks; m is an integer greater than 0 and less than or equal to N; generating M bus requests corresponding to the M blocks, and transmitting the M bus requests to the memory.

In one possible implementation manner, the vector access unit includes an address operation unit and a request fusion unit; the receiving of the memory access instruction by the vector memory access unit, obtaining N memory access request addresses according to the N memory access requests, determining M blocks corresponding to address ranges to which the N memory access request addresses belong in the blocks, generating M bus requests corresponding to the M blocks, and sending the M bus requests to the memory, includes: receiving the access instruction through the address operation unit in the vector access unit; the memory access instruction comprises the N memory access requests; respectively obtaining N memory access request addresses according to the N memory access requests, and sending the N memory access request addresses to the request fusion unit; receiving the N access request addresses sent by the address operation unit through the request fusion unit in the vector access unit, and determining the M blocks corresponding to the address ranges to which the N access request addresses belong in the plurality of blocks; generating the M bus requests corresponding to the M blocks, and transmitting the M bus requests to the memory.

In a possible implementation manner, the obtaining N memory access request addresses according to the N memory access requests respectively includes: dividing the N memory access requests into S request sets through the address operation unit in the vector memory access unit; s is an integer greater than 1; each request set comprises less than or equal to L memory access requests; s access request address sets are obtained according to the S request sets respectively; each memory access request address set comprises the request addresses of the memory access requests in the corresponding request set.

In one possible implementation, S is a number rounded up N/L.

In one possible implementation, the vector access unit further includes a data register; the obtaining of the N memory access request addresses according to the N memory access requests includes: calculating an ith access request address set corresponding to an ith request set in the S request sets through the address operation unit in the vector access unit, and sending the ith access request address set to the data register; i is 0, 1, 2 … … S; and receiving the ith access request address set sent by the address operation unit through the data register in the vector access unit, and storing the ith access request address set.

In a possible implementation manner, the sending the N memory access request addresses to the request fusion unit includes: when S memory access request address sets are stored in the data register, the S memory access request address sets are sent to the request fusion unit through the data register in the vector memory access unit; receiving the S access request address sets sent by the data register through the request fusion unit in the vector access unit; the S memory access request address sets comprise the N memory access request addresses.

In one possible implementation, the vector processor further includes a vector register file; the N memory access requests are N data reading requests; the method further comprises the following steps: respectively receiving data stored in the M blocks fed back by the memory based on the M bus requests through the vector access unit; and respectively writing the data corresponding to the N read data requests into the vector register based on the data stored in the M blocks.

In a third aspect, the present application provides a semiconductor chip, which may include the vector processor provided in any one of the implementations of the first aspect.

In a fourth aspect, the present application provides a semiconductor chip, which may include: the vector processor provided in any one of the implementations of the first aspect above, an internal memory coupled to the multicore processor, and an external memory.

In a fifth aspect, the present application provides a system on a chip SoC chip, including the vector processor provided in any one of the implementations of the first aspect, an internal memory coupled to the vector processor, and an external memory. The SoC chip may be formed of a chip, or may include a chip and other discrete devices.

In a sixth aspect, the present application provides a chip system, where the chip system includes the multi-core processor provided in any one of the implementations of the first aspect. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary or relevant to the operation of the multi-core processor. The chip system may be formed by a chip, or may include a chip and other discrete devices.

In a seventh aspect, the present application provides a processing apparatus, where the processing apparatus has a function of implementing any one of the data access methods in the second plane. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

In an eighth aspect, the present application provides a terminal, where the terminal includes a processor, and the processor is the vector processor provided in any implementation manner of the first aspect. The terminal may also include a memory, coupled to the processor, that retains program instructions and data necessary for the terminal. The terminal may also include a communication interface for the terminal to communicate with other devices or communication networks.

In a ninth aspect, the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, implements the flow of the data access method in any one of the second aspects.

In a tenth aspect, an embodiment of the present invention provides a computer program, where the computer program includes instructions, and when the computer program is executed by a processor, the processor may execute the flow of the data access method in any one of the second aspects.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the background art of the present invention, the drawings required to be used in the embodiments or the background art of the present invention will be described below.

Fig. 1A is a schematic diagram of a vector processor according to an embodiment of the present invention.

Fig. 1B is a schematic structural diagram of a memory according to an embodiment of the invention.

Fig. 1C is a schematic diagram of a vector memory access pipeline according to an embodiment of the present invention.

Fig. 2A is a schematic diagram of a data access mode of a vector processor according to an embodiment of the present invention.

Fig. 2B is a schematic diagram of a vector memory access unit according to an embodiment of the present invention.

Fig. 2C is a schematic diagram illustrating a comparison of bus requests according to an embodiment of the present invention.

Fig. 3A is a schematic diagram of a memory access request packet according to an embodiment of the present invention.

FIG. 3B is a diagram of another vector memory access unit according to an embodiment of the present invention.

Fig. 3C is a schematic diagram of another bus request generation method according to an embodiment of the present invention.

Fig. 3D is a vector register according to an embodiment of the present invention.

Fig. 4 is a schematic flow chart of a data access method according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described below with reference to the drawings.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

As used in this specification, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between 2 or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from two components interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).

First, some terms in the present application are explained so as to be easily understood by those skilled in the art.

(1) Instruction pipelining is the way to divide the operation of an instruction into multiple tiny steps, each of which is done by a dedicated circuit, in order to improve the efficiency of the processor executing the instruction. For example, an instruction may go through 3 stages to execute: fetching, decoding and executing, wherein each stage takes one machine cycle, and if a pipeline technology is not adopted, the instruction execution needs 3 machine cycles; if the instruction pipeline technology is adopted, when the instruction enters decoding after finishing instruction fetching, the next instruction can be subjected to instruction fetching, and therefore the instruction execution efficiency is improved.

(2) The Execution Unit (EU) is responsible for executing instructions, and has the functions of both a controller and an arithmetic Unit.

(3) A Register File (RF), also called a Register File, is an array composed of a plurality of registers in a CPU, and can be implemented by a Flip-Flop (Flip-Flop), or by a fast Static Random Access Memory (SRAM) if the required storage capacity is large. The RAM has special read ports and write ports, and can access different registers in a multi-path concurrent mode.

(4) An Integrated Circuit (IC) is a type of microelectronic device or component. The transistor, the resistor, the capacitor, the inductor and other elements and wires required in a circuit are interconnected together by adopting a certain process, are manufactured on a small or a plurality of small semiconductor wafers or medium substrates, and are then packaged in a tube shell to form a micro structure with the required circuit function; that is, an IC chip is a chip formed by placing an integrated circuit formed by a large number of microelectronic devices (transistors, resistors, capacitors, etc.) on a plastic substrate.

In order to facilitate understanding of the embodiments of the present invention based on the above-mentioned technical problems, a vector processor architecture on which the embodiments of the present invention are based is described below. Referring to fig. 1A, fig. 1A is a schematic diagram of an architecture of a vector processor according to an embodiment of the present invention, where the vector processor 01 may be located in any electronic device, such as a computer, a mobile phone, a tablet, a personal digital assistant, an intelligent wearable device, an intelligent vehicle-mounted device, or an intelligent home appliance. The vector processor 01 may specifically be a chip or a chipset, or a circuit board on which the chip or the chipset is mounted. The chip or chip set or the circuit board on which the chip or chip set is mounted may operate under the necessary software driver. In particular, the amount of the solvent to be used,

the vector processor 01 may include at least one processor core 10, and the processor core 10 may include an instruction Cache unit 101(Cache), an instruction Fetch unit 102(Fetch), an instruction Dispatch unit 103(Dispatch), and a scalar unit 104 and a vector unit 105 connected to the instruction Dispatch unit 103. The instruction cache unit 101 is a temporary memory in the vector processor 01, and the temporary memory has a smaller capacity but a faster data exchange speed to store some instruction packets; an instruction Fetch unit 102(Fetch) to Fetch instructions from the instruction cache unit 101 and send to an instruction dispatch unit 103; the instruction dispatching unit 103 is used for completing the scheduling and dispatching of the instructions to be executed; scalar unit 104 and vector unit 105 each function as an execution unit (EU, also referred to as a functional unit) of vector processor 01 to perform various types of instruction tasks. Specifically, the scalar unit 104 is mainly responsible for processing tasks such as serial execution part, execution branch, interrupt, and system configuration in the application; the vector unit 105 mainly speeds up parallel tasks in the application. Scalar unit 104 and vector unit 105 may employ unified (or separate) instruction fetch unit 102 and instruction dispatch unit 103 to fetch instructions from instruction cache unit 101 and dispatch instructions to execution units.

In addition, the Vector unit 105 may include a plurality of parallel Processing Elements (PEs) that read data from the Vector Register File 1051 (VRF) for computation and writing to the Vector Register File 1051. The data of the vector register file 1051 is exchanged with the memory 02 via the vector access unit 1052 at high bandwidth. Next, referring to fig. 1B, a structure of the above-mentioned memory 02 will be described, please refer to fig. 1B, where fig. 1B is a schematic structural diagram of a memory according to an embodiment of the present invention, and in the diagram, the memory 02 is a non-volatile memory whose stored content is not lost after power is off. Memory 02 may be used for long term storage of instructions and data involved in the operation of vector processor 01, such as boot programs, operating systems, applications and data, and the like. It should be noted that, since the vector processor 01 cannot directly read the data in the memory 02 and cannot directly write the data into the memory 02, when executing a read data (or write data) instruction, the vector unit 105 in the vector processor 01 actually sends a bus request to the memory 02 through the vector access unit 1052 to temporarily load the contents to be read stored in the memory 02 into the vector register file 1051, and then the contents are read out from the vector register file 1051 by the vector unit 105; when a write data (i.e., store) instruction is executed, the vector unit 105 actually temporarily writes the data to be stored into the vector register file 1051, and then sends a bus request to the memory 02 via the vector access unit 1052 to store the data from the vector register file 1051 into the memory 02. The memory 02 may include one or more of Flash memory (e.g., NAND Flash memory, NOR Flash memory, etc.), universal Flash memory (UFS), embedded multimedia card eMMC, universal Flash memory multi-chip package upmcp memory, embedded multimedia card multi-chip package eMCP memory, Solid State Drive (SSD), etc. The storage medium array 20 of the memory 02 may comprise a plurality of blocks for storing data, wherein each Block may comprise M pages (pages), each page being operable to store data.

In one possible implementation, the vector processor 01 may include a plurality of processor cores therein. The processor cores may be homogeneous or heterogeneous, that is, the structures between the processor core 10 and other processor cores may be the same or different, which is not specifically limited in this embodiment of the present invention. Alternatively, the processor core 10 may serve as a master processing core, other processor cores except the processor core 10 may serve as slave processing cores, and the master processing core and the plurality of slave processing cores may be located in one or more chips (ICs). It is understood that the master processing core and the other slave processing cores may be coupled for communication via a bus or other means, and are not particularly limited herein.

It is understood that the structure of the vector processor in fig. 1A is only some exemplary implementations provided by the embodiments of the present invention, and the structure of the processor in the embodiments of the present invention includes, but is not limited to, the above implementations.

Based on the architecture of the vector processor provided by the present application, the embodiment of the present invention further provides a vector access and storage pipeline structure suitable for the architecture of the vector processor, and although different vector access and storage pipelines are slightly different, the specific operation processes are similar. As shown in fig. 1C, fig. 1C is a schematic diagram of a vector access pipeline according to an embodiment of the present invention, where the vector access pipeline 03 may include a vector write data pipeline 31 and a vector read data pipeline 32. The vector write data pipeline 31 may be divided into an instruction decoding stack 301, an address calculation stack 302, a request arbitration stack 303, and an access memory stack 304; the vector read data pipeline 32 may be divided into an instruction decode stack 301, an address compute stack 302, a request arbitration stack 303, an access memory stack 304, a data return stack 305, and a write VR stack 306. In particular, the method comprises the following steps of,

the first stacks in both the vector write data pipeline 31 and the vector read data pipeline 32 are instruction decode stacks 301 where the vector access instruction is parsed into a base address, the number of the offset register, the source (or destination) register number, and the access instruction is sent to the base address and the offset register. If the instruction is a read data instruction, the stack will send a request to read the source operand VR; if it is a write data instruction, the stack passes the number of the destination operand VR to the next stack. The contents of the read base register and offset register are added together in the address calculation stack 302, resulting in an address to access the memory, and the data read from the VR or the VR number is passed to the next stack. The request arbitration stack 303 mainly determines whether the right to access the memory can be obtained, and transfers the access request address, the data read from the VR, or the number of the VR to the next stack. After accessing the memory stack 304, the memory is primarily read or written based on the memory access request address and the VR number is sent to the next stack where the vector write data pipeline 31 ends. The vector read data pipeline 32 ends by waiting for data to return from memory at the data return stack 305 and finally writing the memory-returned data to the VR at the write VR stack 306. It should be noted that the vector access pipeline structure may be different according to the structure of each processor core, and therefore, the vector access pipeline structure referred to in this application refers to the pipeline structure of the processor core 10, and does not specifically limit the pipeline structures of other processor cores.

In the above vector memory access pipeline 03 structure, each memory access instruction in the vector processor 01 undergoes the above operation steps, but different operation steps of a plurality of memory access instructions can be executed simultaneously, so that the instruction flow speed can be increased as a whole, and the program execution time can be shortened. It is understood that the above vector processor architecture and vector access pipeline structure are only some exemplary implementations provided by the embodiments of the present invention, and the vector processor architecture and vector access pipeline structure in the embodiments of the present invention include, but are not limited to, the above implementations.

Based on the architecture of the vector processor provided in fig. 1A described above in this application, in the following embodiments of the invention, the vector processor 01 may include a vector access unit 1052, where the vector access unit 1052 is coupled with a memory 02, and the memory 02 includes a plurality of memory blocks. The specific realized functions can comprise the following steps:

the vector memory access unit 1052 is configured to receive a memory access instruction; the memory access instruction comprises N memory access requests; n is an integer greater than 0; vector memory access unit 1052 obtains N memory access request addresses according to the N memory access requests; the vector memory access unit 1052 determines M blocks corresponding to the address range to which the N memory access request addresses belong in the plurality of blocks; m is an integer greater than 0 and less than or equal to N; vector access unit 1052 generates M bus requests corresponding to the M blocks and sends the M bus requests to the memory. Specifically, the memory access instruction may include a read data instruction or a write data instruction, where each memory access instruction may include a plurality of memory access requests; the access request address is an address in the memory 02 to be accessed by the vector processor 01; the memory block in the memory 02 may include a plurality of pages for storing data, wherein each page corresponds to an address, and thus a plurality of addresses may correspond to one memory block. It should be noted that the data access mode of the vector processor 01 can be generally divided into three modes, namely, continuous data access, access between vector elements at a certain fixed interval, and complete address dispersion between elements. For example, referring to fig. 2A, fig. 2A is a schematic diagram of a data access mode of a vector processor according to an embodiment of the present invention, which illustrates two cases that vector elements are accessed at a certain fixed interval and addresses between the elements are completely discrete, and it is assumed that a memory 02 includes 3 memory blocks, which are Block0, Block1, Block2, Block0 corresponds to addresses 0 to 5, Block1 corresponds to addresses 6 to 11, and Block2 corresponds to addresses 12 to 17. For accessing vector elements at a certain fixed interval, the memory access instruction may start from address 0 and read data at a fixed interval of one element, so the memory access instruction may include requests of read address 0, read address 2, read address 4, read address 6, read address 8, read address 10, read address 12, read address 14, and read address 16; for the complete dispersion of the addresses among the elements, the addresses among the elements to be accessed are discrete, so the memory access instruction can comprise requests of read address 1, read address 2, read address 4, read address 6, read address 10 and read address 12.

In one possible implementation manner, the vector access unit includes an address operation unit and a request fusion unit; the vector memory access unit is specifically configured to: receiving the access instruction through the address operation unit; the memory access instruction comprises the N memory access requests; respectively obtaining N memory access request addresses according to the N memory access requests through the address operation unit, and sending the N memory access request addresses to the request fusion unit; receiving the N memory access request addresses sent by the address operation unit through the request fusion unit, and determining the M blocks corresponding to the address ranges to which the N memory access request addresses belong in the plurality of blocks; generating, by the request fusion unit, the M bus requests corresponding to the M blocks, and sending the M bus requests to the memory. Specifically, referring to fig. 2B, fig. 2B is a schematic diagram of a vector memory access unit according to an embodiment of the present invention, in which the vector memory access unit 1052 includes an address operation unit 10521 and a request fusion unit 10522, where after the address operation unit 10521 receives a memory access instruction, a plurality of memory access request addresses are obtained according to a memory access request in the memory access instruction, and the obtained memory access request addresses are sent to the request fusion unit 10522. Further, the request fusion unit 10522 receives a plurality of access request addresses sent by the address operation unit 10521 and determines blocks corresponding to the access request addresses, and further the request fusion unit 10522 may fuse different access requests of which the access request addresses belong to the same Block (that is, for an access instruction, one Block finally corresponds to one bus request), generate bus requests for each Block, and send the bus requests to the memory, thereby reducing the number of times of repeated accesses for a single Block. For example, as shown in fig. 2C, fig. 2C is a schematic diagram for comparing bus requests according to an embodiment of the present invention, and it is assumed that the memory access instruction includes 3 memory access requests, and the memory 02 includes at least one memory Block0, where Block0 corresponds to addresses 0 to 5. In the figure, after the address operation unit receives the access instruction, 3 access request addresses, namely address 0, address 2 and address 4, can be obtained according to 3 access requests in the access instruction, as shown in (a) in figure 2C, a vector access unit in the prior art generates 3 bus requests one by one according to the access request addresses, and sequentially sends the 3 bus requests to the memory 02; as shown in fig. 2C (b), in the embodiment of the present invention, by improving the vector access unit, after the address operation unit 10521 calculates the addresses of 3 access requests, the 3 addresses are sent to the request fusion unit 10522, and further the request fusion unit 10522 can determine that all the 3 addresses correspond to the same Block (Block0), so that the access requests corresponding to the three addresses are fused to generate a bus request for accessing the Block0, and the bus request is sent to the memory 02. Compared with the prior art, the number of bus requests sent to the memory 02 is reduced, so that the vector processor provided by the embodiment of the invention can avoid sending too many bus requests to repeatedly access the same memory block when reading data from the memory or writing data to the memory, and further reduce the memory access times of the data to a certain extent, thereby improving the memory access performance and the user experience of the vector processor.

In one possible implementation, the address arithmetic unit includes L address operators; each address arithmetic unit obtains a memory access request address based on a memory access request in each clock cycle; l is an integer greater than 1. Specifically, because the address arithmetic unit in the vector memory access unit can comprise a plurality of address arithmetic units, the address arithmetic units can calculate memory access request addresses of different memory access requests in parallel in the same clock cycle, so that the efficiency of calculating the memory access request addresses is improved, and the memory access performance and the user experience of the vector processor are improved.

In a possible implementation manner, the vector memory access unit is specifically configured to: dividing the N memory access requests into S request sets through the address operation unit; s is an integer greater than 1; each request set comprises less than or equal to L memory access requests; obtaining S access request address sets respectively according to the S request sets through the address operation unit; each memory access request address set comprises the request addresses of the memory access requests in the corresponding request set. Specifically, since too many address operators increase the area overhead of hardware, L address operators can be reasonably configured when designing the address operation unit. When the number of the access requests is larger than the number of the address arithmetic units, the access requests can be grouped to obtain a plurality of request sets, wherein the number of the access requests in each request set is smaller than or equal to the number of the address arithmetic units, so that the address arithmetic unit can calculate the access request address set corresponding to the access requests in one request set in one clock period. For example, as shown in fig. 3A, fig. 3A is a memory access request grouping diagram provided by an embodiment of the present invention, in which N memory access requests are divided into S request sets, and then the memory access request address set corresponding to each request set may be sequentially calculated by L address operators in the address operation unit 10521. Therefore, the vector processor provided by the embodiment of the invention can reduce the area overhead of a chip and simultaneously realize the parallel computation of the access and storage request address.

In one possible implementation, S is a number rounded up N/L. Specifically, when memory access requests are grouped, the memory access request addresses can be divided into S request sets according to the number of the address arithmetic units, and S is an N/L number which is rounded up, so that each request set comprises as many memory access requests as possible, and N memory access request addresses can be calculated and obtained in the fewest clock cycles, the efficiency of calculating the memory access request addresses is improved, and the memory access performance and the user experience of the vector processor are improved. For example, suppose that the memory access instruction includes 9 memory access requests, and the address operation unit includes 3 address operators, so that the 9 memory access requests can be divided into 3 request sets, and then the address operation unit can complete the calculation of the memory access request address in three clock cycles.

In one possible implementation, the vector access unit further includes a data register; the vector memory access unit is specifically configured to: calculating an ith memory access request address set corresponding to an ith request set in the S request sets through the address operation unit, and sending the ith memory access request address set to the data register; i is 0, 1, 2 … … S; and receiving the ith access request address set sent by the address operation unit through the data register, and storing the ith access request address set. Specifically, a plurality of request sets are obtained after memory access requests are grouped, and because the address operation unit can only calculate a memory access request address set corresponding to a memory access request in one request set in each clock cycle, a data register can be added in the vector memory access unit for temporarily storing a memory access request address calculated in the previous clock cycle, so that the memory access request address is prevented from being lost. For example, as shown in fig. 3B, fig. 3B is a schematic diagram of another vector access unit according to an embodiment of the present invention, in which the vector access unit 1052 includes an address operation unit 10521, a request fusion unit 10522, and a data register 10523, where after the address operation unit 10521 calculates an access request address set of a request set, the calculated result is sent to the data register 10523 for temporary storage, so as to avoid address loss of the access request.

In a possible implementation manner, the vector memory access unit is specifically configured to: when S memory access request address sets are stored in the data register, the S memory access request address sets are sent to the request fusion unit through the data register; receiving the S access request address sets sent by the data register through the request fusion unit; the S memory access request address sets comprise the N memory access request addresses. Specifically, after the address operation unit completes the calculation of the access request address sets of S request sets, the data register 10523 also records S access request address sets (the S access request address sets include the access request addresses of N access requests), so that the data register 10523 sends the recorded access request addresses to the request fusion unit 10522, and the subsequent request fusion unit 10522 generates a bus request based on the N access request addresses. For example, assume that the access instruction includes 4 access requests, the address arithmetic unit 10521 includes 2 address arithmetic units, and the memory 02 includes at least 2 memory blocks, where Block0 corresponds to addresses 0-5, and Block1 corresponds to addresses 6-11. Referring to fig. 3C, fig. 3C is another schematic diagram of generating a bus request according to an embodiment of the present invention, in which after a memory access request is divided into 2 request sets, first a memory access request address corresponding to a first request set is calculated in an address calculation unit 10521 as address 0 and address 2, the calculated addresses are further recorded in a data register 10523, next a memory access request address corresponding to a second request set is calculated in a next clock cycle as address 4 and address 10, and further the calculated result is recorded in the data register 10523, when memory access request addresses corresponding to these 4 memory access requests are recorded in the data register 10523, these addresses are sent to a request fusion unit 10522, then the request fusion unit 10522 determines that address 0, address 2, and address 4 correspond to Block0, address 4 corresponds to Block1, two bus requests are then generated and sent to the memory in sequence. Compared with the prior art, the number of bus requests sent to the memory 02 is reduced, so that the vector processor provided by the embodiment of the invention can avoid sending too many bus requests to repeatedly access the same memory block when reading data from the memory or writing data to the memory, and further reduce the memory access times of the data to a certain extent, thereby improving the memory access performance and the user experience of the vector processor.

In one possible implementation, the vector processor further includes a vector register file; the N memory access requests are N data reading requests; the vector memory access unit is further configured to: receiving data stored in the M blocks fed back by the memory based on the M bus requests respectively; and respectively writing the data corresponding to the N read data requests into the vector register based on the data stored in the M blocks. In particular, since the vector processor cannot directly read the data in the memory, the data fed back by the memory can be stored in the vector register, which facilitates the rapid reading of the data in the memory through the vector register. For example, as shown in fig. 3D, fig. 3D is a vector register diagram according to an embodiment of the present invention, in which a vector register file 1051 may include a plurality of vector registers, and after the vector access unit 1052 receives data fed back from the memory 02, the data may be sequentially stored in the vector registers based on an access request. The vector processor provided by the embodiment of the invention can read the data in the memory more efficiently, thereby improving the memory access performance and the user experience of the vector processor.

In the embodiment of the invention, after a vector access unit in a vector processor receives an access instruction, a plurality of access request addresses are obtained according to access requests in the access instruction, then blocks corresponding to the access request addresses are determined, and different access requests of which the access request addresses belong to the same Block can be fused (namely for one access instruction, one Block finally corresponds to one bus request), so that bus requests for each Block are generated and sent to a memory, and the repeated access times for a single Block are reduced. Therefore, the embodiment of the invention improves the vector memory access unit in the vector processor, and can avoid sending excessive bus requests to repeatedly access the same memory block when the vector memory access unit reads data from the memory or writes data to the memory, thereby reducing the memory access times of the data to a certain extent, and improving the memory access performance and the user experience of the vector processor.

Having described the vector processor of embodiments of the present invention in detail above, the following provides methods related to embodiments of the present invention.

Referring to fig. 4, fig. 4 is a flowchart illustrating a data accessing method according to an embodiment of the present invention, where the method is applied to any one of the vector processors in fig. 1A and the device including the vector processor. The method may include the following steps S401 to S404. The vector processor comprises a vector access unit, wherein the vector access unit is coupled with a memory, and the memory comprises a plurality of memory blocks. Wherein the content of the first and second substances,

step S401: and receiving a memory access instruction through the vector memory access unit.

Specifically, the memory access instruction comprises N memory access requests; n is an integer greater than 0.

Step S402: and respectively obtaining N memory access request addresses through the vector memory access unit according to the N memory access requests.

Step S403: and determining M blocks corresponding to the address ranges of the N access request addresses in the plurality of blocks through the vector access unit.

Specifically, M is an integer greater than 0 and less than or equal to N.

Step S404: and generating M bus requests corresponding to the M blocks through the vector access unit, and sending the M bus requests to the memory.

In one possible implementation, S is a number rounded up N/L.

It should be noted that, implementation of each step in the data access and storage method described in the embodiment of the present invention is completed in the vector processor in the above embodiment, and details are not described here.

The present application provides a semiconductor chip, which may include the vector processor provided in any one of the implementations of the vector processor embodiments described above.

The present application provides a semiconductor chip, which may include: any one of the above embodiments of a vector processor provides a vector processor, an internal memory coupled to the multicore processor, and an external memory.

The application provides a system on chip SoC chip, which includes the vector processor provided in any one of the above-mentioned vector processor embodiments, an internal memory coupled to the vector processor, and an external memory. The SoC chip may be formed of a chip, or may include a chip and other discrete devices.

The application provides a chip system, which comprises a multi-core processor provided by any one implementation manner in the embodiment of the vector processor. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary or relevant to the operation of the multi-core processor. The chip system may be formed by a chip, or may include a chip and other discrete devices.

The application provides a processing device which has the function of realizing any data access method in the data access and storage method embodiment. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

The application provides a terminal, which comprises a processor, wherein the processor is a vector processor provided by any implementation manner in the data access and storage method embodiment. The terminal may also include a memory, coupled to the processor, that retains program instructions and data necessary for the terminal. The terminal may also include a communication interface for the terminal to communicate with other devices or communication networks.

The application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the data access method flow described in any one of the above data access method embodiments.

An embodiment of the present invention provides a computer program, where the computer program includes instructions, and when the computer program is executed by a processor, the processor may execute a data access method flow described in any one of the above data access method embodiments.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, and may specifically be a processor in the computer device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a magnetic disk, an optical disk, a Read-only memory (ROM) or a Random Access Memory (RAM).

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A vector processor, comprising: the vector memory access unit is coupled with a memory, and the memory comprises a plurality of memory blocks; the vector memory access unit is used for:

receiving a memory access instruction; the memory access instruction comprises N memory access requests; n is an integer greater than 0;

respectively obtaining N memory access request addresses according to the N memory access requests;

determining M blocks corresponding to address ranges to which the N memory access request addresses belong in the plurality of memory blocks; m is an integer greater than 0 and less than or equal to N;

generating M bus requests corresponding to the M blocks, and sending the M bus requests to the memory;

the vector access unit comprises an address operation unit and a request fusion unit; the vector memory access unit is specifically configured to:

receiving the access instruction through the address operation unit; the memory access instruction comprises the N memory access requests; respectively obtaining N memory access request addresses according to the N memory access requests, and sending the N memory access request addresses to the request fusion unit;

receiving the N memory access request addresses sent by the address operation unit through the request fusion unit, and determining the M blocks corresponding to the address ranges to which the N memory access request addresses belong in the plurality of memory blocks; generating the M bus requests corresponding to the M blocks, and transmitting the M bus requests to the memory.

2. The vector processor of claim 1, wherein the address arithmetic unit comprises L address operators; each address arithmetic unit obtains a memory access request address based on a memory access request in each clock cycle; l is an integer greater than 1.

3. The vector processor of claim 2, wherein the vector memory access unit is specifically configured to:

dividing the N memory access requests into S request sets through the address operation unit; s is an integer greater than 1; each request set comprises less than or equal to L memory access requests; s access request address sets are obtained according to the S request sets respectively; each memory access request address set comprises the request addresses of the memory access requests in the corresponding request set.

4. The vector processor of claim 3, wherein S is a number rounded up N/L.

5. The vector processor of claim 4, wherein the vector memory access unit further comprises a data register; the vector memory access unit is specifically configured to:

calculating an ith memory access request address set corresponding to an ith request set in the S request sets through the address operation unit, and sending the ith memory access request address set to the data register; i =0, 1, 2 … … S;

and receiving the ith access request address set sent by the address operation unit through the data register, and storing the ith access request address set.

6. The vector processor of claim 5, wherein the vector memory access unit is specifically configured to:

when S memory access request address sets are stored in the data register, the S memory access request address sets are sent to the request fusion unit through the data register;

receiving the S access request address sets sent by the data register through the request fusion unit; the S memory access request address sets comprise the N memory access request addresses.

7. The vector processor of any one of claims 1-6, further comprising a vector register file; the N memory access requests are N data reading requests; the vector memory access unit is further configured to:

receiving data stored in the M blocks fed back by the memory based on the M bus requests respectively;

and respectively writing the data corresponding to the N read data requests into the vector register based on the data stored in the M blocks.

8. A data access and storage method is applied to a vector processor; the vector processor comprises a vector access unit, the vector access unit is coupled with a memory, and the memory comprises a plurality of memory blocks; the method comprises the following steps:

receiving a memory access instruction through the vector memory access unit; the memory access instruction comprises N memory access requests; n is an integer greater than 0; respectively obtaining N memory access request addresses according to the N memory access requests; determining M blocks corresponding to address ranges to which the N memory access request addresses belong in the plurality of memory blocks; m is an integer greater than 0 and less than or equal to N; generating M bus requests corresponding to the M blocks, and sending the M bus requests to the memory;

receiving the access instruction through an address operation unit in the vector access unit; the memory access instruction comprises the N memory access requests; respectively obtaining N memory access request addresses according to the N memory access requests, and sending the N memory access request addresses to a request fusion unit in the vector memory access unit;

9. The method of claim 8, wherein the address arithmetic unit includes L address operators; each address arithmetic unit obtains a memory access request address based on a memory access request in each clock cycle; l is an integer greater than 1.

10. The method as claimed in claim 9, wherein said obtaining N memory access request addresses according to said N memory access requests respectively comprises:

dividing the N memory access requests into S request sets through the address operation unit in the vector memory access unit; s is an integer greater than 1; each request set comprises less than or equal to L memory access requests; s access request address sets are obtained according to the S request sets respectively; each memory access request address set comprises the request addresses of the memory access requests in the corresponding request set.

11. The method of claim 10, wherein S is a number rounded up N/L.

12. The method of claim 11, wherein the vector memory access unit further comprises a data register; the obtaining of the N memory access request addresses according to the N memory access requests includes:

calculating an ith access request address set corresponding to an ith request set in the S request sets through the address operation unit in the vector access unit, and sending the ith access request address set to the data register; i =0, 1, 2 … … S;

and receiving the ith access request address set sent by the address operation unit through the data register in the vector access unit, and storing the ith access request address set.

13. The method as claimed in claim 12, wherein said sending said N memory access request addresses to said request fusion unit comprises:

when S memory access request address sets are stored in the data register, the S memory access request address sets are sent to the request fusion unit through the data register in the vector memory access unit;

receiving the S access request address sets sent by the data register through the request fusion unit in the vector access unit; the S memory access request address sets comprise the N memory access request addresses.

14. The method of any of claims 8-13, wherein the vector processor further comprises a vector register file; the N memory access requests are N data reading requests; the method further comprises the following steps:

respectively receiving data stored in the M blocks fed back by the memory based on the M bus requests through the vector access unit; and respectively writing the data corresponding to the N read data requests into the vector register based on the data stored in the M blocks.

15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when being executed by a processor, carries out the method of any one of the preceding claims 8 to 14.