CN111158757B

CN111158757B - Parallel access device and method and chip

Info

Publication number: CN111158757B
Application number: CN201911406669.XA
Authority: CN
Inventors: 杨龚轶凡; 郑瀚寻; 闯小明; 周远航
Original assignee: Zhonghao Xinying Hangzhou Technology Co ltd
Current assignee: Zhonghao Xinying (Hangzhou) Technology Co.,Ltd.
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2021-11-30
Anticipated expiration: 2039-12-31
Also published as: CN111158757A

Abstract

The embodiment of the invention discloses a parallel access method, a parallel access device and a chip, which can be used for realizing parallel storage or reading operation of data in the technical field of integrated circuits. Target addresses are generated for a plurality of lanes through the address generator, and the lanes access corresponding storage positions in the RAM according to the target addresses and carry out data access operation in parallel. When the address generator generates the target address for the lane, the lane step length generating unit is used for generating the lane step length, wherein the lane step length is K times of the step length, and the generated lane step lengths are different under the control of the same SIMD control instruction, so that the target address generated by the address generating unit cannot form access conflict. Therefore, the invention can reduce the power consumption of related hardware on the premise of ensuring the lane parallel conflict-free access to the memory, and simultaneously shortens the whole time consumption for data parallel access operation.

Description

Parallel access device and method and chip

Technical Field

The present invention relates to the field of integrated circuit technologies, and in particular, to a parallel access apparatus and method, and a chip.

Background

With the development of science and technology and the progress of society, the design of integrated circuits is widely applied, more and more electronic devices enter the daily life of people, not only brings convenience to the daily life of people, but also further promotes the innovation and research and development of science and technology. In the field of integrated circuit design, data access is one of the most important technologies. The processor accesses the data to access the memory through a load instruction and a store instruction, the load instruction is used to load the data in the corresponding address of the memory into the corresponding register when the processor needs to use the data in the memory, and the store instruction is used to store the data in the corresponding register to the corresponding address of the memory when the processor needs to store the data.

In application scenarios such as multimedia, big data, artificial intelligence, etc., a data parallel algorithm is often used, for example, parallel operations on a plurality of matrices are required in a neural network, and such parallel operations are required to be performed on a large number of data sets simultaneously. Simultaneous operation on a large number of data sets requires simultaneous access to the large number of data sets in parallel. In application scenarios such as multimedia, big data, artificial intelligence, etc., the SIMD technology is mostly adopted to perform parallel processing, and the SIMD (single Instruction Multiple data), i.e., the single Instruction stream Multiple data stream technology, is a technology that uses one controller to control a plurality of processors, thereby implementing parallelism in space. In short, one control instruction can process a plurality of data simultaneously and in parallel. The functional units of the SIMD extension part, which are responsible for the operations of access, storage, calculation and the like, all support the simultaneous processing of a plurality of parallel subunits, so that one SIMD extension instruction can operate a plurality of elements at one time, and the parallel subunits are called lanes. In parallel processing, it is often the case that two or more lanes point to the exact same location in the memory, i.e. there is an address access conflict. The prior art generally adopts a mode of arranging a conflict detection device to solve the problem of access conflict, and an access request related to an access conflict lane is arbitrated through the conflict detection device. Parallel access of data in this manner is not only inefficient, but also involves high power consumption by associated hardware and long data access time.

Disclosure of Invention

In view of the above, the present invention provides a parallel access apparatus and method and a chip, so as to solve the problems of low efficiency of data access operation, high power consumption of related hardware, and long time consumption of data access caused by access conflict resolution in an integrated circuit by means of conflict detection.

In a first aspect, an embodiment of the present invention provides a parallel access apparatus, including a memory and M lanes, where the memory includes a plurality of memory groups, where the number of the memory groups is not less than the number M of the lanes, and M is an integer not less than 2; the device also comprises an immediate pile and an address generator, wherein the immediate pile is connected with the address generator, the address generator is connected with each lane, and each lane is connected with each storage group;

the immediate heap is used for providing address generation information and step size, and the step size is an odd number;

the address generator is used for receiving the SIMD control instruction, lane information and address generation information and generating a target address for a lane; the lane information includes a step size; the address generator includes a lane step length generation unit and an address generation unit, wherein:

the lane step length generating unit is used for generating lane step lengths according to the SIMD control instructions and the lane information, wherein the lane step lengths are K times of the step lengths, K is an integer, the value range of K is [ N, M + N-1], and N is an integer not less than 0;

the address generation unit is used for summing the address generation information and the lane step length according to the control instruction, and outputting the obtained sum value serving as a target address to a corresponding lane according to the control information;

the M lanes are used for accessing the corresponding storage groups according to respective target addresses and performing access operation in parallel.

The parallel access device provided by the embodiment of the invention generates the lane step length of each lane according to the step length set as the odd number by using the lane step length generating unit, and further generates the target address of the lane by using the lane step length. The method and the device ensure that no conflict exists between the addresses generated for each lane, so that the access operation of a plurality of lanes accessing the memory in parallel can be accurately and orderly carried out according to respective target addresses, and the condition of access conflict caused by parallel access is avoided. The problem of multi-lane parallel access conflict is generally solved by arranging a conflict detection method or device in the prior art. Compared with the prior art, the collision detection is not needed to be carried out on the target address of the lane, namely, a relevant collision detection device is not needed to be arranged in relevant hardware, so that the execution efficiency of access operation of multi-lane parallel access is improved, the power consumption of the relevant hardware is reduced, and the whole time consumption of data parallel access is shortened.

Preferably, the lane step generating unit includes an arithmetic operation device for generating the lane step. The arithmetic operation device in the step length of the lane is controlled by the control instruction to process the lane information, a value which is K times of the step length of the lane information can be obtained, and the value is used for calculating the target address of the lane, so that the lane target addresses generated by the address generator are different, and the condition of access conflict during lane access is avoided.

More preferably, the arithmetic operation device includes a plurality of adders connected in cascade. The lane step length is calculated by gradually accumulating the step length in the lane information through the adders, so that the lane step length calculating method is simple in structure, improves the operation rate and reduces the hardware power consumption.

More preferably, the arithmetic operation device includes an adder and a shifter, wherein the lane step generation unit generates and outputs the class a lane step using the shifter, or generates and outputs the class B lane step using the adder. The hardware structure of the lane step generation unit is further simplified, and better flexibility in selecting lane step calculation is provided. The generation efficiency of the lane step length is further improved, and the hardware power consumption is reduced.

Preferably, the address generation information includes a base address and an offset, and the address generation unit includes two adders for summing the base address and the offset and a lane step size, the resultant sum value being a target address. The base address is used as the common base address of each lane, and the lane step length of each lane is set to be different, so that each lane can access different storage groups in parallel, and the condition of access conflict is avoided. By setting the offset, better flexibility in address generation is provided in situations where it is guaranteed that there are no access conflicts. And the two adders are used for summing the base address, the offset and the lane step length, so that the hardware structure cost is minimized.

In a second aspect, an embodiment of the present invention provides a parallel access method. Providing a memory and M lanes, the method comprising the steps of:

step 110: dividing the memory into a plurality of storage groups, wherein the number of the storage groups is not less than M;

step 120: acquiring SIMD control, sequentially generating at least two target addresses according to the SIMD control instruction, and sequentially sending the at least two target addresses to corresponding lanes in the M lanes, wherein one target address can only be sent to one lane; the process of producing a single target address includes:

acquiring lane information according to the SIMD control instruction, wherein the lane information comprises a step length which is an odd number; generating a lane step length according to the lane information, wherein the lane step length is K times of the step length, K is an integer in an interval [ N, M + N-1], and N is an integer not less than 0;

acquiring address generation information according to the SIMD control instruction, summing the address generation information and lane step length, and directly sending the obtained sum value as a target address to a corresponding lane according to the SIMD control instruction; the lane step length generated in the process of generating a single target address according to the SIMD control instruction is different each time;

step 130: and after the target addresses generated according to the SIMD control instruction are all sent to the corresponding lanes, the corresponding lanes simultaneously start to run, and access the corresponding stored values according to the respective received target addresses to perform access operation in parallel.

The parallel access method provided by the embodiment of the invention generates the lane step length of each lane by using the step length set to be odd number, and further generates the target address of the lane by using the lane step length. The method and the device ensure that no conflict exists between the addresses generated for each lane, so that the access operation of a plurality of lane parallel access memories can be accurately and orderly carried out according to respective target addresses, and the condition of lane access conflict caused by parallel access is avoided. The problem of multi-lane parallel access conflict is generally solved by arranging a conflict detection method or device in the prior art. Compared with the prior art, the collision detection is not needed to be carried out on the target address of the lane, namely, a relevant collision detection device is not needed to be arranged in relevant hardware, so that the execution efficiency of access operation of multi-lane parallel access is improved, the power consumption of the relevant hardware is reduced, and the whole time consumption of data parallel access is shortened.

Preferably, the address generation information includes a base address and an offset, and the step 120 includes summing the base address, the offset and a lane step size, and using the sum as the target address. The base address is used as the common base address of each lane, and the lane step length of each lane is set to be different, so that each lane can access different storage groups in parallel, and the condition of access conflict is avoided. By setting the offset, better flexibility in address generation is provided in situations where it is guaranteed that there are no access conflicts.

Preferably, the aforementioned method further provides a lane step generation unit including an arithmetic operation device that is controlled to generate a lane step using the SIMD control instruction. The lane information is processed by an arithmetic operation device in the step length of the control instruction control lane, a value which is K times of the step length in the lane information can be obtained, wherein K is an integer not less than 0, the value is used for calculating the target address of the lane, the target addresses of the lanes generated by an address generator are ensured to be different, and the condition of access conflict during lane access is avoided.

In particular, in the aforementioned parallel access method, each memory bank has a respective bank number, the number of lanes is M, and the log of the destination address is low₂M bits are the group number. Determining the number of bits used to represent the group number in the destination address by the number of lanes, and directly recording the group number in the generated destination addressThe step of separately setting the record group number and related devices is not needed, and the related hardware constitution is simplified.

In a third aspect, an embodiment of the present invention provides a chip. The chip includes a computer-readable storage medium for storing a computer program; the chip also comprises a processor, wherein the processor comprises the parallel access device disclosed by the foregoing; the processor is configured to implement the steps of the aforementioned parallel access method when executing the computer program stored in the readable storage medium.

The invention can be further combined to provide more implementation modes on the basis of the implementation modes provided by the aspects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a parallel access method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a parallel access device 200 according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an address generator 300 according to an embodiment of the present invention;

fig. 4(a) is a schematic structural diagram of a lane step size generating unit 410 according to an embodiment of the present invention;

fig. 4(B) is a schematic structural diagram of a lane step size generating unit 420 according to an embodiment of the present invention;

fig. 4(C) is a schematic structural diagram of a lane step size generating unit 430 according to an embodiment of the present invention;

FIG. 5 is a block diagram of an address generator 500 according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a lane 600 according to an embodiment of the present invention;

FIG. 7 is a sample of data that needs to be stored according to an embodiment of the present invention;

FIG. 8 is a RAM memory provided by an embodiment of the present invention for storing the data of FIG. 7;

fig. 9 is a schematic structural diagram of a four-lane parallel access device 900 according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a step size generating unit 1000 for calculating four lanes according to an embodiment of the present invention;

FIG. 11 is a state diagram of the memory 800 resulting from the storage of the first element of each matrix of FIG. 7;

FIG. 12 is a memory state diagram of the memory 800 resulting from the completion of all elements of each matrix in FIG. 7;

fig. 13 is a schematic structural diagram of a chip 1300 according to an embodiment of the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that when an element is referred to as being "connected to" another element, or "coupled" to one or more other elements, it can be directly connected to the other element or be indirectly connected to the other element.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The following specifically describes embodiments of the present invention.

Fig. 1 is a schematic flow chart of a parallel access method according to an embodiment of the present invention. The method can realize the simultaneous data access operation of a plurality of lanes under the control of the SIMD instruction. As shown in fig. 1, the method comprises the steps of:

step 110: the RAM memory is divided into a plurality of memory banks. The memory mentioned in the embodiment of the invention is a RAM memory, namely a random access memory. The RAM memory may be divided into a number of memory groups, including a number of memory banks for storing data. The embodiment of the invention divides the memory into a plurality of memory groups and sets a group number for each memory group, wherein the number of the memory groups is not less than the number of lanes, and if the number of the lanes is M and M is not less than 2, the number of the memory groups is more than or equal to M, and the number of the memory groups is preferably equal to the number of the lanes.

Step 120: and sequentially generating a plurality of target addresses according to the SIMD control instruction, and sending the target addresses to corresponding lanes. When a lane needs to perform a data access operation, a target address needs to be acquired first, and the lane accesses a memory bank in a memory group to which the target address points to perform the data access operation. And sequentially generating at least two target addresses according to the SIMD control instruction, and directly sending the generated at least two target addresses to corresponding lanes in the M lanes, wherein one target address can be sent to only one lane. The specific number of target addresses generated by the SIMD control instruction, as determined by the SIMD control instruction, may be any integer within [2, M ]. The process of generating a single target address includes:

obtaining lane information according to the SIMD control instruction, wherein the lane information comprises a step length which is an odd number; generating a lane step length according to lane information by a lane step length generating unit, wherein the lane step length is K times of the step length, K is an integer, the value range of K is [ N, M + N-1], and N is an integer not less than 0;

fetch address generationThe address generation information comprises a base address and an offset; summing the address generation information and the lane step length according to the SIMD control instruction, namely summing a base address, an offset and the lane step length, and directly sending the obtained sum value serving as a target address to a lane; the target address is composed of a real address and a group number, wherein the low log in the target address₂M bits are the group number, the rest bits are the real address, and the real address points to a memory bank. The lane step size generated in each process of generating a single target address according to the SIMD control instruction is different.

Step 130: and simultaneously starting to operate a plurality of lanes acquiring the target addresses, and accessing corresponding storage groups in parallel to perform access operation. After the lanes acquire the respective target addresses, the lanes start to operate simultaneously, access to the storage groups corresponding to the group numbers included in the target addresses according to the respective target addresses, further access to the corresponding storage banks according to the actual addresses, and perform data access operations in parallel.

The parallel access method provided by the embodiment of the invention generates the lane step length of each lane by using the step length set to be odd number, and further generates the address of the lane by using the lane step length. The method ensures that no conflict exists between the addresses generated for each lane, so that the access operation of a plurality of lanes accessing the memory in parallel can be accurately and orderly carried out according to respective target addresses. The condition that access conflict occurs in parallel access is avoided. The problem of multi-lane parallel access conflict is generally solved by arranging a conflict detection method or device in the prior art. Compared with the prior art, the collision detection is not needed to be carried out on the target address of the lane any more, namely, a collision detection device relevant to relevant hardware is not needed any more, so that the execution efficiency of access operation of multi-lane parallel access is improved, the power consumption of relevant hardware is reduced, and the whole time consumption of data parallel access is shortened.

Fig. 2 is a schematic structural diagram of a parallel access apparatus 200 according to an embodiment of the present invention. As shown in fig. 2, the matrix processing apparatus 200 includes an immediate heap 210, an address generator 220, M lanes 230, M is not less than 2, a RAM memory 240, the RAM memory 240 is divided into Q memory groups 241, and the memory groups 241 include a plurality of memory banks for storing data, where Q is equal to or greater than M, and preferably Q is equal to M. The immediate heap 210 is coupled to an address generator 220; an address generator 220 is connected to each lane 230; each memory bank has a read/write port, and each lane 230 is connected to the read/write port (not shown) of each memory bank 241 in the RAM memory 240.

The immediate heap 210 is used to provide address generation information and a step size, where the step size is an odd number. The immediate heap 210 sends the address generator 220 the corresponding address generation information and step size according to the request signal when receiving an external request signal, where the request signal may be sent by the address generator 220 or sent by another external control device.

The address generator 220 is for receiving control instructions and lane information as well as address generation information from the immediate heap 210. Wherein the control instruction comprises a SIMD instruction decoded by a decoder; the lane information includes the step size from the immediate heap 210, and also includes 0, 0 is provided by the external device; the address generation information includes a base address and also includes an offset.

Fig. 3 is a schematic structural diagram of an address generator 300 according to an embodiment of the present invention. As shown in fig. 3, the address generator 300 includes a lane step generation unit 301 and an address generation unit 302. The lane step length generating unit 301 is configured to receive lane information, and calculate a lane step length according to a control instruction, where the lane step length is K times of the step length in the lane information, K is an integer, a value range of K is [ N, M + N-1], N is an integer not less than 0, and the generated lane step lengths are different under the control of the same SIMD control instruction. The lane step length generating unit 301 generates a lane step length and then sends the lane step length to the address generating unit 302, and the address generating unit 302 receives the lane step length and the address generating information and generates a target address for the lane according to the lane step length and the address generating information.

In a preferred embodiment, the lane step generating unit generates the lane step by using a shifter and an adderLane step size. Fig. 4(a) is a schematic structural diagram of a lane step length generating unit 410 according to an embodiment of the present invention. As shown in fig. 4(a), the lane step generation unit 410 includes

transmission lines

411 and 412, and a shifter 413 and an adder 414. Suppose that M lanes are now used to perform the data-accessing operation, the serial numbers of the lanes are lane 0, lane 1, lane 2, … …, and lane M-1, respectively. The lane step generating unit receives lane information, the lane information comprises 0 from external equipment and step from the immediate pile, and the transmission line 411 is used for taking 0 as the lane step of the lane 0 according to the SIMD control instruction and directly transmitting the lane step to the address generating unit; the transmission line 412 is used for directly transmitting the step size received from the immediate pile to the address generation unit as the lane step size of lane 1 according to the SIMD control instruction; wherein, the step size received from the external device is 0 and the step size received from the immediate pile are both lane information. The shifter 413 is used for shifting the step length received from the immediate data stack according to the SIMD control instruction, and directly outputting the obtained shifting result to the address generation unit as the lane step length of the A-type lane; wherein the A-type lane comprises a serial number of 2^PP is an integer not less than 1; the lane step directly output by the shifter 413 is also referred to as a class a lane step. The adder 414 is used for summing the shift result output by the shifter 413 and the step length received from the immediate data stack according to the SIMD control instruction, and directly outputting the obtained sum value to the address generation unit as the lane step length of the B-type lane; where the class B lane is a lane other than lane 0 and lane 1, and the class a lane, the lane step directly output by the adder 414 is also referred to as a class B lane step. In addition, the adder 414 may be further configured to receive the shift results of the two shifters 413, calculate a sum value, and output the sum value as a class B lane step; the adder 414 may also be configured to receive an output result of another adder 414 and the step size received from the immediate pile, calculate a sum, and output the sum as the class B lane step size; the adder 414 is also configured to receive an output result of another adder 414 and an output result of the shifter 413, calculate a sum value, and output the resulting sum value as a class B lane step.

In another preferred embodiment, the lane step generating unit generates the lane step by using a plurality of adders. Fig. 4(B) is a schematic structural diagram of a lane step length generating unit 420 according to an embodiment of the present invention. As shown in fig. 4(B), the lane step generation unit 420 includes a transmission line 421 and a transmission line 422, and a plurality of adders 423, and the adders 423 are connected in cascade. Suppose that M lanes are now used to perform the data-accessing operation, the lane numbers are lane 0, lane 1, lane 2, lane 3, … …, and lane M-1, respectively. The lane step generation unit receives lane information, the lane information comprises 0 from external equipment and step from an immediate pile, and the transmission line 421 is used for taking 0 in the lane information as the lane step of the lane 0 according to the SIMD control instruction and directly transmitting the lane step to the address generation unit; the transmission line 422 is used to directly transmit the step size received from the immediate pile to the address generation unit as the lane step size of lane 1 according to the SIMD control instruction. The adders 423 are connected in a cascade mode, the adders 423 are used for receiving two step lengths from the immediate data stack, calculating a sum value and directly outputting the sum value to the address generating unit as a lane step length of the lane 2; the adder 423 may also be configured to output its own operation result to another adder 423; the adder 423 may also be configured to calculate a sum value from the operation result received from the other adder 423 and the step size from the immediate pile, and output the resultant sum value directly to the address generation unit as the lane step size.

In another preferred embodiment, the lane step generating unit generates the lane step by using a plurality of multipliers. Fig. 4(C) is a schematic structural diagram of a lane step length generating unit 430 according to an embodiment of the present invention. As shown in fig. 4(C), the lane step generation unit 430 includes M multipliers 431, which are multiplier 0, multiplier 1, … …, and multiplier M-1, where 0, 1, … …, and M-1 are multiplier numbers. Each multiplier is fixedly written with a numerical value which is equal to the sequence number of the multiplier. Suppose that M lanes are now used to perform the data-accessing operation, the lane numbers are lane 0, lane 1, lane 2, lane 3, … …, and lane M-1, respectively. The lane step generating unit receives lane information including a step from the immediate pile, and the multiplier 0 is used for receiving the step from the immediate pile, multiplying the step by a value fixedly written thereon, calculating the lane step of the lane 0, and directly outputting the lane step to the address generating unit. The other multipliers have the same working process as the multiplier 0, all receive the step length from the immediate data stack, multiply the step length by the numerical value fixedly written on the step length, calculate the lane step length of the lane with the same serial number as the multiplier, and directly output the lane step length to the value address generating unit. The lane step generated for each lane in the three preferred embodiments described above is, in turn, K times the step in the lane information, K ranging from [0, M-1 ]. In some other preferred embodiments, the range of K can be set according to specific conditions, and the selectable interval is [ N, M + N-1], where N is an integer not less than 1.

In fig. 3, the address generating unit 302 is configured to receive the lane step sent by the lane step generating unit and the address generating information sent by the immediate data stack 210, and sum the lane step and the address generating information, where the sum is the target address of one lane. Fig. 5 is a preferred implementation of the address generator provided by the embodiment of the present invention. Fig. 5 is a schematic structural diagram of an address generator 500 according to an embodiment of the present invention. As shown in fig. 5, the address generator 500 includes a first adder 501 and a second adder 502, wherein the first adder 501 is configured to receive the address generation information, calculate a sum of a base address and an offset in the address generation information, and send the sum to the second adder 502 as a calculation result, and the second adder 502 is configured to receive a lane step and the calculation result sent by the first adder 501, and sum the two, and output the resultant sum as a destination address of the lane. In another embodiment, the second adder 502 may also be used to receive the address generation information, calculate a sum of the base address and the offset in the address generation information, and send the sum to the first adder 501. The first adder 501 receives the lane step and the calculation result sent by the second adder 502, and sums them, and the resulting sum value is output as the target address of the lane.

In fig. 2, the lanes 230 receive the respective target addresses, start the operation at the same time, locate the storage groups according to the group numbers in the respective target addresses, and further access the corresponding storage banks according to the actual addresses to perform the data access operation. Fig. 6 is a schematic structural diagram of a lane 600 according to an embodiment of the present invention. As shown in fig. 6, the lane 600 includes a control judgment logic device 601, a register file 602, and an arithmetic logic unit ALU 603. Wherein the control decision logic 601 is configured to receive the target address and identify the group number and the physical address therein to locate the memory group to be accessed by the lane according to the group number and to locate the memory bank in the memory group according to the physical address. The register file 602 includes a plurality of registers, including a destination register, which is a register for storing a source operand, where the source operand is data loaded in a lane access memory for calculation by the arithmetic logic unit ALU 603; also included in register file 602 is a result register for storing the results of the operations performed by ALU 603. The lane loads the data in the target address in the memory to the target register when executing the load instruction, and stores the data in the result register to the target address in the memory when executing the store instruction.

The parallel access device provided by the embodiment of the invention generates the lane step length of each lane according to the step length set as the odd number by using the lane step length generation unit, and further generates the target address of the lane by using the lane step length. The method and the device ensure that no conflict exists between the addresses generated for each lane, so that the access operation of a plurality of lane parallel access memories can be accurately and orderly carried out according to respective target addresses, and the condition of access conflict caused by lane parallel access is avoided. The problem of multi-lane parallel access conflict is generally solved by arranging a conflict detection method or device in the prior art. Compared with the prior art, the collision detection is not needed to be carried out on the target address of the lane, namely, a relevant collision detection device is not needed to be arranged in relevant hardware, so that the execution efficiency of access operation of multi-lane parallel access is improved, the power consumption of the relevant hardware is reduced, and the whole time consumption of data parallel access is shortened.

For a better understanding of the invention, a simple example of the working process using the aforementioned parallel access method and parallel access means is now given. Fig. 7 shows a sample of data to be stored according to an embodiment of the present invention. As shown in fig. 7, the data sample includes 4 matrices, all of which have been stored in corresponding destination registers in the register file in the lane. Referring to fig. 8, a RAM memory for storing the data in fig. 7 according to an embodiment of the present invention is shown. As shown in FIG. 8, the RAM memory is divided into four memory groups, and since the number M of lanes is equal to 4, the lower 2 bits (i.e., log) of the physical address in the RAM memory₂4-2) is the group number. Each storage group is provided with a read-write port and comprises four storage banks, each storage bank corresponds to a physical address, namely the storage bank with the physical address of 0000-1100 is a storage group with the group number of 00; the memory bank with the physical address of 0001 and 1101 is a memory bank with the bank number of 01; the storage body with the physical address of 0010 and 1110 is a storage group, and the group number is 10; the memory bank with the physical address of 1100-1111 is a memory bank with the bank number of 11.

Now to store four matrices as shown in fig. 7 in parallel in the memory shown in fig. 8, four lanes are required for simultaneous access storage. Fig. 9 is a schematic structural diagram of a four-lane parallel access device 900 according to an embodiment of the present invention. As shown in fig. 9, the parallel access apparatus 900 includes an immediate file 910, an address generator 920, four lanes 930, a RAM memory 940, and four memory banks 941. After receiving the SIMD storage control instruction decoded by the decoder, the address generator 920 obtains relevant information according to the SIMD storage control instruction to sequentially generate four target addresses, and sends the generated four target addresses to corresponding lanes, and the lanes are started simultaneously after receiving the target addresses, and access the RAM memory 940 in parallel to perform storage operation. RAM memory 940 is identical to memory 800 shown in fig. 8. One SIMD store control instruction may perform a store operation for one element of the matrix. The specific process is as follows:

the address generator 920 sends a request for obtaining address generation information and lane information to the immediate heap 910 according to the received SIMD control instruction, the immediate heap 910 sends corresponding address generation information and lane information to the address generator 920 after receiving the request, the address generation information includes a base address and an offset, and the lane information includes a step size and 0 (i.e. the step size and the offset are 0)

0000) Here, the base address of the lane information is set to 0000 for easy understanding; the offset is set to 0000, and in some other embodiments, the base address and the offset may be other values; the step size is set to 3, i.e., 0011. It is noted that the step sizes provided in this application may also be other odd numbers, and the step size 3 is only an illustrative example chosen for ease of understanding. Fig. 10 is a schematic structural diagram of a step length generating unit 1000 for calculating four lanes according to an embodiment of the present invention. The address generator 920 receives the address generation information and the lane information, and generates a lane step using the lane step generation unit 1000 shown in fig. 10. As shown in fig. 10, the lane step generating unit 1000 includes a transmission line 1001, a transmission line 1002, a shifter 1003, and an adder 1004. The target address is generated using an address generation unit 500 as shown in fig. 5.

When generating the target address of lane 0, the lane step generation unit 1000 receives lane information including 0000 from the external device and step 0011 from the immediate pile 910. The lane step generation unit 1000 directly outputs 0000 in lane information as a lane step of the lane 0 (i.e., 0 times of the step 0011) using the transmission line 1001 according to the SIMD control instruction and directly outputs to the address generation unit 500, and after receiving the lane step 0000 and the address generation information (i.e., the base address 0000 and the offset 0000), the address generation unit 500 adds the base address 0000 and the offset 0000 by the first adder 501 and sends the obtained result 0000 to the second adder 502; the lane step 0000 is added to the result 0000 of the first adder 501 by the second adder 502, the obtained result 0000 is the target address of the lane 0, and the target address 0000 is directly sent to the lane 0. After the generation of the target address of lane 0 is completed, the address generator 920 continues to generate the target address for lane 1.

When generating the target address of lane 1, the lane step generation unit 1000 receives lane information including 0000 from the external device and step 0011 from the immediate heap 910. The lane step generation unit 1000 directly outputs a step 0011 in the lane information as a lane step (i.e., 1 time of the step 0011) of the lane 1 using the transmission line 1002 according to the SIMD control instruction and directly outputs to the address generation unit 500, and after receiving the lane step 0011 and the address generation information (i.e., the base address 0000 and the offset 0000), the address generation unit 500 adds the base address 0000 and the offset 0000 through the first adder 501 and sends the obtained result 0000 to the second adder 502; the second adder 502 adds the lane step 0011 to the result 0000 of the first adder 501, and the obtained result 0011 is the target address of the lane 1, and the target address 0011 is directly sent to the lane 1. After the generation of the destination address for lane 1 is completed, the address generator 920 continues to generate the destination address for lane 2.

When generating the target address of lane 2, step generation unit 1000 receives lane information including 0000 from the external device and step 0011 from immediate pile 910. The lane step generation unit 1000 sends the step 0011 to the shifter 1003 according to the SIMD control instruction, the shifter 1003 shifts the step 0011 by one bit to the left to obtain a shift result 0110, the shift result 0110 is used as a lane step of the lane 2 (namely, 2 times the step 0011), and is directly output to the address generation unit 500, and after receiving the lane step 0110 and the address generation information (namely, the base address 0000 and the offset 0000), the address generation unit 500 adds the base address 0000 and the offset 0000 by the first adder 501, and sends the obtained result 0000 to the second adder 502; the lane step 0110 is added to the result 0000 of the first adder 501 by the second adder 502, the obtained result 0110 is the target address of the lane 1, and the target address 0110 is directly sent to the lane 2. After the generation of the destination address of lane 2 is completed, the address generator 920 continues to generate the destination address for lane 3.

When generating the target address of lane 3, step size generation unit 1000 receives lane information including 0000 from the external device and step size 0011 from immediate pile 910. The lane step generation unit 1000 sends a step 0011 to the shifter 1003 according to the SIMD control instruction, the shifter 1003 shifts the step 0011 by one bit to the left to obtain a shift result 0110, then the shifter 1003 sends the shift result 0110 to the adder 1004, meanwhile, the lane step generation unit 1000 sends the step 0011 to the adder 1004 according to the SIMD control instruction, the adder 1004 sums the received shift result 0110 and the step 0011, and outputs the obtained sum 1001 as a lane step (i.e., 3 times the step 0011) of the lane 3 to the address generation unit 500, and after receiving the lane step 1001 and the address generation information (i.e., the base 0000 and the offset 0000), the address generation unit 500 adds the base 0000 and the offset 0000 by the first adder 501, and sends the obtained result 0000 to the second adder 502; the second adder 502 adds the lane step 1001 to the result 0000 of the first adder 501, and the obtained result 1001 is the target address of the lane 1, and the target address 1001 is directly sent to the lane 3.

After the target addresses of the four lanes are generated, the SIMD control instruction controls the four lanes to start running simultaneously, accesses the storage groups corresponding to the group numbers in the target addresses of the lanes in parallel in the memory 800, and further accesses the memory banks corresponding to the real addresses in the target addresses to perform the operation of storing the matrix elements. Lane 0 stores the first element 1 in the matrix a1 to the memory bank 00 in the memory bank 00, i.e. the memory bank with the physical address 0000 in fig. 8; lane 1 stores the first element 2 in matrix a2 into memory bank 00 in memory bank 11, i.e. the memory bank with physical address 0011 in fig. 8; lane 2 stores the first element 3 in matrix a3 into memory bank 01 in memory bank 10, i.e. the memory bank with physical address 0110 in fig. 8; lane 3 stores the first element 4 in matrix a4 into memory bank 10 in memory bank 01, i.e. into the memory bank with physical address 1001 in fig. 8. The storage state of the memory 800 after the storage is completed is shown in fig. 11.

The process of storing the remaining three elements in each matrix is similar to the above process except that the SIMD store control instruction when storing the second element controls the address generator 920 to fetch 0100 as the base address from the immediate heap 910; the SIMD store control instruction when storing the third element controls the address generator 920 to obtain a base address of 1000 from the immediate heap 910; the SIMD store control instruction when storing the fourth element controls the address generator 920 to fetch the base address 1100 from the immediate heap 910. The storage state of the memory 800 after all is completed is shown in fig. 12. That is, to complete the storage of four elements in the matrix, four SIMD control instructions are required, and each SIMD control instruction can complete the control of a series of operations such as address generation to the lane access memory for data storage.

The process of data reading (loading) is similar to the storage process, except that the storage process is that after the address generator generates the target address for each lane, four lanes are started simultaneously, and the matrix data stored in the register files in the lanes are stored into the corresponding target addresses in the memory in parallel; in the reading process, after the address generator generates the target address for each lane by the step length, the base address and the offset which are the same as those in the storage process, the four lanes are started to run simultaneously, and the matrix data stored in the target address corresponding to each lane in the memory are loaded into the register files in each lane in parallel. The detailed process is not described herein. It should be understood that the above description is only made for the operation of the parallel access method and the parallel access device provided by the present invention by taking four lanes as an example, and does not represent that the present invention can only realize the parallel access of four lanes. It can be understood that the parallel access method and the parallel access device provided by the invention can be easily popularized to M lanes from the principle and the situation of four lanes, wherein M is an integer not less than 2.

Fig. 13 is a schematic structural diagram of a chip 1300 according to an embodiment of the present invention. The chip 1300 shown in fig. 13 includes one or more processors 1301, a communication interface 1302, and a computer-readable storage medium 1303, and the processors 1301, the communication interface 1302, and the computer-readable storage medium 1303 may be connected by a bus, or may implement communication by other means such as wireless transmission. The embodiment of the present invention is exemplified by connection via a bus 1304. The computer-readable storage medium 1303 is used for storing instructions, and the processor 1301 includes the parallel access apparatus disclosed in the above embodiments, and is used for executing the instructions stored in the computer-readable storage medium 1303. In another embodiment, the computer-readable storage medium 1303 is used for storing a program code, and the processor 1301 may call the program code stored in the computer-readable storage medium 1303 to implement the related functions of the parallel access apparatus, which may be specifically referred to the related descriptions in the foregoing embodiments, and will not be described herein again.

It should be understood that, in the embodiments of the present invention, the Processor 1301 may be a Central Processing Unit (CPU), and the Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The communication interface 1302 may be a wired interface (e.g., an ethernet interface) or a wireless interface (e.g., a cellular network interface or using a wireless local area network interface) for communicating with other modules or equipment devices. For example, the communication interface 1302 in the embodiment of the present application may be specifically configured to receive input data input by a user; or receive data from an external device, etc.

The computer-readable storage medium 1303 may include Volatile Memory (Volatile Memory), such as Random Access Memory (RAM); the Memory may also include a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, HDD), or a Solid-State Drive (SSD); the memory may also comprise a combination of memories of the kind described above. The memory may be configured to store a set of program codes for facilitating the processor to invoke the program codes stored in the computer readable storage medium to implement the aforementioned parallel access method or the related functions of the parallel access apparatus.

It should be noted that fig. 13 is only one possible implementation manner of the embodiment of the present invention, and in practical applications, the chip may further include more or less components, which is not limited herein. For the content that is not shown or described in the embodiment of the present invention, reference may be made to the relevant explanation in the foregoing method embodiment, which is not described herein again.

An embodiment of the present invention further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a processor, the flow of the foregoing parallel access method is implemented. The storage medium includes a ROM/RAM, a magnetic disk, an optical disk, and the like.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the terminal device and the unit described above may refer to corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The embodiment of the invention discloses a parallel access method, a parallel access device, a processor, a chip and a computer readable storage medium, which can be used for realizing the operation of data storage or reading in the technical field of integrated circuits. Target addresses are generated for a plurality of lanes to be executed in parallel through an address generator, and the lanes access corresponding positions in the RAM according to the target addresses to perform data access operation in parallel. When the address generator generates the target address for the lane, the lane step generating unit is used for generating the lane step, wherein the lane step is K times of the step, and under the control of the same SIMD instruction, the lane step generated by the lane step generating unit for each lane is different, so that the target addresses generated by the address generating unit by using the lane step, the base address and the offset do not form a conflict, namely, the access conflict is avoided when the address generator generates the address for the lane. Therefore, the present invention no longer needs to provide an address conflict detection device, or omit the steps related to address conflict detection, compared with the prior art. Therefore, on the premise of ensuring that the parallel lanes correctly access the storage groups in the memory, the power consumption of related hardware can be reduced, and meanwhile, the whole time consumption for data parallel access operation is shortened.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A parallel access apparatus comprising a memory and M lanes, the memory comprising a plurality of memory groups, characterized in that the number of the memory groups is not less than M; the device further comprises an immediate pile and an address generator, wherein the immediate pile is connected with the address generator, the address generator is connected with each lane, and each lane is connected with each storage group;

the immediate heap is used for providing address generation information and step length, and the step length is an odd number;

the address generator is used for receiving SIMD control instructions, lane information and the address generation information and generating a target address for the lane; the lane information includes the step size; the address generator includes a lane step length generation unit and an address generation unit, wherein:

the lane step length generating unit is used for generating a lane step length according to the SIMD control instruction and the lane information, wherein the lane step length is K times of the step length, K is an integer, the value range of K is [ N, M + N-1], and N is an integer not less than 0;

the address generation unit is used for summing the address generation information and the lane step length according to the SIMD control instruction, and outputting the obtained sum value serving as the target address to a corresponding lane;

and the M lanes are used for accessing the corresponding storage groups according to the respective target addresses and performing access operation in parallel.

2. The accessing apparatus according to claim 1, wherein the lane step generating unit includes an arithmetic operation device for generating the lane step.

3. The access device of claim 2, wherein the arithmetic operation device comprises a plurality of adders, the plurality of adders being cascaded.

4. The access device according to claim 2, wherein the arithmetic operation means includes an adder and a shifter, and the lane step generation unit outputs a class a lane step with the shifter or outputs a class B lane step with the adder.

5. The access device according to claim 1, wherein the address generation information includes a base address and an offset, and the address generation unit includes at least two adders for adding the base address, the offset, and the lane step size, and taking a sum value as the target address.

6. A parallel access method providing a memory and M lanes, the method comprising the steps of:

step 110, dividing the memory into a plurality of storage groups, wherein the number of the storage groups is not less than M;

step 120, acquiring a SIMD control instruction, sequentially generating at least two target addresses according to the SIMD control instruction, and sending the at least two target addresses to corresponding lanes in the M lanes, wherein one target address can only be sent to one lane; the process of generating a single said target address comprises:

obtaining lane information according to the SIMD control instruction, wherein the lane information comprises a step length which is an odd number; generating a lane step length according to the lane information, wherein the lane step length is K times of the step length, K is an integer in an interval [ N, M + N-1], and N is an integer not less than 0;

acquiring address generation information according to the SIMD control instruction, summing the address generation information and the lane step length, and directly sending the obtained sum value serving as the target address to the corresponding lane according to the SIMD control instruction; the lane step length generated in the process of generating a single target address according to the SIMD control instruction is different each time;

and step 130, after all the target addresses generated according to the SIMD control instruction are sent to corresponding lanes, the corresponding lanes simultaneously start to run, the corresponding storage groups are accessed according to the received target addresses, and access operation is carried out in parallel.

7. The accessing method according to claim 6, wherein the address generation information includes a base address and an offset, and the step 120 includes summing the base address, the offset and the lane step size, and taking the sum as the target address.

8. The access method according to claim 6, further providing a lane step generation unit including an arithmetic operation device, the SIMD control instruction controlling the arithmetic operation device to generate the lane step.

9. The access method according to any of claims 6-8, wherein each of said storage groups has a respective group number, a low log in said target address₂The M bits are the group number.

10. A chip, comprising:

a computer-readable storage medium for storing a computer program;

a processor comprising at least the access device of any one of claims 1-5; the processor is adapted to carry out the steps of the method according to any of claims 6-9 when executing the computer program.