CN117251212A - Data access device, method, processor and computer equipment - Google Patents

Data access device, method, processor and computer equipment Download PDF

Info

Publication number
CN117251212A
CN117251212A CN202311220340.0A CN202311220340A CN117251212A CN 117251212 A CN117251212 A CN 117251212A CN 202311220340 A CN202311220340 A CN 202311220340A CN 117251212 A CN117251212 A CN 117251212A
Authority
CN
China
Prior art keywords
vector
memory
access
vector memory
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311220340.0A
Other languages
Chinese (zh)
Inventor
任子木
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202311220340.0A priority Critical patent/CN117251212A/en
Publication of CN117251212A publication Critical patent/CN117251212A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3814Implementation provisions of instruction buffers, e.g. prefetch buffer; banks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
    • G06F9/3869Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Complex Calculations (AREA)

Abstract

The embodiment of the application discloses a data access device, a data access method, a processor and computer equipment, and belongs to the technical field of data access. The device comprises: the reorganization unit and the vector access unit; the reorganization unit comprises m first reorganizers; the first reorganizer is configured to reorganize n vector memory addresses in the original vector memory instruction to obtain a candidate vector memory instruction; the reorganization unit is used for determining a target vector memory access instruction from m candidate vector memory access instructions based on the number of memory block conflicts corresponding to the m candidate vector memory access instructions; the reorganization unit is further configured to transmit the target vector access instruction to the vector access unit; the vector memory access unit is used for reading vector data corresponding to each vector memory access address in the target vector memory access instruction from a memory based on the target vector memory access instruction; the number of memory block conflicts in the data access process is reduced, and the data access performance is improved.

Description

Data access device, method, processor and computer equipment
Technical Field
The embodiment of the application relates to the technical field of data access, in particular to a data access device, a data access method, a processor and computer equipment.
Background
With the increasing data size and complexity of various fields, the requirements on the computing power and processing performance of the processor are also increasing. Vector processor systems (Vector Processor System, VPS) are vector-oriented parallel computing, pipeline-based parallel processing systems.
The vector processor is a central processing unit capable of directly operating a one-dimensional array (vector) instruction set, and the vector access unit is a unit for reading data from a memory or writing data into the memory in the vector processor.
In the related art, a vector processor directly processes a vector access instruction with a large number of hash vector access requests through a vector access unit, so that a large number of memory block (bank) conflicts exist in the process of reading vector data, and the access performance is low.
Disclosure of Invention
The embodiment of the application provides a data access device, a method, a processor and computer equipment, which can reduce the number of memory block conflicts in the data access process and improve the data access performance. The technical scheme is as follows:
in one aspect, an embodiment of the present application provides a data access device, where the device includes:
The reorganization unit and the vector access unit; the reorganization unit comprises m first reorganizers;
the first reorganizer is configured to reorganize n vector memory addresses in an original vector memory instruction to obtain candidate vector memory instructions, where address sequences of the n vector memory addresses in different candidate vector memory instructions are different;
the reorganization unit is configured to determine a target vector memory access instruction from m candidate vector memory access instructions based on the number of memory block conflicts corresponding to m candidate vector memory access instructions, where the number of memory block conflicts corresponding to the target vector memory access instruction is less than the number of memory block conflicts corresponding to other candidate vector memory access instructions;
the reorganization unit is further configured to transmit the target vector access instruction to the vector access unit;
the vector memory unit is used for reading vector data corresponding to each vector memory address in the target vector memory instruction from a memory based on the target vector memory instruction.
On the other hand, an embodiment of the present application provides a data access method, where the method is used in the data access device described in the above aspect, and the data access device includes: the reorganization unit and the vector access unit; the reorganization unit comprises m first reorganizers;
The method comprises the following steps:
the n vector memory addresses in the original vector memory instruction are recombined through the first recombinator to obtain candidate vector memory instructions, wherein the address sequences of the n vector memory addresses in different candidate vector memory instructions are different;
determining a target vector memory access instruction from m candidate vector memory access instructions based on the number of memory block conflicts corresponding to m candidate vector memory access instructions through the reorganization unit, wherein the number of memory block conflicts corresponding to the target vector memory access instruction is smaller than the number of memory block conflicts corresponding to other candidate vector memory access instructions;
transmitting the target vector access instruction to the vector access unit through the reorganization unit;
and reading vector data corresponding to each vector memory address in the target vector memory instruction from a memory based on the target vector memory instruction through the vector memory unit.
In some embodiments, the reorganization unit further includes m conflict scorers, an input end of the conflict scorers is connected with an output end of the first reorganizer, and different first reorganizers correspond to different address reorganization types; the recombination unit also comprises a comparator, and the input end of the comparator is connected with the output end of each conflict scorer;
Performing memory block conflict detection on the candidate vector memory access instruction output by the first reorganizer through the conflict scorer, and counting the number of memory block conflicts based on a conflict detection result;
transmitting the memory block conflict quantity to the comparator through the conflict scorer;
and comparing the number of memory block conflicts corresponding to each candidate vector memory access instruction through the comparator, and determining the target address reorganization type from m address reorganization types.
In some embodiments, dividing n vector memory addresses in the candidate vector memory instruction by the conflict scorer based on a single memory upper limit of the memory to obtain a plurality of sets of candidate vector memory requests, wherein the number of addresses of the vector memory addresses in the candidate vector memory requests is equal to the number of addresses corresponding to the single memory upper limit; determining memory blocks corresponding to each vector memory address in each group of candidate vector memory access requests; under the condition that a plurality of vector memory access addresses correspond to the same memory block in the candidate vector memory access request, determining the number of memory block conflicts corresponding to the candidate vector memory access request based on the number of addresses of the vector memory access addresses corresponding to the same memory block; based on the number of memory block conflicts corresponding to each group of candidate vector access requests, counting the number of memory block conflicts corresponding to the candidate vector access instructions.
In some embodiments, the reorganization unit further includes a first multiplexer, an input terminal of the first multiplexer is connected to an output terminal of each first reorganizer, and a gate terminal of the first multiplexer is connected to an output terminal of the comparator;
transmitting a type number corresponding to the target address reorganization type to the first multiplexer through the comparator;
outputting, by the first multiplexer, the target vector memory access instruction based on the type number input by the comparator.
In some embodiments, the first reorganizer exchanges the address sequence of the vector memory addresses in the original vector memory access instruction based on the address exchange manner indicated by the address reorganization type, so as to obtain the candidate vector memory access instruction.
In some embodiments, the apparatus further comprises a type buffer, an input of the type buffer being connected to an output of the comparator;
receiving a type number corresponding to the target address reorganization type input by the comparator through the type buffer;
and sequentially storing each type number according to the input sequence of the type number through the type buffer.
In some embodiments, the apparatus further includes an instruction splitting unit, an input of the instruction splitting unit is connected to an output of the reorganizing unit, and an output of the instruction splitting unit is connected to an input of the vector access unit;
the instruction splitting unit is used for carrying out instruction splitting on the target vector memory access instruction input by the reorganization unit based on the single memory access upper limit of the memory to obtain a plurality of groups of target vector memory access requests, wherein the number of addresses of the vector memory access addresses in the target vector memory access requests is equal to the number of addresses corresponding to the single memory access upper limit;
and sequentially transmitting the target vector access requests to the vector access unit through the instruction splitting unit.
In some embodiments, the instruction splitting unit further comprises a second multiplexer, a counter, and an adder; the input end of the second multiplexer is connected with the output end of the recombination unit, the gating end of the second multiplexer is connected with the output end of the counter, and the output end of the second multiplexer is connected with the input end of the vector memory unit; the input end of the counter is connected with the output end of the adder, and the output end of the counter is connected with the input end of the adder;
Receiving an addition instruction input by the adder through the counter, and transmitting a counting result obtained based on the addition instruction to the second multiplexer;
transmitting, by the second multiplexer, the target vector access request to the vector access unit based on the count result input by the counter.
In some embodiments, the apparatus further comprises a vector buffer and a data reduction unit;
storing vector data corresponding to each vector memory address read from the memory through the vector buffer, and transmitting the vector data corresponding to each vector memory address to the data reduction unit;
and carrying out data reduction on the vector data corresponding to each vector memory address through the data reduction unit to obtain a target vector data set, wherein the address sequence of the vector memory addresses corresponding to each vector data in the target vector data set is the same as the address sequence of the n vector memory addresses in the original vector memory instruction.
In some embodiments, the data reduction unit includes m second recombiners and a third multiplexer, where the second recombiners correspond to the same address reorganization type as the first recombiners; the input end of the second reorganizer is connected with the output end of the vector buffer, and the output end of the second reorganizer is connected with the input end of the third multiplexer;
The device also comprises a type buffer, wherein the output end of the type buffer is connected with the gating end of the third multiplexer;
performing position exchange on vector data corresponding to each vector memory address by the second reorganizer based on an address exchange mode indicated by an address reorganization type to obtain candidate vector data sets;
transmitting the candidate vector data set to the third multiplexer through the second reorganizer;
and determining a target vector data set from m candidate vector data sets based on the type number input by the type buffer through the third multiplexer, and outputting the target vector data set, wherein the type number corresponds to the target vector memory access instruction.
In another aspect, an embodiment of the present application provides a processor, where the processor includes a data access device as described in the foregoing aspect.
In another aspect, an embodiment of the present application provides a computer device, where the computer device includes a processor as described in the foregoing aspect and a memory, where the processor is connected to the memory through a bus.
On the other hand, the embodiment of the application provides a computer device, which comprises a processor, a memory and the data access device according to the above aspect, wherein the processor is connected with the data access device, and the processor is connected with the memory through a bus.
In the embodiment of the application, in the process of vector data access, n vector access addresses in original vector access instructions are firstly recombined through m first recombiners in a reorganizing unit to obtain m candidate vector access instructions, so that the target vector access instructions are determined from m candidate vector access instructions according to the conflict quantity of memory blocks corresponding to each candidate vector access instruction through the reorganizing unit, and the target vector access instructions are transmitted to a vector access unit through the reorganizing unit, so that vector data corresponding to each vector access address in the target vector access instructions are read from a memory through the vector access unit based on the target vector access instructions, the conflict quantity of memory blocks in the vector data access process is reduced, and the data access performance is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a data access device according to an exemplary embodiment of the present application;
FIG. 2 illustrates an address reorganization schematic provided by an exemplary embodiment of the present application;
FIG. 3 is a schematic diagram illustrating a data access device according to another exemplary embodiment of the present application;
FIG. 4 illustrates an address reorganization schematic provided in another exemplary embodiment of the present application;
FIG. 5 illustrates a schematic diagram of an instruction splitting unit provided in an exemplary embodiment of the present application;
FIG. 6 is a schematic diagram illustrating a data access device according to another exemplary embodiment of the present application;
FIG. 7 illustrates a flow chart of a data access method provided by an exemplary embodiment of the present application;
fig. 8 shows a block diagram of a computer device according to an exemplary embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
It should be understood that references herein to "a number" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
In a vector processor, the performance of a vector access unit has a great influence on the overall performance of the processor, besides a load instruction (for completing loading of data) and a store instruction (for completing writing of data), a special instruction is also included in the vector access instruction, and the vector addresses accessed by the instruction are generally hashed, that is, the vector access addresses are out of order and scattered, if vector access is directly performed based on the address sequence of the vector access addresses in the instruction, a great deal of memory block conflict may occur, for example, a plurality of vector access addresses exist in the same clock cycle and correspond to the same memory block, so that the access performance is greatly reduced.
In the related art, in order to improve the memory access performance, through carrying out the whole conflict detection to a large amount of vector memory accesses of input, thus in the process of carrying out data memory access, the arbitration unit in the memory can carry out memory access allocation based on the conflict detection result, and along with the increasing parallelism of the vector processor, the design complexity of the vector memory access unit is also higher and higher, and a large amount of comparators and wiring resources are consumed to directly carry out the whole conflict detection to all vector memory accesses, so that the layout and wiring pressure are multiplied.
In the embodiment of the application, in order to reduce the number of memory block conflicts in the data access process, a reorganizing unit is additionally designed in the data access device, and the reorganizing unit comprises a plurality of first reorganizers, so that vector access addresses in vector access instructions are reorganized through the first reorganizers, the address sequence is adjusted, the vector access instruction with the least number of memory block conflicts is determined, and vector data access is performed by the vector access unit based on the vector access instruction, so that the access performance is improved.
Referring to fig. 1, a schematic structural diagram of a data access device according to an exemplary embodiment of the present application is shown. The data access device 100 includes a reorganizing unit 110 and a vector access unit 120, where the reorganizing unit 110 includes m first reorganizers 1101.
The first reorganizer 1101 is configured to reorganize n vector memory addresses in the original vector memory instruction to obtain candidate vector memory instructions, where the address order of the n vector memory addresses in different candidate vector memory instructions is different.
The original vector memory instruction includes n vector memory addresses, where a plurality of vector memory addresses may exist in the n vector memory addresses corresponding to the same memory block, and the vector memory addresses corresponding to the same memory block may also be distributed in a hash manner, for example, two vector memory addresses corresponding to the same memory block belong to memory addresses in the same clock cycle, so that there is a memory block Conflict in the clock cycle, and data memory needs to be sequentially performed based on the two vector memory addresses, that is, two clock cycles are required to complete the data memory, so in some embodiments, in order to reduce the number of memory block conflicts (Bank conflicts) in the original vector memory instruction, the n vector memory addresses in the original vector memory instruction may be recombined by the first reorganizer, so as to obtain a candidate vector memory instruction.
In an exemplary example, the original vector memory instruction includes 8 vector memory addresses pointing to bank0, bank1, bank2, bank3, bank2, and bank3, respectively, so that in the case of vector memory based on 4 vector memory addresses in a single clock cycle, the vector memory unit in the first clock cycle performs vector memory based on the first 4 vector memory addresses, and the first 4 vector memory addresses point to bank0, bank1, bank0, and bank1, respectively, that is, two vector memory addresses point to the same memory block in the same clock cycle, so that memory block conflicts occur, and vector memory based on the first 4 vector memory addresses cannot be completed simultaneously in a single clock cycle, and two clock cycles are required to complete the vector memory based on the first 4 vector memory addresses, which results in a complete clock cycle.
Therefore, in order to reduce the memory block conflict, in the embodiment of the present application, the first reorganizer reorganizes n vector memory addresses in the original vector memory instruction to obtain the candidate vector memory instruction. For example, the original vector memory instruction includes 8 vector memory addresses pointing to bank0, bank1, bank2, bank3, bank2, and bank3, so that the first reorganizer may exchange positions of the 3 rd vector memory address and the 5 th vector memory address, and exchange positions of the 4 th vector memory address and the 6 th vector memory address, so as to obtain that the memory block pointing sequence of the 8 vector memory addresses in the candidate vector memory instruction is bank0, bank1, bank2, bank3, bank0, bank1, bank2, and bank3, so that under the condition that vector memory is performed based on the 4 vector memory addresses in a single clock cycle, the first clock cycle memory unit performs vector memory based on the 4 th vector memory address, and points to bank0, bank1, bank2, and bank3, that is, the previous 4 vector memory blocks need no time-consuming time based on the 4 vector memory address, and only the candidate vector memory instruction is completed.
In one possible design, the reorganizing unit includes m first reorganizers, and the different first reorganizers respectively correspond to different address reorganizing types (Shuffle types), so that m candidate vector memory access instructions can be obtained through the m first reorganizers, and the address sequences of n vector memory accesses in the different candidate vector memory access instructions are different.
Optionally, the reorganization type corresponding to the first reorganizer may be a reorganization type that the designer obtains relatively small average number of memory block conflicts based on simulation statistics in advance.
Alternatively, the reorganization process may be represented by swapping the address order of partial vector memory addresses in the original vector memory instruction. For example, the first reorganizer includes 8 input ports and 8 output ports, the input ports and the output ports are in one-to-one correspondence, taking the example that the vector access unit can perform vector access based on 4 vector access addresses in a single clock cycle, under the condition that the original vector access instruction is not subjected to address reorganization, the first input port corresponds to the first output port, the second input port corresponds to the second output port, and so on, the input ports and the output ports corresponding to the vector access addresses are the same in number, so that the address sequence among the vector access addresses is unchanged; under the condition of carrying out address reorganization on the original vector memory access instruction, the first input port corresponds to the fourth output port, the fourth input port corresponds to the first output port, the second input port corresponds to the fifth output port, and the fifth input port corresponds to the second output port, namely, the first vector memory access address and the fourth vector memory access address are interchanged, and the second vector memory access address and the fifth vector memory access address are interchanged, so that a candidate vector memory access instruction is obtained.
Namely, different corresponding relations between the input ports and the output ports represent different address reorganization types, and the number of address exchanges in the different reorganization types can also be different, and only one address exchange can be included, for example, only the first input output port and the fourth input output port are connected and exchanged; two address exchanges may be included, for example, the first input/output port and the fourth input/output port are exchanged by connecting lines, and the second input/output port and the fifth input/output port are exchanged by connecting lines, so that a designer can determine, through multiple simulation statistics, a reorganization type with relatively smaller average number of m memory block conflicts applicable to the current address access, and apply the reorganization type to m first reorganizers.
Illustratively, as shown in fig. 2, the first reorganizer may exchange the third vector memory address with the location of the fifth vector memory address, and exchange the fourth vector memory address with the location of the sixth vector memory address.
The reorganizing unit 110 is configured to determine a target vector memory access instruction from the m candidate vector memory access instructions based on the number of memory block conflicts corresponding to the m candidate vector memory access instructions, where the number of memory block conflicts corresponding to the target vector memory access instruction is less than the number of memory block conflicts corresponding to other candidate vector memory access instructions.
In some embodiments, after n vector memory addresses in the original vector memory instructions are reorganized by m first reorganizers to obtain m candidate vector memory instructions, in order to complete vector data memory under the condition that the minimum number of memory block conflicts is generated, the reorganizing unit may be configured to determine, according to the number of memory block conflicts corresponding to the m candidate vector memory instructions, a target vector memory instruction from the m candidate vector memory instructions.
In some embodiments, the reorganizing unit may be configured to determine a number of memory block conflicts corresponding to each candidate vector memory access instruction, so as to determine, as the target vector memory access instruction, a candidate vector memory access instruction having a minimum number of corresponding memory block conflicts.
The reorganizing unit 110 is further configured to transmit the target vector access instruction to the vector access unit 120.
In some embodiments, after determining the target vector memory access instruction, the reorganization unit may be further configured to transmit the target vector memory access instruction to the vector memory access unit for further vector data reading.
The vector access unit 120 is configured to read, based on the target vector access instruction, vector data corresponding to each vector access address in the target vector access instruction from the memory.
In some embodiments, after receiving the target vector memory access instruction, the vector memory access unit may sequentially read vector data corresponding to each vector memory address from the memory according to an address sequence of n vector memory addresses in the target vector memory access instruction.
Alternatively, the memory may be an internal memory integrated with the data access device in the same processor, or may be an external memory located outside the processor where the data access device is located.
In summary, in the embodiment of the present application, during the vector data access process, the m first recombiners in the recombining unit are used to recombine the n vector access addresses in the original vector access instruction to obtain m candidate vector access instructions, so that, through the recombining unit, according to the number of memory block conflicts corresponding to each candidate vector access instruction, the target vector access instruction is determined from the m candidate vector access instructions, and the target vector access instruction is transmitted to the vector access unit through the recombining unit, so that, based on the target vector access instruction, the vector access unit reads the vector data corresponding to each vector access address in the target vector access instruction from the memory, thereby reducing the number of memory block conflicts in the vector data access process, and improving the data access performance.
In order to ensure that the number of memory block conflicts corresponding to the determined target vector memory access instructions is smaller than the number of memory block conflicts corresponding to other candidate vector memory access instructions, memory block conflict detection can be performed on each candidate vector memory access instruction respectively, so that the target vector memory access instructions are determined based on conflict detection results.
In one possible design, the reorganizing unit includes m first reorganizers corresponding to different address reorganizing types, and m conflict scorers, where the conflict scorers are in one-to-one correspondence with the first reorganizers, and an input end of the conflict scorers is connected with an output end of the first reorganizing unit.
Illustratively, as shown in fig. 3, the reorganizing unit 310 of the data access device further includes, in addition to the first reorganizer 1, the first reorganizer 2, and the first reorganizer 3, a conflict scorer 1, a conflict scorer 2, and a conflict scorer 3 corresponding to the first reorganizing unit, where an output end of the first reorganizing unit 1 is connected to an input end of the conflict scorer 1, an output end of the first reorganizing unit 2 is connected to an input end of the conflict scorer 2, and an output end of the first reorganizing unit 3 is connected to an input end of the conflict scorer 3.
The first reorganizer is configured to exchange an address sequence of a vector memory address in an original vector memory instruction based on an address exchange manner indicated by an address reorganization type, so as to obtain a candidate vector memory instruction.
Optionally, in order to exchange the address sequence of the vector memory access, the corresponding relationship of the input/output connection lines in the first reorganizer may be exchanged in a circuit connection manner, that is, the vector memory access input to the first reorganizer corresponds to the first address sequence, and the vector memory access of the second address sequence may be obtained through the exchange of the input/output connection lines inside the first reorganizer.
For example, the vector memory access instruction includes 8 vector memory accesses, the 8 input ports in the first reorganizer are respectively used for inputting the vector memory accesses, and the 8 output ports are respectively used for outputting the vector memory accesses, so that the output address sequence of the vector memory accesses is different from the input address sequence by exchanging the input/output connection lines.
Schematically, as shown in fig. 4, the vector memory access instruction includes 8 vector memory access addresses pointing to bank0, bank1, bank2, bank3, bank2, and bank3 respectively, and through the exchange of input and output wires inside the first reorganizer corresponding to the address reorganization type 1, the address sequence pointing to bank0, bank1, bank2, bank3, bank0, bank1, bank2, and bank3 can be obtained.
The conflict scorer is used for carrying out memory block conflict detection on the candidate vector memory access instruction output by the first reorganizer, and counting the number of memory block conflicts based on a conflict detection result.
In some embodiments, after the candidate vector access instruction is obtained through the first reorganizer, in order to accurately obtain the number of memory block conflicts corresponding to the candidate vector access instruction, the candidate vector access instruction may be transmitted to the conflict scorer through the first reorganizer, so that the conflict scorer performs memory block conflict detection on the candidate vector access instruction, and further based on a conflict detection result, the number of memory block conflicts corresponding to the candidate vector access instruction is counted.
Optionally, the conflict scorer may divide the candidate vector memory instruction according to a single memory access upper limit of the memory, that is, determine which vector memory addresses are used for vector memory in a single clock cycle, so that the conflict scorer detects a plurality of corresponding vector memory addresses in a single clock cycle, determines whether there is a vector memory address corresponding to the same memory block in the plurality of vector memory addresses, and indicates that there is no memory block conflict if there is no vector memory address corresponding to the same memory block in the single clock cycle; under the condition that two vector memory accesses correspond to the same memory block in a single clock cycle, the number of the memory block conflicts is 1; in the case that there are three vector memory accesses corresponding to the same memory block in a single clock cycle, the number of memory block conflicts is indicated as 2, and so on.
Optionally, in the vector data memory access process, the vector memory access unit generally reads vector data from the memory based on the number of addresses corresponding to the single memory access upper limit according to the single memory access upper limit of the memory, so that the conflict scorer can detect the conflict situation of the memory block in the single data memory access process, thereby counting the conflict situation of the memory block corresponding to the candidate vector memory access instruction.
In some embodiments, the conflict scorer divides n vector memory addresses in the candidate vector memory instruction according to a single memory access upper limit of the memory to obtain a plurality of sets of candidate vector memory requests, where the number of addresses of the vector memory addresses in the candidate vector memory requests is equal to the number of addresses corresponding to the single memory access upper limit.
Further, the conflict scorer can determine the memory block corresponding to each vector memory address in each set of candidate vector memory access requests, and judge whether at least two vector memory addresses in each set of candidate vector memory access requests correspond to the same memory block.
In one possible implementation manner, when there are multiple vector accesses in the candidate vector access request corresponding to the same memory block, the conflict scorer may determine the number of memory block conflicts corresponding to the candidate vector access request according to the number of addresses of the vector accesses corresponding to the same memory block, for example, the candidate vector access request includes 4 vector accesses pointing to bank0, bank1, bank0 and bank1 respectively, and if there are two vector accesses pointing to bank0 and two vector accesses pointing to bank1 in the candidate vector access request, that is, there is a memory block conflict of one clock cycle, and the number of memory block conflicts is 1.
And then the conflict scorer can count the number of memory block conflicts corresponding to the candidate vector access instruction according to the number of memory block conflicts corresponding to each group of candidate vector access requests, and optionally, the number of memory block conflicts corresponding to the candidate vector access instruction is equal to the sum of the number of memory block conflicts corresponding to each group of candidate vector access requests.
Illustratively, the single memory access limit of the memory is 4, i.e., the memory can only receive requests for 4 vector memory accesses per clock cycle.
As shown in fig. 4, the first candidate vector memory access instruction corresponding to the address reorganization type 1 includes 8 vector memory accesses pointing to bank0, bank1, bank2, bank3, bank0, bank1, bank2, and bank3, respectively, that is, the first candidate vector memory access instruction may be divided into first candidate vector memory access requests, including 4 vector memory accesses pointing to bank0, bank1, bank2, and bank3, respectively; the second candidate vector access request comprises 4 vector access addresses which respectively point to bank0, bank1, bank2 and bank3, and the number of memory block conflicts corresponding to the first candidate vector access instruction is 0.
As shown in fig. 4, the second candidate vector memory access instruction corresponding to the address reorganization type 2 includes 8 vector memory accesses pointing to bank0, bank1, bank2, bank3, bank2, and bank3, respectively, that is, the second candidate vector memory access instruction may be divided into the first candidate vector memory access requests, including 4 vector memory accesses pointing to bank0, bank1, bank0, and bank1, respectively; the second candidate vector memory access request comprises 4 vector memory access addresses which respectively point to bank2, bank3, bank2 and bank3, and if one memory block conflict exists in the first candidate vector memory access request, one memory block conflict also exists in the second candidate vector memory access request, so that the number of memory block conflicts corresponding to the second candidate vector memory access instruction is 2.
In one possible design, a comparator is also included in the reorganization unit, and the input of the comparator is connected to the output of each conflict scorer. Illustratively, as shown in fig. 3, the reorganizing unit 310 includes a comparator 311, where an input terminal of the comparator 311 is connected to an output terminal of the conflict scorer 1, an output terminal of the conflict scorer 2, and an output terminal of the conflict scorer 3, respectively.
In some embodiments, the conflict scorer is further configured to communicate the number of memory block conflicts to the comparator.
Optionally, after determining the number of memory block conflicts corresponding to the candidate vector access instruction, each conflict scorer may transmit the number of memory block conflicts to the comparator.
The comparator is used for comparing the conflict quantity of the memory blocks corresponding to each candidate vector memory access instruction, and determining the target address reorganization type from m address reorganization types.
In some embodiments, after receiving the number of memory block conflicts corresponding to the m conflict scorers, the comparator may determine the target address reorganization type from the m address reorganization types.
In some embodiments, the m first reformers and the conflict scorers respectively correspond to type numbers of 1 to m, so that the number of memory block conflicts corresponding to the m conflict scorers received by the comparator, namely, the type numbers and the number of memory block conflicts are included, and then the comparator can determine the type number with the minimum number of corresponding memory block conflicts after comparing the number of m memory block conflicts, and the type number can represent the number of the first reformers, the number of the conflict scorers, and the number corresponding to the address reform type corresponding to the first reformers.
In one possible design, the reorganization unit further comprises a first multiplexer, an input of the first multiplexer being connected to an output of each first reorganizer, and a gate of the first multiplexer being connected to an output of the comparator. Illustratively, as shown in fig. 3, the reorganizing unit 310 further includes a first multiplexer 312, where an input terminal of the first multiplexer 312 is connected to an output terminal of the first reorganizing unit 1, to an output terminal of the first reorganizing unit 2, to an output terminal of the first reorganizing unit 3, and a gate terminal of the first multiplexer 312 is connected to an output terminal of the comparator 311.
The comparator is used for transmitting a type number corresponding to the target address reorganization type to the first multiplexer; the first multiplexer is used for outputting a target vector memory access instruction based on the type number input by the comparator.
In some embodiments, after determining the target address reorganization type with the least number of corresponding memory block conflicts, the comparator may transmit the type number corresponding to the target address reorganization type to the first multiplexer, so that the first multiplexer may select, according to the type number, the candidate vector memory access instruction output by the first reorganizer corresponding to the type number as the target vector memory access instruction, and output the target vector memory access instruction.
In one possible design, in order to read vector data corresponding to each vector memory address from the memory according to a single memory upper limit of the memory, the data memory device may further include an instruction splitting unit, where an input end of the instruction splitting unit is connected to an output end of the reorganizing unit, and an output end of the instruction splitting unit is connected to an input end of the vector memory unit.
Illustratively, as shown in FIG. 3, an input of the instruction splitting unit 320 is connected to an output of the first multiplexer 312 in the reorganizing unit 310, and an output of the instruction splitting unit 320 is connected to an input of the vector accessing unit 330.
Optionally, the instruction splitting unit is configured to split the instruction of the target vector memory access instruction input by the reconstruction unit based on a single memory access upper limit of the memory, so as to obtain multiple groups of target vector memory access requests, where the number of addresses of the vector memory access addresses in the target vector memory access requests is equal to the number of addresses corresponding to the single memory access upper limit.
In some embodiments, the single memory access limit of the memory is 4, that is, the memory can only receive requests containing 4 vector memory addresses at a time, so that in the case that 16 vector memory addresses are included in the target vector memory instruction, the instruction splitting unit needs to divide the 16 vector memory addresses into 4 sets of target vector memory requests, each set of target vector memory requests containing 4 vector memory addresses.
In an illustrative example, the target vector memory instruction includes 16 vector memory addresses pointing to bank0, bank1, bank2, bank3, bank4, bank5, bank6, bank7, bank0, bank1, bank2, bank3, bank4, bank5, bank6, and bank7, and the single memory upper limit of the memory is 4 vector memory addresses, so the instruction splitting unit needs to split the 16 vector memory addresses into 4 groups, each containing 4 vector memory addresses, thereby obtaining 4 target vector memory requests, wherein the first target vector memory request includes 4 vector memory addresses pointing to bank0, bank1, bank2, and bank3; the second target vector access request comprises 4 vector access addresses which respectively point to bank4, bank5, bank6 and bank7; the third target vector access request comprises 4 vector access addresses which respectively point to bank0, bank1, bank2 and bank3; the fourth target vector access request includes 4 vector access addresses pointing to bank4, bank5, bank6 and bank7 respectively.
Optionally, the instruction splitting unit is further configured to sequentially transmit the target vector access request to the vector access unit.
In some embodiments, after obtaining multiple sets of target vector memory access requests, the instruction splitting unit may sequentially transmit each set of target vector memory access requests to the vector memory access unit, which reads vector data from the memory based on each set of target vector memory access requests.
In order to enable the sequential transmission of each set of target vector memory access requests to the vector memory access unit, in one possible design, the instruction splitting unit may further comprise a second multiplexer, a counter and an adder, wherein an input of the second multiplexer is connected to an output of the reorganizing unit, a gate of the second multiplexer is connected to an output of the counter, an output of the second multiplexer is connected to an input of the vector memory access unit, an input of the counter is connected to an output of the adder, and an output of the counter is connected to an input of the adder.
Illustratively, as shown in fig. 5, the input end of the second multiplexer 510 is connected to the output end of the reorganizing unit, the gate end of the second multiplexer 510 is connected to the output end of the counter 520, the output end of the second multiplexer 510 is connected to the input end of the vector memory unit, the input end of the counter 520 is connected to the output end of the adder 530, and the output end of the counter 520 is connected to the input end of the adder 530.
The counter is used for receiving an addition instruction input by the adder and transmitting a counting result obtained based on the addition instruction to the second multiplexer; the second multiplexer is used for transmitting the target vector access request to the vector access unit based on the counting result input by the counter.
In some embodiments, the initial count result corresponding to the counter may be set to 0, and when the instruction splitting unit starts to transmit the target vector access request to the vector access unit, the adder performs an addition instruction, and transmits the instruction to the counter, so that the counter obtains a count result of 1, and transmits the count result to the second multiplexer, so that the second multiplexer selects the first group of target vector access requests as output based on the count result; after the second multiplexer outputs the first set of target vector access requests, the adder continues to execute an addition instruction and transmits the addition instruction to the counter, so that the counter obtains a counting result of 2 and transmits the counting result to the second multiplexer, so that the second multiplexer selects the second set of target vector access requests as output based on the counting result, and so on until the instruction splitting unit transmits all target vector access requests to the vector access unit.
In the above embodiment, the conflict scorers are designed in the reorganization unit in a matching manner for each first reorganizer, and each conflict scorer is connected with the comparator, so that the comparator determines the target reorganization type with the least number of memory block conflicts, the comparator transmits the comparison result to the first multiplexer, and the first multiplexer selects the target vector memory command as output, thereby improving the determination efficiency of the target reorganization type.
In addition, through designing instruction splitting unit, according to the single memory access upper limit of memory by instruction splitting unit, the target vector memory access request is transmitted to the vector memory access unit in proper order, and according to each target vector memory access request in proper order by vector memory access unit, the corresponding vector data is read from the memory, has improved the memory access efficiency of vector data.
In consideration of that the vector memory unit reads vector data from the memory based on each target vector memory request, where the target vector memory requests are obtained by reorganizing and splitting the vector memory addresses in the original vector memory instruction, in order to ensure that the address sequence corresponding to the vector data finally output by the data memory device is the same as the address sequence corresponding to the vector memory addresses in the original vector memory instruction, in one possible design, the data memory device may further include a vector buffer and a data restoring unit.
Illustratively, as shown in fig. 6, the data access device includes a reorganizing unit 610, an instruction splitting unit 620, a vector access unit 630, a vector buffer 640, and a data restoring unit 650.
The vector buffer is used for storing vector data corresponding to each vector memory address read from the memory and transmitting the vector data corresponding to each vector memory address to the data reduction unit.
In some embodiments, in the process that the vector access unit reads vector data from the memory based on each target vector access request, the vector buffer stores vector data corresponding to each vector access address read from the memory, and after receiving vector data corresponding to the complete vector access instruction, the vector buffer transmits the vector data to the data restoring unit.
In some embodiments, since the vector access unit performs data reading with the target vector access request as a group, vector data stored in the vector buffer each time also uses the target vector access request as a group, so that the vector buffer transmits vector data corresponding to each group of target vector access requests corresponding to the target vector access instruction to the data reduction unit.
In an illustrative example, the original vector memory instruction includes 16 vector memory addresses, each set of target vector memory requests includes 4 vector memory addresses, and each set of target vector memory requests has no memory block conflict, so that the vector buffer needs to wait for 4 clock cycles to obtain vector data corresponding to the complete vector memory instruction.
The data restoring unit is used for carrying out data restoration on the vector data corresponding to each vector memory address to obtain a target vector data set, and the address sequence of the vector memory addresses corresponding to each vector data in the target vector data set is the same as the address sequence of n vector memory addresses in the original vector memory instruction.
In some embodiments, in order to make the address sequence of the vector data finally output by the data memory device be the same as the address sequence of the vector memory addresses in the original vector memory instruction, the data reduction unit may perform data reduction on the vector data corresponding to each vector memory address according to the received vector data corresponding to each set of target vector memory requests, so as to obtain a target vector data set, where the address sequence of the vector memory addresses corresponding to each vector data in the target vector data set is the same as the address sequence of n vector memory addresses in the original vector memory instruction.
In one possible design, the data reduction unit includes m second recombiners and a third multiplexer, where the second recombiners correspond to the same address reorganization type as the first recombiners. The input end of the second recombinator is connected with the output end of the vector buffer, and the output end of the second recombinator is connected with the input end of the third multiplexer.
Illustratively, as shown in fig. 6, the data reduction unit 650 includes a second multiplexer 1, a second multiplexer 2, and a second multiplexer 3, where an input end of the second multiplexer 1 is connected to an output end of the vector buffer 640, an output end of the second multiplexer 1 is connected to an input end of the third multiplexer 651, an input end of the second multiplexer 2 is connected to an output end of the vector buffer 640, an output end of the second multiplexer 2 is connected to an input end of the third multiplexer 651, an input end of the second multiplexer 3 is connected to an output end of the vector buffer 640, and an output end of the second multiplexer 3 is connected to an input end of the third multiplexer 651.
The second recombiner is used for carrying out position exchange on the vector data corresponding to each vector memory address based on the address exchange mode indicated by the address recombination type, so as to obtain a candidate vector data set.
In some embodiments, the second recombiners in the data reduction unit correspond to the first recombiners in the recombination unit one by one and correspond to the same address recombination type, so that after the vector data corresponding to each vector memory address transmitted by the vector buffer is received, the second recombiners can perform position exchange on the vector data corresponding to each vector memory address according to an address exchange mode indicated by the address recombination type, and a candidate vector data set is obtained.
Optionally, the address sequence of the vector memory address corresponding to each vector data in the candidate vector data set output by the second reorganizer corresponding to the target reorganization type is the same as the address sequence of each vector memory address in the original vector memory instruction.
In an illustrative example, the first reorganizer 1 corresponds to a reorganization type that the first input/output port and the fourth input/output port are connected and exchanged; the recombination type corresponding to the first recombiner 2 is to exchange the connection line of the first input/output port and the fourth input/output port, and exchange the connection line of the second input/output port and the fifth input/output port; the reorganization type corresponding to the first reorganizer 3 is to exchange the connection line between the third input/output port and the fifth input/output port, and exchange the connection line between the fourth input/output port and the sixth input/output port, so that in order to perform data reduction on the vector data corresponding to each vector memory address, the reorganization type corresponding to the second reorganizer 1 needs to exchange the connection line between the first input/output port and the fourth input/output port; the recombination type corresponding to the second recombiner 2 needs to be that the first input/output port and the fourth input/output port are connected and exchanged, and the second input/output port and the fifth input/output port are connected and exchanged; the second reorganizer 3 needs to exchange the connection between the third input/output port and the fifth input/output port, and exchange the connection between the fourth input/output port and the sixth input/output port.
Optionally, the second multiplexer is further configured to transmit the candidate vector data set to the third multiplexer.
In some embodiments, after obtaining the candidate vector data set, each second multiplexer then needs to transmit the candidate vector data set to a third multiplexer, which outputs the target vector data set.
In order to improve the efficiency of determining the target vector data set, in one possible design, the data access device further comprises a type buffer, wherein the input end of the type buffer is connected with the output end of the comparator, and the output end of the type buffer is connected with the gating end of the third multiplexer.
Illustratively, as shown in FIG. 6, an input of the type buffer 660 is coupled to an output of the comparator 611, and an output of the type buffer 660 is coupled to a strobe of the third multiplexer 651.
Optionally, the type buffer is configured to receive a type number corresponding to the target address reorganization type input by the comparator in the reorganization unit.
In some embodiments, considering that there may be multiple original vector memory access instructions to perform data memory access through the data memory access device, that is, there are multiple type numbers of the target address reorganization type corresponding to the original vector memory access instructions, in order to ensure stability and order of data input and output, the type buffer is further configured to store each type number in sequence according to an input order corresponding to each type number.
Optionally, the third multiplexer is configured to determine a target vector data set from the m candidate vector data sets based on the type number input by the type buffer, and output the target vector data set, where the type number corresponds to the target vector access instruction.
In some embodiments, the address reorganization type corresponding to each second reorganizer in the data reduction unit is the same as the address reorganization type corresponding to each first reorganizer in the reorganization unit, and the first reorganizers and the second reorganizers corresponding to the same address reorganization type have the same number, so that in the case of receiving the type number corresponding to the target vector access instruction input by the type buffer, the third multiplexer can determine the target vector data set from m candidate vector data sets according to the type number, and output the target vector data set.
In some embodiments, the third multiplexer receives the type number input by the type buffer, so as to determine a second multiplexer corresponding to the type number according to the type number, and output the candidate vector data set input by the second multiplexer as the target vector data set.
In an exemplary example, the reorganization type corresponding to the target vector access instruction is the reorganization type in the first reorganizer 3, and is obtained by exchanging the third input output port and the fifth input output port with connection lines, and exchanging the fourth input output port and the sixth input output port with connection lines, for example, the original vector access instruction includes 8 vector access addresses pointing to the bank0, the bank1, the bank2, the bank3, the target vector access instruction is the first reorganizer 3, and the fourth input output port and the sixth input output port with connection lines are exchanged, so that the 8 vector access addresses in the target vector access instruction point to the bank0, the bank1, the bank2, the bank3, respectively, and the comparator in the reorganization unit can input the type number 3 into the type number 3, namely, the type 3 can be input into the buffer, and the candidate buffer can be further input to the third multiplexer, and the candidate buffer can be further input to the type 3.
In the above embodiment, by designing the vector buffer and the data restoring unit in the data access device, after the vector data corresponding to each vector access address is read from the memory, the vector data corresponding to each group of target vector access requests can be stored in the vector buffer, so that the loss of the vector data is avoided.
In addition, a plurality of candidate vector data sets can be obtained by designing second recombiners in the data reduction unit, which are in one-to-one correspondence with the first recombiners, and carrying out data reduction on vector data corresponding to each vector memory address through the second recombiners, so that the candidate vector data set corresponding to the target recombination type can be output as the target vector data set through the third multiplexer based on the type number, the address sequence of the vector memory address corresponding to each vector data in the target vector data set is the same as the address sequence of each vector memory address in the original vector memory access instruction, and the accuracy and the effectiveness of data reading are improved.
Referring to fig. 7, a flowchart of a data access method according to an exemplary embodiment of the present application is shown, where the method is used in the data access apparatus provided in the foregoing embodiments, and the method includes:
In step 701, n vector memory addresses in the original vector memory instruction are recombined by the first recombiner to obtain candidate vector memory instructions, wherein the address sequences of the n vector memory addresses in different candidate vector memory instructions are different.
In step 702, the reorganizing unit determines a target vector memory access instruction from the m candidate vector memory access instructions based on the number of memory block conflicts corresponding to the m candidate vector memory access instructions, where the number of memory block conflicts corresponding to the target vector memory access instruction is less than the number of memory block conflicts corresponding to other candidate vector memory access instructions.
In step 703, the reorganization unit transmits the target vector access instruction to the vector access unit.
In step 704, the vector data corresponding to each vector memory address in the target vector memory instruction is read from the memory by the vector memory unit based on the target vector memory instruction.
In some embodiments, the reorganization unit further includes m conflict scorers, an input end of each conflict scorer is connected to an output end of each first reorganizer, and different first reorganizers correspond to different address reorganization types; the reorganization unit also comprises comparators, wherein the input ends of the comparators are connected with the output ends of the conflict scorers;
Performing memory block conflict detection on the candidate vector memory access instruction output by the first reorganizer through a conflict scorer, and counting the number of memory block conflicts based on a conflict detection result;
transmitting the memory block conflict quantity to a comparator through a conflict scorer;
and comparing the number of memory block conflicts corresponding to each candidate vector memory access instruction through a comparator, and determining the target address reorganization type from m address reorganization types.
In some embodiments, dividing n vector memory addresses in the candidate vector memory instruction by a conflict scorer based on a single memory upper limit of the memory to obtain a plurality of groups of candidate vector memory requests, wherein the number of addresses of the vector memory addresses in the candidate vector memory requests is equal to the number of addresses corresponding to the single memory upper limit; determining memory blocks corresponding to each vector memory address in each group of candidate vector memory access requests; under the condition that a plurality of vector memory addresses correspond to the same memory block in the candidate vector memory access request, determining the number of memory block conflicts corresponding to the candidate vector memory access request based on the number of addresses of the vector memory addresses corresponding to the same memory block; based on the number of memory block conflicts corresponding to each group of candidate vector memory access requests, counting the number of memory block conflicts corresponding to the candidate vector memory access instructions.
In some embodiments, the reorganization unit further includes a first multiplexer, an input terminal of the first multiplexer is connected to an output terminal of each first reorganizer, and a gate terminal of the first multiplexer is connected to an output terminal of the comparator;
transmitting a type number corresponding to the target address reorganization type to a first multiplexer through a comparator;
and outputting a target vector memory access instruction based on the type number input by the comparator through the first multiplexer.
In some embodiments, the address order of the vector memory addresses in the original vector memory instruction is swapped by the first reorganizer based on the address swap mode indicated by the address reorganization type, to obtain candidate vector memory instructions.
In some embodiments, the apparatus further comprises a type buffer, an input of the type buffer being connected to an output of the comparator;
receiving a type number corresponding to the target address reorganization type input by the comparator through a type buffer;
and sequentially storing each type number by the type buffer according to the input sequence of the type number.
In some embodiments, the apparatus further comprises an instruction splitting unit, an input end of the instruction splitting unit being connected to an output end of the reorganizing unit, an output end of the instruction splitting unit being connected to an input end of the vector access unit;
The method comprises the steps that an instruction splitting unit is used for splitting instructions based on a single access upper limit of a memory, and target vector access instructions input by an overlapping unit are subjected to instruction splitting to obtain multiple groups of target vector access requests, wherein the number of addresses of vector access addresses in the target vector access requests is equal to the number of addresses corresponding to the single access upper limit;
and sequentially transmitting the target vector access requests to the vector access unit through the instruction splitting unit.
In some embodiments, the instruction splitting unit further comprises a second multiplexer, a counter, and an adder; the input end of the second multiplexer is connected with the output end of the reorganization unit, the gating end of the second multiplexer is connected with the output end of the counter, and the output end of the second multiplexer is connected with the input end of the vector access unit; the input end of the counter is connected with the output end of the adder, and the output end of the counter is connected with the input end of the adder;
receiving an addition instruction input by an adder through a counter, and transmitting a counting result obtained based on the addition instruction to a second multiplexer;
and transmitting a target vector access request to the vector access unit through a second multiplexer based on the counting result input by the counter.
In some embodiments, the apparatus further comprises a vector buffer and a data reduction unit;
the vector data corresponding to each vector memory access address read from the memory is stored through the vector buffer, and the vector data corresponding to each vector memory access address is transmitted to the data reduction unit;
and carrying out data reduction on the vector data corresponding to each vector memory address through a data reduction unit to obtain a target vector data set, wherein the address sequence of the vector memory addresses corresponding to each vector data in the target vector data set is the same as the address sequence of n vector memory addresses in the original vector memory instruction.
In some embodiments, the data reduction unit includes m second recombiners and a third multiplexer, where the second recombiners correspond to the same address reorganization type as the first recombiners; the input end of the second multiplexer is connected with the output end of the vector buffer, and the output end of the second multiplexer is connected with the input end of the third multiplexer;
the device also comprises a type buffer, wherein the output end of the type buffer is connected with the gating end of the third multiplexer;
performing position exchange on vector data corresponding to each vector memory access address by a second multiplexer based on an address exchange mode indicated by an address reorganization type to obtain candidate vector data groups;
Transmitting the candidate vector data set to a third multiplexer through a second reorganizer;
and determining a target vector data set from the m candidate vector data sets based on the type number input by the type buffer through the third multiplexer, and outputting the target vector data set, wherein the type number corresponds to the target vector memory access instruction.
In summary, in the embodiment of the present application, during the vector data access process, the m first recombiners in the recombining unit are used to recombine the n vector access addresses in the original vector access instruction to obtain m candidate vector access instructions, so that, through the recombining unit, according to the number of memory block conflicts corresponding to each candidate vector access instruction, the target vector access instruction is determined from the m candidate vector access instructions, and the target vector access instruction is transmitted to the vector access unit through the recombining unit, so that, based on the target vector access instruction, the vector access unit reads the vector data corresponding to each vector access address in the target vector access instruction from the memory, thereby reducing the number of memory block conflicts in the vector data access process, and improving the data access performance.
In some embodiments, the data access device in the embodiments of the present application may be integrated in the processor, or may be separately provided outside the processor.
Referring to FIG. 8, a block diagram of a computer device 800 is shown as provided in one exemplary embodiment of the present application. The computer device 800 may be a portable mobile terminal, for example: smart phones, tablet computers, dynamic video expert compression standard audio layer 3 (Moving Picture Experts Group Audio Layer III, MP 3) players, dynamic video expert compression standard audio layer 4 (Moving Picture Experts Group Audio Layer IV, MP 4) players. The computer device 1200 may also be referred to as a user device, portable terminal, workstation, server, etc. other names.
In general, the computer device 800 includes: a processor 801 and a memory 802.
Processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 801 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 801 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a central processor (Central Processing Unit, CPU), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 801 may integrate with an image processor (Graphics Processing Unit, GPU) for rendering and rendering of content required for display by the display screen. In some embodiments, the processor 801 may also include an artificial intelligence (Artificial Intelligence, AI) processor for processing computing operations related to machine learning.
In some embodiments, the processor 801 may be integrated with the data access device provided in the above embodiments, or the processor 801 may be further connected to a separately provided data access device. When the processor 801 has a memory access requirement, the data memory access device can perform data memory access.
Memory 802 may include one or more computer-readable storage media, which may be tangible and non-transitory. Memory 802 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices.
In some embodiments, computer device 800 may also optionally include a peripheral interface 803 and at least one peripheral device.
Those skilled in the art will appreciate that the architecture shown in fig. 8 is not limiting and that more or fewer components than shown may be included or that certain components may be combined or that a different arrangement of components may be employed.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims (20)

1. A data access device, the device comprising: the reorganization unit and the vector access unit; the reorganization unit comprises m first reorganizers;
the first reorganizer is configured to reorganize n vector memory addresses in an original vector memory instruction to obtain candidate vector memory instructions, where address sequences of the n vector memory addresses in different candidate vector memory instructions are different;
the reorganization unit is configured to determine a target vector memory access instruction from m candidate vector memory access instructions based on the number of memory block conflicts corresponding to m candidate vector memory access instructions, where the number of memory block conflicts corresponding to the target vector memory access instruction is less than the number of memory block conflicts corresponding to other candidate vector memory access instructions;
the reorganization unit is further configured to transmit the target vector access instruction to the vector access unit;
the vector memory unit is used for reading vector data corresponding to each vector memory address in the target vector memory instruction from a memory based on the target vector memory instruction.
2. The apparatus of claim 1, wherein the reorganizing unit further comprises m conflict scorers, wherein an input end of the conflict scorer is connected to an output end of the first reorganizer, and different first reorganizers correspond to different address reorganizing types;
the recombination unit also comprises a comparator, and the input end of the comparator is connected with the output end of each conflict scorer;
the conflict scorer is used for carrying out memory block conflict detection on the candidate vector memory access instruction output by the first reorganizer and counting the number of memory block conflicts based on a conflict detection result;
the conflict scorer is further configured to transmit the number of memory block conflicts to the comparator;
the comparator is used for comparing the conflict quantity of the memory blocks corresponding to each candidate vector memory access instruction, and determining the target address reorganization type from m address reorganization types.
3. The apparatus of claim 2, wherein the conflict scorer is further configured to:
dividing n vector memory addresses in the candidate vector memory access instruction based on a single memory access upper limit of the memory to obtain a plurality of groups of candidate vector memory access requests, wherein the number of the vector memory addresses in the candidate vector memory access requests is equal to the number of addresses corresponding to the single memory access upper limit;
Determining memory blocks corresponding to each vector memory address in each group of candidate vector memory access requests;
under the condition that a plurality of vector memory access addresses correspond to the same memory block in the candidate vector memory access request, determining the number of memory block conflicts corresponding to the candidate vector memory access request based on the number of addresses of the vector memory access addresses corresponding to the same memory block;
based on the number of memory block conflicts corresponding to each group of candidate vector access requests, counting the number of memory block conflicts corresponding to the candidate vector access instructions.
4. The apparatus of claim 2, wherein the reorganizing unit further comprises a first multiplexer, an input of the first multiplexer being connected to an output of each first reorganizing unit, a gate of the first multiplexer being connected to an output of the comparator;
the comparator is used for transmitting a type number corresponding to the target address reorganization type to the first multiplexer;
the first multiplexer is configured to output the target vector memory access instruction based on the type number input by the comparator.
5. The apparatus of claim 2, wherein the first reorganizer is configured to:
And exchanging the address sequence of the vector memory access addresses in the original vector memory access instruction based on the address exchange mode indicated by the address reorganization type to obtain the candidate vector memory access instruction.
6. The apparatus of claim 2, further comprising a type buffer, an input of the type buffer being coupled to an output of the comparator;
the type buffer is used for receiving the type number corresponding to the target address reorganization type input by the comparator;
the type buffer is further configured to store each type number in sequence according to the input sequence of the type numbers.
7. The apparatus of claim 1, further comprising an instruction splitting unit, an input of the instruction splitting unit being coupled to an output of the reorganizing unit, an output of the instruction splitting unit being coupled to an input of the vector memory access unit;
the instruction splitting unit is configured to split the target vector memory access instruction input by the reorganizing unit based on a single memory access upper limit of the memory to obtain multiple groups of target vector memory access requests, where the number of addresses of the vector memory access addresses in the target vector memory access requests is equal to the number of addresses corresponding to the single memory access upper limit;
The instruction splitting unit is further configured to sequentially transmit the target vector access request to the vector access unit.
8. The apparatus of claim 7, wherein the instruction splitting unit further comprises a second multiplexer, a counter, and an adder;
the input end of the second multiplexer is connected with the output end of the recombination unit, the gating end of the second multiplexer is connected with the output end of the counter, and the output end of the second multiplexer is connected with the input end of the vector memory unit;
the input end of the counter is connected with the output end of the adder, and the output end of the counter is connected with the input end of the adder;
the counter is used for receiving an addition instruction input by the adder and transmitting a counting result obtained based on the addition instruction to the second multiplexer;
the second multiplexer is configured to transmit the target vector access request to the vector access unit based on the count result input by the counter.
9. The apparatus of claim 1, further comprising a vector buffer and a data reduction unit;
The vector buffer is used for storing vector data corresponding to each vector memory access address read from the memory and transmitting the vector data corresponding to each vector memory access address to the data reduction unit;
the data restoring unit is configured to restore the vector data corresponding to each vector memory address to obtain a target vector data set, where the address sequence of the vector memory address corresponding to each vector data in the target vector data set is the same as the address sequence of the n vector memory addresses in the original vector memory instruction.
10. The apparatus of claim 9, wherein the data reduction unit includes m second recombiners and a third multiplexer, the second recombiners corresponding to the same address reorganization type as the first recombiners;
the input end of the second reorganizer is connected with the output end of the vector buffer, and the output end of the second reorganizer is connected with the input end of the third multiplexer;
the device also comprises a type buffer, wherein the output end of the type buffer is connected with the gating end of the third multiplexer;
the second recombiner is configured to perform location exchange on vector data corresponding to each vector memory address based on an address exchange manner indicated by the address reorganization type, so as to obtain a candidate vector data set;
The second multiplexer is further configured to transmit the candidate vector data set to the third multiplexer;
and the third multiplexer is used for determining a target vector data set from m candidate vector data sets based on the type number input by the type buffer, and outputting the target vector data set, wherein the type number corresponds to the target vector memory access instruction.
11. A data access method, characterized in that the method is used for a data access device according to any one of claims 1 to 10, the data access device comprising: the reorganization unit and the vector access unit; the reorganization unit comprises m first reorganizers;
the method comprises the following steps:
the n vector memory addresses in the original vector memory instruction are recombined through the first recombinator to obtain candidate vector memory instructions, wherein the address sequences of the n vector memory addresses in different candidate vector memory instructions are different;
determining a target vector memory access instruction from m candidate vector memory access instructions based on the number of memory block conflicts corresponding to m candidate vector memory access instructions through the reorganization unit, wherein the number of memory block conflicts corresponding to the target vector memory access instruction is smaller than the number of memory block conflicts corresponding to other candidate vector memory access instructions;
Transmitting the target vector access instruction to the vector access unit through the reorganization unit;
and reading vector data corresponding to each vector memory address in the target vector memory instruction from a memory based on the target vector memory instruction through the vector memory unit.
12. The method according to claim 11, wherein the determining, by the reorganizing unit, the target vector memory access instruction from the m candidate vector memory access instructions based on the number of memory block conflicts corresponding to the m candidate vector memory access instructions includes:
performing memory block conflict detection on the candidate vector memory access instruction output by the first reorganizer through a conflict scorer, and counting the number of memory block conflicts based on a conflict detection result;
transmitting the memory block conflict quantity to a comparator through the conflict scorer;
and comparing the number of memory block conflicts corresponding to each candidate vector memory access instruction through the comparator, and determining the target address reorganization type from m address reorganization types.
13. The method of claim 12, wherein the performing, by the conflict scorer, memory block conflict detection on the candidate vector memory access instruction output by the first reorganizer, and counting the number of memory block conflicts based on the result of conflict detection, includes:
Dividing n vector memory addresses in the candidate vector memory access instruction based on a single memory access upper limit of the memory to obtain a plurality of groups of candidate vector memory access requests, wherein the number of the vector memory addresses in the candidate vector memory access requests is equal to the number of addresses corresponding to the single memory access upper limit;
determining memory blocks corresponding to each vector memory address in each group of candidate vector memory access requests;
under the condition that a plurality of vector memory access addresses correspond to the same memory block in the candidate vector memory access request, determining the number of memory block conflicts corresponding to the candidate vector memory access request based on the number of addresses of the vector memory access addresses corresponding to the same memory block;
based on the number of memory block conflicts corresponding to each group of candidate vector access requests, counting the number of memory block conflicts corresponding to the candidate vector access instructions.
14. The method according to claim 12, wherein the method further comprises:
transmitting a type number corresponding to the target address reorganization type to a first multiplexer through the comparator;
outputting, by the first multiplexer, the target vector memory access instruction based on the type number input by the comparator.
15. The method according to claim 12, wherein the method further comprises:
receiving a type number corresponding to the target address reorganization type input by the comparator through a type buffer;
and sequentially storing each type number according to the input sequence of the type number through the type buffer.
16. The method of claim 11, wherein the method further comprises:
the method comprises the steps that an instruction splitting unit is used for splitting instructions of target vector memory access instructions input by a reorganization unit based on a single memory access upper limit of a memory to obtain multiple groups of target vector memory access requests, and the number of addresses of vector memory access addresses in the target vector memory access requests is equal to the number of addresses corresponding to the single memory access upper limit;
and sequentially transmitting the target vector access requests to the vector access unit through the instruction splitting unit.
17. The method of claim 11, wherein the method further comprises:
the vector data corresponding to each vector memory address read from the memory is stored through a vector buffer, and the vector data corresponding to each vector memory address is transmitted to a data reduction unit;
And carrying out data reduction on the vector data corresponding to each vector memory address through the data reduction unit to obtain a target vector data set, wherein the address sequence of the vector memory addresses corresponding to each vector data in the target vector data set is the same as the address sequence of the n vector memory addresses in the original vector memory instruction.
18. A processor comprising a data access device as claimed in any one of claims 1 to 10.
19. A computer device comprising the processor of claim 18 and a memory, the processor being coupled to the memory by a bus.
20. A computer device comprising a processor, a memory and a data access means as claimed in any one of claims 1 to 10, wherein the processor is connected to the data access means and the processor is connected to the memory via a bus.
CN202311220340.0A 2023-09-20 2023-09-20 Data access device, method, processor and computer equipment Pending CN117251212A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311220340.0A CN117251212A (en) 2023-09-20 2023-09-20 Data access device, method, processor and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311220340.0A CN117251212A (en) 2023-09-20 2023-09-20 Data access device, method, processor and computer equipment

Publications (1)

Publication Number Publication Date
CN117251212A true CN117251212A (en) 2023-12-19

Family

ID=89125915

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311220340.0A Pending CN117251212A (en) 2023-09-20 2023-09-20 Data access device, method, processor and computer equipment

Country Status (1)

Country Link
CN (1) CN117251212A (en)

Similar Documents

Publication Publication Date Title
CN107657581B (en) Convolutional neural network CNN hardware accelerator and acceleration method
US7669036B2 (en) Direct path monitoring by primary processor to each status register in pipeline chained secondary processors for task allocation via downstream communication
US10831693B1 (en) Multicast master
US11138106B1 (en) Target port with distributed transactions
CN114942831A (en) Processor, chip, electronic device and data processing method
CN111860806A (en) Fractal calculation device and method, integrated circuit and board card
EP3846036A1 (en) Matrix storage method, matrix access method, apparatus and electronic device
CN112416433A (en) Data processing device, data processing method and related product
CN110515872B (en) Direct memory access method, device, special computing chip and heterogeneous computing system
CN113918233A (en) AI chip control method, electronic equipment and AI chip
CN111158757B (en) Parallel access device and method and chip
CN112068965A (en) Data processing method and device, electronic equipment and readable storage medium
CN117251212A (en) Data access device, method, processor and computer equipment
US20120124343A1 (en) Apparatus and method for modifying instruction operand
CN111260043A (en) Data selector, data processing method, chip and electronic equipment
US11354130B1 (en) Efficient race-condition detection
CN109542837B (en) Operation method, device and related product
CN109558565B (en) Operation method, device and related product
CN109543835B (en) Operation method, device and related product
CN111260042B (en) Data selector, data processing method, chip and electronic equipment
CN114661634A (en) Data caching device and method, integrated circuit chip, computing device and board card
CN112395008A (en) Operation method, operation device, computer equipment and storage medium
CN112395009A (en) Operation method, operation device, computer equipment and storage medium
CN109543836B (en) Operation method, device and related product
CN111258632B (en) Data selection device, data processing method, chip and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication