CN116302099A

CN116302099A - Method, processor, device, medium for loading data into vector registers

Info

Publication number: CN116302099A
Application number: CN202211664244.0A
Authority: CN
Inventors: 林志翔; 崔泽汉
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-06-23

Abstract

Provided are a method, a processor, an electronic device, a non-transitory storage medium for loading data to a vector register using a Gather instruction, the method comprising: loading data in the memory to a first vector register by a first set of load operations of a plurality of load operations of the Gather instruction; loading data in memory to a second vector register by a second set of load operations of the plurality of load operations of the Gather instruction, wherein the second vector register is different from the first vector register, wherein a last load operation of the first set of load operations and a first load operation of the second set of load operations are adjacent among the plurality of load operations of the Gather instruction; the data in the first vector register and the data in the second vector register are merged into one vector register.

Description

Method, processor, device, medium for loading data into vector registers

Technical Field

The present application relates to the field of integrated circuits, and more particularly to a method, processor, electronic device, non-transitory storage medium for loading data into vector registers using a Gather instruction.

Background

An instruction set is a set of instructions in the CPU that are used to calculate and control the computer system. The instruction set is generally divided into a reduced instruction set (Reduced Instruction Set Computer, RISC) and a complex instruction set (Complex Instruction Set Computer, CISC).

Modern processors include vector processing units, which are capable of performing data-parallel computations, and are an important component of the processor. While the single instruction multiple data SIMD (Single Instruction Multiple Data) instruction set refers to a single instruction multiple data stream technique, multiple sets of data channels can be operated in parallel with one set of instructions. The core of a processor vector unit is the SIMD instruction set (also known as the floating point instruction set, vector instruction set) and vector registers that it supports. The SIMD instruction can control a plurality of parallel processing microelements on one controller, and execute a plurality of data streams by one instruction operation, so that the operation speed of a program can be improved. SIMD instructions may perform the same operations on a set of data (also referred to as a "data vector") on a controller simultaneously, respectively, to achieve spatial parallelism. SIMD helps a central processing unit (Central Processor Unit, CPU) to implement data-level parallelism (Data Level Parallelism, DLP).

Vector registers, also known as floating point registers, can store multiple elements compared to ordinary general purpose registers, and are the core memory unit for running SIMD instruction sets. Vector registers, unlike scalar registers, which can store only one set of data, are capable of storing multiple sets of data (integer or floating point numbers), which may be 128/256/512 bits wide, depending on the particular implementation. While one SIMD instruction of the SIMD instruction set may operate on multiple elements stored in the vector registers at the same time. For example the AVX2 instruction set is used in modern processors. The instruction set is provided with a load instruction for loading a plurality of values consecutively held in memory into a vector register at one time, and a replace (demux) instruction for dynamically reconstructing elements within the vector register.

In the field of digital signal processing (Digital Signal Processing, DSP) design, the Gather/Scatter instruction may be used by programs, the Gather instruction may load a set of data into a register from different locations in memory, and the Scatter instruction may write the data in the register to different locations in memory.

Where stride memory supports memory-ready, vector data elements of each SIMD data may be from non-contiguous memory addresses. The operand of the Gather instruction in AVX2 is a base address plus a vector register, which holds how much the offset (Displacement) of each element in the SIMD data relative to the base address is, the CPU can "aggregate" several discrete data into one SIMD register. The Scatter instruction may "Scatter" the data in the register to different locations in memory. The Gather instruction and the Scatter instruction are complex instructions that are split into multiple micro-operations to perform the respective load and write-back operations.

It is desirable to improve parallelism among micro-operations of the Gather instruction to shorten instruction execution time.

Disclosure of Invention

According to one aspect of the present application, there is provided a method of loading data into a vector register using a Gather instruction, comprising: loading data in the memory to a first vector register by a first set of load operations of a plurality of load operations of the Gather instruction; loading data in memory to a second vector register by a second set of load operations of the plurality of load operations of the Gather instruction, wherein the second vector register is different from the first vector register, wherein a last load operation of the first set of load operations and a first load operation of the second set of load operations are adjacent among the plurality of load operations of the Gather instruction; the data in the first vector register and the data in the second vector register are merged into one vector register.

According to another aspect of the present application, there is provided a processor for loading data into a vector register using a Gather instruction, comprising: a loader configured to load data in memory to a first vector register through a first set of load operations of a plurality of load operations of the Gather instruction; loading data in memory to a second vector register by a second set of load operations of the plurality of load operations of the Gather instruction, wherein the second vector register is different from the first vector register, wherein a last load operation of the first set of load operations and a first load operation of the second set of load operations are adjacent among the plurality of load operations of the Gather instruction; and a combiner configured to combine the data in the first vector register and the data in the second vector register into one vector register.

According to another aspect of the present application, there is provided a method of loading data into a register, comprising: loading data in the memory into a first register through a first set of load operations of the plurality of load operations; loading data in the memory to a second register by a second set of load operations of the plurality of load operations, wherein the second register is different from the first register, wherein a last load operation of the first set of load operations and a first load operation of the second set of load operations are adjacent in the plurality of load operations; the data in the first register and the data in the second register are merged into one register.

According to another aspect of the present application, there is provided a processor for loading data into a register, comprising: a loader configured to load data in the memory to the first register through a first set of load operations of the plurality of load operations; loading data in the memory to a second register by a second set of load operations of the plurality of load operations, wherein the second register is different from the first register, wherein a last load operation of the first set of load operations and a first load operation of the second set of load operations are adjacent in the plurality of load operations; and a combiner configured to combine the data in the first register and the data in the second register into one register.

According to another aspect of the present application, there is provided an electronic device including: a memory for storing instructions; and a processor for reading the instructions in the memory and performing the method according to various embodiments of the application.

According to another aspect of the application, there is provided a non-transitory storage medium having instructions stored thereon, wherein the instructions, when read by a processor, cause the processor to perform a method according to various embodiments of the application.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 illustrates the operation of a conventional Gather instruction and Scatter instruction in registers.

Fig. 2 shows the execution of a conventional SIMD add instruction.

FIG. 3 shows a flow chart of a conventional Gather instruction as it executes.

FIG. 4 is a flow chart showing how a conventional Gather instruction loads data multiple times.

FIG. 5 illustrates a flow chart of a method of loading data to a vector register with a Gather instruction according to embodiments of the present application.

FIG. 6 illustrates a schematic diagram of one example of a method of loading data to a vector register using a Gather instruction, according to embodiments of the present application.

FIG. 7 illustrates a schematic diagram of another example of a method of loading data to a vector register using a Gather instruction according to embodiments of the present application.

FIG. 8 illustrates a flow chart of a method 800 of loading data into registers according to an embodiment of the present application.

FIG. 9 shows a block diagram of a processor that loads data to vector registers using a Gather instruction according to embodiments of the present application.

FIG. 10 illustrates a block diagram of a processor that loads data to vector registers using a Gather instruction according to embodiments of the present application.

Fig. 11 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application.

Fig. 12 shows a schematic diagram of a non-transitory readable storage medium according to an embodiment of the disclosure.

Detailed Description

Reference will now be made in detail to the specific embodiments of the present application, examples of which are illustrated in the accompanying drawings. While the present application will be described in conjunction with the specific embodiments, it will be understood that it is not intended to limit the present application to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the application as defined by the appended claims. It should be noted that the method steps described herein may be implemented by any functional block or arrangement of functions, and any functional block or arrangement of functions may be implemented as a physical entity or a logical entity, or a combination of both.

As shown in FIG. 1, the Gather instruction may Gather data in discrete addresses 0, 2, 4, 6, for example, in memory into consecutive addresses 0, 1, 2, 3, for example, in registers. The Scatter instruction may Scatter data written to memory at addresses 0, 2, 4, 6, for example, at addresses 0, 1, 2, 3 in a register.

When executing SIMD instructions, the CPU will store multiple sets of data in vector registers, and operate on multiple sets of data simultaneously using the vector registers. Fig. 2 shows the execution of a conventional SIMD add instruction. As shown in fig. 2, x and y are vector registers, each of which contains 4 sets of data (A, B, C, D), and the 4 sets of data are added (ax+ay, bx+by, cx+cy, dx+dy) By using an addition instruction of SIMD.

In the earliest proposed Scale, index, base, SIB (which is a few inputs to calculate access virtual addresses) address Index mode, scalar registers (which are 64-bit wide registers, storing fixed point data (i.e., integers), also known as fixed point registers) are used to store Index (Index) values. In the subsequent advanced vector extension 2 (Advanced Vector Extensions, AVX 2) instruction set, a new Scale, vector Index, base, VSIB (differing from SIB in that Index is in the vector register) address Index pattern is proposed, using a vector register to store multiple Index values. The use of the VSIB address index mode enables multiple virtual addresses to be calculated using multiple index values and the VSIB address indexed instruction enables multiple virtual addresses to be used to access memory, as opposed to an original instruction that only can give one index value and can only calculate one virtual address.

Currently, only two instructions of a gateway/scanner use a VSIB address index mode, wherein the gateway instruction reads data of a plurality of SIB addresses and sequentially puts the read results into a vector register, and the scanner instruction sequentially writes values in the vector register into the plurality of SIB addresses.

FIG. 3 shows a flow chart of a conventional Gather instruction as it executes. As shown in FIG. 3, the other instruction obtains each index value from the vector register, calculates the address corresponding to each index, and loads the data corresponding to each address into the vector register.

However, in an X86 vector instruction, only instructions that modify vector portion data (e.g., for some portion of Ax, bx, cx, dx in fig. 2) are not used frequently, and the complexity of the physical circuit design is taken into account comprehensively, in an actual circuit design, there is no circuit operation that writes only some portion of the vector registers. The Gather instruction stores the data loaded for multiple times in different positions in the same vector register, so that each loading needs to wait until the previous loading to write the result back into the vector register, reads the value of the vector register, combines the result with the data loaded at the time, and then writes the result back into the vector register. Thus, each load in the Gather instruction has a dependency relationship, and the execution speed of the Gather instruction is influenced.

The Gather instruction loads data for multiple times to be stored in different positions in the same vector register, so that each loading needs to wait until the previous loading to write the result back into the vector register, then reads the value of the orientation register, and combines the value with the data of the current loading once and then writes the result back into the vector register. Thus, there is an interdependent relationship between every two adjacent load operations.

According to the optimization scheme, a plurality of loading operations are grouped, the loading operations in the same group can be mutually dependent, the loading operations in different groups are independent, and finally, the results stored in the temporarily used vector registers are required to be combined and written back into the vector registers defined by the instruction layer. The grouping may be in the form of a group of 2 load operations, in the form of a group of 4 load operations, or in other granularity of division. For example, in the case of a set of 2 load operations, it may be implemented that the even-numbered loads depend on the results of the odd-numbered loads, i.e., each two loads write back into the same temporary vector register, interdependence. After execution of the different dependency chains has ended, a merged micro-operation (UOP) is required, where the CPU decides what operations to perform in particular based on the micro-operation) to write the results in the plurality of temporary vector registers back into one temporary vector register.

Because each dependency chain split requires a temporary vector register to store the data that is loaded back (using the same register results in the previous and next load instructions writing to the same register, which results in the next load instruction directly overwriting the value of the previous load instruction), splitting into n dependency chains uses n temporary vector registers (n is a positive integer greater than 1). Therefore, when the running program has a shortage of demands on the vector register, an optimization scheme for the Gather instruction needs to be adjusted, for example, an original dependence chain formed by two adjacent loading instructions can be changed into a dependence chain formed by four loading instructions, and the execution parallelism of a certain Gather instruction is sacrificed, but by reducing the use of the vector register, the blocking caused by the insufficient vector register of the non-Gather instruction is reduced, so that the performance of the whole program is improved.

The splitting into more dependent chains results in using more temporary registers and more micro-operations and merging the final results, and the performance improvement caused by splitting the dependent chains is not a function which linearly increases with the number of the split dependent chains, so that the number of the temporary registers used is comprehensively considered in the implementation process, the cost of the micro-operation number of the merged data and the performance improvement caused by improving the parallelism are comprehensively considered, and the specific splitting into how many different dependent chains can be executed in parallel can be determined.

One scenario where the compiler uses the Gather instruction in large quantities is regular access to the array, i.e. the difference between the address of each Gather load and the address of the last load is a fixed value or the change in the difference has a certain rule. Thus, the prefetcher in the processor can easily predict which address will be loaded next by the other instruction when it encounters the other instruction again after a period of training. The prefetcher can issue prefetches in advance in the current loading phase, fetching the next Data to be loaded back into the L1 Data Cache. When the other instruction executes the next load, the required Data is already in the L1 Data Cache, greatly shortening the time of each issue load into the Data return register. Therefore, the bottleneck in the Gather instruction is mainly the dependency relationship among a plurality of load operations rather than the delay of the load operation, and the effectiveness of the optimization of the patent is further proved.

Various embodiments of the present application are described in detail below.

FIG. 5 illustrates a flow chart of a method 500 of loading data into a vector register using a Gather instruction according to embodiments of the present application.

As shown in fig. 5, a method 500 of loading data into a vector register using a Gather instruction includes: step 510, loading data in the memory into a first vector register through a first set of load operations of a plurality of load operations of the Gather instruction; step 520, loading data in the memory to a second vector register by a second set of load operations of the plurality of load operations of the Gather instruction, wherein the second vector register is different from the first vector register, wherein a last load operation of the first set of load operations and a first load operation of the second set of load operations are adjacent among the plurality of load operations of the Gather instruction; step 530, merging the data in the first vector register and the data in the second vector register into one vector register.

In this way, for the dependence relation of the micro-operations in the Gather instruction, a micro-operation implementation scheme for splitting a dependence chain is provided, a first group of loading operations (mutually dependent) in a plurality of loading operations of the Gather instruction are loaded to a first vector register, a second group of loading operations (mutually dependent) in the plurality of loading operations of the Gather instruction are loaded to a second vector register different from the first vector register, the dependence relation between the first group of loading operations and the second group of loading operations (at least the last loading operation of the first group of loading operations and the first loading operation of the second group of loading operations) is broken, and therefore parallel execution between the first group of loading operations and the second group of loading operations (at least the last loading operation of the first group of loading operations and the first loading operation of the second group of loading operations) is achieved, and the parallelism of the micro-operations in the Gather instruction is improved, so that the execution time of the Gather instruction is greatly shortened, and the performance of a final processor is improved objectively.

As shown in the left side of fig. 6, it is assumed that the original Gather instruction is to store multiple loads of data to different locations in the same vector register, so that each load needs to wait until the previous load to write the result back to the vector register, then read the value of the orientation register, merge with the data of the current load once, and then write back to the vector register. Thus, conventionally, there is an interdependent relationship between every two adjacent load operations.

And as shown on the right side of fig. 6, the 8 load operations in the Gather instruction that are otherwise inter-dependent to be loaded into the same vector register are divided into, for example, 4 groups, e.g., the first group is 2 consecutive/adjacent load operations, the second group is 2 consecutive/adjacent load operations, the third group is 2 consecutive/adjacent load operations, and the fourth group is 2 consecutive/adjacent load operations. It can be seen that the load operations in the first group and the load operations in the second group are contiguous, the load operations in the first group and the load operations in the third or fourth group are non-contiguous, and so on.

The application not only divides 8 loading operations into 4 groups, but also loads the data in the memory into 4 different vector registers when each group of loading operations is executed. That is, data in memory is loaded into a first vector register by a first set of load operations of the plurality of load operations of the Gather instruction, data in memory is loaded into a second vector register different from the first vector register by a second set of load operations of the plurality of load operations of the Gather instruction, and data in memory is loaded into one or more third vector registers different from the first or second vector registers by one or more third sets of load operations of the plurality of load operations of the Gather instruction.

Because the first vector register, the second vector register, the one or more third vector registers, and the fourth vector register are destination-diverse registers, for example, the data loaded into the second vector register need not wait for the data loaded into the first vector register to merge, for example, the data loaded into the one or more third vector registers need not wait for the data loaded into the first or second vector registers to merge, for example, the data loaded into the other vector registers need not wait for the data loaded into the first or second or one or more third vector registers to merge, and therefore the data loaded into one vector register need not wait for the data loaded into the other vector register, and therefore, the execution of the load operations loaded into the different vector registers do not depend on each other, but are independent of each other, and can be executed in parallel, thereby improving the parallelism of the load operations in the Gather instruction and speeding up the execution speed of the Gather instruction.

That is, dividing the load operation in the Gather instruction into several groups requires several vector registers, which are temporary, and finally merging the data loaded into these temporary vector registers into one vector register.

As shown on the right side of fig. 6, only 2 adjacent load operations are mutually dependent, i.e. the adjacent 2 load operations are written back to the same vector register, the load operations in different groups are written back to different vector registers, and finally the results stored in the temporarily used vector registers need to be combined and written back to the vector registers defined by the instruction layer. As in the example shown on the right side of fig. 6, there are 2 adjacent load operations in a group, namely the odd and even loads. For example, it may be implemented that the even-numbered loads depend on the results of the odd-numbered loads, i.e. each two loads write back into the same temporary vector register, interdependence. And the loads of different groups are written back into different temporary vector registers, are independent from each other and can be executed in parallel. After execution of the different dependency chains has ended, merged micro-operations need to be used to write the results in multiple temporary vector registers back into one temporary vector register.

Of course, the above example divides the 8 load operations into a plurality of groups including the same number of load operations, but the number of load operations included in the plurality of groups may be different, so as to achieve the purposes of improving the parallelism of load operations in the Gather instruction and accelerating the execution speed of the Gather instruction. For example, 8 load operations may be divided into 3 groups, the first group comprising 2 load operations, the second group comprising 4 load operations, and the third group comprising 2 load operations, in which case parallelism is also improved, except that it may be necessary to wait for the completion of the 4 load operations of the second group to complete the merge.

The existing scheme can obtain the best average performance of a certain splitting scheme under various testing programs by investigating the existing testing programs and evaluating the performance of different splitting schemes, so that the splitting scheme encountering a Gather instruction each time is statically determined, for example, each splitting into a fixed group of several loading operations.

However, it is also contemplated to implement a variety of split schemes for the other instructions, with the processor dynamically detecting the current state (e.g., detecting the number of vector registers currently remaining), to determine whether the other instructions choose to split into aggressive schemes for more dependent chains or conservative schemes for less dependent chains in order to reserve the number of vector registers for other instructions.

Specifically, the smaller the number of load operations included in each group, the more groups can be executed independently of each other, the more vector registers need to be used, so that the higher the parallelism of load operations in the Gather instruction, for example, the 4 groups (dependency chains) can be obtained if the 8 load operations are divided into 2 load operations each, the higher the parallelism, and the 2 groups (dependency chains) can be obtained if the 8 load operations are divided into 4 load operations each, the 2 groups are executed in parallel, and the parallelism is lower than the 4 groups. Thus, if it is determined that execution parallelism of the Gather instruction is to be increased, the number of load operations in each group is reduced. If the CPU can determine that the Gather instruction is a critical instruction on the instruction dependency chain, i.e. many instructions depend on the Gather instruction, then the parallelism of execution (or execution time) of the Gather instruction is more critical at this point, because at this point, even if the resources of the vector register are given to other instructions, they are likely to be unable to execute because they depend on the Gather instruction. (of course, provided that the CPU can determine that the Gather instruction is an instruction on a critical path.)

In addition, the number of load operations in each group may also be based on the number of vector registers available. Because each dependency chain split requires a temporary vector register to store the data that was loaded back (if the same is used, it is not known which load the data in the register returned at this time is due to out-of-order execution of the load instruction), splitting into n dependency chains uses n temporary vector registers (n is a positive integer greater than 1). Thus if there are a number of available vector registers greater than a predetermined threshold, the number of load operations in each group may be reduced, such that the number of dependency chains increases, and thus the number of temporary vector registers required increases. However, when the running program is in tension for the vector registers (e.g., the number of available vector registers is less than or equal to a predetermined threshold), the optimization scheme for the Gather instruction needs to be adjusted, and the number of load operations in each group may be increased, so that the number of dependency chains, and thus the number of temporary vector registers needed, is reduced.

For example, a dependency chain may be formed by originally two adjacent load instructions into a dependency chain of every fourth load, i.e., a total of 2 groups (2 dependency chains). FIG. 7 illustrates a schematic diagram of another example of a method of loading data to a vector register using a Gather instruction according to embodiments of the present application. As shown in FIG. 7, every fourth load constitutes one dependency chain, i.e., a total of 2 groups (2 dependency chains). Because only 2 dependency chains can be parallel, and 4 loading operations in each dependency chain are mutually dependent, the execution parallelism of the Gather instruction is sacrificed to a certain extent, but by reducing the use of vector registers, the blocking of non-Gather instructions caused by insufficient vector registers can be reduced, so that the performance of the whole program is improved.

Thus, the number of load operations in each group may be considered comprehensively according to the requirements of the execution parallelism of the Gather instruction and the number of available vector registers. The splitting into more dependent chains results in using more temporary registers and more micro-operations and merging the final results, and the performance improvement caused by splitting the dependent chains is not a function which linearly increases with the number of the split dependent chains, so that the number of the temporary registers used is comprehensively considered in the implementation process, the cost of the micro-operation number of the merged data and the performance improvement caused by improving the parallelism are comprehensively considered, and the specific splitting into how many different dependent chains can be executed in parallel can be determined.

To sum up, in one embodiment, execution of each adjacent load operation in the first set of load operations is interdependent, execution of each adjacent load operation in the second set of load operations is interdependent, and execution of the first set of load operations is independent of execution of the second set of load operations.

In one embodiment, the other load operations of the first set of load operations other than the last load operation are adjacent or non-adjacent, and the other load operations of the second set of load operations other than the first load operation are adjacent or non-adjacent. That is, there may also be two load operations in either the first or second set of load operations that are not interdependent, so long as at least the last load operation of the first set of load operations and the first load operation of the second set of load operations are adjacent among the plurality of load operations of the Gather instruction, i.e., according to embodiments of the present application, two adjacent load operations that are otherwise interdependent may be changed to independent two load operations that may be executed in parallel. Of course, in most cases, the other load operations of the first set of load operations, except for the last load operation, are also adjacent, and the other load operations of the second set of load operations, except for the first load operation, are also adjacent.

The first set of load operations includes a first number of load operations, the second set of load operations includes a second number of load operations, and the first number and the second number are the same or different and are both greater than one, wherein the smaller the first number and/or the second number, the greater the number of vector registers used, and the greater the parallelism of execution of the Gather instructions.

Then, since data from the memory is loaded into each temporary vector register in order to improve parallelism, the loaded data is merged into one vector register after completion.

In one embodiment, said merging the data in the first vector register and the data in the second vector register into one vector register comprises one of the following steps: merging data in the first vector register and data in the second vector register into the first vector register or the second vector register; the data in the first vector register and the data in the second vector register are merged into one vector register different from the first vector register or the second vector register. That is, the data in the two vector registers may be merged into one of the two vector registers, or the data in the two vector registers may be merged into another separate vector register, and in any case, the data in the scatter registers are merged into one vector register, so that the operation that would be performed by the other instruction is completed, i.e., the data is fetched from the memory and written into the same vector register.

Because one scenario in which the compiler uses the Gather instruction in a large amount is regular access to the array, that is, the difference between the address loaded by the Gather and the address loaded last time is a fixed value or the change of the difference has a certain rule, the prefetcher in the processor can easily predict which address the Gather instruction will load next time when the Gather instruction encounters the Gather instruction again after a period of training. The prefetcher can issue prefetching in advance in the current loading stage, and fetch the data to be loaded next into the L1 data cache (the prefetcher is divided into an L1 cache, an L2 cache and an L3 cache according to the access speed of the cache from high to low). When the other instruction executes the next loading, the needed data is already in the L1 data cache, and the time for sending the loading instruction to the data return register from the memory each time is greatly shortened by directly extracting the data from the L1 data cache. Therefore, the bottleneck in the Gather instruction can be obtained mainly due to the dependence relationship among a plurality of load operations rather than the delay of the load operation, and the effectiveness of scheme optimization of the application is further proved.

Thus, in one embodiment, the method 500 may further comprise: for predicting which address in memory will be accessed by a next load operation using a predictive model that is trained based on accesses by individual load operations to individual addresses in memory; pre-fetching the data at the predicted address from the memory into an L1 data cache in advance in the current loading operation; in response to performing the next load operation, the prefetched data is loaded from the L1 data cache into the first or second or one or more third vector registers.

Therefore, the access mode of the memory address by the access instruction needs to be obtained through training, when loading is carried out each time, the address of the next loading is estimated according to the access mode, the Data which is not in the L1 Data Cache is taken back to the L1 Data Cache in advance, and the delay caused by the L1 Data Cache miss (Data Cache miss) of the next loading is reduced.

In addition, the scheme that the loading operation in the Gather instruction is grouped can be considered to be combined with the prefetcher, if the prefetcher has good prefetching effect on a certain Gather instruction and the hit rate of the next loading address is high, the bottleneck of the Gather instruction can be considered to be optimized on the internal dependence, and the scheme provided by the embodiment of the application is preferably adopted. However, if the prefetcher has a poor prefetching effect on a certain irregularly accessed Gather instruction, the bottleneck of the Gather instruction can be considered to be on the delay of loading and retrieving data from the memory, but not on the internal dependency, and the optimization scheme of the patent can be omitted, that is, more temporary vector registers and more micro-operations are not required to be used for improving the parallelism inside the Gather instruction.

Thus, in one embodiment, the method 500 further comprises: if the hit rate of the address predicted by the prediction model to hit the actual access address of the next loading operation is larger than a preset threshold value, the method for loading data to the vector register by using the Gather instruction is carried out; if the hit rate of the address predicted by the prediction model to hit the actual access address of the next loading operation is less than or equal to the preset threshold value, the method for loading data to the vector register by using the Gather instruction is not carried out.

In summary, aiming at the dependence relationship of the micro-operations in the Gather instruction, the application provides a micro-operation implementation scheme capable of splitting the dependence chain, and the parallelism of the micro-operations in the Gather instruction is improved, so that the execution time of the Gather instruction is greatly shortened, and the performance of a processor is finally improved.

In addition, for other complicated instructions with serious internal dependence, the thought provided by the scheme of the application can be used for referencing, and the dependency relationship can be broken by using a method of storing intermediate results by using a temporary register (not necessarily a vector register).

As shown in fig. 8, a method 800 of loading data into a register includes: step 810, loading data in a memory to a first register through a first set of load operations of a plurality of load operations; step 820, loading data in the memory to a second register through a second set of load operations of the plurality of load operations, wherein the second register is different from the first register, wherein a last load operation of the first set of load operations and a first load operation of the second set of load operations are adjacent in the plurality of load operations; step 830, merging the data in the first register and the data in the second register into one register.

Therefore, the dependency relationship among the seriously dependent complex instructions can be broken through by using a temporary register (not necessarily a vector register) to store intermediate results and finally merging, so that the parallelism of instruction execution is improved, the execution time of the instructions is shortened, and the performance of a processor is finally improved.

FIG. 9 shows a block diagram of a processor 900 that loads data to vector registers using a Gather instruction according to embodiments of the present application.

As shown in fig. 9, a processor 900 for loading data into vector registers using a Gather instruction includes: a loader 910 configured to load data in memory to a first vector register through a first set of load operations of a plurality of load operations of a Gather instruction; loading data in memory to a second vector register by a second set of load operations of the plurality of load operations of the Gather instruction, wherein the second vector register is different from the first vector register, wherein a last load operation of the first set of load operations and a first load operation of the second set of load operations are adjacent among the plurality of load operations of the Gather instruction; a combiner 920 configured to combine the data in the first vector register and the data in the second vector register into one vector register.

In one embodiment, execution of each adjacent load operation in the first set of load operations is interdependent, execution of each adjacent load operation in the second set of load operations is interdependent, and execution of the first set of load operations is independent of execution of the second set of load operations.

In one embodiment, other load operations of the first set of load operations other than the last load operation are adjacent or non-adjacent, other load operations of the second set of load operations other than the first load operation are adjacent or non-adjacent, and the first set of load operations includes a first number of load operations, the second set of load operations includes a second number of load operations, and the first number and the second number are the same or different and are both greater than one, wherein the greater number of vector registers are used, the smaller the first number and/or the second number, the greater the parallelism of execution of the Gather instruction.

In one embodiment, the combiner 920 is configured to perform one of the following steps: merging data in the first vector register and data in the second vector register into the first vector register or the second vector register; the data in the first vector register and the data in the second vector register are merged into one vector register different from the first vector register or the second vector register.

In one embodiment, the loader 910 is further configured to: loading data in memory to one or more third vector registers by one or more third set of load operations of the Gather instruction, wherein the one or more third vector registers are different from both the second vector register and the first vector register; merging data in the one or more third vector registers into the one vector register.

In one embodiment, processor 900 further includes: a prefetcher (not shown) configured to: for predicting which address in memory will be accessed by a next load operation using a predictive model that is trained based on accesses by individual load operations to individual addresses in memory; pre-fetching the data at the predicted address from the memory into an L1 data cache in advance in the current loading operation; wherein the loader is configured to load the prefetched data from the L1 data cache into the first or second or one or more third vector registers in response to performing the next load operation.

In one embodiment, the prefetcher is configured to: if the hit rate of the address predicted by the prediction model to hit the actual access address of the next loading operation is larger than a preset threshold value, the method for loading data to the vector register by using the Gather instruction is carried out; if the hit rate of the address predicted by the prediction model to hit the actual access address of the next loading operation is less than or equal to the preset threshold value, the method for loading data to the vector register by using the Gather instruction is not carried out.

In this way, for the dependence relation of the micro-operations in the Gather instruction, a micro-operation implementation scheme for splitting a dependence chain is provided, a first group of loading operations (mutually dependent) in a plurality of loading operations of the Gather instruction are loaded to a first vector register, a second group of loading operations (mutually dependent) in the plurality of loading operations of the Gather instruction are loaded to a second vector register different from the first vector register, the dependence relation between the first group of loading operations and the first group of loading operations is broken, parallel execution between the first group of loading operations and the second group of loading operations is objectively achieved, the parallelism of the micro-operations in the Gather instruction is improved, the execution time of the Gather instruction is greatly shortened, and finally the performance of a processor is improved.

FIG. 10 shows a block diagram of a processor 1000 that loads data to vector registers with a Gather instruction according to embodiments of the present application.

As shown in fig. 10, a processor 1000 for loading data into registers includes: a loader 1010 configured to load data in the memory into the first register through a first set of load operations of the plurality of load operations; loading data in the memory to a second register by a second set of load operations of the plurality of load operations, wherein the second register is different from the first register, wherein a last load operation of the first set of load operations and a first load operation of the second set of load operations are adjacent in the plurality of load operations; a combiner 1020 configured to combine the data in the first register and the data in the second register into one register.

The electronic device may include a processor (H1); a storage medium (H2) coupled to the processor (H1) and having stored therein processor-executable instructions for performing the steps of the methods of the embodiments of the present application when executed by the processor.

The processor (H1) may include, but is not limited to, for example, one or more processors or microprocessors or the like.

The storage medium (H2) may include, for example, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, a computer storage medium (e.g., hard disk, a floppy disk, a solid state disk, a removable disk, a CD-ROM, a DVD-ROM, a blu-ray disc, etc.).

In addition, the electronic device may include a data bus (H3), an input/output (I/O) bus (H4), a display (H5), and an input/output device (H6) (e.g., keyboard, mouse, speaker, etc.), etc.

The processor (H1) may communicate with external devices (H5, H6, etc.) via a wired or wireless network (not shown) through an I/O bus (H4).

The storage medium (H2) may also store at least one processor executable instruction for performing the functions and/or steps of the methods in the embodiments described in the present technology when executed by the processor (H1).

In one embodiment, the at least one processor-executable instruction may also be compiled or otherwise formed into a software product in which one or more processor-executable instructions, when executed by a processor, perform the functions and/or steps of the methods described in the embodiments of the technology.

As shown in fig. 12, the readable storage medium 1220 has stored thereon instructions, such as readable instructions 1210. When the readable instructions 1210 are executed by a processor, the various methods described with reference to the above may be performed. Readable storage media include, but are not limited to, volatile memory and/or nonvolatile memory, for example. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. For example, the readable storage medium 1220 may be connected to a computing device such as a computer, and then the various methods described above may be performed where the computing device runs the readable instructions 1210 stored on the readable storage medium 1220.

In summary, 1. Aiming at the dependence relationship of the micro-operations in the Gather instruction, a micro-operation implementation scheme capable of splitting a dependence chain is provided, and parallelism of the micro-operations in the Gather instruction is improved, so that execution time of the Gather instruction is greatly shortened. 2. Meanwhile, the fact that the multi-use resources required by the splitting scheme affect the execution efficiency of other non-Gather instructions is considered, and the specific splitting scheme is considered under the condition of integrating the whole program.

Of course, the specific embodiments described above are merely examples and are not limiting, and those skilled in the art may combine and combine steps and means from the above separately described embodiments to achieve the effects of the present application according to the concepts of the present application, such combined and combined embodiments are also included in the present application, and such combination and combination are not described here one by one.

Note that the advantages, effects, and the like mentioned in the present disclosure are merely examples and are not to be construed as necessarily essential to the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.

The block diagrams of the devices, apparatuses, devices, systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

The step flow diagrams in this disclosure and the above method descriptions are merely illustrative examples and are not intended to require or imply that the steps of the various embodiments must be performed in the order presented. The order of steps in the above embodiments may be performed in any order, as will be appreciated by those skilled in the art. Words such as "thereafter," "then," "next," and the like are not intended to limit the order of the steps; these words are simply used to guide the reader through the description of these methods. Furthermore, any reference to an element in the singular, for example, using the articles "a," "an," or "the," is not to be construed as limiting the element to the singular.

In addition, the steps and means in the various embodiments herein are not limited to practice in a certain embodiment, and indeed, some of the steps and some of the means associated with the various embodiments herein may be combined according to the concepts of the present application to contemplate new embodiments, which are also included in the scope of the present application.

The individual operations of the above-described method may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software components and/or modules including, but not limited to, circuitry for hardware, an Application Specific Integrated Circuit (ASIC), or a processor.

The various illustrative logical blocks, modules, and circuits described herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an ASIC, a field programmable gate array signal (FPGA) or other Programmable Logic Device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any commercially available processor, controller, microcontroller or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may reside in any form of tangible storage medium. Some examples of storage media that may be used include Random Access Memory (RAM), read Only Memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, and so forth. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. A software module may be a single instruction, or many instructions, and may be distributed over several different code segments, among different programs, and across several storage media.

The methods disclosed herein include acts for implementing the described methods. The methods and/or acts may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of acts is specified, the order and/or use of specific acts may be modified without departing from the scope of the claims.

The functions described above may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as instructions on a tangible, readable medium. A storage media may be any available tangible media that can be accessed by a computer. By way of example, and not limitation, such readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other tangible medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. As used herein, discs (disks) and disks include Compact Disks (CDs), laser disks, optical disks, digital Versatile Disks (DVDs), floppy disks, and blu-ray disks where disks usually reproduce data magnetically, while disks reproduce data optically with lasers.

Thus, the computer program product may perform the operations presented herein. For example, such a computer program product may be a readable tangible medium having instructions tangibly stored (and/or encoded) thereon, the instructions being executable by a processor to perform operations described herein. The computer program product may comprise packaged material.

The software or instructions may also be transmitted over a transmission medium. For example, software may be transmitted from a website, server, or other remote source using a transmission medium such as a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, or microwave.

Furthermore, modules and/or other suitable means for performing the methods and techniques described herein may be downloaded and/or otherwise obtained by the user terminal and/or base station as appropriate. For example, such a device may be coupled to a server to facilitate the transfer of means for performing the methods described herein. Alternatively, the various methods described herein may be provided via storage means (e.g., RAM, ROM, a physical storage medium such as a CD or floppy disk, etc.) so that the user terminal and/or base station can obtain the various methods when coupled to or providing storage means to the device. Further, any other suitable technique for providing the methods and techniques described herein to a device may be utilized.

Other examples and implementations are within the scope and spirit of the disclosure and the appended claims. For example, due to the nature of software, the functions described above may be implemented using software executed by a processor, hardware, firmware, hardwired or any combination of these. Features that implement the functions may also be physically located at various locations including being distributed such that portions of the functions are implemented at different physical locations. Also, as used herein, including in the claims, the use of "or" in the recitation of items beginning with "at least one" indicates a separate recitation, such that recitation of "at least one of A, B or C" means a or B or C, or AB or AC or BC, or ABC (i.e., a and B and C), for example. Furthermore, the term "exemplary" does not mean that the described example is preferred or better than other examples.

Various changes, substitutions, and alterations are possible to the techniques described herein without departing from the techniques of the teachings, as defined by the appended claims. Furthermore, the scope of the claims of the present disclosure is not limited to the particular aspects of the process, machine, manufacture, composition of matter, means, methods and acts described above. The processes, machines, manufacture, compositions of matter, means, methods, or acts, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding aspects described herein may be utilized. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or acts.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. A method of loading data into a vector register using a Gather instruction, comprising:

loading data in the memory to a first vector register by a first set of load operations of a plurality of load operations of the Gather instruction;

loading data in memory to a second vector register by a second set of load operations of the plurality of load operations of the Gather instruction, wherein the second vector register is different from the first vector register, wherein a last load operation of the first set of load operations and a first load operation of the second set of load operations are adjacent among the plurality of load operations of the Gather instruction;

the data in the first vector register and the data in the second vector register are merged into one vector register.

2. The method of claim 1, wherein execution of each adjacent load operation in the first set of load operations is interdependent, execution of each adjacent load operation in the second set of load operations is interdependent, and execution of the first set of load operations is independent of execution of the second set of load operations.

3. The method of claim 1, wherein other ones of the first set of load operations other than the last load operation are adjacent or non-adjacent, wherein other ones of the second set of load operations other than the first load operation are adjacent or non-adjacent, and wherein the first set of load operations comprises a first number of load operations, wherein the second set of load operations comprises a second number of load operations, and wherein the first number and the second number are the same or different and are both greater than one, wherein the greater vector registers are used, the smaller the first number and/or the second number, the greater the parallelism of execution of the Gather instruction.

4. The method of claim 1, wherein the merging the data in the first vector register and the data in the second vector register into one vector register comprises one of:

merging data in the first vector register and data in the second vector register into the first vector register or the second vector register;

the data in the first vector register and the data in the second vector register are merged into one vector register different from the first vector register or the second vector register.

5. The method of claim 1, further comprising:

loading data in memory to one or more third vector registers by one or more third set of load operations of the Gather instruction, wherein the one or more third vector registers are different from both the second vector register and the first vector register;

merging data in the one or more third vector registers into the one vector register.

6. The method of claim 1, further comprising:

for predicting which address in memory will be accessed by a next load operation using a predictive model that is trained based on accesses by individual load operations to individual addresses in memory;

pre-fetching the data at the predicted address from the memory into an L1 data cache in advance in the current loading operation;

in response to performing the next load operation, the prefetched data is loaded from the L1 data cache into the first or second or one or more third vector registers.

7. The method of claim 6, further comprising:

if the hit rate of the address predicted by the prediction model to hit the actual access address of the next loading operation is larger than a preset threshold value, the method for loading data to the vector register by using the Gather instruction is carried out;

If the hit rate of the address predicted by the prediction model to hit the actual access address of the next loading operation is less than or equal to the preset threshold value, the method for loading data to the vector register by using the Gather instruction is not carried out.

8. A processor for loading data into a vector register using a Gather instruction, comprising:

a loader configured to load data in memory to a first vector register through a first set of load operations of a plurality of load operations of the Gather instruction; loading data in memory to a second vector register by a second set of load operations of the plurality of load operations of the Gather instruction, wherein the second vector register is different from the first vector register, wherein a last load operation of the first set of load operations and a first load operation of the second set of load operations are adjacent among the plurality of load operations of the Gather instruction;

and a combiner configured to combine the data in the first vector register and the data in the second vector register into one vector register.

9. The processor of claim 8, wherein execution of each adjacent load operation in the first set of load operations is interdependent, execution of each adjacent load operation in the second set of load operations is interdependent, and execution of the first set of load operations is independent of execution of the second set of load operations.

10. The processor of claim 8, wherein other ones of the first set of load operations other than the last load operation are adjacent or non-adjacent, wherein other ones of the second set of load operations other than the first load operation are adjacent or non-adjacent, and wherein the first set of load operations comprises a first number of load operations, wherein the second set of load operations comprises a second number of load operations, and wherein the first number and the second number are the same or different and are both greater than one, wherein the more vector registers are used,

the smaller the first number and/or the second number, the greater the execution parallelism of the Gather instruction.

11. The processor of claim 8, wherein the combiner is configured to perform one of:

12. The processor of claim 8, wherein the loader is further configured to:

13. The processor of claim 8, further comprising:

a prefetcher configured to:

for predicting which address in memory will be accessed by a next load operation using a predictive model that is trained based on accesses by individual load operations to individual addresses in memory; and

wherein the loader is configured to load the prefetched data from the L1 data cache into the first or second or one or more third vector registers in response to performing the next load operation.

14. The processor of claim 13, wherein the prefetcher is configured to:

15. A method of loading data into a register, comprising:

loading data in the memory into a first register through a first set of load operations of the plurality of load operations;

loading data in the memory to a second register by a second set of load operations of the plurality of load operations, wherein the second register is different from the first register, wherein a last load operation of the first set of load operations and a first load operation of the second set of load operations are adjacent in the plurality of load operations;

the data in the first register and the data in the second register are merged into one register.

16. A processor for loading data into registers, comprising:

a loader configured to load data in the memory to the first register through a first set of load operations of the plurality of load operations; loading data in the memory to a second register by a second set of load operations of the plurality of load operations, wherein the second register is different from the first register, wherein a last load operation of the first set of load operations and a first load operation of the second set of load operations are adjacent in the plurality of load operations;

and a combiner configured to combine the data in the first register and the data in the second register into one register.

17. An electronic device, comprising:

a memory for storing instructions;

a processor for reading instructions in said memory and performing the method of any of claims 1-7, 15.

18. A non-transitory storage medium having instructions stored thereon,

wherein the instructions, when read by a processor, cause the processor to perform the method of any one of claims 1-7, 15.