CN115905048A

CN115905048A - Multi-strategy memory segmented copy optimization method, device and medium

Info

Publication number: CN115905048A
Application number: CN202211521391.2A
Authority: CN
Inventors: 黄亮明; 张静; 姜军; 蒋丽萍; 邓洁
Original assignee: Wuxi Advanced Technology Research Institute
Current assignee: Wuxi Advanced Technology Research Institute
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-04-04

Abstract

The invention provides a multi-strategy memory segmented copy optimization method, a multi-strategy memory segmented copy optimization device and a multi-strategy memory segmented copy optimization medium, which can improve the copy performance. The method comprises the following steps: obtaining data to be copied, judging the source target address boundary attribute according to the characteristics of the access instruction, and if the target address is not R byte boundary, processing the target address R byte boundary by using a scalar copy instruction; and acquiring the length of the data to be copied, distinguishing the data according to the length of the data to be copied, and aiming at the characteristics of each interval, adopting different optimization means to realize segmented multi-strategy copy optimization.

Description

Multi-strategy memory segmented copy optimization method, device and medium

Technical Field

The invention relates to a multi-strategy memory segmented copy optimization method, a device and a medium, belonging to the technical field of byte copy.

Background

Memory copying is widely used in computer data processing. In the existing memory copy, a typical method is a memcpy function optimization method of a high-bit-width instruction priority strategy, a single-byte copy instruction is used firstly to enable at least one source/destination address to meet the range attribute of the highest-bit-width instruction supported by a system, and another address is copied according to the instruction corresponding to the highest-byte length (not greater than the highest-bit-width instruction of the system) which can be matched with the current range attribute, so that the situation that the address uses fixed bytes such as a single byte or two bytes and smaller instructions to complete circular copy exists. Another representative is a 16/64 byte segmented memory accelerated copy method, where less than 16 bytes are copied with a single byte, between 16 and 64 bytes are copied with 4 bytes, and more than 64 bytes are accelerated 16 times with 4 byte sequential copies. The above studies have been more focused on high-bit-width instructions or simple segmented modes, and have not fully utilized the length range of the data to be copied and the structural characteristics supported by the processor for optimization.

In the prior art, there are related patents: an optimization method of a memcpy function; patent application No.: 201310408259.5;

a method for accelerating memory copy speed; patent application No.: CN201010133607.9.

However, in the existing memory copy method, under the condition that the length range of the data to be copied is wide, customized copy cannot be accurately performed on different data segments:

1. because the length of the data to be copied is not fixed and the current memory copying method does not carry out full segmentation aiming at a wider length range, the larger and smaller lengths of the data to be copied use a fixed strategy;

2. under the condition that the length of data to be copied is large, efficient copying cannot be performed due to the fact that the bit width of an existing copying instruction is fixed; under the condition that the length of the data to be copied is small, single-byte circular copying is still adopted at present, so that the cyclic and judgment overhead is large;

3. current copy methods do not make full use of the memory fabric features supported by the processor.

The memory copy optimization method at the present stage is simple and inflexible, and does not fully combine the characteristics of the processor, resulting in low profit.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a multi-strategy memory segmented copy optimization method, device and medium, which can improve the copy performance.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a multi-policy memory segment copy optimization method, including the following steps:

step 1: acquiring data to be copied, judging the source target address boundary attribute according to the characteristics of the access instruction, if the target address is not R byte boundary, processing the target address by using a scalar copy instruction until R byte boundary of the target address is reached, and then jumping to the step 2, and if the R byte boundary of the target address is reached, directly jumping to the step 2;

step 2: and acquiring the length of the data to be copied, distinguishing the data according to the length of the data to be copied, and aiming at the characteristics of each interval, adopting different optimization means to realize segmented multi-strategy copy optimization.

Further, the method is based on a processor with a multi-level storage structure and comprises an in-core (L1 and L2 level) Cache and an out-core (L3 level) Cache.

Furthermore, the data to be copied are distinguished according to the length of the data to be copied, and different optimization means are adopted aiming at the characteristics of each interval to realize segmented multi-strategy copy optimization, wherein the method comprises the following steps:

step a: and setting an efficiency threshold A (the value of A is equal to the size of an X-byte scalar memory access processing unit) according to the characteristics of the processor and the word length processed by the scalar memory access instruction. Judging the size relationship between the length of the data to be copied and the efficiency threshold A, if the length of the data to be copied is less than A, skipping to the step b, and if the length of the data to be copied is more than or equal to A, skipping to the step c;

step b: aiming at the condition that the length len of the data to be copied is less than A, a jump table is adopted to realize that: performing offset calculation according to the length len of data to be copied, obtaining an accurate jump address by combining with a jump table base address, and finishing len byte copying by combining with a processor memory access instruction after jumping to a corresponding position;

step c: setting an efficiency threshold B (the value of B is an integral multiple of a Y byte vector access instruction processing unit) according to the characteristics of a processor and the processing word length of the vector access instruction; judging the size relation between the length of the data to be copied and the efficiency threshold B, if the length of the data to be copied is less than B, jumping to the step d, and copying the memory by using a scalar memory access instruction; if the length of the data to be copied is greater than or equal to B, jumping to the step e, and copying by using a vector access instruction;

step d: and processing the target address attribute to meet the natural pair attribute of the scalar access instruction, namely the low-bit width pair attribute, aiming at the condition that the length len of the data to be copied is between A and B, and circularly copying by adopting the X-byte scalar access instruction as a processing unit.

Step e: aiming at the condition that the length len of data to be copied is larger than B, in order to realize single-instruction multi-data stream, the target address attribute is processed to meet the natural pair attribute of the vector access instruction, namely the high-bit width pair attribute, and the Y-byte vector access instruction is introduced into the processing unit to realize circular accelerated copying.

Further, in step d, the cyclic copy is performed for the processing unit by using the scalar memory access instruction of X bytes, which includes:

d1. when the length of the data to be copied is greater than M X, circularly copying the processing unit according to the X-byte scalar instruction, and circularly expanding M (M =4, 8, 16 and the like) times each time; when the loop condition is not met, jumping to the step d2 or d3 according to the relation between the length of the current data to be copied and the X;

d2. when the length of the data to be copied is between X and M X, circularly copying the data by taking an X-byte scalar instruction as a processing unit; when the circulation condition is not met, skipping to the step d3;

d3. and when the length of the data to be copied is less than X, jumping to the step b.

Further, in step e, a Y byte vector access instruction is introduced to implement loop accelerated copy for the processing unit, including:

e1. when the data length is larger than N x Y, circularly reading the processing unit according to a Y byte vector instruction, circularly storing the processing unit according to a Y byte vector non-access Cache instruction, and expanding N (N =4, 8, 16 and the like) times in a circular body; when the loop condition is not met, skipping to the step e2 or e3 according to the relation between the current data length and Y as appropriate;

e2. when the data length is between Y and N x Y, circularly reading the data according to the Y byte vector instruction as a processing unit, and circularly storing the data according to the Y byte vector non-access Cache instruction; when the circulation condition is not met, skipping to the step e3;

e3. and when the data length is smaller than Y, skipping to the step d to judge the data length and selecting a skipping branch.

In a second aspect, the present invention provides an apparatus for optimizing multi-policy memory segment copy, where the apparatus includes:

a border attribute module: the scalar copy instruction is used for acquiring data to be copied, judging the source target address boundary attribute according to the characteristics of the access instruction, and processing the data to be copied to a target address R byte boundary by using the scalar copy instruction if the target address is not R byte boundary;

a segment optimization module: the method is used for obtaining the length of the data to be copied, distinguishing the data according to the length of the data to be copied, and adopting different optimization means aiming at the characteristics of each interval to realize segmented multi-strategy copy optimization.

In a third aspect, the present invention provides a multi-policy memory segment copy optimization apparatus, including a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method of the first aspect.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect.

Compared with the prior art, the invention has the following beneficial effects:

1. the method and the device divide the data to be copied according to the length of the data to be copied, and replace a fixed strategy with a multi-form copy strategy according to different interval characteristics, so that the copying efficiency is effectively improved;

2. on the basis of scalar memory access, a vector memory access instruction is introduced to realize a low-high bit width-scalar vector combined memory access mode; the method can extract and copy relevant parameter information according to an actual application scene, and an optimal branch strategy is matched;

3. the invention can adopt multi-strategy optimization such as jump table, kernel internal and external hierarchical prefetching, cyclic code segment expansion, low and high bit width-scalar vector combined access mode, cache storage non-access and the like aiming at different branches, thereby realizing accurate copying of the length of small data to be copied and the length boundary situation of large data to be copied, maximizing the timeliness of data prefetching and the advantage of multi-level Cache, effectively reducing the loop judgment and branch jumping cost, reducing the Miss transfer prediction rate and improving the performance of memory copying.

Drawings

FIG. 1 is a flow chart of a multi-policy memory segment copy algorithm

FIG. 2 is a schematic diagram of a multi-policy memory segment copy optimization policy.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The first embodiment is as follows:

the embodiment provides a multi-policy memory segmented copy optimization method, fig. 1 is a flow chart of a multi-policy memory segmented copy algorithm, and the method comprises the steps of firstly judging source and target address borderline attributes according to the characteristics of memory access instructions, and if a target address is not R byte borderline, processing the target address to R byte borderline by using scalar copy instructions; ^{address pair attribute} Will affect the performance of the use of the memory access instruction. If the destination address R is byte aligned, the following steps are directly performed.

And secondly, distinguishing according to the length of data to be copied, and aiming at the characteristics of each interval, adopting different optimization means to realize segmented multi-strategy copy optimization.

Fig. 2 is a schematic diagram of a multi-policy memory segment copy optimization policy, which is detailed in the following steps:

a. an efficiency threshold A is set based on processor characteristics and scalar memory access instruction processing word size. Judging the size relationship between the length of the data to be copied and the efficiency threshold A, if the length of the data to be copied is less than A, skipping to the step b, and if the length of the data to be copied is more than or equal to A, skipping to the step c; the value of A is generally set to be an integral multiple of the processing word length X of the scalar access instruction, wherein the size of A is equal to X, so that when the data volume to be copied is smaller than the processing word length of the scalar access instruction, the step b is skipped.

b. For the situation that the length len of the data to be copied is smaller than A, a jump table is adopted to realize that: performing offset calculation according to the length len of the data to be copied, obtaining an accurate jump address by combining with a jump table base address, and finishing len byte copying by combining with a processor memory access instruction after jumping to a corresponding position, thereby realizing accurate and efficient memory copying under the condition of small length of the data to be copied;

c. an efficiency threshold B is set based on processor characteristics and the length of the word processed by the vector access instruction. Judging the size relation between the length of the data to be copied and the efficiency threshold B, if the length of the data to be copied is less than B, jumping to the step d, and copying the memory by using a scalar memory access instruction; if the number of the access points is larger than or equal to B, jumping to the step e, and copying by using a vector access instruction; the value of B is generally set as an integral multiple of the word length Y processed by the vector access instruction, and the specific multiple has no definite corresponding relation and can be set as appropriate.

d. And processing the target address attribute to meet the natural pair attribute of the scalar access instruction, namely the low-bit width pair attribute, aiming at the condition that the length len of the data to be copied is between A and B, and circularly copying by adopting the X-byte scalar access instruction as a processing unit. Based on the length of the segmented data to be copied, the advantages of a storage structure supported by a processor are fully utilized, and a multi-strategy optimization method is adopted: (1) in order to improve the locality of data and reduce the CacheMiss rate, and reasonably setting the intra-core hierarchical (L1 + L2-level Cache) prefetching (setting single-level prefetching or hierarchical prefetching as appropriate according to the length to be copied) by combining the length of the data to be copied; (2) in order to reduce the cycle number and the transfer judgment overhead, a cyclic code segment expansion optimization technology is fully utilized, and the expansion number is determined to be M according to the Cache line size and the memory access instruction processing word length of a processor:

d1. when the length of the data to be copied is greater than M X, circularly copying the data to be copied for the processing unit according to the X byte scalar instruction, and circularly expanding the data for M times each time; when the loop condition is not met, skipping to the step d2 or the step d3 as appropriate according to the relation between the length of the current data to be copied and the X;

d2. when the length of the data to be copied is between X and M X, circularly copying the data by taking an X-byte scalar instruction as a processing unit; when the loop condition is not met, jumping to the step d3;

d3. when the length of the data to be copied is less than X, jumping to the step b;

the number of the M is generally set to be 4, 8, 16 and the like, and the specific expansion times can be comprehensively considered by combining the size of a Cache line and a scalar memory access instruction processing unit.

e. Aiming at the condition that the length len of data to be copied is greater than B, in order to realize single-instruction multi-data stream, the destination address attribute is processed to meet the natural pair attribute of the vector access instruction, namely the high-bit width pair attribute, and the Y-byte vector access instruction is introduced into a processing unit to realize circular accelerated copying. Based on the length of the segmented data to be copied, the advantages of a storage structure supported by a processor are fully exerted, and a multi-strategy optimization method is adopted: (1) intra-core and intra-core hierarchical (L1 + L2+ L3 level Cache) prefetching; (2) expanding a loop code segment, and expanding N times according to the length of the data to be copied and the vector access instruction processing word length; (3) in order to reduce the pollution of the write-back operation on the Cache and improve the hit rate of the read operation, a non-access Cache storage instruction is used:

e1. when the length of the data to be copied is greater than NxY, circularly reading the data by taking a Y-byte vector instruction as a processing unit, circularly storing the data by taking a Y-byte vector not accessing a Cache instruction, and expanding the data in a circulating body for N times; when the loop condition is not met, skipping to the step e2 or e3 as appropriate according to the relation between the length of the current data to be copied and Y;

e2. when the length of the data to be copied is between Y and N x Y, circularly reading the data according to a Y byte vector instruction as a processing unit, and circularly storing the data according to a Y byte vector non-access Cache instruction; when the circulation condition is not met, skipping to the step e3;

e3. and when the length of the data to be copied is less than Y, skipping to the step d to judge the length of the data to be copied and selecting a skipping branch.

The number of N is generally set to 4, 8, 16 and the like, and the specific expansion times can be comprehensively considered by combining the size of a Cache line and a vector access instruction processing unit.

The key point of the invention is different from other copying methods in the industry is that the invention is divided according to the length of the data to be copied, and a multi-form copying strategy is used for replacing a fixed strategy aiming at different interval characteristics; on the basis of scalar memory access, a vector memory access instruction is introduced to realize a low-high bit width-scalar vector combination memory access mode; the advantages of a storage structure supported by a processor are fully utilized, and optimization means such as hierarchical prefetching inside and outside a core, cyclic code segment expansion, cache storage without access and the like are combined to achieve segmented multi-strategy memory copy optimization.

A multi-copy strategy based on length division of data to be copied: under the condition that the length range of the data to be copied is wide, interval division is carried out, accurate copying is carried out aiming at the boundary situation of the length of small data to be copied and the length of large data to be copied, because the length of the data to be copied is small, if single-byte cyclic copying is adopted, the ratio of the cyclic and judging expenses is large, the method is realized by adopting a skip list, not only can accurate judgment and skip be achieved, but also the expenses of multiple cyclic judgment can be saved, and therefore high-efficiency accurate memory copying is realized; and circularly copying the data to be copied under the condition of large length.

The low and high bit width-standard vector combination access mode comprises the following steps: under the condition that the length of the data to be copied is between the efficiency threshold A, B, processing the destination address attribute to meet the natural pair attribute of the scalar access instruction, namely the low-bit width pair attribute, and copying by adopting the scalar access instruction; and under the condition that the length of the data to be copied is greater than an efficiency threshold B, processing the destination address attribute to meet the natural pair attribute, namely the high-bit-width pair attribute, of the vector access instruction, and introducing the vector access instruction to realize accelerated copying.

The access optimization method adapting to the characteristics of the multilevel storage structure comprises the following steps: firstly, the length of data to be copied and the capacity of the Cache inside and outside the core are combined, and hierarchical prefetching is reasonably set, so that the timeliness and the Cache resource performance of data prefetching are maximized, and the data prefetching optimization effect is further improved compared with the first-level prefetching; secondly, loop expansion times are flexibly set, the probability of concurrent execution of statements in a loop body is improved by utilizing CPU instruction level parallelism, efficient scheduling of an instruction pipeline is facilitated, the expenses of branch judgment and jumping can be reduced, and the Miss rate of transfer prediction is reduced; and finally, reasonably utilizing the Cache storage instruction without accessing the Cache to reduce the pollution of the write-back operation on the Cache, thereby improving the hit rate of the read operation.

By adopting the method, the copy-related parameter information can be extracted according to the actual application scene, the optimal branch strategy is matched, and multi-strategy optimization such as jump tables, kernel internal and external hierarchical prefetching, loop code segment expansion, low and high bit width-scalar vector combined access mode, cache storage avoidance and the like is adopted for different branches, so that accurate copy of the length of small data to be copied and the length boundary situation of large data to be copied is realized, the timeliness of data prefetching and the advantage of multi-level Cache are brought into play to the maximum, the loop judgment and branch jump expenses are effectively reduced, the Miss rate of transfer prediction is reduced, and the performance of memory copy is improved.

Example two:

this embodiment provides a multi-policy memory segment copy optimization device, including:

The apparatus of the present embodiment can be used to implement the method described in the first embodiment.

Example three:

the embodiment provides a multi-strategy memory segmented copy optimization device, which comprises a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method of embodiment one.

Example four:

the present embodiment provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of an embodiment.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A multi-strategy memory segmented copy optimization method is characterized by comprising the following steps:

and 2, step: and acquiring the length of the data to be copied, distinguishing the data according to the length of the data to be copied, and aiming at the characteristics of each interval, adopting different optimization means to realize segmented multi-strategy copy optimization.

2. The multi-strategy memory subsection copy optimization method of claim 1, wherein the data to be copied is distinguished according to the length thereof, and different optimization means are adopted for each interval characteristic to realize subsection multi-strategy copy optimization, comprising:

step a: judging the size relationship between the length of the data to be copied and the efficiency threshold A, if the length of the data to be copied is less than A, skipping to the step b, and if the length of the data to be copied is more than or equal to A, skipping to the step c;

step c: judging the size relation between the length of the data to be copied and the efficiency threshold B, if the length of the data to be copied is smaller than B, jumping to the step d, and copying the memory by using a scalar memory access instruction; if the length of the data to be copied is greater than or equal to B, jumping to the step e, and copying by using a vector access instruction;

step d: aiming at the condition that the length len of data to be copied is between A and B, processing the target address attribute to meet the natural pair attribute of a scalar access instruction, namely the low-bit width pair attribute, and circularly copying by adopting the X-byte scalar access instruction as a processing unit;

step e: and aiming at the condition that the length len of the data to be copied is greater than B, processing the attribute of the destination address to meet the natural pair attribute of the vector access and storage instruction, namely the high-bit-width pair attribute, and introducing the Y-byte vector access and storage instruction into a processing unit to realize circular accelerated copying.

3. The multi-policy memory segment copy optimization method according to claim 2, wherein in the step d, the cyclic copy is performed for the processing unit by using an X-byte scalar access instruction, and the method comprises the following steps:

d1. when the length of the data to be copied is greater than M X, circularly copying the data to be copied for the processing unit according to the X byte scalar instruction, and circularly expanding the data for M times each time; when the loop condition is not met, according to the relation between the length of the current data to be copied and X, when the length of the data to be copied is between X and M X, skipping to the step d2, and when the length of the data to be copied is smaller than X, skipping to the step d3;

d2. when the length of the data to be copied is between X and M X X, circularly copying the data to be copied by taking an X-byte scalar instruction as a processing unit; when the circulation condition is not met, skipping to the step d3;

4. The multi-policy memory segmented copy optimization method according to claim 2, wherein in step e, introducing a Y byte vector access instruction to implement loop accelerated copy for the processing unit comprises:

e1. when the length of the data to be copied is larger than N x Y, circularly reading the data by taking a Y byte vector instruction as a processing unit, circularly storing the data by taking a Y byte vector not to access a Cache instruction, and expanding the data in a circular body for N times; when the loop condition is not met, jumping to the step e2 or e3 according to the relation between the current data length and Y;

e2. when the length of the data to be copied is between Y and N x Y, circularly reading the data according to a Y byte vector instruction as a processing unit, and circularly storing the data according to a Y byte vector non-access Cache instruction; when the loop condition is not met, jumping to the step e3;

e3. and when the length of the data to be copied is smaller than Y, skipping to the step d to judge the data length and selecting a skipping branch.

5. An apparatus for optimizing multi-policy memory segment copy, the apparatus comprising:

6. The multi-strategy memory segmented copy optimization device is characterized by comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method of any one of claims 1 to 4;

7. computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.