CN112965663A

CN112965663A - Method for multiplexing storage space of data block and related product

Info

Publication number: CN112965663A
Application number: CN202110247330.0A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-06-15
Anticipated expiration: 2041-03-05
Also published as: CN112965663B

Abstract

The present disclosure provides a method for multiplexing storage space of a data block and a related product, which may be implemented in a computing device, wherein the computing device may be included in a combined processing device, and the combined processing device may further include a universal interconnect interface and other processing devices. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for data of the computing device and the other processing device.

Description

Method for multiplexing storage space of data block and related product

Technical Field

The present disclosure relates to the field of computers, and more particularly, to multiplexing of storage space.

Background

Neural network technology has recently become popular, and some hardware vendors have begun to design neural network processors to speed up the computation of neural networks. For a neural network processor, on-chip memory space is limited. As the depth and width of the neural network increase, the amount of data required increases gradually, and the size of the memory space may become a bottleneck.

In the operation process of the neural network, the space of a plurality of data blocks can be multiplexed. To this end, many frameworks have been implemented with memory management modules to improve memory reuse. However, many of the existing algorithms are prone to generate memory fragments, and cannot sufficiently utilize memory efficiently.

Disclosure of Invention

One objective of the present disclosure is to overcome the defects in the prior art that memory fragments are easily generated during spatial multiplexing, and spatial multiplexing efficiency is low.

According to a first aspect of the present disclosure, there is provided a method for multiplexing a storage space of a data block, including: determining a conflict relationship between a plurality of data blocks; determining the data blocks with multiplexing relation according to the conflict relation; adjusting an allocation order of the storage spaces of the plurality of data blocks so that a following data block in the allocation order can multiplex the storage space of a preceding data block.

According to a second aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.

According to a third aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions which, when executed by one or more processors, perform the method as described above.

The technical scheme disclosed by the invention can obtain higher space reuse rate, reduce the waste of the storage space, and can reach the corresponding space reuse rate at a higher convergence speed, and especially has positive significance for the edge equipment with smaller storage space.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1a shows a schematic diagram of a conventional neural network architecture;

FIG. 1b is a schematic diagram of a neural network implemented by a dual-core processor;

FIG. 1c illustrates an exemplary execution flow;

FIG. 1d illustrates an exemplary collision matrix;

fig. 2a and 2b show comparative schematic diagrams of different multiplexing methods according to an embodiment of the present disclosure;

FIG. 3 shows a flow diagram of a method of multiplexing storage space for a block of data according to one embodiment of the present disclosure;

FIG. 4 is a flow chart illustrating adjusting an allocation order of storage space of the plurality of data blocks according to an embodiment of the present disclosure;

FIG. 5a illustrates a schematic diagram of pre-allocating storage for a data block according to one embodiment of the present disclosure;

FIG. 5b is a diagram illustrating pre-allocation of storage for a data block according to another embodiment of the present disclosure;

FIGS. 6 a-6 c illustrate exemplary allocations of memory for two other sequences;

FIG. 7 illustrates a flowchart of a method for adjusting the order of data blocks in the sequence of data blocks to maximize the continuity of the data blocks in a multiplexing relationship, according to an embodiment of the present disclosure;

FIG. 8 shows an exemplary diagram of adjusting a sequence of data blocks;

FIG. 9 illustrates a flowchart of a method for further adjusting the order of data blocks in the sequence of data blocks to maximize the continuity of data blocks in a multiplexing relationship, according to another embodiment of the present disclosure;

fig. 10a shows the comparison results of the technical solution (genetic algorithm, GA) of the present disclosure with the first-fit (FF) algorithm and the best-fit merging (best fit with coalescing) algorithm under different operating environments;

FIG. 10b shows a comparison of the convergence speed of the genetic algorithm of the present disclosure with FF employing a random search algorithm under ResNet50, MobileNet V1, MobileNet V2, InceptionV3, GoogleNet, DensetNet network types;

FIG. 11 shows a schematic diagram of a combined treatment apparatus according to the present disclosure;

fig. 12 illustrates an exemplary board card.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Meanwhile, a person skilled in the art should, according to the idea of the present disclosure, change or modify the embodiments and applications of the present disclosure. In view of the above, this description should not be taken as limiting the present disclosure.

First, for the sake of better description of the spatial multiplexing technique hereinafter, the collision relationship between data blocks is exemplified hereinafter.

Deep Neural Networks (DNNs), such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), have been successful in a number of areas. Taking the more common CNN as an example, it mainly consists of a convolutional layer, a pooling layer, an active layer, a complete connection layer, etc. The multidimensional data operated on by each layer may be referred to as a tensor.

In a multi-core processor platform, in order to take advantage of the parallel nature of neural networks and to fully utilize hardware resources, the operations in CNN are typically divided into multiple sub-operations, and the network topology may change dynamically.

In the present disclosure, an example is given by taking fig. 1a to 1D as an example, where fig. 1a shows a conventional neural network, which includes a convolutional layer and a pooling layer, and three tensor data blocks D0, D1, and D2. When executing on the GPU platform, data chunk D0 may be assumed to be able to share memory space with data chunk D2. But for the multi-core processor platform, the layers and tensors of the conventional neural network are divided and the divided layers are as shown in fig. 1 b.

In fig. 1b, in the processing core a, the data block D0 is split into data blocks D00 and D01, the data block D00 generates a data block D10 after passing through the convolutional layer, and the data block D01 generates a data block D20 after passing through the pooling layer; in the processing core B, the data block D11 is generated after the data block D01 passes through the convolutional layer, and the data block D21 is generated after the data block D11 passes through the pooling layer; in the processing core a, the generated data chunks D20 and D21 are combined into a data chunk D2.

It is to be understood that the data blocks with conflicts may be processed by different processing cores (e.g., the data blocks D20 and D21) or may be processed by the same processing core (e.g., the data blocks D01 and D00). Taking data blocks D01 and D00 as examples, in processing core A, the execution instruction Store D00 is executed first, and then the execution instruction Store D01 is executed, at which time the life cycle of data block D00 has not yet ended, because the instruction Load D00 is executed subsequently. Therefore, the life cycles of the data block D00 and the data block D01 overlap, and the two data blocks are conflicting data blocks, and the storage space cannot be multiplexed.

Furthermore, from the topology perspective, for example, the data blocks D11 and D10 may seem to multiplex one memory space, but in practice, the two data blocks cannot multiplex the memory space because they are used by different processing cores and their life cycles overlap. Therefore, only from the network topology, an erroneous memory space reuse policy may be obtained. From the topology of the neural network, the data blocks D00 and D11 are used by different branches, from which we cannot directly derive their relative life cycle intervals. However, as can be seen by the execution flow of FIG. 1c, data blocks D00 and D11 are used by processing core A and processing core B, respectively. The accesses to data blocks D00 and D11 are separated by a second synchronization operation (sync), so it can be seen that the life cycles of data blocks D00 and D11 do not overlap, and they can multiplex the same storage space.

FIG. 1d illustrates an exemplary collision matrix, wherein the gray portions indicate that the life cycles of the data blocks do not overlap, i.e., there is no collision between the data blocks; and the slashed part indicates that the life cycles of the data blocks overlap, i.e. that there is a conflict between the data blocks. It should be understood that the collision matrix shown in fig. 1d is only an example for identifying the collision between the data blocks, the collision relationship between the data blocks may be different for different instructions, and the collision relationship between the data blocks does not necessarily need to be represented in the form of the collision matrix.

Fig. 2a and 2b show a schematic diagram comparing different multiplexing methods according to an embodiment of the present disclosure.

The order in which the storage space is allocated for the data blocks may cause different efficiencies in spatial multiplexing. It is assumed that there are three data blocks D0, D1, D2, which are 5, 5, 6, respectively, where the space of data blocks D0 and D2 is reusable.

As shown in fig. 2a, if the allocation order is D0, D1, D2, the total storage space size occupied by the three data blocks is 16; whereas, as shown in fig. 1b, if the allocated space order is D2, D1, D0, then D0 can multiplex the space of D2, and the total occupied storage space size is 11.

As can be seen from the examples shown in fig. 2a and 2b, by changing the order in which the data blocks are allocated storage space, there is a better chance of multiplexing the storage space.

If the most suitable allocation order needs to be searched, the search time complexity is N!assuming that the number of data blocks is N! Such a calculation is very inefficient and time consuming, and therefore requires a suitable algorithm for searching.

Fig. 3 shows a flow chart of a method of multiplexing storage space of a data block according to one embodiment of the present disclosure. As shown in fig. 3, the method may include: in operation S310, determining a conflict relationship between a plurality of data blocks; in operation S320, determining a data block having a multiplexing relationship according to the conflict relationship; and adjusting an allocation order of the storage spaces of the plurality of data blocks so that a following data block in the allocation order can multiplex the storage space of a preceding data block in operation S330.

For operation S310, as described above with respect to fig. 1a to 1d, the lifetime of the plurality of data blocks may be determined first, and then the data blocks with non-crossed lifetime may be determined as a non-collision relationship; in other words, if the life cycles of the data blocks are crossed, it means that the data blocks have a conflict relationship. Data blocks with no conflict relationship can multiplex the storage space, while data blocks with conflict relationship can not multiplex the storage space. This has already been explained above and will not be described in further detail here.

It should be understood that, in addition to determining whether the life cycles between the data blocks overlap as described above, the data blocks having a multiplexing relationship are determined according to the conflict relationship, and the size of the data blocks having no conflict relationship is preferably determined, so that the size of the data blocks with the storage space capable of being multiplexed is not smaller than the size of the data blocks to be multiplexed.

Taking fig. 2a and 2b as an example, although there is no conflict relationship between data blocks D2 and D0, which can reuse storage space, data block D2 has a size of 6 and data block D0 has a size of 5, so that data block D0 can reuse storage space with a size of 5 after data block D2 is allocated storage space with a size of 6; however, if the data block D0 is allocated a storage space of size 5 first, when the data block D2 needs to multiplex the storage space of size 5, the data block D2 of size 6 needs to be split into two data blocks of different sizes (sizes 5 and 1), thereby easily generating more storage space fragments.

After determining the conflict relationship between the data blocks and the sizes of the data blocks, the allocation order of the data blocks to allocate the storage space may be adjusted so that the size of the data block allocated with the storage space first is not smaller than the size of the data block allocated with the storage space later. As shown in fig. 2a and 2b, it is necessary to allocate the storage space for the data block D2 first, and then for the data block D0 without conflict relationship. It is to be understood that the order of assignment may be D2, D1 and D0, or may be D2, D0 and D1, or may be D1, D2 and D0.

According to one embodiment of the present disclosure, adjusting an allocation order of storage spaces of the plurality of data blocks so that a following data block in the allocation order can multiplex a storage space of a preceding data block includes: and adjusting the distribution sequence of the storage spaces of the plurality of data blocks so that the continuity of the storage spaces of at least one part of the plurality of data blocks meets a preset condition.

As can be seen from the above description in conjunction with fig. 2a and 2b, the order of allocating storage space for the data blocks may be various, for example, the allocation order of three data blocks in fig. 2a and 2b may be three, for example, D2, D1, and D0; d2, D0 and D1; or D1, D2, and D0, and if there are more data blocks, the allocation is more, and the complexity of the allocation will be higher.

In the present disclosure, in addition to the order of allocating the storage space by adjusting the data blocks, it is preferable to consider the continuity of the storage space in different allocation orders, and the better the continuity of the storage space, the higher the reuse degree of the storage space, the less the fragmentation of the storage space. Therefore, in the technical solution of the present disclosure, the continuity of the storage space can be considered as an important index.

Fig. 4 is a flowchart illustrating an adjustment of the allocation order of the storage space of the plurality of data blocks according to an embodiment of the present disclosure.

As shown in fig. 4, adjusting the allocation order of the storage spaces of the plurality of data blocks so that the continuity of the storage spaces of at least a part of the plurality of data blocks meets the preset condition may include: pre-allocating a storage space for each of a plurality of data blocks in operation S410, wherein the size of the pre-allocated storage space is not smaller than the size of the corresponding data block; in operation S420, sorting the plurality of data blocks according to a start address of a pre-allocated storage space to form a data block sequence; and adjusting an arrangement order of the data blocks in the data block sequence to maximize a continuity of the data blocks having the multiplexing relationship in operation S430.

It is to be understood that for convenience of operation, a triplet (id, size, addr) may be formed for each data block, the id indicating the index of the data block, which may be any identifier capable of distinguishing the data block, the size indicating the size of the data block, and the addr indicating the address of the data block. Through the triple, the location of the data block, the size of the data block, the address of the data block, and the like can be determined so as to facilitate the operation on the data block. The sequence in the following may be to order the indexes of the data blocks without ordering the data blocks themselves.

FIG. 5a illustrates a schematic diagram of pre-allocating storage for a data block according to one embodiment of the present disclosure.

First, assume that data chunks are D1-D6(id), whose sizes (size) are 9, 6, 2, 5, 3, and 4, respectively. In this case, it is necessary to allocate a storage space in the memory for these data blocks, and the size of the storage space allocated to each data block should be not smaller than the size of the data block described above.

Next, assuming that the sequence 1 of the data block is { D1, D2, D3, D4, D5, D6}, each available memory space may be searched starting from the start address of the memory and pre-allocated to the data block. Assuming that the memory has several memory spaces from the start address, the sizes of the first six available memory spaces are S1-2, S2-7, S3-10, S4-4, S5-6, and S6-5, the 7 th memory space size is S7-7, and the 8 th memory space size is S8-11. These storage spaces S1-S8 are arranged according to storage addresses, and these storage spaces are discrete and unconnected. Then, the pre-allocation results are as follows:

first, a storage space capable of accommodating the data block D1(size 9) is searched from the storage spaces S1-S6, and it can be seen that the storage space S3 (size 10) may be pre-allocated to the data block D1.

Searching for a storage space capable of accommodating the data block D2(size 6) from the storage spaces S1-S6, it can be seen that the storage space S2 (7) may be pre-allocated to the data block D2.

Searching for a storage space capable of accommodating the data block D3(size 2) from the storage spaces S1-S6, it can be seen that the storage space S1 (size 2) may be pre-allocated to the data block D3.

Searching for a storage space capable of accommodating the data block D4(size 5) from the storage spaces S1-S6, it can be seen that the storage space S5 (size 6) may be pre-allocated to the data block D3.

Searching for a storage space capable of accommodating the data block D5(size 3) from the storage spaces S1-S6, it can be seen that the storage space S4 (size 4) may be pre-allocated to the data block D5.

Searching for a storage space capable of accommodating the data block D6(size 4) from the storage spaces S1-S6, it can be seen that the storage space S6 (5) may be pre-allocated to the data block D6.

After pre-allocating the memory space, the data blocks may be reordered according to the starting address of each memory space, thereby forming a reordered sequence 1 ═ D3, D2, D1, D5, D4, D6 }.

The above description is merely an example, and FIG. 5b illustrates a schematic diagram of pre-allocating storage for a data block according to another embodiment of the present disclosure. In FIG. 5b, the storage spaces S1 and S2 are contiguous, in which case the manner in which data is allocated for the data blocks is different.

According to an embodiment of the present disclosure, pre-allocating storage space for the sequence 1 ═ { D1, D2, D3, D4, D5, D6} may also be performed by:

first, a storage space capable of accommodating the data block D1(size ═ 9) is searched from the storage spaces S1 to S6, and it can be seen that since the storage spaces S1 and S2 are contiguous, the storage spaces S1 and S2 can be commonly allocated to the data block D1.

Searching for a storage space capable of accommodating data block D2(size 6) from storage spaces S1-S6, it can be seen that storage space S3 (size 10) can be pre-allocated to data block D2, and then available storage space of size 4 remains in storage space S3.

Searching for a storage space capable of accommodating the data block D3(size 2) from the storage spaces S1-S6, it can be seen that the storage space S3 (remaining 4) may be pre-allocated to the data block D3.

After pre-allocating the memory space, the data blocks may be reordered according to the starting address of each memory space, thereby forming a reordered sequence 1 ═ D1, D2, D3, D5, D4, D6 }.

It can be seen that in the example described above in connection with fig. 5a and 5b, the pre-allocation of storage space for each of the plurality of data blocks is performed by: retrieving a memory space capable of accommodating a data block starting from a predetermined address of the memory; each data block is pre-allocated a first storage space capable of accommodating the data block.

Based on the above rules, it can be appreciated that if the initial sequence of data blocks is different, the sequence formed by the final pre-allocation of storage space followed by re-ordering will also be different. As described above in connection with fig. 2a and 2b, different orders of allocating storage space for data blocks may result in different storage space multiplexing efficiencies. Therefore, it is necessary to find a storage space allocation method that can maximize the spatial multiplexing efficiency.

Preferably, the arrangement order of the data blocks in the data block sequence may be adjusted to maximize the continuity of the data blocks having the multiplexing relationship. The sequence is considered to be optimal if the sequence has the best continuity of the data blocks having the multiplexing relationship, in other words, such a sequence means the optimal allocation order and can multiplex the storage space to the maximum extent.

However, for a large number of data blocks, if all data block sequences are pre-allocated with storage space, the amount of calculation is huge, and therefore, it is necessary to find a method capable of efficiently adjusting the data block sequences, and by adjusting the sequences, there is a greater chance of finding a preferred data block sequence.

According to one embodiment of the disclosure, before pre-allocating storage space for each of a plurality of data blocks, a plurality of initialization data block populations are formed, each initialization data block population including a plurality of data blocks having different initial orders, so as to pre-allocate storage space for the plurality of data blocks in each initialization population according to the initial order.

In fact, the sequence 1 introduced above is one of the sequences, and various other sequences exist.

Fig. 6a to 6c show exemplary allocation of memory spaces for two further sequences.

Fig. 6a to 6c exemplarily randomly select two sequences of D1-D6, for example, sequence 2 ═ { D6, D3, D1, D5, D4, D2} and sequence 3 ═ D2, D3, D5, D6, D4, D1 }.

According to an embodiment of the present disclosure, sorting the plurality of data blocks according to a start address of a pre-allocated storage space to form a data block sequence may include: and sorting the plurality of data blocks in each initial population according to the starting addresses of the storage space pre-allocated for the plurality of data blocks in each initial population to form a plurality of data block sequences, wherein each data block sequence comprises the plurality of data blocks with different arrangement orders.

Assuming that the memory has several memory spaces from the start address, the sizes of the first six available memory spaces are S1-2, S2-7, S3-10, S4-4, S5-6, and S6-5, the 7 th memory space size is S7-7, and the 8 th memory space size is S8-11. These storage spaces S1-S8 are arranged according to storage addresses, and these storage spaces are discrete and unconnected. Then, the pre-allocation results are as follows.

For sequence 2, a search is made from storage spaces S1-S6 for storage space capable of accommodating data block D6(size 4), and it can be seen that storage space S2 (size 7) can be pre-allocated to data block D6.

Searching for a storage space capable of accommodating the data block D1(size 9) from the storage spaces S1-S6, it can be seen that the storage space S3 (size 10) may be pre-allocated to the data block D1.

Searching for a storage space capable of accommodating the data block D4(size 5) from the storage spaces S1-S6, it can be seen that the storage space S5 (size 6) may be pre-allocated to the data block D4.

Searching the storage space S1-S6 for the storage space capable of accommodating data block D2(size 6), it can be seen that the remaining storage space S1-S6 is not capable of accommodating data block D2, so that a new storage space can be further searched, for example, storage space S7 (7) is pre-allocated to data block D2, and storage space S6 is temporarily in an unoccupied state.

After pre-allocating the memory space, the data blocks may be reordered according to the starting address of each memory space, thereby forming a reordered sequence 2 ═ D3, D6, D1, D5, D4, D2 }.

For sequence 3, a search is made from storage spaces S1-S6 for storage space capable of accommodating data block D2(size 6), and it can be seen that storage space S2 (size 7) can be pre-allocated to data block D2.

Searching for a storage space capable of accommodating the data block D5(size 3) from the storage spaces S1-S6, it can be seen that the storage space S3 (size 10) may be pre-allocated to the data block D5.

Searching for a storage space capable of accommodating the data block D6(size 4) from the storage spaces S1-S6, it can be seen that the storage space S4 (size 4) may be pre-allocated to the data block D6.

Searching for the storage space capable of accommodating data block D1(size 9) from storage spaces S1-S6, it can be seen that the remaining storage space in storage spaces S1-S7 is not capable of accommodating data block D1, so that a new storage space can be further searched, for example, storage space S8 (11) is pre-allocated to data block D1, while storage spaces S6 and S7 are temporarily in an unoccupied state.

After pre-allocating the memory space, the data blocks may be reordered according to the starting addresses of the memory space, thereby forming a reordered sequence 3 ═ D3, D2, D5, D6, D4, D1 }.

For sequence 3, there are also preferably other allocation patterns, as shown in fig. 6 c.

Unlike the one shown in fig. 6b, searching for a storage space capable of accommodating data block D5(size 3) from storage spaces S1-S6, it can be seen that storage space S3 (size 10) can be pre-allocated to data block D5, and the size of the remaining available storage space in storage space S3 is 7.

Next, a storage space capable of accommodating data block D6(size 4) is searched from storage spaces S1-S6, and it can be seen that storage space S3 (remaining storage space is 3) can be pre-allocated to data block D6, at which time the size of the remaining storage space in storage space S3 is 3.

Searching for a storage space capable of accommodating data block D1(size 9) from storage spaces S1-S6, it can be seen that the remaining storage space in storage spaces S1-S7 is not capable of accommodating data block D1, so that a new storage space can be further searched, for example, storage space S8 (11) is pre-allocated to data block D1, while storage spaces S4, S6 and S7 are temporarily in an unoccupied state.

After pre-allocating the memory space, the data blocks may be reordered according to the starting addresses of the memory space, thereby forming a reordered sequence 3' ═ D3, D2, D5, D6, D4, D1 }.

The reordering sequences generated after allocating storage space for different data block sequences are exemplified above, so according to an embodiment of the present disclosure, some data block sequences may be randomly selected from original data blocks as an initialization data block population, by which different reordering sequences may be formed, and by further ordering the reordering sequences, more data block sequences may be formed, thereby enabling more opportunities to find a better data block sequence.

In the above, for the sake of understanding, different storage spaces are allocated to different data blocks, and according to an embodiment of the present disclosure, pre-allocating a first storage space capable of accommodating each data block may include: and allocating the same initial address for the data blocks with the multiplexing relationship.

Assume that the data block D2(size ═ 6), D3(size ═ 2), D4(size ═ 5) and D6(size ═ 4) have no conflict relationship and can multiplex the same memory space. Then for the reordering sequence 1 ═ D3, D2, D1, D5, D4, D6, the data blocks D4 and D6 may multiplex the storage space of the data block D2, i.e. the storage space S2 may also be multiplexed to the data D4 and D6.

For the reordering sequence 2 ═ D3, D6, D1, D5, D4, D2, since there is a relationship that the size of the data block D2(size ═ 6), D3(size ═ 2), D4(size ═ 5) and D6(size ═ 4) increases with the order of allocation, it is difficult for the data block allocated later to multiplex the storage space of the data block allocated earlier, that is, the data block D2(size ═ 6), D4(size ═ 5) and D6(size ═ 4) to multiplex the storage space of the data block D3(size ═ 2), the data block D2(size ═ 5966) and D4(size ═ 5) to multiplex the storage space of the data block D6(size ═ 4), and the data block D92 (size ═ 6) to multiplex the data block D6865.

For the reordering sequence 3 ═ D3, D2, D5, D6, D4, D1, the data blocks D6 and D4 may multiplex the storage space of the data block D2, i.e. the storage space S2 may also be multiplexed to the data D6 and D4.

It can be seen that for reordering sequence 1 ═ D3, D2, D1, D5, D4, D6}, reordering sequence 2 ═ { D3, D6, D1, D5, D4, D2} and reordering sequence 3 ═ D3, D2, D5, D6, D4, D1}, for four data blocks D2(size ═ 6), D3(size ═ 2), D4(size ═ 5) and D6(size ═ 4) in which there is a multiplexing relationship, theoretically, storage space should be allocated first for data block D2, and then for data block D3(size ═ 2), D4(size ═ 5) and D6(size ═ 4), because of the largest data block size of these data blocks 2, D3, and D594 can be multiplexed. In the reordering sequence 1 and the reordering sequence 3, the allocation of the data block D3 is before the data block D2, so the data block D2 cannot multiplex the storage space of the data block D3; in the reordering sequence 2, the data blocks D3, D4 and D6 are all allocated in the order before the data block D2, which obviously makes the data blocks D2(size 6), D3(size 2), D4(size 5) and D6(size 4) not be well multiplexed.

Therefore, before the storage space is actually allocated for the data blocks, the allocation order of the data blocks needs to be further adjusted to maximize the continuity of the data blocks having the multiplexing relationship.

Fig. 7 is a flowchart illustrating a method for adjusting the order of data blocks in the data block sequence to maximize the continuity of the data blocks in the multiplexing relationship according to an embodiment of the present disclosure.

As shown in fig. 7, the order of the data blocks in the data block sequence may be adjusted by: selecting a first sequence of data blocks comprising a sequence of contiguous spatial data blocks as a first parent sequence in operation S710, wherein the sequence of contiguous spatial data blocks comprises a plurality of data blocks having contiguous storage space; selecting a second sequence of data blocks as a second parent sequence in operation S720; in operation S730, combining the sequence of consecutive spatial data blocks in the first parent sequence as a genetic factor with other data blocks in the second parent sequence to form an initial progeny sequence; in operation S740, adjusting the arrangement position of one or more data blocks in the initial child sequence, thereby forming a first variant child sequence; in operation S750, the initial child sequence is compared with the first variant child sequence to select a child sequence with the largest continuity of the data blocks in the multiplexing relationship.

Fig. 8 shows an exemplary diagram of adjusting a sequence of data blocks. The flowchart shown in fig. 7 will be described in detail below with reference to fig. 8.

As shown in fig. 8, it is assumed that the reordering sequence 4 ═ { D1, D4, D6, D3, D5, D2}, where D2(size ═ 6), D3(size ═ 2), D4(size ═ 5), and D6(size ═ 4) do not have collision relationship and can multiplex the same memory space. It can be seen that, in the reordering sequence 4, there is a consecutive sequence of spatial data blocks 1 ═ D4, D6, D3, which obviously facilitates spatial multiplexing. It can be inherited as a genetic element in the hope of finding a more optimal reordered sequence. In fig. 8, the consecutive spatial data block sequence 1 is identified in gray.

The reordered sequence 4 can be combined or "crossed" as a first parent sequence with another second parent sequence to form an initial progeny sequence that retains the genetic elements of the parent sequence.

Assuming that the reordering sequence 2 ═ { D3, D6, D1, D5, D4, D2} is randomly selected as the second parent sequence, as shown in fig. 8, the intersection procedure of the reordering sequence 4 ═ { D1, D4, D6, D3, D5, D2} as the first parent sequence and the reordering sequence 2 ═ { D3, D6, D1, D5, D4, D2} as the second parent sequence is as follows.

First, the sequence of consecutive spatial data blocks 1 ═ D4, D6, D3, determined above, is proposed as a genetic factor to form a sequence a, then the other data blocks { D1, D5, D2} in the second additional sequence are proposed to form a sequence B, and in the case of preserving sequence a, sequence a and sequence B are arbitrarily combined or interleaved to form an initial progeny sequence C, e.g., initial progeny sequence C ═ D5, D1, D4, D6, D3, D2 }.

It is to be understood that since the initial child sequence C is generated by randomly combining the sequence a and the sequence B, the above initial child sequence C is only an example, and may be in other forms.

After the initial child sequence C ═ D5, D1, D4, D6, D3, D2} is obtained, it can be compared to the first parent sequence to determine the more superior of the initial child sequence C and the first parent sequence.

If the initial child sequence C is more optimal than the first parent sequence, e.g., includes the largest number of consecutive data blocks in a multiplexing relationship, then the initial child sequence can be further inherited as a better child sequence.

And if it is not more optimal than the first parent sequence, the child sequence can be further adjusted. For example, the arrangement position of one or more data blocks in the initial child sequence is randomly adjusted, for example, the initial child sequence C is adjusted from C ═ { D5, D1, D4, D6, D3, D2} to the first variant child sequence D ═ { D2, D5, D1, D4, D6, D3 }.

The first variant subsequence D obtained above is more preferable than the first parent subsequence and the initial subsequence C because the continuity of the data blocks in the multiplexing relationship is still kept maximum and the space allocated by the data blocks D2 can be multiplexed by D4, D6 and D3 since the allocation sequence of the data block D2 is moved before the data blocks D4, D6 and D3.

It should be understood that the above example merely shifts the position of the data block D2, but in a particular implementation, any one of the data blocks D1-D6 may be shifted in order to adjust the order of the entire sequence data blocks. For a large number of data blocks, the adjusted sequence of data blocks may be better than the last sequence of data blocks.

As can be seen in fig. 8, the first variant progeny sequence, while already satisfying the need for multiplexing D2(size 6), D3(size 2), D4(size 5) and D6(size 4), still has the need for further improvement.

Fig. 9 shows a flowchart of a method for further adjusting the order of the data blocks in the data block sequence to maximize the continuity of the data blocks in the multiplexing relationship according to another embodiment of the present disclosure.

As shown in fig. 9, it includes operations S710-S750 as shown in fig. 7, and the detailed description thereof is omitted here. Further, the order of the data blocks in the data block sequence may be adjusted by further operations, in operation S760, taking the sub-generation sequence with the largest continuity of the data blocks having the multiplexing relationship as the optimal sub-generation sequence; in operation S770, adjusting the arrangement position of one or more data blocks in the best progeny sequence, thereby forming a second variant progeny sequence; and comparing the best progeny sequence with the second variant progeny sequence to select the best progeny sequence with the greatest continuity of the data blocks in the multiplexing relationship in operation S780.

The method shown in fig. 9 is actually an iterative process, i.e., the more optimal genetic element is continuously passed down or inherited, and as the sequence of descendants is continuously adjusted, a more optimal or optimal sequence of data blocks is gradually found.

As further shown in fig. 8, as described above, the first variant subsequence D is better than the first parent sequence and the initial child sequence C because the continuity of the data blocks in the multiplexing relationship remains the greatest, and since the allocation order of the data block D2 moves before the data blocks D4, D6, and D3, the data blocks D4, D6, and D3 can multiplex the space allocated by the data block D2. But the first variant progeny sequence D can still be further adjusted to find a more optimal sequence of data blocks.

The first variant subsequence D ═ { D2, D5, D1, D4, D6, D3} may be used as the current best subsequence, and the arrangement position of some or some data blocks in the current best subsequence is adjusted, and each time the position is adjusted, the adjusted new subsequence is compared with the original current best subsequence, and it is known that a qualified subsequence is found. For example, when the subsequence is changed to E1 ═ D5, D1, D2, D4, D6, D3}, or E2 ═ D2, D4, D6, D3, D5, D1}, or E3 ═ D5, D2, D4, D6, D3, D1}, or E4 ═ D1, D2, D4, D6, D3, D5}, the final optimal subsequence can be considered. It is to be understood that subsequence E1 is shown in fig. 8 as an exemplary second variant subsequence, wherein the gray portions represent consecutive data block sequences that can be multiplexed.

It can be understood that, for a continuous storage space, the total storage space occupied by the child sequence E1-E4 is 3(D5) +9(D1) +6(D2) ═ 18, which saves about 38% of the storage space compared to the storage space occupied by the conventional solution with size 29 (total storage space occupied by D1 to D6).

For discrete storage space, the data blocks D2, D4, D6 and D3 in multiplexing relationship only need to occupy 6 size of storage space, which saves about 11/17-64% of storage space compared to 17 size of storage space when not multiplexed.

A threshold of the number of iterations may be set for the method shown in fig. 9, and after the number of iterations reaches the threshold, the iteration may be stopped and the currently best data block sequence may be selected; or the difference between the current best progeny sequence and the second variant progeny sequence can be compared, and if the difference is less than a preset condition, the iteration can be stopped.

It should be understood that the terms "sequence", "parent sequence", "child sequence", "variant child sequence", etc., although different, are used herein above to denote essentially a sequence of data blocks, and they are referred to by different names only to clearly distinguish them in different application scenarios, and not to constitute any limitation of the present disclosure.

In order to verify the technical effect achieved by the technical solution of the present disclosure, a comparison test is performed with different platforms and network types, fig. 10a shows a comparison result of the technical solution of the present disclosure (genetic algorithm, GA), a first-fit (FF) algorithm and a best fit with coalescing (best with coalescing) algorithm under different operating environments, a comparison index thereof is a memory space reduction rate (MRR), and a calculation manner of the MRR is as follows: if the size of the memory space required by a task without memory space multiplexing is n bytes, and if the memory space required by a task using the spatial multiplexing technique is s bits, MRR ═ n/n.

ResNet50, MobileNet V1, MobileNet V2, Inception V3, GoogleNet and DensetNet are used as the network types of the test, and Cambricon-X is used as the test platform.

It can be seen that, in the single core, the genetic algorithm, the FF algorithm and the BFC algorithm of the present disclosure respectively realize average MRRs of 91.1%, 89.8% and 89.9%, and all the three realize good spatial multiplexing effect under the single core condition, and the genetic algorithm of the present disclosure is slightly better; in the case of 4 cores, the three algorithms respectively realize the average MRR of 91.9%, 90.6% and 56.6%, and the genetic algorithm and the FF algorithm of the present disclosure are obviously superior to the BFC algorithm; for the 16-core case, the three algorithms respectively achieve average MRRs of 87.9%, 86.3% and 50.7%, and the genetic algorithm and FF algorithm of the present disclosure are significantly superior to the BFC algorithm; in all cases, the genetic algorithm of the present disclosure outperforms the FF algorithm.

Fig. 10b shows a comparison of the convergence speed of the genetic algorithm of the present disclosure with FF using a random search algorithm under ResNet50, MobileNetV1, MobileNetV2, inclusion v3, google net, DensetNet network types, where the abscissa is the number of iterations and the ordinate is MRR. It can be seen that the GA algorithm of the present disclosure achieves faster convergence rates among other species types of algorithms, except that both exhibit nearly the same convergence rate in the MobileNetV2 network.

The application of neural network architecture is becoming deeper and wider, the connection between operations becomes denser, but the memory capacity becomes a bottleneck. Compared with the traditional method, the technical scheme of the invention can obtain higher space reuse rate, reduce the waste of storage space, and can reach corresponding space reuse rate with faster convergence speed, and has positive significance especially for the edge device with smaller storage space.

The present disclosure also provides an electronic device, comprising: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.

The present disclosure also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.

The technical scheme disclosed by the invention can be applied to the field of artificial intelligence and is realized or realized in an artificial intelligence chip. The chip may exist alone or may be included in a computing device.

Fig. 11 illustrates a combined processing device 1100 that includes a computing device 1102 as described above, a universal interconnect interface 1104, and other processing devices 1106. The computing device according to the present disclosure interacts with other processing devices to collectively perform operations specified by a user. Fig. 11 is a schematic view of a combined processing apparatus.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

A universal interconnect interface for transferring data and control instructions between a computing device (including, for example, a machine learning computing device) and other processing devices. The computing device acquires required input data from other processing devices and writes the input data into a storage device on the computing device chip; control instructions can be obtained from other processing devices and written into a control cache on a computing device slice; the data in the memory module of the computing device can also be read and transmitted to other processing devices.

Optionally, the architecture may further include a storage device 1108, which is connected to the computing device and the other processing devices, respectively. The storage device is used for storing data in the computing device and the other processing devices, and is particularly suitable for storing all data which cannot be stored in the internal storage of the computing device or the other processing devices.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

In some embodiments, the disclosure also discloses a chip packaging structure, which includes the chip.

In some embodiments, the disclosure also discloses a board card comprising the chip packaging structure. Referring to fig. 12, an exemplary card is provided that may include, in addition to the chip 1202, other kits including, but not limited to: a memory device 1204, an interface device 1206, and a control device 1208.

The memory device is connected with the chip in the chip packaging structure through a bus and used for storing data. The memory device may include a plurality of sets of memory cells 1210. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used to realize data transmission between the chip and an external device 1212 (e.g., a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. In another embodiment, the interface device may also be another interface, and the disclosure does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing and/or a plurality of processing circuits in the chip.

In some embodiments, the present disclosure also discloses an electronic device or apparatus, which includes the above board card.

Electronic devices or apparatuses include data processing apparatuses, robots, computers, printers, scanners, tablets, smart terminals, cell phones, automobile data recorders, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches, headsets, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, optical, acoustic, magnetic or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. With this understanding, when the technical solution of the present disclosure can be embodied in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.

Claims

1. A method of multiplexing storage space for a block of data, comprising:

determining a conflict relationship between a plurality of data blocks;

determining the data blocks with multiplexing relation according to the conflict relation;

adjusting an allocation order of the storage spaces of the plurality of data blocks so that a following data block in the allocation order can multiplex the storage space of a preceding data block.

2. The method of claim 1, wherein determining a conflict relationship between a plurality of data blocks comprises:

determining a life cycle of the plurality of data blocks;

and determining the data blocks with no cross existence periods as a conflict-free relation.

3. The method of claim 1 or 2, wherein determining data blocks for which a multiplexing relationship exists according to the collision relationship comprises:

and determining the size of the data blocks without conflict relationship, wherein the size of the data blocks of which the storage space can be multiplexed is not smaller than the size of the data blocks to be multiplexed.

4. The method of claim 3, wherein adjusting the allocation order of the storage space of the plurality of data blocks such that a following data block in the allocation order can multiplex the storage space of a preceding data block comprises:

and adjusting the distribution sequence of the storage spaces of the plurality of data blocks so that the continuity of the storage spaces of at least one part of the plurality of data blocks meets a preset condition.

5. The method of claim 4, wherein adjusting the allocation order of the storage spaces of the plurality of data blocks so that the continuity of the storage spaces of at least a portion of the plurality of data blocks meets a preset condition comprises:

pre-allocating a storage space for each of a plurality of data blocks, wherein the size of the pre-allocated storage space is not smaller than the size of the corresponding data block;

sorting the plurality of data blocks according to the starting address of the pre-allocated storage space to form a data block sequence;

and adjusting the arrangement sequence of the data blocks in the data block sequence so as to maximize the continuity of the data blocks with multiplexing relation.

6. The method of claim 5, further comprising: before pre-allocating a storage space for each data block in a plurality of data blocks, forming a plurality of initialization data block populations, wherein each initialization data block population comprises a plurality of data blocks with different initial sequences, so that the storage space is pre-allocated for the plurality of data blocks in each initialization population according to the initial sequences;

sorting the plurality of data blocks according to a starting address of a pre-allocated storage space to form a data block sequence comprises: and sorting the plurality of data blocks in each initial population according to the starting addresses of the storage space pre-allocated for the plurality of data blocks in each initial population to form a plurality of data block sequences, wherein each data block sequence comprises the plurality of data blocks with different arrangement orders.

7. The method of claim 5 or 6, pre-allocating storage space for each of a plurality of data blocks comprising:

retrieving a memory space capable of accommodating a data block starting from a predetermined address of the memory;

each data block is pre-allocated a first storage space capable of accommodating the data block.

8. The method of claim 7, wherein pre-allocating a first storage space capable of accommodating each data block comprises: and allocating the same initial address for the data blocks with the multiplexing relationship.

9. The method of claim 6, wherein adjusting the order of the data blocks in the sequence of data blocks to maximize the continuity of the data blocks having a multiplexing relationship comprises:

selecting a first sequence of data blocks comprising a sequence of contiguous spatial data blocks as a first parent sequence, wherein the sequence of contiguous spatial data blocks comprises a plurality of data blocks having contiguous storage space;

selecting a second sequence of data blocks as a second parent sequence;

taking the continuous spatial data block sequence in the first parent sequence as a genetic factor, and combining the genetic factor with other data blocks in the second parent sequence to form an initial filial generation sequence;

adjusting the arrangement position of one or more data blocks in the initial progeny sequence, thereby forming a first variant progeny sequence;

comparing the initial progeny sequence with the first variant progeny sequence to select the most contiguous progeny sequence of data blocks in a multiplexing relationship.

10. The method of claim 9, further comprising performing the following iterative operations:

taking the sub-generation sequence with the maximum continuity of the data blocks with the multiplexing relationship as the optimal sub-generation sequence;

adjusting the arrangement position of one or more data blocks in the optimal progeny sequence, thereby forming a second variant progeny sequence;

comparing the best progeny sequence with the second variant progeny sequence to select the best progeny sequence with the greatest continuity of data blocks in a multiplexing relationship.

11. The method of claim 10, wherein the iterating is stopped in response to the number of iterations reaching a preset threshold, or in response to the best progeny sequence differing from the second variant progeny sequence by less than a preset condition.

12. An electronic device, comprising:

one or more processors; and

memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-11.

13. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of any one of claims 1-11.