CN114266302A

CN114266302A - Deep learning Embedding data efficient processing system and method for heterogeneous memory device

Info

Publication number: CN114266302A
Application number: CN202111547323.9A
Authority: CN
Inventors: 何水兵; 陈平; 李旭
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-04-01

Abstract

The invention provides a system and a method for efficiently processing deep learning Embedding data for heterogeneous memory devices, wherein the system comprises three modules, wherein an Embedding data placement module is used for classifying the Embedding data and pre-adding data and placing the Embedding data and the pre-adding data on an NVM (non-volatile memory) or a DRAM (dynamic random access memory); the efficient index establishing module is used for establishing an index for the placed data; the Embedding operation running module rapidly positions the Embedding data related in the request by using the established index and executes normal Embedding operation. The method utilizes the cold and hot characteristics of deep learning Embedding data and the packing appearance characteristics to carry out data arrangement on heterogeneous memory equipment; and a lightweight index is established to efficiently service task requests, the system can maximally utilize the space of the DRAM and the NVM, and the processing efficiency of the system on Embedding data is improved.

Description

Deep learning Embedding data efficient processing system and method for heterogeneous memory device

Technical Field

The invention relates to the field of deep learning of computer science, in particular to an efficient processing system for deep learning Embedding data.

Background

The innovation of deep learning technology has greatly promoted the development of computer vision, natural language processing, medicine and other fields, and has attracted extensive attention in both academic and industrial fields. In the field of deep learning, Embedding is widely used to represent data due to its powerful characterization capability. At the present stage, because the Embedding data volume is large, a large storage space is required, and from the economical point of view and the device capacity point of view, the existing memory cannot meet the application requirement. With the continuous innovation of memory technology, a nonvolatile memory (NVM) has the characteristics of large capacity, low price, byte addressing capability and high read-write bandwidth (relative to a magnetic disk), and thus, a new solution is provided for the problem of insufficient capacity. However, NVM has poor random read performance, and the access of a large amount of irregular and random embed data drastically reduces the operating efficiency of NVM system (heterogeneous memory system of DRAM + NVM), so how to design an efficient processing system facing deep learning embed data for heterogeneous memory devices to improve the operating efficiency of applications becomes important.

Disclosure of Invention

In order to solve the problem of performance reduction caused by access of a large amount of irregular and random Embedding data, the invention provides an efficient processing system for deep learning Embedding data of heterogeneous memory equipment.

The invention comprises two inventions:

1, intelligently placing Embedding data;

the invention is divided into a training set and a testing set according to a deep learning data set, wherein the training set is known data and used for model training, and the testing set is unknown data and used for model reasoning. And selecting to perform data analysis operation on the training set and perform effect verification by using the test set because the training set and the test set have similar data access modes.

According to the obvious cold and hot distribution characteristics of the Embedding data, the data are firstly sorted according to the access frequency, and the top h% of the data are selected as the hot spot data (the h% can be customized by a user).

The method and the device perform pre-adding and data processing for the frequent item set for storage according to the frequent items of the deep learning Embedding data (a plurality of Embedding data are often accessed together).

2, establishing an efficient index for the placed data;

after the different classes of embed data are placed, the system needs to establish a lightweight index (occupying less memory space overhead) for the data and realize a function of fast searching after a request comes, specifically:

the sorting ID is used as an index of single data, the stored pre-adding and data take a sorting ID set of corresponding Embedding, namely a closed frequent item set, as an index, and simultaneously, the packing relation of each Embedding in the closed frequent items is expressed according to the idea of an adjacency matrix.

Specifically, the technical scheme adopted by the invention is as follows:

a deep learning Embedding data efficient processing method for heterogeneous memory devices comprises the following steps:

(1) and performing access frequency descending sorting on the Embedding data in the training set, setting a sorting ID, and dividing the sorted data into hot data and cold data. The hot data is stored in the DRAM in its entirety and the cold data is stored in the NVM in its entirety. And finally, storing the pre-sum data of the closed frequency complex item set with the packing times larger than 3 into the NVM, and storing the pre-sum data of the closed frequency complex item set with the packing times smaller than or equal to 3 into the DRAM.

(2) And expressing the packing relation of each Embedding in the closed frequent item set according to the idea of the adjacency matrix, wherein each Embedding points to the Embedding connected with the Embedding in the closed frequent item set.

(3) Processing the request: according to the sorting ID, sorting the Embedding data in the request in an ascending order and determining whether each Embedding data in the request belongs to hot data or cold data, wherein:

and (3) for the hot data in the request, judging whether the packing relation exists in the Embedding of the hot data one by one according to the packing relation established in the step (2) and the sorting result, if the packing relation exists in the continuous Embedding and the packing frequency is more than 3, inquiring the NVM according to the sorting ID of the continuous Embedding to obtain the corresponding pre-sum data, and if the packing frequency is less than or equal to 3, inquiring the DRAM according to the sorting ID of the continuous Embedding to obtain the corresponding pre-sum data. And if the packaging relation does not exist, directly inquiring the DRAM to obtain the corresponding Embedding data until the inquiry is complete.

And for the cold data in the request, directly querying the NVM according to the sequencing ID to obtain the corresponding Embedding data until the query is complete.

Further, in the step (1), dividing the front part of the cold data into temperature data, packing and storing all Embedding data in all temperature data into the NVM according to the sorting ID and by using a superset, and recording the largest packing granularity M of the current superset and the sorting ID I of the corresponding largest packed Embedding; when temperature data in the request is processed, inquiring M pre-adding data if the ordering ID is smaller than the Embelling data of I; and for the data with the sorting ID larger than I, the data is queried with the granularity of M-1.

Further, the original data is further divided, including the first h% of hot data, the middle w% of warm data and the last c% of cold data (where h, w and c can be customized), preferably, the hot data accounts for 5%, the warm data accounts for 5% -20%, and the rest is cold data.

Further, the step (2) further includes a step of sorting the closed frequent item set and reassigning the hot data sorting ID according to the closed frequent item set, which specifically includes the following steps:

and performing descending sorting according to the number of the Embedding data in the closed-frequency items, and performing descending sorting between the closed-frequency items with the same number of the Embedding data according to the occurrence probability of the closed-frequency items.

And reallocating the hot data sorting IDs according to the sequence of occurrence of Embedding in the closed frequent item set sorting.

Further, in the step (2), a packing relationship of each Embedding in the closed frequent item set is expressed according to an idea of an adjacency matrix, and each Embedding points to an Embedding connected to the Embedding in the closed frequent item, specifically:

and utilizing a bitmap to represent the packing relation of each Embedding data, wherein the packing relation is represented as 1, the packing relation is not represented as 0, and the packing relation of each Embedding data only records the packing relation between the Embedding data with the sequencing ID larger than that of the Embedding data and the Embedding data.

Further, in the step (3), if the NVM or the DRAM is queried according to the sorting IDs of consecutive embeddings and no corresponding pre-summation data is found, the largest sorting ID in the sorting IDs of consecutive embeddings is removed one by one for re-query until the query obtains the pre-summation data.

The efficient data processing system for deep learning Embedding facing to the heterogeneous memory device, which is based on the method, comprises the following steps:

and the Embedding data placement module is used for performing descending sequencing on the access frequency according to the Embedding data in the training set, setting a sequencing ID (identity), and dividing the sequenced data into hot data and cold data. Storing all hot data into a DRAM and all cold data into an NVM; and meanwhile, performing closed-loop frequent item mining on the hot data to obtain a closed-loop frequent item set, storing the pre-sum data of the closed-loop frequent item set with the packing times larger than 3 into the NVM, and storing the pre-sum data of the closed-loop frequent item set with the packing times smaller than or equal to 3 into the DRAM.

And the efficient index establishing module is used for expressing the packing relation of each Embedding in the closed frequent item set according to the idea of the adjacency matrix, and each Embedding points to the Embedding connected with the closed frequent item set.

And the Embedding operation running module is used for quickly positioning the Embedding data related to the request by utilizing the established index and packing relation and executing normal Embedding operation.

The invention has the beneficial effects that: the invention provides a heterogeneous memory device-oriented deep learning Embedding data efficient processing system, which is used for designing an intelligent data placement mode aiming at heterogeneous memory devices and establishing a lightweight index to efficiently service task requests.

Drawings

FIG. 1 is a block diagram of a system framework;

FIG. 2 is a schematic diagram of an Embedding operation;

FIG. 3 is a CDF diagram of Embedding data access frequency;

FIG. 4 is a schematic diagram of closed-frequency item data (pre-sum data) acquisition;

FIG. 5 is a schematic diagram of Embedding data type division;

FIG. 6 is a schematic diagram comparing NVM single access to DRAM multiple access latency;

FIG. 7 is a schematic diagram of Embedding data placement;

FIG. 8 is a schematic diagram of reallocation of IDs by Embedding data;

FIG. 9 is an index diagram;

FIG. 10 is a schematic diagram of request processing.

Detailed Description

The invention is further illustrated in the following description with reference to specific embodiments according to the accompanying drawings:

the invention provides a heterogeneous memory device-oriented deep learning Embedding data efficient processing method, which is characterized in that data placement is performed on the heterogeneous memory device by utilizing the cold and hot characteristics and the packing appearance characteristics of the deep learning Embedding data; establishing a lightweight high-efficiency index structure; meanwhile, an Embedding processing flow is designed to improve the operation efficiency of the Embedding data.

Based on the method, the invention also provides a deep learning Embedding data high-efficiency processing system facing the heterogeneous memory device, the system frame module diagram is shown in fig. 1 and comprises three modules, wherein an Embedding data placing module is used for classifying the Embedding data and the pre-adding data and placing the Embedding data and the pre-adding data on the NVM or the DRAM; the efficient index establishing module is used for establishing an index for the placed data; the Embedding operation running module rapidly positions the Embedding data related in the request by using the established index and executes normal Embedding operation. The Embedding operation is shown in fig. 2, that is, the system will randomly read a plurality of Embedding data in the storage device and perform an addition operation.

The treatment process of the invention is as follows:

(1) data placement:

as shown in fig. 3, since many embed data have a hot and cold characteristic (part of data is frequently accessed), and these hot data frequently appear together (frequent items), it is possible to perform a hot and cold analysis on training embed data, reasonably place the data in DRAM and NVM, and simultaneously obtain these frequent items of data and perform a pre-summing operation before storing them, which is specifically as follows:

and performing access frequency descending sorting on the Embedding data in the training set, setting a sorting ID, and dividing the sorted data into hot data and cold data. The hot data is stored in the DRAM in its entirety and the cold data is stored in the NVM in its entirety.

And finally, storing the pre-sum data of the closed frequency complex item set with the packing times larger than 3 into the NVM, and storing the pre-sum data of the closed frequency complex item set with the packing times smaller than or equal to 3 into the DRAM.

(2) And index establishment is carried out on the placed data: and expressing the packing relation of each Embedding in the closed frequent item according to the idea of the adjacency matrix, wherein each Embedding points to the Embedding connected with the Embedding in the closed frequent item. And establishing an index of each closed-frequency item according to the Embedding data sorting ID.

(3) When the inference task arrives, carrying out Embedding data search by using the index, and carrying out Embedding operation;

as a preferable scheme, the step (1) can be obtained by dividing into the following sub-steps:

as shown in step one of fig. 4, the system first sorts the data in descending order according to the access frequency, and divides the hot data and the cold data by the user selection threshold.

Step two: and removing cold data in the request.

Step three: requests containing only hot data are passed into a closed-frequency term mining algorithm (MBEA algorithm or MMBEA algorithm).

Step four: and obtaining a plurality of closed frequent item sets, and firstly sorting the closed frequent item sets in a descending order according to the number of data in the frequent items, and sorting the closed frequent items with the same number of data in the descending order according to the occurrence probability.

Further, a part of the cold data may be further selected as the temperature data according to a threshold set by the user, as shown in fig. 5. In order to fully utilize the space of the DRAM device, hot data is stored in the DRAM and warm data and cold data are stored in the NVM. For the pre-adding data of the closed frequent item set, reasonable placement is carried out based on the following observation:

as can be seen from FIG. 6, one NVM access delay is slightly greater than 3 DRAM accesses. In order to enable the total performance of the Embedding operation after the NVM is used to be larger than or equal to that of a device using a full DRAM, only the pre-sum data which is larger than 3 packaging times is put into the NVM, and the pre-sum data which is smaller than or equal to 3 packaging times is put into the DRAM. The cold data is stored in its entirety in the slow NVM device. Since the warm data has the possibility of pre-summing, the super-set is selected to be fully packed (that is, all data are pre-summed from 2 to n until the storage threshold set by the user is reached, and the largest packing granularity M of the current super-set and the number I of the corresponding largest packing Embedding are recorded). In addition, the sequencing ID ranges of the hot data, the warm data and the cold data are required to be recorded, so that subsequent searching is facilitated. The final data placement effect is shown in fig. 7.

As a preferred scheme, the index establishing process in step (2) is specifically as follows:

to facilitate the creation of the index, the Embedding data needs to be reassigned to the hot data sorting IDs in the order of appearance in the sorted frequent items, as shown in fig. 8.

Then, an index is established according to the closed frequent item, and the establishment basis is as follows: and expressing the packing relation of each Embedding in the closed frequent item according to the idea of the adjacency matrix, wherein each Embedding points to the Embedding connected with the Embedding in the closed frequent item. As shown in FIG. 8, the Embedding-1 data is taken as an example. The data connected with the Embedding-1 comprises Embedding-2, Embedding-3 and Embedding-8, so that the connection relation needs to be recorded by using an index. Furthermore, for the packing of multiple data, e.g. 1269, 1 is required to be connected to 2, 2 to 6, 6 to 9. Preferably, to further reduce the index space overhead, it is represented using a bitmap of bit granularity, as shown in fig. 9. Wherein [1] represents the sorting ID. of recording the corresponding Embedding-1, and then the connection relation between the Embedding data with ID larger than 1 and the Embedding-1 is recorded in sequence, wherein the bit corresponding to the bitmap is 1 to represent connection, and 0 to represent disconnection.

Finally, pre-adding the packed data, storing the data by using hash, wherein the pre-adding and the data take a corresponding ordering ID set of Embedding, namely a closed frequent item set, as an index during the storage, for example:

135 is a closed frequent item set, E is summed in advance (E ═ Embedding-1+ Embedding-3+ Embedding-5), and then E is indexed using hash H ("135").

As a preferred scheme, in step (3), when the inference task arrives, the specific process of request processing is as follows:

and mapping the Embedding involved in the new request according to the new sorting ID, and sorting the sorting IDs from small to large.

As shown in FIG. 10, when a new request comes, the ID in the request is known to belong to hot, warm, or cold data. The specific search comprises the following steps:

and for the hot data, judging whether the packing relation exists in the Embedding belonging to the hot data one by one according to the established packing relation and the sorting result, if the packing relation exists in the continuous Embedding and the packing frequency is more than 3, inquiring the NVM according to the sorting ID of the continuous Embedding to obtain the corresponding pre-sum data, and if the packing frequency is less than or equal to 3, inquiring the DRAM according to the sorting ID of the continuous Embedding to obtain the corresponding pre-sum data. And if the packaging relation does not exist, directly inquiring the DRAM to obtain the corresponding Embedding data until the inquiry is complete.

Taking the request of fig. 10 as an example, for 1, it is looked up whether to connect to 2, 2 to 6, 6 to 9. The result when lookup 9 is concatenated with 10 is not concatenated, so the previously concatenated 1269 data is hashed with the sort ID, and since the packing granularity is greater than 3 (lookup on DRAM is less than or equal to 3), the lookup is from NVM. Similarly, the remaining thermal data is looked up in this form. However, if the specified data is not found in the device after the hash occurs, then rolling back an Embedding and then performing the hash, and finding (a special small probability case), for example, finding the hash value of 1234 does not find, then rolling back the hash value of 123.

For temperature data, in this scenario, since all temperature data are packed using a superset, when the largest packing granularity M of the stored current superset is 2, that is, 2 pre-summation operations are stored, 1522, 2833, and 39 are sequentially searched in pairs from the NVM.

Step three: and for cold data, directly searching the NVM one by one according to the sorting IDs until the query is complete.

The following is a specific example to further illustrate the beneficial effects of the present invention:

the specific experiment is as follows:

experimental configuration:

(1) operating the system: ubuntu 18.04.3 LTS;

(2) a CPU: model number 8-core Intel (R) Xeon (R) Gold 6126CPU @2.60GHz, equipped with 32GB DRAM;

(3) a storage device: 512GB, SK hynix SC311 SATA SSD; western Digital WDC WD40EZRZ-75G HDD; intel Optane NVM 256 GB;

the final test results are:

comparison scheme: using a random scheme to store the Embedding data on the DRAM and the NVM without distinction; the scheme of the invention is as follows: dividing data according to cold and hot characteristics and using packaging storage; compared with a comparison scheme, the total access performance of the method disclosed by the invention on the common data set MovieLens is improved by 1.5 times.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should all embodiments be exhaustive. And obvious variations or modifications of the invention may be made without departing from the scope of the invention.

Claims

1. A deep learning Embedding data efficient processing method facing heterogeneous memory devices is characterized by comprising the following steps:

(2) And expressing the packing relation of each Embedding in the closed frequent item according to the idea of the adjacency matrix, wherein each Embedding points to the Embedding connected with the Embedding in the closed frequent item.

2. The method according to claim 1, wherein in step (1), the front part of the cold data is divided into warm data, all the Embedding data in all the warm data are packed and stored in the NVM according to the sorting ID and by using the superset, and the largest packing granularity M of the current superset and the sorting ID I of the corresponding largest packing Embedding are recorded; when temperature data in the request is processed, inquiring M pre-adding data if the ordering ID is smaller than the Embelling data of I; and for the data with the sorting ID larger than I, the data is queried with the granularity of M-1.

3. The method of claim 2, wherein the hot data percentage is 5%, the warm data percentage is 5% to 20%, and the remainder is cold data.

4. The method according to claim 1, wherein the step (2) further comprises the steps of sorting the closed frequent item set and reassigning hot data sort IDs according to the closed frequent item set, as follows:

5. The method according to claim 4, wherein in the step (2), the packing relationship of each Embedding in the closed frequent item set is expressed according to the idea of the adjacency matrix, and each Embedding points to the Embedding connected to it in the closed frequent item, specifically:

6. The method according to claim 1, wherein in step (3), if the NVM or DRAM is queried according to the consecutive Embedding sorting IDs and no corresponding pre-summation data is found, the largest sorting ID in the consecutive Embedding sorting IDs is removed one by one for re-query until the query obtains the pre-summation data.

7. The system for efficiently processing deep learning Embedding data facing to the heterogeneous memory device according to any one of claims 1 to 6, comprising:

And the efficient index establishing module is used for representing the packing relation of each Embedding in the closed frequent items according to the idea of the adjacency matrix, and each Embedding points to the Embedding connected with the Embedding in the closed frequent items.