CN117873394A

CN117873394A - Data compression method, device, electronic equipment and readable storage medium

Info

Publication number: CN117873394A
Application number: CN202410056860.0A
Authority: CN
Inventors: 陈新海; 刘杰; 颜君峻; 龚春叶; 杨博; 王庆林; 张庆阳; 李胜国; 甘新标; 陈旭光; 肖调杰
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2024-01-15
Filing date: 2024-01-15
Publication date: 2024-04-12

Abstract

The application discloses a data compression method, a data compression device, electronic equipment and a readable storage medium, which are applied to the technical field of storage. The method comprises the steps of obtaining electronic data to be compressed, wherein the electronic data are represented by a sparse matrix; inserting the value of each non-zero element of the electronic data to be compressed into a hash table based on the total number of non-zero elements contained in the electronic data to be compressed; storing the values of the non-repeated elements of the hash table into a storage data structure, and obtaining a hash key value table for identifying the storage corresponding relation between the hash table and the storage data structure; and establishing corresponding relations among all non-zero elements of the current compressed storage format, all elements of the hash table and all elements of the storage data structure corresponding to the electronic data to be compressed to obtain index information, so as to convert the compressed storage format of the electronic data to be compressed into a format of compressing repeated elements into single elements and index coordinates. The method and the device can solve the problem of low access efficiency of the related technology, and effectively improve the access efficiency of the electronic data expressed in the sparse matrix format.

Description

Data compression method, device, electronic equipment and readable storage medium

Technical Field

The present disclosure relates to the field of storage technologies, and in particular, to a data compression method, a data compression device, an electronic device, and a readable storage medium.

Background

With the rapid development of cloud computing and artificial intelligence technology, electronic data formed based on electronic technical means such as computer application, communication and information technology is more and more formed, including but not limited to text, image and audio and video data. These electronic data need to be stored in digitized form.

For electronic data represented by a sparse matrix, such as image data represented by a sparse matrix, related art generally employs storing such data in a line compressed storage manner. However, in the storage mode, a large amount of repeated elements in the memory are required to be accessed in the storage process, the memory access efficiency is low, and the memory access overhead is increased due to frequent memory access, so that the memory bandwidth is wasted.

In view of this, improving access efficiency is a technical problem that one skilled in the art needs to solve.

Disclosure of Invention

The application provides a data compression method, a data compression device, electronic equipment and a readable storage medium, which can effectively improve the memory access efficiency of electronic data represented in a sparse matrix format.

In order to solve the technical problems, the application provides the following technical scheme:

in one aspect, the present application provides a data compression method, including:

acquiring electronic data to be compressed, which is represented by a sparse matrix;

inserting the value of each non-zero element of the electronic data to be compressed into a pre-constructed hash table based on the total number of non-zero elements contained in the electronic data to be compressed;

storing the values of the non-repeated elements of the hash table into a pre-constructed storage data structure, and obtaining a hash key value pair table for marking the storage corresponding relation between the hash table and the storage data structure so as to store all the values of the non-repeated non-zero elements in the electronic data to be compressed by using the storage data structure;

establishing a corresponding relation among all non-zero elements of a current compression storage format corresponding to the electronic data to be compressed, all elements of the hash table and all elements in the storage data structure to obtain index information so as to convert the electronic data to be compressed from the current compression storage format to a target value compression storage format;

the length of the storage data structure is determined according to the number of elements of the hash table; the target value compression storage format is a format that compresses repeated elements into single element indexed coordinates.

Illustratively, the inserting the value of each non-zero element of the electronic data to be compressed into a pre-constructed hash table based on the total number of non-zero elements contained in the electronic data to be compressed includes:

a hash table is built in advance, and an original value array corresponding to the current compression storage format of the electronic data to be compressed is obtained;

initializing the hash table and the original value array;

inserting the values of the non-zero elements in the original value array into the hash table to obtain a hash key value table, so that all the non-zero elements of the electronic data to be compressed are inserted into the hash table;

and determining the number of all non-repeated non-zero elements of the electronic data to be compressed according to the total number of elements of the hash key value table.

Illustratively, storing the value of the non-repeated element of the hash table in a pre-constructed storage data structure, and obtaining a hash key value pair table for identifying the storage correspondence between the hash table and the storage data structure, including:

pre-constructing a storage data structure; the storage data structure is used for storing all non-repeated non-zero element values in the electronic data to be compressed, and the storage data structure is expressed in an array format; the array length of the storage data structure is the same as the number of elements of the hash key value table;

Initializing the storage data structure and a subscript of the storage data structure;

initializing a pre-constructed hash table position pointer;

for the values of the non-repeated elements of the hash key value table, putting the target key value pointed by the current hash table position pointer into the target position of the storage data structure, and storing the identification value of the target position into the target key value;

and when the key value of each non-repeated element in the hash key value table comprises the identification value stored to the corresponding position of the storage data structure, taking the current hash key value table as a hash key value table.

For example, the establishing a correspondence between each non-zero element of the current compressed storage format corresponding to the electronic data to be compressed, each element of the hash table, and each element in the storage data structure to obtain index information includes:

initializing the subscript value of a non-zero element of an original value array corresponding to the current compressed storage format;

pre-constructing index information, and representing the index information by adopting an index array; the number of elements contained in the index array is the total number of non-zero elements contained in the electronic data to be compressed;

For each non-zero element of the original value array, determining a storage position of the current non-zero element corresponding to the storage data structure from the hash key value table, and storing the storage position to a target index position of the index array;

and generating index information after the index array stores the storage positions of the storage data structures corresponding to the non-zero elements of the original value array.

Illustratively, after the electronic data to be compressed is converted from the current compressed storage format to the target compressed storage format, the method further includes:

receiving a target task to be executed; the target task to be executed comprises a subtask for performing sparse matrix vector multiplication operation by utilizing the electronic data to be compressed and target electronic data; the target electronic data is represented by a dense vector;

pre-constructing an intermediate result storage structure; the intermediate result storage structure is used for storing the calculation results of each row of elements of the electronic data to be compressed and the corresponding elements of the target electronic data;

reading each non-zero element from the electronic data to be compressed to calculate in the process of starting to execute the subtasks;

For each read non-zero element, according to the sequence value of the current non-zero element in the electronic data to be compressed, respectively obtaining a column position coordinate value and an index value corresponding to the target non-zero element by retrieving the electronic data to be compressed and the index information;

selecting corresponding first and second operation elements from the target electronic data and the stored data structure by using the column position coordinate value and the index value as indexes, respectively;

and storing the calculation results of the first operation element and the second operation element to the corresponding positions of the intermediate result storage structure.

Illustratively, the obtaining the column position coordinate value and the index value corresponding to the target non-zero element by retrieving the electronic data to be compressed and the index information respectively includes:

representing the position of the non-zero element of the electronic data to be compressed in advance by adopting a one-dimensional array format to obtain a column position array;

and determining the coordinate value of the current non-zero element corresponding to the column position array based on the sequence value of the current non-zero element in the electronic data to be compressed, and taking the coordinate value as the column position coordinate value.

Determining a current position identification value of the current non-zero element corresponding to the index information based on the sequence value of the current non-zero element in the electronic data to be compressed;

and determining the position of the current non-zero element corresponding to the stored data structure according to the current position identification value to serve as the index value.

Another aspect of the present application provides a data compression apparatus, including:

the data acquisition module is used for acquiring the electronic data to be compressed, which is represented by the sparse matrix;

the table storage module is used for inserting the value of each non-zero element of the electronic data to be compressed into a pre-constructed hash table based on the total number of non-zero elements contained in the electronic data to be compressed;

the data storage module is used for storing the values of the non-repeated elements of the hash table into a pre-constructed storage data structure, and obtaining a hash key value table for marking the storage corresponding relation between the hash table and the storage data structure so as to store all the values of the non-repeated non-zero elements in the electronic data to be compressed by utilizing the storage data structure; the length of the storage data structure is determined according to the number of elements of the hash table;

The corresponding relation construction module is used for establishing corresponding relation among all non-zero elements of the current compression storage format corresponding to the electronic data to be compressed, all elements of the hash table and all elements in the storage data structure to obtain index information so as to convert the electronic data to be compressed from the current compression storage format to a target value compression storage format; the target value compression storage format is a format for compressing repeated elements into single element indexed coordinates.

The application also provides an electronic device comprising a processor for implementing the steps of the data compression method according to any of the preceding claims when executing a computer program stored in a memory.

Finally, the present application provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the data compression method according to any of the preceding claims.

The technical scheme provided by the application has the advantages that the hash table is utilized to remove repeated elements of the electronic data to be compressed, all non-repeated elements are stored in a new data storage structure, key values of the non-repeated elements are modified into the non-repeated elements through traversing the hash table, and the values of the non-repeated elements are modified into corresponding subscript positions. Finally, through constructing the current compressed storage format of the electronic data to be compressed, the corresponding relation between each element of the hash table and each element in the storage data structure can obtain index information, so that the electronic data to be compressed in the original compressed storage format is converted into a new target value storage format, the whole process is simple in processing process, the application capability in an actual scene is provided, the practicability is higher, the repeated elements in the value array corresponding to the electronic data to be compressed in the compression process can be removed, the redundant stored repeated element values can be fully compressed, all the repeated elements can be stored in the value array only once, and the required storage space is obviously reduced, thereby improving the data locality, fully utilizing the cache structure in the central processor, reducing the memory bandwidth waste caused by accessing the repeated elements, and effectively improving the memory access efficiency. Therefore, when the electronic data to be compressed is read in the task execution process, the data of the value array can be accessed from the cache more efficiently, the memory access efficiency is effectively improved, and the execution efficiency of related tasks is improved.

In addition, the application also provides a corresponding implementation device, electronic equipment and a readable storage medium for the data compression method, so that the method is more practical, and the device, the electronic equipment and the readable storage medium have corresponding advantages.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

For a clearer description of the technical solutions of the present application or of the related art, the drawings that are required to be used in the description of the embodiments or of the related art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a schematic diagram of a sparse matrix vector multiplication framework of an exemplary application scenario provided herein;

fig. 2 is a schematic diagram of a sparse matrix stored in COO format;

FIG. 3 is a schematic diagram of storing a sparse matrix in CSR format;

fig. 4 is a schematic flow chart of a data compression method provided in the present application;

FIG. 5 is a hash key representation intent of one exemplary application scenario provided herein;

FIG. 6 is a hash key table diagram of an exemplary application scenario provided herein;

FIG. 7 is a schematic diagram of a correspondence between a value array and an index array of an exemplary application scenario provided herein;

FIG. 8 is a schematic diagram of a target value compression storage format of an exemplary application scenario provided herein;

fig. 9 is a schematic diagram of an extraction process of a first operation element of an exemplary application scenario provided in the present application;

fig. 10 is a schematic diagram of an extraction process of a second operation element of an exemplary application scenario provided in the present application;

FIG. 11 is a schematic diagram of an intermediate result storage structure of an exemplary application scenario provided herein;

FIG. 12 is a schematic diagram of a subtask result storage structure and an intermediate result storage structure of an exemplary application scenario provided herein;

FIG. 13 is a schematic view of a subtask execution result of an exemplary application scenario provided herein;

FIG. 14 is a block diagram of one embodiment of a data compression device provided herein;

fig. 15 is a block diagram of an embodiment of an electronic device provided in the present application.

Detailed Description

In order to provide a better understanding of the present application, those skilled in the art will now make further details of the present application with reference to the drawings and detailed description. Wherein the terms "first," "second," "third," "fourth," and the like in the description and in the claims and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations of the two, are intended to cover a non-exclusive inclusion. The term "exemplary" means "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

More and more electronic data, especially image data, are currently represented by sparse matrices. Sparse matrices are a special class of matrices that contain a large number of zero elements, corresponding to dense matrices, with non-zero elements in the sparse matrix generally accounting for less than 5% of the total matrix elements. Sparse matrix vector multiplication is a common core function in a scientific computation library, and the main function of the sparse matrix vector multiplication is to realize the operation of multiplying a sparse matrix and a dense vector, as shown in fig. 1, wherein a is the sparse matrix, and X and Y are the dense vectors.

Because sparse matrices contain a large number of zero elements, it is inefficient to store such matrices in two-dimensional arrays. The related art storage method for sparse matrix includes COO (coordinate storage) and CSR (line compression storage), as shown in fig. 2 and 3. The COO format stores only non-zero elements and not zero elements, with a triplet of shapes < row, column, value > for each non-zero element. The entire matrix can thus be represented by three arrays of coo_rows (array of rows), coo_cols (array of columns) and coo vals (array of values). The CSR format stores the number of lines on a COO basis

The group compression is the number of elements in each row, wherein csr_rows_ptr is an array of 'row pointers', and the start and stop positions of the elements in each row are stored. Taking the a matrix in fig. 1 as an example, where csr_rows_ptr [0] =0 and csr_rows_ptr [1] =4, the elements between [0,4 ] are non-zero elements in the first row. Although the storage format has improved efficiency compared with the two-dimensional array storage mode, for a matrix with a large number of repeated values, a large number of repeated elements are required to be accessed, and frequent memory access can cause large memory access cost and waste of memory bandwidth. Because the compression capability of the CSR format is limited, the redundant stored repeated element values are not fully compressed, so that the memory access efficiency is low, and the execution efficiency of scientific calculation tasks is seriously affected.

In view of this, the present application fully utilizes the characteristic of a large number of repeated elements in the sparse matrix, compresses the repeated elements in the matrix into a single element plus index coordinate, further compresses the storage space of such sparse matrix, and can improve the efficiency of the sparse matrix vector multiplication operation. Having described the technical aspects of the present application, various non-limiting embodiments of the present application are described in detail below. Numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present application.

Referring first to fig. 4, fig. 4 is a flow chart of a data compression method provided in the present application, and the present application may include the following:

s401: and acquiring the electronic data to be compressed, which is represented by the sparse matrix.

In this embodiment, the electronic data to be compressed may be any electronic data such as image data, voice data, text data, etc., and the electronic data to be compressed is represented in the form of a sparse matrix.

S402: based on the total number of non-zero elements contained in the electronic data to be compressed, the value of each non-zero element of the electronic data to be compressed is inserted into a pre-constructed hash table.

The hash table, i.e. the hash table, accesses the record by mapping Key values, i.e. keys and values, to a position in the table, so as to speed up the search. In other words, the hash table has the function of removing duplicate elements and fast look-up, and the keys of the hash table represent values that do not duplicate non-zero elements. For example, a table is defined as M, a function f (key) exists, and if the address recorded in the table containing the key can be obtained after substituting the function for any given key value key, the table M is referred to as a Hash table, and the function f (key) is a Hash (Hash) function. One skilled in the art may construct a hash table in advance according to any of the related techniques, which is not limited in this application. Because the sparse matrix is a special matrix containing a large number of zero elements, correspondingly, after the sparse matrix is adopted for representing the electronic data to be compressed, the non-zero elements contained in the sparse matrix are the total number of the non-zero elements corresponding to the electronic data to be compressed, and because the hash table can remove repeated elements, after S402, the elements stored in the hash table are the non-zero and non-repeated elements of the electronic data to be compressed.

S403: storing the values of the non-repeated elements of the hash table into a pre-constructed storage data structure, and obtaining a hash key value table for identifying the storage corresponding relation between the hash table and the storage data structure so as to store all the values of the non-repeated non-zero elements in the electronic data to be compressed by using the storage data structure.

In this embodiment, the storage data structure may be pre-constructed, and the storage data structure may be represented in an array format, which is used to store the values of the non-repeated elements in the hash table, so the length of the storage data structure may be determined according to the number of elements in the hash table. All non-repeating non-zero element values in the electronic data to be compressed may be stored by the storage data structure. The hash key value pair table is generated after the hash table of the previous step records the stored corresponding relation, namely the hash table recorded with the stored corresponding relation is the hash key value pair table of the present step.

S404: and establishing corresponding relations among all non-zero elements of the current compression storage format corresponding to the electronic data to be compressed, all elements of the hash table and all elements in the storage data structure to obtain index information so as to convert the electronic data to be compressed from the current compression storage format to the target value compression storage format.

In this embodiment, the current compression storage format is a compression storage format supported in the related art, such as COO and CSR formats, and since the stored elements are elements of the electronic data to be compressed, whether the current compression storage format or the hash table or the storage data structure, by constructing the correspondence between each non-zero element of the current compression storage format, each element of the hash table, and each element in the storage data structure corresponding to the electronic data to be compressed, the electronic data to be compressed can be converted from the current compression storage format to the target value compression storage format, which is a new value compression storage format provided in the present application, and is a format for compressing the repeated elements into single element plus index coordinates. The index coordinates are index information.

In the technical scheme provided by the application, the hash table is utilized to remove repeated elements of the electronic data to be compressed, all non-repeated elements are stored in a new data storage structure, key values of the non-repeated elements are modified into the non-repeated elements through traversing the hash table, and the values of the non-repeated elements are modified into corresponding subscript positions. Finally, through constructing the current compressed storage format of the electronic data to be compressed, the corresponding relation between each element of the hash table and each element in the storage data structure can obtain index information, so that the electronic data to be compressed in the original compressed storage format is converted into a new target value storage format, the whole process is simple in processing process, the application capability in an actual scene is provided, the practicability is higher, the repeated elements in the value array corresponding to the electronic data to be compressed in the compression process can be removed, the redundant stored repeated element values can be fully compressed, all the repeated elements can be stored in the value array only once, and the required storage space is obviously reduced, thereby improving the data locality, fully utilizing the cache structure in the central processor, reducing the memory bandwidth waste caused by accessing the repeated elements, and effectively improving the memory access efficiency. Therefore, when the electronic data to be compressed is read in the task execution process, the data of the value array can be accessed from the cache more efficiently, the memory access efficiency is effectively improved, and the execution efficiency of related tasks is improved.

It should be noted that, in the present application, the steps may be executed simultaneously or in a certain preset order as long as the steps conform to the logic order, and fig. 4 is only a schematic manner and does not represent only such an execution order.

In the above embodiment, the implementation of step S402 is not limited, and an exemplary implementation of storing the electronic data to be compressed in the hash table in this embodiment may include the following:

initializing a hash table and an original value array;

inserting the values of the non-zero elements in the original value array into a hash table to obtain a hash key value table, so that all the non-zero elements of the electronic data to be compressed are inserted into the hash table;

The original hash table may be defined as a value_hash map, and after all non-zero elements of the electronic data to be compressed are inserted into the hash table, a hash key table is generated, that is, the hash key table is the hash table into which all non-zero elements are inserted. The current compressed storage format is CSR, and the sparse matrix representing the electronic data to be compressed may be represented as a, where a is the size of a: row X column = m X n, containing nnz non-zero elements in total, corresponding to the dimensions of dense vectors X and Y being: n×1 and m×1. The process of storing electronic data to be compressed to a hash table may include:

Step 1: initializing value_hashmap;

step 2: initializing i=0, i representing the subscript of the CSR format median array (which may be defined as csr_vals);

step 3: inserting the value of the i non-zero element into the hash table, wherein the repeated element is automatically abandoned after the hash table is inserted;

step 4: let i=i+1;

step 5: and judging whether i is equal to nnz, if not, returning to the step 3, otherwise, jumping to the step 6.

After the above steps are completed, the number of non-repeated non-zero elements (i.e., numVal) is counted, that is, the number of elements in the hash table, the value_hashmap is stored as shown in fig. 5, and for convenience of distinction, the value_hashmap shown in fig. 5 is defined as a hash key value table, where the hash table has only keys (keys) with specific meaning.

In the above embodiment, the implementation of step S403 is not limited, and an exemplary generation manner of the hash key table in this embodiment may include the following:

pre-constructing a storage data structure; initializing a storage data structure and a subscript of the storage data structure; initializing a pre-constructed hash table position pointer; for the value of each non-repeated element of the hash key value table, putting the target key value pointed by the pointer of the current hash table into the target position of the storage data structure, and storing the identification value of the target position into the target key value; when the key value of each non-repeated element in the hash key value table comprises the identification value stored to the corresponding position of the storage data structure, the current hash key value table is used as the hash key value table.

In this embodiment, the storage data structure is used to store all the values of non-repeated non-zero elements in the electronic data to be compressed, and the storage data structure is represented by using an array format csrv_vals; the number of elements of the hash key table is the same as the array length numVal of the storage data structure, and after the hash key table is generated and the elements are recorded in the hash key table and stored to the identification values of the corresponding positions of the storage data structure, the hash key table at this time can be defined as the hash key table for convenience of description. Based on the above embodiments, an exemplary implementation of the present embodiment may include:

step 6: initializing numval=value_hashmap.size, wherein size represents the number of elements in the hash table;

step 7: initializing a new array csrv_vals, wherein the array length is numVal and is used for storing all non-repeated non-zero element values in the sparse matrix;

step 8: initializing j=0, wherein j represents the subscript of the csrv_vals array;

step 9: initializing it=value_hashmap. Begin, where it points to the location of the element in the hash table and begin represents the first element of the hash table;

step 10: placing the key pointed to by the hash table it pointer (the value of a certain non-repeating element) in the j-th position of the value set (csrv_vals);

Step 11: the subscript position j is stored at the value (it.val≡ζ) corresponding to the ith key of the hash table, and the hash table key values at this time are respectively represented as follows: the value of the non-repeating non-zero element and the subscript position of its corresponding value array (csrv_vals);

step 12: let j=j+1; let it=it+1;

step 13: it is determined whether it points to value_hashmap.end (where end represents the last element of the hash table), if not, it returns to step 10, otherwise it jumps to step 14 of the following embodiment.

When step 13 is completed, all non-repeated non-zero elements, that is, all keys in the value_hashmap are stored in the csrv_vals array, and at this time, the key value of the value_hashmap represents the value of the non-repeated non-zero element and the subscript position of the corresponding value array (csrv_vals), taking the matrix a in fig. 1 as an example, the state corresponding to the csrv_vals array and the value_hashmap hash table is shown in fig. 6, at this time, the value_hashmap is defined as a hash key value table, which can represent that there are two non-repeated non-zero elements (a and b) in the matrix a, where a is stored in the number 0 position of the csrv_vals array; and b is stored in position 1 of the csrv_vals array.

In the above embodiment, how to execute step S404 is not limited, and an exemplary generation manner of the index information given in this embodiment may include the following:

pre-constructing index information, and representing the index information by adopting an index array;

for each non-zero element of the original value array, determining a storage position of the current non-zero element in a storage data structure from a hash key value table, and storing the storage position to a target index position of the index array;

and generating index information after each non-zero element of the index array storage original value array corresponds to a storage position of the storage data structure.

In this embodiment, the index array may be expressed as csrv_vals_idx [ ], where the number of elements included in the index array is the total number of non-zero elements included in the electronic data to be compressed; the value array corresponding to the current compressed storage format may be expressed as csr_vals, and the generation process of the index information includes:

step 14: initializing k=0, wherein k represents the subscript of a value array (csr_vals) of a certain non-zero element in the current compressed storage format;

step 15: the index position of the corresponding value of the non-zero element (csr_vals [ k ]) in the value array (csrv_vals) is quickly searched from the hash table, and the index is put into the kth position (csrv_vals_idx [ k ]);

Step 16: let k=k+1;

step 17: and judging whether k is equal to nnz, if not, returning to the step 14, otherwise, ending.

After step 17 is completed, the sparse matrix in the original format is converted into a new value compression storage format, i.e. the value array (csr_vals) in the CSR format is converted into a repetition-free value array (csrv_values) and a corresponding index array (csrv_vals_idx) in the new value compression storage format. Since the value array is typically a double-precision floating point number (64 bits) and the index array is typically an unsigned integer array (32 bits or less), the memory space of the sparse matrix in this format is compressed. FIG. 7 illustrates the relative states of the value array and index array in the new value compression storage format after conversion is complete. Fig. 8 shows the target value compression storage format finally converted through steps 1 to 17.

After the electronic data to be compressed is converted from the current compressed storage format to the target compressed storage format in the above embodiment, when a task that needs to use the electronic data to be compressed to participate in a subsequent sparse matrix vector multiplication operation is received, the execution process of the task may include:

receiving a target task to be executed;

pre-constructing an intermediate result storage structure;

respectively taking the column position coordinate value and the index value as indexes, and selecting corresponding first operation elements and second operation elements from the target electronic data and the storage data structure;

In this embodiment, the data compression and storage method described in any one of the above embodiments is used as a preprocessing operation in the sparse matrix vector multiplication operation process of this embodiment, and the target task to be executed includes a subtask that performs the sparse matrix vector multiplication operation by using the electronic data to be compressed and the target electronic data, where the target electronic data is represented by a dense vector X. The subtask calculation process is carried out according to the order of row priority, namely, the rows in the sparse matrix and the target electronic data are sequentially subjected to multiplication and addition operation, and after each row of calculation is finished, a result is written back into a subtask result storage structure Y, namely, a dense vector Y shown in fig. 1. The present embodiment defines an intermediate result storage structure sum for storing the calculation results of each row of elements of the electronic data to be compressed and the corresponding elements of the target electronic data. The column position coordinate values and index values of the above embodiments may be determined in the following manner: representing the position of non-zero elements of electronic data to be compressed in advance by adopting a one-dimensional array format to obtain a column position array; and determining the coordinate value of the current non-zero element corresponding to the column position array based on the sequence value of the current non-zero element in the electronic data to be compressed, and taking the coordinate value as the column position coordinate value. The corresponding element is selected as the first operation element, that is, operand 1, from the dense vector corresponding to the target electronic data based on the column position coordinate value. Determining a current position identification value of the current non-zero element corresponding to the index information based on the sequence value of the current non-zero element in the electronic data to be compressed; and determining the position of the current non-zero element corresponding to the storage data structure according to the current position identification value to serve as an index value. Based on the index value, the corresponding element is selected from the stored data structure as the second operation element, operand 2. An exemplary execution of the subtasks of the present embodiment may include:

Step 1: initializing i=0, where i represents that the i-th line is currently being calculated;

step 2: initializing sum=0, wherein sum represents an intermediate result obtained by multiplying and adding the row element and the corresponding element of the X vector;

step 3: initializing j=csrv_rows_ptr [ i ], wherein j represents the calculated j-th non-zero element, csrv_rows_ptr [ i ] is the starting position of the non-zero element of the row and is also the ending position of the non-zero element of the previous row;

step 4: taking out csrv_col [ j ] as an index of an operand 1, and taking out the operand 1 from the X vector through the index, wherein csrv_col [ j ] represents column coordinates of a j-th non-zero element; step 4 is to extract a multiplication operand 1 from X, taking the matrix and vector of fig. 1 as an example, as shown in fig. 9. Taking j=2 as an example, csrv_col [2] =4 indicates that the next non-zero element is the 4+1=5 th column, which corresponds to the fifth row element 6 of the dense vector X, and the operand 1 is 6 at this time.

Step 5: fetching the value of the operand 2 from the csrv_vals, wherein the csrv_vals_idx [ j ] represents the index of the j-th non-zero element value in the csrv_vals array; step 5 is to extract the operand 2 from the sparse matrix, taking the matrix and vector in fig. 1 as an example, as shown in fig. 10: taking j=2 as an example, where csrv_vals_idx [2] =1, the value representing the next non-zero element is at the 1+1=2 position of the value array (csrv_vals), and operand 2 is taken from csrv_vals [1], where operand 2 is b;

Step 6: after multiplying the two operands, the result is added to the sum, taking the matrix and vector of fig. 1 as an example, as shown in fig. 11: taking j=2 as an example, step 6 is to multiply the two fetched operands and add them to the temporary variable sum. At this time, sum=6a, and sum=6a+6b after step 6 is completed;

step 7: let j=j+1;

step 8: judging whether j is equal to csrv_rows_ptr [ i+1], wherein csrv_rows_ptr [ i+1] represents the end position of the non-zero element of the row and is the start position of the non-zero element of the next row, if not, returning to the step 4, otherwise, jumping to the step 9;

step 9: and storing the value in sum to the position corresponding to the ith element in the vector Y, and completing the non-zero element calculation of the ith row in the sparse matrix. Taking the matrix and vector of fig. 1 as an example, as shown in fig. 12: taking i=0 as an example, when the non-zero elements of the 0 th row are all calculated, the sum is written back to the dense vector Y as a result, i.e. the calculation result of Y [0] is 6a+7b;

step 10: let i=i+1;

step 11: and (4) judging whether i is equal to m, if not, returning to the step (4), otherwise, finishing all non-zero elements of all lines to finish the calculation, and ending the execution flow of the subtasks. Taking the matrix and vector of fig. 1 as an example, the calculation result is shown in fig. 13.

As can be seen from the above, the present embodiment removes duplicate elements by hash table and saves all non-duplicate elements in the newly initialized value array by traversing the value array in CSR format. The hash table is traversed, modifying its "key" to a non-repeating element, and "value" to the corresponding index position. Traversing the CSR format value array again, searching the index position corresponding to the value from the hash table, storing in the newly initialized index array, finding the index corresponding to the operand through the index array when the subtask is executed, and then taking the operand from the value array through the index. The invention can make all repeated elements only stored once in the value array, so the storage space of the array is reduced and the data locality is improved. During calculation, the central processing unit is more likely to access the data of the value array in the cache, so that the calculation speed is effectively improved, and the execution efficiency of the subtasks is improved.

The application also provides a corresponding device for the data compression method, so that the method is more practical. Wherein the device may be described separately from the functional module and the hardware. The data compression apparatus provided in the present application is described below for implementing the data compression method provided in the present application, and in this embodiment, the data compression apparatus may include or be divided into one or more program modules, where the one or more program modules are stored in a storage medium and executed by one or more processors, to implement the data compression method disclosed in the first embodiment. Program modules in the present application refer to a series of computer program instruction segments capable of performing particular functions more appropriately than the program itself for describing the execution of the data compression means in a storage medium. The following description will specifically describe functions of each program module of the present embodiment, and the data compression apparatus described below and the data compression method described above may be referred to correspondingly to each other.

Based on the angle of the functional modules, referring to fig. 14, fig. 14 is a block diagram of a data compression device provided in the present application under an embodiment, where the device may include:

a data acquisition module 141, configured to acquire electronic data to be compressed, which is represented by a sparse matrix;

a table storage module 142, configured to insert a value of each non-zero element of the electronic data to be compressed into a hash table constructed in advance, based on a total number of non-zero elements included in the electronic data to be compressed;

the data storage module 143 is configured to store values of non-repeated elements of the hash table into a pre-constructed storage data structure, and obtain a hash key value pair table that identifies a storage correspondence between the hash table and the storage data structure, so as to store values of all non-repeated non-zero elements in the electronic data to be compressed by using the storage data structure; the length of the storage data structure is determined according to the number of elements of the hash table;

the correspondence construction module 144 is configured to establish correspondence among non-zero elements of a current compression storage format corresponding to the electronic data to be compressed, elements of the hash table, and elements in the storage data structure, so as to obtain index information, so that the electronic data to be compressed is converted from the current compression storage format to a target value compression storage format; the target value compression storage format is a format for compressing repeated elements into single element plus index coordinates.

Optionally, in some implementations of this embodiment, the table storage module 142 may be further configured to:

initializing a hash table and an original value array;

Optionally, in other implementations of this embodiment, the data storage module 143 may be further configured to:

initializing a storage data structure and a subscript of the storage data structure;

initializing a pre-constructed hash table position pointer;

for the value of each non-repeated element of the hash key value table, putting the target key value pointed by the pointer of the current hash table into the target position of the storage data structure, and storing the identification value of the target position into the target key value;

When the key value of each non-repeated element in the hash key value table comprises the identification value stored to the corresponding position of the storage data structure, the current hash key value table is used as the hash key value table.

Illustratively, in some implementations of the present embodiment, the correspondence building module 144 may be further configured to:

In other implementations of the present embodiment, the apparatus may further include a task execution module, for example, where the task execution module may be configured to:

receiving a target task to be executed; the target task to be executed comprises a subtask for performing sparse matrix vector multiplication operation by utilizing the electronic data to be compressed and the target electronic data; the target electronic data is represented by a dense vector;

As an exemplary implementation of the foregoing embodiment, the foregoing task execution module may further be configured to:

representing the position of non-zero elements of electronic data to be compressed in advance by adopting a one-dimensional array format to obtain a column position array;

As another exemplary implementation of the above embodiment, the task execution module may further be configured to:

and determining the position of the current non-zero element corresponding to the storage data structure according to the current position identification value to serve as an index value.

The functions of each functional module of the data compression device described in the present application may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not repeated herein.

As can be seen from the above, the present embodiment can solve the problem of low access efficiency in the related art, and effectively improve the access efficiency of electronic data represented in a sparse matrix format.

The data compression device mentioned above is described from the viewpoint of the functional module, and further, the application also provides an electronic device, which is described from the viewpoint of hardware. Fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application in an implementation manner. As shown in fig. 15, the electronic device includes a memory 150 for storing a computer program; a processor 151 for implementing the steps of the data compression method as mentioned in any of the embodiments above when executing a computer program.

Processor 151 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and processor 151 may also be a controller, microcontroller, microprocessor, or other data processing chip, etc. The processor 151 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). Processor 151 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, processor 151 may incorporate a GPU (Graphics Processing Unit, graphics processor) for rendering and rendering content required to be displayed by the display screen. In some embodiments, the processor 151 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 150 may include one or more computer-readable storage media, which may be non-transitory. Memory 150 may also include high-speed random access memory as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. The memory 150 may in some embodiments be an internal storage unit of the electronic device, such as a hard disk of a server. The memory 150 may also be an external storage device of the electronic device, such as a plug-in hard disk provided on a server, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), etc. in other embodiments. Further, the memory 150 may also include both internal storage units and external storage devices of the electronic device. The memory 150 may be used to store not only application software installed in an electronic device, but also various types of data, such as: code of a program or the like in executing the data compression method may also be used to temporarily store data that has been output or is to be output. In this embodiment, the memory 150 is at least used to store a computer program 1501, which, when loaded and executed by the processor 151, is capable of implementing the relevant steps of the data compression method disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 150 may further include an operating system 1502, data 1503, and the like, and the storage manner may be transient storage or permanent storage. Operating system 1502 may include, among other things, windows, unix, linux. The data 1503 may include, but is not limited to, data corresponding to the data compression result, and the like.

In some embodiments, the electronic device may further include a display 152, an input/output interface 153, a communication interface 154, or referred to as a network interface, a power supply 155, and a communication bus 156. Among other things, the display 152, input-output interface 153 such as a Keyboard (Keyboard) belong to a user interface, which may optionally also include standard wired interfaces, wireless interfaces, etc. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface. The communication interface 154 may optionally include a wired interface and/or a wireless interface, such as a WI-FI interface, a bluetooth interface, etc., typically used to establish a communication connection between an electronic device and other electronic devices. The communication bus 156 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 15, but not only one bus or one type of bus.

Those skilled in the art will appreciate that the configuration shown in fig. 15 is not limiting of the electronic device and may include more or fewer components than shown, for example, a sensor 157 that performs various functions.

The functions of each functional module of the electronic device described in the present application may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not repeated herein.

It will be appreciated that the data compression method of the above embodiments, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored on a computer readable storage medium. Based on such understanding, the technical solution of the present application, or a part contributing to the related art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium, performing all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrically erasable programmable ROM, registers, a hard disk, a multimedia card, a card-type Memory (e.g., SD or DX Memory, etc.), a magnetic Memory, a removable disk, a CD-ROM, a magnetic disk, or an optical disk, etc., that can store program code.

Based on this, the present application also provides a readable storage medium storing a computer program which, when executed by a processor, performs the steps of the data compression method according to any one of the embodiments above.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the hardware including the device and the electronic equipment disclosed in the embodiments, the description is relatively simple because the hardware includes the device and the electronic equipment corresponding to the method disclosed in the embodiments, and relevant places refer to the description of the method.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above describes in detail a data compression method, apparatus, electronic device and readable storage medium provided in the present application. Specific examples are set forth herein to illustrate the principles and embodiments of the present application, and the description of the examples above is only intended to assist in understanding the methods of the present application and their core ideas. It should be noted that, based on the embodiments in this application, all other embodiments that can be obtained without inventive labor by those skilled in the art are within the scope of protection of this application. The present application may be subject to numerous improvements and modifications without departing from the principles of the present application, and such improvements and modifications are intended to fall within the scope of the claims of the present application.

Claims

1. A method of data compression, comprising:

2. The method for compressing data according to claim 1, wherein said inserting the value of each non-zero element of said electronic data to be compressed into a pre-constructed hash table based on the total number of non-zero elements contained in said electronic data to be compressed comprises:

initializing the hash table and the original value array;

3. The method of claim 1, wherein storing the values of the non-repeating elements of the hash table in a pre-constructed storage data structure and obtaining a hash key value pair that identifies a storage correspondence of the hash table to the storage data structure comprises:

initializing a pre-constructed hash table position pointer;

4. The method for compressing data according to claim 1, wherein said establishing a correspondence between each non-zero element of the current compressed storage format, each element of the hash table, and each element in the stored data structure, for the electronic data to be compressed, includes:

5. The data compression method according to any one of claims 1 to 4, wherein after the electronic data to be compressed is converted from the current compressed storage format to the target compressed storage format, further comprising:

6. The data compression method according to claim 5, wherein the obtaining column position coordinate values and index values corresponding to the target non-zero element by retrieving the electronic data to be compressed and the index information, respectively, includes:

7. The data compression method according to claim 5, wherein the obtaining column position coordinate values and index values corresponding to the target non-zero element by retrieving the electronic data to be compressed and the index information, respectively, includes:

8. A data compression apparatus, comprising:

9. An electronic device comprising a processor and a memory, the processor being adapted to implement the steps of the data compression method according to any one of claims 1 to 7 when executing a computer program stored in the memory.

10. A readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the data compression method according to any of claims 1 to 7.