CN113742056A

CN113742056A - Data storage method, device and equipment and computer readable storage medium

Info

Publication number: CN113742056A
Application number: CN202011303864.2A
Authority: CN
Inventors: 郭沛松; 李�杰; 陈晓宇; 包勇军; 朱小坤; 刘健; 韩小涛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2021-12-03

Abstract

The application provides a data storage method, a data storage device and a computer-readable storage medium, wherein the method comprises the following steps: acquiring data to be stored and a type identifier of the data to be stored; determining a target memory unit from a target memory pool based on the type identifier, wherein the target memory pool is a memory pool corresponding to the type identifier; acquiring mapping data of variable-length data in the data to be stored and fixed-length data in the data to be stored, and determining the fixed-length data and the mapping data as target data; and storing the target data into the target memory unit, wherein the memory occupation amount of the target data is equal to the memory size of the target memory unit, and storing the target data with the same type identification into the memory units with the same size, so that the memory addresses among a plurality of memory units in the memory pool are continuous, the spatial locality of the data can be improved, the cache hit rate is improved, and the memory utilization rate can be improved.

Description

Data storage method, device and equipment and computer readable storage medium

Technical Field

The present application relates to the field of computer storage technologies, and relates to, but is not limited to, a data storage method, apparatus, device, and computer-readable storage medium.

Background

The graph is an abstract data structure commonly used in computer science, and has more general representation capability in structural and semantic aspects. The graph structure data is widely existed in the aspects of life, for example, in an e-commerce scene, a user and a commodity can be regarded as two types of vertexes, and the relationships between the user and the commodity, such as browsing, buying, purchasing and the like, can be regarded as different types of edges.

The graph computation system is a system that analyzes and computes graph structure data. With the increasing scale of graph data, the performance requirements on graph computing systems are higher and higher. The graph data storage is an important component of the graph computing system, and the memory format adopted when the graph data is stored has great influence on the processing scale and the query efficiency of the graph computing system and influences the computing performance of the graph computing system.

In the related art, graph computing systems usually store graph data in a loose memory format, and the storage mode of the memory format is based on mechanisms such as byte alignment and standard container memory allocation, so that memory addresses are discontinuous, memory waste is caused, the memory utilization rate is low, the locality of a storage space is poor, and the hit rate of a random access cache is low.

Disclosure of Invention

In view of this, embodiments of the present application provide a data storage method, apparatus, device, and computer-readable storage medium.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a data storage method, which comprises the following steps:

acquiring data to be stored and a type identifier of the data to be stored;

determining a target memory unit from a target memory pool based on the type identifier, wherein the target memory pool is a memory pool corresponding to the type identifier;

acquiring mapping data of variable-length data in the data to be stored and fixed-length data in the data to be stored, and determining the fixed-length data and the mapping data as target data;

and storing the target data to the target memory unit, wherein the memory occupation amount of the target data is equal to the memory size of the target memory unit.

In some embodiments, the obtaining the data to be stored and the type identifier of the data to be stored includes:

acquiring a file to be stored, wherein the file to be stored comprises at least one source data, the source data comprises a type identifier, and the source data is at least graph data;

performing digital system conversion processing on the source data to obtain target digital system data to be stored;

and determining the type identifier of the source data as the type identifier corresponding to the data to be stored.

In some embodiments, the determining a target memory unit from a target memory pool based on the type identifier includes:

determining a target memory pool according to the type identifier of the data to be stored;

and determining a target memory unit from the target memory pool, wherein the target memory unit is a memory unit pre-allocated in the target memory pool.

In some embodiments, the determining a target memory pool according to the type identifier of the data to be stored includes:

determining a type identifier set corresponding to the file to be stored based on the type identifier of each source data in the file to be stored;

allocating a corresponding memory pool for each type identifier in the type identifier set in a memory space to obtain N memory pools; wherein, N is the number of the type identifiers in the type identifier set, and N is a positive integer;

and determining a memory pool corresponding to the type identifier of the data to be stored in the N memory pools as a target memory pool.

In some embodiments, said determining a target memory unit from said target memory pool comprises:

acquiring a storage format predefined for the type identifier;

pre-allocating at least one memory unit in the target memory pool based on the storage format;

and determining a free memory unit which is pre-allocated as a target memory unit.

In some embodiments, said pre-allocating at least one memory unit in said target memory pool based on said storage format comprises:

determining the byte length of a predefined fixed field, the byte length of a fixed-length attribute and the number M of variable-length attributes based on the storage format, wherein M is a natural number;

determining the memory occupation amount of target data corresponding to the data to be stored according to the byte length of the fixed field, the byte length of the fixed-length attribute and the byte lengths of the M address pointers;

pre-allocating at least one memory unit in the target memory pool based on the memory footprint.

In some embodiments, said pre-allocating at least one memory cell in said target memory pool based on said memory footprint comprises:

when no idle memory unit exists in the target memory pool, a memory block is pre-allocated in the target memory pool;

and dividing the memory block into at least one memory unit based on the memory occupation amount.

In some embodiments, the obtaining mapping data of variable-length data in the data to be stored and fixed-length data in the data to be stored includes:

determining fixed field data, fixed length attribute data and variable length attribute data of the data to be stored based on the storage format;

determining the fixed field data and the fixed length attribute data as fixed length data in the data to be stored;

determining M variable-length attribute data as M variable-length data in the data to be stored;

and mapping the M variable length data into M address pointers, and determining the M address pointers as mapping data, wherein the M address pointers respectively point to memory addresses of the M variable length attribute data.

The embodiment of the application provides a data storage device, which comprises:

the first acquisition module is used for acquiring data to be stored and the type identifier of the data to be stored;

a first determining module, configured to determine a target memory unit from a target memory pool based on the type identifier, where the target memory pool is a memory pool corresponding to the type identifier;

the second acquisition module is used for acquiring mapping data of variable-length data in the data to be stored and fixed-length data in the data to be stored;

the second determining module is used for determining the fixed-length data and the mapping data as target data;

and the storage module is used for storing the target data to the target memory unit, and the memory occupation amount of the target data is equal to the memory size of the target memory unit.

An embodiment of the present application provides a data storage device, including:

a processor; and

a memory for storing a computer program operable on the processor;

wherein the computer program realizes the steps of the above data storage method when executed by a processor.

Embodiments of the present application provide a computer-readable storage medium, which stores computer-executable instructions configured to perform the steps of the data storage method.

The embodiment of the application provides a data storage method, a data storage device, data storage equipment and a computer-readable storage medium, wherein the method comprises the following steps: acquiring data to be stored and a type identifier of the data to be stored; determining a target memory unit from a target memory pool based on the type identifier, wherein the target memory pool is a memory pool corresponding to the type identifier; acquiring mapping data of variable-length data in the data to be stored and fixed-length data in the data to be stored, and determining the fixed-length data and the mapping data as target data; and storing the target data to the target memory unit, wherein the memory occupation amount of the target data is equal to the memory size of the target memory unit. Therefore, the target data with the same type of identification is stored in the memory units with the same size, so that the memory addresses among the memory units in the memory pool are continuous, the spatial locality of the data is improved, and the cache hit rate is improved; and the memory waste can be reduced, thereby improving the memory utilization rate.

Drawings

In the drawings, which are not necessarily drawn to scale, like reference numerals may describe similar components in different views. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed herein.

Fig. 1 is a schematic flowchart of an implementation of a data storage method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of another implementation of a data storage method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of another implementation of the data storage method according to the embodiment of the present application;

FIG. 4 is a schematic flowchart of an implementation of a graph data storage method according to an embodiment of the present disclosure;

FIG. 5 is a diagram of vertex binary encoding format;

FIG. 6 is a diagram of an edge binary encoding format;

FIG. 7 is a diagram illustrating different types of vertices being stored in corresponding memory pools;

FIG. 8 is a diagram illustrating the storage of different types of edges to corresponding memory pools;

fig. 9 is a schematic flow chart illustrating a working principle of a memory pool according to an embodiment of the present disclosure;

FIG. 10 is a schematic diagram illustrating a flowchart of loading graph data according to an embodiment of the present application;

FIG. 11 is a schematic diagram illustrating a structure of a data storage device according to an embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of a data storage device according to an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

In the related art, the memory is distributed and released on the heap by using the memory management function of the system, so that on one hand, the memory waste is caused, and the memory utilization rate is reduced; on the other hand, the address is discontinuous due to repeated memory application, the locality of the data space is poor, and the cache hit rate is reduced.

Based on the above problems in the related art, embodiments of the present application provide a data storage method, which is applied to a data storage device. The method provided by the embodiment of the present application can be implemented by a computer program, and when the computer program is executed, each step in the data storage method provided by the embodiment of the present application is completed. In some embodiments, the computer program may be executed by a processor in a data storage device. Fig. 1 is a schematic flow chart of an implementation of a data storage method provided in an embodiment of the present application, and as shown in fig. 1, the data storage method includes the following steps:

step S101, obtaining data to be stored and type identification of the data to be stored.

In the embodiment of the present application, the data to be stored is data related to vertex data or edge data of a graph. A graph is a common class of abstract data structure in computer science, comprising vertices and edges. For example, in an e-commerce scenario, a user and a commodity can be considered as two types of vertices, and the relationships between the user and the commodity, such as browsing, shopping, purchasing, and the like, can be considered as different types of edges. Different types of points or edges, with different type identifications. The type identifier may be predefined by a designer, for example, the type identifier defining vertex "user" in an e-commerce scene is "vertex _ type _ layer", the type identifier defining vertex "commodity" is "vertex _ type _ goods", the type identifier defining side "browse" is "edge _ type _ browse", the type identifier defining side "buy" is "edge _ type _ add", and the type identifier defining side "buy" is "edge _ type _ buy".

Because the data in the memory is generally stored in a binary mode, and the source data stored in the file to be stored is generally natural language data, the source data stored in the file to be stored is preprocessed on the basis of the natural language data, and the data to be stored is obtained. The preprocessing may be binary conversion of the source data to obtain binary data to be stored. And when the data system of the source data is converted, the inherent data attribute of the source data is not changed, and the type identifier of the source data is determined as the type identifier of the data to be stored.

Step S102, determining a target memory unit from the target memory pool based on the type identifier.

And the target memory pool is a memory pool corresponding to the type identifier.

The core technical principle of the data storage method provided by the embodiment of the application is that graph data in a compact memory format is stored based on a memory pool technology, a target memory unit is a memory unit pre-allocated in a target memory pool, and the size of the memory unit is determined based on the type identifier.

After the graph data is preprocessed to obtain the data to be stored, a memory unit needs to be applied for the graph data in a memory space. In the embodiment of the application, the memory units are allocated based on the memory pools, a target memory pool for storing data to be stored corresponding to the type identifier is determined in a plurality of pre-applied memory pools according to the type identifier of the data to be stored, and then a target memory unit is determined in the plurality of pre-allocated memory units in the target memory pool. In the embodiment of the application, in order to avoid memory waste, when the memory units are pre-allocated, the memory units with proper sizes are pre-allocated based on the memory occupation amount of the data to be stored.

For example, for a certain memory unit, the memory format corresponds to the binary coding format, and mainly includes a fixed field portion, a fixed-length attribute field portion, and a variable-length attribute field portion. In binary encoding of graph data, the memory size required for storing data of a fixed field and data of a fixed-length attribute is fixed, and the memory size required for storing data of a variable-length attribute is not fixed. In the embodiment of the application, an extra memory space is applied for storing the data with the variable length attribute, a mapping relation is established between the address of the data with the variable length attribute and the set field in the memory unit, and the length of the set field is fixed, so that the size of the pre-allocated memory unit can be determined based on the length of the fixed field, the length of the fixed length attribute field and the length of the set field.

In the embodiment of the application, the size of the memory space is determined based on the type identifier of the data to be stored, so that extra memory space is not occupied, memory waste can be reduced, and the memory utilization rate is improved.

Step S103, acquiring mapping data of variable-length data in the data to be stored and fixed-length data in the data to be stored.

After the type identification of the data to be stored is determined, the storage format of the data to be stored can be determined, and based on the storage format, fixed field data, fixed length attribute data and variable length attribute data of the data to be stored are determined; and determining the fixed field data and the fixed length attribute data as fixed length data in the data to be stored. Determining variable length attribute data as variable length data in the data to be stored; and mapping the variable length data into an address pointer, and determining the address pointer as mapping data of the variable length data in the data to be stored.

And step S104, determining the fixed-length data and the mapping data as target data.

In practical implementation, the data to be stored can be divided into fixed-length data and variable-length data by combining with the type identifier of the data to be stored, the memory occupancy of the fixed-length data during storage is fixed, and the memory occupancy of the variable-length data during storage is variable. Applying for an extra memory space to store data with variable length, and introducing at least one address pointer into a memory unit to point the address pointer to the memory address of the variable length data. And determining the fixed-length data and the address pointer in the data to be stored as target data.

For example, the data to be stored by the user includes id (vertex1_ id) of the user, weight (vertex1_ weight) of the user, age (vertex1_ age) of the user, and address (vertex1_ address) of the user, where vertex1_ id and vertex1_ weight are fixed fields, vertex1_ age is a fixed-length attribute, and vertex1_ address is a variable-length attribute. When the fixed field and the fixed-length attribute are stored, the size of the occupied memory is fixed, and the size of the occupied memory by the variable-length attribute is not fixed. In the embodiment of the present application, a memory space is additionally applied outside the memory pool to store variable length attribute data, and an address pointer vary _ attr _1 is introduced into the memory unit to point to an address storing vertex1_ address in the additional memory space, so that the data of the variable length attribute is mapped to a fixed length. Therefore, based on the data to be stored, with variable memory occupancy, vertex1_ id, vertex1_ weight, vertex1_ age and vertex1_ address, the obtained target data with fixed memory occupancy are vertex1_ id, vertex1_ weight, vertex1_ age and vary _ attr _ 1.

Step S105, storing the target data in the target memory unit.

Here, the memory footprint of the target data is equal to the memory size of the target memory cell. Therefore, the target data with the same type identification is stored in the memory units with the same size, so that the memory addresses among the memory units in the memory pool are continuous, the spatial locality of the data is improved, and the cache hit rate is improved.

According to the data storage method provided by the embodiment of the application, data to be stored and the type identification of the data to be stored are obtained through data storage equipment; determining a target memory unit from a target memory pool based on the type identifier, wherein the target memory pool is a memory pool corresponding to the type identifier; acquiring mapping data of variable-length data in the data to be stored and fixed-length data in the data to be stored, and determining the fixed-length data and the mapping data as target data; and storing the target data to the target memory unit, wherein the memory occupation amount of the target data is equal to the memory size of the target memory unit. Therefore, the target data with the same type of identification is stored in the memory units with the same size, so that the memory addresses among the memory units in the memory pool are continuous, the spatial locality of the data is improved, and the cache hit rate is improved; and the memory waste can be reduced, thereby improving the memory utilization rate.

In an implementation manner, the step S101 "obtaining data to be stored and a type identifier of the data to be stored" in the embodiment shown in fig. 1 may be implemented as the following steps:

in step S1011, a file to be stored is acquired.

The file to be stored comprises at least one source data, the source data comprises a type identifier, the source data is at least graph data, and the source data with the same type identifier has the same storage format.

In some embodiments, the files to be stored include at least graph data files, i.e., the source data is at least graph data. The graph data includes vertex data and edge data. One or more files to be stored may be stored. The type identifiers of the source data in each file to be stored may be the same, may also be partially the same, or may also be different. For example, the file to be stored is a file storing graph data, and the source data in the file are all vertices in the graph data, such as when the type identifiers of the vertices are the same: all source data of the file to be stored are vertex 'user 1', and the type of the source data is marked as 'vertex _ type _ layer 1'; as another example, when the type identifications of the source data are not all the same: the source data of the file to be stored has a vertex "user 1", the type of which is identified as "vertex _ type _ layer 1", and also has a vertex "commodity", the type of which is identified as "vertex _ type _ goods". The source data in the file to be stored may also include both vertices and edges: the source data of the file to be stored has a vertex "user 1", whose type is identified as "vertex _ type _ layer 1", and also has a vertex "commodity", whose type is identified as "vertex _ type _ goods", and also has an edge "buy", whose type is identified as "edge _ type _ way". In the embodiment of the present application, the type identifier of the source data included in the file to be stored is not limited.

Step S1012, performing number system conversion processing on the source data to obtain target number system data to be stored.

Because the format of the computer storage data is a binary data format, for example, the graph data format loaded by the graph computing system is a binary data format, the vertex data and the edge data of the graph need to be binary coded.

For example, the binary data format of the vertex includes three parts, namely a fixed field, a fixed-length attribute and a variable-length attribute. The fixed field mainly comprises vertex _ type, vertex _ id and vertex _ weight, wherein the vertex _ type represents the type of a vertex, the vertex _ id represents the vertex id, and the vertex _ weight represents the weight of the vertex. The fixed-length attribute refers to an attribute of a basic data type such as int, float and the like. The variable length attribute refers to the attribute that the data type is an array, a character string and the like. Each variable length attribute includes two fields, namely, a variable _ attr _ len field and a variable _ attr _ val field, wherein the variable _ attr _ len field indicates the number or length of variable length attribute elements, the variable _ attr _ val field indicates variable length attribute values, and table 1 is a vertex binary data format.

TABLE 1 binary data format for vertices

Step S1013, determining the type identifier of the source data as the type identifier corresponding to the data to be stored.

And when the data system of the source data is converted, the inherent type identifier of the source data is not changed, and based on the inherent type identifier of the source data, the type identifier of the source data is determined as the type identifier of the data to be stored. Therefore, the data to be stored and the type identification of the data to be stored are obtained based on the file to be stored.

On the basis of the embodiment shown in fig. 1, the embodiment of the present application further provides a data storage method. Fig. 2 is a schematic flow chart of another implementation of the data storage method according to the embodiment of the present application, and as shown in fig. 2, the data storage method includes the following steps:

step S201, acquiring data to be stored and a type identifier of the data to be stored.

In the embodiment of the present application, step S201, step S204 to step S206 correspond to step S101, step S103 to step S105, respectively, and the implementation manner and effect of step S201, step S204 to step S206 refer to the description of step S101, step S103 to step S105.

The following steps S202 to S203 are one implementation of the step S102.

Step S202, determining a target memory pool according to the type identification of the data to be stored.

Here, the target memory pool is a memory pool corresponding to the type identifier. The core technical principle of the data storage method provided by the embodiment of the application is to realize the storage of graph data in a compact memory format based on a memory pool technology.

Step S203, determining a target memory unit from the target memory pool.

Here, the target storage unit is a pre-allocated memory unit in the target memory pool, and the size of the memory unit is determined based on the type identifier.

Step S204, obtaining the mapping data of the variable-length data in the data to be stored and the fixed-length data in the data to be stored.

Step S205, determining the fixed-length data and the mapping data as target data.

In step S206, the target data is stored in the target memory unit.

And the memory occupation amount of the target data is equal to the memory size of the target memory unit.

According to the data storage method provided by the embodiment of the application, data to be stored and the type identification of the data to be stored are obtained through data storage equipment; determining a target memory pool according to the type identifier of the data to be stored, wherein the target memory pool is a memory pool corresponding to the type identifier of the data to be stored; determining a target memory unit from the target memory pool, wherein the target memory unit is a memory unit pre-allocated in the target memory pool, and the size of the memory unit is determined based on the type identifier; acquiring mapping data of variable-length data in the data to be stored and fixed-length data in the data to be stored; determining the fixed-length data and the mapping data as target data; and storing the target data to the target memory unit, wherein the memory occupation amount of the target data is equal to the memory size of the target memory unit. Therefore, the target data with the same type of identification is stored in the memory units with the same size, so that the memory addresses among the memory units in the memory pool are continuous, the spatial locality of the data is improved, and the cache hit rate is improved; and the size of the memory space is determined based on the type identifier, so that the extra memory space is not occupied, the memory waste can be reduced, and the memory utilization rate is improved.

In some embodiments, the step S202 "determining the target memory pool according to the type identifier of the data to be stored" in the embodiment shown in fig. 2 may be implemented by:

step S2021, determining a type identifier set corresponding to the file to be stored based on the type identifier of each source data in the file to be stored.

The file to be stored includes a plurality of graph data, and the type identifier of each graph data may be the same, may also be partially the same, or may also be different. For example, the graph data includes 3 vertex data and 2 edge data, where 2 vertex data is "user", and the type thereof is "vertex _ type _ layer", and the type thereof is "commodity", and the type thereof is "vertex _ type _ goods"; the 2 pieces of edge data are "purchase", and the type thereof is identified as "edge _ type _ buy".

All the type identifiers in the file to be stored form a type identifier set, and as in the above example, the type identifier set is { "vertex _ type _ layer", "vertex _ type _ goods", "edge _ type _ bucket" }.

Step S2022, allocating a corresponding memory pool to each type identifier in the set of type identifiers in the memory space, to obtain N memory pools.

And N is the number of the type identifiers in the type identifier set, and N is a positive integer.

And allocating a memory pool for each type identifier in the memory space to obtain a corresponding number of memory pools, wherein each memory pool is used for storing data of the corresponding type identifier. For example, according to the type identifier set, 3 memory pools are allocated, wherein a memory pool with a type identifier "vertex _ type _ layer" is used for storing vertex data of "user", a memory pool with a type identifier "vertex _ type _ goods" is used for storing vertex data of "commodity", and a memory pool with a type identifier "edge _ type _ bucket" is used for storing data of "purchase".

Step S2023, determining a memory pool corresponding to the type identifier of the data to be stored in the N memory pools as a target memory pool.

And if the type identifier of the data to be stored is 'vertex _ type _ layer', determining the memory pool with the same type identifier as a target memory pool.

In some embodiments, the step S203 "determining a target memory unit from the target memory pool" in the embodiment shown in fig. 2 can be implemented by:

step S2031, obtaining a storage format predefined for the type identifier.

The type identifier may be predefined by a designer, and when the type identifier is predefined, the designer sets a storage format of the type identifier at the same time, that is, when the type identifier is predefined to be stored, the basic data type, the memory occupation amount, and the like of each field data are acquired. For example, the type of "user" is identified as "vertex _ type _ layer" and has 2 fixed-length attributes and 1 variable-length attribute, and the storage format is as shown in table 2:

TABLE 1 binary data format for vertices

Step S2032, pre-allocating at least one memory unit in the target memory pool based on the storage format.

After the storage format is determined, the memory occupation amount of the fixed-length data and the variable-length data can be determined, so that the memory occupation amount of the storage format can be determined.

Step S2033, determine a pre-allocated free memory unit as a target memory unit.

Before storing data to be stored, judging whether a free memory unit exists in a target memory pool, and if so, determining the free memory unit as a target memory unit; if not, the pre-allocation operation needs to be performed again to ensure that there are free memory locations available to store the data to be stored.

In some embodiments, the step S2032 of pre-allocating at least one memory unit in the target memory pool based on the storage format may be implemented as:

step S20321, based on the storage format, determining the byte length of the predefined fixed field, the byte length of the fixed-length attribute, and the number M of the variable-length attributes.

Here, M is a natural number.

After the storage format is determined, the memory occupancy amounts of the fixed-length data and the variable-length data can be determined, for example, as described above, the memory occupancy amounts of the fixed fields "vertex _ type", "vertex _ id", and "vertex _ weight" are 2 bytes, 1 byte, and 1 byte, respectively, and the memory occupancy amounts of the fixed-length attributes "fix _ attr _ 1" and "fix _ attr _ 2" are 4 bytes and 1 byte, respectively. The number of variable length attributes is 1.

Step S20322, determining the memory occupation amount of the target data corresponding to the data to be stored according to the byte length of the fixed field, the byte length of the fixed-length attribute, and the byte lengths of the M address pointers.

Because the memory occupation amount of the variable-length data during storage is variable, in the embodiment of the application, an additional memory space is applied for storing the data of the variable-length part, an address pointer is introduced into the memory unit, and the address pointer points to the memory address of the variable-length data. The memory occupation amount of storing one address pointer is 4 bytes, and the number of the address pointers is determined according to the number of the variable length attributes, so that the memory occupation amount required by storing a plurality of address pointers corresponding to the variable length attributes can be determined. In the above example, if the number of the variable length attributes is 1, an address pointer is allocated to the memory pool, and the memory occupancy of the address pointer is 4 bytes.

Thus, the memory usage of the storage format is 13(═ 2+1+1+4+1+4) bytes.

Step S20323, pre-allocating at least one memory unit in the target memory pool based on the memory footprint.

Based on the determined memory footprint, a plurality of memory cells having a size equal to the determined memory footprint are pre-allocated in the target memory pool, e.g., a plurality of memory cells having a size of 13 bytes are pre-allocated.

In some embodiments, pre-allocating memory units in the target memory pool may be implemented as: when no idle memory unit exists in the target memory pool, a memory block is pre-allocated in the target memory pool; and dividing the memory block into at least one memory unit based on the memory occupation amount.

In some embodiments, the step S103 in the embodiment shown in fig. 1 or the step S205 "acquiring the mapping data of the variable-length data in the data to be stored and the fixed-length data in the data to be stored" in the embodiment shown in fig. 2 may be implemented as the following steps:

and step S1031, based on the storage format, determining the fixed field data, the fixed length attribute data and the variable length attribute data of the data to be stored.

And according to a preset storage format, dividing the fixed field, the fixed length attribute and the variable length attribute in the data to be stored to obtain the fixed field data, the fixed length attribute data and the variable length attribute data.

Step S1032, determine the fixed field data and the fixed length attribute data as fixed length data in the data to be stored.

And taking the fixed field data and the fixed length attribute data in the data to be stored as fixed length data and directly storing the fixed field data and the fixed length attribute data in the target memory unit.

Step S1033, determining M variable length attribute data as M variable length data in the data to be stored.

Step S1034, map the M variable length data into M address pointers.

In step S1035, M address pointers are determined as mapping data.

The M address pointers point to memory addresses of the M variable length attribute data respectively.

In the embodiment of the application, an additional memory space is applied for storing variable length data of length M, M address pointers are introduced into a target memory unit, and each address pointer points to a memory address corresponding to the variable length data stored in the additionally applied memory space. And mapping the M variable-length data into M address pointers respectively to obtain mapping data with fixed length, and storing the mapping data to a target memory unit.

On the basis of the embodiment shown in fig. 1, the embodiment of the present application further provides a data storage method. Fig. 3 is a schematic flow chart of another implementation of the data storage method according to the embodiment of the present application, and as shown in fig. 3, the data storage method includes the following steps:

step S301, acquiring a file to be stored.

Step S302, carrying out digital system conversion processing on the source data to obtain target digital system data to be stored.

Step S303, determining the type identifier of the source data as the type identifier of the corresponding data to be stored.

Step S304, based on the type identifier of each source data in the file to be stored, determining a type identifier set corresponding to the file to be stored.

Step S305, allocating a corresponding memory pool to each type identifier in the set of type identifiers in a memory space, to obtain N memory pools.

Step S306, determining a memory pool corresponding to the type identifier of the data to be stored in the N memory pools as a target memory pool.

Step S307, obtaining a predefined storage format for the type identifier.

Step S308, based on the storage format, determining the byte length of the predefined fixed field, the byte length of the fixed-length attribute and the number M of the variable-length attributes.

Here, M is a natural number.

Step S309, determining the memory occupation amount of the target data corresponding to the data to be stored according to the byte length of the fixed field, the byte length of the fixed-length attribute and the byte lengths of the M address pointers.

Step S310, pre-allocating at least one memory unit in the target memory pool based on the memory occupation amount.

When no idle memory unit exists in the target memory pool, a memory block is pre-allocated in the target memory pool; and dividing the memory block into at least one memory unit based on the memory occupation amount.

In step S311, a pre-allocated free memory unit is determined as a target memory unit.

Step S312, determining the fixed field data, the fixed length attribute data, and the variable length attribute data of the data to be stored based on the storage format.

Step 313, determining the fixed field data and the fixed length attribute data as fixed length data in the data to be stored.

In step S314, the M variable length attribute data are determined as M variable length data in the data to be stored.

Step S315, mapping the M variable length data into M address pointers.

In step S316, M address pointers are determined as mapping data.

Here, the M address pointers point to memory addresses of the M variable length attribute data, respectively.

Step S317, determining the fixed length data and the mapping data as target data.

Step S318, store the target data to the target memory unit.

Here, the memory footprint of the target data is equal to the memory size of the target memory cell.

According to the data storage method provided by the embodiment of the application, the target data with the same type of identification is stored in the memory units with the same size, so that the memory addresses among the memory units in the memory pool are continuous, the spatial locality of the data is improved, and the cache hit rate is improved; and the size of the memory space is determined based on the type identifier, so that the extra memory space is not occupied, the memory waste can be reduced, and the memory utilization rate is improved.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

The graph is an abstract data structure commonly used in computer science, and has more general representation capability in structural and semantic aspects. The graph structure data widely exists in the aspects of life. For example, in an e-commerce scenario, a user and a commodity can be regarded as two types of vertices, and the relationships between them, such as browsing, shopping, purchasing, and the like, can be regarded as different types of edges.

The graph computation system is a system that analyzes and computes graph structure data. With the increasing scale of graph data, the performance requirements on graph computing systems are higher and higher. The graph data storage is an important component of the graph computing system, and the memory format of the graph data storage has great influence on the processing scale and the query efficiency of the graph computing system. The high-efficiency graph data memory format can improve the graph calculation performance in a fine-grained manner.

In current graph computing systems, graph data is typically stored in a loose-form memory format, and point and edge objects are constructed and stored based on a structured data structure. The loose memory format has two disadvantages:

1) the byte alignment, the memory allocation of the standard container and other mechanisms cause memory waste and low memory utilization rate.

2) The memory address is discontinuous, the data space locality is poor, and the cache hit rate is low due to random access.

The core technical principle of the embodiment of the application is to realize high-performance graph data storage in a compact memory format based on a memory pool technology. Fig. 4 is a schematic flow chart of an implementation of the graph data storage method according to the embodiment of the present application, and as shown in fig. 4, the graph data storage method in the compact memory format according to the embodiment of the present application includes the following steps:

step S401, binary coding of the image data.

The graph data is preprocessed and stored in a binary format on the disk.

Step S402, calculating the memory size of each type of point and edge and pre-allocating the corresponding memory unit.

The graph computing system computes the memory size of each type of point and edge according to the type of the point and the edge defined by a user, and pre-allocates memory units of various types of points and edges in a memory pool.

In step S403, a memory cell of a specified type is selected to store corresponding binary data of a graph.

The graph computing system loads graph binary data, acquires corresponding memory units from the memory pool according to the point and edge types, and directly stores the corresponding binary data.

Based on the technology, on one hand, the memory waste can be reduced, and the memory utilization rate is improved; on the other hand, the storage addresses of the vertex data or the edge data of the same type are similar, so that the space locality is good, and the cache hit rate can be improved. The complete technical solution of the embodiments of the present application will be described in detail below.

1. Binary-coding-based data preprocessing techniques.

The graph data loaded by the graph computing system is in a binary format, so that the point and edge source data needs to be binary coded. The specific format is as follows:

1) vertex binary format: the binary data format of the vertex comprises three parts of a fixed field, a fixed-length attribute and a variable-length attribute.

Fig. 5 is a schematic diagram of a vertex binary coding format, and as shown in fig. 5, the fixed field of the vertex mainly includes vertex _ type, vertex _ id, and weight, where vertex _ type represents the type of the vertex, vertex _ id represents the vertex id, and weight represents the weight of the vertex. The fixed-length attribute of the vertex refers to the attribute of the basic data type such as char, int, float, etc. The variable length attribute of the vertex refers to the attribute that the data type is an array, a character string and the like, each variable length attribute comprises two fields of a variable _ attr _ len field and a variable _ attr _ val field, wherein the variable _ attr _ len field represents the number or the length of variable length attribute elements, and the variable _ attr _ val field represents a variable length attribute value.

2) Edge binary format: the binary data format of an edge includes three parts, a fixed-length attribute and a variable-length attribute.

Fig. 6 is a schematic diagram of an edge binary encoding format, and as shown in fig. 6, the fixed field of an edge mainly includes edge _ type, vertex1_ id, vertex2_ id, and weight, where edge _ type represents the type of the edge, vertex1_ id and vertex2_ id represent the ids of two vertices of the edge, and weight represents the weight of the edge. The fixed-length attribute of the edge refers to the attribute of the basic data type, such as char, int, float, etc. The variable length attribute of the edge refers to the attribute that the data type is an array, a character string and the like, each variable length attribute comprises two fields of a variable _ attr _ len field and a variable _ attr _ val field, wherein the variable _ attr _ len field represents the number or the length of variable length attribute elements, and the variable _ attr _ val field represents a variable length attribute value.

2. Compact memory format based on memory pool technology

The embodiment of the application realizes a compact memory format based on a memory pool technology, can reduce memory waste, increases data locality, and realizes high-performance graph data storage. Fig. 7 is a schematic diagram of storing different types of vertices into corresponding memory pools, and fig. 8 is a schematic diagram of storing different types of edges into corresponding memory pools, as shown in fig. 7 and 8, different types of points or different types of edges have different binary encoding formats, so that independent memory pools are respectively created according to types. For some type of memory pool, it consists of multiple chunks chunk, each chunk being subdivided into multiple unit of memory units. For a certain memory unit, the memory format corresponds to the binary coding format of a vertex or an edge, and mainly comprises three parts, namely a fixed field, a fixed length attribute and a variable length attribute. In binary encoding of graph data, the memory size required by the fixed field and the fixed-length attribute is fixed, while the memory occupied by the variable-length attribute field is uncertain. In order to ensure that the sizes of memory units in the same memory pool are the same, an extra memory space is applied for storing values of variable length attributes, the sizes of variable length attribute fields in the memory units are fixed to 8bytes, and address values of variable length attribute addresses are stored.

Fig. 9 is a schematic diagram of a working principle flow of a memory pool provided in an embodiment of the present application, and as shown in fig. 9, the working principle flow of the memory pool includes the following steps:

in step S901, the number of types of points and edges is obtained.

Wherein the number of types of points and edges is predefined by the user.

For example, points of two types of goods and users are predefined, and edges of three types of browsing, purchasing and purchasing are predefined. The number of types of the acquired points is 2, and the type data of the acquired edges is 3.

Step S902, a memory pool with a corresponding number is created.

And 5(2+3) memory pools are created, wherein 2 memory pools are used for storing point type graph data, and 3 memory pools are used for storing edge type graph data.

Step S903, initializing the memory pool and applying for the memory block.

Initializing the memory pool, and applying for a memory chunk from the system.

In step S904, the memory size required for the type point or edge is calculated.

And calculating the required memory size according to the number of the fields of the type point or edge and the data type defined by the user.

In step S905, the memory block is subdivided into a plurality of memory units with corresponding sizes.

And subdividing the memory block chunk to obtain a plurality of memory unit units, wherein the size of each memory unit is equal to the size of the memory required by the corresponding type of point or edge.

In step S906, it is determined whether there is a free memory cell.

If the memory block has an idle memory unit, step S907 is entered to obtain an idle memory unit and allocate the idle memory unit to the applicant; if the memory block has no free memory unit, step S908 is performed to apply for a memory block from the system again.

In step S907, a free memory unit is acquired and allocated to the caller.

Here, the free memory unit is allocated to the caller, i.e., to the applicant, as a pre-allocated memory unit.

In step S908, a new memory block is applied.

After applying for a new memory block, the process returns to step S905 to subdivide the new memory block.

When the graph computing system is initialized, the required memory pool is created through the steps, and the memory units with fixed sizes are pre-allocated. And then begins loading the graph binary data. Fig. 10 is a schematic diagram of a loading flow of graph data according to an embodiment of the present application, and as shown in fig. 10, a loading process of graph data includes the following steps:

step S1001, a file of a vertex or an edge to be loaded is acquired.

Step S1002, reads the binary file.

Step S1003, determine whether the end of the file is reached.

When the end of the file is reached, the step S1007 is entered to finish the storage; when the end of the file is not reached, the flow proceeds to step S1004 to continue the storage.

Step S1004, the type of the current vertex or edge is read.

Step S1005, applying for a memory cell from the corresponding memory pool according to the type.

In step S1006, binary data is loaded into the memory unit.

Step S1007, end.

And acquiring the number of fields and the data type of the vertex or the edge of the current type according to the type, acquiring the binary data segment of the vertex or the edge of the type according to the number of fields and the data type of the vertex or the edge, and directly storing the binary data into the applied memory unit without analyzing the binary data.

Memory units are allocated based on a large block of memory, and continuous memory addresses are arranged among the memory units of the same memory block, so that the spatial locality of data is improved. Meanwhile, the size of the memory unit is obtained according to strict calculation of the defined point and edge types, and extra memory space occupation does not exist, so that the memory utilization rate is improved.

The embodiment of the application mainly realizes a compact high-performance graph data memory format based on a memory pool technology. Firstly, graph data of a plaintext is converted into a binary coding format through data preprocessing. The memory size of each type of vertex and edge is then determined according to the defined vertex and edge format. And then, pre-allocating memory units with corresponding sizes based on the memory pool idea. When loading graph data, firstly, the type of the corresponding vertex or edge is obtained, then the free memory cell in the memory pool of the corresponding type is obtained, and the binary data is directly stored in the memory cell. The scheme can reduce memory waste, has high memory utilization rate, improves the spatial locality of data, and has high cache hit rate.

Based on the foregoing embodiments, the embodiments of the present application provide a data storage device, where the modules included in the device and the units included in the modules may be implemented by a processor in a computer device; of course, the implementation can also be realized through a specific logic circuit; in the implementation process, the processor may be a Central Processing Unit (CPU), a Microprocessor Unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.

Fig. 11 is a schematic structural diagram of a data storage device provided in an embodiment of the present application, and as shown in fig. 11, the data storage device 110 includes:

a first obtaining module 111, configured to obtain data to be stored and a type identifier of the data to be stored;

a first determining module 112, configured to determine a target memory pool according to the type identifier of the data to be stored, where the target memory pool is a memory pool corresponding to the type identifier of the data to be stored;

a second determining module 113, configured to determine a target memory unit from the target memory pool, where the target memory unit is a memory unit pre-allocated in the target memory pool, and the size of the memory unit is determined based on the data to be stored;

a second obtaining module 114, configured to obtain target data based on the data to be stored;

a storage module 115, configured to store the target data to the target memory unit, where a memory footprint of the target data is equal to a memory size of the target memory unit.

In some embodiments, the first obtaining module 111 is further configured to:

acquiring a file to be stored, wherein the file to be stored comprises at least one source data, the source data comprises a type identifier, and the source data with the same type identifier has the same storage format;

In some embodiments, the first determining module 112 is further configured to:

In some embodiments, the second determining module 113 is further configured to:

acquiring a storage format predefined for the type identifier of the data to be stored;

and determining a free memory unit pre-allocated in the target memory pool as a target memory unit.

In some embodiments, the second obtaining module 114 is further configured to:

determining a fixed field, a fixed length attribute and a variable length attribute of the data to be stored based on the storage format;

initializing M address pointers, wherein the M address pointers respectively point to M memory addresses with variable length attributes;

and determining the fixed field, the fixed length attribute and the M address pointers of the data to be stored as target data.

Here, it should be noted that: the above description of the data storage device embodiment is similar to the above description of the method, with the same advantageous effects as the method embodiment. For technical details not disclosed in the embodiments of the data storage device of the present application, a person skilled in the art shall refer to the description of the embodiments of the method of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the above evaluation method of the advertising copy is implemented in the form of a software functional module and is sold or used as a standalone product, it may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Accordingly, embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the data storage method provided in the above embodiments.

In the embodiment of the present application, a schematic diagram of a composition structure of a data storage device provided in the embodiment of the present application is provided in fig. 12, and according to the exemplary structure of the data storage device 120 shown in fig. 12, other exemplary structures of the data storage device 120 may be foreseen, so that the structures described herein should not be considered as limiting, for example, some components described below may be omitted, or components not described below may be added to adapt to special requirements of some applications.

The data storage device 120 shown in fig. 12 includes: a processor 121, at least one communication bus 122, a user interface 123, at least one external communication interface 124 and memory 125. Wherein the communication bus 122 is configured to enable connective communication between these components. The user interface 123 may include a display screen, and the external communication interface 124 may include a standard wired interface and a wireless interface, among others. Wherein the processor 121 is configured to execute the program of the data storage method stored in the memory to implement the steps in the data storage method provided by the above-mentioned embodiments.

The above description of the data storage device and storage medium embodiments is similar to the description of the method embodiments above, with similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the data storage device and storage medium of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a device to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of data storage, the method comprising:

acquiring data to be stored and a type identifier of the data to be stored;

2. The method according to claim 1, wherein the obtaining the data to be stored and the type identifier of the data to be stored comprises:

3. The method of claim 2, wherein determining the target memory unit from the target memory pool based on the type identifier comprises:

4. The method according to claim 3, wherein the determining a target memory pool according to the type identifier of the data to be stored comprises:

5. The method of claim 3, wherein determining the target memory unit from the target memory pool comprises:

acquiring a storage format predefined for the type identifier;

6. The method of claim 5, wherein pre-allocating at least one memory unit in the target memory pool based on the storage format comprises:

7. The method of claim 6, wherein pre-allocating at least one memory cell in the target memory pool based on the memory footprint comprises:

8. The method according to claim 6, wherein the obtaining mapping data of variable-length data in the data to be stored and fixed-length data in the data to be stored comprises:

9. A data storage device, characterized in that the device comprises:

10. A data storage device, comprising:

a processor; and

a memory for storing a computer program operable on the processor;

wherein the computer program realizes the steps of the method of any one of claims 1 to 8 when executed by a processor.

11. A computer-readable storage medium having stored thereon computer-executable instructions configured to perform the steps of the method of any one of claims 1 to 8.