CN112612925A

CN112612925A - Data storage method, data reading method and electronic equipment

Info

Publication number: CN112612925A
Application number: CN202011590749.8A
Authority: CN
Inventors: 史承毅; 杨凯
Original assignee: Shanghai Youyang New Media Information Technology Co ltd
Current assignee: Du Xiaoman Technology Beijing Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-06
Anticipated expiration: 2040-12-29
Also published as: CN112612925B

Abstract

The application provides a data storage method, a data reading method and an electronic device, wherein a first character string to be stored is stored according to first index information, the first index information is used for representing the link relation of a first node corresponding to the first character string in a first data structure, a second character string to be stored and an identifier of the first node are stored according to second index information, the second index information is used for representing the link relation of a second node corresponding to the second character string in a second data structure, wherein at least one data structure of a compression path exists in the first data structure and the second data structure, at least two characters are stored in a storage space corresponding to at least one node in a plurality of nodes contained in the data structure of the compression path, compressed storage of data to be stored is realized, the memory space is saved, and the data to be stored does not need to be compressed, the efficiency of data storage is improved.

Description

Data storage method, data reading method and electronic equipment

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data storage method, a data reading method, and an electronic device.

Background

At present, for the storage of key-value pair data, a commonly used mode includes a tree map (TreeMap) or a hash table (Hashtable). The TreeMap is stored through a binary tree data structure, the stored elements are Key value pair data, and Key word keys are sequenced to realize the storage of each Key value pair data on the tree structure; the hash table hashes the Key word Key of the Key value pair data into the memory space through a hash function, solves the problem of hash position conflict through a linked list method, and stores each Key value pair data in a linked list node.

However, the storage methods in the prior art all store the original key-value pair data into the memory without compression, which results in excessive memory consumption.

Disclosure of Invention

The application provides a data storage method, a data reading method and electronic equipment, which can realize real-time storage and reading of data and save storage resources.

In a first aspect, an embodiment of the present application provides a data storage method, including:

storing a first character string to be stored according to first index information; the first index information is used for representing the link relation of a first node in a first data structure, and the first node is a node corresponding to a first character string in the first data structure;

and storing a second character string to be stored and the identifier of the first node according to second index information, wherein the second character string is a keyword of the first character string, the second index information is used for representing the link relation of the second node in a second data structure, and the second node is a node corresponding to the second character string in the second data structure.

The data structure of at least one compression path exists in the first data structure and the second data structure, and at least two characters are stored in a storage space corresponding to at least one node in a plurality of nodes contained in the data structure of the compression path.

In a second aspect, an embodiment of the present application provides a method for reading data, including:

reading the identifier of the first node based on second index information corresponding to a second character string, wherein the second character string is a keyword of the first character string to be read, the second index information is used for representing the link relation of the second node in a second data structure, the second node is a node corresponding to the second character string in the second data structure, and the first node is a node corresponding to the first character string in the first data structure;

reading a first character string based on first index information corresponding to a first node, wherein the first index information is used for representing the link relation of the first node in a first data structure;

the data structure of at least one compression path exists in the first data structure and the second data structure, and in a plurality of nodes contained in the data structure of the compression path, at least one stored data corresponding to the node exists and contains at least two characters.

In a third aspect, an embodiment of the present application provides an electronic device, including:

the storage unit is used for storing the first character string to be stored according to the first index information; the first index information is used for representing the link relation of a first node in a first data structure, and the first node is a node corresponding to a first character string in the first data structure;

the storage unit is further configured to store a second character string to be stored and an identifier of the first node according to second index information, where the second character string is a keyword of the first character string, the second index information is used to represent a link relationship of the second node in a second data structure, and the second node is a node corresponding to the second character string in the second data structure.

In a fourth aspect, an embodiment of the present application provides an electronic device, including:

the reading unit is used for reading the identifier of the first node based on second index information corresponding to a second character string, wherein the second character string is a keyword of the first character string to be read, the second index information is used for representing the link relation of the second node in a second data structure, the second node is a node corresponding to the second character string in the second data structure, and the first node is a node corresponding to the first character string in the first data structure;

the reading unit is further configured to read the first character string based on first index information corresponding to the first node, where the first index information is used to represent a link relationship of the first node in the first data structure;

In a fifth aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor;

the memory stores computer-executable instructions;

the processor executes the computer-executable instructions stored by the memory, causing the processor to perform the method of the first aspect or embodiments thereof.

In a sixth aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor;

the memory stores computer-executable instructions;

the processor executes the computer-executable instructions stored by the memory, causing the processor to perform the method of the second aspect or embodiments thereof.

In a seventh aspect, an embodiment of the present application provides a storage medium, including: a readable storage medium and a computer program for implementing the method of the first aspect, the second aspect or implementations thereof.

According to the embodiment of the application, the first character string is stored according to the first index information based on the first data structure, the second character string is stored according to the second index information based on the second data structure, at least one data structure in the first data structure and the second data structure is a data structure of a compression path, compression storage of data to be stored is achieved, the memory space is saved, the data to be stored does not need to be compressed, and the efficiency of data storage is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a storage structure of a tree diagram according to an embodiment of the present disclosure;

fig. 2 is a storage structure of a hash table according to an embodiment of the present disclosure;

FIG. 3a is a logical diagram of a data structure according to an embodiment of the present application;

fig. 3b is a schematic diagram of a hash table according to an embodiment of the present application;

FIG. 3c is a schematic diagram of a memory location map according to an embodiment of the present application;

FIG. 4 is a logical diagram of a data structure provided by an embodiment of the present application;

fig. 5 is a schematic flowchart of a data storage method according to an embodiment of the present application;

fig. 6 is a schematic flowchart of a data reading method according to an embodiment of the present application;

fig. 7 is a schematic flow chart of data export according to an embodiment of the present disclosure;

fig. 8 is a schematic diagram illustrating a mapping reduction process of a data structure according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device 900 according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device 1000 according to an embodiment of the present application;

fig. 11 is a schematic hardware structure diagram of an electronic device 1100 according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a storage structure of a tree diagram according to an embodiment of the present application. As shown in fig. 1, one Key-Value pair (Key-Value) data is stored for each node in the tree map (TreeMap). It should be understood that, after each key-value pair data is stored according to the TreeMap structure without compression, since the TreeMap structure itself occupies more memory space, the total memory size occupied after the key-value pair data is stored is expected to be 2 to 4 times that of the key-value pair data itself; moreover, the key value pair data inserted each time cannot be deduplicated, and will occupy more memory.

Fig. 2 is a storage structure of a hash table according to an embodiment of the present application. As shown in fig. 2, a Key is hashed into a memory space through a hash function, hash position conflicts are generally solved through a chain address method or an open addressing method, fig. 2 shows that the conflict problem is solved through a chain table method, a Key value pair data is stored in a storage space corresponding to each chain table node, and similar to a tree diagram, when the Key value pair data is stored in a hash table structure, the Key value pair data is also stored without compression, and the memory consumption is large.

Aiming at the problems, the technical idea of the application is as follows: the Key (the second character string in the same text) and the Value (the first character string in the same text) of the Key-Value pair data are respectively stored based on the two data structures, at least one of the two data structures is a data structure of a compression path, when the Key and/or the Value of each Key-Value pair data are stored, only characters except a public prefix of a dictionary tree are needed to be stored, the memory space is saved, in addition, the Key-Value pair data do not need to be compressed or decompressed in advance, and the data storage and reading efficiency is improved.

In order to solve the problem that the memory consumption of Key-Value pair data is large in the prior art, in the embodiment of the present application, keys and values of the Key-Value pair data are stored by two data structures, and at least one of the two data structure columns is a data structure of a compression path, such as a dictionary tree (Trie tree) of the compression path.

First, the first data structure and the second data structure referred to in the embodiments of the present application are explained.

FIG. 3a is a logical diagram of a data structure according to an embodiment of the present application; fig. 3b is a schematic diagram of a hash table according to an embodiment of the present application; fig. 3c is a schematic diagram of a storage location map according to an embodiment of the present application.

The data structure shown in fig. 3a is a logical structure storing values of key-value pair data, like the first data structure in the following, and as shown in fig. 3a, the first data structure includes a plurality of nodes, such as a root node and leaf nodes of multiple levels, for example, v1 is the root node of the first data structure, v2 to v7 are the leaf nodes of the first data structure, and v1 is the parent nodes of v2 and v4, v2 is the parent nodes of v3 and v6, v6 is the parent node of v7, and v4 is the parent node of v 5. It should be understood that the first data structure may include more or fewer nodes.

At least one character is stored in the storage space corresponding to each node, and in order to realize the storage of the compression path, two or more characters are stored in the storage space corresponding to at least one node in the first data structure.

And aiming at each leaf node in the first data structure, the storage space corresponding to the leaf node does not store the common prefix matched with the storage data corresponding to the parent node.

The path between each leaf node and the parent node connected to the leaf node may be expressed by a two-dimensional vector (mismatch initial character, mismatch position), it should be noted that the mismatch initial character is a first mismatch character after the storage data corresponding to the leaf node and the storage data corresponding to the parent node connected to the leaf node are matched character by character, and the mismatch position is a character position of a last character of the common prefix in the data stored in the storage space corresponding to the parent node. Optionally, the path data corresponding to each leaf node may be expressed by a three-dimensional vector < parent node identification, mismatch initial character, mismatch position >.

For example, the stored data corresponding to the V1 node is https:// new.kk.com, the stored data corresponding to the V2 node is https:// new.kk.com/ch/anti, the data stored in the storage space corresponding to the V1 node is https:// new.kk.com, as can be seen, the mismatch character is ".", the mismatch position is 11, the path between the V2 node and the V1 node is (. ", 11), and the data stored in the storage space corresponding to the V2 node is kk.com/anti, where https:// new is a common prefix of the stored data corresponding to the V1 node and the stored data corresponding to the V2 node. Alternatively, the path data corresponding to the v2 node may be expressed as < v1,. >, 11 >.

As shown in fig. 3a, a path between a v2 node and a v3 node is (d, 7), a path between a v2 node and a v6 node is (f, 10), a path between a v6 node and a v7 node is (a, 0), a path between a v1 node and a v4 node is (s, 8), a path between a v4 node and a v5 node is (/, 0), a storage space corresponding to the v3 node stores "/bj", a storage space corresponding to the v6 node stores "inace", a storage space corresponding to the v7 node stores "ship", a storage space corresponding to the v4 node stores "ports.qq.com", and a storage space corresponding to the v5 node stores "nba". It can be understood that the stored data corresponding to the node V3 is https:// new.kk.com/d/bj, the stored data corresponding to the node V4 is https:// sports.kk.com, the stored data corresponding to the node V5 is https:// sports.kk.com/nba, the stored data corresponding to the node V6 is https:// new.kk.com/ch/finance, and the stored data corresponding to the node V7 is https:// new.kk.com/ch/fashion.

Optionally, the character string stored in the storage space corresponding to each node ends with a character $ which is a character not present in the key-value pair data and is added at the end of the character string, so that any two character strings in the character string set are not prefixes of each other.

Based on the logical representation of the first data structure shown in fig. 3a, fig. 3b shows the physical structure of the first data structure, and the identifier of each node in the first data structure, for example, v1 to v7, is respectively hashed to obtain its hash address in the hash table, and the path data corresponding to each node is stored in the corresponding hash address.

In order to associate the node with the corresponding storage space of the memory, the embodiment of the present application uses the bit array and the pointer array as shown in fig. 3c to implement the corresponding relationship between the node and the storage space indicated by the pointer. The bit array and pointer array may be represented as LabelStore.

It should be understood that the bit array includes a plurality of chunks, each chunk having a plurality of data bits, illustratively 1 byte (8 bits), and each chunk having 8 data bits. Each data block corresponds to one pointer in the pointer array, and the number of data bits set to 1 in each data block is the number of character strings stored in the storage space indicated by the pointer corresponding to the data block.

For example, v1 in fig. 3b maps to the first data bit of the first chunk of the bit array, then this data position is 1, similarly, v3 maps to the fourth data bit of the first chunk of the bit array, v6 maps to the 7 th data bit of the first chunk of the bit array, and so on. Correspondingly, the storage space indicated by the pointer corresponding to the first chunk of the bit array stores "https:// news. kk.com", "/bj" and "increment", 20 is the number of characters of https:// news. kk.com plus character $, 4 is the number of characters of "/bj" plus character $, and 7 is the number of characters of "increment" plus character $.

Fig. 4 is a logic diagram of a data structure according to an embodiment of the present application. Similar to the first data structure, FIG. 4 illustrates a logical structure for storing keys for key-value pair data, generally referred to as the second data structure below. The second data structure includes a plurality of nodes, such as a root node and multi-level leaf nodes, illustratively, k1 is the root node of the second data structure, k2 through k7 are the leaf nodes of the second data structure, and k1 is the parent node of k2 through k6, and k6 is the parent node of k 7. It should be understood that the nodes in the second data structure have a one-to-one correspondence with the nodes in the first data structure, e.g., the k1 node corresponds to the v1 node, the k2 node corresponds to the v2 node, and so on.

At least one character is stored in the storage space corresponding to each node, in order to realize the storage of the compression path, two or more characters are stored in the storage space corresponding to at least one node in the second data structure, and in addition, the identification of the node of the value corresponding to the node is also stored in the storage space corresponding to each node in the second data structure, so as to realize the corresponding storage of the key and the value.

And aiming at each leaf node in the second data structure, the storage space corresponding to the leaf node does not store the common prefix matched with the storage data corresponding to the parent node.

The path between each leaf node and the parent node connected to the leaf node may be expressed by a two-dimensional vector (mismatch initial character, mismatch position), it should be noted that the mismatch initial character is a first mismatch character after the storage data corresponding to the leaf node and the storage data corresponding to the parent node connected to the leaf node are matched character by character, and the mismatch position is a position of a last character of the common prefix in the data stored in the storage space corresponding to the parent node. Optionally, the path data corresponding to each leaf node may be expressed by a three-dimensional vector < parent node identification, mismatch initial character, mismatch position >.

For example, the stored data corresponding to the node k1 is news, the stored data corresponding to the node k2 is antipneumonic, the data stored in the storage space corresponding to the node k1 is news and v1, it can be seen that the mismatch initial character is "a", the mismatch position is 0, the path between the node k2 and the node k1 is (a, 0), and the data stored in the storage space corresponding to the node k2 is ntipneumonic and v 2. Alternatively, the path data corresponding to the k2 node may be expressed as < k1, a, 0 >.

As shown in fig. 4, the path between the k1 node and the k3 node is (b, 0), the path between the k1 node and the k3 node is (s, 0), the path between the k1 node and the k4 node is (s, 0), the path between the k1 node and the k5 node is (b, 1), the path between the k1 node and the k6 node is (f, 0), the path between the k6 node and the k7 node is (a, 0), the storage space corresponding to the k3 node stores "eijiing" and v3, "the storage space corresponding to the k4 node stores" ports "and v4," the storage space corresponding to the k5 node stores "a" and v5, "the storage space corresponding to the k6 node stores" iance "and v6," and the storage space corresponding to the k7 node stores "shinon" and v 7. It can be understood that the stored data corresponding to the node k3 is beijing, the stored data corresponding to the node k4 is sports, the stored data corresponding to the node k5 is nba, the stored data corresponding to the node k6 is finace, and the stored data corresponding to the node k7 is washion.

Similar to the first data structure, based on the logical expression of the second data structure shown in fig. 4, the physical structure of the second data structure may be a hash table, and the identifier of each node in the first data structure, for example, k1 to k7, is respectively hashed to obtain its hash address in the hash table, and the path data corresponding to each node is stored in the corresponding hash address.

In order to associate a node to a corresponding storage space of a memory, in the embodiments of the present application, a bit array and a pointer value are used to implement a corresponding relationship between the node and the storage space indicated by the pointer.

It should be understood that the bit array includes a plurality of chunks, each chunk having a plurality of data bits, illustratively 1 byte (8 bits), and each chunk having 8 data bits. Each data block corresponds to one pointer in the pointer array, and the number of data bits set to 1 in each data block is the number of character strings stored in the storage space indicated by the pointer corresponding to the data block. The specific implementation manner is similar to that of the first data structure, and is not described herein again.

It should be noted that, when the first data structure is the data structure of the compression path shown in fig. 3a to 3c, the second data structure may be the data structure of the compression path shown in fig. 4, or any other type of data structure; or when the second data structure is the data structure of the compression path shown in fig. 4, the first data structure is the data structure of the compression path shown in fig. 3a to 3 c. Generally, when the first data structure for storing value is a data structure of a compression path, more memory consumption can be saved compared with the second data structure.

Based on the first data structure or the second data structure mentioned in any of the above embodiments, the embodiments of the present application provide the following possible implementation manners for how to implement real-time compressed storage and data reading.

The data storage method comprises the following steps:

fig. 5 is a schematic flowchart of a data storage method according to an embodiment of the present application. As shown in fig. 5, the data storage method includes the following implementation steps:

s501: and storing the first character string to be stored according to the first index information.

The first index information is used to characterize a link relationship of the first node in the first data structure, for example, the first node is a root node, or a parent node connected to the first node when the first node is any leaf node and a path between the first node and the parent node, it should be understood that each character string to be stored exists in the first data structure corresponding to the node, and the first node is a node corresponding to the first character string in the first data structure.

In this step, the first character string is stored according to the first index information, where the first character string may be stored in a storage space corresponding to the first node, or a character of the first character string without a common prefix is stored in a storage space corresponding to the first node, or a character of the first character string without the common prefix except for the mismatch first character or the mismatch first n characters is stored in a storage space corresponding to the first node.

S502: and storing the second character string to be stored and the identifier of the first node according to the second index information.

It should be understood that the second string is a key of the first string, and the second index information is used to characterize a link relationship of the second node in the second data structure, for example, the second node is a root node, or a parent node connected to the second node when the second node is any leaf node and a path between the two, and the second node is a corresponding node of the second string in the second data structure.

It should be appreciated that storing the second string, along with the identity of the first node of the first string corresponding thereto, can embody a correspondence between keys and values in the key-value pair data.

For example, the second character string and the identifier of the first node may be stored according to the second index information, and the second character string and the identifier of the first node may be stored in a storage space corresponding to the second node, or characters of a non-common prefix in the second character string and the identifier of the first node may be stored in a storage space corresponding to the second node, or characters of a non-common prefix of the second character string except for a mismatch first character or a mismatch first n characters may be stored in a storage space corresponding to the second node together with the identifier of the first node.

It should be understood that, in order to implement storage of the compression path, characters of the non-common prefix of the first character string are generally stored in a storage space corresponding to the first node, or characters except for the mismatched first character or the mismatched first n characters in the non-common prefix of the first character string are stored in the storage space corresponding to the first node, and when the first character string is not stored in the compression path, the identifiers of the second character string and the first node are stored in the compression path.

Optionally, the first character string and the second character string may each include at least one of letters, symbols, and chinese characters.

For example, the embodiment of the present application needs to create the first index information based on the first character string. Optionally, the first index information may include: the identification of the parent node of the first node (third node) and the path data between the first node and the third node, optionally, the path data may include a mismatch first character and a mismatch location, for example, the first index information may be represented as a triple < identification of the third node, mismatch first character, mismatch location >.

In an implementation manner, when the first character string needs to be stored, if the first data structure does not have a root node yet, the root node is created in the first data structure as the first node, and at this time, the identifier and the path data of the third node in the first index information are both empty and can be represented as < -, >.

In another implementation manner, when the first character string needs to be stored, and a root node has been established in the first data structure or at least one leaf node has been established on the basis of the root node, a third node is determined in the first data structure based on the first character string, it should be understood that, in the first data structure, the stored data corresponding to the third node is the most characters of the common prefix of the first character string, and further, based on the stored data corresponding to the third node, the first node and the path data between the first node and the third node are created.

Taking the example shown in fig. 3a as an example, assuming that the first string is https:// new.kk.com/d/bj, the common prefix of the storage data https:// new.kk.com/ch/anti corresponding to the v2 node in the first data structure is https:// new.kk.com/, and the number of common prefix characters is the largest compared to the common prefixes of the storage data corresponding to the other nodes in the first data structure and the first string, so that the child node v3 of the v2 node is created, and the path data between the v2 node and the v3 node is created, optionally, the path data between the v2 node and the v3 node includes a mismatch initial character "d" and a mismatch position 7, at this time, the first index information may be expressed as < v3, d, 7 >.

Exemplarily, starting from a root node of a first data structure, matching data stored in a storage space corresponding to the root node with a first character string to obtain target path data, where the target path data includes a target mismatch first character and a target mismatch position, the target mismatch first character is a first mismatch character after the root node is character-by-character matched with the first character string, and the target mismatch position is a character position of a last character of a common prefix between the storage data corresponding to the root node and the first character string, in the character string stored in the storage space corresponding to the root node; if the target path data is different from the path data of any leaf node connected to the root node, the root node is a third node; and if the target path data is the same as the path data of any leaf node connected to the root node, taking the leaf node as the root node, repeating the process until the target path data is different from the path data of any leaf node connected to the root node, and determining a third node.

Still taking fig. 3a as an example, the first character string is https:// new.kk.com/d/bj, starting from the root node v1 of the first data structure, the stored data https:// new.kk.com corresponding to the root node is matched with the first character string to obtain target path data of <, 11>, the target path data is the same as the path data between the v1 node and the v2 node, the kk.com/ch/anti stored in the storage space corresponding to the v2 node is matched with the first character string to obtain target path data of < d, 7>, the target path data is different from the path data of any leaf node connected to the v2 node, and the v2 node is a third node of the first character string.

And further, creating a child node of the third node as the first node, and establishing a path between the first node and the third node by using the finally obtained target path data as the path data between the first node and the third node.

On the basis of the foregoing embodiment, the embodiment of the present application provides a possible implementation manner for how to store the first character string to be stored according to the first index information: and storing other characters except the first mismatched character in at least one mismatched character of the storage data corresponding to the first character string and the third node into a storage space corresponding to the first node, wherein the at least one mismatched character is a continuous character from the first mismatched character to the last character of the first character string after character-by-character matching is carried out on the storage data corresponding to the first character string and the third node. It should be understood that the first mismatch character is already stored in the path data of the first index information, and therefore the first mismatch character does not need to be stored again, so as to save memory space, alternatively, more mismatch characters may be stored in the path data of the first index information, for example, the first n mismatch characters are stored, and the mismatch character stored in the storage space corresponding to the first node should be a character other than the first n mismatch characters in at least one mismatch character.

Still taking fig. 3a as an example, after the first character string https:// new.kk.com/d/bj is matched with the storage data https:// new.kk.com/ch/anti corresponding to the third node v2 character by character, at least one mismatch character d/bj is obtained, and/bj outside the first mismatch character "d" in the characters d/bj is stored in the storage space corresponding to the first node.

In the process, the first data structure of the first character string based on the compression path is stored in the memory, and the memory consumption is saved.

In a possible implementation manner, in order to enable each node in the hash table to accurately correspond to a storage space pointed by a pointer, the embodiment of the present application establishes a correspondence between each node in the first data structure and the pointer by using a bit array. In addition, in order to save the memory space, a corresponding pointer does not need to be set for each node, but a plurality of nodes are set to correspond to one pointer.

For example, in the process of storing the first character string, a storage space corresponding to the first node is determined based on the identifier of the first node, it should be understood that the storage space is a storage space in the memory, for example, the identifier of the first node is mapped to a corresponding data bit in a bit array, and a data position 1 is set, the bit array includes a plurality of data blocks, each data block has a plurality of data bits, each data block corresponds to one pointer, in the data block where the data bit corresponding to the first node is located, the number of the data bits that are set to 1 before the data bit is n, n ≧ 0, the memory space after the nth character string indicated by the pointer corresponding to the data block is used as the storage space corresponding to the first node, and further, other characters except for the first mismatched character in at least one mismatched character of the first character string are stored into the storage space.

On the basis of any of the above embodiments, after the first node is created, the identifier and the second character string of the first node may be stored according to the second index information, and in order to further save the memory space, the second data structure is also a data structure of a compression path. Similarly, it is necessary to create second index information based on the second character string, and the second index information may include: the identification of the parent node of the second node (fourth node) and the path data between the second node and the fourth node, optionally, the path data may include a mismatch first character and a mismatch location, for example, the second index information may be represented as a triple < identification of the fourth node, mismatch first character, mismatch location >.

It should be understood that the process of creating the second index information is similar to the process of creating the first index information, and is not described herein again. And the process of storing the second character string to be stored and the identifier of the first node according to the second index information is similar to the process of storing the first character string in a way of compressing the path, and the storage process is only required to be stored by adding the identifier of the first node after determining the mismatched character required to be stored in the storage space corresponding to the second node.

Secondly, a data reading method comprises the following steps:

fig. 6 is a flowchart illustrating a data reading method according to an embodiment of the present application. As shown in fig. 6, the data reading method includes:

s601: and reading the identifier of the first node based on the second index information corresponding to the second character string.

The second character string is a keyword of the first character string to be read, and the second index information is used to represent a link relationship of the second node in the second data structure, for example, the second node is a root node, or a parent node connected to the second node when the second node is any leaf node and a path between the parent node and the parent node, it should be understood that each stored character string has a node corresponding to the threshold in the second data structure, the second node is a node corresponding to the second character string in the second data structure, and the first node is a node corresponding to the first character string in the first data structure.

It should be understood that when the first character string needs to be queried or obtained, the corresponding first character string may be found through a keyword (a second character string) of the first character string, for example, after the second character string input by the user is received, a second node corresponding to the second character string is searched in the second data structure, for example, from a root node of the second data structure, from top to bottom, stored data corresponding to each node is matched with the second character string, stored data that can be completely matched with the second character string is found, the node corresponding to the stored data is the second node, corresponding index information is the second index information, and an identifier of the first node is read from a storage space corresponding to the second node.

S602: and reading the first character string based on the first index information corresponding to the first node.

The first index information is used to characterize a link relationship of the first node in the first data structure, for example, the first node is a root node, or a parent node connected to the first node when the first node is any leaf node, and a path between the first node and the parent node.

In this step, based on the first node, or the identifier of the first node, corresponding first index information is determined, for example, according to the first index information, the storage data in the storage space corresponding to the first node is read, or according to the first index information, the storage data in the storage space corresponding to the first node and the storage data in the storage space corresponding to the parent node of the first node in the first index information are read to obtain a first character string, or according to the first index information, the storage data in the storage space corresponding to the first node, the storage data in the storage space corresponding to the parent node of the first node in the first index information, and the path data between the first node and the second node in the first index information are read to obtain the first character string.

In the embodiment of the application, based on the second data structure, the identifier of the first node corresponding to the second character string is read according to the second index information, and then based on the first data structure, the first character string is read according to the first index information corresponding to the identifier of the first node, at least one of the first data structure and the second data structure is a data structure of a compression path, and based on the stored data stored in the compression path, real-time reading of data is achieved.

In a specific implementation manner, the first index information includes: identification of the third node and path data between the first node and the third node, it being understood that the third node is a parent node of the first node; the third node has the most characters corresponding to the stored data and the common prefix of the first string among the plurality of nodes of the first data structure.

Illustratively, reading the first character string based on the first index information corresponding to the first node includes: reading characters stored in a storage space corresponding to the first node; reading a mismatch initial character and a mismatch position in path data, wherein the mismatch initial character is a first mismatch character after character-by-character matching of first character strings and storage data corresponding to a third node, the mismatch position is a character position of a last character of a common prefix in a character string stored in a storage space corresponding to the third node, and the common prefix is a continuous character which is successfully matched after character-by-character matching of the first character string and the storage data corresponding to the third node; reading a common prefix before the mismatch position from the storage data corresponding to the third node; and combining the common prefix, the mismatch initial character and the character stored in the storage space corresponding to the first node in sequence to obtain a first character string.

In the process of reading the common prefix before the mismatch position from the storage data corresponding to the third node, there are two possible scenarios: the first node and the third node are root nodes of a first data structure, and a common prefix is directly read from a storage space corresponding to the third node; and if the third node is a leaf node, reading part of the common prefixes from the storage space corresponding to the third node, reading the common prefixes of the third node and the father node from the storage space corresponding to the father node of the third node according to the index information corresponding to the third node, repeating the steps until part of the common prefixes stored in the storage space corresponding to the root node are read, and combining the part of the common prefixes read from each node to obtain the storage data corresponding to the third node and the common prefix of the first character string.

Taking fig. 3a and fig. 4 as an example, assuming that the second character string is a hash, second index information corresponding to the hash is searched for in the second data structure shown in fig. 4, for example, starting from a root node k1, it is determined that the second character string does not have a common prefix with stored data "news" corresponding to a node k1, so that the target path data may be represented as < f, 0>, further, matching is performed with stored data finence corresponding to a node k6 having a path data of < f, 0>, obtaining target path data < a, 0>, and then matching is performed with stored data fahash corresponding to a node k7, it can be seen that the stored data corresponding to a node k7 matches with the second character string, the second index information is index information corresponding to a node k7, and then, according to the second index information, the identifier of the first node stored in the storage space corresponding to a node k7 is determined as v 7. Further, v7 is searched in the first data structure, first index information corresponding to v7 is obtained as < v6, a, 0>, data reading is performed from a storage space corresponding to v7 according to the first index information, a character "ship" is obtained, a first mismatch character is obtained from the first index information as "a", a mismatch position is 0, it is indicated that a common prefix required by storage is not stored in the storage space corresponding to the v6 node, a mismatch initial character "f" is read from the index information corresponding to the v6 node, the first 10 characters "kk.com/ch"/"are read from the storage space corresponding to the v2 parent node of the v6 node, further, a mismatch initial character" "is read from the index information corresponding to the v2 node, the first 11 characters" https "/new" are read from the storage space corresponding to the v2 parent node v1, and finally the obtained "https:// new" is obtained, ".", "kk.com/ch/", "f", "a", "ship" are combined into a first string "https:// new.kk.com/ch/ship".

Illustratively, in the process of reading the characters stored in the storage space corresponding to each node, the identifier of the node needs to be mapped to corresponding data bits in a bit array, where the bit array includes a plurality of data blocks, each data block has a plurality of data bits, and each data block corresponds to one pointer; in the data block where the data bit corresponding to the first node is located, the number of the data bits which are set to 1 before the data bit is n, n is larger than or equal to 0, and the (n + 1) th character string in the storage space indicated by the pointer corresponding to the data block is read.

On the basis of any of the above embodiments, the application can perform serialization export on the first data structure, the second data structure and the data stored in the storage space corresponding to each node in each data structure to obtain a serialized file, which is stored in a binary form and is a backup of the memory data, thereby avoiding the problem of data loss caused by service restart.

The hash table (HashTrie) structure is realized by using a three-primitive array compact vector, the derivation (dump) process is simple, and the dump process of LabelStore is shown in FIG. 7. In order to store the meta information of the non-empty part of the data chunk to the file, a temporary array tmp is used to store the non-empty set of data bits. Firstly writing the position of the data block chunk in the bit array and the data of the data block which is not empty into a temporary array, then writing the length of the temporary array into a file, firstly writing the chunk data into the file for each data block in the temporary array, then extracting a character string set from a corresponding pointer array according to the position information of the data bit in the chunk, and writing the character string set into the file.

Firstly, writing the length of a bit array into a serialized file, setting the count i of a counter to be 0, defining a null array as a temporary array tmp, judging whether i is smaller than the number of chunks in the bit array, if so, determining whether the number of 1 in the ith chunk in the bit array is 0, further, if the number of 1 in the ith chunk is not 0, (i, chunks [ i ]) is stored in the tmp, then adding 1 to the count i to continuously judge whether i is smaller than the number of chunks in the bit array, wherein the chunks [ i ] is the data of the ith chunk, if the number of 1 in the ith chunk is 0, adding 1 to the count i to continuously judge whether i is smaller than the number of chunks in the bit array, and writing the length of a tnp numerical value into the serialized file until i is equal to the number of chunks in the bit array.

Further, making the count j of the counter equal to 0, judging whether j is smaller than the number of the digit group chunk, if so, calculating the number of 1 in the jth data block tmp [ j ] in the temporary array, and writing tmp [ j ] into the serialized file; and judging whether the character string pointer corresponding to the jth chunk is a null pointer or not, if so, counting j and adding 1, judging whether j is smaller than the number of the bit arrays chunk again or not, if j is equal to the number of the bit arrays chunk, ending the derivation process, if not, determining whether the number of 1 in tmp [ j ] is larger than 0, if so, extracting a character string from the pointer and writing the character string into a serialized file, subtracting 1 from the number of 1 in tmp [ j ], then, determining whether the number of 1 in tmp [ j ] is larger than 0 again after subtracting 1, and adding 1 to the counting j until the number of 1 in tmp [ j ] is equal to 0, judging whether j is smaller than the number of the bit arrays chunk again or not, and ending the derivation process when j is equal to the number of the bit arrays chunk.

Correspondingly, the method and the device can export the first data structure, the second data structure and the data stored in the storage space corresponding to each node in each data structure, namely, the serialized files are imported into the lead-in memory, so that the memory is restored to the state before export.

In the importing process, reading the serialized file generated by exporting, and firstly building a HashTrie structure and a corresponding LabelStore structure in the memory. And then updating the meta information (data volume and the like) and the data of the corresponding structure in the serialized file to the structure in the memory.

The embodiment of the application also comprises a compression (compact) process, wherein data files corresponding to the Key-Trie and the Value-Trie are generated offline in the compact process and are realized through mapping-reduction (map-reduce). As shown in fig. 8, in the map stage, the hash value of the key is calculated, and the key and the value belonging to the same segment are output to a file according to the hash value segment; in the reduce stage, a Trie tree is built according to the Key and the Value of each fragment, then serialization of the Trie tree is used as an output file of the reduce, and the Key-Trie and Value-Trie serialization files corresponding to each fragment are obtained through map-reduce. The compact can thoroughly clean up the node data marked for deletion on one hand, and can quickly import the full data into the memory by reading the serialized files on the other hand, so that the updating efficiency is improved.

Fig. 9 is a schematic structural diagram of an electronic device 900 according to an embodiment of the present application, and as shown in fig. 9, the electronic device 900 includes:

a storage unit 910, configured to store a first character string to be stored according to first index information; the first index information is used for representing the link relation of a first node in a first data structure, and the first node is a node corresponding to a first character string in the first data structure;

the storage unit 910 is further configured to store a second character string to be stored and an identifier of the first node according to second index information, where the second character string is a keyword of the first character string, the second index information is used to represent a link relationship of the second node in a second data structure, and the second node is a node corresponding to the second character string in the second data structure.

The electronic device 900 provided in this embodiment includes a storage unit 910, which stores a first character string according to first index information based on a first data structure, and stores a second character string according to second index information based on a second data structure, where at least one of the first data structure and the second data structure is a data structure of a compression path, so that compressed storage of data to be stored is realized, memory space is saved, data to be stored does not need to be compressed, and efficiency of data storage is improved.

In one possible design, the storage unit 910 is further configured to:

based on the first character string, first index information is created.

Optionally, the first index information includes: an identity of a third node and path data between the first node and the third node; in the plurality of nodes of the first data structure, the stored data corresponding to the third node has the most characters with the common prefix of the first character string, and the common prefix is a continuous character successfully matched after the stored data corresponding to the first character string and the third node are matched character by character.

In one possible design, the storage unit 910 is specifically configured to:

determining a third node based on the first character string;

and based on the storage data corresponding to the third node, creating the first node and path data between the first node and the third node, wherein the path data comprises a mismatch first character and a mismatch position, the mismatch first character is a first mismatch character after the first character string and the storage data corresponding to the third node are matched character by character, and the mismatch position is a character position of a last character of the common prefix in a character string stored in a storage space corresponding to the third node.

In one possible design, the storage unit 910 is specifically configured to:

matching data stored in a storage space corresponding to a root node with a first character string from the root node of a first data structure to obtain target path data, wherein the target path data comprises a target mismatch initial character and a target mismatch position, the target mismatch initial character is a first mismatch character after the root node is matched with the first character string character by character, the target mismatch position is a last character of a common prefix of the stored data corresponding to the root node and the first character string, and the character position in the character string stored in the storage space corresponding to the root node is the target mismatch position;

if the target path data is different from the path data of any leaf node connected to the root node, the root node is a third node;

and if the target path data is the same as the path data of any leaf node connected to the root node, taking the leaf node as the root node, repeating the process until the target path data is different from the path data of any leaf node connected to the root node, and determining a third node.

In one possible design, the storage unit 910 is specifically configured to:

and storing other characters except the first mismatch character in at least one mismatch character of the storage data corresponding to the first character string and the third node into a storage space corresponding to the first node, wherein the at least one mismatch character is a continuous character from the first mismatch character in the first character string to the last character in the first character string.

In one possible design, the storage unit 910 is further configured to:

and determining a storage space corresponding to the first node based on the identification of the first node.

In one possible design, the storage unit 910 is specifically configured to:

mapping the identifier of the first node to a corresponding data bit in the bit array, and mapping the data position to 1; the bit array comprises a plurality of data blocks, each data block comprises a plurality of data bits, and each data block corresponds to a pointer;

in the data block where the data bit corresponding to the first node is located, the number of the data bits which are set to be 1 before the data bit is n, n is larger than or equal to 0, and the memory space behind the nth character string indicated by the pointer corresponding to the data block is used as the storage space corresponding to the first node.

In one possible design, the storage unit 910 is specifically configured to:

based on the second character string, second index information is created.

The electronic device provided in this embodiment can be used to implement the data storage method in any of the above embodiments, and the implementation effect is similar to that of the method embodiment, and is not described here again.

Fig. 10 is a schematic structural diagram of an electronic device 1000 according to an embodiment of the present application, and as shown in fig. 10, the electronic device 1000 includes:

a reading unit 1010, configured to read an identifier of a first node based on second index information corresponding to a second character string, where the second character string is a keyword of the first character string to be read, the second index information is used to represent a link relationship of the second node in a second data structure, the second node is a node corresponding to the second character string in the second data structure, and the first node is a node corresponding to the first character string in the first data structure;

the reading unit 1010 is further configured to read the first character string based on first index information corresponding to the first node, where the first index information is used to represent a link relationship of the first node in the first data structure;

In this embodiment, the electronic device 1000 includes a reading unit 1010, which reads, based on the second data structure, the identifier of the first node corresponding to the second character string according to the second index information, and then reads, based on the first data structure, the first character string according to the first index information corresponding to the identifier of the first node, where at least one of the first data structure and the second data structure is a data structure of a compression path, and based on the stored data that is stored in a compression manner, the real-time reading of the data is achieved.

In one possible design, the reading unit 1010 is specifically configured to:

reading characters stored in a storage space corresponding to the first node;

reading a mismatch initial character and a mismatch position in the path data, wherein the mismatch initial character is a first mismatch character after character-by-character matching of the first character string and the storage data corresponding to the third node, and the mismatch position is a character position of the last character of the common prefix in the character string stored in the storage space corresponding to the third node;

reading a common prefix before the mismatch position from the storage data corresponding to the third node;

and combining the common prefix, the mismatch initial character and the character stored in the storage space corresponding to the first node in sequence to obtain a first character string.

In one possible design, the reading unit 1010 is specifically configured to:

mapping the identifier of the first node to corresponding data bits in a bit array, wherein the bit array comprises a plurality of data blocks, each data block comprises a plurality of data bits, and each data block corresponds to a pointer;

in the data block where the data bit corresponding to the first node is located, the number of the data bits which are set to 1 before the data bit is n, n is larger than or equal to 0, and the (n + 1) th character string in the storage space indicated by the pointer corresponding to the data block is read.

The electronic device provided in this embodiment can be used to implement the data reading method in any of the above embodiments, and the implementation effect is similar to that of the method embodiment, and is not described here again.

Fig. 11 is a schematic hardware structure diagram of an electronic device 1100 according to an embodiment of the present disclosure. As shown in fig. 11, in general, an electronic device 1100 includes: a processor 1110 and a memory 1120.

The processor 1110 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 1110 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1110 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1110 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 1001 may further include an AI (Artificial Intelligence) processor for processing a computing operation related to machine learning.

The memory 1120 may include one or more computer-readable storage media, which may be non-transitory. The memory 1120 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1120 is used to store at least one instruction for execution by processor 1110 to implement the methods provided by the method embodiments herein.

Optionally, as shown in fig. 11, the electronic device 1100 may further include a transceiver 1130, and the processor 1110 may control the transceiver 1130 to communicate with other devices, and in particular, may transmit information or data to the other devices or receive information or data transmitted by the other devices.

The transceiver 1130 may include a transmitter and a receiver, among others. The transceiver 1130 may further include one or more antennas, which may be present in number.

Those skilled in the art will appreciate that the configuration shown in fig. 11 does not constitute a limitation of the electronic device 1100, and may include more or fewer components than those shown, or combine certain components, or employ a different arrangement of components.

Embodiments of the present application further provide a non-transitory computer-readable storage medium, where instructions in the storage medium, when executed by a processor of a gateway, enable an electronic device to perform the method provided by the foregoing embodiments.

The computer-readable storage medium in this embodiment may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, etc. that is integrated with one or more available media, and the available media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., SSDs), etc.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The embodiment of the present application also provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the method provided by the above embodiment.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for storing data, comprising:

storing a first character string to be stored according to first index information; the first index information is used for representing a link relation of a first node in a first data structure, and the first node is a node corresponding to the first character string in the first data structure;

and storing a second character string to be stored and the identifier of the first node according to second index information, wherein the second character string is a keyword of the first character string, the second index information is used for representing the link relation of a second node in a second data structure, and the second node is a node corresponding to the second character string in the second data structure.

The first data structure and the second data structure at least have a data structure of a compression path, and in a plurality of nodes included in the data structure of the compression path, at least two characters are stored in a storage space corresponding to at least one node.

2. The method of claim 1, further comprising:

and creating the first index information based on the first character string.

3. The method of claim 2, wherein the first index information comprises: an identification of a third node and path data between the first node and the third node; in the plurality of nodes of the first data structure, the stored data corresponding to the third node has the most characters with the common prefix of the first character string, and the common prefix is a continuous character successfully matched after character-by-character matching of the stored data corresponding to the first character string and the third node.

4. The method of claim 3, wherein creating the first index information based on the first string comprises:

determining the third node based on the first string;

and creating the first node and path data between the first node and the third node based on the stored data corresponding to the third node, wherein the path data comprises a mismatch first character and a mismatch position, the mismatch first character is a first mismatch character after the first character string and the stored data corresponding to the third node are matched character by character, and the mismatch position is a character position of a last character of the common prefix in a character string stored in a storage space corresponding to the third node.

5. The method of claim 4, wherein determining the third node based on the first string comprises:

matching data stored in a storage space corresponding to the root node with the first character string from the root node of the first data structure to obtain target path data, wherein the target path data comprises a target mismatch first character and a target mismatch position, the target mismatch first character is a first mismatch character after the root node is matched with the first character string character by character, the target mismatch position is a last character of a common prefix of the storage data corresponding to the root node and the first character string, and the target mismatch position is a character position in the character string stored in the storage space corresponding to the root node;

if the target path data is different from the path data of any leaf node connected to the root node, the root node is the third node;

if the target path data is the same as the path data of any leaf node connected to the root node, taking the leaf node as the root node, repeating the process until the target path data is different from the path data of any leaf node connected to the root node, and determining the third node.

6. The method according to claims 3 to 5, wherein storing the first character string to be stored according to the first index information comprises:

and storing other characters except the first mismatched character in at least one mismatched character of the storage data corresponding to the first character string and the third node into a storage space corresponding to the first node, wherein the at least one mismatched character is a continuous character from the first mismatched character in the first character string to the last character in the first character string.

7. The method of claim 6, further comprising:

8. The method of claim 7, wherein the determining the storage space corresponding to the first node based on the identity of the first node comprises:

mapping the identifier of the first node to a corresponding data bit in a bit array, and locating the data at position 1; the bit array comprises a plurality of data blocks, each data block comprises a plurality of data bits, and each data block corresponds to a pointer;

in the data block where the data bit corresponding to the first node is located, the number of the data bits with 1 being placed before the data bit is n, n is greater than or equal to 0, and the memory space behind the nth character string indicated by the pointer corresponding to the data block is used as the storage space corresponding to the first node.

9. The method according to any one of claims 1 to 5, further comprising:

and creating the second index information based on the second character string.

10. A method for reading data, comprising:

reading an identifier of a first node based on second index information corresponding to a second character string, wherein the second character string is a keyword of the first character string to be read, the second index information is used for representing a link relation of a second node in a second data structure, the second node is a node corresponding to the second character string in the second data structure, and the first node is a node corresponding to the first character string in the first data structure;

reading the first character string based on first index information corresponding to the first node, wherein the first index information is used for representing the link relation of the first node in a first data structure;

the data structure of at least one compression path exists in the first data structure and the second data structure, and in a plurality of nodes contained in the data structure of the compression path, stored data corresponding to at least one node includes at least two characters.

11. The method of claim 10, wherein the first index information comprises: an identification of the third node and path data between the first node and the third node; in the plurality of nodes of the first data structure, the stored data corresponding to the third node has the most characters with the common prefix of the first character string, and the common prefix is a continuous character successfully matched after character-by-character matching of the stored data corresponding to the first character string and the third node.

12. The method according to claim 11, wherein reading the first character string based on the first index information corresponding to the first node comprises:

reading characters stored in a storage space corresponding to the first node;

reading a mismatch initial character and a mismatch position in the path data, wherein the mismatch initial character is a first mismatch character after character-by-character matching of the first character string and the storage data corresponding to the third node, and the mismatch position is a character position of a last character of the common prefix in a character string stored in a storage space corresponding to the third node;

reading a common prefix before the mismatch position from storage data corresponding to the third node;

and combining the common prefix, the mismatch initial character and the character stored in the storage space corresponding to the first node in sequence to obtain the first character string.

13. The method according to claim 12, wherein reading the character stored in the storage space corresponding to the first node comprises:

in the data block where the data bit corresponding to the first node is located, the number of the data bits which are previously set to 1 is n, n is greater than or equal to 0, and the (n + 1) th character string in the storage space indicated by the pointer corresponding to the data block is read.

14. An electronic device, comprising: a memory and a processor;

the memory stores computer-executable instructions;

the processor executing the computer-executable instructions stored by the memory causes the processor to perform the method of any of claims 1-13.

15. A storage medium, comprising:

readable storage media and computer programs;

the computer program is for implementing the method of any one of claims 1 to 13.