CN116073835A - Geographic position data compression method and device, electronic equipment and storage medium - Google Patents

Geographic position data compression method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116073835A
CN116073835A CN202310203540.9A CN202310203540A CN116073835A CN 116073835 A CN116073835 A CN 116073835A CN 202310203540 A CN202310203540 A CN 202310203540A CN 116073835 A CN116073835 A CN 116073835A
Authority
CN
China
Prior art keywords
character
node
nodes
characters
position data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310203540.9A
Other languages
Chinese (zh)
Other versions
CN116073835B (en
Inventor
邹炎炎
陶周天
刘祖军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Smartsteps Data Technology Co ltd
Original Assignee
Smartsteps Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Smartsteps Data Technology Co ltd filed Critical Smartsteps Data Technology Co ltd
Priority to CN202310203540.9A priority Critical patent/CN116073835B/en
Publication of CN116073835A publication Critical patent/CN116073835A/en
Application granted granted Critical
Publication of CN116073835B publication Critical patent/CN116073835B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3091Data deduplication

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention relates to the technical field of big data, and provides a geographic position data compression method, a geographic position data compression device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a pre-constructed character tree, wherein the character tree is constructed according to all non-repeated characters in a plurality of geographic position data, the character tree comprises a root node, associated nodes except the root node and character nodes, each associated node is used for associating a father node and a child node of the associated node, each associated node except the child node of the root node is also associated with at least one character node, and each character node represents a non-repeated character; determining the code of each character in each geographic position data according to the character tree; and combining the codes of all characters in each geographic position data to obtain the compressed codes of each geographic position data. The invention can effectively compress the geographic position data.

Description

Geographic position data compression method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of big data technologies, and in particular, to a geographic location data compression method, a geographic location data compression device, an electronic device, and a storage medium.
Background
The geographic position data stores all resident information of the user, taking the geographic position data as an example, the geographic position data is encoded by using the geohash, and about 4 geohashes are provided for one person every day, if the time span exceeds one year, the data volume of the resident information is very large, so that huge storage space is occupied, and how to reduce the storage space occupied by the geographic position data is a problem to be solved urgently by those skilled in the art.
Disclosure of Invention
The invention aims to provide a geographic position data compression method, a geographic position data compression device, electronic equipment and a storage medium, which can effectively compress geographic position data and greatly reduce the storage space occupied by the geographic position data.
Embodiments of the invention may be implemented as follows:
in a first aspect, the present invention provides a method for compressing geographic location data, the method comprising:
acquiring a pre-constructed character tree, wherein the character tree is constructed according to all non-repeated characters in a plurality of geographic position data, the character tree comprises a root node, a plurality of association nodes and at least one character node associated with each association node, each association node is used for associating a father node and a child node of the association node, and each character node represents one non-repeated character;
Determining the code of each character in each geographic position data according to the character tree;
and combining codes of all characters in each geographic position data to obtain the compressed codes of each geographic position data.
In an alternative embodiment, the character tree includes a first branch and a second branch, the root node includes a first child node and a second child node, the first child node and the second child node are respectively associated with the first branch and the second branch, the weight value of the first child node is preset to a first preset value, the weight value of the second child node is preset to a second preset value, the first branch and the second branch each include multiple layers, each layer includes an association node and at least one character node, the weight values of the association nodes of the layers in the same branch are the same, the weight values of the character nodes of the layers in the same branch in the same position are the same, and the step of determining the code of each character in the geographic position data according to the character tree includes:
for any target character in the geographic position data, taking a path with the least path nodes between the root node and the target character nodes as a target path, wherein the target character nodes are character nodes representing the target characters;
According to the path sequence of each node, determining a sequence formed by sequentially arranging the weight values of each node as the code of the target character;
and taking each character of each geographic position data as the target character to obtain the code of each character in each geographic position data.
In an alternative embodiment, the method further comprises:
acquiring a plurality of geographic position data, wherein each geographic position data comprises at least one character;
performing de-duplication on all characters included in all the geographic position data to obtain non-duplicate characters in the geographic position data;
calculating the weight of each character in the non-repeated characters according to the occurrence times of each character in the non-repeated characters in the plurality of geographic position data;
and constructing the character tree according to the weights of all the non-repeated characters.
In an alternative embodiment, the character tree includes a first branch and a second branch, and the step of constructing the character tree according to weights of all the non-repeated characters includes:
generating a first child node and a second child node of the root node, and associating the first child node and the second child node with the first branch and the second branch, respectively;
Generating associated nodes and character nodes of each layer of the first branch and associated nodes and character nodes of each layer of the second branch according to weights of all non-repeated characters based on the first child node and the second child node, and obtaining the character tree;
the weight value of the first sub-node is set to be a first preset value, the weight value of the first sub-node and the weight value of the second sub-node are preset to be a second preset value, the weight values of the associated nodes of all layers in the same branch are set to be the same, and the weight values of the character nodes with the same positions of all layers in the same branch are set to be the same.
In an alternative embodiment, the non-repeated characters form a character sequence according to the order of the weights of the non-repeated characters from big to small, the non-repeated characters and the first child node are used as a first father node, and the second child node is used as a second father node;
the step of generating the associated nodes and character nodes of each layer of the first branch and the associated nodes and character nodes of each layer of the second branch based on the first child node and the second child node according to the weights of all the non-repeated characters, and obtaining the character tree comprises the following steps:
Acquiring the number of characters which do not repeat characters in the character sequence;
determining the target number of non-repeated characters to be inserted once according to the character number;
if the character sequence is not null, the target non-repeated characters are taken out from the character sequence according to the sequence from the big weight to the small weight;
inserting the target number of the target non-duplicate characters into the character tree based on the first parent node and the second parent node;
generating an associated node of the first father node and an associated node of the second father node, replacing the first father node with the associated node of the first father node, replacing the second father node with the associated node of the second father node, and repeating the step of obtaining the number of characters in the character sequence, which are not repeated, until the character sequence is empty, so as to obtain the character tree.
In an alternative embodiment, the step of determining the target number of non-repeated characters to be inserted once according to the number of characters includes:
and if the number of the characters is larger than or equal to the number of the reference nodes, the number of the reference nodes is taken as a target number, otherwise, the number of the characters is taken as the target number, and the number of the reference nodes is determined according to the maximum number of the character nodes included in any layer of the first branch and the second branch.
In an alternative embodiment, the step of inserting the target number of the target non-repeating characters into the character tree based on the first parent node and the second parent node:
generating character nodes of the target plurality of first father nodes and generating character nodes of the target plurality of second father nodes;
and assigning values to the character nodes of the second father node and the character nodes of the first father node in turn according to the sequence from big to small of the weights of the target non-repeated characters, wherein the weights of the target non-repeated characters represented by the character nodes of the second father node are larger than those of the target non-repeated characters represented by the character nodes of the first father node, so that the target non-repeated characters are inserted into the character tree.
In a second aspect, the present invention provides a geographical position data compression apparatus, the apparatus comprising:
the acquisition module is used for acquiring a pre-constructed character tree, the character tree is constructed according to all non-repeated characters in a plurality of geographic position data, the character tree comprises a root node, a plurality of association nodes and at least one character node associated with each association node, each association node is used for associating a father node and a child node of the association node, and each character node represents one non-repeated character;
The determining module is used for determining the code of each character in each geographic position data according to the character tree;
and the coding module is used for combining codes of all characters in each geographic position data to obtain the compressed codes of each geographic position data.
In a third aspect, the present invention provides an electronic device, including a processor and a memory, where the memory is configured to store a program, and where the processor is configured to implement the geographic location data compression method according to the first aspect in the foregoing embodiment when executing the program.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of geographic location data compression as described in the first aspect of the previous embodiments.
Compared with the prior art, the method and the device have the advantages that the codes of each character in each geographic position data are determined by utilizing the pre-built character tree, then the codes of all the characters in each geographic position data are combined to finally obtain the codes after the compression of each geographic position data, each character node represents one non-repeated character and comprises a root node, a plurality of association nodes and at least one character node associated with each association node, each association node is used for associating a father node and a child node of the association node, each association node except the child node of the root node is also associated with at least one character node, the codes of each character in each geographic position data can be determined through the character tree, and finally the codes of all the characters in each geographic position data are combined to obtain the codes after the compression of each geographic position data, so that the effective compression of the geographic position data is realized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart illustrating a geographic location data compression method according to an embodiment of the present invention.
Fig. 2 is an exemplary diagram of a node structure of a character tree according to an embodiment of the present invention.
Fig. 3 is an exemplary diagram of a character tree according to an embodiment of the present invention.
Fig. 4 is an exemplary diagram of another character tree according to an embodiment of the present invention.
Fig. 5 is a diagram illustrating construction examples of character trees with two different structures according to an embodiment of the present invention.
Fig. 6 is a diagram illustrating a process of inserting target non-repeated characters according to an embodiment of the present invention.
Fig. 7 is a diagram illustrating an example of setting weight values for the character tree in fig. 6 according to an embodiment of the present invention.
Fig. 8 is a block diagram of a geographic location data compression device according to an embodiment of the present invention.
Fig. 9 is a block schematic diagram of an electronic device according to an embodiment of the present invention.
Icon: 10-an electronic device; 11-a processor; 12-memory; 13-bus; 100-a geographic location data compression device; 110-an acquisition module; 120-determining a module; 130-an encoding module; 140-building up a module.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
In the description of the present invention, it should be noted that, if the terms "upper", "lower", "inner", "outer", and the like indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, or the azimuth or the positional relationship in which the inventive product is conventionally put in use, it is merely for convenience of describing the present invention and simplifying the description, and it is not indicated or implied that the apparatus or element referred to must have a specific azimuth, be configured and operated in a specific azimuth, and thus it should not be construed as limiting the present invention.
Furthermore, the terms "first," "second," and the like, if any, are used merely for distinguishing between descriptions and not for indicating or implying a relative importance.
It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.
In the prior art, taking geo-location data encoded by geohash as an example, there are about 4 geohashes per person per day, if the geo-location data is in the following format: userId, [ geohash_1, geohash_2, geohash_3, geohash_4], userId is a user identifier, and geohash_1-geohash_4 are 4 geohashes, and for a geohash encoded by using geohash7, 7 characters are used for one geohash, and 56 bytes are used for the other. With the increase of users and days, the data volume is continuously increasing, and the calculation mode is as follows: the number of people is 56 times 4 times, the number of people is increased along with the increase of time, the data volume is larger and the occupied storage space is larger and larger.
In order to reduce the storage space occupied by the geographic position data, the embodiment of the invention provides a geographic position data compression method, a geographic position data compression device, electronic equipment and a storage medium, which can effectively compress the geographic position data, greatly reduce the storage space occupied by the geographic position data and are described in detail below.
Referring to fig. 1, fig. 1 is a flowchart illustrating a geographic location data compression method according to an embodiment of the invention, the method includes the following steps:
step S101, a pre-constructed character tree is obtained, the character tree is constructed according to all non-repeated characters in a plurality of geographic position data, the character tree comprises a root node, a plurality of association nodes and at least one character node associated with each association node, each association node is used for associating a father node and a child node of the association node, each association node except the child node of the root node is also associated with at least one character node, and each character node represents a non-repeated character.
In this embodiment, the plurality of geographic location data is to be compressed, and one geographic location data includes a plurality of characters, and the characters may be letters, numbers, or other symbols and combinations thereof, for example, the geographic location data is: wx4unu1, which comprises 7 characters. Characters represented by character nodes in the character tree are non-repeated characters in a plurality of geographic position data, for example, 3 geographic position data are respectively: wx4unu1, wtw37qt, wtw37rt, where the non-repeated characters are: w, x, 4, n, u, 1, t, 3, 7, q, r.
In this embodiment, the character tree is a tree structure, and includes a root node, a plurality of association nodes, and character nodes, where the association nodes are used to establish an association relationship between a parent node of the association node and a child node of the association node, the character nodes are used to represent non-repeated characters, one character node represents one non-repeated character, and the character nodes associated with the association nodes and the association nodes have the same parent node. Referring to fig. 2, fig. 2 is an exemplary diagram of a node structure of a character tree provided by the embodiment of the present invention, in fig. 2, namely, a child node 1 of a root node is used for associating the root node with an associated node 1, a child node 2 of the root node is used for associating the root node with an associated node 2, the associated node 1 is used for associating the child node 1 with an associated node 3, the character node associated with the associated node 1 is the character node 1, and the character node associated with the associated node 2 is the character node 2.
It should be noted that fig. 2 is merely an exemplary diagram of a character tree, and in fact, the character tree may include more associated nodes and character nodes.
Step S102, determining the codes of each character in each geographic position data according to the character tree.
In this embodiment, each associated node in the character tree is preset with a weight value, and each character node is preset with a weight value, and one implementation manner is as follows: the coding of the character represented by the character node can be determined according to the weight value of the node from the root node to the character node. Referring to fig. 3, fig. 3 is an exemplary diagram of a character tree, in fig. 3, the weight value of the child node on the left of the root node is 0, the weight value of the child node on the right of the root node is 1, the weight value of the associated node in the first branch is 0, the weight value of the character node is 1, the weight value of the associated node in the second branch is 0, the weight value of the character node is 1, and for the character t, the weights of the nodes in the path from the root node to the character t of the character tree are arranged as the codes of the character t: 01, for character 4, it is encoded as 001.
Step S103, the codes of all characters in each geographic position data are combined, and the codes after compression of each geographic position data are obtained.
In this embodiment, the codes of the characters may be sequentially combined according to the positions of the characters in each geographical position data, to obtain the compressed codes of each geographical position data. For example, the geographic location data is: wts44, w is encoded as: 11. when t is 01 and s is 101 and 4 is 001, the compressed code of the geographic position data is: 1101101001001.
according to the method provided by the embodiment, the codes of each character in each geographic position data can be determined through the character tree, and finally, the codes of all the characters in each geographic position data are combined to obtain the codes after compression of each geographic position data, so that the effective compression of the geographic position data is realized.
In this embodiment, in order to quickly determine the most reasonable and efficient encoding of a character in each geographic location data from a character tree, this embodiment provides an implementation of determining the encoding of any target character in each geographic location data:
firstly, taking a path with the least path nodes between a root node and a target character node as a target path, wherein the target character node is a character node representing a target character;
In this embodiment, the character tree includes a first branch and a second branch, the root node includes a first child node and a second child node, the first child node and the second child node are respectively associated with the first branch and the second branch, the weight value of the first child node is preset to a first preset value, the weight value of the second child node is preset to a second preset value, the first branch and the second branch each include multiple layers, each layer includes an associated node and at least one character node, the weight values of the associated nodes of the layers in the same branch are the same, and the weight values of the character nodes in the same layer in the same branch are the same. In the character tree of fig. 3, the first preset value is 0, the second preset value is 1, the character nodes associated with the associated nodes are one, the weight values of the associated nodes of the first branch and the second branch are both 0, and the weight values of the character nodes are both 1. In addition to the case that the number of character nodes associated with the associated node is one in fig. 3, the number of character nodes associated with the associated node may be plural, referring to fig. 4, fig. 4 is an exemplary diagram of another character tree provided in the embodiment of the present invention, in fig. 4, the first preset value is 0, the second preset value is 1, the number of character nodes associated with the associated node is two, for the first branch, the weight values of the associated nodes in each layer of the first branch are the same, the weight value of the character nodes in any layer of the first branch from right to left is 1: 00. 01, for the second branch, the weight value of the associated node of each layer in the second branch is the same, and is 0, and the weight value of the character node of any layer in each layer in the second branch from right to left is: 11. 10.
Secondly, determining a sequence formed by sequentially arranging weight values of all nodes as a code of a target character according to the path sequence of all nodes;
in this embodiment, for the geographic position data wx4unu1, if w is the target character, the coding of w is as follows according to the character tree of fig. 4: 111.
and finally, taking each character of each geographic position data as a target character to obtain the code of each character in each geographic position data.
In this embodiment, the geographical position data to be compressed is different, where the number of occurrences of non-repeated characters in the geographical position data is also different, so that in order to obtain a better compression effect, the length of the code of the character with a larger number of occurrences is made as short as possible, and the length of the code of the character with a smaller number of occurrences is made slightly longer as appropriate, so that a better compression effect can be achieved, and therefore, this embodiment further provides an implementation manner for constructing a character tree:
firstly, acquiring a plurality of geographic position data, wherein each geographic position data comprises at least one character;
secondly, performing de-duplication on all characters included in all the geographic position data to obtain non-duplicated characters in the geographic position data;
Thirdly, calculating the weight of each character in the non-repeated characters according to the occurrence times of each character in the non-repeated characters in the plurality of geographic position data;
in the present embodiment, as a way of calculating the weight: counting the times of all non-repeated characters to obtain total times; and carrying out normalization processing on the times of each non-repeated character according to the total times to obtain a normalization value of each non-repeated character, and taking the normalization value of each non-repeated character as the weight of each non-repeated character.
And finally, constructing a character tree according to the weights of all the non-repeated characters.
In this embodiment, the larger the weight is, the closer the corresponding character node in the character tree is to the root node, and the shorter the encoding length is.
According to the method provided by the embodiment, the weights of the characters are calculated according to the times of the characters used in the geographic position data, so that the finally constructed character tree can reflect the use frequency of the characters used in the geographic position data, the characters are not repeated when the weights are larger, the corresponding character nodes in the character tree are closer to the root node, the coding length is shorter, the coding length of the finally obtained geographic position data is shorter, and a better compression effect is achieved.
In an alternative embodiment, one way to construct the character tree based on the weights of all non-repeating characters is:
firstly, generating a first child node and a second child node of a root node, and respectively associating the first child node and the second child node with a first branch and a second branch;
in this embodiment, the root node has two child nodes: a first child node and a second child node.
Secondly, based on the first sub-node and the second sub-node, generating associated nodes and character nodes of each layer of the first branch and associated nodes and character nodes of each layer of the second branch according to weights of all non-repeated characters to obtain a character tree;
in this embodiment, the first branch and the second branch may each include multiple layers, and the structures of the remaining layers are the same except for the last layer (i.e., the layer farthest from the root node), that is, the layers have the same association node and character node, and the structure of the last layer may be the same as or different from the structure of the remaining layers, depending on the number of specific non-repeated characters and the number of character nodes of each layer of each branch. Referring to fig. 5, fig. 5 is a diagram illustrating an example of the construction of two character trees with different structures according to an embodiment of the present invention. In fig. 5, build mode one: the number of non-repeated characters is 10, and the number of character nodes of each layer of each branch is 1, so that the character tree comprises 5 layers, and the number of associated nodes and character nodes of each layer is the same; the construction mode II is as follows: if the number of unrepeated characters is 11, the first branch of the character tree comprises 3 layers, the last layer comprises 1 character node, the other layers comprise 2 character nodes, the second branch comprises 3 layers, and each layer comprises 2 character nodes.
In order to generate a character tree more efficiently, the implementation manner of forming a character sequence of non-repeated characters according to the order of the weights of the non-repeated characters from big to small, taking a child node of a root node associated with a first branch as a first father node, and taking a child node of a root node associated with a second branch as a second father node to obtain the character tree may be as follows:
(1) Acquiring the number of characters which do not repeat characters in a character sequence;
(2) Determining the target number of non-repeated characters to be inserted once according to the number of characters;
in an alternative embodiment, the step of determining the target number of non-repeated characters to be inserted at a time based on the number of characters includes:
and if the number of the characters is greater than or equal to the number of the reference nodes, the number of the reference nodes is taken as the target number, otherwise, the number of the characters is taken as the target number, and the number of the reference nodes is determined according to the maximum number of the character nodes included in any layer of any one of the first branch and the second branch.
In this embodiment, as one way, the maximum number of character nodes included in any layer of the first branch and the second branch is equal to the reference node number=branch number, and when the number of layers of the character tree is greater, the coding of the target character is determined from the character tree, the processing efficiency is lower, for example, the total number of unrepeated characters is 32, the branch number is 2, if the maximum number of character nodes of each layer in each branch is 1, the reference node number is 1*2 =2, the number of layers of each branch is 16, which may result in a great decrease in processing efficiency, and if the maximum number of character nodes of each layer in each branch is 2, the reference node number is 2×2=4, the number of layers of each branch is 8, which may be greatly reduced.
(3) If the character sequence is not null, a plurality of target non-repeated characters are taken out from the character sequence according to the sequence from the large weight to the small weight;
(4) Inserting a target number of target non-duplicate characters into the character tree based on the first parent node and the second parent node;
in an alternative embodiment, one way of insertion is: generating character nodes of the target plurality of first father nodes and generating character nodes of the target plurality of second father nodes; and assigning a value to the character node of the second father node and assigning a value to the character node of the first father node in turn according to the sequence from big to small of the weight of the target non-repeated character, wherein the weight of the character node of the second father node is larger than that of the character node of the first father node, so that the target non-repeated character is inserted into the character tree.
As one implementation, if the first branch is a left branch and the second branch is a right branch, the method inserts the target non-repeated characters into the character tree in the order of inserting the right branch and then inserting the left branch, and if the number of character nodes of each layer of each branch is plural when inserting the right branch or the left branch, the method inserts the target non-repeated characters into the character tree in the order of inserting the right branch and then inserting the target non-repeated characters into the character tree.
(5) Generating an associated node of a first father node and an associated node of a second father node, replacing the first father node with the associated node of the first father node, replacing the second father node with the associated node of the second father node, and repeating the steps (1) - (5) until the character sequence is empty, thereby obtaining a character tree.
In order to more clearly describe the character tree generation process, please refer to fig. 6, fig. 6 is a process example diagram of inserting target non-repeated characters provided in an embodiment of the present invention. In fig. 6, the character sequence { w, t, s,4,7, c,5,9,6 }, the branches are 2, the maximum number of character nodes of each layer in each branch is 2, the number of reference nodes is 2×2=4, w, t, s,4 are selected from the character sequence for the first time, character nodes are sequentially inserted, the associated nodes are generated on the basis, 7, c,5,9 are selected from the character sequence for the second time, character nodes are sequentially inserted, the associated nodes are generated on the basis, and two character nodes of the right branch are selected from the character sequence for the third time, and 6 to be inserted into the two character nodes of the right branch are selected from the character sequence for the third time, so as to obtain the character tree.
Finally, the weight value of the first sub-node is set to be a first preset value, the weight value of the second sub-node is preset to be a second preset value, the weight values of the associated nodes of all layers in the same branch are set to be the same, and the weight values of the character nodes with the same positions of all layers in the same branch are set to be the same.
Referring to fig. 7, fig. 7 is a diagram illustrating an example of weight setting for the final character tree generated in fig. 6 according to the embodiment of the present invention, and the codes of the non-repeated characters according to the weight setting in fig. 6 are shown in table 1.
TABLE 1
Non-repeating character Encoding
w 111
t 110
s 000
4 010
7 1011
c 1010
5 0100
9 0101
6 10011
By adopting the mode of the embodiment, effective compression of the geographic position data can be realized, for example, wx4unu1 is converted into binary system before compression: "1110111 1111000110100 1110101 1101110 1110101 110001", length 56; by the encoding of this embodiment, after compression: 111 011111101 001 100000011 01111100100000011 1000010, 48 length, 8 bytes saving than the existing mode; as time becomes longer, the number of people becomes larger; saving memory is increasing, for example: final storage value = geohash length (bit) 4 x 365 x 3 million, wherein 4 means that a person will have 4 geohashes a day, 365 days a year, the number of people is 3 million, and the geohash length is the encoded length, and the compression ratio is calculated by the following method:
before compression: 56/8×4×365×300000000/1024/1024/1024/1024 = 2.788TB
After compression: 48/8 x4 x 365 x 300000000/1024/1024/1024/1024 = 2.390TB
The compression ratio is: 2.788/2.390 =0.85
Therefore, the compression mode provided by the embodiment can obtain a better compression ratio.
When the compressed data is restored, the compressed data is restored to the corresponding character through the encoding table generated by the character tree in the embodiment, and then translated to the geohash value.
In order to perform the respective steps of the above-described embodiments and various possible implementations, an implementation of a geographic location data compression device is presented below. Referring to fig. 8, fig. 8 is a block diagram of a geographic position data compression device 100 according to an embodiment of the invention. It should be noted that, the basic principle and the technical effects of the geographic position data compression device 100 provided in this embodiment are the same as those of the foregoing embodiments, and for brevity, this embodiment is not mentioned in the description.
The geographic location data compression device 100 includes an acquisition module 110, a determination module 120, an encoding module 130, and a construction module 140.
The obtaining module 110 is configured to obtain a pre-constructed character tree, where the character tree is constructed according to all non-repeated characters in the plurality of geographic location data, and the character tree includes a root node, a plurality of association nodes, and at least one character node associated with each association node, where each association node is configured to associate a parent node and a child node of the association node, and each character node represents a non-repeated character;
a determining module 120, configured to determine, according to the character tree, a code of each character in each geographic location data;
the encoding module 130 is configured to combine the codes of all the characters in each geographic location data to obtain a compressed code of each geographic location data.
In an alternative embodiment, the character tree includes a first branch and a second branch, the root node includes a child node in the first branch and a child node in the second branch, the root node includes a first child node and a second child node, the first child node and the second child node are respectively associated with the first branch and the second branch, the weight value of the first child node is preset to a first preset value, the weight value of the second child node is preset to a second preset value, the first branch and the second branch each include multiple layers, each layer includes an association node and at least one character node, the weight values of the association nodes in each layer of the same branch are the same, the weight values of the character nodes in each layer of the same branch are the same, and the determining module 120 is specifically configured to: for any target character in each geographic position data, taking a path with the least path nodes between the root node and the target character nodes as a target path, wherein the target character nodes are character nodes representing the target characters; according to the path sequence of each node, determining a sequence formed by sequentially arranging weight values of each node as the code of the target character; and taking each character of each geographic position data as a target character to obtain the code of each character in each geographic position data.
In an alternative embodiment, the construction module 140 is configured to: acquiring a plurality of geographic position data, wherein each geographic position data comprises at least one character; performing de-duplication on all characters included in all the geographic position data to obtain non-duplicate characters in the plurality of geographic position data; calculating the weight of each character in the non-repeated characters according to the occurrence times of each character in the non-repeated characters in the plurality of geographic position data; and constructing a character tree according to the weights of all the non-repeated characters.
In an alternative embodiment, the character tree includes a first branch and a second branch, and the construction module 140 is specifically configured to: generating a first child node and a second child node of the root node, and associating the first child node and the second child node with the first branch and the second branch respectively; based on the first child node and the second child node, generating associated nodes and character nodes of each layer of the first branch and associated nodes and character nodes of each layer of the second branch according to weights of all non-repeated characters to obtain a character tree; the weight value of the first sub-node is set to be a first preset value, the weight value of the second sub-node is preset to be a second preset value, the weight values of the associated nodes of all layers in the same branch are set to be the same, and the weight values of the character nodes with the same positions of all layers in the same branch are set to be the same.
In an alternative embodiment, the non-repeated characters form a character sequence according to the order of the weights of the non-repeated characters from large to small, and the first child node is used as a first father node, and the second child node is used as a second father node;
the construction module 140 is specifically configured to, when obtaining the character tree, generate, based on the first child node and the second child node, an associated node and a character node of each layer of the first branch and an associated node and a character node of each layer of the second branch according to weights of all non-repeated characters: acquiring the number of characters which do not repeat characters in a character sequence; determining the target number of non-repeated characters to be inserted once according to the number of characters; if the character sequence is not null, a plurality of target non-repeated characters are taken out from the character sequence according to the sequence from the large weight to the small weight; inserting a target number of target non-duplicate characters into the character tree based on the first parent node and the second parent node; generating an associated node of a first father node and an associated node of a second father node, replacing the first father node with the associated node of the first father node, replacing the second father node with the associated node of the second father node, and repeatedly obtaining the number of characters without repeated characters in the character sequence until the character sequence is empty, thereby obtaining a character tree.
In an alternative embodiment, the construction module 140 is specifically configured to, when configured to determine, according to the number of characters, a target number of non-repeated characters that need to be inserted at a time: and if the number of the characters is greater than or equal to the number of the reference nodes, the number of the reference nodes is taken as the target number, otherwise, the number of the characters is taken as the target number, and the number of the reference nodes is determined according to the maximum number of the character nodes included in any layer of the first branch and the second branch.
In an alternative embodiment, the construction module 140, when configured to insert the target number of target non-repetitive characters into the character tree based on the first parent node and the second parent node, is specifically configured to: generating character nodes of the target plurality of first father nodes and generating character nodes of the target plurality of second father nodes; and assigning a value to the character node of the second father node and assigning a value to the character node of the first father node in turn according to the sequence from big to small of the weight of the target non-repeated character, wherein the weight of the character node of the second father node is larger than that of the character node of the first father node, so that the target non-repeated character is inserted into the character tree.
Referring to fig. 9, fig. 9 is a schematic block diagram of the electronic device 10 according to the embodiment of the present invention, and the electronic device 10 includes a processor 11, a memory 12, and a bus 13. The processor 11 and the memory 12 are connected by a bus 13.
The processor 11 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 11 or by instructions in the form of software. The processor 11 may be a general-purpose processor, including a central processing unit (CentralProcessing Unit, CPU), a Network Processor (NP), etc.; but may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
The memory 12 is used for storing a program, for example, the geographic location data compression device 100 in fig. 8, and each of the geographic location data compression devices 100 includes at least one software functional module that may be stored in the memory 12 in the form of software or firmware (firmware), and the processor 11 executes the program after receiving the execution instruction to implement the geographic location data compression method in the embodiment of the present invention.
The memory 12 may include high-speed random access memory (RAM: random AccessMemory) and may also include non-volatile memory (nonvolatile memory). Alternatively, the memory 12 may be a storage device built into the processor 11, or may be a storage device independent of the processor 11.
The bus 13 may be an ISA bus, a PCI bus, an EISA bus, or the like. Fig. 9 is represented by only one double-headed arrow, but does not represent only one bus or one type of bus.
An embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a geographical location data compression method as in the previous embodiments.
In summary, the embodiments of the present invention provide a geographic location data compression method, a geographic location data compression device, an electronic device, and a storage medium, where the method includes: acquiring a pre-constructed character tree, wherein the character tree is constructed according to all non-repeated characters in a plurality of geographic position data, the character tree comprises a root node, a plurality of association nodes and at least one character node associated with each association node, each association node is used for associating a father node and a child node of the association node, and each character node represents a non-repeated character; determining the code of each character in each geographic position data according to the character tree; and combining the codes of all characters in each geographic position data to obtain the compressed codes of each geographic position data. Compared with the prior art, the embodiment of the invention can determine the codes of each character in each geographic position data through the character tree, finally combines the codes of all the characters in each geographic position data to obtain the codes after compressing each geographic position data, and realizes the effective compression of the geographic position data.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (10)

1. A method of geographic location data compression, the method comprising:
acquiring a pre-constructed character tree, wherein the character tree is constructed according to all non-repeated characters in a plurality of geographic position data, the character tree comprises a root node, a plurality of association nodes and at least one character node associated with each association node, each association node is used for associating a father node and a child node of the association node, and each character node represents one non-repeated character;
determining the code of each character in each geographic position data according to the character tree;
and combining codes of all characters in each geographic position data to obtain the compressed codes of each geographic position data.
2. The method of compressing geographical position data of claim 1, wherein the character tree includes a first branch and a second branch, the root node includes a first child node and a second child node, the first child node and the second child node are respectively associated with the first branch and the second branch, the weight value of the first child node is preset to a first preset value, the weight value of the second child node is preset to a second preset value, the first branch and the second branch each include multiple layers, each layer includes an associated node and at least one character node, the weight values of the associated nodes of the layers of the same branch are the same, the weight values of the character nodes of the layers of the same branch are the same, and the step of determining the code of each character in the geographical position data according to the character tree includes:
For any target character in the geographic position data, taking a path with the least path nodes between the root node and the target character nodes as a target path, wherein the target character nodes are character nodes representing the target characters;
according to the path sequence of each node, determining a sequence formed by sequentially arranging the weight values of each node as the code of the target character;
and taking each character of each geographic position data as the target character to obtain the code of each character in each geographic position data.
3. A method of compressing geographical location data as recited in claim 1, wherein the method further comprises:
acquiring a plurality of geographic position data, wherein each geographic position data comprises at least one character;
performing de-duplication on all characters included in all the geographic position data to obtain non-duplicate characters in the geographic position data;
calculating the weight of each character in the non-repeated characters according to the occurrence times of each character in the non-repeated characters in the plurality of geographic position data;
and constructing the character tree according to the weights of all the non-repeated characters.
4. A geographical location data compression method as recited in claim 3, wherein the character tree comprises a first branch and a second branch, and wherein constructing the character tree based on weights of all the non-repeating characters comprises:
generating a first child node and a second child node of the root node, and associating the first child node and the second child node with the first branch and the second branch, respectively;
generating associated nodes and character nodes of each layer of the first branch and associated nodes and character nodes of each layer of the second branch according to weights of all non-repeated characters based on the first child nodes and the second child nodes, so as to obtain the character tree;
the weight value of the first sub-node is set to be a first preset value, the weight value of the first sub-node and the weight value of the second sub-node are preset to be a second preset value, the weight values of the associated nodes of all layers in the same branch are set to be the same, and the weight values of the character nodes with the same positions of all layers in the same branch are set to be the same.
5. The geographical location data compression method of claim 4, wherein the non-repeated characters are grouped into character sequences in order of their weights from large to small, and wherein the first child node is used as a first parent node and the second child node is used as a second parent node;
The step of generating the associated nodes and character nodes of each layer of the first branch and the associated nodes and character nodes of each layer of the second branch based on the first child node and the second child node according to the weights of all the non-repeated characters, and obtaining the character tree comprises the following steps:
acquiring the number of characters which do not repeat characters in the character sequence;
determining the target number of non-repeated characters to be inserted once according to the character number;
if the character sequence is not null, the target non-repeated characters are taken out from the character sequence according to the sequence from the big weight to the small weight;
inserting the target number of the target non-duplicate characters into the character tree based on the first parent node and the second parent node;
generating an associated node of the first father node and an associated node of the second father node, replacing the first father node with the associated node of the first father node, replacing the second father node with the associated node of the second father node, and repeating the step of obtaining the number of characters in the character sequence, which are not repeated, until the character sequence is empty, so as to obtain the character tree.
6. The geographical location data compression method of claim 5, wherein the determining a target number of non-repeated characters to be inserted once based on the number of characters comprises:
and if the number of the characters is larger than or equal to the number of the reference nodes, the number of the reference nodes is taken as a target number, otherwise, the number of the characters is taken as the target number, and the number of the reference nodes is determined according to the maximum number of the character nodes included in any layer of the first branch and the second branch.
7. The method of geographic location data compression as claimed in claim 5, wherein the step of inserting the target number of the target non-repeating characters into the character tree based on the first parent node and the second parent node:
generating character nodes of the first father node and generating character nodes of the second father node;
and assigning values to the character nodes of the second father node and the character nodes of the first father node in turn according to the sequence from big to small of the weights of the target non-repeated characters, wherein the weights of the target non-repeated characters represented by the character nodes of the second father node are larger than those of the target non-repeated characters represented by the character nodes of the first father node, so that the target non-repeated characters are inserted into the character tree.
8. A geographical location data compression device, the device comprising:
the acquisition module is used for acquiring a pre-constructed character tree, the character tree is constructed according to all non-repeated characters in a plurality of geographic position data, the character tree comprises a root node, a plurality of association nodes and at least one character node associated with each association node, each association node is used for associating a father node and a child node of the association node, and each character node represents one non-repeated character;
the determining module is used for determining the code of each character in each geographic position data according to the character tree;
and the coding module is used for combining codes of all characters in each geographic position data to obtain the compressed codes of each geographic position data.
9. An electronic device comprising a processor and a memory, the memory for storing a program, the processor for implementing the geographic location data compression method of any of claims 1-7 when the program is executed.
10. A computer readable storage medium, having stored thereon a computer program which, when executed by a processor, implements a geographical location data compression method as claimed in any one of claims 1-7.
CN202310203540.9A 2023-03-06 2023-03-06 Geographic position data compression method and device, electronic equipment and storage medium Active CN116073835B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310203540.9A CN116073835B (en) 2023-03-06 2023-03-06 Geographic position data compression method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310203540.9A CN116073835B (en) 2023-03-06 2023-03-06 Geographic position data compression method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116073835A true CN116073835A (en) 2023-05-05
CN116073835B CN116073835B (en) 2023-08-25

Family

ID=86175063

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310203540.9A Active CN116073835B (en) 2023-03-06 2023-03-06 Geographic position data compression method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116073835B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140330986A1 (en) * 2013-05-01 2014-11-06 Red Hat, Inc. Compression of content paths in a digital certicate
CN109889205A (en) * 2019-04-03 2019-06-14 杭州嘉楠耘智信息科技有限公司 Encoding method and system, decoding method and system, and encoding and decoding method and system
US10558738B1 (en) * 2019-03-15 2020-02-11 Amazon Technologies, Inc. Compression of machine learned models
CN111615149A (en) * 2020-05-13 2020-09-01 和智信(山东)大数据科技有限公司 Signaling track data compression method and device
CN113746487A (en) * 2021-08-25 2021-12-03 山东云海国创云计算装备产业创新中心有限公司 Data compression method and device, electronic equipment and storage medium
US20220321141A1 (en) * 2021-03-31 2022-10-06 DRIC Software, Inc. File compression system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140330986A1 (en) * 2013-05-01 2014-11-06 Red Hat, Inc. Compression of content paths in a digital certicate
US10558738B1 (en) * 2019-03-15 2020-02-11 Amazon Technologies, Inc. Compression of machine learned models
CN109889205A (en) * 2019-04-03 2019-06-14 杭州嘉楠耘智信息科技有限公司 Encoding method and system, decoding method and system, and encoding and decoding method and system
CN111615149A (en) * 2020-05-13 2020-09-01 和智信(山东)大数据科技有限公司 Signaling track data compression method and device
US20220321141A1 (en) * 2021-03-31 2022-10-06 DRIC Software, Inc. File compression system
CN113746487A (en) * 2021-08-25 2021-12-03 山东云海国创云计算装备产业创新中心有限公司 Data compression method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴晨晖;王映辉;: "一种基于自顶向下的哈夫曼编码方法", 计算机技术与发展, no. 10, pages 51 - 53 *
王防修;周康;: "基于二叉排序树的哈夫曼编码", 武汉工业学院学报, no. 04, pages 45 - 48 *

Also Published As

Publication number Publication date
CN116073835B (en) 2023-08-25

Similar Documents

Publication Publication Date Title
US10020913B2 (en) Polar code encoding method and device
CN109409518B (en) Neural network model processing method and device and terminal
JP3276860B2 (en) Data compression / decompression method
US6982661B2 (en) Method of performing huffman decoding
CN104579360B (en) A kind of method and apparatus of data processing
CN113746487B (en) Data compression method and device, electronic equipment and storage medium
CN111615149B (en) Signaling track data compression method and device
CN112737596A (en) Dynamic Huffman coding method, device and equipment based on sorting network
CN110545106A (en) Method and device for coding time series data
CN115357571A (en) Data deduplication method, device, equipment and medium
CN113821373B (en) Method, system, equipment and storage medium for improving disk address translation speed
CN109981108B (en) Data compression method, decompression method, device and equipment
CN116073835B (en) Geographic position data compression method and device, electronic equipment and storage medium
CN112332857B (en) Cyclic shift network system and cyclic shift method for LDPC code
CN103051480B (en) The storage means of a kind of DN and DN storage device
CN112804029A (en) Transmission method, device and equipment of BATS code based on LDPC code and readable storage medium
CN111492586B (en) Method and device for designing basic matrix of LDPC code with orthogonal rows
CN108092670B (en) Coding method and device
CN113078910B (en) Method, device, medium and electronic equipment for determining bit field
JP2024503032A (en) Audio coding method and decoding method, audio coding device and decoding device
CN104765790B (en) A kind of method and apparatus of data query
CN110825927A (en) Data query method and device, electronic equipment and computer readable storage medium
US10498358B2 (en) Data encoder and data encoding method
CN113239052B (en) Alliance chain grouping method, device, equipment and medium
CN114640357B (en) Data encoding method, apparatus and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant