CN114490546A - Track data compression method and device, electronic equipment and storage medium - Google Patents

Track data compression method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114490546A
CN114490546A CN202210102571.0A CN202210102571A CN114490546A CN 114490546 A CN114490546 A CN 114490546A CN 202210102571 A CN202210102571 A CN 202210102571A CN 114490546 A CN114490546 A CN 114490546A
Authority
CN
China
Prior art keywords
data
compressed
sub
type
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210102571.0A
Other languages
Chinese (zh)
Inventor
刘钧文
俞自生
李瑞远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong City Beijing Digital Technology Co Ltd
Original Assignee
Jingdong City Beijing Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong City Beijing Digital Technology Co Ltd filed Critical Jingdong City Beijing Digital Technology Co Ltd
Priority to CN202210102571.0A priority Critical patent/CN114490546A/en
Publication of CN114490546A publication Critical patent/CN114490546A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Abstract

The embodiment of the invention relates to a method and a device for compressing track data, electronic equipment and a storage medium, wherein the track data to be compressed is divided into different subdata sets according to data content; determining a data type corresponding to the sub data set, and determining a data common part corresponding to all data in the sub data set and a data characteristic part corresponding to each data according to the data type, wherein the data common part and the data characteristic part corresponding to each data form the compressed sub data set; integrating each compressed subdata set to obtain compressed track data; the embodiment of the invention realizes the maximum lossless compression potential of different data types, improves the compression performance and realizes efficient and fine-grained query operation on the track data.

Description

Track data compression method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and an apparatus for compressing track data, an electronic device, and a storage medium.
Background
With the advent of the big data age, data is compressed, and the storage pressure of a storage device can be relieved. In the prior art, track data is generally compressed in a lossless manner as a whole for compression of the track data.
In the process of implementing the invention, the inventor finds that at least the following technical problems exist in the prior art: the compression performance is not good and the specific information cannot be efficiently inquired.
Disclosure of Invention
The invention provides a track data compression method and device, electronic equipment and a storage medium, and aims to solve the technical problems that the track data compression performance is poor and specific information cannot be efficiently inquired.
In a first aspect, the present invention provides a method for compressing track data, including: dividing track data to be compressed into different subdata sets according to data contents; determining a data type corresponding to the sub data set, and determining a data common part corresponding to all data in the sub data set and a data characteristic part corresponding to each data according to the data type, wherein the data common part and the data characteristic part corresponding to each data form the compressed sub data set; and integrating the compressed subdata sets to obtain compressed track data.
As an alternative embodiment, the data content comprises at least one of: spatial data, temporal data, and other data; the data types include at least one of: floating point, long integer, string, and integer.
As an optional embodiment, the determining the data type corresponding to the sub data set, and determining a data common part corresponding to all data in the sub data set and a data characteristic part corresponding to each data according to the data type, where the data common part and the data characteristic part corresponding to each data form a compressed sub data set, includes: if the data type corresponding to the longitude and latitude data is determined to be a floating point type, determining that the common part of the data is the common value of all the longitude and latitude data, and determining that the characteristic part of the data is the exclusive OR result of each longitude and latitude data and the common value; and folding the leading zero position of the XOR result, wherein the common value and the XOR result after folding form the compressed subdata set.
As an optional embodiment, the determining the data type corresponding to the sub data set, and determining a data common part corresponding to all data in the sub data set and a data characteristic part corresponding to each data according to the data type, where the data common part and the data characteristic part corresponding to each data form a compressed sub data set, includes: if the data type corresponding to the timestamp sequence is determined to be a long integer, determining that the common part of the data is an initial timestamp of the timestamp sequence, the characteristic part of the data is a first difference value, and the initial timestamp and the first difference value form the compressed subdata set; wherein the first difference is a change value of a subsequent timestamp relative to a previous timestamp.
As an optional embodiment, the determining the data type corresponding to the sub data set, and determining a data common part corresponding to all data in the sub data set and a data characteristic part corresponding to each data according to the data type, where the data common part and the data characteristic part corresponding to each data form a compressed sub data set, includes: if the data type corresponding to the timestamp sequence is determined to be a long integer, determining that the common part of the data is an initial timestamp of the timestamp sequence, the characteristic part of the data is a second difference value, and the initial timestamp and the second difference value form the compressed subdata set; the second difference is a change value of a subsequent first difference relative to a previous first difference, and the first difference is a change value of a subsequent timestamp relative to a previous timestamp.
As an optional embodiment, the method further comprises: if the data type corresponding to the subdata set is determined to be the character string type, converting the character string type data in the subdata set into corresponding coded data according to a preset mapping relation table, wherein the mapping relation between the character string type data and the coded data is stored in the preset mapping relation table, and the data type of the coded data is integer; and compressing the encoded data according to an integer data compression algorithm to obtain a compressed subdata set.
As an optional embodiment, the method further comprises: and if the data type corresponding to the sub data set is determined to be the integer, compressing the data in the sub data set according to an integer data compression algorithm to serve as the compressed sub data set.
As an alternative embodiment, the integer data compression algorithm comprises: storing binary data corresponding to the integer data according to the first part and the second part; the first part is the number of leading zero bits in the binary data, and the second part is the effective information part in the binary data.
In a second aspect, the present invention provides an apparatus for compressing trace data, comprising: the dividing module is used for dividing the track data to be compressed into different subdata sets according to data contents; a determining module, configured to determine a data type corresponding to the sub data set, and determine, according to the data type, a data common part corresponding to all data in the sub data set and a data characteristic part corresponding to each data, where the data common part and the data characteristic part corresponding to each data form a compressed sub data set; and the integration module is used for integrating the compressed subdata sets to obtain compressed track data.
In a third aspect, the present invention provides an electronic device, including a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus; a memory for storing a computer program; a processor configured to implement the steps of the method for compressing trajectory data according to any one of the first aspect when executing a program stored in the memory.
In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of compressing trajectory data according to any one of the first aspect.
According to the track data compression method, the track data compression device, the electronic equipment and the storage medium, track data to be compressed are divided into different subdata sets according to data contents; determining a data type corresponding to the sub data set, and determining a data common part corresponding to all data in the sub data set and a data characteristic part corresponding to each data according to the data type, wherein the data common part and the data characteristic part corresponding to each data form the compressed sub data set; integrating each compressed subdata set to obtain compressed track data; in the embodiment of the invention, the track data to be compressed is firstly divided according to the data content, so that the data in each subdata set has great similarity, then the data type of each subdata set can be determined, and the data is divided into the public part and the characteristic part according to the data type of the subdata set for storage, thereby realizing the maximum lossless compression potential of different data types, improving the compression performance, and realizing the efficient and fine-grained query operation on the track data.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a schematic diagram of a lossless track data compression method provided in the prior art;
fig. 2 is a schematic flowchart of a track data compression method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating another track data compression method according to an embodiment of the present invention;
FIG. 4 is a diagram of a model architecture of a hardware-based floating point compression algorithm according to an embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating compression of string data according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a method for compressing track data according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of an apparatus for compressing track data according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
With the advent of the big data age, a large amount of trajectory data including trajectory data of vehicles, persons, and various devices is generated in cities. The track data is not only huge, but also each track usually contains tens of thousands or even millions of longitude and latitude, time and other information. Existing track data compression strategies can be divided into two major categories, namely lossy compression and lossless compression.
The lossy compression is to not record points close to the space-time distance which does not affect the general trend of the track under the premise that the error between the compressed track and the original track is within an acceptable range, the occupied space is small, the space occupation is reduced by simplifying the data, and the method is to replace the smaller data storage space at the expense of the accuracy of part of data information under the prior art. However, with the development of smart city industry in recent years, the use demand of track data is more diversified, one piece of track data often contains information other than many coordinate points, and loss of information in multiple dimensions due to lossy compression results in that such compression technology is only suitable for tasks focusing on track shapes and trends, and is not suitable for task situations of multi-purpose track data mining. Especially when the trajectory data needs to be processed, analyzed and visualized in real time, efficient lossless compression and storage of the trajectory data are needed, and some efficient query operations on the trajectory are needed.
The lossless compression is to compress the track object and the incidental information thereof as a whole, so that the occupied space is larger, the compression ratio is relatively low, but the original information can be retained by 100%, and the specific data point query is supported. Common lossless compression algorithms include Zip, Gzip, Zstd, Kryo, etc., and this compression strategy converts the whole track data into byte arrays, and then stores the byte arrays into the database finally, as shown in fig. 1, fig. 1 is a schematic diagram of a lossless compression method for track data provided by the prior art. Lossless compression has the following problems: on one hand, the maximum lossless compression potential of different data contents in data compression cannot be exerted, and on the other hand, the overall lossless compression is not beneficial to efficiently inquiring specific information.
In view of the above technical problems, the technical idea of the present invention is as follows: the track data to be compressed is firstly divided according to data content, so that the data in each sub data set has great similarity, then the data type of each sub data set can be determined, and the data is divided into a public part and a characteristic part according to the data type of the sub data set for storage.
Fig. 2 is a schematic flow chart of a track data compression method according to an embodiment of the present invention, and as shown in fig. 2, the track data compression method includes:
step S101, dividing the track data to be compressed into different sub data sets according to data content.
Optionally, the data content includes at least one of: spatial data, temporal data, and other data. In this step, the track data to be compressed may be divided into sub data sets corresponding to the spatial data, sub data sets corresponding to the temporal data, and sub data sets corresponding to other data contents. The spatial data may be latitude and longitude data, the time data may be a time stamp sequence, and the other data may be attribute information of the trajectory data.
It should be noted that the data of the sub data sets divided according to the data content generally have great similarity, for example, the longitude and latitude data are a series of data with small variation amplitude and the same data type, and the timestamp sequence is a series of data with small variation amplitude and the same data type, which is also convenient for the subsequent efficient and fine-grained query of the specific information.
Step S102, determining a data type corresponding to the sub data set, and determining a data common part corresponding to all data in the sub data set and a data characteristic part corresponding to each data according to the data type, wherein the data common part and the data characteristic part corresponding to each data form the compressed sub data set.
Optionally, the data type includes at least one of: floating point, long integer, string. In this step, first, a data type corresponding to each sub data set is determined, for example, if the spatial data is longitude and latitude data, the data type of the longitude and latitude data is usually determined as a floating point type; if the time data is a time stamp sequence, the data type of the time stamp sequence is usually a long integer; then, according to different data types, determining a data common part corresponding to all data in the sub data set and a data characteristic part corresponding to each data, for example, determining that the data common part of the sub data set is a common value of all longitude and latitude data aiming at the longitude and latitude data of a floating point type, determining that the data common part of the sub data set is an exclusive or result of each longitude and latitude data and the common value, and forming a compressed sub data set by the common value and the exclusive or result corresponding to each data; and aiming at the long and integer time stamp sequence, determining that the data common part of the sub data set is a starting time stamp, determining that the data characteristic part is the difference value of the next time stamp relative to the previous time stamp, and forming the compressed sub data set by the starting time stamp and the difference value corresponding to each piece of data.
And step S103, integrating the compressed subdata sets to obtain compressed track data.
Specifically, after the common part of the data and the data characteristic part corresponding to each data are used as the compressed sub data sets, the compressed sub data sets are integrated, so that the compressed track data is obtained.
According to the track data compression method provided by the embodiment of the invention, the track data to be compressed is divided into different sub data sets according to data contents; determining a data type corresponding to the sub data set, determining a data common part corresponding to all data in the sub data set and a data characteristic part corresponding to each data according to the data type, wherein the data common part and the data characteristic part corresponding to each data form a compressed sub data set, and integrating each compressed sub data set to obtain compressed track data; in the embodiment of the invention, the track data to be compressed is firstly divided according to the data content, so that the data in each subdata set has great similarity, then the data type of each subdata set can be determined, and the data is divided into the public part and the characteristic part according to the data type of the subdata set for storage, thereby realizing the maximum lossless compression potential of different data types, improving the compression performance, and realizing the efficient and fine-grained query operation on the track data.
On the basis of the foregoing embodiment, fig. 3 is a schematic flow chart of another track data compression method provided in an embodiment of the present invention, where the data content includes at least one of the following: spatial data, temporal data, and other data; the data types include at least one of: floating point, long integer, string, and integer. As shown in fig. 3, the method for compressing track data includes:
step S201, dividing the track data to be compressed into different sub data sets according to the data content.
If the spatial data is longitude and latitude data, executing step S202 and step S203; if the time data is the time stamp sequence, step S204 is executed.
Step S202, if the data type corresponding to the longitude and latitude data is determined to be a floating point type, determining that the common part of the data is the common value of all the longitude and latitude data, and the data characteristic part is the exclusive OR result of each longitude and latitude data and the common value.
And S203, folding the leading zero position of the XOR result, wherein the common value and the XOR result after folding form the compressed subdata set.
Step S204, if the data type corresponding to the time stamp sequence is determined to be a long integer, determining that the common part of the data is the initial time stamp of the time stamp sequence, the characteristic part of the data is a first difference value, and the initial time stamp and the first difference value form the compressed subdata set.
Wherein the first difference is a change value of a subsequent timestamp relative to a previous timestamp.
And S205, integrating each compressed subdata set to obtain compressed track data.
The implementation manners of step S201 and step S205 in this embodiment are similar to the implementation manners of step S101 and step S103 in the foregoing embodiment, and are not described again here.
The difference from the above embodiment is that the present embodiment further defines an optimal compression method corresponding to the sub data sets of each data type. In this embodiment, if it is determined that the data type corresponding to the latitude and longitude data is a floating point type, determining that the common part of the data is a common value of all the latitude and longitude data, and the data characteristic part is an exclusive or result of each latitude and longitude data and the common value; folding the leading zero position of the XOR result, wherein the common value and the XOR result after folding form the compressed subdata set; if the data type corresponding to the timestamp sequence is determined to be a long integer, determining that the common part of the data is an initial timestamp of the timestamp sequence, the characteristic part of the data is a first difference value, and the initial timestamp and the first difference value form the compressed subdata set; wherein the first difference is a change value of a subsequent timestamp relative to a previous timestamp.
Specifically, what is indispensable in the trajectory data is each spatial point data, which is generally described by latitude and longitude data. The latitude and longitude data are usually decimal numbers, and are generally referred to as floating point data types in computers, which are generally stored and defined by IEEE 754 standard, so that all digits in binary systems need to be stored in the storage process, for example, for Double-precision floating point Double type, 8 bytes are used for storage, that is, 64-bit binary digits are used for storage. However, in the trace scenario, the longitude ranges from-180 ° to 180 °, the latitude ranges from-90 ° to 90 °, and the movement range of the trace is very small compared to the entire range, i.e., the degree of change mapped to floating point data is very small. For example, in the track point within six rings of beijing, the longitude ranges from 116.279512 to 116.499705, and the latitude ranges from 39.989051 to 39.834976, it can be seen that the variation range of all the longitudes and latitudes is very small, and if all the data bits need to be stored according to the existing expression standard of general floating point data, a very large waste of space is caused.
Therefore, in this embodiment, a hardware-based floating point compression algorithm is used to compress the longitude and latitude data, and the compression algorithm can fully utilize some characteristics of a Central Processing Unit (CPU) in terms of Processing floating point data, extract a common part of the floating point data, and only store a characteristic part of each floating point data, thereby greatly reducing the space occupation of the floating point data.
Fig. 4 is a model architecture diagram of a hardware-based floating point compression algorithm according to an embodiment of the present invention, as shown in fig. 4, a plurality of latitude and longitude data are input to the model, a common value is determined as a predicted value according to the latitude and longitude data, and the common value is stored as basic floating point data; and then carrying out XOR operation on the plurality of longitude and latitude data and the predicted values, so that a plurality of front bits of the longitude and latitude data become zero, the CPU can directly acquire the number of the front zeros, namely the number of the front zero bits, folds the digits, and finally only the XOR and the folded data need to be stored, so that the data volume can be greatly reduced, and the compression and the efficient storage of mass longitude and latitude data are realized.
For example, if the input longitude data includes 117.11, 117.12, 117.13, the common value is determined to be 117.1 (stored as the base floating point data) and used as the predicted value; exclusive or operation is carried out on the first longitude data 117.11 and the predicted value to obtain 000.01 as a change value 1; the second longitude data 117.12 is subjected to exclusive or operation with the predicted value to obtain 000.02 as a change value 2, and the like to obtain a change value 3, a change value 4 and the like; then, each variable value is processed by a Hash (Hash) function, so that each variable value is distributed more uniformly to obtain a Hash table; and then, the basic floating point data and the data in the hash table are processed by a prediction function, namely leading zero bits of the change values are folded, so that difference values of 0.01, 0.02, 0.03 and the like are obtained, and the difference values are output as compressed data.
After the latitude and longitude data are compressed by adopting a hardware-based floating point compression algorithm, the compression capability can be greatly improved, and the compression speed is higher. For example, for 1G data volume, the compression ratio is about 1.7 at the highest, and the compression speed is the fastest in the same type of algorithm (such as bzip2, gzip, p7zip), the average compression time under the same data volume can be saved by more than 50% compared with the same type of algorithm.
Secondly, in the track data, each track point has corresponding time data, and the time data is crucial to describing the track information, but also faces the problem of large amount of redundant storage. Typically, time data will be stored in the form of a timestamp, i.e., a time millisecond value of 0 minutes 0 seconds from 1 month 1 day 0 of 1970, will be a long integer type. However, the time of the track data generally has a small variation range, and if the track data is stored in the conventional storage mode, the past time information is stored in a redundant manner, so that a large amount of space is occupied.
In this embodiment, all original time millisecond values are not stored redundantly, but the minimum value in the time stamp sequence is stored separately, that is, as a common part of data, the latter value only stores a change value relative to the previous time stamp, that is, a first difference delta, so that the time data of each trace point data only needs to store the change value, and does not need to store the full amount of time information, and the amount of time data actually stored can be greatly reduced.
As an alternative embodiment, the time data is a time stamp sequence, and the step S102 includes: if the data type corresponding to the timestamp sequence is determined to be a long integer, determining that the common part of the data is an initial timestamp of the timestamp sequence, the characteristic part of the data is a second difference value, and the initial timestamp and the second difference value form the compressed subdata set; the second difference is a change value of a subsequent first difference relative to a previous first difference, and the first difference is a change value of a subsequent timestamp relative to a previous timestamp.
Specifically, for the time stamp sequence, the data characteristic portion may also be a second difference delta-of-delta, that is, the data characteristic portion corresponding to each data is a change value of a subsequent delta relative to a previous delta.
Table 1 shows that, for a series of timestamps, if each piece of data needs 64-bit storage space according to a conventional timestamp storage method, all pieces of time data in the table can be stored in 512-bit storage space in total, as shown in table 1. In this embodiment, if the data characteristic portion is the first difference delta, a storage space of 155 bits is required; if the data characteristic portion is the second difference delta-of-delta, only 103 bits of storage space is needed, and it is obvious that the compression method for the timestamp sequence provided by this embodiment has a more obvious effect of improving the storage space for a huge amount of track data in a real track data scene.
Also, as can be seen from Table 1, using delta-of-delta based is more memory efficient than delta based because it stores delta-of-deltas that may have consecutive 0 s, which have more compression space.
TABLE 1
Unix time stamp delta delta-of-delta Compressed bit number
1561889600000 0 0 64
1561889600010 10 10 9
1561889600010 0 -10 9
1561889600011 1 1 9
1561889600012 1 0 1
1561889600013 1 0 1
1561889600015 2 1 9
1561889600017 2 0 1
As an optional embodiment, the method further comprises: if the data type corresponding to the subdata set is determined to be the character string type, converting the character string type data in the subdata set into corresponding coded data according to a preset mapping relation table, wherein the mapping relation between the character string type data and the coded data is stored in the preset mapping relation table, and the data type of the coded data is integer; and compressing the encoded data according to an integer data compression algorithm to obtain a compressed subdata set.
Specifically, among the trajectory data, the data type of the spatial data or the temporal data may be a character string type, or the data type corresponding to other data included in the trajectory data may be a character string type, for example, information of prefecture and prefecture. In a conventional storage mode, all character string data are stored, so that the same character string is repeatedly stored, and the storage space is wasted.
Therefore, in this embodiment, a data dictionary manner is adopted, that is, some character string data and names of province, city, district and county (usually integers) are encoded and stored in the memory, and only the codes and the mapping relationship table of the character string data and the codes need to be stored. The method greatly reduces the occupied space of the character string data, and simultaneously converts the character string data into integer types of 1, 2 and 3, so that the integer types can be compressed according to the compression method of the integer data, and the data storage is facilitated.
Fig. 5 is a schematic diagram of compressing string data according to an embodiment of the present invention. As shown in fig. 5, the integers 1, 2 and 3 are mapped to beijing, shanghai and shanxi provinces, respectively; if the track point 1, the track point 2, the track point 3 and the track point 4 are in Beijing, only the corresponding code 1 needs to be stored; if the track points 5, 6 and 7 are in Shanghai city, only the corresponding codes 2 need to be stored; if the track point 8, the track point 9, the track point 10 and the track point 12 are in Shanxi province, only the corresponding code 3 needs to be stored.
As an optional embodiment, if it is determined that the data type corresponding to the sub data set is the integer, compressing the data in the sub data set according to an integer data compression algorithm to obtain a compressed sub data set.
Specifically, in the trace data, the data type of the spatial data or the temporal data may also be integer, or the data type corresponding to other data included in the trace data is integer, and the data may be compressed according to a compression method of the integer data and then stored.
As an alternative embodiment, the integer data compression algorithm comprises: storing binary data corresponding to the integer data according to the first part and the second part; the first part is the number of leading zero bits in the binary data, and the second part is the effective information part in the binary data.
In particular, in the bottom layer of a computer, integers are stored as binary numbers, and for the convenience of storage and calculation, these binary types and lengths are often fixed, for example, Int type, length defaults to 32 bits, and table 2 provides an example of binary storage of integer data according to an embodiment of the present invention. However, such a storage method has a problem in that there is often a long meaningless "0" in binary digits, resulting in a waste of storage space.
TABLE 2
Original numerical value Binary system (Int type, 32 bit)
12 00000000000000000000000000001100
1000 00000000000000000000001111101000
Therefore, in this embodiment, the binary data is stored in two parts, wherein the first part stores the number of leading zero bits in the binary data, i.e. the meaningless "0" number of the digital header, and the second part stores the effective information part of the binary data, thereby realizing the compression of the whole binary data.
To facilitate further understanding of the embodiment, fig. 6 is a schematic diagram of a method for compressing track data according to an embodiment of the present invention, and as shown in fig. 6, track data to be compressed is obtained by a data distributor to obtain floating-point longitude and latitude data, long integer timestamp sequence, string data, and integer data; aiming at floating point type longitude and latitude data, a hardware-based floating point data compression method is adopted to obtain a corresponding compressed subdata set, a delta-of-delta compression algorithm is adopted to obtain a corresponding compressed subdata set aiming at a long integer type time stamp sequence, a data dictionary compression method is adopted to obtain a corresponding compressed subdata set aiming at character string type data, and an integer type data compression algorithm is adopted to obtain a corresponding compressed subdata set aiming at integer type data; and then integrating the compressed subdata sets through a data distributor to obtain compressed track data and storing the compressed track data.
According to the track data compression method provided by the embodiment of the invention, if the data type corresponding to the longitude and latitude data is determined to be a floating point type, the common part of the data is determined to be the common value of all the longitude and latitude data, and the characteristic part of the data is the exclusive or result of each longitude and latitude data and the common value; folding the leading zero position of the XOR result, wherein the common value and the XOR result after folding form the compressed subdata set; if the data type corresponding to the timestamp sequence is determined to be a long integer, determining that the common part of the data is an initial timestamp of the timestamp sequence, the characteristic part of the data is a first difference value, and the initial timestamp and the first difference value form the compressed subdata set; wherein the first difference is a change value of a subsequent timestamp relative to a previous timestamp; the embodiment of the invention firstly determines the data types corresponding to the spatial data and the time data, and then compresses by adopting the compression method with the optimal data type, thereby realizing the maximum lossless compression potential of different data types, improving the compression performance, and realizing the efficient and fine-grained query operation on the track data.
Fig. 7 is a schematic structural diagram of a track data compression apparatus according to an embodiment of the present invention, and as shown in fig. 7, the track data compression apparatus includes:
the dividing module 10 is configured to divide track data to be compressed into different sub data sets according to data content; a determining module 20, configured to determine a data type corresponding to the sub data set, and determine, according to the data type, a data public part corresponding to all data in the sub data set and a data characteristic part corresponding to each data, where the data public part and the data characteristic part corresponding to each data form a compressed sub data set; an integrating module 30, configured to integrate the compressed sub-data sets to obtain compressed track data.
As an alternative embodiment of the invention, the data content comprises at least one of: spatial data, temporal data, and other data; the data types include at least one of: floating point, long integer, string, and integer.
As an optional embodiment of the present invention, the spatial data is longitude and latitude data, and the determining module 20 is specifically configured to: if the data type corresponding to the longitude and latitude data is determined to be a floating point type, determining that the common part of the data is the common value of all the longitude and latitude data, and determining that the characteristic part of the data is the exclusive OR result of each longitude and latitude data and the common value; and folding the leading zero position of the XOR result, wherein the common value and the XOR result after folding form the compressed subdata set.
As an optional embodiment of the present invention, the time data is a time stamp sequence, and the determining module 20 is specifically configured to: if the data type corresponding to the timestamp sequence is determined to be a long integer, determining that the common part of the data is an initial timestamp of the timestamp sequence, the characteristic part of the data is a first difference value, and the initial timestamp and the first difference value form the compressed subdata set; wherein the first difference is a change value of a subsequent timestamp relative to a previous timestamp.
As an optional embodiment of the present invention, the time data is a time stamp sequence, and the determining module 20 is specifically configured to: if the data type corresponding to the timestamp sequence is determined to be a long integer, determining that the common part of the data is an initial timestamp of the timestamp sequence, the characteristic part of the data is a second difference value, and the initial timestamp and the second difference value form the compressed subdata set; the second difference is a change value of a subsequent first difference relative to a previous first difference, and the first difference is a change value of a subsequent timestamp relative to a previous timestamp.
As an alternative embodiment of the present invention, the determining module 20 is further configured to: if the data type corresponding to the subdata set is determined to be the character string type, converting the character string type data in the subdata set into corresponding coded data according to a preset mapping relation table, wherein the mapping relation between the character string type data and the coded data is stored in the preset mapping relation table, and the data type of the coded data is integer; and compressing the coded data according to an integer data compression algorithm to obtain a compressed subdata set.
As an alternative embodiment of the present invention, the determining module 20 is further configured to: and if the data type corresponding to the sub data set is determined to be the integer, compressing the data in the sub data set according to an integer data compression algorithm to serve as the compressed sub data set.
As an alternative embodiment of the present invention, the determining module 20 is further configured to implement the integer data compression algorithm, where the integer data compression algorithm includes: storing binary data corresponding to the integer data according to the first part and the second part; the first part is the number of leading zero bits in the binary data, and the second part is the effective information part in the binary data.
The implementation principle and technical effect of the track data compression apparatus provided in this embodiment are similar to those of the above embodiments, and are not described herein again.
The track data compression device provided by the embodiment of the invention is used for dividing track data to be compressed into different subdata sets according to data contents through the dividing module; a determining module, configured to determine a data type corresponding to the sub data set, and determine, according to the data type, a data common part corresponding to all data in the sub data set and a data characteristic part corresponding to each data, where the data common part and the data characteristic part corresponding to each data form a compressed sub data set; the integration module is used for integrating each compressed subdata set to obtain compressed track data; in the embodiment of the invention, the track data to be compressed is firstly divided according to the data content, so that the data in each subdata set has great similarity, then the data type of each subdata set can be determined, and the data is divided into the public part and the characteristic part according to the data type of the subdata set for storage, thereby realizing the maximum lossless compression potential of different data types, improving the compression performance, and realizing the efficient and fine-grained query operation on the track data.
As shown in fig. 8, an embodiment of the present invention provides an electronic device, which includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 complete mutual communication through the communication bus 114,
a memory 113 for storing a computer program;
in an embodiment of the present invention, the processor 111 is configured to implement the steps of the method for compressing track data provided in any one of the foregoing method embodiments when executing the program stored in the memory 113.
Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method for compressing trajectory data provided in any of the foregoing method embodiments.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (11)

1. A method for compressing trace data, comprising:
dividing track data to be compressed into different subdata sets according to data contents;
determining a data type corresponding to the sub data set, and determining a data public part corresponding to all data in the sub data set and a data characteristic part corresponding to each data according to the data type, wherein the data public part and the data characteristic part corresponding to each data form the compressed sub data set;
and integrating the compressed subdata sets to obtain compressed track data.
2. The method of claim 1, wherein the data content comprises at least one of: spatial data, temporal data, and other data;
the data types include at least one of: floating point, long integer, string, and integer.
3. The method of claim 2, wherein the spatial data is latitude and longitude data, the determining the data type corresponding to the sub data set, and determining a data common portion corresponding to all data in the sub data set and a data characteristic portion corresponding to each data according to the data type, the data common portion and the data characteristic portion corresponding to each data constituting a compressed sub data set, comprises:
if the data type corresponding to the longitude and latitude data is determined to be a floating point type, determining that the common part of the data is the common value of all the longitude and latitude data, and determining that the characteristic part of the data is the exclusive OR result of each longitude and latitude data and the common value;
and folding the leading zero position of the XOR result, wherein the common value and the XOR result after folding form the compressed subdata set.
4. The method of claim 2, wherein the time data is a time stamp sequence, the determining the data type corresponding to the sub data set, and determining a data common portion corresponding to all data in the sub data set and a data characteristic portion corresponding to each data according to the data type, the data common portion and the data characteristic portion corresponding to each data constituting a compressed sub data set comprises:
if the data type corresponding to the timestamp sequence is determined to be a long integer, determining that the common part of the data is an initial timestamp of the timestamp sequence, the characteristic part of the data is a first difference value, and the initial timestamp and the first difference value form the compressed subdata set;
wherein the first difference is a change value of a subsequent timestamp relative to a previous timestamp.
5. The method of claim 2, wherein the time data is a time stamp sequence, the determining the data type corresponding to the sub data set, and determining a data common portion corresponding to all data in the sub data set and a data characteristic portion corresponding to each data according to the data type, the data common portion and the data characteristic portion corresponding to each data constituting a compressed sub data set comprises:
if the data type corresponding to the timestamp sequence is determined to be a long integer, determining that the common part of the data is an initial timestamp of the timestamp sequence, the characteristic part of the data is a second difference value, and the initial timestamp and the second difference value form the compressed subdata set;
the second difference is a change value of a subsequent first difference relative to a previous first difference, and the first difference is a change value of a subsequent timestamp relative to a previous timestamp.
6. The method of claim 2, further comprising:
if the data type corresponding to the subdata set is determined to be the character string type, converting the character string type data in the subdata set into corresponding coded data according to a preset mapping relation table, wherein the mapping relation between the character string type data and the coded data is stored in the preset mapping relation table, and the data type of the coded data is integer;
and compressing the encoded data according to an integer data compression algorithm to obtain a compressed subdata set.
7. The method of claim 2, further comprising:
and if the data type corresponding to the sub data set is determined to be the integer, compressing the data in the sub data set according to an integer data compression algorithm to serve as the compressed sub data set.
8. The method of claim 6 or 7, wherein the integer data compression algorithm comprises:
storing binary data corresponding to the integer data according to the first part and the second part; the first part is the number of leading zero bits in the binary data, and the second part is the effective information part in the binary data.
9. An apparatus for compressing trace data, comprising:
the dividing module is used for dividing the track data to be compressed into different subdata sets according to data contents;
a determining module, configured to determine a data type corresponding to the sub data set, and determine, according to the data type, a data common part corresponding to all data in the sub data set and a data characteristic part corresponding to each data, where the data common part and the data characteristic part corresponding to each data form a compressed sub data set;
and the integration module is used for integrating the compressed subdata sets to obtain compressed track data.
10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the method of compressing trajectory data according to any one of claims 1 to 8 when executing a program stored in the memory.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of compressing trajectory data according to any one of claims 1 to 8.
CN202210102571.0A 2022-01-27 2022-01-27 Track data compression method and device, electronic equipment and storage medium Pending CN114490546A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210102571.0A CN114490546A (en) 2022-01-27 2022-01-27 Track data compression method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210102571.0A CN114490546A (en) 2022-01-27 2022-01-27 Track data compression method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114490546A true CN114490546A (en) 2022-05-13

Family

ID=81477219

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210102571.0A Pending CN114490546A (en) 2022-01-27 2022-01-27 Track data compression method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114490546A (en)

Similar Documents

Publication Publication Date Title
US10430433B2 (en) Systems and methods for data conversion and comparison
CN112953550B (en) Data compression method, electronic device and storage medium
US20170109398A1 (en) Systems and methods for data conversion and comparison
Balkenhol et al. Universal data compression based on the Burrows-Wheeler transformation: Theory and practice
US8239421B1 (en) Techniques for compression and processing optimizations by using data transformations
US20080275847A1 (en) Scalable minimal perfect hashing
WO1998039699A2 (en) Data coding network
CN111008230B (en) Data storage method, device, computer equipment and storage medium
CN109286399B (en) Compression method of GPS track data based on LZW algorithm
CN105144157A (en) System and method for compressing data in database
US6600432B2 (en) Variable length encoding and decoding of ascending numerical sequences
CN1426629A (en) Method and apparatus for optimized lossless compression using plurality of coders
EP0885429A1 (en) System and method for the fractal encoding of datastreams
CN101469989A (en) Compression method for navigation data in mobile phone network navigation
CN114490546A (en) Track data compression method and device, electronic equipment and storage medium
CN116707532A (en) Decompression method and device for compressed text, storage medium and electronic equipment
CN109255090B (en) Index data compression method of web graph
Shukla et al. A Comparative Analysis of Lossless Compression Algorithms on Uniformly Quantized Audio Signals
Gagie On the value of multiple read/write streams for data compression
Cheng et al. Unique-order interpolative coding for fast querying and space-efficient indexing in information retrieval systems
CN111275184B (en) Method, system, device and storage medium for realizing neural network compression
CN111866520A (en) Coding and decoding method, coding and decoding device and communication system
Al-Bahadili et al. A bit-level text compression scheme based on the HCDC algorithm
JP3038233B2 (en) Data compression and decompression device
Akil et al. FPGA-based architecture for hardware compression/decompression of wide format images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination