WO2016138619A1 - 一种数据增量更新方法 - Google Patents

一种数据增量更新方法 Download PDF

Info

Publication number
WO2016138619A1
WO2016138619A1 PCT/CN2015/073510 CN2015073510W WO2016138619A1 WO 2016138619 A1 WO2016138619 A1 WO 2016138619A1 CN 2015073510 W CN2015073510 W CN 2015073510W WO 2016138619 A1 WO2016138619 A1 WO 2016138619A1
Authority
WO
WIPO (PCT)
Prior art keywords
segment
incremental
old
node
data
Prior art date
Application number
PCT/CN2015/073510
Other languages
English (en)
French (fr)
Inventor
倪桂强
陈志龙
姜劲松
罗健欣
马遥
严英姿
Original Assignee
倪桂强
陈志龙
姜劲松
罗健欣
马遥
严英姿
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 倪桂强, 陈志龙, 姜劲松, 罗健欣, 马遥, 严英姿 filed Critical 倪桂强
Priority to PCT/CN2015/073510 priority Critical patent/WO2016138619A1/zh
Publication of WO2016138619A1 publication Critical patent/WO2016138619A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating

Definitions

  • the present invention relates to the field of data updating such as computers, smart devices, and consumer electronic products, and more particularly to a method for incrementally updating data.
  • Figure 1 shows the basic flow of incremental data update in the prior art: the old version file (software or data) and the new version file are stored on the server side, and the server compares the new file with the old file before a new version is released. And the difference information of the two is expressed as the form of the incremental package. After the incremental package is compressed, the incremental compressed package is obtained, and then transmitted to the device that needs to update the software through the communication network, and the device receives the incremental compressed package and decompresses it. Convert old files to new files according to the instructions and data in the delta package.
  • the data update process shown in Figure 1 has limited network bandwidth, cost calculated by traffic, and practical limitations such as smaller memory and limited power of consumer electronic devices, and smaller incremental compression. Packets reduce traffic costs and download times, and smaller memory consumption and fewer CPU clock cycles when applying incremental packets can reduce the time it takes to generate new files, save device power, and reduce the impact on other applications that are running. Improve the user experience. It can be seen that the main evaluation criteria for the incremental update method of the consumer electronic device data are: the size of the incremental package, the incremental compressed package, and the time for applying the incremental package. Therefore, reducing the size of incremental and incremental compression packages and speeding up the application of incremental packages are important issues in the data update method.
  • the data incremental update method mainly includes: RDIFF method proposed by T. Andrew in the literature “Efficient Algorithms for Sorting and Synchronization” (Australian National University, 1999), C. Percival in the literature “Naive differences of executable code”
  • the BSDIFF method proposed by the University of Oxford (2003) and the authors of D.Korn et al.
  • the VCDIFF method proposed by The VCDIFF Generic Differencing and Compression Data Format (RFC 3284 (Proposed Standard), June. 2002).
  • the RDIFF method mainly divides the old file and the new file into consecutive multiple data blocks of the same size, calculates the hash value of each data block, and searches for the same data block between the old and new files according to the hash value, and the incremental package includes the pair.
  • the method has the advantages of simple structure, fast calculation speed and is suitable for solving the problem of RDC (Remote Differential Compression). The disadvantage is that it can not comprehensively collect all the similar information between the old and new files.
  • the software that applies the method includes Rsync and rdiff-backup.
  • the BSDIFF method is to find the exact same data segment between the new file and the old file, and then extend the prefix and suffix of the data segment byte by byte, trying to find an approximate but not identical data segment, that is, an approximate data segment.
  • the BSDIFF method uses the suffix array sorting algorithm proposed by NSLersson and K.Sadakane.
  • the instructions in the incremental package are ADD, INSERT, SEEK.
  • the parameters of the ADD operation are approximate data segment length and correction amount, and the parameters of the INSERT operation are insertion segments.
  • the length and insertion content of the SEEK operation is the jump span of the old file read pointer.
  • the correction parameter of the ADD operation is the main component of the incremental package.
  • the similarity between the old and new files is proportional to the ratio of “0” in the correction amount, and the higher ratio of “0” makes the incremental compression package smaller than the size of the new file.
  • the incremental package in the BSDIFF method is small, but the construction of the incremental package and the application incremental package is computationally intensive, and therefore takes a lot of time.
  • the BSDIFF method is the most widely used incremental update algorithm, and the Bsdiff tool applies the BSDIFF method.
  • the VCDIFF method is based on the LZ77 (Lempel-Ziv 77) compression algorithm to improve incremental updates.
  • the LZ77 compression algorithm is mainly based on the fact that a subsequent data segment in a data stream is similar to or related to the previous data segment.
  • the LZ77 compression algorithm is used to link the old file and the new file into a data stream, and the LZ77 compression algorithm is used.
  • Compress intercept the part of the compressed stream that expresses the new file as an incremental package.
  • the performance of the VCDIFF method is between the RDIFF method and the BSDIFF method, sacrificing a partial compression ratio and speeding up the execution.
  • Xdelta is one of the software that uses the VCDIFF method, and the software has been improved on the basis of VCDIFF, which optimizes the instruction set and further reduces the size of the incremental compression package.
  • the technical problem to be solved by the present invention is to provide a conversion for the data incremental update method in the prior art, which generally has a large incremental package, a low compression ratio, a long running time, and cannot simultaneously take care of these characteristics. Constructs a data increment update method for incremental packages in a way that looks for the shortest cost path.
  • a technical solution adopted by the present invention is: providing a data incremental update method, the method comprising the following steps: First, constructing a similar information set Segment, and characterizing a new file array New
  • the old file array Old contains characters for comparison operations, and outputs the same plurality of character segments segment(s, t, l), where s is the position of the character segment in the old file array Old, and t is the character segment at The position in the new file array New, l is the number of bytes of the character segment, and the similar information set Segment is obtained as:
  • Segment ⁇ segment(s,t,l)
  • the horizontal line segment map is constructed, and the plurality of character segments segment(s, t, l) in the similar information set segment are converted into a plurality of horizontal line segments seg i (t, st) in the horizontal line segment map.
  • the left end point coordinate of the horizontal line segment seg i (t, st) is (t, st), the length is l, and i represents the serial number of the horizontal line segment seg i (t, st) in the horizontal line segment diagram;
  • constructing a path map the plurality of horizontal line segments seg i (t, st) in the horizontal line segment map are corresponding to the plurality of nodes V i in the path map, and a node edge is constructed between the plurality of nodes V i , And calculating the edge cost of each node edge;
  • the fourth step constructing the minimum cost path, from the starting node segment(0, 0, 0) to the terminating node segment (newSize) through the plurality of nodes V i in the path graph , 0, 0) has multiple paths, calculates the sum of the cost of the side of the plurality of nodes included in each path, and finds the path with the smallest value as the minimum cost path, and newSize represents the word in the
  • the method for comparing the byte included in the new file array New with the byte included in the old file array Old includes:
  • the old file array Old is suffixed to obtain the suffix array I of the old file array Old; then, using the suffix array I, the old file array Old is found in the new file array New ⁇ new[t ], new[t+1],...,new[newSize-1] ⁇ prefix matches the segment with the longest length ⁇ old[s],old[s+1],...,old[s+l-1] ⁇ , and output:
  • the threshold value Lmin 3, or Lmin is another positive integer value less than 10.
  • the configuration of the road map for which a byte fragment node V x corresponding segment (s x, t x, l x) and V y corresponding to another node
  • the byte fragment segment(s y , t y , l y ) exists between the node V x and the node V y if and only if (t x + l x ) ⁇ (t y + l y ) is satisfied.
  • the node edge, the node V x and other nodes adjacent to the node V x construct a node edge, and the number of the construction node edges is not greater than the node degree MAX_CONECTION.
  • the method for calculating the edge cost of the node V x to the node edge of the node V y is: using the encoding instruction and the encoded data to form a representation byte, and determining the node V x The expression required for the right end point of the corresponding horizontal line segment seg x (t x , s x -t x ) to be transferred to the right end point of the horizontal line segment seg y (t y , s y -t y ) corresponding to the node V y The number of bytes, which is the value of the cost of the edge.
  • the Dijkstra algorithm in the constructed minimum cost path, is employed to calculate the minimum cost path.
  • the instruction set in the construction delta package, includes an "insert", “copy”, “forward jump”, “backward jump” instruction,
  • the data set is composed of character parameters that are operated by the "insert” instruction, the instruction code including the instruction identifier and the instruction parameters.
  • the instruction identifier occupies 2 bits and is stored. There are four kinds of codes in 00, 01, 10, and 11, and there are four kinds of instructions corresponding to "insert”, “copy”, “forward jump”, and "backward jump”.
  • the instruction encoding structure is: the first byte is: the instruction identifier +0 + instruction parameter; the middle byte is: 0 + instruction parameter; the last byte is: 1 + instruction parameters.
  • the instruction encoding length when the value of the instruction parameter is less than 32, the instruction encoding length is 1 byte; when the instruction parameter value is greater than or equal to 32 and less than 4096, the instruction encoding length is 2 bytes; when the instruction parameter is greater than or equal to 4096 and less than 524288, the instruction code length is 3 bytes.
  • the constructing the incremental package further comprising: compressing the incremental package to obtain an incremental compressed package, where the application incremental package includes The incremental compression package is decompressed to get the incremental package.
  • the incremental packet is compressed by using an LZMA compression algorithm to obtain the incremental compressed packet.
  • the invention has the beneficial effects that the present invention compares the degree of similarity between new and old files by the above steps of constructing similar information sets, constructing horizontal line segments, constructing path maps, constructing minimum cost paths, constructing incremental packages, and applying incremental packages.
  • the form indicates that the problem of generating the minimum incremental package is transformed into finding the shortest path problem, and the minimum incremental package is generated according to the path, and the data incremental update based on the method of the present invention can save an average of 69.3% of the data amount, and Compared with the data incremental updating method in the prior art, the compression rate is the highest, and the running time of the application incremental packet is short, and the method of the invention has wide application range, and is applicable not only to Consumer electronics, but also can be applied to other platforms and systems.
  • FIG. 2 is a flow chart of an embodiment of a method for incrementally updating data according to the present invention
  • FIG. 3 is a schematic diagram of an embodiment of constructing a horizontal line segment diagram in another embodiment of a data incremental update method according to the present invention.
  • FIG. 4 is a schematic diagram of an embodiment of a construction path diagram in another embodiment of a data incremental update method according to the present invention.
  • FIG. 5 is a schematic diagram of an embodiment of calculating a side cost in another embodiment of a data incremental update method according to the present invention.
  • FIG. 6 is a flow chart of an embodiment of constructing an incremental packet in another embodiment of a data incremental update method according to the present invention.
  • FIG. 7 is a structural diagram of instruction coding in another embodiment of a data incremental update method according to the present invention.
  • FIG. 8 is a flow chart of another embodiment of a data incremental update method in accordance with the present invention.
  • FIG. 9 is a schematic diagram of an embodiment of an optimized horizontal line segment diagram in another embodiment of a data incremental update method according to the present invention.
  • FIG. 10 is a diagram showing a comparison analysis of incremental update compression ratios according to another embodiment of the data incremental update method of the present invention.
  • 11 is a runtime comparison analysis diagram of an application delta packet according to another embodiment of the data incremental update method of the present invention.
  • FIG. 2 is a flow chart of an embodiment of a data incremental update method according to the present invention, comprising the steps of: constructing a similar information set S201; constructing a horizontal line segment map S202; constructing a path map S203; constructing a minimum cost path S204; constructing a delta package S205 ; Apply the incremental package S206.
  • the above steps are specifically described below in conjunction with specific embodiments.
  • step S201 the old file and the new file need to be compared, and the same character segments between the old file and the new file content are found, and the same character segments are composed of the similar information sets.
  • the content of an old file embodiment is "You do not love a woman because she is beautiful, but she is beautiful because you love her.”
  • the content of the new file embodiment is "She love a man because he Do not just love her beauty.She is beautiful because a beautiful love.”
  • the old file and the new file are represented in the form of an array, and the old file array Old and the new file array New are respectively defined.
  • the 0th byte Old[0] in the old file array Old corresponds to the first character "Y” in the old file
  • the first byte Old[1] corresponds to the character "o” in the old file
  • the 2 bytes Old[2] correspond to the character "u” in the old file
  • the 3rd byte Old[3] corresponds to the space character "" in the old file, and so on.
  • the composition of the new file array New has the same characteristics as the old file array Old, and will not be described again. Due to the limitation of layout space, the contents of the old file and the new file occupy two lines, but there are no special characters for line breaks, but only English characters, space characters and punctuation characters.
  • segment(s, t, l) of the same character segment where s is the sequence number position of the character segment in the old file array Old.
  • t is the sequence number position of the character segment in the new file array New
  • l is the number of bytes occupied by the character segment.
  • Segment ⁇ segment(s,t,l)
  • the method of comparing the characters contained in the new file array New with the characters contained in the old file array Old includes:
  • the old file array Old is suffixed, and the old file array Old suffix array I is returned; then, using the suffix array I, the old file array Old is found with the new file array New ⁇ new[t], new[ The t+1],...,new[newSize-1] ⁇ prefix matches the segment with the longest length ⁇ old[s],old[s+1],...,old[s+l-1] ⁇ , where newSize represents the number of bytes in the new file array New, and then outputs:
  • of the similar information set Segment is more, and the value range is:
  • newSize represents the number of bytes in the new file array New
  • oldSize represents the number of bytes in the new file array Old.
  • the subtracting method comprises:
  • the character segment whose number of bytes l is smaller than the threshold L min is subtracted from the similar information set Segment. .
  • L min can also be other positive integer values less than 10.
  • the reduced similar information set Segment base Preferably, after the three kinds of subtraction methods are used at the same time, the reduced similar information set Segment base
  • the process proceeds to the step of constructing the horizontal line segment S202 in FIG. 2, and the plurality of character segments segment(s, t, l) in the similar information set Segment obtained in step S201 are converted into a plurality of horizontal lines in a horizontal line segment map.
  • the segment seg i (t, st), the left end point coordinate of the horizontal line segment seg i (t, st) is (t, st), the abscissa is t, the ordinate is st, the length is l, and i represents the horizontal line segment seg The number of i (t, st) in the horizontal line graph.
  • FIG. 3 is a schematic diagram of an embodiment of constructing a horizontal line segment diagram in another embodiment of a data incremental update method in accordance with the present invention.
  • Table 1 coordinates of each horizontal line segment and corresponding length and character segments
  • the plurality of horizontal line segments seg i (t, st) in the horizontal line segment map are corresponding to the plurality of nodes V i in the path map, at the plurality of nodes V i Build node edges between them and calculate the edge cost for each node edge. This will be described below in conjunction with FIG.
  • FIG. 4 is a schematic diagram of an embodiment of a construction path diagram in another embodiment of a data incremental update method in accordance with the present invention.
  • the figure includes the start point and the end point, the start point corresponds to segment(0,0,0), the end point corresponds to segment(newSize,0,0), and newSize represents the number of bytes in the new file array New.
  • the node V 1 in FIG. 4 corresponds to seg 1 ( 1 , 32) in FIG. 3
  • the node V 2 corresponds to seg 2 (3, 7)
  • the node V 3 corresponds to seg 3 (11, 9). Correspondence, and so on.
  • V 1 V 2, V 3, V 4 are represented by the dotted line connection, such a connection between two nodes is referred to as edge nodes, and V 1
  • the value of the node between V 2 and the value of 2 is marked with a value of 10 on the side of the node between V 1 and V 3
  • the value of the node between V 1 and V 4 is marked with a value of 21 on the edge of these nodes.
  • the value is called the edge cost.
  • General side cost calculation method is: using the coded data and coded instructions consisting byte representation, determination node V x corresponding to the horizontal line segment seg x (t x, s x -t x) proceeds to the right end point of the corresponding node V y
  • the number of representation bytes required for the right endpoint of the horizontal segment seg y (t y , s y -t y ) is the edge cost of the node edge from node V x to node V y .
  • the calculation process of the edge cost will be specifically described below with reference to FIGS. 4 and 5.
  • FIG. 5 is a schematic diagram of an embodiment of calculating edge cost in another embodiment of a data delta update method in accordance with the present invention.
  • the cost-to-edge node V 3 is 5 V 6.
  • the similarity information corresponding to V 3 seg 3 (11,9) to generate a new file of 22 characters, i.e., "She love a man because", if you need to use the similar information seg 6 (25, -22) corresponding to V 6 on this basis, you should first insert 2 characters, that is, "he”.
  • the operation proceeds to node V 6 corresponding to the horizontal line segment seg 6 (25, -22) of the right end point of the desired expression bytes
  • the number is 5, which is the edge cost of nodes V 3 to V 6 .
  • the acquisition process of the edge costs of nodes V 13 to V 14 is also shown in Figure 5(b).
  • the new file is generated to the first 90 bytes, the read pointer from the old file (90 + (--21)) to the character position Pre-adjust ((-9)-(-21)) positions, that is, point to the 81st character of the old file, and copy the pointer to the 4 characters at the beginning of the position, the content is "love”.
  • at least one byte of instruction code is required to represent "forward jump 12 characters”
  • at least one byte of instruction code is required to express "copy 4 characters”
  • the edge cost of 14 is 2.
  • FIG. 4 there are only three node sides from the node V 1 to the nodes V 2 , V 3 , and V 4 , and theoretically, from the node V 1 to the other nodes V 5 , V 6 , V 7 , V 8 , V 9 , V 10 , V 11 , V 12 , V 13 , V 14 should also have node edges , the main reason for not selecting these node edges is to construct the node edges between adjacent nodes as the main Reduce the spatial complexity of the construction path map.
  • Other nodes might also FIG. 4 where there is the node V 1 class.
  • the edge node needs to be selected is defined, the main methods are: one of the nodes V x corresponding to the character segment segment (s x, t x, l x) V y with another node corresponding character segment segment (s y Between , t y , l y ), if and only if (t x + l x ) ⁇ (t y + l y ) is satisfied, the node V x to the node edge of the node V y exists, the node V x and the neighbor The other nodes of the node V x construct node edges, and the number of constructed node edges is not greater than the node edge threshold MAX_CONECTION.
  • the node side threshold MAX_CONECTION 5. It can be seen that in the method, the condition (t x + l x ) ⁇ (t y + l y ) is set, and in the horizontal line segment diagram, the right end point of the horizontal line segment corresponding to the node V y is relative to the horizontal line segment corresponding to V x The right end of the node is more to the right, so that the node V y is more backward to the right than the node V x , ensuring that new character information exists.
  • the node edge is mainly constructed between adjacent nodes, and the maximum threshold is set for the number of node edges, in order to reduce the number of unnecessary node edges and reduce the space complexity of the path map.
  • the number of nodes in the construction path map is n'
  • the maximum number of nodes is n' 2
  • the maximum number of nodes is reduced. Is MAX_CONNECTION ⁇ n'.
  • the step of constructing the minimum cost path S204 in FIG. 2 is entered. From the starting segment segment (0, 0, 0) through the multiple nodes V i in the path graph to the end segment segment (newSize, 0, 0) has multiple paths, calculate the sum of the edge costs of the multiple node edges included in each path The path with the smallest sum value is the least cost path, and newSize represents the number of bytes in the new file array New.
  • the problem of constructing the smallest incremental package translates into finding the shortest path problem, so the smallest incremental package is generated based on the least cost path.
  • an incremental packet will be generated along the minimum cost path constructed as described above.
  • the main content of the incremental package is a file consisting of an instruction set and a data set. Specifically, an instruction set and a data set are used, and the instruction code between adjacent nodes is sequentially determined from the starting point along the minimum cost path.
  • the instruction package and the data set constitute an incremental package.
  • the instruction set includes an "insert", "copy”, “forward jump”, “backward” instruction
  • the data set is an "insert" instruction
  • the character parameters of the operation are composed, and the instruction code includes an instruction identifier and an instruction parameter.
  • the minimum cost path in FIG. 6 is the minimum cost path of the embodiment shown in FIG. 4. It can be seen that the required instructions from the starting point to the node V 2 are “insert”, “forward jump” and “copy”.
  • the instruction parameter of the "insert” instruction is "3", and the corresponding insertion data is "She”; the instruction parameter of the "forward jump” instruction is "7”; the instruction parameter of the "copy” instruction It is “8”.
  • the instructions between other nodes are similar, such as the instruction between nodes V 3 and V 6 and the instruction parameters and data included in each instruction are “insert 2he”, “backward jump 31", “copy 8"".
  • these instructions and corresponding instruction parameters are represented by binary code by way of instruction encoding. Description.
  • FIG. 7 shows an embodiment of the structure of the instruction code. It can be seen that the first 2 bits of the first byte of the instruction code are instruction identifiers, indicating the type of the instruction, for example, "00" corresponds to the "copy” instruction, "01". Corresponding to the "insert” instruction, "10” corresponding to the “forward jump” instruction, "11” corresponding to the “backward jump” instruction, of course, this correspondence has other combinations, only need to ensure one-to-one correspondence can.
  • the 7 may be composed of a plurality of bytes, wherein the first byte, that is, the first byte uses 5 bits to represent the instruction parameter, and the remaining bytes occupy 7 bits to represent the instruction parameter, specifically each
  • the composition of the bytes is: the first byte is: the instruction identifier +0 + instruction parameters; the middle byte is: 0 + instruction parameters; the last byte is: 1 + instruction parameters.
  • the instruction parameter is a non-negative integer and the encoding length is variable, and the encoding end identifier is a byte beginning with 1, that is, a tail byte.
  • the incremental package can also be compressed to obtain an incremental compressed package.
  • the incremental compressed package needs to be decompressed, and the incremental package is restored.
  • the incremental LZMA Lempel-Ziv-Markov Chain-Algorithm
  • the incremental LZMA Lempel-Ziv-Markov Chain-Algorithm
  • FIG. 8 shows this optimization process.
  • the step-by-step configuration similarity information set S801 to the step application incremental data package S806 in FIG. 8 has the same method and function as the one-to-one correspondence from the step configuration similarity information set S201 to the step application incremental data set S206 in FIG. 2, and details are not described herein again.
  • the main difference is: whether the incremental compression package is minimum after the incremental packet compression, if not the minimum, then optimize the horizontal line segment map, the secondary construction path map, the minimum cost path and the incremental package, compress the incremental package, Until the final incremental compression package is finally obtained.
  • the specific implementation process in FIG. 8 is: after completing the construction of the incremental package S805, the process proceeds to the step of compressing the incremental package S8051, completing the compression of the incremental package, obtaining the incremental compressed package, and then the size of the incremental compressed package.
  • the judgment is mainly to determine whether the incremental compression package is the minimum (the process may need to be repeated multiple times to complete), and if it is the minimum, the S8061 is decompressed through the step before the application of the incremental package S806, from the incremental compression package.
  • the incremental package is restored, and the incremental package is applied; if the incremental compressed package is not the minimum, then the step is to optimize the horizontal line segment S8021, and the original horizontal line segment map is optimized, and then the secondary construction path map is performed.
  • the incremental package is compressed until a minimum incremental compressed package is finally obtained.
  • the meaning of the collection is that there is a similar but not identical content in the new file and the old file.
  • the existence of the paragraph indicates the historical inheritance relationship between the new file and the old file.
  • the incremental package can reduce the fundamental amount of data transmission. The reason is to take advantage of this relationship, so the class in the path diagram
  • the node is a key part of the incremental package. For the horizontal line segment with a short length, the insertion data corresponding to the insertion instruction is not excessively dispersed, which improves the compression efficiency of the incremental package and reduces the size of the incremental compression package.
  • Figure 9 shows an embodiment of optimizing a horizontal line segment map.
  • 9(a) is a horizontal line segment map corresponding to all the same character segments in all nodes, that is, similar information sets
  • FIG. 9(b) is a minimum cost path after determining the minimum cost path among all nodes.
  • the horizontal line segment corresponding to the node, and Figure 9(c) is the horizontal line segment obtained after optimization in Figure 9(b). It can be seen from the comparison that after optimization, the horizontal line segment 91 in Figure 9(b) 92, 93, 94, 95 were deleted.
  • the representative tools applying the aforementioned RDIFF method, VCDIFF method, and BSDIFF method are Rsync, Xdelta, and Bsdiff respectively.
  • the representative tool for applying the data incremental update method of the present invention is Ddiff, and the experimental samples are under Linux, Android, and Win32 platforms. 6 software. Referring to Fig. 10, according to experimental data, the average compression ratios of Rsync, Xdelta, Bsdiff, and Ddiff were 13.2%, 60.3%, 63.6%, and 69.3%, respectively. Among them, the calculation formula of the compression ratio is:
  • Compression_ratio (ASize-BSize)/ASize
  • ASize represents the size of the file before compression
  • BSize represents the size of the file after compression
  • Xdelta, Bsdiff, and Ddiff have similar compression ratios during the processing of Sample 3 and Sample 6.
  • sample 1 and sample 2 and sample 4 there is a similarity between the new file and the old file, but the version change is more complicated, in addition to the content modification and addition, there is also the position exchange of the content block and the copy with the modification, new The same piece of content between the file and the old file is short and large in number.
  • the compression ratio of the embodiment of the data incremental update method of the present invention is similar to other methods in extreme cases, and is generally superior to other methods in general.
  • Rsync analyzes the similarities between old and new files and splits the files into larger granularity. Although the calculation process is simple, it is not conducive to generating an optimal incremental update scheme, so the compression ratio is significantly lower than other tools.
  • Figure 11 illustrates the runtime of the application of incremental packages, where the experiments are all run on the same hardware platform.
  • the average running time of the Rsync, Xdelta, Bsdiff, and Ddiff application delta packets is 7750ms, 546.5ms, 1153.2ms, and 602.8ms, respectively.
  • the reason why Rsync runs significantly higher than other tools is that the incremental package is obtained multiple times, and it takes time to wait for the next incremental package after applying the current incremental package.
  • the new files in Sample 4 and Sample 6 are the largest.
  • the running time of Xdelta and Ddiff is small, because the main operations of these two tools when generating new files are string copying, and the operations such as addition are even less. .
  • the new file in sample 1 is the smallest, except for Rsync, other tools run close to zero.
  • sample 2 sample 3 and sample 5
  • the Ddiff runtime is always not the lowest, because the old file used by Ddiff has more data segments. Although it can reduce the size of the incremental packet, it can read multiple times in the old file. The operation of writing the problem takes more time, and finally the running time of the application incremental package is slightly higher.
  • the efficient dictionary strategy Xdelta has the shortest runtime, and the data delta method embodiment of the present invention consisting of string copy and pointer jump operations has a runtime close to Xdelta; addition Bsdiff with too many times has a relatively high running time, and the special Rsync running time of the incremental update process is long and is greatly affected by the network rate.
  • the compression rate of the embodiment of the data incremental update method of the present invention is the highest compared with other data incremental update methods, and the running time of the application incremental package is close to the minimum value, and the incremental update performance is the most. excellent.
  • the data incremental update method of the present invention converts the problem of generating the minimum incremental package into the shortest path problem, and generates the smallest incremental package according to the path, and constructs the similar information set, constructs the minimum cost path, constructs
  • the horizontal line segment graph step is optimized to minimize the final incremental compression package, which can save an average of 69.3% of the data volume.
  • the compression ratio is the highest, and the application increment is
  • the package has a short running time, and the method of the invention has a wide application range, and is applicable not only to consumer electronic products but also to other platforms and systems.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种数据增量更新方法,该方法包括步骤:构造相似信息集合、构造水平线段图、构造路径图、构造最小代价路径、构造增量包和应用增量包。通过上述技术方案,本发明将新旧文件间的相似程度用图的形式表示,把生成最小增量包的问题转化为寻找最短路径问题,并依据该路径生成最小的增量包,基于本发明方法进行数据增量更新,能够平均节省69.3%的数据量,与现有技术中的数据增量更新方法相比,压缩率高,应用增量包的运行时间较短,本发明方法应用范围广,不仅适用于消费电子产品,而且还能应用到其他平台和系统中。

Description

一种数据增量更新方法 【技术领域】
本发明涉及计算机、智能设备、消费电子产品等数据更新领域,尤其涉及一种数据增量更新方法。
【背景技术】
随着智能手机、穿戴设备等消费电子产品提供的应用服务不断增多,系统更新、应用软件升级、安全漏洞修复等数据更新业务应用范围越来越广。其中,代码或数据的变化是软件新旧版本存在差异的主要原因,并且,新旧版本软件之间的差异信息往往远小于软件本身的大小。因此,使用数据增量方式进行更新是非常高效的,通过增量更新找到新旧文件之间的差异,并将该差异信息表述成增量文件,增量文件含有指令和数据,进行更新的设备使用增量文件可以将旧文件转化为新文件。
图1显示了现有技术中数据增量更新的基本流程:在服务器端存储着旧版本文件(软件或者数据)和新版文件,当某个新版本要被发布前,服务器比较新文件和旧文件,并将二者的差异信息表述为增量包的形式,增量包被压缩后得到增量压缩包,再通过通信网络传递给需要更新软件的设备,设备收到增量压缩包并解压,按照增量包中的指令和数据将旧文件转化为新文件。
图1所示的数据更新流程在实际应用中,会存在受限的网络带宽、按流量计算的费用、以及消费电子设备较小的内存和有限的电量等实际限制,而较小的增量压缩包能够减少流量费用和下载时间,应用增量包时较小的内存耗费和较少的CPU时钟周期能够减少生成新文件的时间、节省设备电量、降低对正在运行的其他应用程序的影响,最终提升用户体验。可以看出,消费电子设备数据增量更新方法的主要评价标准是:增量包、增量压缩包的大小和应用增量包的时间。因此,减小增量包和增量压缩包的大小,以及加快应用增量包的速度是数据更新方法中需要重点解决的问题。
在现有技术中,数据增量更新方法主要包括:T.Andrew在文献《Efficient Algorithms for Sorting and Synchronization》(Australian National University,1999)提出的RDIFF方法,C.Percival在文献《Naive differences of executable code》(University of Oxford,2003)提出的BSDIFF方法,以及D.Korn等作者在文献 《The VCDIFF Generic Differencing and Compression Data Format》(RFC 3284(Proposed Standard),June.2002)提出的VCDIFF方法。
RDIFF方法主要将旧文件和新文件分成相同大小的连续多个数据块,计算每个数据块的哈希值,依据哈希值在新旧文件之间寻找相同的数据块,增量包中包括对旧文件某个数据块的引用,或者是一个完整的新数据块。该方法结构简单,计算速度快并且适用于解决RDC(Remote Differential Compression)问题,缺点是无法全面采集新旧文件之间所有的相似信息,应用该方法的软件有Rsync和rdiff-backup等。
BSDIFF方法是在新文件和旧文件之间寻找完全相同的数据段,接着逐字节延伸该数据段的前缀和后缀,尝试寻找近似但不完全相同的数据段,即近似数据段。BSDIFF方法中使用了N.S.Larsson和K.Sadakane提出的后缀数组排序算法,增量包中指令有ADD、INSERT、SEEK,ADD操作的参数是近似数据段长度和修正量,INSERT操作的参数是插入段长度和插入内容,SEEK操作的参数是旧文件读取指针的跳跃跨度。ADD操作的修正量参数是增量包的主要成分,新旧文件的相似程度与修正量中“0”的比例成正比,较高的“0”的比例使得增量压缩包小于新文件的大小。BSDIFF方法中的增量包较小,但是构造增量包和应用增量包的计算量较大,因此费时较多。目前BSDIFF方法是应用最广泛的增量更新算法,Bsdiff工具应用了BSDIFF方法。
VCDIFF方法是基于LZ77(Lempel-Ziv 77)压缩算法进行改进,实现增量更新。LZ77压缩算法主要是基于一个数据流中后面的数据段与前面的数据段相似或者相关进行数据压缩,借鉴LZ77压缩算法,VCDIFF方法将旧文件和新文件链接成一个数据流,应用LZ77压缩算法进行压缩,截取压缩流中表达新文件的部分作为增量包。VCDIFF方法的性能介于在RDIFF方法和BSDIFF方法之间,牺牲了部分压缩率,加快了执行的速度。Xdelta是应用VCDIFF方法的软件之一,并且该软件还在VCDIFF基础上进行了改进,优化了其中的指令集,进一步减小了增量压缩包的大小。
为此,需要提供一种数据增量更新方法,与现有技术相比能够在减少增量包大小、提高增量包压缩率和加快客户端设备的运行速度等方面都有显著的性能提高。
【发明内容】
本发明主要解决的技术问题是针对现有技术中数据增量更新方法普遍存在增量包较大、压缩率不高、运行时间较长以及这些特性不能同时较好兼顾的不足,提供一种转化为寻找最短代价路径的方式构造增量包的数据增量更新方法。
为解决上述技术问题,本发明采用的一个技术方案是:提供.一种数据增量更新方法,该方法包括以下步骤:第一步,构造相似信息集合Segment,将新文件数组New包含的字符与旧文件数组Old包含的字符进行对比运算,输出相同的多个字符片段segment(s,t,l),其中,s是该字符片段在该旧文件数组Old中的位置,t是该字符片段在该新文件数组New中的位置,l是该字符片段的字节数量,得到该相似信息集合Segment为:
Segment={segment(s,t,l)|old[s+i]=new[t+i],i=0,1,2,...,l-1};
第二步,构造水平线段图,将该相似信息集合Segment中的该多个字符片段segment(s,t,l)对应转换为该水平线段图中的多个水平线段segi(t,s-t),该水平线段segi(t,s-t)的左端点坐标为(t,s-t),长度为l,i表示该水平线段segi(t,s-t)在该水平线段图中的序号;第三步,构造路径图,将该水平线段图中的该多个水平线段segi(t,s-t)对应为该路径图中的多个节点Vi,在该多个节点Vi之间构建节点边,并计算每一个该节点边的边代价;第四步,构造最小代价路径,从起始节点segment(0,0,0)经过该路径图中的该多个节点Vi到达终止节点segment(newSize,0,0)有多条路径,计算每一条路径包括的多个该节点边的该边代价之和,找到和值最小的路径即为最小代价路径,newSize表示该新文件数组New中的字节数;第五步,构造增量包,根据该旧文件数组Old和该新文件数组New,采用指令集和数据集,从起始节点沿该最小代价路径,依次确定相邻节点间的指令编码,由该指令编码组成该增量包;第六步,应用增量包,根据该旧文件数组Old,按照该增量包中的该指令编码字节逐字节生成新文件。
在本发明数据增量更新方法另一实施例中,该将新文件数组New包含的字节与旧文件数组Old包含的字节进行对比运算的方法包括:
首先,对该旧文件数组Old进行后缀排序,得到该旧文件数组Old的后缀数组I;然后,利用该后缀数组I,在该旧文件数组Old中寻找与该新文件数组New中{new[t],new[t+1],...,new[newSize-1]}前缀匹配长度最大的片段 {old[s],old[s+1],...,old[s+l-1]},并输出:
new[t]=old[s],new[t+1]=old[s+1],...,new[t+l-1]=old[s+l-1]。
在本发明数据增量更新方法另一实施例中,该构造相似信息集合Segment包括进一步对该相似信息集合Segment进行消减,该消减方法包括:第一,若存在segment(s',t',l')∈Segment和segment(s″,t″,l″)∈Segment,并且s″=s'+k,t″=t'+k,l″=l'-k,0<k<l',则从该相似信息集合Segment中消减segment(s″,t″,l″);或/和,第二,若该字符片段segment(s,t,l)的字节数量l必须大于等于门限值Lmin,从该相似信息集合Segment中消减字节数量l小于门限值Lmin的字符片段;或/和,第三,若存在segment(s',t',l')∈Segment和segment(s″,t″,l″)∈Segment,并且s″≠s',t″=t',l″<l',则从该相似信息集合Segment中消减segment(s″,t″,l″)。
在本发明数据增量更新方法另一实施例中,该门限值Lmin=3,或者Lmin是小于10的其他正整数值。
在本发明数据增量更新方法另一实施例中,在该构造路径图中,对于其中一个节点Vx对应的字节片段segment(sx,tx,lx)与另一个节点Vy对应的字节片段segment(sy,ty,ly)之间当且仅当满足(tx+lx)<(ty+ly)时,才存在该节点Vx到该节点Vy的节点边,该节点Vx与邻近该节点Vx的其他节点构建节点边,并且该构建节点边的数目不大于节点度数MAX_CONECTION。
在本发明数据增量更新方法另一实施例中,该节点度数MAX_CONECTION=3。
在本发明数据增量更新方法另一实施例中,该节点Vx到该节点Vy的节点边的边代价的计算方法是:采用编码指令和编码数据组成表述字节,确定该节点Vx对应的水平线段segx(tx,sx-tx)的右侧端点转移到该节点Vy对应的水平线段segy(ty,sy-ty)右侧端点所需的该表述字节的数量,即为该边代价的值。
在本发明数据增量更新方法另一实施例中,在该构造最小代价路径中,采用Dijkstra算法来计算最小代价路径。
在本发明数据增量更新方法另一实施例中,在该构造增量包中,该指令集包括“插入”、“拷贝”、“向前跳转”、“向后跳转”指令,该数据集是由该“插入”指令操作的字符参数构成,该指令编码包括指令标示符和指令参数。
在本发明数据增量更新方法另一实施例中,该指令标示符占用2比特,存 在00、01、10、11四种编码,对应该“插入”、“拷贝”、“向前跳转”、“向后跳转”四种指令。
在本发明数据增量更新方法另一实施例中,该指令编码结构是:首字节为:指令标示符+0+指令参数;中间字节为:0+指令参数;尾字节为:1+指令参数。
在本发明数据增量更新方法另一实施例中,该指令参数的值小于32时,该指令编码长度为1个字节;该指令参数值大于等于32且小于4096时,该指令编码长度为2字节;该指令参数大于等于4096且小于524288时,该指令编码长度为3字节。
在本发明数据增量更新方法另一实施例中,在该构造增量包中,还包括进一步对该增量包进行压缩,得到增量压缩包,在该应用增量包中,包括对该增量压缩包解压缩,得到该增量包。
在本发明数据增量更新方法另一实施例中,采用LZMA压缩算法对该增量包进行压缩,得到该增量压缩包。
在本发明数据增量更新方法另一实施例中,采用LZMA压缩算法对该增量包进行压缩之前,返回到该构造水平线段图中,优化该水平线段图,二次构造路径图、最小代价路径和增量包后,压缩该增量包并得到最小的增量压缩包。
在本发明数据增量更新方法另一实施例中,该优化该水平线段图的方法包括:第一,若该水平线段segi(t,s-t)中的s-t=0,则保留该水平线段segi(t,s-t);以及,第二,若该水平线段segi(t,s-t)分布离散,与周围其他水平线段不能组成近似水平直线,并且该水平线段segi(t,s-t)的该长度l小于长度门限值N,则删除该水平线段segi(t,s-t)。
在本发明数据增量更新方法另一实施例中,该长度门限值N=3,或者N是小于10的其他正整数值。
本发明的有益效果是:通过上述构造相似信息集合、构造水平线段图、构造路径图、构造最小代价路径、构造增量包和应用增量包步骤,本发明将新旧文件间的相似程度用图的形式表示,把生成最小增量包的问题转化为寻找最短路径问题,并依据该路径生成最小的增量包,基于本发明方法进行数据增量更新,能够平均节省69.3%的数据量,与现有技术中的数据增量更新方法相比,压缩率最高,应用增量包的运行时间较短,本发明方法应用范围广,不仅适用于 消费电子产品,而且还能应用到其他平台和系统中。
【附图说明】
图1是现有技术数据增量更新的流程图;
图2是根据本发明数据增量更新方法一实施例的流程图;
图3是根据本发明数据增量更新方法另一实施例中构造水平线段图实施例的示意图;
图4是根据本发明数据增量更新方法另一实施例中构造路径图实施例的示意图;
图5是根据本发明数据增量更新方法另一实施例中计算边代价实施例的示意图;
图6是根据本发明数据增量更新方法另一实施例中构造增量包实施例流程图;
图7是根据本发明数据增量更新方法另一实施例中指令编码的结构组成图;
图8是根据本发明数据增量更新方法另一实施例的流程图;
图9是根据本发明数据增量更新方法另一实施例中优化水平线段图的实施例示意图;
图10是根据本发明数据增量更新方法另一实施例的增量更新压缩率对比分析图;
图11是根据本发明数据增量更新方法另一实施例的应用增量包的运行时间对比分析图。
【具体实施方式】
为了便于理解本发明,下面结合附图和具体实施例,对本发明进行更详细的说明。附图中给出了本发明的较佳的实施例。但是,本发明可以以许多不同的形式来实现,并不限于本说明书所描述的实施例。相反地,提供这些实施例的目的是使对本发明的公开内容的理解更加透彻全面。
需要说明的是,除非另有定义,本说明书所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。在本发明的说明书中所使用的术语只是为了描述具体的实施例的目的,不是用于限制本发明。本说明书所使用的术语“和/或”包括一个或多个相关的所列项目的任意的和所有的组合。
图2是根据本发明数据增量更新方法一实施例的流程图,包括以下步骤:构造相似信息集合S201;构造水平线段图S202;构造路径图S203;构造最小代价路径S204;构造增量包S205;应用增量包S206。以下结合具体实施例对上述各步骤作具体说明。
首先,在步骤S201中需要将旧文件和新文件进行对比,找到旧文件和新文件内容之间相同的字符片段,由这些相同字符片段组成相似信息集合。为便于说明,假设一旧文件实施例的内容是“You do not love a woman because she is beautiful,but she is beautiful because you love her.”,新文件实施例的内容是“She love a man because he do not just love her beauty.She is beautiful because a beautiful love.”。以数组的形式表示该旧文件和新文件,分别定义旧文件数组Old和新文件数组New。其中,旧文件数组Old中的第0个字节Old[0]对应该旧文件中的首字符“Y”,第1个字节Old[1]对应该旧文件中的字符“o”,第2个字节Old[2]对应该旧文件中的字符“u”,第3个字节Old[3]对应该旧文件中的空格字符“”,向后依次类推。新文件数组New的组成具有与旧文件数组Old与相同特点,不再赘述。由于受版面篇幅限制,该旧文件和新文件中的内容虽然都占据了两行,但没有用于换行的特殊字符,而是只包括英文字符、空格字符和标点字符。
为了说明旧文件数组Old和新文件数组New中相同的字符片段,定义相同字符片段的表示式segment(s,t,l),其中,s是该字符片段在旧文件数组Old中的序号位置,t是该字符片段在新文件数组New中的序号位置,l是该字符片段占用的字节数。例如,segment(33,1,3)=“he”,对应旧文件数组Old中字节Old[33]、Old[34]、Old[35]和新文件数组New中字节New[1]、New[2]、New[3]组成的相同字符片度。又例如,segment(10,3,8)=“love a”,分别对应旧文件数组Old中字节Old[10]至Old[17]和新文件数组New中字节New[3]至New[10]。这样,由这些相同字符片度组成的相似信息集合Segment可以表示为:
Segment={segment(s,t,l)|old[s+i]=new[t+i],i=0,1,2,...,l-1}。
为了找到旧文件数组Old和新文件数组New中相同的字符片段,将新文件数组New包含的字符与旧文件数组Old包含的字符进行对比运算的方法包括:
首先,对旧文件数组Old进行后缀排序,返回旧文件数组Old的后缀数组I;然后,利用该后缀数组I,在旧文件数组Old中寻找与新文件数组New中{new[t],new[t+1],...,new[newSize-1]}前缀匹配长度最大的片段{old[s],old[s+1],...,old[s+l-1]},其中newSize表示该新文件数组New中的字节数,然后输出:
new[t]=old[s],new[t+1]=old[s+1],...,new[t+l-1]=old[s+l-1]。
但是,由此构成的相似信息集合Segment包含的相同字符片段数,即相似信息集合Segment的基数|Segment|较多,其取值范围是:
0≤|Segment|≤((newSize×(newSize+1)×(3×oldSize-newSize+1))/6)
该式中,newSize表示该新文件数组New中的字节数,oldSize表示该新文件数组Old中的字节数,当newSize≤oldSize当时,上式成立,若newSize>oldSize,交换上式中newSize和oldSize的位置,即:
0≤|Segment|≤((oldSize×(oldSize+1)×(3×newSize-oldSize+1))/6)
由此可以看出,相似信息集合Segment中的相同字符片段数可能非常大。例如,若新文件和旧文件大小均为1M字节(220个字节),则|Segment|的理论最大值可能超过260。因此,需要对相似信息集合Segment中的相同字符片段进行消减。但是,对相似信息集合Segment消减以后,如果|Segment|过小,将损失重要相似信息,影响后面增量更新的效果;如果|Segment|过大,将使运算时间过长甚至不可解。因此需要将相似信息集合Segment被削减后的规模控制在最优范围。优选的,消减方法包括:
方法一,若存在segment(s',t',l')∈Segment和segment(s″,t″,l″)∈Segment,并且s″=s'+k,t″=t'+k,l″=l'-k,0<k<l',则从所述相似信息集合Segment中消减segment(s″,t″,l″)。
结合前述旧文件和新文件实施例,该方法一的应用举例是:例如segment(s',t',l')=segment(20,11,12)=“man because”,segment(s″,t″,l″)=segment(21,12,11)=“an because”。显然,“man because”中已经包含了“an because”,因此,消减segment(21,12,11)=“an because”。
方法二,若存在segment(s',t',l')∈Segment和segment(s″,t″,l″)∈Segment,并且s″≠s',t″=t',l″<l',则从所述相似信息集合Segment中消减segment(s″,t″,l″)。
结合前述旧文件和新文件实施例,该方法二的应用举例是:例如segment(s',t',l')=segment(82,38,8)=“love her”,segment(s″,t″,l″)=segment(11,38,4)=“love”。显然,“love her”中已经包含了“love”,因此,消减segment(11,38,4)=“love”。
方法三,若字符片段segment(s,t,l)的字节数量l必须大于等于门限值Lmin,从所述相似信息集合Segment中消减字节数量l小于门限值Lmin的字符片段。
结合前述旧文件和新文件实施例,该方法三的应用举例是:例如Lmin=3,则segment(3,25,3)=“do”将被消减,而segment(3,25,8)=“do not”保留。Lmin还可以是小于10的其他正整数值。
优选的,同时采用上述三种消减方法后,削减后的相似信息集合Segment基
数的取值范围为:
0≤|Segment|≤newSize+1-Lmin
显然,经过消减处理,相似信息集合Segment的规模得到有效控制。
接着,进入图2中的构造水平线段图S202步骤,将S201步骤中得到的相似信息集合Segment中的多个字符片段segment(s,t,l)对应转换为一个水平线段图中的多个水平线段segi(t,s-t),该水平线段segi(t,s-t)的左端点坐标为(t,s-t),横坐标是t,纵坐标是s-t,长度为l,i表示该水平线段segi(t,s-t)在该水平线段图中的序号。
图3是根据本发明数据增量更新方法另一实施例中构造水平线段图实施例的示意图。结合上述旧文件实施例“You do not love a woman because she is beautiful,but she is beautiful because you love her.”和新文件实施例“She love a man because he do not just love her beauty.She is beautiful because a beautiful love.”,可以看出,图3中每条线段均对应于相似信息集合Segment的一个相同字符段,该图展示了新文件和旧文件的相似情况,该图中线段的长度越长或者数量越多,即说明新文件和旧文件之间的相关性越高。以seg3(11,9)为例,其左侧端点坐标为(11,9),线段长度为12,表明该新文件中从第11个字符开始的长度为12的字符片度与该旧文件中第20(11+9)个字符开始长度为12的字符片度完 全相同,对应的相同字符片段是“man because”。以下表1列出图3中各水平线段坐标及对应的相同字符片段。
表1各水平线段坐标及对应的长度和字符片段
水平线段segi(t,s-t) 长度l 字符片段
seg1(1,32) 3 “he”
seg2(3,7) 8 “love a”
seg3(11,9) 12 “man because”
seg4(21,64) 4 “e he”
seg5(23,31) 3 “he”
seg6(25,-22) 8 “do not”
seg7(37,-27) 6 “love”
seg8(37,44) 9 “love her”
seg9(46,-9) 6 “beaut”
seg10(55,-1) 24 “he is beautiful because”
seg11(70,-47) 9 “because”
seg12(77,-63) 4 “e a”
seg13(80,-21) 11 “beautiful”
seg14(90,-9) 5 “love”
接着,进入图2中的构造路径图S203步骤,将水平线段图中的多个水平线段segi(t,s-t)对应为路径图中的多个节点Vi,在所述多个节点Vi之间构建节点边,并计算每一个节点边的边代价。以下结合图4进行说明。
图4是根据本发明数据增量更新方法另一实施例中构造路径图实施例的示意图。该图中包括起点和终点,起点对应是segment(0,0,0),终点对应是segment(newSize,0,0),newSize表示新文件数组New中的字节数。其中,图4中的节点V1与图3中的seg1(1,32)相对应,节点V2与seg2(3,7)相对应,节点V3与seg3(11,9)相对应,以此类推。从图4还可以看出,从节点V1分别到节点V2、V3、V4均有用虚线表示的连线,这种在两个节点之间的连线称为节点边,并且V1到V2之间的节点边上标有数值2,V1到V3之间的节点边上标有数值10,V1到V4之间的节点边上标有数值21,这些节点边上的数值称为边代价。边代价的一般计算方法是:采用编码指令和编码数据组成表述字节,确定节点Vx对应的水平线段segx(tx,sx-tx) 的右侧端点转移到节点Vy对应的水平线段segy(ty,sy-ty)右侧端点所需的表述字节的数量,即为节点Vx到节点Vy的节点边的边代价。以下结合图4和图5举例具体说明边代价的计算过程。
图5是根据本发明数据增量更新方法另一实施例中计算边代价实施例的示意图。在图4中,节点V3到V6的边代价是5。结合图5(a),在利用相似信息集合Segment由旧文件生成新文件的过程中,在使用V3对应的相似信息seg3(11,9)后,新文件生成到第22个字符,即“She love a man because”,若需要在此基础上使用V6对应的相似信息seg6(25,-22),则首先应插入2个字符,即“he”。注意,这里若用指令编码表示“插入2个字节”时,至少需要占用1个字节的指令编码来表述“插入2个字节”,在后面构造增量包中对此还进一步说明。而对于“he”,则需要占用2个字节分别表述字符“h”和“e”。然后,使用旧文件V3对应的相似信息seg3(11,9)和插入2个字符“he”之后,旧文件中指向第(24+9)个字符的指针需要向后调整(9-(-22))个位置,即指向旧文件的第2个字符,并复制该指针指向位置开始的8个字符,内容为“do not”。这里,若用指令编码来表示“向后跳转31个字符”时,至少需要占用1个字节的指令编码来表述“向后跳转31个字符”;另外,若用指令编码表示“复制8个字符”时,至少需要占用1个字节的指令编码来表述“复制8个字符”。这样一来,从节点V3对应的水平线段seg3(11,9)的右侧端点,转移到节点V6对应的水平线段seg6(25,-22)右侧端点所需的表述字节的数量是5,这就是节点V3到V6的边代价。
图5(b)中还显示了节点V13到V14的边代价的获取过程。在使用V13对应的相似信息seg13(80,-21)后,新文件已生成到第90个字节,旧文件中的读取指针从第(90+(-21))个字符位置向前调整((-9)-(-21))个位置,即指向旧文件的第81个字符,并复制该指针指向位置开始的4个字符,内容为“love”。这里,至少需要占用1个字节的指令编码来表述“向前跳转12个字符”,以及至少需要占用1个字节的指令编码来表述“复制4个字符”,因此节点V13到V14的边代价是2。
进一步,在图4中,从节点V1到节点V2、V3、V4只有3个节点边,而从理论上讲,从节点V1到其他节点V5、V6、V7、V8、V9、V10、V11、V12、V13、V14也都应该存在节点边,而不选取这些节点边的主要原因是以相邻节点间构造节点边为主,由此可以降低构造路径图的空间复杂度。图4中的其他节点也与节点V1有类 似情况。
为此,需要对节点边的选取进行限定,主要方法是:其中一个节点Vx对应的字符片段segment(sx,tx,lx)与另一个节点Vy对应的字符片段segment(sy,ty,ly)之间,当且仅当满足(tx+lx)<(ty+ly)时,才存在节点Vx到节点Vy的节点边,节点Vx与邻近该节点Vx的其他节点构建节点边,并且构建节点边的数目不大于节点边门限值MAX_CONECTION。优选的,节点边门限值MAX_CONECTION=5。可以看出,该方法中设定条件(tx+lx)<(ty+ly),在水平线段图中,节点Vy对应的水平线段的右端点相对于Vx对应的水平线段的右端点更要靠右,这样节点Vy相对于节点Vx更为向右递进,保证了存在新的字符信息。而节点边主要是在相邻节点之间构建,以及对节点边的数目设置最大门限,都是为了减少不必要的节点边数目,降低路径图的空间复杂度。
从技术效果上来看,若构造路径图的节点数为n',在对构建节点边不进行限定时,则最大节点边数是n'2,进行节点边限定后,则最大节点边数减小为MAX_CONNECTION×n'。
接着,进入图2中的构造最小代价路径S204步骤。从起点segment(0,0,0)经过路径图中的多个节点Vi到达终点segment(newSize,0,0)有多条路径,计算每一条路径包括的多个节点边的边代价之和,找到和值最小的路径即为最小代价路径,newSize表示所述新文件数组New中的字节数。
结合图4可以计算出,从起点segment(0,0,0)到终点segment(newSize,0,0)的多条路径中,从起点经由V2、V3、V6、V8、V9、V10、V13、V14到达终点的这条路径是最小代价路径,图4中通过实线标出了该路径。而在路径图中寻找最小代价路径,可以采用成熟的Dijkstra算法来计算最小代价路径,图4中的最小代价路径就是采取该算法确定的。
因此,构造最小增量包的问题就转化为寻找最短路径问题,因此最小增量包就是基于最小代价路径而生成的。
在图2所示的构造增量包S205步骤中,将沿上述构造的最小代价路径生成增量包。而增量包的主要内容则是由指令集和数据集组成的文件,具体而言,就是采用指令集和数据集,从起点沿所述最小代价路径,依次确定相邻节点间的指令编码,再依次由指令编码和数据集组成增量包。优选的,该指令集包括“插入”、“复制”、“向前跳转”、“向后跳转”指令,数据集是由“插入”指令 操作的字符参数构成,指令编码包括指令标示符和指令参数。
结合图6对增量包的生成过程进行说明。图6中的最小代价路径是承接图4所示实施例的最小代价路径,可以看出,从起点到节点V2所需指令先后是“插入”、“向前跳转”和“复制”,其中,该“插入”指令的指令参数是“3”,后面对应的插入数据则是“She”;该“向前跳转”指令的指令参数是“7”;该“复制”指令的指令参数是“8”。同样,其他节点间的指令与此类似,如节点V3和V6之间的指令以及每个指令包括的指令参数和数据依次是“插入2he”、“向后跳转31”、“复制8”。而对于指令“插入”、“复制”、“向前跳转”和“向后跳转”将通过指令编码的方式对这些指令以及对应的指令参数用二进制代码来表示,以下结合图7做进一步说明。
图7显示了指令代码的结构一个实施例,可以看出该指令编码的第1字节的开头2比特为指令标示符,说明指令的类型,例如“00”对应“复制”指令、“01”对应“插入”指令、“10”对应“向前跳转”指令、“11”对应“向后跳转”指令,当然,这种对应关系还有其他组合关系,只需保证能够一一对应即可。图7所示指令编码结构中的指令参数可以由多个字节组成,其中第1字节,即首字节用5比特表示指令参数外,其余字节占用7个比特表示指令参数,具体每个字节的组成结构是:首字节为:指令标示符+0+指令参数;中间字节为:0+指令参数;尾字节为:1+指令参数。可见,指令参数为非负整数且编码长度可变,编码结束标识是以1开头的字节,即尾字节。对指令参数而言,当指令参数小于25=32时,指令编码的长度为1字节;当指令参数大于等于32且小于25+7=4096时,指令编码的长度为2字节;当指令参数大于等于4096且小于25+7+7=524288时,指令编码的长度为3字节,依此类推。
在获得增量包以后,就可以结合旧文件数组Old,按照增量包中的指令编码文件逐字节生成新文件,该过程在图2所示的应用增量包S206步骤中完成。应用增量包生成新文件并替换旧文件属于现有技术,不再赘述。
还有进一步的优化:在完成构造增量包以后还可以对增量包进行压缩,得到增量压缩包,在应用增量包中,需要对该增量压缩包解压缩,复原该增量包。这里,可以采用成熟的LZMA(Lempel-Ziv-MarkovChain-Algorithm)压缩方法对增量包进行压缩后得到增量压缩包。
但是,这种优化会产生一种情况:依据最小代价路径生成的增量包,在未 压缩情况下是最小的,但是压缩后得到的增量压缩包不一定是最小的,这种现象存在的主要原因是最小代价路径中存在分布离散的节点,这些节点使得构造的增量包较小,但是增量包中的数据集(即增量包中所有插入指令对应插入的数据的集合)对应的位置分散、内部相关性不高,最终使得增量包的压缩效果不佳。
为此,需要进一步优化所述水平线段图,删减最小代价路径中的部分节点,适当增大增量包的大小,达到减小增量压缩包的效果。图8显示了这种优化过程。图8中从步骤构造相似信息集合S801至步骤应用增量包S806与图2中从步骤构造相似信息集合S201至步骤应用增量包S206一一对应,具有相同的方法和作用,不再赘述。主要区别在于:对增量包压缩之后判断增量压缩包是否为最小,若不是最小,则优化水平线段图,二次构造路径图、最小代价路径和增量包后,压缩该增量包,直到最终得到最小的增量压缩包。在图8中具体实现过程是:在完成构造增量包S805以后,进入到压缩增量包S8051步骤,完成对增量包的压缩,得到增量压缩包,然后对该增量压缩包的大小进行判断,主要是确定该增量压缩包是否为最小(该过程可能需要往复多次才能完成),若是最小,则在应用增量包S806之前先经过步骤解压缩S8061,从增量压缩包中恢复增量包,再对该增量包进行应用;若该增量压缩包不是最小,则进入步骤优化水平线段图S8021,对原有的水平线段图进行优化处理,再经过二次构造路径图S803、构造最小代价路径S804和构造增量包S805后,压缩该增量包,直到最终得到最小的增量压缩包。
优选的,在优化水平线段图S8021中,采用的方法包括:若水平线段segi(t,s-t)中的s-t=0,则保留该水平线段segi(t,s-t);若水平线段segi(t,s-t)分布离散,与周围其他水平线段不能组成近似水平直线,并且该水平线段segi(t,s-t)的长度l小于长度门限值N,则删除该水平线段segi(t,s-t),优选的,该长度门限值N=3。或者N是小于10的其他正整数值。
采取这两种方法的主要原因和技术效果是:水平线段图中s-t=0对应的节点是关键节点,在构造最小代价路径中是骨干节点,必须保留;近似分布于同一条水平直线的水平线段集合的意义是,新文件和旧文件中存在一段相似但不完全相同的内容,该段内容的存在说明了新文件和旧文件之间的历史承接关系,增量包能够减少传输数据量的根本原因是利用了这种关系,因此路径图中该类 节点是增量包的关键部分;而对于剔除部分长度较短的水平线段,使得插入指令对应的插入数据不过度分散,提升了增量包的压缩效率,减小了增量压缩包的大小。
图9显示了对水平线段图进行优化的一个实施例。其中,图9(a)是由所有节点,即相似信息集合中的所有相同字符片段对应的水平线段图,图9(b)则是在所有节点中确定最小代价路径后,由最小代价路径上的节点对应的水平线段图,而图9(c)则是对图9(b)优化后得到的水平线段图,经过比较可以看出,经过优化后,图9(b)中的水平线段91、92、93、94、95被删掉。这些水平线段的基本特点就是长度较短、距离s-t=0轴较远,与其他相邻节点难以构成水平直线。
为了说明本发明的技术效果,以下结合图10和图11分别说明本发明实施例在提高压缩率和减少增量包应用时间方面的显著优势。
应用前述的RDIFF方法、VCDIFF方法、BSDIFF方法的代表性工具分别是Rsync、Xdelta、Bsdiff,应用本发明数据增量更新方法实施例的代表性工具是Ddiff,实验样本是Linux、Android和Win32平台下的6款软件。结合图10,根据实验数据得到,Rsync、Xdelta、Bsdiff和Ddiff的平均压缩率分别为13.2%、60.3%、63.6%和69.3%。其中,压缩率的计算公式为:
Compression_ratio=(ASize-BSize)/ASize
其中,ASize表示压缩之前文件的大小,BSize表示压缩之后文件的大小。从图10中可以看出,在6组测试中,本发明数据增量更新方法实施例的压缩率均优于其他方法。对于样本5,所有增量更新方法的压缩率均较低,说明新文件和旧文件之间的相似程度不高。该情况下,Xdelta和Ddiff方法拥有相对较高的压缩率的原因是它们使用的压缩算法LZ77和LZMA拥有较好的压缩性能。对于样本3和样本6,所有增量更新算法的压缩率均较高,说明新文件和旧文件之间的相似程度较高。该情况下,新文件和旧文件的变化只是少量的内容修改或者添加,新文件和旧文件之间的相同内容段长度较长且数量较少,增量更新的方案较易生成。因此,在处理样本3和样本6的过程中,Xdelta、Bsdiff和Ddiff拥有相近的压缩率。对于样本1、样本2和样本4的情况,新文件和旧文件之间存在相似,但是版本的变化较复杂,除了内容修改和添加外,还有内容块的位置交换和附带修改的复制,新文件和旧文件之间的相同内容段的长度较短且数量巨大。该情况是增量更新方法需要解决的难题,Ddiff的压缩率在这些情况下 的压缩率最高。总之,本发明数据增量更新方法实施例的压缩率在极端情况下与其他方法相近,在一般情况下显著优于其他方法。
另外,Rsync分析新旧文件之间的相似情况时将文件拆分为较大粒度,虽然计算过程简单,但是不利于生成最优的增量更新方案,因此压缩率显著低于其他工具。
图11说明了应用增量包时的运行时间情况,其中实验均运行在同一硬件平台上。根据实验数据统计,Rsync、Xdelta、Bsdiff和Ddiff应用增量包的平均运行时间分别为7750ms、546.5ms、1153.2ms和602.8ms。图11中,Rsync的运行时间显著高于其他工具的原因是,增量包分多次获得,应用当前增量包后还需耗费时间等待下一个增量包。样本4和样本6中新文件最大,该情况下Xdelta和Ddiff的运行时间均较小,原因是这两个工具在生成新文件时主要进行的操作是字符串复制,加法等运算甚至都较少。样本1中新文件最小,除Rsync外,其他工具运行时间接近于0。样本2、样本3和样本5中,Ddiff运行时间始终不是最低的,原因是Ddiff使用的旧文件的数据段较多,虽然能够减少增量包的大小,但是在旧文件中多次跳转读写问题的操作耗费了更多的时间,最终应用增量包的运行时间略高。对于应用增量包的消费电子设备,高效的字典策略的Xdelta具有最短的运行时间,由字符串复制和指针跳转操作构成的本发明数据增量方法实施例具有接近Xdelta的运行时间;加法运算次数过多的Bsdiff具有相对较高的运行时间,增量更新过程特殊的Rsync运行时间较长且受网络速率影响较大。
综合图10和图11,可以得到:与其他数据增量更新方法相比,本发明数据增量更新方法实施例的压缩率最高,应用增量包的运行时间接近最小值,增量更新性能最优。
通过上述方式,本发明数据增量更新方法把生成最小增量包的问题转化为寻找最短路径问题,并依据该路径生成最小的增量包,并且在构造相似信息集合、构造最小代价路径、构造水平线段图步骤中都进行了优化,使得最终得到的增量压缩包最小,能够平均节省69.3%的数据量,与现有技术中的数据增量更新方法相比,压缩率最高,应用增量包的运行时间较短,本发明方法应用范围广,不仅适用于消费电子产品,而且还能应用到其他平台和系统中。
以上所述仅为本发明的实施例,并非因此限制本发明的专利范围,凡是利 用本发明说明书及附图内容所作的等效结构变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。

Claims (17)

  1. 一种数据增量更新方法,其特征在于,所述方法包括以下步骤:
    第一步,构造相似信息集合Segment,将新文件数组New包含的字符与旧文件数组Old包含的字符进行对比运算,输出相同的多个字符片段segment(s,t,l),其中,s是所述字符片段在所述旧文件数组Old中的位置,t是所述字符片段在所述新文件数组New中的位置,l是所述字符片段的字节数量,得到所述相似信息集合Segment为:
    Segment={segment(s,t,l)|old[s+i]=new[t+i],i=0,1,2,...,l-1};
    第二步,构造水平线段图,将所述相似信息集合Segment中的所述多个字符片段segment(s,t,l)对应转换为所述水平线段图中的多个水平线段segi(t,s-t),所述水平线段segi(t,s-t)的左端点坐标为(t,s-t),长度为l,i表示所述水平线段segi(t,s-t)在所述水平线段图中的序号;
    第三步,构造路径图,将所述水平线段图中的所述多个水平线段segi(t,s-t)对应为所述路径图中的多个节点Vi,在所述多个节点Vi之间构建节点边,并计算每一个所述节点边的边代价;
    第四步,构造最小代价路径,从起始节点segment(0,0,0)经过所述路径图中的所述多个节点Vi到达终止节点segment(newSize,0,0)有多条路径,计算每一条路径包括的多个所述节点边的所述边代价之和,找到和值最小的路径即为最小代价路径,newSize表示所述新文件数组New中的字节数;
    第五步,构造增量包,根据所述旧文件数组Old和所述新文件数组New,采用指令集和数据集,从所述起始节点沿所述最小代价路径,依次确定相邻节点间的指令编码,由所述指令编码组成所述增量包;
    第六步,应用增量包,根据所述旧文件数组Old,按照所述增量包中的所述指令编码逐字节生成新文件。
  2. 根据权利要求1所述的数据增量更新方法,其特征在于,所述将新文件数组New包含的字符与旧文件数组Old包含的字符进行对比运算的方法包括:
    首先,对所述旧文件数组Old进行后缀排序,得到所述旧文件数组Old的后缀数组I;
    然后,利用所述后缀数组I,在所述旧文件数组Old中寻找与所述新文件数组New中{new[t],new[t+1],...,new[newSize-1]}前缀匹配长度最大的片段 {old[s],old[s+1],...,old[s+l-1]},并输出:
    new[t]=old[s],new[t+1]=old[s+1],...,new[t+l-1]=old[s+l-1]。
  3. 根据权利要求2所述的数据增量更新方法,其特征在于,所述构造相似信息集合Segment包括进一步对所述相似信息集合Segment进行消减,所述消减方法包括:
    第一,若存在segment(s',t',l')∈Segment和segment(s″,t″,l″)∈Segment,并且s″=s'+k,t″=t'+k,l″=l'-k,0<k<l',则从所述相似信息集合Segment中消减segment(s″,t″,l″);或/和,
    第二,若所述字节片段segment(s,t,l)的字节数量l必须大于等于门限值Lmin,从所述相似信息集合Segment中消减字节数量l小于所述门限值Lmin的字符片段;或/和,
    第三,若存在segment(s',t',l')∈Segment和segment(s″,t″,l″)∈Segment,并且s″≠s',t″=t',l″<l',则从所述相似信息集合Segment中消减segment(s″,t″,l″)。
  4. 根据权利要求3所述的数据增量更新方法,其特征在于,所述门限值Lmin=3,或者Lmin是小于10的其他正整数值。
  5. 根据权利要求3所述的数据增量更新方法,其特征在于,在所述构造路径图中,对于其中一个节点Vx对应的字符片段segment(sx,tx,lx)与另一个节点Vy对应的字符片段segment(sy,ty,ly)之间当且仅当满足(tx+lx)<(ty+ly)时,才存在所述节点Vx到所述节点Vy的节点边,所述节点Vx与邻近所述节点Vx的其他节点构建节点边,并且所述构建节点边的数目不大于节点度数MAX_CONECTION。
  6. 根据权利要求5所述的数据增量更新方法,其特征在于,所述节点度数MAX_CONECTION=3。
  7. 根据权利要求5或6所述的数据增量更新方法,其特征在于,所述节点Vx到所述节点Vy的节点边的边代价的计算方法是:采用编码指令和编码数据组成表述字节,确定所述节点Vx对应的水平线段segx(tx,sx-tx)的右侧端点转移到所述节点Vy对应的水平线段segy(ty,sy-ty)右侧端点所需的所述表述字节的数量,即为所述边代价的值。
  8. 根据权利要求7所述的数据增量更新方法,其特征在于,在所述构造最小代价路径中,采用Dijkstra算法来计算最小代价路径。
  9. 根据权利要求8所述的数据增量更新方法,其特征在于,在所述构造增量包中,所述指令集包括“插入”、“拷贝”、“向前跳转”、“向后跳转”指令,所述数据集是由所述“插入”指令操作的字符参数构成,所述指令编码包括指令标示符和指令参数。
  10. 根据权利要求9所述的数据增量更新方法,其特征在于,所述指令标示符占用2比特,存在00、01、10、11四种编码,对应所述“插入”、“拷贝”、“向前跳转”、“向后跳转”指令。
  11. 根据权利要求10所述的数据增量更新方法,其特征在于,所述指令编码的结构是:首字节为:指令标示符+0+指令参数;中间字节为:0+指令参数;尾字节为:1+指令参数。
  12. 根据权利要求11所述的数据增量更新方法,其特征在于,所述指令参数的值小于32时,所述指令编码的长度为1个字节;所述指令参数值大于等于32且小于4096时,所述指令编码的长度为2字节;所述指令参数大于等于4096且小于524288时,所述指令编码的长度为3字节。
  13. 根据权利要求9所述的数据增量更新方法,其特征在于,在所述构造增量包中,还进一步包括对所述增量包进行压缩,得到增量压缩包,在所述应用增量包中,包括对所述增量压缩包解压缩,得到所述增量包。
  14. 根据权利要求13所述的数据增量更新方法,其特征在于,采用LZMA压缩算法对所述增量包进行压缩,得到所述增量压缩包。
  15. 根据权利要求14所述的数据增量更新方法,其特征在于,采用LZMA压缩算法对所述增量包进行压缩之前,返回到所述构造水平线段图中,优化所述水平线段图,二次构造路径图、最小代价路径和增量包后,压缩所述增量包得到最小的增量压缩包。
  16. 根据权利要求15所述的数据增量更新方法,其特征在于,所述优化所述水平线段图的方法包括:
    第一,若所述水平线段segi(t,s-t)中的s-t=0,则保留所述水平线段segi(t,s-t);以及,
    第二,若所述水平线段segi(t,s-t)分布离散,与周围其他水平线段不能组成近似水平直线,并且所述水平线段segi(t,s-t)的所述长度l小于长度门限值N,则删除所述水平线段segi(t,s-t)。
  17. 根据权利要求16所述的数据增量更新方法,其特征在于,所述长度门限值N=3,或者N是小于10的其他正整数值。
PCT/CN2015/073510 2015-03-02 2015-03-02 一种数据增量更新方法 WO2016138619A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/073510 WO2016138619A1 (zh) 2015-03-02 2015-03-02 一种数据增量更新方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/073510 WO2016138619A1 (zh) 2015-03-02 2015-03-02 一种数据增量更新方法

Publications (1)

Publication Number Publication Date
WO2016138619A1 true WO2016138619A1 (zh) 2016-09-09

Family

ID=56849165

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/073510 WO2016138619A1 (zh) 2015-03-02 2015-03-02 一种数据增量更新方法

Country Status (1)

Country Link
WO (1) WO2016138619A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968327A (zh) * 2012-12-14 2013-03-13 沈阳美行科技有限公司 一种支持增量更新的嵌入式poi数据增量更新方法
CN103685585A (zh) * 2012-09-07 2014-03-26 中国科学院计算机网络信息中心 一种高可靠的dns数据更新方法及系统
CN104834539A (zh) * 2015-03-02 2015-08-12 倪桂强 一种数据增量更新方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103685585A (zh) * 2012-09-07 2014-03-26 中国科学院计算机网络信息中心 一种高可靠的dns数据更新方法及系统
CN102968327A (zh) * 2012-12-14 2013-03-13 沈阳美行科技有限公司 一种支持增量更新的嵌入式poi数据增量更新方法
CN104834539A (zh) * 2015-03-02 2015-08-12 倪桂强 一种数据增量更新方法

Similar Documents

Publication Publication Date Title
US10715618B2 (en) Compressibility estimation for lossless data compression
CN104834539B (zh) 一种数据增量更新方法
EP1641219A2 (en) Efficient algorithm for finding candidate objects for remote differential compression
CN105207678B (zh) 一种改进型lz4压缩算法的硬件实现系统
US20060047855A1 (en) Efficient chunking algorithm
US7623047B2 (en) Data sequence compression
US10187081B1 (en) Dictionary preload for data compression
US20050235043A1 (en) Efficient algorithm and protocol for remote differential compression
CN103975533A (zh) 可变长度编码的数据流的高带宽解压
US10680645B2 (en) System and method for data storage, transfer, synchronization, and security using codeword probability estimation
US11722148B2 (en) Systems and methods of data compression
US10476519B2 (en) System and method for high-speed transfer of small data sets
JP2021527376A (ja) データ圧縮
US10735025B2 (en) Use of data prefixes to increase compression ratios
US11070618B2 (en) Techniques for updating files
CN103248369A (zh) 基于fpga的压缩系统及其方法
CN111370064A (zh) 基于simd的哈希函数的基因序列快速分类方法及系统
Lenhardt et al. Gipfeli-high speed compression algorithm
US9287893B1 (en) ASIC block for high bandwidth LZ77 decompression
CN112559462A (zh) 一种数据压缩方法、装置、计算机设备和存储介质
WO2016138619A1 (zh) 一种数据增量更新方法
Jozsa et al. Universal quantum information compression and degrees of prior knowledge
Tiwari et al. Aggregated Deflate-RLE compression technique for body sensor network
US10762281B1 (en) Prefix compression for keyed values
Tamakoshi et al. From run length encoding to LZ78 and back again

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15883676

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.02.2018)

122 Ep: pct application non-entry in european phase

Ref document number: 15883676

Country of ref document: EP

Kind code of ref document: A1