WO2024066753A1 - 压缩数据的方法和相关装置 - Google Patents

压缩数据的方法和相关装置 Download PDF

Info

Publication number
WO2024066753A1
WO2024066753A1 PCT/CN2023/111784 CN2023111784W WO2024066753A1 WO 2024066753 A1 WO2024066753 A1 WO 2024066753A1 CN 2023111784 W CN2023111784 W CN 2023111784W WO 2024066753 A1 WO2024066753 A1 WO 2024066753A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
string
information
compressed
compression
Prior art date
Application number
PCT/CN2023/111784
Other languages
English (en)
French (fr)
Inventor
王亚伟
伊利亚谢列兹尼奥夫
彼得罗琴科帕维尔
丹尼斯杰尼先科
陈绪金
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2024066753A1 publication Critical patent/WO2024066753A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures

Definitions

  • Embodiments of the present application relate to the field of information technology, and more specifically, to a method for compressing data and related devices.
  • a software package is a collection of files and directories required for a software product.
  • Software packages are usually designed and generated by application developers after the application code development is completed. Software products need to be generated into one or more packages so that they can be easily distributed and installed.
  • Object files are an important part of software packages.
  • Object files contain object codes.
  • Object codes are the codes generated by compilers or assemblers after processing source codes. Object codes usually consist of machine codes or codes close to machine languages.
  • the embodiments of the present application provide a method and related device for compressing data, which can reduce the size of a software package and improve user experience.
  • an embodiment of the present application provides a method for compressing data, comprising: determining N target files included in a software package, where N is a positive integer greater than or equal to 1; determining constant string information, where the constant string information is used to indicate at least one constant string and a constant string identifier corresponding to each constant string in the at least one constant string, and each target file in the N target files includes the at least one constant string; determining N special string information, where the N special string information corresponds one-to-one with the N target files, where first special string information is used to indicate at least one special string in a first target file and a special string identifier corresponding to each special string in the at least one special string, where the first special string information is any one of the N special string information, and the first target file is the target file corresponding to the first special string; replacing the constant string and special string of each target file of the N target files with the corresponding identifier to obtain N replaced target files; and compressing the software package according to the first information to be compressed,
  • the size of a software package determines the user experience. The larger the software package, the longer it takes for the user to download it; the smaller the software package, the less time the user spends downloading it.
  • the above technical solution can reduce the size of the software package by compressing the target files in the software package, so that the user can download/transfer the software package faster, thereby achieving the purpose of improving the user experience.
  • the software package may be a software package in an integrated development environment (IDE).
  • the software package may be an IDE main program installation package, or an IDE extension program installation package, etc.
  • the IDE may be a traditional IDE running on a local computer device, or a cloud IDE (which may be called an online integrated development environment or a network IDE (web IDE), etc.).
  • the software package may be a compressed file.
  • the software package may be decompressed to obtain a non-compressed file, and then the target file in the software package may be determined.
  • the method further includes: determining M non-target files in the software package, where M is a positive integer greater than or equal to 1; grouping the non-target files to obtain third information to be compressed, where the third information to be compressed includes at least one file set, where files belonging to the same file set have the same characteristics; compressing the software package according to the first information to be compressed includes: compressing the first information to be compressed and the second information to be compressed to obtain a compressed software package.
  • Compression algorithms usually predict Y bits based on the X bits before the current bit. If it fails, it will try X-1 until it succeeds. Therefore, compressing similar content (files) together can improve the overall prediction accuracy, thereby shortening the compression time and improving the compression ratio. At the same time, grouping files according to their similarity and using customized compression methods for specific files can further improve the compression ratio of files.
  • the multiple file sets include a first file set and at least one small file, where the small file is a non-target file among the M non-target files whose size is less than or equal to a file size threshold.
  • the multiple file sets include at least one second file set, wherein multiple non-target files belonging to the same file set have the same extension, the same encoding method, and/or the same file type.
  • the method before compressing the first information to be compressed and the second information to be compressed to obtain a compressed software package, the method further includes: determining K compression workloads, the K compression workloads corresponding one-to-one to K bit streams, each of the K bit streams including part or all of files from the same object to be compressed, wherein the object to be compressed includes the constant string information, the special string information, the replaced target file, and the file set, and K is a positive integer greater than or equal to 2; according to the K compression workloads, allocating the K bit streams to P operation units for compression, wherein the difference between the first workload and the second workload is less than a workload threshold, wherein the first workload is the sum of the workloads of the bit streams allocated to the first operation unit, and the second workload is the sum of the workloads of the bit streams allocated to the second operation unit, the first operation unit and the second operation unit are any two operation units among the P operation units, and P is a positive integer greater than or equal to 2.
  • the above technical solution allocates different bit streams to different operation units for compression based on the compression workload of the bit stream, which can further shorten the compression time.
  • the compressibility score of the i-th bit stream is the distance between an amplitude histogram of information included in the i-th bit stream and a Gaussian white noise amplitude histogram.
  • an embodiment of the present application provides a computer device, which includes a unit for implementing the first aspect or any possible implementation manner of the first aspect.
  • an embodiment of the present application provides a computer device, which includes a processor, wherein the processor is used to couple with a memory, read and execute instructions and/or program codes in the memory to execute the first aspect or any possible implementation of the first aspect.
  • an embodiment of the present application provides a chip system, which includes a logic circuit, which is used to couple with an input/output interface and transmit data through the input/output interface to execute the first aspect or any possible implementation method of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium, which stores program code.
  • the computer storage medium runs on a computer, it enables the computer to execute the first aspect or any possible implementation of the first aspect.
  • an embodiment of the present application provides a computer program product, the computer program product comprising: a computer program code, when the computer program code is run on a computer, the computer executes the first aspect or any possible implementation of the first aspect; Current method.
  • FIG1 is a schematic flow chart of a method for compressing data provided by an embodiment of the present application.
  • FIG. 2 is a schematic flow chart of a method for compressing data provided by an embodiment of the present application.
  • FIG3 is a schematic structural block diagram of a computer device provided according to an embodiment of the present application.
  • the computer device referred to in the embodiments of the present application may be a desktop computer, a laptop computer, a tablet computer, a server, or other computer device.
  • Fig. 1 is a schematic flow chart of a method for compressing data provided by an embodiment of the present application.
  • the method shown in Fig. 1 can be executed by a computer device or a component (such as a chip or a system chip, etc.) in a computer device.
  • a component such as a chip or a system chip, etc.
  • step 101 determine whether the software package is compressed. If the software package is compressed, execute steps 102 and 103; if the software package is not compressed, directly execute step 103.
  • the embodiment of the present application does not limit the compression format of the software package.
  • the compression format of the software package can be jar format, zip format, rar format, etc.
  • the software package may be a software package in an integrated development environment (IDE).
  • IDE integrated development environment
  • the software package may be an IDE main program installation package, or an IDE extension program installation package, etc.
  • the file can be divided into a compressed file and an uncompressed file.
  • the embodiment of the present application does not limit the format of the compressed file.
  • the format of the compressed file can be a jar format, a zip format, a rar format, etc.
  • An uncompressed file may include a target file, and may also include any one or more files other than the compression formats such as jar, zip or rar.
  • an uncompressed file may include any one or more types of files: an executable file (e.g., a file with an extension of .exe), a library file (e.g., a file with an extension of .lib, .dll, .a or .so, etc.), a text file (e.g., a file with an extension of .txt, .doc, etc.), a sound file (e.g., a file with an extension of .mp3, .wav, .flac, etc.), a video file (e.g., a file with an extension of .mp4, .mkv, .avi, or .rmvb, etc.), or a picture file (e.g., a file with an extension of .jpg, .gif, .bmp, etc.), etc.
  • an executable file e.g., a file with an extension of .exe
  • a library file e.g., a
  • uncompressed files are referred to as program files.
  • step 104 may be executed; if the software package does not include a compressed file, step 105 may be executed.
  • step 105 may be performed on the uncompressed files first, and then step 105 may be performed after the compressed files are decompressed to obtain the uncompressed files (ie, step 104 is performed first).
  • the compressed file referred to in the embodiments of the present application may be a compressed file obtained after a single compression, or a file obtained after a nested compression. If the compressed file is obtained after a single compression, then the files obtained after decompressing the compressed file are all uncompressed files. If the compressed file is nested compressed, then the compressed file can also be obtained after decompressing the compressed file.
  • the embodiments of the present application do not limit the number of nested layers. For example, the number of nested layers may be one layer, two layers, or more than two layers.
  • step 105 can be executed; if there are still compressed files after decompression, then continue to decompress the compressed files until there are no compressed files after decompression.
  • the target file can be determined by the file extension.
  • common target file extensions include .obj, .o, .class, etc.
  • program files can be divided into target files and non-target files.
  • program files can be divided into target files and non-target files.
  • the software package contains N target files and M non-target files, where N and M are both positive integers greater than or equal to 1.
  • the non-target file may be any type of file, such as a text file, a video file, an audio file, an executable file, etc.
  • the target file is generated by compiling the source file. Since the target file is closely related to the compilation system, the metadata of the target file contains a large number of standard strings defined by the compilation system, including constant strings and special strings. Constant strings are strings that appear in every target file, while special strings are strings that appear in a certain target file.
  • N target files can have only one constant string information, and the constant string information includes a constant string that appears in each target file and a constant string identifier used to distinguish the constant string.
  • N target files have N special string information, and the N special string information corresponds to the N target files one by one.
  • the first target file is any target file among the N target files, and the first special string information is the special string information corresponding to the first target file.
  • the first special string information may include a special string contained in the first target file and a special string identifier used to distinguish the special string contained in the first target file.
  • Each special string information in the N special string information may include an identity identifier, and the identity identifier is used to indicate in which target file the special string contained in the special string information appears.
  • each constant string has a corresponding constant string identifier
  • each special string has a corresponding special string identifier.
  • the length of the constant string identifier can be X bits
  • the length of the special string identifier can be Y bits.
  • X and Y are both positive integers greater than or equal to 1. The values of X and Y can be the same.
  • the values of X and Y may be predetermined.
  • X (or Y) may be equal to 8, 12, 16, 24, etc.
  • the values of X and Y can be determined according to the number of constant character strings. For example, assuming that the number of constant character strings is Num C , then X must be greater than or equal to the number of bits of the binary number corresponding to Num C. For example, assuming that the number of constant character strings is 80. The binary representation corresponding to 80 is 1010000, a total of seven bits. Then the value of X can be a positive integer greater than or equal to 7. For example, X can be equal to 7 or 8.
  • the specific positions of the constant string identifier and the special string identifier can be used to distinguish whether the identifier is a constant string identifier or a special string identifier. For example, assuming that the values of X and Y are both 16, the first 8 bits can be used to distinguish between the constant string identifier and the special string identifier. For example, the first 8 bits of the constant string identifier can be 00000000, and the first 8 bits of the special string identifier can be 11111111. In this way, it can be determined whether an identifier is a constant string identifier or a special string identifier based on the first 8 bits.
  • the constant string and the special string can be determined from the metadata of the target file.
  • the metadata of the target file is stored in the target file warehouse. Therefore, the metadata of the N target files can be queried from the target file warehouse to obtain the constant string and the special string of each target file.
  • the metadata of the first target file contains the following information:
  • the metadata of the first target file indicates that the first target file contains two standard string nodes, "References” and "Namepool".
  • the metadata of the first target file only shows the constant string and special string of the node "References".
  • the node "References” contains 5 constant strings such as Methodref, corresponding to identifiers: 1-5; special strings are matched by regular expressions.
  • the above regular expression "Regex:(?)#SN-)([AZ]+[0-9]+) ⁇ b” means that the special string includes any package starting with SN.
  • Special strings include letters A to Z and numbers 0 to 9.
  • special strings need to filter any results containing the special characters '@', '.', '/'. That is, even if a string starts with SN and contains letters A to Z and numbers 0 to 9, if the string contains '@', '.', '/', then the string is not a special string.
  • the constant string identifier can be determined according to the identifier of the constant string in the metadata. For example, if the identifier of the constant string Methodref in the metadata of the first target file is 1, then the constant string identifier of the constant string Methodref can be 0000 0000 0000 0001, where the first 8 bits are flag bits used to distinguish constant strings from special strings, and the last 8 bits are the identifier of the constant string Methodref in the metadata.
  • the special string identifier can be determined based on the position where the special string appears in the metadata. For example, the last 8 bits of the special string identifier of the first appearing special string may correspond to a decimal number of 1, the last 8 bits of the special string identifier of the second appearing special string may correspond to a decimal number of 2, and so on.
  • the special string identifier of the 8th appearing special string may be 1111 1111 0000 1000, where the first 8 bits are a flag for distinguishing a constant string from a special string, and the last 8 bits are the position where the special string appears in the metadata (i.e., the 8th appearance).
  • the special string identifier can also be determined based on the identifier of the constant string in the metadata and the position where the special string appears in the metadata. For example, the maximum value of the identifier of the constant string in the first target file is 5, then the last 8 bits of the special string identifier of the first special string that appears can correspond to the decimal number 6, the last 8 bits of the special string identifier of the second special string that appears can correspond to the decimal number 7, and so on.
  • the special string identifier of the 8th special string that appears can be 1111 1111 0000 1000, where the first 8 bits are a flag for distinguishing between a constant string and a special string, and the last 8 bits are the position where the special string appears in the metadata (i.e., the 8th appearance).
  • the constant string information may include each constant string and a constant string identifier of each constant string.
  • the constant string information may include N const constant strings and N const constant string identifiers.
  • the N const constant strings correspond to the N const constant string identifiers one by one.
  • the constant string information may include each constant string, the order in which each constant string appears, and the constant string identifier of the first constant string that appears.
  • the constant string identifier of each constant string can be determined based on the constant string identifier and the order in which the constant string first appears. For example, assuming that N const constant strings are shared, the constant string information may include N const constant strings, the constant string identifier of the first constant string that appears among the N const constant strings, and the order in which the N const constant strings appear.
  • the constant string identifier of the nth constant string that appears among the N const constant strings (where n is a positive integer greater than or equal to 2 and less than or equal to N const ) and the constant string identifier of the first constant string may satisfy the following relationship:
  • ID n is the constant string identifier of the nth constant string
  • ID 1 is the constant string identifier of the first constant string
  • is a positive integer greater than or equal to 1.
  • the special string information may include each special string and the special string identifier of each special string.
  • the special string information may include each special string, the order of occurrence of each special string and the special string identifier of the first special string that occurs.
  • the constant string identifier of the constant string Methodref is 0000 0000 0000 0001, then replace the constant string Methodref that appears in each of the N target files with 0000 0000 0000 0001.
  • the special strings in the N target files with special string identifiers.
  • the special string identifier of the special string SNAZ14389 in the first target file is 1111 1111 0000 1000, then replace all the special strings SNAZ14389 in the first target file with 1111 1111 0000 1000.
  • the target file that completes the replacement of the constant string identifier and the special string identifier can be called a replaced target file.
  • Replacing constant strings and special strings in the target file with their respective identifiers can reduce the size of the target file.
  • the constant string identifier and the special string identifier also include multiple repeated numbers, such as flags used to distinguish constant strings from special strings. Such repeated numbers can have a higher compression rate when compressed.
  • first compression information For the convenience of description, the constant string information, the N special string information and the N replaced target files may be collectively referred to as first compression information.
  • a file set may be determined based on the size of the non-target files, and the file set includes all non-target files whose sizes are less than or equal to the file size threshold among the M non-target files.
  • reference files whose file sizes are less than or equal to the non-target file size threshold are referred to as small files, and non-target files whose file sizes are greater than the file size threshold are referred to as large files.
  • the file set including all small files may be referred to as file set 1. It is understandable that if the file sizes of the M non-target files are all greater than the file size threshold, then the file set 1 may not be included in the second compression information.
  • the file size threshold may be a system default or a setting.
  • the file size threshold may be less than or equal to 1024 bytes (byte, B).
  • the file size threshold may be 1024B, 1000B, 512B, 300B, 256B, 200B, 128B, or 100B, etc.
  • the file size threshold may be less than or equal to 512B.
  • the file size threshold may be 512B, 500B, 300B, 256B, 200B, 128B, or 100B, etc.
  • the file size threshold may be less than or equal to 256B.
  • the file size threshold may be 256B, 200B, 128B, or 100B, etc.
  • the files included in the file set 1 are only selected according to the file size. Therefore, the file set 1 may include files of various formats.
  • the file set 1 may include one or more of text files, library files, image files, etc.
  • Files in the same file set have the same characteristics.
  • the characteristic of all files in file set 1 is that they are less than or equal to the file size threshold.
  • the multiple file sets may further include multiple file sets 2.
  • Each file set in the multiple file sets 2 includes files that are non-target files whose file size is greater than the file size threshold. Files belonging to the same file set 5 have the same characteristics and can be compressed using the same compression algorithm.
  • non-target files may include files of different types.
  • non-target files may include text files, video files, audio files, executable files, etc.
  • the grouping of non-target files can be grouped according to file type, and files of the same type belong to the same file set.
  • Different file sets include different types of non-target files. For example, file set 2-1 includes text files; file set 2-2 includes executable files; file set 2-3 includes audio files, file set 2-4 includes picture files, etc.
  • non-target files may be grouped according to their extensions. For example, file set 2-1 includes all files with an extension of .dll; file set 2-2 includes all files with an extension of .exe; and file set 2-3 includes all files with an extension of .txt.
  • Situation 1 files with different extensions can achieve good compression effects using the same compression algorithm, and these files with different extensions may also be of different file types
  • Situation 2 files of the same type with different extensions can achieve better compression effects using different compression algorithms
  • Situation 3 files with the same extension but different encoding methods can achieve better compression effects using different compression algorithms. Therefore, in some embodiments, grouping information can be pre-set. In this way, non-target files can be grouped directly according to the grouping information.
  • LZMA2 Lempel-Ziv-Markov chain algorithm 2
  • LZMA2 has a good compression effect on files with extensions of .exe and .dll. Therefore, all files with extensions of .exe and .dll can belong to the same file set 2.
  • txt file Indicates a text file whose encoding method is the American standard code for information interchange (ASCII); txt file (unicode) indicates a text file whose encoding method is unicode.
  • file set 2-1 includes files with extensions of dll and exe
  • file set 2-2 includes text files encoded in ASCII
  • file set 2-3 includes text files encoded in unicode.
  • the information to be compressed in the software package includes the first information to be compressed and the second information to be compressed determined in the above steps.
  • the object to be compressed can be the constant string information, the special string information, the replaced target file and the file set.
  • One of the N TB objects to be compressed is the constant string information, one of the N special string information, one of the N replaced target files, file set 1 or one of the N S file sets 2.
  • Each bit stream in the multiple bit streams includes part or all of the files in the same object to be compressed.
  • the multiple bitstreams correspond one-to-one to the objects to be compressed determined in step 107, and each bitstream includes all files in the corresponding object to be compressed.
  • the object to be compressed may be divided into multiple bitstreams, each of which has a size not exceeding the bitstream threshold.
  • one object to be compressed may correspond to multiple bitstreams, each of which contains only a portion of the files in the corresponding object to be compressed.
  • the bitstream corresponding to the small file may not be compressed.
  • the main reason is that the compression efficiency of the small file is not good.
  • the compression ratio of the small file is not high; or, although the compression ratio of the small file is relatively high, the computing resources occupied are not worth it compared with the compression ratio of the small file.
  • a small file of 100B may only be 30B after compression, but the same computing resources can compress a file of 100 megabytes (MB) to 40MB. Therefore, the same computing resources can only save 70B of capacity by compressing a small file. Relative to a software package, the saved capacity is very small. Therefore, from the perspective of saving computing resources and improving compression efficiency, not compressing small files can complete the compression of the software package faster, and will not have a substantial impact on the final size of the software package.
  • small files may also be compressed.
  • the compression workload of the bitstream can be determined based on the compressibility score of the bitstream and the size of the bitstream.
  • the compressibility score of the bitstream is the distance between the amplitude histogram of the information included in the bitstream and the Gaussian white noise histogram.
  • the bitstream is digitally mapped (each 8bit or 1byte is mapped to the 0-255 interval) to obtain the gradient histogram of the bitstream, and then the gradient amplitude histogram is further calculated.
  • the distance between the amplitude histogram and the Gaussian white noise histogram can be the Euclidean distance, standard Euclidean distance, or Mahalanobis distance between the amplitude histogram and the Gaussian white noise histogram.
  • Comp i is the compression workload of the i-th bit stream
  • Grdi is the compressibility score of the i-th bit stream
  • Size i is the size of the i-th bit stream.
  • the compression workload of the i-th bitstream may reflect the proportion of the compression time of the i-th bitstream in the total compression time of the K bitstreams. For example, if the compression workload of the i-th bitstream is 10, it means that the compression time of the i-th bitstream accounts for 10% of the total compression time of the K bitstreams.
  • the compression workload of the bitstream can be determined according to a predetermined corresponding relationship.
  • Table 2 shows the corresponding relationship between the compression workload, the bitstream size, and the compressibility score.
  • the compression workload of the bitstream is 10; if the compressibility score of the bitstream is greater than or equal to S1 and less than S2 and the bitstream size of the bitstream is greater than 10MB, then the compression workload of the bitstream is 30.
  • an operation unit is allocated to each bit stream.
  • the K bit streams can be evenly distributed to the P operation units, so that the sum of the compression workload of the bit streams distributed to different operation units is the same or similar.
  • bitstream 1 For example, the workload of bitstream 1 is 10, the workload of bitstream 2 is 20, the workload of bitstream 3 is 30, and the workload of bitstream 4 is 40.
  • the computing unit may be a processor or a component (e.g., a core) in a processor.
  • a computer device may include multiple processors, each of which may be a computing unit.
  • a computer device may include a processor including multiple cores.
  • a computing unit is a core in the processor.
  • the operation unit compresses the allocated bit stream to obtain a compressed bit stream.
  • bit stream to be assembled in step 112 also includes the bit stream corresponding to the file set 1.
  • Fig. 2 is a schematic flow chart of a method for compressing data provided by an embodiment of the present application.
  • the method shown in Fig. 2 can be executed by a computer device or a component (such as a chip or a system chip, etc.) in a computer device.
  • a component such as a chip or a system chip, etc.
  • the target file can be determined by the file extension.
  • common target file extensions include .obj, .o, .class, etc.
  • program files can be divided into target files and non-target files.
  • program files can be divided into target files and non-target files.
  • the software package contains N target files and M non-target files, where N and M are both positive integers greater than or equal to 1.
  • the non-target file may be any type of file, such as a text file, a video file, an audio file, an executable file, etc.
  • the software package is a compressed file or the software package includes one or more compressed files.
  • the compressed file can be first decompressed to obtain an uncompressed file, and then the target file in the uncompressed file can be determined. If the compressed file is nested, the nested compressed file can be decompressed after the compressed file is decompressed until there is no compressed file after decompression.
  • the constant string information is used to indicate at least one constant string and a constant string identifier corresponding to each constant string in the at least one constant string.
  • Each target file in the N target files includes the at least one constant string.
  • the N special string information corresponds to the N target files one by one. Assuming that the first string information is any special string information among the N special string information, the first target file is the target file corresponding to the first string information among the N target files.
  • the first string information can be used to indicate at least one special string in the first target file and a string identifier corresponding to each string in the at least one special string.
  • the method for determining the constant string information and the special string information may refer to the embodiment shown in FIG. 1 , and for the sake of brevity, it will not be described in detail here.
  • the replacement method of the constant character string and the special character string can refer to the embodiment shown in FIG1 , and for the sake of brevity, it will not be described in detail here.
  • the first compression information may include the constant string, the N special strings, and the N replaced target files.
  • bitstream 1 includes the first compression information
  • bitstream 2 includes M non-target files in the software package.
  • Bitstream 1 and bitstream 2 are compressed respectively to obtain a compression result of bitstream 1 and a compression result of bitstream 2. Then, the compression result of bitstream 1 and the compression result of bitstream 2 are combined to obtain a compressed software package.
  • the non-target files may be classified first to obtain two file sets, the first file set including non-target files whose file size is less than or equal to the file size threshold; the second file set including non-target files whose file size is greater than the file size threshold.
  • three bitstreams may be determined, bitstream 1, bitstream 2 and bitstream 3.
  • Bitstream 1 includes the constant string information, the N special string information and the N replaced target files
  • bitstream 2 may include files in the first file set
  • bitstream 3 may include files in the second file set.
  • Bitstream 1 and bitstream 3 are compressed respectively to obtain the compression result of bitstream 1 and the compression result of bitstream 3.
  • Bitstream 2 the compression result of bitstream 1 and the compression result of bitstream 3 are combined to obtain a compressed software package.
  • non-target files whose file size is less than or equal to the file size threshold are not compressed, but only the decomposition and replacement results of the target file (i.e., the constant string, the N special strings and the N replaced target files) and the non-target files whose file size is greater than the file size threshold are compressed.
  • the computing resources of the computer device can be saved and the compressed software package can be obtained more quickly.
  • the size of all non-target files is less than or equal to the file size threshold.
  • the constant string, the N special strings and the N replaced target files can be compressed, and then the compression result and the non-target files are combined to obtain a compressed software package.
  • all non-target file sizes are greater than the file size threshold.
  • the non-target files, the constant string, the N special strings and the N replaced target files can be compressed, and then the compression results are combined to obtain a compressed software package.
  • the decomposition and replacement results of the target file are all compressed in one bitstream.
  • the constant string, the N special strings, and the N replaced target files may belong to different bitstreams, respectively.
  • bitstream 1 to bitstream 4 may be determined, bitstream 1 to bitstream 4.
  • Bitstream 1 includes the constant string.
  • Bitstream 2 includes the N special strings.
  • Bitstream 3 includes the N replaced target files.
  • Bitstream 4 includes M non-target files.
  • bitstream 1 to bitstream 4 may be compressed respectively to obtain the compression results of bitstream 1 to bitstream 4. Then, the compression results of bitstream 1 to bitstream 4 are combined to obtain a compressed software package.
  • bitstream 1 may be determined, bitstream 1 to bitstream 5.
  • Bitstream 1 includes the constant string.
  • Bitstream 2 includes the N special strings.
  • Bitstream 3 includes the N replaced target files.
  • Bitstream 4 includes the first file set.
  • Bitstream 5 includes the second file set. Then, bitstreams 1 to 3 and bitstream 5 are compressed respectively to obtain compression results of bitstream 1, compression results of bitstream 2, compression results of bitstream 3 and compression results of bitstream 5.
  • Bitstream 4 the compression results of bitstream 1, the compression results of bitstream 2, the compression results of bitstream 3 and the compression results of bitstream 5 are combined to obtain a compressed software package.
  • the compression workload of each bit stream can be determined separately, and then an operation unit is allocated to each bit stream according to the compression workload of each bit stream.
  • the method for determining the compression workload and the method for allocating the operation units can refer to the embodiment shown in FIG1 , and for the sake of brevity, they will not be described here.
  • Fig. 3 is a schematic structural block diagram of a computer device provided according to an embodiment of the present application.
  • the computer device shown in Fig. 3 includes a processing unit 301 and a compression unit 302.
  • the processing unit 301 is used to determine N target files included in the software package, where N is a positive integer greater than or equal to 1.
  • the processing unit 301 is further configured to determine constant string information, the constant string information being used to indicate at least one constant string and a constant string identifier corresponding to each constant string in the at least one constant string, each target in the N target files
  • the file includes the at least one constant string.
  • the processing unit 301 is also used to determine N special string information, where the N special string information corresponds one-to-one to the N target files, and the first special string information is used to indicate at least one special string in the first target file and a special string identifier corresponding to each special string in the at least one special string, the first special string is any one of the N special strings, and the first target file is a target file corresponding to the first special string.
  • the processing unit 301 is further configured to replace the constant character string and the special character string of each of the N target files with a corresponding identifier to obtain N replaced target files.
  • the compression unit 302 is used to compress the software package according to the first information to be compressed, where the first information to be compressed includes the constant string information, the N special string information and the N replaced target files.
  • processing unit 301 and the compression unit 302 can be found in the description of the above embodiments, and will not be described again here for the sake of brevity.
  • the processing unit 301 and the compression unit 302 may be implemented by a processor.
  • the embodiment of the present application further provides a computer device, including a processor and a memory.
  • the processor is used to couple with the memory, read and execute instructions and/or program codes in the memory, so as to execute the steps in the above method embodiment.
  • the processor can be a chip.
  • the processor can be a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), a graphics processing unit (GPU), a system on chip (SoC), a central processor unit (CPU), a network processor (NP), a digital signal processor (DSP), a microcontroller unit (MCU), a programmable logic device (PLD), other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or other integrated chips.
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • GPU graphics processing unit
  • SoC system on chip
  • CPU central processor unit
  • NP network processor
  • DSP digital signal processor
  • MCU microcontroller unit
  • PLD programmable logic device
  • each step of the above method can be completed by an integrated logic circuit of hardware in a processor or an instruction in the form of software.
  • the steps of the method disclosed in conjunction with the embodiment of the present application can be directly embodied as a hardware processor for execution, or a combination of hardware and software modules in a processor for execution.
  • the software module can be located in a storage medium mature in the art such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory or an electrically erasable programmable memory, a register, etc.
  • the storage medium is located in a memory, and the processor reads the information in the memory and completes the steps of the above method in conjunction with its hardware. To avoid repetition, it is not described in detail here.
  • the processor in the embodiment of the present application can be an integrated circuit chip with signal processing capabilities.
  • each step of the above method embodiment can be completed by an integrated logic circuit of hardware in the processor or an instruction in the form of software.
  • the general processor can be a microprocessor or the processor can also be any conventional processor, etc.
  • the steps of the method disclosed in the embodiment of the present application can be directly embodied as a hardware decoding processor to be executed, or a combination of hardware and software modules in the decoding processor to be executed.
  • the software module can be located in a random access memory, a flash memory, a read-only memory, a programmable read-only memory or an electrically erasable programmable memory, a register, and other mature storage media in the art.
  • the storage medium is located in a memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • the memory in the embodiments of the present application can be a volatile memory or a non-volatile memory, or can include both volatile and non-volatile memories.
  • the non-volatile memory can be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory.
  • the volatile memory can be a random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDR SDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchlink DRAM
  • DR RAM direct rambus RAM
  • the present application also provides a computer program product, which includes: a computer program code, when the computer program code is run on a computer, the computer executes each step in the above embodiments.
  • the present application also provides a computer-readable medium, which stores a program
  • the program code is run on a computer, the computer is caused to execute each step in the above embodiment.
  • an embodiment of the present application provides a chip system, which includes a logic circuit, which is used to couple with an input/output interface and transmit data through the input/output interface to execute the various steps in the above embodiments.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application can be essentially or partly embodied in the form of a software product that contributes to the prior art.
  • the computer software product is stored in a storage medium and includes several instructions for a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供一种压缩数据的方法和相关装置,该方法包括:确定软件包中的N目标文件;确定对应于目标文件的常量字符串信息和N个特殊字符串信息;将该N个目标文件的每个目标文件的常量字符串和特殊字符串替换为对应的标识,得到N个已替换目标文件;根据该常量字符串信息、该N个特殊字符串信息和该N个已替换目标文件,压缩该软件包,该第一待压缩信息包括。软件包的大小决定了用户体验。软件包越大,用户需要越长的时间下载;软件包越小,用户花费的下载时间越小。上述技术方案通过压缩软件包中的目标文件,可以减小软件包的大小,使得用户可以更快地下载/传输软件包,从而达到提升用户体验的目的。

Description

压缩数据的方法和相关装置
本申请要求于2022年09月29日提交俄罗斯联邦专利局、申请号为2022125457、申请名称为“压缩数据的方法和相关装置”的俄罗斯联邦专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及信息技术领域,更具体地,涉及压缩数据的方法和相关装置。
背景技术
软件包是软件产品所需的文件和目录的集合。软件包通常是在完成应用程序代码开发后由应用程序开发者设计和生成的。软件产品需要生成到一个或多个软件包中,以便可以轻松地将其分发、安装。
目标文件(object file)是软件包的重要组成部分。目标文件保存有目标代码(object code)。目标代码是编译器或汇编器处理源代码后所生成的代码。目标代码一般由机器代码或接近于机器语言的代码组成。
目前常用的压缩工具对于目标文件的压缩效果并不理想。因此,如何更好地压缩目标文件是业界需要解决的问题。
发明内容
本申请实施例提供一种压缩数据的方法和相关装置,可以减小软件包的大小,提升用户体验。
第一方面,本申请实施例中提供一种压缩数据的方法,包括:确定软件包中包括的N个目标文件,N为大于或等于1的正整数;确定常量字符串信息,该常量字符串信息用于指示至少一个常量字符串以及与该至少一个常量字符串中的每个常量字符串对应的常量字符串标识,该N个目标文件中的每个目标文件包括该至少一个常量字符串;确定N个特殊字符串信息,该N个特殊字符串信息与该N个目标文件一一对应,第一特殊字符串信息用于指示第一目标文件中的至少一个特殊字符串以及与该至少一个特殊字符串中的每个特殊字符串对应的特殊字符串标识,该第一特殊字符串信息是该N个特殊字符串信息中的任一个字符串信息,该第一目标文件是与该第一特殊字符串对应的目标文件;将该N个目标文件的每个目标文件的常量字符串和特殊字符串替换为对应的标识,得到N个已替换目标文件;根据第一待压缩信息,压缩该软件包,该第一待压缩信息包括该常量字符串信息、该N个特殊字符串信息和该N个已替换目标文件。
软件包的大小决定了用户体验。软件包越大,用户需要越长的时间下载;软件包越小,用户花费的下载时间越小。上述技术方案通过压缩软件包中的目标文件,可以减小软件包的大小,使得用户可以更快地下载/传输软件包,从而达到提升用户体验的目的。
可选的,该软件包可以是集成开发环境(integrated development environment,IDE)中的软件包。例如,该软件包可以是IDE自身主程序安装包,也可以是IDE扩展程序安装包等。该IDE可以是传统的运行在本地计算机设备上的IDE,也可以是云IDE(可以称为在线集成开发环境或网络IDE(web IDE)等)。
在使用IDE进行跨环境开发(例如远程开发、使用云IDE开发)等场景中,会频繁涉及到多个物理环境之间(例如多个计算机设备之间或者多个系统之间的)软件包传输。上述技术方案可以是识别并压缩软件包中的目标文件,可以减小软件包的大小,便于软件包的传输,提升用户体验。
可选的,该软件包可以是一个压缩文件。在此情况下,可以相对软件包进行解压缩,得到非压缩文件后,再确定软件包中的目标文件。
结合第一方面,在第一方面的一种可能的实现方式中,该方法还包括:确定该软件包中的M个非目标文件,M为大于或等于1的正整数;对该非目标文件分组,得到第三待压缩信息,该第三待压缩信息包括至少一个文件集合,其中属于同一个文件集合的文件具有相同的特征;该根据第一待压缩信息,压缩该软件包,包括:压缩该第一待压缩信息和该第二待压缩信息,得到压缩后的软件包。
压缩算法通常会基于当前比特之前的X个比特,预测之后Y个比特。如果失败,则会尝试X-1,直到成功位置,所以相似内容(文件)一起压缩,可以提升总体的预测准确性,进而缩短压缩时间,提升压缩比。同时根据文件相似度分组,对特定文件采用定制化压缩方法可以进一步提升文件的压缩比。
结合第一方面,在第一方面的一种可能的实现方式中,该多个文件集合包括第一文件集合,至少一个小文件,该小文件是该M个非目标文件中中大小小于或等于文件大小阈值的非目标文件。
从节省运算资源和提高压缩效率的角度,不压缩小文件可以更快地完成软件包的压缩,也不会对软件包最终的大小造成实质性影响。上述技术方案可以将非目标文件中的小文件分为一组。在后续处理中,不压缩小文件,而直接将小文件和已压缩文件组合得到压缩软件包。
结合第一方面,在第一方面的一种可能的实现方式中,该多个文件集合包括至少一个第二文件集合,其中,属于同一文件集合的多个非目标文件具有相同的扩展名、相同的编码方式、和/或相同的文件类型。
结合第一方面,在第一方面的一种可能的实现方式中,在该压缩该第一待压缩信息和该第二待压缩信息,得到压缩后的软件包之前,该方法还包括:确定K个压缩工作量,该K个压缩工作量与K个比特流一一对应,该K个比特流中的每个比特流包括来自于同一待压缩对象的部分或全部文件,其中该待压缩对象包括该常量字符串信息,该特殊字符串信息、该已替换目标文件以及该文件集合,K为大于或等于2的正整数;根据该K个压缩工作量,将该K个比特流分配给P个运算单元进行压缩,其中,第一工作量与第二工作量之差小于工作量门限,其中该第一工作量是分配给第一运算单元的比特流的工作量之和,该第二工作量是分配给第二运算单元的比特流的工作量之和,该第一运算单元和该第二运算单元是该P个运算单元中的任意两个运算单元,P为大于或等于2的正整数。
上述技术方案基于比特流的压缩工作量,将不同的比特流分配给不同的运算单元来进行压缩,这样可以进一步缩短压缩时间。
结合第一方面,在第一方面的一种可能的实现方式中,该确定K个压缩工作量,包括:确定该K个比特流中的第i个比特流的可压缩性评分,i=1,…,K;根据该第i个比特流的可压缩性评分和该第i个比特流的大小,确定该K个压缩工作量中的第i个压缩工作量。
结合第一方面,在第一方面的一种可能的实现方式中,该第i个比特流的可压缩性评分是该第i个比特流包括的信息的幅值直方图与高斯白噪声幅值直方图的距离。
数据随机性越大就越符合高斯分布。因此,使用高斯白噪声直方图可以方便地确定比特流的随机性。幅值直方图与高斯白噪声直方图的距离越小,比特流的随机性越大,压缩量就越大,相应的工作量就越大;反之,幅值直方图与高斯白噪声直方图的距离越大,比特流的随机性越小,压缩量就越小,相应的工作量就越小。
结合第一方面,在第一方面的一种可能的实现方式中,该根据该第i个比特流的可压缩性评分和该第i个比特流的大小,确定该K个压缩工作量中的第i个压缩工作量,包括:根据以下公式确定第i个压缩工作量:Compi=(1-Grdi)×Sizei,其中,Compi是该第i个压缩工作量,Grdi是该第i个比特流的可压缩性评分,Sizei是该第i个比特流的大小。
第二方面,本申请实施例提供一种计算机设备,该计算机设备包括用于实现第一方面或第一方面的任一种可能的实现方式的单元。
第三方面,本申请实施例提供一种计算机设备,该计算机设备包括处理器,该处理器用于与存储器耦合,读取并执行该存储器中的指令和/或程序代码,以执行第一方面或第一方面的任一种可能的实现方式。
第四方面,本申请实施例提供一种芯片系统,该芯片系统包括逻辑电路,该逻辑电路用于与输入/输出接口耦合,通过该输入/输出接口传输数据,以执行第一方面或第一方面任一种可能的实现方式。
第五方面,本申请实施例提供一种计算机可读存储介质,该计算机可读存储介质存储有程序代码,当该计算机存储介质在计算机上运行时,使得计算机执行如第一方面或第一方面的任一种可能的实现方式。
第六方面,本申请实施例提供一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码在计算机上运行时,使得该计算机执行如第一方面或第一方面的任一种可能的实 现方式。
附图说明
图1是本申请将实施例提供的一种压缩数据的方法的示意性流程图。
图2是本申请将实施例提供的一种压缩数据的方法的示意性流程图。
图3是根据本申请实施例提供的一种计算机设备的示意性结构框图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
本申请实施例中所称的计算机设备可以是台式计算机、笔记本电脑、平板电脑、服务器等计算机设备。
图1是本申请将实施例提供的一种压缩数据的方法的示意性流程图。图1所示的方法可以由计算机设备或者计算机设备中的部件(例如芯片或者系统芯片等)执行。为了便于描述。
101,确定软件包是否被压缩。如果软件包被压缩,则执行步骤102和步骤103;如果软件包未被压缩,则可以直接执行步骤103。
本申请实施例对软件包的压缩格式并不限定。例如,软件包的压缩格式可以是jar格式、zip格式、rar格式等。
在一些实施例中,该软件包可以是集成开发环境(integrated development environment,IDE)中的软件包。例如,该软件包可以是IDE自身主程序安装包,也可以是IDE扩展程序安装包等。
102,对压缩的软件包进行解压,得到解压后的软件包。
根据文件是否被压缩,文件可以分为压缩文件和未压缩文件。本申请实施例对压缩文件的格式也不进行限定。例如,压缩文件的格式可以是jar格式、zip格式、rar格式等。未压缩文件可以包括目标文件,还可以包括除了jar、zip或rar等压缩格式以外的任意一种或多种文件。例如,未压缩文件可以包括以下任一种或多种类型的文件:可执行文件(例如扩展名为.exe的文件)、库文件(例如扩展名为.lib、.dll、.a或.so等的文件)、文本文件(例如扩展名为.txt、.doc等的文件)、声音文件(例如扩展名为.mp3,.wav,.flac等的文件)、视频文件(例如扩展名为.mp4,.mkv,.avi,或.rmvb等的文件)、或图片文件(例如扩展名为.jpg,.gif,.bmp等的文件)等。
为了便于描述,本申请实施例中将未压缩文件称为程序文件。
103,确定软件包中是否包括压缩文件。
在一些实施例中,若软件包中包含压缩文件,则可以执行步骤104;若软件包中不包含压缩文件,则可以执行步骤105。
在另一些实施例中,若软件包中包含压缩文件和未压缩文件,则可以先对未压缩文件执行步骤105,对压缩文件解压缩得到未压缩文件(即先执行步骤104)之后再执行步骤105。
104,对压缩文件解压。
本申请实施例中所称的压缩文件可以是单次压缩后得到的压缩文件,也可以经过嵌套压缩后得到的文件。如果压缩文件是单次压缩后得到的,那么对该压缩文件解压缩后得到的都是未经压缩的文件。如果压缩文件是嵌套压缩的,那么对该压缩文件进行解压缩后还可以得到压缩文件。本申请实施例对嵌套层数并不限定。例如,嵌套层数可以是一层,也可以是两层,或者大于两层。
如果压缩文件解压后得到的都是未压缩文件(程序文件),那么可以执行步骤105;如果压缩文件解压后还有压缩文件,那么继续对压缩文件进行解压,直到解压后没有压缩文件为止
105,确定软件包中的目标文件。
目标文件可以根据文件的扩展名来确定。例如,常见的目标文件的扩展名包括.obj,.o,.class等。
根据是否是目标文件,程序文件可以分为目标文件和非目标文件。为了便于描述一下假设软件包中包含N个目标文件和M个非目标文件,N和M都是大于或等于1的正整数。
非目标文件可以是各种类型的文件,例如可以是文本文件、视频文件、音频文件、可执行文件等。
106,对该N目标文件进行分解和替换,得到常量字符串信息、N个特殊字符串信息和N个已替换目标文件。
目标文件是源文件通过编译产生的。由于目标文件跟编译系统密切相关,所以目标文件的元数据中包含大量由编译系统定义的规格字符串,这些字符串包括常量字符串和特殊字符串。常量字符串是在每个目标文件中都出现的字符串,而特殊字符串是某个目标文件中出现的字符串。
由此可见,N个目标文件可以只有一个常量字符串信息,该常量字符串信息包括每个目标文件中都出现的常量字符串和用于区分常量字符串的常量字符串标识。N个目标文件有N个特殊字符串信息,N个特殊字符串信息与N个目标文件一一对应。假设第一目标文件是N个目标文件中的任一个目标文件,第一特殊字符串信息是第一目标文件对应的特殊字符串信息。那么,第一特殊字符串信息可以包括第一目标文件中包含的特殊字符串以及用于区分第一目标文件中包含的特殊字符串的特殊字符串标识。N个特殊字符串信息中的每个特殊字符串信息可以包含一个身份标识,该身份标识用于指示该特殊字符串信息包含的特殊字符串是在哪个目标文件中出现的。
如上所述,每个常量字符串有一个对应的常量字符串标识,每个特殊字符串有一个对应的特殊字符串标识。常量字符串标识的长度可以是X比特(bit),特殊字符串标识的长度可以是Y比特。X和Y都是大于或等于1的正整数。X和Y的取值可以相同。
在一些实施例中,X和Y的取值可以是确定好的。例如,X(或Y)可以等于8、12、16、24等。
在一些实施例中,X和Y的取值可以根据常量字符串的数目来确定。例如,假设常量字符串的数目为NumC,那么X要大于或等于NumC对应的二进制数的位数。例如,假设常量字符串的数目为80。80对应的二进制表示为1010000,共七位。那么X的取值可以为大于或等于7的正整数。例如,X可以等于7或者8。
在一些实施例中,常量字符串标识和特殊字符串标识的特定位置可以用于区分该标识是常量字符串标识还是特殊字符串标识。例如,假设X和Y的取值均为16,那么前8比特可以用于区分常量字符串标识和特殊字符串标识。例如,常量字符串标识的前8比特可以是00000000,特殊字符串标识的前8比特可以是11111111。这样,根据前8比特就可以确定出一个标识是常量字符串标识还是特殊字符串标识。
常量字符串和特殊字符串可以从目标文件的元数据中确定。目标文件的元数据保存在目标文件仓库中。因此,可以从目标文件仓库中查询到该N个目标文件的元数据,得到常量字符串以及每个目标文件的特殊字符串。
例如,第一目标文件的元数据包含以下信息:
上述第一目标文件的元数据指示第一目标文件包含“References”和“Namepool”两个规格字符串节点。为了便于描述,第一目标文件的元数据只示出了节点“References”的常量字符串和特殊字符串。如上所示,节点“References”包含Methodref等5个常量字符串,对应标识:1-5;特殊字符串通过正则表达式匹配。上述正则表达式“Regex:(?!#SN-)([A-Z]+[0-9]+)\b”表示特殊字符串包括任何以SN开头的包 括字母A到字母Z和数字0至数字9的字符串。另外,特殊字符串需要过滤任何包含‘@’‘.’‘/’特殊字符的结果。也就是说,即使一个字符串以SN开头的包括,且包含字母A到字母Z和数字0至数字9,但是如果该字符串包含‘@’‘.’‘/’,那么该字符串也不是特殊字符串。
常量字符串标识可以根据常量字符串在元数据中的标识确定。例如,第一目标文件的元数据中的常量字符串Methordref的标识为1,那么常量字符串Methordref的常量字符串标识可以是0000 0000 0000 0001,其中前8为是用于区分常量字符串和特殊字符串的标志位,后8位是常量字符串Methordref在元数据中的标识。
在一些实施例中,特殊字符串标识可以根据特殊字符串在元数据中出现的位置来确定。例如,第一个出现的特殊字符串的特殊字符串标识的后8为对应的10进制数可以是1,第二个出现的特殊字符串的特殊字符串的后8为对应的十进制数可以是2,以此类推。例如,第8个出现的特殊字符串的特殊字符串标识可以为1111 1111 0000 1000,其中前8位是用于区分常量字符串和特殊字符串的标志为,后8为是该特殊字符串在元数据中出现的位置(即第8个出现)。
在另一些实施例中,特殊字符串标识也可以根据常量字符串在元数据中的标识以及特殊字符串在元数据中出现的位置来确定。例如,第一目标文件中常量字符串的标识的最大值为5,那么第一个出现的特殊字符串的特殊字符串标识的后8为对应的10进制数可以是6,第二个出现的特殊字符串的特殊字符串的后8为对应的十进制数可以是7,以此类推。例如,第8个出现的特殊字符串的特殊字符串标识可以为1111 1111 0000 1000,其中前8位是用于区分常量字符串和特殊字符串的标志为,后8为是该特殊字符串在元数据中出现的位置(即第8个出现)。
在一些实施例中,常量字符串信息中可以包括每个常量字符串以及每个常量字符串的常量字符串标识。例如,假设共有Nconst个常量字符串,那么常量字符串信息可以包括Nconst个常量字符串以及Nconst个常量字符串标识。Nconst个常量字符串与Nconst个常量字符串标识一一对应。
在另一些实施例中,常量字符串信息中可以包括每个常量字符串,每个常量字符串出现的顺序以及第一个出现的常量字符串的常量字符串标识。每个常量字符串的常量字符串标识可以根据第一个出现的常量字符串的常量字符串标识和出现顺序确定。例如,假设共用Nconst个常量字符串,那么常量字符串信息可以包括Nconst个常量字符串,Nconst个常量字符串中第一个出现的常量字符串的常量字符串标识,以及Nconst个常量字符串出现的顺序。Nconst个常量字符串中的第n个出现的常量字符串(其中n是大于或等于2且小于或等于Nconst的正整数)的常量字符串标识与第一个常量字符串的常量字符串标识可以满足以下关系:
IDn=ID1+n×α,(公式1)
其中IDn是第n个出现的常量字符串的常量字符串标识,ID1是第一个常量字符串的常量字符串标识,α是大于或等于1的正整数。这样,根据第一个常量字符串的标识和常量字符串的出现顺序,就可以确定出第i个出现的常量字符串的常量字符串标识。
类似的,在一些实施例中,特殊字符串信息可以包括每个特殊字符串以及每个特殊字符串的特殊字符串标识。在另一些实施例中,特殊字符串信息可以包括每个特殊字符串,每个特殊字符串的出现顺序以及第一个出现的特殊字符串的特殊字符串标识。
将N个目标文件中的常量字符串替换为常量字符串标识。例如,常量字符串Methordref的常量字符串标识是0000 0000 0000 0001,那么将N个目标文件中的每个目标文件中出现的常量字符串Methordref都替换为0000 0000 0000 0001。类似的,将N个目标文件中的特殊字符串都替换为特殊字符串标识。例如,第一目标文件中的特殊字符串SNAZ14389的特殊字符串标识为1111 1111 0000 1000,那么将第一目标文件中所有的特殊字符串SNAZ14389都替换为1111 1111 0000 1000。完成常量字符串标识和特殊字符串标识替换的目标文件可以称为已替换目标文件。
将目标文件中的常量字符串和特殊字符串替换为各自的标识可以减少目标文件的大小。此外,常量字符串标识和特殊字符串标识中还包括多个重复出现的数字,例如用于区分常量字符串和特殊字符串的标志位,这种重复出现的数字在压缩时可以有较高的压缩率。
为了便于描述,可以将常量字符串信息、N个特殊字符串信息和N个已替换目标文件统称为第一压缩信息。
107,对M个非目标文件分组,得到第二压缩信息,第二压缩信息包括多个文件集合。
在一些实施例中,可以先根据非目标文件的大小,确定一个文件集合,该文件集合包括M个非目标文件中所有大小小于或等于文件大小阈值的非目标文件。为了便于描述,以下将文件大小小于或等于非目标文件大小阈值的参考文件称为小文件,将文件大小大于该文件大小阈值的非目标文件称为大文件。包括所有小文件的文件集合可以称为文件集合1。可以理解的是,M个非目标文件的文件大小均大于该文件大小阈值,那么该第二压缩信息中可以不包括该文件集合1。
该文件大小阈值可以是系统默认的,也可以是设置的。例如,在一些实施例中,该文件大小阈值可以是小于或等于1024字节(byte,B)。例如,该文件大小阈值可以是1024B、1000B、512B、300B、256B、200B、128B、或者100B等。又如,在一些实施例中,该文件大小阈值可以是小于或等于512B。例如,该文件大小阈值可以是512B、500B、300B、256B、200B、128B、或者100B等。又如,在一些实施例中,该文件大小阈值可以是小于或等于256B。例如,该文件大小阈值可以是256B、200B、128B、或者100B等。
可以理解的是,文件集合1中包含的文件仅是根据文件大小筛选出来的。因此文件集合1中可能包含有各种格式的文件。例如,文件集合1中可以文本文件、库文件、图片文件等的一种或多种。
属于同一个文件集合中的文件具有相同的特征。例如,文件集合1中的所有文件的特征都是小于或等于文件大小阈值。
在一些实施例中,该多个文件集合还可以包括多个文件集合2。该多个文件集合2中的每个文件集合包括的文件都是文件大小大于该文件大小阈值的非目标文件。属于同一个文件集合5的文件具有相同的特征,可以使用相同的压缩算法来压缩。
如上所述,非目标文件可以包括不同类型的文件。例如,非目标文件可以包括文本文件、视频文件、音频文件、可执行文件等。
在一些实施例中,对于非目标文件的分组可以根据文件类型来分组,同一类型的文件属于同一个文件集合。不同的文件集合包括不同类型的非目标文件。例如,文件集合2-1包括文本文件;文件集合2-2包括可执行文件;文件集合2-3包括音频文件、文件集合2-4包括图片文件等。
不同类型的文件的扩展名不同,同一类文件的扩展名也可以不同。因此,在一些实施例中,可以根据扩展名来为非目标文件分组。例如,文件集合2-1包括所有扩展名为.dll的文件;文件集合2-2包括所有扩展名为.exe的文件;文件集合2-3包括所有扩展名为.txt的文件。
此外,还可能出现以下情况:情况1:不同扩展名的文件使用相同的压缩算法能够取得很好的压缩效果,这些扩展名不同的文件可能文件类型也不相同;情况2:同一个类型的不同扩展名的文件用不同的压缩算法能得到更好的压缩效果;情况3:相同扩展名的不同编码方式的文件使用不同的压缩算法能得到更好的压缩效果。因此,在一些实施例中,可以预先设置好分组信息。这样,可以直接根据该分组信息,对非目标文件进行分组。
例如,伦佩尔-齐夫-马尔可夫链算法2(Lempel-Ziv-Markov chain-algorithm,LZMA2)对扩展名为.exe的文件和扩展名为.dll的文件都有不错的压缩效果。因此,所有扩展名为.exe的文件和扩展名为.dll的文件都可以属于同一个文件集合2。
又如,对于编码格式为统一码(unicode)的文本文件,霍夫曼编码有很好的压缩效果。因此,所有编码方式为unicode的文本文件可以被分为同一个文件集合。对于编码格式为美国信息交换标准代码(American standard code for information interchange,ASCII)的文本文件,伦佩尔-齐夫-韦尔奇(Lempel-Ziv-Welch,LZW)算法有很好的压缩效果。因此,虽然都是文本文件,但是由于编码方式的不同,编码方式为unicode的文本文件和编码方式为ASCII的文本文件分别属于两个不同的文件集合2。
根据上述关系可以得到如表1所示的分组信息。
表1
如表1所示,*.dll表示所有扩展名为dll的文件;*.exe表示所有扩展名为exe的文件;txt file(ASCII) 表示编码方式为美国信息交换标准代码(American standard code for information interchange,ASCII)的文本文件;txt file(unicode)表示编码方式为统一码(unicode)的文本文件。
根据如表1所示的分组信息,可以确定三个文件集合5,其中文件集合2-1中包括扩展名为dll和exe的文件;文件集合2-2中包括编码方式为ASCII的文本文件;文件集合2-3中包括编码方式为unicode的文本文件。
108,确定多个比特流。
该软件包中需要压缩的信息包括上述步骤确定的第一待压缩信息和第二待压缩信息。为了便于描述,引入待压缩对象这个概念。待压缩对象可以是该常量字符串信息,该特殊字符串信息、该已替换目标文件以及该文件集合。如上所述,第一待压缩信息包括一个常量字符串信息、N个特殊字符串信息和N个已替换目标文件。如果假设第二待压缩信息中包括一个文件集合1和NS个文件集合2。那么该软件包中共包括1+N+N+1+NS=NTB个待压缩对象。NTB个待压缩对象的一个待压缩对象为常量字符串信息、N个特殊字符串信息中的一个、N个已替换目标文件中的一个、文件集合1或者NS个文件集合2中的一个。
该多个比特流中的每个比特流包括属于同一个待压缩对象中部分或全部文件。
在一些实施例中,该多个比特流与步骤107确定的待压缩对象一一对应,每个比特流包括对应的待压缩对象中的所有文件。
在另一些实施例中,如果待压缩对象的大小超过一个比特流阈值,那么该待压缩对象可以被分为多个比特流,每个比特流的大小不超过该比特流阈值。在此情况下,一个待压缩对象可能对应多个比特流,每个比特流只包含对应的待压缩对象中的部分文件。
109,确定需要压缩的比特流的压缩工作量。
在一些实施例中,小文件对应的比特流可以不进行压缩。主要原因是对于小文件的压缩效益不佳。例如,小文件的压缩比不高;或者,虽然小文件的压缩比较高,但是占用的运算资源与小文件的压缩比相比不值得。例如,一个100B大小的小文件压缩后可能只有30B,但是同样的运算资源可以将一个大小为100兆字节(Megabytes,MB)的文件压缩为40MB。因此,同样的运算资源压缩一个小文件仅能节省70B的容量。相对于一个软件包而言,所节省的容量非常小。因此,从节省运算资源和提高压缩效率的角度,不压缩小文件可以更快地完成软件包的压缩,也不会对软件包最终的大小造成实质性影响。
当然,在另一些实施例中,也可以对小文件进行压缩。
对于任一个比特流,该比特流的压缩工作量可以根据该比特流的可压缩性评分和该比特流的大小确定。该比特流的可压缩性评分是该比特流包括的信息的幅值直方图与高斯白噪声直方图的距离。将比特流进行数字映射(每8bit或1byte映射到0-255区间),得到该比特流的梯度直方图,进而进一步计算梯度幅值直方图。幅值直方图与高斯白噪声直方图的距离可以是幅值直方图与高斯白噪声直方图之间的欧式距离、标准欧式距离、或马氏距离等。
幅值直方图与高斯白噪声直方图的距离越小,比特流的随机性越大,压缩量就越大,相应的工作量就越大;反之,幅值直方图与高斯白噪声直方图的距离越大,比特流的随机性越小,压缩量就越小,相应的工作量就越小。
在一些实施例中,比特流的压缩工作量可以根据以下公式确定:
Compi=(1-Grdi)×Sizei
其中,Compi是所述第i个比特流的压缩工作量,Grdi是所述第i个比特流的可压缩性评分,Sizei是所述第i个比特流的大小。
在一些实施例中,第i个比特流的压缩工作量可以反映第i个比特流的压缩时间在K个比特流的总压缩时间中的比例。例如,第i个比特流的压缩工作量是10,那么表示第i个比特流的压缩时间占K个比特流的总压缩时间的10%。
在另一些实施例中,比特流的压缩工作量可以根据预先确定好的对应关系来确定。例如表2示出了压缩工作量、比特流大小和可压缩性评分的对应关系。
表2
如表2所示,如果比特流的可压缩性评分大于或等于S1且小于S2且该比特流的比特流大小小于1000KB,那么该比特流的压缩工作量10;如果比特流的可压缩性评分大于或等于S1且小于S2且该比特流的比特流大小大于10MB,那么该比特流的压缩工作量为30。
110,根据比特流的压缩工作量,为每个比特流分配一个运算单元。
假设共确定了K个比特流的压缩工作量,共有P个运算单元可以用于压缩。那么可以将这K个比特流平均分配给P个运算单元,使得不同的运算单元分配的比特流的压缩工作量之和相同或者近似。
例如,比特流1的工作量为10,比特流2的工作量为20,比特流3的工作量为30,比特流4的工作量为40,总共有2个运算单元,运算单元1和运算单元2,可以进行压缩工作。那么,可以将比特流1和比特流4分配给运算单元1,比特流2和比特流3分配给运算单元2。这样,2个运算单元中的每个运算单元分配的工作量都是50。
在一些实施例中,运算单元可以是处理器或者处理器中的部件(例如核心(core))。例如,在一些实施例中,计算机设备可以包括多个处理器,每个处理器可以是一个运算单元。又如,在另一些实施例中,计算机设备可以包括一个包含多个核心的处理器。一个运算单元是该处理器中的一个核心。
111,运算单元对分配到的比特流进行压缩,得到压缩后的比特流。
112,对压缩后的比特流进行组装,得到压缩软件包。
可以理解的是,如果不对与文件集合1对应的比特流进行压缩,那么步骤112中需要组装的比特流还包括与文件集合1对应的比特流。
图2是本申请将实施例提供的一种压缩数据的方法的示意性流程图。图2所示的方法可以由计算机设备或者计算机设备中的部件(例如芯片或者系统芯片等)执行。为了便于描述。
201,确定软件包中的目标文件。
目标文件可以根据文件的扩展名来确定。例如,常见的目标文件的扩展名包括.obj,.o,.class等。
根据是否是目标文件,程序文件可以分为目标文件和非目标文件。为了便于描述一下假设软件包中包含N个目标文件和M个非目标文件,N和M都是大于或等于1的正整数。
非目标文件可以是各种类型的文件,例如可以是文本文件、视频文件、音频文件、可执行文件等。
可选的,在一些实施例中,软件包是一个压缩文件或者软件包中包括一个或多个压缩文件。在此情况下,可以先对压缩文件解压缩得到未压缩文件,然后判断未压缩文件中的目标文件。如果压缩文件是嵌套压缩的,则可以在对压缩文件解压缩后继续对嵌套的压缩文件进行解压,直到解压后没有压缩文件为止。
202,确定常量字符串信息。
该常量字符串信息用于指示至少一个常量字符串以及与该至少一个常量字符串中的每个常量字符串对应的常量字符串标识。该N个目标文件中的每个目标文件都包括该至少一个常量字符串。
203,确定N个特殊字符串信息。
该N个特殊字符串信息与该N个目标文件一一对应。假设第一字符串信息是该N个特殊字符串信息中的任一个特殊字符串信息,第一目标文件是该N个目标文件中与该第一字符串信息对应的目标文件。第一字符串信息可以用于指示第一目标文件中的至少一个特殊字符串以及与该至少一个特殊字符串中每个字符串对应的字符串标识。
常量字符串信息和特殊字符串信息的确定方法可以参考图1所示的实施例,为了简洁,在此不再赘述。
204,将该N个目标文件中的每个目标文件的常量字符串和特殊字符串替换为对应的标识,得到N个已替换目标文件。
常量字符串和特殊字符串的替换方法可以参考图1所示的实施例,为了简洁,在此就不再赘述。
205,根据第一压缩信息,压缩该软件包。该第一压缩信息可以包括该常量字符串、该N个特殊字符串和该N个已替换目标文件。
上述技术方案中,将目标文件中的常量字符串和特殊字符串用相应的标识替代之后再对已替换目标文件进行压缩。这样,可以减少目标文件中特殊字符串和常量字符串占用的空间,从而提升目标文件的压缩效率。
在一些实施例中,可以确定两个比特流,比特流1比特流2。比特流1包括该第一压缩信息,比特流2包括软件包中的M个非目标文件。分别对比特流1和比特流2进行压缩,得到比特流1的压缩结果和比特流2的压缩结果。然后,组合比特流1的压缩结果和比特流2的压缩结果,得到压缩软件包。
在另一些实施例中,可以先对非目标文件进行分类,得到两个文件集合,第一文件集合包括文件大小小于或等于文件大小阈值的非目标文件;第二文件集合包括文件大小大于该文件大小阈值的非目标文件。在此情况下,可以确定三个比特流,比特流1,比特流2和比特流3。比特流1包括该常量字符串信息、该N个特殊字符串信息和该N个已替换目标文件,比特流2可以包括第一文件集合中的文件,比特流3可以包括第二文件集合中的文件。分别对比特流1和比特流3进行压缩,得到比特流1的压缩结果和比特流3的压缩结果。组合比特流2,比特流1的压缩结果和比特流3的压缩结果,得到压缩软件包。换句话说,该实施例中,对于文件大小小于或等于该文件大小阈值的非目标文件不进行压缩,而只压缩目标文件的分解和替换结果(即该常量字符串、该N个特殊字符串和该N个已替换目标文件)以及文件大小大于该文件大小阈值的非目标文件。这样可以节省计算机设备的运算资源,更快地得到压缩软件包。
在一些实施例中,所有非目标文件的大小都小于或等于该文件大小阈值。在此情况下,可以只对该常量字符串、该N个特殊字符串和该N个已替换目标文件进行压缩,然后将压缩结果和非目标文件组合得到压缩软件包。
在另一些实施例中,所有非目标文件大小都大于该文件大小阈值。在此情况下,可以对非目标文件、该常量字符串、该N个特殊字符串和该N个已替换目标文件进行压缩,然后将压缩结果组合得到压缩软件包。
上述实施例中,目标文件的分解和替换结果(即该常量字符串、该N个特殊字符串和该N个已替换目标文件)都在一个比特流中被压缩。在另一些实施例中,该常量字符串、该N个特殊字符串和该N个已替换目标文件可以分别属于不同的比特流。例如,在一些实施例中,可以确定四个比特流,比特流1至比特流4。比特流1包括该常量字符串。比他了就2包括该N个特殊字符串。比特流3包括该N个已替换目标文件。比特流4包括M个非目标文件。在此情况下,可以分别对比特流1至比特流4进行压缩,得到比特流1至比特流4的压缩结果。然后,组合比特流1至比特流4的压缩结果,得到压缩软件包。又如,在一些实施例中,可以确定五个比特流,比特流1至比特流5。比特流1包括该常量字符串。比他了就2包括该N个特殊字符串。比特流3包括该N个已替换目标文件。比特流4包括该第一文件集合。比特流5包括该第二文件集合。然后,分别对比特流1至比特流3以及比特流5进行压缩,得到比特流1的压缩结果,比特流2的压缩结果,比特流3的压缩结果和比特流5的压缩结果。组合比特流4,比特流1的压缩结果,比特流2的压缩结果,比特流3的压缩结果和比特流5的压缩结果,得到压缩软件包。
与图1所示的实施例类似,如果包含多个需要压缩的比特流,则可以分别确定每个比特流的压缩工作量,然后根据每个比特流的压缩工作量,为每个比特流分配一个运算单元。压缩工作量的确定方法和运算单元的分配方法可以参考图1所示的实施例,为了简洁,在此就不再赘述。
图3是根据本申请实施例提供的一种计算机设备的示意性结构框图。如图3所示的计算机设备包括处理单元301和压缩单元302。
处理单元301,用于确定软件包中包括的N个目标文件,N为大于或等于1的正整数。
处理单元301,还用于确定常量字符串信息,该常量字符串信息用于指示至少一个常量字符串以及与该至少一个常量字符串中的每个常量字符串对应的常量字符串标识,该N个目标文件中的每个目标 文件包括该至少一个常量字符串。
处理单元301,还用于确定N个特殊字符串信息,该N个特殊字符串信息与该N个目标文件一一对应,第一特殊字符串信息用于指示第一目标文件中的至少一个特殊字符串以及与该至少一个特殊字符串中的每个特殊字符串对应的特殊字符串标识,该第一特殊字符串是该N个特殊字符串中的任一个字符串,该第一目标文件是与该第一特殊字符串对应的目标文件。
处理单元301,还用于将该N个目标文件的每个目标文件的常量字符串和特殊字符串替换为对应的标识,得到N个已替换目标文件。
压缩单元302,用于根据第一待压缩信息,压缩该软件包,该第一待压缩信息包括该常量字符串信息、该N个特殊字符串信息和该N个已替换目标文件。
处理单元301和压缩单元302的具体功能和有益效果可以参见上述实施例中的描述,为了简洁,在此就不再赘述。
处理单元301和压缩单元302可以由处理器实现。
本申请实施例还提供了一种计算机设备,包括处理器和存储器。处理器用于与存储器耦合,读取并执行存储器中的指令和/或程序代码,以执行上述方法实施例中的步骤。
应理解,上述处理器可以是一个芯片。例如,该处理器可以是现场可编程门阵列(field programmable gate array,FPGA),可以是专用集成芯片(application specific integrated circuit,ASIC),还可以是图形处理器(graphics processing unit,GPU),还可以是系统芯片(system on chip,SoC),还可以是中央处理器(central processor unit,CPU),还可以是网络处理器(network processor,NP),还可以是数字信号处理电路(digital signal processor,DSP),还可以是微控制器(micro controller unit,MCU),还可以是可编程控制器(programmable logic device,PLD)、其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,或其他集成芯片。
在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。为避免重复,这里不再详细描述。
应注意,本申请实施例中的处理器可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
可以理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
根据本申请实施例提供的方法,本申请还提供一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码在计算机上运行时,使得该计算机执行上述实施例中的各个步骤。
根据本申请实施例提供的方法,本申请还提供一种计算机可读介质,该计算机可读介质存储有程 序代码,当该程序代码在计算机上运行时,使得该计算机执行上述实施例中的各个步骤。
根据本申请实施例提供的方法,本申请实施例提供一种芯片系统,该芯片系统包括逻辑电路,该逻辑电路用于与输入/输出接口耦合,通过该输入/输出接口传输数据,以执行上述实施例中的各个步骤。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (19)

  1. 一种压缩数据的方法,其特征在于,包括:
    确定软件包中包括的N个目标文件,N为大于或等于1的正整数;
    确定常量字符串信息,所述常量字符串信息用于指示至少一个常量字符串以及与所述至少一个常量字符串中的每个常量字符串对应的常量字符串标识,所述N个目标文件中的每个目标文件包括所述至少一个常量字符串;
    确定N个特殊字符串信息,所述N个特殊字符串信息与所述N个目标文件一一对应,第一特殊字符串信息用于指示第一目标文件中的至少一个特殊字符串以及与所述至少一个特殊字符串中的每个特殊字符串对应的特殊字符串标识,所述第一特殊字符串信息是所述N个特殊字符串信息中的任一个字符串信息,所述第一目标文件是与所述第一特殊字符串对应的目标文件;
    将所述N个目标文件的每个目标文件的常量字符串和特殊字符串替换为对应的标识,得到N个已替换目标文件;
    根据第一待压缩信息,压缩所述软件包,所述第一待压缩信息包括所述常量字符串信息、所述N个特殊字符串信息和所述N个已替换目标文件。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:确定所述软件包中的M个非目标文件,M为大于或等于1的正整数;
    对所述非目标文件分组,得到第三待压缩信息,所述第三待压缩信息包括至少一个文件集合,其中属于同一个文件集合的文件具有相同的特征;
    所述根据第一待压缩信息,压缩所述软件包,包括:
    压缩所述第一待压缩信息和所述第二待压缩信息,得到压缩后的软件包。
  3. 根据权利要求2所述的方法,其特征在于,所述多个文件集合包括第一文件集合,至少一个小文件,所述小文件是所述M个非目标文件中中大小小于或等于文件大小阈值的非目标文件。
  4. 根据权利要求2或3所述的方法,其特征在于,所述多个文件集合包括至少一个第二文件集合,其中,属于同一文件集合的多个非目标文件具有相同的扩展名、相同的编码方式、和/或相同的文件类型。
  5. 根据权利要求2至4中任一项所述的方法,其特征在于,在所述压缩所述第一待压缩信息和所述第二待压缩信息,得到压缩后的软件包之前,所述方法还包括:
    确定K个压缩工作量,所述K个压缩工作量与K个比特流一一对应,所述K个比特流中的每个比特流包括来自于同一待压缩对象的部分或全部文件,其中所述待压缩对象包括所述常量字符串信息,所述特殊字符串信息、所述已替换目标文件以及所述文件集合,K为大于或等于2的正整数;
    根据所述K个压缩工作量,将所述K个比特流分配给P个运算单元进行压缩,其中,第一工作量与第二工作量之差小于工作量门限,其中所述第一工作量是分配给第一运算单元的比特流的工作量之和,所述第二工作量是分配给第二运算单元的比特流的工作量之和,所述第一运算单元和所述第二运算单元是所述P个运算单元中的任意两个运算单元,P为大于或等于2的正整数。
  6. 根据权利要求5所述的方法,其特征在于,所述确定K个压缩工作量,包括:
    确定所述K个比特流中的第i个比特流的可压缩性评分,i=1,…,K;
    根据所述第i个比特流的可压缩性评分和所述第i个比特流的大小,确定所述K个压缩工作量中的第i个压缩工作量。
  7. 根据权利要求6所述的方法,其特征在于,所述第i个比特流的可压缩性评分是所述第i个比特流包括的信息的幅值直方图与高斯白噪声幅值直方图的距离。
  8. 根据权利要求6或7所述的方法,其特征在于,所述根据所述第i个比特流的可压缩性评分和所述第i个比特流的大小,确定所述K个压缩工作量中的第i个压缩工作量,包括:
    根据以下公式确定第i个压缩工作量:
    Compi=(1-Grdi)×Sizei
    其中,Compi是所述第i个压缩工作量,Grdi是所述第i个比特流的可压缩性评分,Sizei是所述第 i个比特流的大小。
  9. 一种计算机设备,其特征在于,包括:
    处理单元,用于确定软件包中包括的N个目标文件,N为大于或等于1的正整数;
    所述处理单元,还用于确定常量字符串信息,所述常量字符串信息用于指示至少一个常量字符串以及与所述至少一个常量字符串中的每个常量字符串对应的常量字符串标识,所述N个目标文件中的每个目标文件包括所述至少一个常量字符串;
    所述处理单元,还用于确定N个特殊字符串信息,所述N个特殊字符串信息与所述N个目标文件一一对应,第一特殊字符串信息用于指示第一目标文件中的至少一个特殊字符串以及与所述至少一个特殊字符串中的每个特殊字符串对应的特殊字符串标识,所述第一特殊字符串是所述N个特殊字符串中的任一个字符串,所述第一目标文件是与所述第一特殊字符串对应的目标文件;
    所述处理单元,还用于将所述N个目标文件的每个目标文件的常量字符串和特殊字符串替换为对应的标识,得到N个已替换目标文件;
    压缩单元,用于根据第一待压缩信息,压缩所述软件包,所述第一待压缩信息包括所述常量字符串信息、所述N个特殊字符串信息和所述N个已替换目标文件。
  10. 根据权利要求9所述的计算机设备,其特征在于,所述处理单元,还用于确定所述软件包中的M个非目标文件,对所述非目标文件分组,得到第三待压缩信息,所述第三待压缩信息包括至少一个文件集合,其中,M为大于或等于1的正整数,属于同一个文件集合的文件具有相同的特征;
    所述压缩单元,具体用于压缩所述第一待压缩信息和所述第二待压缩信息,得到压缩后的软件包。
  11. 根据权利要求10所述的计算机设备,其特征在于,所述多个文件集合包括第一文件集合,至少一个小文件,所述小文件是所述M个非目标文件中中大小小于或等于文件大小阈值的非目标文件。
  12. 根据权利要求10或11所述的计算机设备,其特征在于,所述多个文件集合包括至少一个第二文件集合,其中,属于同一文件集合的多个非目标文件具有相同的扩展名、相同的编码方式、和/或相同的文件类型。
  13. 根据权利要求10至12中任一项所述的计算机设备,其特征在于,所述处理单元,还用于确定K个压缩工作量,所述K个压缩工作量与K个比特流一一对应,所述K个比特流中的每个比特流包括来自于同一待压缩对象的部分或全部文件,其中所述待压缩对象包括所述常量字符串信息,所述特殊字符串信息、所述已替换目标文件以及所述文件集合,K为大于或等于2的正整数;
    根据所述K个压缩工作量,将所述K个比特流分配给P个运算单元进行压缩,其中,第一工作量与第二工作量之差小于工作量门限,其中所述第一工作量是分配给第一运算单元的比特流的工作量之和,所述第二工作量是分配给第二运算单元的比特流的工作量之和,所述第一运算单元和所述第二运算单元是所述P个运算单元中的任意两个运算单元,P为大于或等于2的正整数。
  14. 根据权利要求13所述的计算机设备,其特征在于,所述处理单元,具体用于确定所述K个比特流中的第i个比特流的可压缩性评分,i=1,…,K;
    根据所述第i个比特流的可压缩性评分和所述第i个比特流的大小,确定所述K个压缩工作量中的第i个压缩工作量。
  15. 根据权利要求14所述的计算机设备,其特征在于,所述第i个比特流的可压缩性评分是所述第i个比特流包括的信息的幅值直方图与高斯白噪声幅值直方图的距离。
  16. 根据权利要求14或15所述的计算机设备,其特征在于,所述处理单元,具体用于根据以下公式确定第i个压缩工作量:
    Compi=(1-Grdi)×Sizei
    其中,Compi是所述第i个压缩工作量,Grdi是所述第i个比特流的可压缩性评分,Sizei是所述第i个比特流的大小。
  17. 一种计算机设备,其特征在于,包括:处理器,所述处理器用于与存储器耦合,读取并执行所述存储器中的指令和/或程序代码,以执行如权利要求1-8中任一项所述的方法。
  18. 一种芯片系统,其特征在于,包括:逻辑电路,所述逻辑电路用于与输入/输出接口耦合,通过所述输入/输出接口传输数据,以执行如权利要求1-8中任一项所述的方法。
  19. 一种计算机可读介质,其特征在于,所述计算机可读介质存储有程序代码,当所述计算机程 序代码在计算机上运行时,使得计算机执行如权利要求1-8中任一项所述的方法。
PCT/CN2023/111784 2022-09-29 2023-08-08 压缩数据的方法和相关装置 WO2024066753A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2022125457 2022-09-29
RU2022125457 2022-09-29

Publications (1)

Publication Number Publication Date
WO2024066753A1 true WO2024066753A1 (zh) 2024-04-04

Family

ID=90475975

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/111784 WO2024066753A1 (zh) 2022-09-29 2023-08-08 压缩数据的方法和相关装置

Country Status (1)

Country Link
WO (1) WO2024066753A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118509485A (zh) * 2024-07-17 2024-08-16 杭州新中大科技股份有限公司 传输数据的处理方法、装置、设备、介质和产品

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103023511A (zh) * 2012-12-05 2013-04-03 云之朗科技有限公司 一种应用的压缩编码方法及装置
US20150188565A1 (en) * 2012-09-21 2015-07-02 Fujitsu Limited Compression device, compression method, and recording medium
CN105846825A (zh) * 2015-01-30 2016-08-10 富士通株式会社 压缩方法、解压缩方法、压缩装置以及解压缩装置
CN109298940A (zh) * 2018-09-28 2019-02-01 考拉征信服务有限公司 计算任务分配方法、装置、电子设备及计算机存储介质
CN111683046A (zh) * 2020-04-29 2020-09-18 平安国际智慧城市科技股份有限公司 文件压缩以及获取的方法、装置、设备及存储介质
CN114463068A (zh) * 2022-02-11 2022-05-10 麒麟合盛网络技术股份有限公司 一种数据处理方法和装置
CN114579571A (zh) * 2022-03-01 2022-06-03 珠海金山数字网络科技有限公司 数据处理方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150188565A1 (en) * 2012-09-21 2015-07-02 Fujitsu Limited Compression device, compression method, and recording medium
CN103023511A (zh) * 2012-12-05 2013-04-03 云之朗科技有限公司 一种应用的压缩编码方法及装置
CN105846825A (zh) * 2015-01-30 2016-08-10 富士通株式会社 压缩方法、解压缩方法、压缩装置以及解压缩装置
CN109298940A (zh) * 2018-09-28 2019-02-01 考拉征信服务有限公司 计算任务分配方法、装置、电子设备及计算机存储介质
CN111683046A (zh) * 2020-04-29 2020-09-18 平安国际智慧城市科技股份有限公司 文件压缩以及获取的方法、装置、设备及存储介质
CN114463068A (zh) * 2022-02-11 2022-05-10 麒麟合盛网络技术股份有限公司 一种数据处理方法和装置
CN114579571A (zh) * 2022-03-01 2022-06-03 珠海金山数字网络科技有限公司 数据处理方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118509485A (zh) * 2024-07-17 2024-08-16 杭州新中大科技股份有限公司 传输数据的处理方法、装置、设备、介质和产品

Similar Documents

Publication Publication Date Title
US11755565B2 (en) Hybrid column store providing both paged and memory-resident configurations
US7924183B2 (en) Method and system for reducing required storage during decompression of a compressed file
US10972125B2 (en) Storage access interface to an encoded storage system
US9390099B1 (en) Method and apparatus for improving a compression ratio of multiple documents by using templates
WO2020025006A1 (zh) 数据压缩、解压方法及相关装置、电子设备、系统
US20180285014A1 (en) Data storage method and apparatus
Wu Notes on design and implementation of compressed bit vectors
WO2024066753A1 (zh) 压缩数据的方法和相关装置
CN103326732A (zh) 压缩数据的方法、解压数据的方法、编码器和解码器
US20100079314A1 (en) Method and apparatus for compressing and decompressing data
CN110943744A (zh) 数据压缩、解压缩以及基于数据压缩和解压缩的处理方法及装置
KR20220049522A (ko) 다중 데이터 스트림을 포함하는 압축된 입력 데이터를 압축 해제하기 위한 압축 해제 엔진
CN115699584A (zh) 使用将未压缩/已压缩内容相关的索引的压缩/解压缩
WO2024149207A1 (zh) 数据处理方法和装置、介质和计算机设备
US8463759B2 (en) Method and system for compressing data
CN103198127A (zh) 大文件排序方法及系统
JP2016170750A (ja) データ管理プログラム、情報処理装置およびデータ管理方法
Lloyd et al. Run-length base-delta encoding for high-speed compression
US9697899B1 (en) Parallel deflate decoding method and apparatus
US10681106B2 (en) Entropy sharing across multiple compression streams
WO2019119336A1 (zh) 一种通用数据gz格式的多线程压缩与解压方法及装置
US11397586B1 (en) Unified and compressed statistical analysis data
WO2024066547A1 (zh) 数据压缩方法、装置、计算设备及存储系统
Gao et al. Content-based textual big data analysis and compression
WO2023093761A1 (zh) 处理数据的方法和相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23869999

Country of ref document: EP

Kind code of ref document: A1