CN113641643A - File writing method and device - Google Patents

File writing method and device Download PDF

Info

Publication number
CN113641643A
CN113641643A CN202110751158.2A CN202110751158A CN113641643A CN 113641643 A CN113641643 A CN 113641643A CN 202110751158 A CN202110751158 A CN 202110751158A CN 113641643 A CN113641643 A CN 113641643A
Authority
CN
China
Prior art keywords
file
written
read
data block
original data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110751158.2A
Other languages
Chinese (zh)
Inventor
李文坦
汪翔
沈春辉
杨成虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Innovation Co
Original Assignee
Alibaba Singapore Holdings Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Singapore Holdings Pte Ltd filed Critical Alibaba Singapore Holdings Pte Ltd
Priority to CN202110751158.2A priority Critical patent/CN113641643A/en
Publication of CN113641643A publication Critical patent/CN113641643A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

One or more embodiments of the present disclosure provide a method and an apparatus for writing a file, where when a file is written, a file to be written is selected to be split into a plurality of original data blocks, then compression processing is performed on each original data block, and write processing is performed on compressed data blocks generated after compression, so as to obtain a written file; and generating index information aiming at the written file, wherein the index information comprises data representing the position information of the original data block in the file to be compressed and data representing the position information of the compressed data block in the compressed file. When the data is read, determining which compressed data block the data to be read belongs to according to the index information, and decompressing the determined compressed data block to obtain the data to be read. Therefore, when data is read, the data to be read can be obtained only by decompressing a plurality of compressed data blocks instead of decompressing the whole file, so that more flexible data reading is realized, and the file reading efficiency is improved.

Description

File writing method and device
Technical Field
One or more embodiments of the present disclosure relate to the field of computer application technologies, and in particular, to a method and an apparatus for writing a file.
Background
In most computer systems, data is stored in the form of files. In order to utilize the storage space of the computer system more efficiently, the files are compressed and then stored to utilize the storage space of the computer system more efficiently.
In the related art, when a file is written, the file is used as the minimum compression unit, that is, the entire file is input into a compression algorithm to obtain a compressed file, and the compressed file is written into a designated location. Furthermore, when the file needs to be read, the whole file needs to be decompressed and then read. However, in some application scenarios, only certain specified portions of data in a file need to be read, and the entire file content need not be read. However, with the existing compression technology, the whole file still needs to be decompressed, and the problems of low reading efficiency and poor flexibility exist.
Disclosure of Invention
In view of this, one or more embodiments of the present disclosure provide a file writing method and apparatus.
According to a first aspect of one or more embodiments of the present specification, there is provided a file writing method, including:
determining a file to be written;
splitting the file to be written according to a preset splitting algorithm to obtain a plurality of original data blocks;
respectively compressing the plurality of original data blocks to obtain a plurality of compressed data blocks respectively corresponding to the plurality of original data blocks;
writing the obtained compressed data blocks to obtain a written file;
generating index information aiming at the written file, and performing associated storage on the generated index information and the written file; wherein the index information includes: original data block position information and compressed data block position information, wherein the original data block position information is used for representing the position of each original data block in the file to be written, and the compressed data block position information is used for representing the position of each compressed data block in the written file.
According to a second aspect of one or more embodiments of the present specification, a file reading method is provided, configured to read data of a file to be read, where the file to be read is written according to the file writing method described in the first aspect of the embodiments of the present specification; the method comprises the following steps:
determining a file to be read and the position of data to be read in the file to be written corresponding to the file to be read;
determining at least one compressed data block corresponding to the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read, and the position information of the original data block and the position information of the compressed data block included in the index information of the file to be read;
and decompressing the determined at least one compressed data block, and outputting the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read and the position information of the original data block.
According to a third aspect of one or more embodiments of the present specification, there is provided a file writing apparatus, the apparatus including:
the to-be-written file determining module is used for determining a to-be-written file;
the device comprises a to-be-written file splitting module, a to-be-written file splitting module and a writing module, wherein the to-be-written file splitting module is used for splitting a to-be-written file according to a preset splitting algorithm to obtain a plurality of original data blocks;
the original data block compression module is used for respectively compressing the plurality of original data blocks to obtain a plurality of compressed data blocks respectively corresponding to the plurality of original data blocks;
the compressed data block writing module is used for executing writing operation on the obtained plurality of compressed data blocks to obtain written files;
the index information generation module is used for generating index information aiming at the written file and storing the generated index information and the written file in a correlation manner; wherein the index information includes: original data block position information and compressed data block position information, wherein the original data block position information is used for representing the position of each original data block in the file to be written, and the compressed data block position information is used for representing the position of each compressed data block in the written file.
According to a fourth aspect of the embodiments of the present specification, there is provided a file reading apparatus, configured to read data of a file to be read, where the file to be read is written according to the file writing method described in the first aspect of the embodiments of the present specification; the device comprises:
the device comprises a to-be-read file determining module, a to-be-read file determining module and a reading module, wherein the to-be-read file determining module is used for determining a to-be-read file and the position of data to be read in the to-be-written file corresponding to the to-be-read file;
the compressed data block determining module is used for determining at least one compressed data block corresponding to the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read, and the original data block position information and the compressed data block position information which are included in the index information of the file to be read;
and the data output module is used for decompressing the determined at least one compressed data block and outputting the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read and the position information of the original data block.
According to a fifth aspect of embodiments herein, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a method according to the first or second aspect of embodiments herein.
According to a sixth aspect of embodiments herein, there is provided an electronic apparatus comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements the method according to the first aspect or the second aspect of the embodiments of the present specification by executing the executable instructions.
According to a seventh aspect of embodiments herein, there is provided a computer program which, when executed, implements a method as described in the first or second aspect of embodiments herein.
In one or more embodiments of the present specification, when a file is written, a file to be written is selected to be split into a plurality of original data blocks, then compression processing is performed on each original data block, and write processing is performed on compressed data blocks generated after compression, so as to obtain a written file; and generating index information aiming at the written file, wherein the index information comprises data representing the position information of the original data block in the file to be compressed and data representing the position information of the compressed data block in the compressed file. When the data is read, determining which compressed data block the data to be read belongs to according to the index information, and decompressing the determined compressed data block to obtain the data to be read. Therefore, when data is read, the data to be read can be obtained only by decompressing a plurality of compressed data blocks instead of decompressing the whole file, so that more flexible data reading is realized, and the file reading efficiency is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.
FIG. 1 is a flow chart illustrating a method of writing a file according to an exemplary embodiment of the present description.
Fig. 2 is a schematic diagram illustrating a correspondence relationship between a file to be written and a written file according to an embodiment of the present specification.
FIG. 3 is a flow chart illustrating a method of reading a file according to an exemplary embodiment of the present description.
Fig. 4 is a schematic diagram illustrating a file reading method according to an embodiment of the present disclosure.
FIG. 5 is a flow chart illustrating a method for writing a file according to an embodiment of the present disclosure.
FIG. 6 is a flow chart illustrating a file reading method according to an embodiment of the present disclosure.
FIG. 7 is a block diagram of a file writing apparatus, shown in accordance with an exemplary embodiment of the present description.
FIG. 8 is a block diagram of a document reading apparatus, shown in accordance with an exemplary embodiment of the present description.
Fig. 9 illustrates a hardware configuration diagram of a computer device in which a file writing apparatus or a file reading apparatus is located according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.
It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.
In most computer systems, data is stored in the form of files. A file typically comprises a file identity (typically a file name) for distinguishing between different files and data comprised in the file. In the related art, in order to save the limited storage space and more fully utilize the storage space of the computer system, files are generally compressed and then stored.
The current file compression method generally uses the file as the minimum compression operation unit, and the whole file is input into the compression algorithm to obtain the compressed file, which makes it necessary to decompress all the data of the whole file together when decompressing. In some cases, for example, in a database system, when a file is read, it is not necessary to read all the contents of the entire file, and as long as data at a specified position in a compressed preamble is read, the existing compression method cannot flexibly read the file according to the requirement, and the contents that are not required to be read in the file are also decompressed, which causes redundant operation, so that the file reading efficiency is low.
In order to solve the above problems, the present specification provides a file writing method, where when a file is written, a file to be written is selected to be split into a plurality of original data blocks, then compression processing is performed on each original data block, and writing processing is performed on compressed data blocks generated after compression, so as to obtain a written file; and generating index information aiming at the written file, wherein the index information comprises data representing the position information of the original data block in the file to be compressed and data representing the position information of the compressed data block in the compressed file. When the data is read, determining which compressed data block the data to be read belongs to according to the index information, and decompressing the determined compressed data block to obtain the data to be read. Therefore, when data is read, the data to be read can be obtained only by decompressing a plurality of compressed data blocks instead of decompressing the whole file, so that more flexible data reading is realized, and the file reading efficiency is improved.
Next, a file writing method provided in this specification will be described in detail.
As shown in fig. 1, fig. 1 is a flowchart of a file writing method according to an exemplary embodiment, which includes the following steps:
step 101, determining a file to be written.
The determining of the file to be written may be determining identification information of the file to be written, or acquiring all contents of the file to be written. In other words, in this embodiment of the present specification, the content of the file to be written may be acquired while performing the compressed writing operation, or after the entire content of the file to be written is acquired, the compressed writing operation may be started on the file to be written.
And 103, splitting the file to be written according to a preset splitting algorithm to obtain a plurality of original data blocks.
The splitting is performed according to a preset splitting algorithm, which may be determining whether to split the file to be written according to the size of the file to be written, for example, splitting the file to be written when the size of the file is greater than a preset file size threshold. Under the condition that the data needs to be split, the sizes of a plurality of original data blocks can be the same or different; under the condition that the original data blocks of the same file to be written have the same size, the original data blocks of different files to be written may have the same size or may have different sizes, which may be set according to actual requirements, and this specification is not limited herein. It should be noted that, when the number of original data blocks is greater than or equal to 2, the file writing method and the file reading method provided in this specification can only achieve the purposes of improving the reading efficiency and achieving flexible reading.
In view of the fact that the length of the original data block can be preset in order to make the index information as small as possible, in this way, as long as the length of the original data block and the number of the original data blocks are stored in the index information, the offset of the first byte and the last byte of all the original data blocks in the file to be written can be known, and further the position of each original data block in the file to be written can be known. Specifically, the method further comprises: acquiring a set original data block length; splitting the file to be written according to a preset splitting algorithm to obtain a plurality of original data blocks; compressing the plurality of original data blocks respectively to obtain a plurality of compressed data blocks respectively corresponding to the plurality of original data blocks, comprising: repeatedly executing the following steps until all data of the file to be written are compressed: and under the condition that the size of the input uncompressed data of the file to be written reaches the set length of the original data block or the input of all data of the file to be written is finished, taking the uncompressed data as an original data block and compressing the original data block. In addition, in the method, the data in the file to be written is input and compressed at the same time, so that the file compression and writing efficiency is further improved. It should be further noted that, in the above method for splitting a file to be written according to a preset original data block length, although the original data block length is set, the length of each original data block cannot reach the set original data block length, and the length of the last original data block cannot reach the original data block length.
All input files to be written in can not be compressed, and the compressed writing operation can be executed on the files to be written in under the condition that the compression condition is met. Specifically, the splitting the file to be written according to a preset splitting algorithm to obtain a plurality of original data blocks includes: and splitting the file to be written according to a preset splitting algorithm under the condition of meeting the compression condition to obtain a plurality of original data blocks.
Further, in the case where the compression condition is not satisfied, the file may be selected to be directly written without being compressed. Specifically, the method further comprises the following steps: under the condition that the compression condition is not met, directly writing the data in the file to be written into the written file; and generating index information of the written file, wherein the generated index information is used for representing that the written file is not compressed, and storing the generated index information in association with the written file. The index information can determine whether the file is compressed according to the content of the file when the file is read, so that normal data can be displayed to a user. In addition, when the compression condition is not satisfied, notification that the compression condition is not satisfied may be issued to the side that issues the file write instruction, or a slave operation of writing to the file to be written may not be performed.
The compression condition may be that the user specifies the file to be compressed, or comprehensively determines whether to compress the file according to the size of the storage space available in the disk or hard disk, the size of the file to be written, and the usage of the disk or hard disk over a period of time. For example, a magnetic disk or a hard disk has a larger available storage space at present, a file to be written is smaller, and the utilization rate of the magnetic disk or the hard disk is lower in the past for a long time, so that the file to be written can be selected to be directly written to improve the writing efficiency; if the disk or the hard disk has a larger available storage space at present, but the file to be written is large, and the utilization rate of the disk or the hard disk is always high in the past for a long time, the file to be written can be compressed and then written, so that more files can be stored in the limited disk or hard disk space.
In addition, considering that not all the files to be written are suitable for compression, for example, if some files to be written are compressed once during input and then compressed once, the size of the file to be written can be reduced to a limited extent, and if the files to be written are compressed during input, the processing resources of the computer system are wasted. If a large amount of data of the file to be written needs to be written, compressing the file to be written will reduce the overall file writing efficiency. Therefore, in view of the above problem, in the embodiment of the present specification, it is also supported to determine whether to write a file after compressing the file according to the predicted compression rate. Specifically, the compression condition includes that the predicted compression rate of the file to be written is less than a preset compression rate threshold.
The method for obtaining the predicted compression ratio can be to compress an original data block of the file to be written, calculate the compression ratio of the original data block, and use the compression ratio of the original data block as the predicted compression ratio of the file to be written; or determining the average compression rate of the file in the file format in the statistical period according to the file format of the file to be written, and taking the average compression rate of the file format as the predicted compression rate of the file to be written. Of course, the obtaining method of the predicted compression ratio may also be other methods, and the description is not limited herein.
In addition to the above description, whether the file to be written meets the compression condition or not is determined for the entire file to be written, and whether each original data block meets the compression condition or not may be determined for each original data block, so that the original data blocks meeting the compression condition are written after being compressed, and the original data blocks not meeting the compression condition are directly written without being compressed; in addition, if a written file includes uncompressed original data blocks and compressed data blocks, it needs to be described in the index information whether each data block is a compressed data block. This may make the compression method more flexible.
And 105, respectively compressing the plurality of original data blocks to obtain a plurality of compressed data blocks respectively corresponding to the plurality of original data blocks.
And step 107, performing writing operation on the obtained plurality of compressed data blocks to obtain a written file.
Next, step 105 and step 107 will be collectively described.
The algorithm for compressing the original data block may be a compression algorithm such as ZSTD, ZLIB, LZ4, and the description is not limited herein. Because the data in different original data blocks are not necessarily the same, the sizes of the compressed data blocks corresponding to the original data blocks are not necessarily the same even if the sizes of the different original data blocks are the same. After writing, the corresponding relationship between the written file and the file to be written is as shown in fig. 2, and it should be noted that, although the graph size of each original data block is the same in fig. 2, and the graph size of each compressed data block is also the same, in practical application, the sizes of different original data blocks are not necessarily the same, and the sizes of different compressed data blocks are not necessarily the same.
Step 109, generating index information for the written file, and storing the generated index information and the written file in an associated manner; wherein the index information includes: original data block position information and compressed data block position information, wherein the original data block position information is used for representing the position of each original data block in the file to be written, and the compressed data block position information is used for representing the position of each compressed data block in the written file.
The generated index information and the written file are stored in association, the index information may be stored in the written file (for example, in a header or in a file metadata portion), or an index information file may be newly created, the index information of the written file is stored in the index information file, and a corresponding relationship between the index information and the written file identifier is stored. In other words, the index information file and the index information of the written files may be in a one-to-many relationship or a one-to-one correspondence relationship.
Under the condition that the length of the original data block is preset, the position information of the original data block in the index information can be represented by the preset length of the original data block, or the length of the original data block and the number of the original data blocks; under the condition that different original data blocks are different in length, the storage of the position information of the original data blocks can be realized by storing the offset of the first byte of each original data block. Since the length of each compressed data block cannot be known from the length of the original data block, the position information of the compressed data block may be represented by an offset of the first byte of each compressed data block, and of course, the position information of the compressed data block may be represented by other methods besides the offset, which is not limited herein. In addition, if the compression method encapsulates a plurality of compression algorithms, the compression algorithm corresponding to the written file needs to be marked in the index information, so that decompression can be performed according to the correct compression algorithm during decompression; of course, the index information may also include other information used for indicating the positions of the compressed data blocks in the written file and the original data blocks in the file to be written, such as the number of original data blocks, the length of each compressed data block, and the like.
In addition, the method of the present specification may be implemented by a separate program or by an operating system. Considering that if the method is configured in an operating system, individual configuration cannot be performed for each program, and flexible processing cannot be performed according to user requirements, for example, a computer system is provided with two programs, namely a and B, when the program a writes a file, the program a requires to be compressed and then written, and when the program B requires to be written into the file, direct writing without compression is required to improve reading and writing efficiency, and if the method is configured in the operating system, individual configuration may not be performed for each program, so that the method can be implemented by the program. If the method is implemented according to a program, the method can be implemented according to a file writing instruction of a user, and the method can also be configured in other programs to realize separate configuration of different programs.
Further, it is considered that more than one program needs to use the file writing function, and if a separate configuration is performed for each program, it is troublesome, for example, in the file writing process of the program a, a file is copied into two copies and then written into the copy, and a write module developed for the program a, which can be compressed and then written into the copy, cannot be directly transplanted into other programs that do not need to copy into two copies. Therefore, it is further considered that each program calls a file read-write interface (specifically, a file write interface and a file read interface, in the file write method, the file write interface is called, and in the file read method, the file read interface is called), when writing and reading a file, the write method is encapsulated in the file read-write interface, so that the effects of convenient transplantation and transparency to the program (that is, the program does not sense the compression write process) are achieved. In other words, the method is configured in a file writing program; other programs call the file writing program by calling the file reading and writing interface.
In addition, if the method calls the file writing program corresponding to the file writing method in the specification by calling the file reading and writing interface, the method can also be applied to different database systems. The distributed big data system generally comprises a plurality of different database systems, if a data compression writing method is independently developed for each database system, due to the difference of writing modes of different database systems (for example, some database systems need to be copied and stored, and some database systems require to divide a piece of data into a plurality of pieces of data), the development burden is increased, and through the method, a compression writing interface can be conveniently provided for the database systems of a plurality of heterogeneous cores of the distributed big data system.
In addition, the method for calling the file writing program by calling the file reading and writing interface provided by the specification can also encapsulate a plurality of compression algorithms in the file writing program, and use different compression algorithms for different programs, so that a user can configure more flexibly. In the related art, data input required by different compression algorithms is different, for example, some compression algorithms require input in an array form, some compression algorithms require input in an InputStream form, and when the related art is used, the input of different compression algorithms needs to be adaptively adjusted according to the file output form of a program, so that different programs can use specific compression algorithms. But in the application, the compression algorithm is packaged, an input interface is provided for the outside, and the form of the data input compression algorithm can be changed in the program, so that the file writing program can be applied to different programs without modification.
In addition, the file writing program can be flexibly configured, for example, a part of writing functions of other programs (programs for calling the file reading and writing interface) can be set to call the file writing program, and other parts can directly call the original file reading and writing interface, so that different writing functions in the same program can be flexibly configured. In addition, when the file read-write interface is called, file identification information of the file to be written and a byte stream (i.e., a data portion in the file) of the file generally need to be provided.
By the calling method, the configuration of flexibly executing different file writing methods on different programs can be realized, and because the calling file writing program is realized by calling the file reading and writing interface, the transparency to other programs (the program calling the file writing program) is realized, and the compressed writing of the file can be realized under the condition that other programs have no perception (the file reading and writing interface is called as the execution logic of other programs without the method).
In addition, correspondingly, the present specification also provides a file reading method, and the following detailed description will be provided for the file reading method provided in the present specification.
As shown in fig. 3, fig. 3 is a file reading method, according to an exemplary embodiment, for reading data of a file to be read, where the file to be read is written according to the file writing method described in the first aspect of the embodiment of the present specification; the method comprises the following steps:
step 301, determining a file to be read and a position of data to be read in a file to be written corresponding to the file to be read.
The file to be read is implemented by the file writing method shown in the first aspect (the method shown in fig. 1) of the embodiment of the present specification, that is, the file to be read in the present specification has corresponding index information. The position of the data to be read in the file to be written corresponding to the file to be read may be represented by an offset, or may be represented by other ways, and this specification is not limited herein.
Step 303, determining at least one compressed data block corresponding to the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read, and the original data block position information and the compressed data block position information included in the index information of the file to be read.
Step 305, decompressing the determined at least one compressed data block, and outputting the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read and the position information of the original data block.
For example, if the data to be read is 120 bytes of data of 100 th and 120 th bytes in the file to be written corresponding to the file to be read, it is determined that the first original data block is 0-109 bytes of data of the file to be written according to the position information of the original data block, and the second original data block is 219 bytes of data of 110 th and 110 th bytes, it is determined that the original data blocks corresponding to the data to be read are the first and second data blocks, and it is determined that the data to be read is compressed in the first compressed data block and the second compressed data block. And then according to the position information of the compressed data blocks, determining that the first compressed data block occupies 0-78 bytes of the file to be read, and the second compressed data block occupies 79-139 bytes of the file to be read, so that the first and second compressed data blocks can be determined according to the position information of the compressed data blocks. After the two data blocks are determined, decompressing the two determined data blocks to obtain decompressed data blocks (namely original data blocks), and determining output data according to the position information of the original data blocks and the position of the data to be read in the file to be written. The above process is the process shown in fig. 4.
During decompression, the length of the decompressed data block needs to be output, so that a program corresponding to the method can determine the end of the decompressed data block through the length, and whether decompression is successful can be determined according to the length and the length of the original data block.
In addition, if the file to be written corresponding to the file to be read passes the step of judging whether the compression condition is met during writing, the method further comprises the following steps: judging whether the file to be read is a compressed file or not according to the index information of the file to be read; and under the condition that the file to be read is not a compressed file, outputting the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read.
As described in the first aspect of the embodiments of the present specification, it may be selected to determine whether a file to be written meets a writing condition before splitting data blocks, split the file to be written into a plurality of original data blocks when the writing condition is met, compress the plurality of original data blocks respectively and write the compressed original data blocks, and directly write the file to be written when the writing condition is not met. Therefore, in this case, it is necessary to determine whether the file is a compressed file according to the index information and then read the compressed file, or correct data may not be read. If the original data blocks are judged to meet the compression condition, whether each data block in the file to be read is a compressed data block needs to be judged according to the index information, and different reading strategies are flexibly executed on different data blocks.
In addition, the file reading method is similar to the file writing method described above, and both are configured in a program (such as a file reading program), and other programs call the file reading program by calling the file reading and writing interface. Thus, the file reading method can not invade the execution logic of other programs.
By the file writing method and the file reading method, the file to be written is written after being compressed in blocks, so that when the file is read, only part of data blocks are decompressed according to the position of the data to be read in the corresponding file to be written, the purpose of reading the required data can be realized without decompressing the whole file, all data do not need to be decompressed, redundant operation is reduced, the reading efficiency is improved, and the file can be read more flexibly.
In addition, in the file writing method in this specification, it is also determined whether the compression condition is satisfied, so that writing after compression is performed only when the compression condition is satisfied, thereby improving writing efficiency. In addition, the method reduces the occupation of redundant compression on the processing capacity of the computer and can write the file more efficiently under the condition that the writing condition is that the predicted compression ratio is smaller than the preset compression ratio.
The file writing method and the file reading method in this specification may also be configured in a corresponding file writing program or a corresponding file reading program (of course, the two programs may be one program), and the other program calls the file reading interface or the file reading interface to call the two programs, so that the method is conveniently applied to different programs and database systems, and the original execution logic of the other program is not invaded.
Next, a file writing method and a file reading method provided in the present specification will be described in detail by a specific embodiment.
As shown in fig. 5, fig. 5 is a flowchart of a file writing method according to an embodiment, including the following steps:
step 501, determining a file to be written according to a file read-write interface calling instruction.
In this embodiment, the method is integrated in a file writing program, and other programs call the file writing program by calling a file reading and writing interface call instruction. In other words, when data needs to be written, the other programs call the file read/write interface, and then call the file writing program. While others are unaware of the specific file writing process.
Step 503, determining whether the size of the uncompressed data of the received file to be written reaches the set length of the original data block, or the file to be written is completely input. If yes, step 505 is executed, and if no, step 503 is executed.
Step 505, the uncompressed data is used as an original data block.
That is, after the length of the original data block is preset, after the length of the input data reaches the length of one original data block, the data with the length of the original data block is used as one original data block, and as for the last data block, the length of the last data block does not necessarily reach the length of the original data block, so that if the file to be written is completely input, the input uncompressed data also needs to be used as one original data block.
In step 507, the judger judges whether the compression is worth. If the file is judged to be worth compressing, the decider proceeds to step 509; if the judgment is made and the judger determines that the file is not worth compressing, go to step 511; if not, step 513 is performed.
In this embodiment, the determining whether compression is worth is performed on a file, and the determining unit is a module for determining whether compression is worth, and the determining method of the module may be to determine whether the compression rate of the first data block of the file to be written is smaller than a preset threshold value.
Step 509, compress the original data block to obtain a compressed data block, and perform a write operation on the compressed data block; and determining the length of the compressed data block, and updating the position information of the compressed data block in the index information according to the length of the compressed block.
The compressed block location information may include a first byte offset and a length for each compressed data block.
In step 511, a write operation is performed on the original data block.
Step 513, inputting the original data block into the judger to judge whether the file to be written is worth compressing.
Step 515, determine whether the file to be written is completely written. If all writes are complete, step 517 is performed, and if not, step 503 is performed.
Step 517, determining whether the index information includes the position information of the compressed data block, if so, executing step 519, and if not, executing step 521.
Step 519, determining the set length of the original data blocks, the number of the original data blocks and the compression algorithm, adding the determined length, number and compression algorithm to the index information, and writing the added index information into the header of the obtained written file.
Step 521, setting the index information to indicate that the written file is not compressed, and writing the set index information into the header of the obtained written file.
Corresponding to the file writing method, the specification also provides a file reading method. As shown in fig. 6, fig. 6 is a flowchart of a file reading method according to an embodiment of the present disclosure, including the following steps:
step 601, determining the file to be read and the position of the data to be read in the file to be written corresponding to the file to be read according to the file read-write interface calling instruction.
Step 603, judging whether the file to be read is a compressed file according to the index information. If so, step 607 is performed, and if not, step 605 is performed.
Step 605, directly outputting the data at the position according to the position of the data to be read in the file to be written corresponding to the file to be read.
Step 607, determining at least one compressed data block corresponding to the data to be read according to the position information of the compressed block and the position information of the original data block; and decompressing the determined at least one compressed data block to obtain a corresponding original data block.
Step 609, outputting the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read and the position information of the original data block.
Corresponding to the embodiments of the method, the present specification also provides embodiments of a file writing device and a file reading device and a terminal applied thereto.
As shown in fig. 7, fig. 7 is a block diagram of a file writing apparatus according to an exemplary embodiment shown in the present specification, the apparatus including:
a to-be-written file determining module 710, configured to determine a to-be-written file.
And a to-be-written file splitting module 720, configured to split the to-be-written file according to a preset splitting algorithm, so as to obtain a plurality of original data blocks.
The original data block compressing module 730 is configured to compress the plurality of original data blocks respectively to obtain a plurality of compressed data blocks respectively corresponding to the plurality of original data blocks.
And a compressed data block writing module 740, configured to perform a writing operation on the obtained several compressed data blocks, so as to obtain a written file.
An index information generating module 750, configured to generate index information for the written file, and store the generated index information in association with the written file; wherein the index information includes: original data block position information and compressed data block position information, wherein the original data block position information is used for representing the position of each original data block in the file to be written, and the compressed data block position information is used for representing the position of each compressed data block in the written file.
The to-be-written file splitting module is specifically used for splitting the to-be-written file according to a preset splitting algorithm under the condition that a compression condition is met, so as to obtain a plurality of original data blocks. The compression condition comprises that the predicted compression rate of the file to be written is smaller than a preset compression rate threshold value. On this basis, the device further comprises: a to-be-written file writing module 760, configured to directly write data in a to-be-written file into a written file if a compression condition is not satisfied; and generating index information of the written file, wherein the generated index information is used for representing that the written file is not compressed, and storing the generated index information in association with the written file.
Furthermore, the apparatus further comprises: an original data block length obtaining module 770, configured to obtain a set original data block length; the original data block compression module and the original data block compression module are specifically configured to repeatedly execute the following steps until all data of the file to be written is compressed: and under the condition that the size of the input uncompressed data of the file to be written reaches the set length of the original data block or the input of all data of the file to be written is finished, taking the uncompressed data as an original data block and compressing the original data block.
In addition, the device is configured in a file writing program; other programs call the file writing program by calling the file reading and writing interface.
As shown in fig. 8, fig. 8 is a block diagram of a file reading apparatus shown in this specification according to an exemplary embodiment, configured to read data of a file to be read, where the file to be read is written according to the file writing method described in the first aspect of this specification; the device comprises:
the to-be-read file determining module 810 is configured to determine a to-be-read file and a position of data to be read in the to-be-written file corresponding to the to-be-read file.
The compressed data block determining module 820 is configured to determine at least one compressed data block corresponding to the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read, and the original data block position information and the compressed data block position information included in the index information of the file to be read.
And the data output module 830 is configured to decompress the determined at least one compressed data block, and output the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read and the position information of the original data block.
If the written file is written, and whether the written file meets the compression condition is judged, the device further comprises: the to-be-read file reading module 840 is configured to determine whether the to-be-read file is a compressed file according to the index information of the to-be-read file; and under the condition that the file to be read is not a compressed file, outputting the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read.
The implementation process of the functions and actions of each module in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
As shown in fig. 9, fig. 9 is a hardware configuration diagram of a computer device in which a file writing apparatus or a file reading apparatus is located, and the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by programs or firmware, relevant program codes are stored in the memory 1020 and called by the processor 1010 to be executed.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
Embodiments of the present specification also provide a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method according to the first or second aspect of the embodiments of the present specification.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
Embodiments of the present specification also provide a computer program, which when executed, implements the method according to the first or second aspect of the embodiments of the present specification.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Claims (13)

1. A method of writing a file, the method comprising:
determining a file to be written;
splitting the file to be written according to a preset splitting algorithm to obtain a plurality of original data blocks;
respectively compressing the plurality of original data blocks to obtain a plurality of compressed data blocks respectively corresponding to the plurality of original data blocks;
writing the obtained compressed data blocks to obtain a written file;
generating index information aiming at the written file, and performing associated storage on the generated index information and the written file; wherein the index information includes: original data block position information and compressed data block position information, wherein the original data block position information is used for representing the position of each original data block in the file to be written, and the compressed data block position information is used for representing the position of each compressed data block in the written file.
2. The method according to claim 1, wherein splitting the file to be written according to a preset splitting algorithm to obtain a plurality of original data blocks includes:
and splitting the file to be written according to a preset splitting algorithm under the condition of meeting the compression condition to obtain a plurality of original data blocks.
3. The method according to claim 2, wherein the compression condition comprises that the predicted compression rate of the file to be written is less than a preset threshold value of compression rate.
4. The method of claim 2, further comprising:
under the condition that the compression condition is not met, directly writing the data in the file to be written into the written file;
and generating index information of the written file, wherein the generated index information is used for representing that the written file is not compressed, and storing the generated index information and the written file in a correlation manner.
5. The method of claim 1, wherein said step of treating is carried out in a single step,
the method further comprises the following steps:
acquiring a set original data block length;
splitting the file to be written according to a preset splitting algorithm to obtain a plurality of original data blocks; compressing the plurality of original data blocks respectively to obtain a plurality of compressed data blocks respectively corresponding to the plurality of original data blocks, comprising:
repeatedly executing the following steps until all data of the file to be written are compressed:
and under the condition that the size of the input uncompressed data of the file to be written reaches the set length of the original data block or the input of all data of the file to be written is finished, taking the uncompressed data as an original data block and compressing the original data block.
6. The method of claim 1, wherein the method is configured in a file writer;
other programs call the file writing program by calling the file reading and writing interface.
7. A file reading method for reading data of a file to be read, the file to be read being written according to the file writing method of any one of claims 1 to 6; the method comprises the following steps:
determining a file to be read and the position of data to be read in the file to be written corresponding to the file to be read;
determining at least one compressed data block corresponding to the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read, and the position information of the original data block and the position information of the compressed data block included in the index information of the file to be read;
and decompressing the determined at least one compressed data block, and outputting the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read and the position information of the original data block.
8. The method according to claim 7, wherein the file to be read is written according to the file writing method according to claim 4;
the method further comprises the following steps:
judging whether the file to be read is a compressed file or not according to the index information of the file to be read;
and under the condition that the file to be read is not a compressed file, outputting the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read.
9. A file writing apparatus, the apparatus comprising:
the to-be-written file determining module is used for determining a to-be-written file;
the device comprises a to-be-written file splitting module, a to-be-written file splitting module and a writing module, wherein the to-be-written file splitting module is used for splitting a to-be-written file according to a preset splitting algorithm to obtain a plurality of original data blocks;
the original data block compression module is used for respectively compressing the plurality of original data blocks to obtain a plurality of compressed data blocks respectively corresponding to the plurality of original data blocks;
the compressed data block writing module is used for executing writing operation on the obtained plurality of compressed data blocks to obtain written files;
the index information generation module is used for generating index information aiming at the written file and storing the generated index information and the written file in a correlation manner; wherein the index information includes: original data block position information and compressed data block position information, wherein the original data block position information is used for representing the position of each original data block in the file to be written, and the compressed data block position information is used for representing the position of each compressed data block in the written file.
10. A file reading apparatus for reading data of a file to be read, the file to be read being written according to the file writing method of any one of claims 1 to 6; the device comprises:
the device comprises a to-be-read file determining module, a to-be-read file determining module and a reading module, wherein the to-be-read file determining module is used for determining a to-be-read file and the position of data to be read in the to-be-written file corresponding to the to-be-read file;
the compressed data block determining module is used for determining at least one compressed data block corresponding to the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read, and the original data block position information and the compressed data block position information which are included in the index information of the file to be read;
and the data output module is used for decompressing the determined at least one compressed data block and outputting the data to be read according to the position of the data to be read in the file to be written corresponding to the file to be read and the position information of the original data block.
11. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements the method of any one of claims 1-8 by executing the executable instructions.
12. A computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any one of claims 1-8.
13. A computer program which when executed implements the method of any one of claims 1-8.
CN202110751158.2A 2021-07-02 2021-07-02 File writing method and device Pending CN113641643A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110751158.2A CN113641643A (en) 2021-07-02 2021-07-02 File writing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110751158.2A CN113641643A (en) 2021-07-02 2021-07-02 File writing method and device

Publications (1)

Publication Number Publication Date
CN113641643A true CN113641643A (en) 2021-11-12

Family

ID=78416612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110751158.2A Pending CN113641643A (en) 2021-07-02 2021-07-02 File writing method and device

Country Status (1)

Country Link
CN (1) CN113641643A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116303297A (en) * 2023-05-25 2023-06-23 深圳市东信时代信息技术有限公司 File compression processing method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582653A (en) * 2018-11-14 2019-04-05 网易(杭州)网络有限公司 Compression, decompression method and the equipment of file
CN110784225A (en) * 2018-07-31 2020-02-11 华为技术有限公司 Data compression method, data decompression method, related device, electronic equipment and system
CN110879800A (en) * 2018-09-05 2020-03-13 阿里巴巴集团控股有限公司 Data writing, compressing and reading method, data processing method and device
CN110888851A (en) * 2018-08-15 2020-03-17 阿里巴巴集团控股有限公司 Method and device for creating and decompressing compressed file, electronic and storage device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110784225A (en) * 2018-07-31 2020-02-11 华为技术有限公司 Data compression method, data decompression method, related device, electronic equipment and system
CN110888851A (en) * 2018-08-15 2020-03-17 阿里巴巴集团控股有限公司 Method and device for creating and decompressing compressed file, electronic and storage device
CN110879800A (en) * 2018-09-05 2020-03-13 阿里巴巴集团控股有限公司 Data writing, compressing and reading method, data processing method and device
CN109582653A (en) * 2018-11-14 2019-04-05 网易(杭州)网络有限公司 Compression, decompression method and the equipment of file

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张素琴: "《C++程序设计语言》", 31 August 1995, 清华大学出版社, pages: 194 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116303297A (en) * 2023-05-25 2023-06-23 深圳市东信时代信息技术有限公司 File compression processing method, device, equipment and medium
CN116303297B (en) * 2023-05-25 2023-09-29 深圳市东信时代信息技术有限公司 File compression processing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN110879800B (en) Data writing, compressing and reading method, data processing method and device
CN109614372B (en) Object storage and reading method and device and service server
JP2008065834A (en) Fusion memory device and method
CN109857454B (en) Method, device, electronic equipment and storage medium for generating and caching installation package
CN110362547B (en) Method and device for encoding, analyzing and storing log file
US11249987B2 (en) Data storage in blockchain-type ledger
CN110597461B (en) Data storage method, device and equipment in block chain type account book
CN112181471A (en) Differential upgrading method and device, storage medium and computer equipment
CN115203148A (en) Method and device for modifying file
US20140258247A1 (en) Electronic apparatus for data access and data access method therefor
CN113641643A (en) File writing method and device
CN108053034B (en) Model parameter processing method and device, electronic equipment and storage medium
CN107577474B (en) Processing method and device for upgrading file and electronic equipment
CN113031871A (en) Data adding and aggregating method and device, electronic equipment and readable storage medium
CN113064556A (en) BIOS data storage method, device, equipment and storage medium
US20180246657A1 (en) Data compression with inline compression metadata
CN110597462A (en) Data storage method, system, device and equipment in block chain type account book
CN110851433B (en) Key optimization method for key value storage system, storage medium, electronic device and system
CN115765754A (en) Data coding method and coded data comparison method
US9019134B2 (en) System and method for efficiently translating media files between formats using a universal representation
CN113377391B (en) Method, device, equipment and medium for making and burning image file
CN113010113B (en) Data processing method, device and equipment
CN108234552B (en) Data storage method and device
CN116431585A (en) File compression method and device, and file decompression method and device
CN111090854A (en) Target program execution and conversion method, device, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40069617

Country of ref document: HK

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20240229

Address after: # 03-06, Lai Zan Da Building 1, 51 Belarusian Road, Singapore

Applicant after: Alibaba Innovation Co.

Country or region after: Singapore

Address before: Room 01, 45th Floor, AXA Building, 8 Shanton Road, Singapore

Applicant before: Alibaba Singapore Holdings Ltd.

Country or region before: Singapore