CN116263758A - Data writing method and device and computing equipment - Google Patents

Data writing method and device and computing equipment Download PDF

Info

Publication number
CN116263758A
CN116263758A CN202211455826.8A CN202211455826A CN116263758A CN 116263758 A CN116263758 A CN 116263758A CN 202211455826 A CN202211455826 A CN 202211455826A CN 116263758 A CN116263758 A CN 116263758A
Authority
CN
China
Prior art keywords
file
directory
tracked
catalog
tracking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211455826.8A
Other languages
Chinese (zh)
Inventor
陈海峰
范云博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202211455826.8A priority Critical patent/CN116263758A/en
Publication of CN116263758A publication Critical patent/CN116263758A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a data writing method, a device and a computing device, wherein the method comprises the following steps: monitoring a file updating event of a directory to be tracked in a storage system; updating information in a target memory according to file information of an update target file corresponding to a file update event; deleting the catalog tracking file corresponding to the catalog to be tracked stored in the storage system; generating an updated catalog tracking file corresponding to the catalog to be tracked according to the information in the target memory and uploading the catalog tracking file to a storage system; when the management requirement of the directory to be tracked is monitored, the directory tracking file corresponding to the directory to be tracked is read from the storage system, and invalid file data under the directory to be tracked is cleaned according to the read directory tracking file. By the mode, the performance of writing data into the storage system can be improved, and invalid data can be avoided.

Description

Data writing method and device and computing equipment
Technical Field
The present invention relates to the field of database technologies, and in particular, to a data writing method, device and computing equipment.
Background
HDFS (Hadoop Distributed File System ) is a default, file-based storage system for big data ecology, and many big data compute engines are designed and implemented based on its API. Unlike HDFS, object storage is a way to store data. With the trend of separation of storage and computing, many enterprises attempt to build database schemas with object storage, which directly facilitates large data computing engines to increasingly store objects as storage systems. The big data computing engine can use the HDFS semantic to access the object storage, build a computing analysis platform and meet the scene analysis requirement in a multi-dimension way.
When the big data computing engine uses HDFS semantics to write data into the object store, in order to prevent dirty data from being generated, the intermediate results of the computing analysis are generally written into a temporary directory, and after all the analysis results fall off, the temporary directory is renamed to be a final directory. Because of the limitations of object storage, both functionality and performance are very limited, and once data is written to object storage, the data object may not change. Therefore, the file under the whole directory is traversed for copying and deleting based on the HDFS semantic directory renaming operation stored by the object, the complexity of the directory renaming operation is positively correlated with the size and the number of the file under the directory, the more the number of the directory files is, the larger the number of the files is, the higher the complexity of the directory renaming operation is, the performance influence is greater, and the renaming operation using the HDFS semantic also causes the atomicity problem, namely one renaming operation is decomposed into two operations of copying and deleting, and the situation that the user data views are inconsistent is easy to generate.
Disclosure of Invention
The present invention has been made in view of the above problems, and it is an object of the present invention to provide a data writing method, apparatus and computing device that overcomes or at least partially solves the above problems.
According to an aspect of the present invention, there is provided a data writing method, the method comprising:
monitoring a file updating event of a directory to be tracked in a storage system;
updating information in a target memory according to file information of an update target file corresponding to a file update event;
deleting the catalog tracking file corresponding to the catalog to be tracked stored in the storage system;
generating an updated catalog tracking file corresponding to the catalog to be tracked according to the information in the target memory and uploading the catalog tracking file to a storage system;
when the management requirement of the directory to be tracked is monitored, the directory tracking file corresponding to the directory to be tracked is read from the storage system, and invalid file data under the directory to be tracked is cleaned according to the read directory tracking file.
Optionally, updating the information in the target memory according to the file information of the update target file corresponding to the file update event further includes:
if a file adding event of the catalog to be tracked is monitored, adding file information of a new file corresponding to the file adding event into a target memory;
if a file deleting event of the catalog to be tracked is monitored, deleting the file information of the file to be deleted corresponding to the file deleting event from the target memory.
Optionally, the directory trace file further includes generating time information; the reading the directory tracking file corresponding to the directory to be tracked from the storage system further comprises:
if a plurality of directory trace files corresponding to the directory to be traced are stored in the storage system, the directory trace file with the generation time information closest to the current time information contained in the directory trace files is read.
Optionally, the directory trace file further includes: file expiration time information of each file under the catalog to be tracked; the step of clearing the invalid file data under the directory to be tracked according to the read directory tracking file further comprises the following steps:
and deleting the data blocks of the expired files under the directory to be tracked according to the file expiration time information of each file contained in the read directory tracking file.
Optionally, cleaning invalid file data under the directory to be tracked according to the read directory tracking file further includes:
and determining a file to be cleaned under the target to be tracked according to the directory tracking file read at the time and the directory tracking file read last time, and deleting the data block of the file to be cleaned.
Optionally, the directory trace file further includes: file size information of each file under the catalog to be tracked; the method further comprises the steps of:
And counting the file data quantity under the to-be-tracked directory according to the file size information of each file contained in the read directory tracking file.
Optionally, the method further comprises: according to the catalog management requirement dimension, determining each information item contained in the catalog tracking file;
generating an updated directory tracking file corresponding to the directory to be tracked according to the information in the target memory further comprises:
and generating an updated directory tracking file corresponding to the directory to be tracked according to each piece of information corresponding to each information item contained in the target memory.
According to another aspect of the present invention, there is provided a data writing apparatus, comprising:
the monitoring module is suitable for monitoring file updating events of the catalogues to be tracked in the storage system;
the information updating module is suitable for updating the information in the target memory according to the file information of the update target file corresponding to the file updating event;
the file deleting module is suitable for deleting the catalog tracking file corresponding to the catalog to be tracked stored in the storage system;
the file generation module is suitable for generating an updated directory tracking file corresponding to the directory to be tracked according to the information in the target memory;
the uploading module is suitable for uploading the updated directory tracking file to the storage system;
The acquisition module is suitable for reading the catalog tracking file corresponding to the catalog to be tracked from the storage system when the management requirement of the catalog to be tracked is monitored;
and the data cleaning module is suitable for cleaning invalid file data under the to-be-tracked directory according to the read directory tracking file.
Optionally, the information update module is further adapted to: if a file adding event of the catalog to be tracked is monitored, adding file information of a new file corresponding to the file adding event into a target memory; if a file deleting event of the catalog to be tracked is monitored, deleting the file information of the file to be deleted corresponding to the file deleting event from the target memory.
Optionally, the directory trace file further includes generating time information; the acquisition module is further adapted to: if a plurality of directory trace files corresponding to the directory to be traced are stored in the storage system, the directory trace file with the generation time information closest to the current time information contained in the directory trace files is read.
Optionally, the directory trace file further includes: file expiration time information of each file under the catalog to be tracked; the data cleaning module is further adapted to: and deleting the data blocks of the expired files under the directory to be tracked according to the file expiration time information of each file contained in the read directory tracking file.
Optionally, the data cleaning module is further adapted to: and determining a file to be cleaned under the target to be tracked according to the directory tracking file read at the time and the directory tracking file read last time, and deleting the data block of the file to be cleaned.
Optionally, the directory trace file further includes: file size information of each file under the catalog to be tracked; the apparatus further comprises: and the statistics module is suitable for counting the data quantity of the files under the catalog to be tracked according to the file size information of each file contained in the read catalog tracking file.
Optionally, the apparatus further comprises: the information management module is suitable for determining each information item contained in the directory tracking file according to the directory management requirement dimension; the file generation module is further adapted to: and generating an updated directory tracking file corresponding to the directory to be tracked according to each piece of information corresponding to each information item contained in the target memory.
According to yet another aspect of the present invention, there is provided a computing device comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the data writing method.
According to still another aspect of the present invention, there is provided a computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the above-described data writing method.
According to the data writing method, the data writing device and the computing equipment, file updating events of the catalogue to be tracked in the storage system are monitored; updating information in a target memory according to file information of an update target file corresponding to a file update event; deleting the catalog tracking file corresponding to the catalog to be tracked stored in the storage system; generating an updated catalog tracking file corresponding to the catalog to be tracked according to the information in the target memory and uploading the catalog tracking file to a storage system; when the management requirement of the directory to be tracked is monitored, the directory tracking file corresponding to the directory to be tracked is read from the storage system, and invalid file data under the directory to be tracked is cleaned according to the read directory tracking file. By the method, data is directly written into the target directory, file information of the target directory is identified through the custom directory tracking file, the directory tracking file for identifying the file information of the target directory is continuously and dynamically generated along with file changes of the target directory, the directory tracking file is acquired and analyzed to clean dirty data in the target directory, renaming operation in the process of writing the data into the target directory is avoided, dirty data in the target directory is avoided, data writing performance is greatly improved, and inconsistent user directory views are avoided.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 shows a flow chart of a data writing method provided by an embodiment of the invention;
FIG. 2 is a flow chart of a data writing method according to another embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a data writing apparatus according to an embodiment of the present invention;
FIG. 4a is a schematic diagram of a system architecture in another embodiment of the invention;
FIG. 4b is a schematic diagram showing a directory file tracking system according to another embodiment of the present invention;
FIG. 5a is a diagram illustrating a format of a directory trace file according to an embodiment of the present invention;
FIG. 5b is a schematic diagram illustrating the format of an extended directory trace file in accordance with one embodiment of the present invention;
FIG. 6 illustrates a schematic diagram of a computing device provided by an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
Fig. 1 shows a flowchart of a data writing method according to an embodiment of the present invention, where the method is applied to any device having computing power. As shown in fig. 1, the method comprises the steps of:
step S110, a file update event of a directory to be tracked in the storage system is monitored.
The storage system can be formed by object storage, when writing data into the object storage, newly adding files into the catalog to be tracked and deleting the existing files, and monitoring the event of newly adding files or deleting files of the catalog to be tracked.
Step S120, according to the file information of the update target file corresponding to the file update event, the information in the target memory is updated.
When the updating of the file under the target to be tracked is monitored, for example, writing a new file into the target to be tracked or deleting the file under the target to be tracked, the information in the target memory is updated according to the file information of the updating target file.
Step S130, deleting the catalog tracking file corresponding to the catalog to be tracked stored in the storage system.
And deleting the catalog tracking file corresponding to the catalog to be tracked, which is stored in the object storage, wherein the catalog tracking file is used for identifying the file content under the catalog to be tracked.
Step S140, generating an updated catalog tracking file corresponding to the catalog to be tracked according to the information in the target memory and uploading the catalog tracking file to the storage system.
After updating the file information in the target memory, generating an updated directory tracking file corresponding to the directory to be tracked according to the file information in the target memory, and uploading the updated directory tracking file to the object storage.
And step S150, when the management requirement of the directory to be tracked is monitored, reading the directory tracking file corresponding to the directory to be tracked from the storage system, and cleaning invalid file data under the directory to be tracked according to the read directory tracking file.
For example, when the preset time is reached, determining that the management requirement of the directory to be tracked exists, reading a directory tracking file corresponding to the directory to be tracked from the storage system, determining an invalid file under the directory to be tracked according to the read directory tracking file, and deleting a data block of the invalid file, thereby deleting dirty data under the directory to be tracked.
In the prior art, in the offline calculation and analysis process based on Spark stored by an object, a result file is named by temp or suffix. When the computing task is completed, the result file is immediately renamed to the final file. If the analysis result file is very large at this time, the file renaming operation definitely has a great influence on Spark analysis performance.
According to the data writing method provided by the embodiment of the invention, the data is directly written into the target directory, the file information of the target directory is identified through the custom directory tracking file, the directory tracking file for identifying the file information of the target directory is continuously and dynamically generated along with the file change of the target directory, the directory tracking file is acquired and analyzed to clean dirty data in the target directory, the renaming operation in the data writing object storage process is avoided, the dirty data in the target directory is avoided, the data writing performance is greatly improved, and the condition that the user directory view is inconsistent is avoided.
Fig. 2 is a flowchart of a data writing method according to another embodiment of the present invention, where the method is applied to any device having computing power. As shown in fig. 2, the method comprises the steps of:
step S210, registering and generating a catalog to be tracked in the storage system according to the tracking registration request.
For example, the big data engine initiates a tracking registration request, registers in the storage system to generate a directory to be tracked, and after the directory to be tracked is generated, a new file can be added to the directory to be tracked or a file in the directory to be tracked can be deleted.
Step S220, a file update event of the directory to be tracked in the storage system is monitored.
Detecting the update condition of the file under the catalog to be tracked, and if a new event of the file of the catalog to be tracked is detected, executing step S230; if a file deletion event of the directory to be tracked is detected, step S240 is executed.
Step S230, adding the file information of the newly added file corresponding to the file newly added event into the target memory.
The file adding event may be a file adding instruction or a file adding operation, and when the newly added file under the to-be-tracked directory is monitored, the file information of the newly added file is added to the target memory, where the file information may include: the file name and file size, that is, the directory trace file contains the file name and file size information of each file under the directory.
Step S240, deleting the file information of the file to be deleted corresponding to the file deleting event from the target memory.
The file deleting event may be a file deleting instruction, and when the file deleting instruction of the directory to be tracked is monitored, the file information of the corresponding file to be deleted is deleted from the target memory.
Step S250, deleting the catalog tracking file corresponding to the catalog to be tracked stored in the storage system.
And deleting the directory tracking file corresponding to the directory to be tracked which is already stored in the storage system.
Step S260, generating an updated catalog tracking file corresponding to the catalog to be tracked according to the information in the target memory and uploading the catalog tracking file to the storage system.
And when the file is newly added or deleted under the directory to be tracked, adding the file information of the newly added file into the target memory or deleting the file information of the file to be deleted in the target memory, generating a new directory tracking file according to the information in the target memory after the updating of the data in the target memory is completed, and replacing the directory tracking file corresponding to the directory to be tracked in the storage system with the new directory tracking file.
Step S270, when monitoring the management requirement of the catalog to be tracked, if a plurality of catalog tracking files corresponding to the catalog to be tracked are stored in the storage system, the catalog tracking file with the closest generation time information to the current time information contained in the catalog tracking files is read, and invalid file data under the catalog to be tracked is cleaned according to the read catalog tracking file.
And reading the latest directory tracking file corresponding to the directory to be tracked from the storage system, analyzing the read directory tracking file, determining an invalid file under the directory to be tracked according to an analysis result, and cleaning a data block of the invalid file.
In the solution of this embodiment, when the file under the directory to be tracked changes, a new directory trace file is generated to replace the existing directory trace file in the storage system, so, in theory, only one directory trace file exists in the directory to be tracked in the storage system, that is, the directory trace file triggered and generated by the latest file update of the directory to be tracked.
However, in practical applications, there may be a case where two directory trace files corresponding to the directory to be traced are stored in the storage system at the same time, for example, when a new directory trace file is generated, the system crashes when the old directory trace file is deleted, and two directory trace files exist after the system is restarted. In this case, it is necessary to analyze which directory trace file should be read, specifically, when the directory trace file is generated, the generation time thereof is also included in the directory trace file, and then the directory trace file whose generation time is closest to the current time included in the plurality of directory trace files is read. For example, if the generated time stamp is included in the directory trace file, the directory trace file having the largest time stamp among the plurality of directory trace files is read. By the method, the directory tracking file for accurately identifying the file condition under the directory to be tracked can be read.
In an alternative way, the directory trace file further comprises: the step of clearing invalid file data under the directory to be tracked according to the read directory tracking file includes: and deleting the data blocks of the expired files under the directory to be tracked according to the file expiration time information of each file contained in the read directory tracking file. And determining the expired file according to the expiration time of each file contained in the directory tracking file, and deleting the data block of the expired file.
In an alternative way, when a file deletion operation of a directory to be tracked is monitored, file deletion is not performed temporarily, but file data block deletion is performed when directory management is performed according to the directory tracking file. Specifically, according to the directory trace file read this time and the directory trace file read last time, determining the file to be cleaned under the directory to be cleaned and deleting the data block of the file to be cleaned, comparing the names of the files included in the directory trace file read this time and the names of the files included in the directory trace file read last time, determining the file names of the directory trace files included in the directory trace file read last time but not included in the directory trace file read this time, wherein the file corresponding to the file names is the file to be cleaned, and deleting the data block of the file to be cleaned.
In an optional manner, the size of the file data under the directory may also be counted according to the directory tracking file, specifically, the amount of the file data under the directory to be tracked is counted according to the file size information of each file included in the read directory tracking file.
In the prior art, if the file data quantity of the directory is to be obtained, all the file sizes of the directory in the object storage are required to be sequentially obtained through iteration, and the complexity of obtaining the directory data quantity under the object storage is positively correlated with the number of subdirectories and the number of files under the directory, namely, the more the number of subdirectories and the number of files are, the larger the network throughput is, the larger the delay of obtaining the directory data quantity is, the greater the complexity is, and the performance is poorer. According to the method provided by the embodiment of the invention, the size of each file under the catalog to be tracked is not required to be queried, and the size of the file data of the catalog to be tracked can be obtained by statistics only analyzing the catalog tracking file of the catalog to be tracked.
In an alternative way, in order to implement finer granularity management on the directory, information for implementing authentication management on the directory is further added to the directory tracking file, which specifically includes: the authority identification, the owner identification and/or the group identification, specifically, the identity of the initiator of the management requirement is authenticated according to the authentication identification, and the initiator is allowed to acquire other file information of the directory tracking file under the condition that the authentication is passed; information for directory quota management may also be added to the directory trace file, including: the maximum capacity number and/or the maximum da Wen number are convenient for the sponsor of the management requirement to know the capacity condition of the catalog to be tracked.
In the prior art, the directory operation based on the object storage cannot be managed, coarse granularity management can be performed on the bucket level when the object storage is used for creating the bucket, for example, the reading and writing of another sub account number are directly limited, and for example, the size of one bucket is directly and coarsely limited, and once the bucket is created, the set permissions cannot be modified. Let alone fine-grained directory management of HDFS semantics such as directory authentication, quota, and file expiration policies. The embodiment of the invention can also realize directory authentication, quota and file expiration strategies by expanding the information such as authentication, quota and file expiration in the directory tracking file.
In another alternative manner, the information recorded in the directory tracking file may be customized according to the management requirement, and each information item included in the directory tracking file is determined according to the dimension of the directory management requirement, so that an updated directory tracking file corresponding to the directory to be tracked is generated according to each information corresponding to each information item included in the target memory. Information items such as: file name, file size, file expiration time, rights identification, owner identification, maximum number of files, maximum capacity, time stamp, etc. For example, according to the type of the directory to be tracked, determining the information items of multiple dimensions to be recorded in the directory tracking file, adding the information items to the target memory in advance for some fixed information, and updating the information in the target memory when updating the file for some dynamic information. By the method, a user can customize file information recorded by the directory tracking file, and the personalized tracking requirement of an initiator on the directory is met.
According to the data writing method provided by the embodiment of the invention, data is directly written into the directory, directory tracking files for identifying directory file information are dynamically generated, dirty data under the directory is cleaned regularly by acquiring and analyzing the directory tracking files, so that dirty data under the directory can be avoided while directory renaming operation (namely, operations such as copying and deleting a large number of data blocks in object storage are avoided), migration time-consuming actions of a large number of data blocks in the object storage are avoided, and therefore, the performance of writing massive data into the object storage is greatly improved, the problem of operating atomicity is solved, and the situation that user directory views are inconsistent is avoided; further, the multi-time sub-directory network query is replaced by the acquisition and analysis of the single directory tracking file, and the query performance of the directory data size is greatly improved.
Fig. 3 shows a schematic structural diagram of a data writing device according to an embodiment of the present invention, as shown in fig. 3, the device includes:
a monitoring module 31 adapted to monitor file update events of directories to be tracked in the storage system;
the information updating module 32 is adapted to update information in the target memory according to the file information of the update target file corresponding to the file update event;
The file deleting module 33 is adapted to delete the directory tracking file corresponding to the directory to be tracked stored in the storage system;
the file generating module 34 is adapted to generate an updated directory tracking file corresponding to the directory to be tracked according to the information in the target memory;
an uploading module 35 adapted to upload the updated catalog tracking file to the storage system;
the obtaining module 36 is adapted to read, when monitoring a management requirement of the directory to be tracked, a directory tracking file corresponding to the directory to be tracked from the storage system;
the data cleaning module 37 is adapted to clean invalid file data under the to-be-tracked directory according to the read directory tracking file.
In an alternative way, the information update module 32 is further adapted to: if a file adding event of the catalog to be tracked is monitored, adding file information of a new file corresponding to the file adding event into a target memory; if a file deleting event of the catalog to be tracked is monitored, deleting the file information of the file to be deleted corresponding to the file deleting event from the target memory.
In an alternative manner, the catalog tracking file further includes generating time information; the acquisition module 36 is further adapted to: if a plurality of directory trace files corresponding to the directory to be traced are stored in the storage system, the directory trace file with the generation time information closest to the current time information contained in the directory trace files is read.
In an alternative way, the directory trace file further comprises: file expiration time information of each file under the catalog to be tracked; the data cleaning module 37 is further adapted to: and deleting the data blocks of the expired files under the directory to be tracked according to the file expiration time information of each file contained in the read directory tracking file.
In an alternative, the data cleaning module 37 is further adapted to: and determining a file to be cleaned under the target to be tracked according to the directory tracking file read at the time and the directory tracking file read last time, and deleting the data block of the file to be cleaned.
In an alternative way, the directory trace file further comprises: file size information of each file under the catalog to be tracked; the apparatus further comprises: and the statistics module is suitable for counting the data quantity of the files under the catalog to be tracked according to the file size information of each file contained in the read catalog tracking file.
In an alternative, the apparatus further comprises: the information management module is suitable for determining each information item contained in the directory tracking file according to the directory management requirement dimension; the file generation module 34 is further adapted to: and generating an updated directory tracking file corresponding to the directory to be tracked according to each piece of information corresponding to each information item contained in the target memory.
FIG. 4a is a schematic diagram of a system architecture according to another embodiment of the present invention, as shown in FIG. 4a, a directory file tracking system is embedded in a big data computing engine, and includes a directory operation interface module, a directory tracking module, and a directory supervision center.
And the catalog operation interface module is responsible for registering the catalog to be tracked to the catalog supervision center and is the unique catalog operation entry of the big data calculation engine for the catalog to be tracked.
The catalog tracking module is responsible for managing catalog tracking files, and when the files under the catalog to be tracked are changed, the corresponding data in the memory, the local disk and the tracking files in the object storage are changed.
The directory supervision center is responsible for monitoring and managing the tracking directories, cleaning dirty data files and supervising a plurality of tracking directories.
Fig. 4b is a schematic structural diagram of a directory file tracking system according to another embodiment of the present invention, where, as shown in fig. 4b, a monitoring module, an information updating module, a file generating module, a file deleting module, an uploading module, and an information managing module in the data writing device are disposed in a directory tracking module in the directory file tracking system, and an obtaining module, a data cleaning module, and a statistics module in the data writing device are disposed in a directory supervision center in the directory tracking system.
FIG. 5a shows a schematic diagram of a format of a directory trace file in an embodiment of the present invention, and FIG. 5b shows a schematic diagram of a format of an extended directory trace file in an embodiment of the present invention, where the extended directory trace file includes more-dimensional file information for achieving fine-grained management based on object-store HDFS semantic directory operations.
Firstly, registering a catalog to be tracked to a catalog monitoring center by a big data computing engine through a catalog operation interface module; secondly, the catalog operation interface module generates catalog tracking files T1 in the memory and the local disk respectively through the catalog tracking module. The directory trace file T1 includes information such as a time stamp, a file name, and a file size for generating the directory trace file T1.
After the directory tracking file T1 is generated, the directory tracking module uploads the local directory tracking file T1 to the object store.
When the to-be-tracked directory generates a new file, immediately adding the new file name and the file size to the memory, re-disguising the memory data to the directory tracking file T2, uploading the directory tracking file T2 to the object storage, and deleting the directory tracking file T1.
When the to-be-tracked directory deletes a file, immediately deleting the file name and the file size from the memory, then, carrying out the process of making the data in the updated memory into a local directory tracking file T1 again, uploading the directory tracking file T1 into the object storage, and deleting the directory tracking file T2.
That is, one directory trace file reflects the file contents of the corresponding directory at a certain time. When the directory changes (i.e. the files are added or subtracted), the directory trace file also changes, i.e. the directory trace file reflecting the directory contents is switched from the directory trace file T1 to the directory trace file T2 or the directory trace file T2 to the directory trace file T1, and the reason why the old directory trace file is not directly modified here is that: the data in the object store does not support modification, and thus, a new directory trace file is dynamically generated to replace the old directory trace file following the updating of the under-directory file.
The catalog supervision center acquires catalog tracking files through the catalog tracking module, and the logic for acquiring the catalog tracking files of the appointed catalog is as follows: firstly, determining whether a directory trace file exists; if yes, further determining whether two directory trace files exist, if yes, comparing the time stamps of the two directory trace files, namely the directory trace file T1 and the directory trace file T2, and reading the directory trace file with the large time stamp; if there is only one trace file in the object store, either directory trace file T1 or directory trace file T2, the directory trace file is selected for trace to the directory administration center. And the directory supervision center dynamically and periodically manages files under the directory according to the acquired directory tracking files.
The big data calculation engine senses the content of the directory through the directory tracking file, even if dirty data is generated under the target directory, the big data calculation engine cannot sense the existence of the dirty data before the directory file tracking system regularly cleans the dirty data, and the scheme of the embodiment ensures the data writing performance and simultaneously avoids the generation of the dirty data.
For example, in the prior art, three files child_a/file_ A, child _b/file_b and file_c are written into a temporary directory, and then the temporary directory is renamed to a Parent directory. This renaming operation involves copying and deleting of object store data blocks, which has a significant impact on the mass data write object store performance.
By adopting the method of the embodiment of the invention, the following three files, namely the following three files,/Parent/child_A/File_A,/Parent/child_B/File_B,/Parent/File_C, are directly written into the target directory Parent. When the writing of these three files is completed, a directory trace File T1 is generated, and the directory trace File T1 includes a generation time stamp, child_a/file_a (File name) and File size, child_a/file_b (File name) and File size, and child_a/file_c (File name) and File size. When the file_C operation under the directory Parent is monitored at a certain moment, the file_C data block does not need to be immediately deleted, only the data in the memory is needed to be modified, and the directory trace File T2 is output to the object for storage as an atomicity, wherein the directory trace File T2 comprises a generation timestamp, a child_A/File_A (File name) and File size, and a child_A/File_B (File name) and File size. The directory supervision center acquires the directory tracking File T2 and cleans up the expired File file_c. And counting the file data quantity of the directory Parent, and only analyzing and reading the directory tracking file T2 file without respectively inquiring child_A and child_B subdirectories under the Parent. Under-directory file changes result in trace file regeneration and trace file names alternate between T1 and T2.
Currently, some optimizations are also performed on renaming operations based on object storage systems, roughly in two ways. The first method is to add a similar database system between the object storage system and the computing and analyzing engine to manage the mapping directory file names. The complexity of operation has an impact on the robustness of the system. The second method is to conceptually propose a reconstructed object storage server, increase the support of renaming operation, and the principle is similar to that of the server supporting the modification of object keywords. However, this method is complicated, and the operability is low due to the optimization proposed only in concept.
By adopting the data writing method provided by the embodiment of the invention, the performance of the related operation of the big data computing engine directory file can be improved. Hbases based on HDFS, for example, use a large number of table-temporary directories, which are mainly used to store intermediate results in Flush and completions. Taking Flush as an example, the formation of HFile by KV data drop in MemStore is first generated under the tmp directory, and once completed, is moved from the tmp directory to the corresponding real file directory. The table temporary directory renaming operation involving HBase using object storage may involve a large number of data block copies and deletions, and the operation is also not atomic. By adopting the HBase data catalog tracking technology, the problems can be solved, KV data in the Memstore is dropped into an actual file catalog, after all data dropping operations are completed, the file name is added into the catalog tracking file, and at the moment, the HBase can perform merging operation on the Store HFile. For another example, when Spark SQL stores Hive query results in object storage, the method is similar to writing tmp directory first, and then sequentially writing file rename under the directory to the final directory. The problems of low Spark on Hive performance and file disguising atomicity can be solved by using the directory file tracking technology.
The system of the embodiment of the invention fully utilizes the existing computing resources and storage resources, multiplexes the object storage system in the aspect of the state preservation of the directory tracking file, and realizes the logical embedded in the big data computing engine by the directory dynamic tracking. The custom directory tracking files are dynamically and alternately generated in the object storage system, and the files are acquired and analyzed in the big data calculation engine. On the basis of solving the performance and safety problems, the system also keeps the original business system architecture unchanged. In addition, the method can track the directory files of a plurality of directories at the same time, the system is small, the dependence on the computing resources is low, and the flexibility is high. .
Embodiments of the present invention provide a non-volatile computer storage medium storing at least one executable instruction that may perform the data writing method of any of the above-described method embodiments.
FIG. 6 illustrates a schematic diagram of an embodiment of a computing device of the present invention, and the embodiments of the present invention are not limited to a particular implementation of the computing device.
As shown in fig. 6, the computing device may include: a processor 602, a communication interface (Communications Interface), a memory 606, and a communication bus 608.
Wherein: processor 602, communication interface 604, and memory 606 perform communication with each other via communication bus 608. Communication interface 604 is used to communicate with network elements of other devices, such as clients or other servers. The processor 602 is configured to execute the program 610, and may specifically perform relevant steps in the data writing method embodiment for a computing device.
In particular, program 610 may include program code including computer-operating instructions.
The processor 602 may be a central processing unit CPU or a specific integrated circuit ASIC (Application Specific Integrated Circuit) or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included by the computing device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.
A memory 606 for storing a program 610. The memory 606 may comprise high-speed RAM memory or may further comprise non-volatile memory (non-volatile memory), such as at least one disk memory.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components according to embodiments of the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as an apparatus or device program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims (10)

1. A method of writing data, the method comprising:
monitoring a file updating event of a directory to be tracked in a storage system;
updating information in a target memory according to file information of an update target file corresponding to the file update event;
deleting the catalog tracking file corresponding to the catalog to be tracked stored in the storage system;
generating an updated catalog tracking file corresponding to the catalog to be tracked according to the information in the target memory and uploading the catalog tracking file to the storage system;
when the management requirement on the catalog to be tracked is monitored, the catalog tracking file corresponding to the catalog to be tracked is read from the storage system, and invalid file data under the catalog to be tracked is cleaned according to the read catalog tracking file.
2. The method of claim 1, wherein updating the information in the target memory according to the file information of the update target file corresponding to the file update event further comprises:
if a file adding event of the catalog to be tracked is monitored, adding file information of a new file corresponding to the file adding event into the target memory;
If the file deleting event of the catalog to be tracked is monitored, deleting the file information of the file to be deleted corresponding to the file deleting event from the target memory.
3. The method of claim 1 or 2, wherein the directory trace file further comprises generating time information; the reading the directory tracking file corresponding to the directory to be tracked from the storage system further includes:
and if a plurality of directory tracking files corresponding to the directory to be tracked are stored in the storage system, reading the directory tracking file with the generation time information closest to the current time information contained in the directory tracking files.
4. The method of claim 1, wherein the directory trace file further comprises: file expiration time information of each file under the catalog to be tracked; the step of cleaning the invalid file data under the catalog to be tracked according to the read catalog tracking file further comprises:
and deleting the data block of the expired file under the catalog to be tracked according to the file expiration time information of each file contained in the read catalog tracking file.
5. The method of claim 1, wherein the cleaning invalid file data under the directory to be tracked according to the read directory tracking file further comprises:
And determining the file to be cleaned under the target to be tracked according to the directory tracking file read at the time and the directory tracking file read last time, and deleting the data block of the file to be cleaned.
6. The method of claim 1, wherein the directory trace file further comprises: file size information of each file under the catalog to be tracked; the method further comprises the steps of:
and counting the file data quantity under the catalog to be tracked according to the file size information of each file contained in the read catalog tracking file.
7. The method according to any one of claims 1-6, further comprising: determining each information item contained in the directory tracking file according to the directory management requirement dimension;
the generating the updated directory tracking file corresponding to the directory to be tracked according to the information in the target memory further includes:
and generating an updated directory tracking file corresponding to the directory to be tracked according to each piece of information corresponding to each piece of information contained in the target memory.
8. A data writing apparatus, the apparatus comprising:
The monitoring module is suitable for monitoring file updating events of the catalogues to be tracked in the storage system;
the information updating module is suitable for updating the information in the target memory according to the file information of the update target file corresponding to the file update event;
the file deleting module is suitable for deleting the catalog tracking file corresponding to the catalog to be tracked stored in the storage system;
the file generation module is suitable for generating an updated directory tracking file corresponding to the directory to be tracked according to the information in the target memory;
the uploading module is suitable for uploading the updated directory tracking file to a storage system;
the acquisition module is suitable for reading the catalog tracking file corresponding to the catalog to be tracked from the storage system when the management requirement on the catalog to be tracked is monitored;
and the data cleaning module is suitable for cleaning the invalid file data under the to-be-tracked directory according to the read directory tracking file.
9. A computing device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform operations corresponding to the data writing method according to any one of claims 1 to 7.
10. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the data writing method of any of claims 1-7.
CN202211455826.8A 2022-11-21 2022-11-21 Data writing method and device and computing equipment Pending CN116263758A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211455826.8A CN116263758A (en) 2022-11-21 2022-11-21 Data writing method and device and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211455826.8A CN116263758A (en) 2022-11-21 2022-11-21 Data writing method and device and computing equipment

Publications (1)

Publication Number Publication Date
CN116263758A true CN116263758A (en) 2023-06-16

Family

ID=86722882

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211455826.8A Pending CN116263758A (en) 2022-11-21 2022-11-21 Data writing method and device and computing equipment

Country Status (1)

Country Link
CN (1) CN116263758A (en)

Similar Documents

Publication Publication Date Title
JP6669892B2 (en) Versioned hierarchical data structure for distributed data stores
JP7410181B2 (en) Hybrid indexing methods, systems, and programs
Liu et al. Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS
CN106547914B (en) Data acquisition management system and method thereof
EP3069274B1 (en) Managed service for acquisition, storage and consumption of large-scale data streams
US8078653B1 (en) Process for fast file system crawling to support incremental file system differencing
CN109960686B (en) Log processing method and device for database
US9367579B1 (en) System and method for maintaining a file change log within a distributed file system
US8131723B2 (en) Recovering a file system to any point-in-time in the past with guaranteed structure, content consistency and integrity
US20210103522A1 (en) Estimating worker nodes needed for performing garbage collection operations
US10417265B2 (en) High performance parallel indexing for forensics and electronic discovery
US20160267132A1 (en) Abstraction layer between a database query engine and a distributed file system
CN106484906B (en) Distributed object storage system flash-back method and device
US8452788B2 (en) Information retrieval system, registration apparatus for indexes for information retrieval, information retrieval method and program
US20070094312A1 (en) Method for managing real-time data history of a file system
US10769025B2 (en) Indexing a relationship structure of a filesystem
JP2006107446A (en) Batch indexing system and method for network document
US11188423B2 (en) Data processing apparatus and method
US20200310964A1 (en) Marking impacted similarity groups in garbage collection operations in deduplicated storage systems
CN112269781A (en) Data life cycle management method, device, medium and electronic equipment
CN110287201A (en) Data access method, device, equipment and storage medium
Hu et al. Extracting deltas from column oriented NoSQL databases for different incremental applications and diverse data targets
US9922043B1 (en) Data management platform
US10872073B1 (en) Lock-free updates to a data retention index
CN116263758A (en) Data writing method and device and computing equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination