CN116126209A - Data storage method, system, device, storage medium and program product - Google Patents

Data storage method, system, device, storage medium and program product Download PDF

Info

Publication number
CN116126209A
CN116126209A CN202111341519.2A CN202111341519A CN116126209A CN 116126209 A CN116126209 A CN 116126209A CN 202111341519 A CN202111341519 A CN 202111341519A CN 116126209 A CN116126209 A CN 116126209A
Authority
CN
China
Prior art keywords
data
storage
uploading
local
local disk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111341519.2A
Other languages
Chinese (zh)
Inventor
段蒙
蒋杰
邵赛赛
马骏杰
齐赫
章超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111341519.2A priority Critical patent/CN116126209A/en
Publication of CN116126209A publication Critical patent/CN116126209A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0652Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device
    • G06F3/0676Magnetic disk device
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data storage method, a system, a device, a storage medium and a program product, and belongs to the field of computer data processing. Comprising the following steps: receiving distributed storage data sent by an execution node, wherein the distributed storage data is data stored by the execution node in a distributed storage mode; transferring the distributed storage data to a local disk for storage, wherein the local disk is a local storage area corresponding to the current equipment; in response to the fact that the local disk meets data uploading conditions, uploading local storage data in the local disk to a cloud storage system according to a preset uploading strategy, wherein the preset uploading strategy is determined based on the average file size in the local disk; in response to the completion of the uploading of the local storage data, clearing the local storage data from the local disk; the method and the device realize the effect of cooperating the local storage system and the cloud storage system, and solve the problem that the workload of big data cannot be effectively deployed in an offline mixed part cluster due to the fact that storage calculation cannot be separated.

Description

Data storage method, system, device, storage medium and program product
Technical Field
The present invention relates to the field of computer data processing, and in particular, to a data storage method, system, apparatus, storage medium and program product.
Background
With the advent of the big data age, because a single computer device has limited performance and cannot complete more complex computing tasks, the concept of a distributed computing framework is introduced, and the distributed computing framework can solve the problem that huge computers are needed, and the servers are utilized to form a computing cluster and provide a parallel computing software framework to realize complex operations such as communication, load balancing, task computing and processing, task storage and the like among the servers.
In the related art, data is generally stored directly to a local disk or directly to a distributed file system (HadoopDistributed File System, HDFS) by using a Shuffle service node in a distributed computing framework.
However, whether directly uploading to a local disk or in an HDFS is adopted, data is stored in each node at Map (mapping) nodes, so that calculation and storage separation of the data cannot be realized, and the problem that the workload of the data cannot be effectively deployed in an offline mixed portion cluster is caused.
Disclosure of Invention
The embodiment of the application provides a data storage method, a system, a device, a storage medium and a program product, which are used for improving data processing efficiency. The technical scheme is as follows:
in one aspect, a data storage method is provided, applied to a transit service node in a distributed computing framework, and the method includes:
receiving distributed storage data sent by an execution node, wherein the distributed storage data is stored by the execution node in a distributed storage mode;
transferring the distributed storage data to a local disk for storage, wherein the local disk is a local storage area corresponding to the current equipment;
in response to the fact that the local disk meets data uploading conditions, uploading local storage data in the local disk to a cloud storage system according to a preset uploading strategy, wherein the preset uploading strategy is determined based on the average file size in the local disk;
and in response to the completion of the uploading of the local storage data, clearing the local storage data from the local disk.
In one aspect, a data storage system is provided for use in a distributed computing framework, the system comprising:
The execution node is used for performing task processing on the candidate data to obtain distributed storage data; transmitting the distributed storage data to a transfer service node, wherein the distributed storage data is stored in a distributed storage mode;
the transfer service node is used for receiving the distributed storage data; transferring the distributed storage data to a local disk for storage, wherein the local disk is a local storage area corresponding to the current equipment; in response to the fact that the local disk meets data uploading conditions, uploading local storage data in the local disk to a cloud storage system according to a preset uploading strategy, wherein the preset uploading strategy is determined based on the average file size in the local disk;
the cloud storage system is used for receiving the local storage data uploaded by the local disk;
and the transit service node is further used for clearing the local storage data from the local disk in response to the completion of the uploading of the local storage data.
In one aspect, there is provided a data storage device, the device comprising:
the receiving module is configured to receive distributed storage data sent by an executing node, wherein the distributed storage data is data stored by the executing node in a distributed storage mode;
The storage module is configured to transfer the distributed storage data to a local disk for storage, wherein the local disk is a local storage area corresponding to the current equipment;
the uploading module is configured to respond to the fact that the local disk meets data uploading conditions, and upload the local storage data in the local disk to a cloud storage system according to a preset uploading strategy, wherein the preset uploading strategy is determined based on the average file size in the local disk;
and the clearing module is configured to clear the data from the local disk in response to the completion of the data uploading.
In one aspect, a computer device is provided that includes a processor and a memory. The memory stores at least one instruction, at least one program, a set of codes, or a set of instructions that are loaded and executed by the processor to implement the data storage method as described above.
In one aspect, a computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set loaded and executed by the processor to implement a data storage method as described above is provided.
In one aspect, a computer program product is provided, the computer program product comprising a computer program/instruction stored in a computer readable storage medium, from which the computer program/instruction is read by a processor of a computer device, the processor executing the computer program/instruction to cause the computer device to execute to implement a data storage method as described above.
The beneficial effects of the technical scheme that this application embodiment provided are:
the method comprises the steps of establishing a data transmission process between a local disk and a cloud storage system, uploading distributed storage data stored to the local disk to the cloud storage system according to a preset uploading strategy, realizing the effect of cooperating with the local storage and the cloud storage system, solving the problem that the workload of big data cannot be effectively deployed in an offline mixed part cluster due to incapability of separating storage and calculation, and improving the processing efficiency of the data in a certain program.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a data storage system provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of a multi-level storage management unit provided based on FIG. 1;
FIG. 3 is a flow chart of a method for storing data according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a data upload based on the one shown in FIG. 3;
FIG. 5 is a schematic diagram based on another data upload shown in FIG. 3;
FIG. 6 is a flow chart of a data storage method according to another embodiment of the present application;
FIG. 7 is a flow chart of a method of proactive upload based on the one shown in FIG. 6;
FIG. 8 is a flow chart of a data storage method according to another embodiment of the present application;
FIG. 9 is a block diagram of a data storage device according to one embodiment of the present application;
FIG. 10 is a block diagram of a data storage device according to another embodiment of the present application;
FIG. 11 is a schematic diagram of a server provided in an exemplary embodiment of the present application;
fig. 12 is a server framework diagram provided in an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
First, description will be made on terms involved in the embodiments of the present application:
distributed computing framework: and sharing the calculation tasks to a plurality of servers, and enabling each server to bear a part of calculation tasks and data storage tasks. The rollback operation of acquiring the data corresponding to each task, merging the calculation results and executing the error calculation needs to be completed in the distributed calculation process.
The embodiment of the application is mainly described by taking a MapReduce distributed computing frame as an example, wherein the MapReduce distributed computing frame is used for a distributed computing frame facing batch processing, and the MapReduce distributed computing frame is mainly divided into four stages, namely Split nodes, map nodes, shuffle nodes and Reduce nodes.
The Split nodes mainly divide the input metadata into data blocks with preset lengths, and the divided data blocks are delivered to the Map nodes for task processing, wherein the number of Split nodes determines the number of Map node tasks.
The Map node mainly maps the data blocks sliced by the Split node to obtain an intermediate calculation result.
The Shuffle node is an intermediate link between the Split node and the Map node, and is mainly responsible for Partition (Partition), sort (sequencing), skip (overflow write), merge (merging), fetch (grabbing) and other works, the Partition determines which Partition each piece of data output by the Map node task is to be included in, and gives to which Reduce node to process the task, wherein the number of Reduce node tasks determines the number of the Partition.
The specific execution flow is that metadata is firstly segmented by Split nodes to obtain Block blocks, corresponding Map nodes are distributed on each Block to calculate intermediate results, the calculated results are stored in a Key-value form, at the moment, a Hash module is taken from Key values by a Shuffle node, then corresponding files are formed according to the number of Reduce nodes, the Reduce nodes read the corresponding files by the Map nodes, the files are combined after the read is finished, the combined files are stored in corresponding local disks, and the local storage data are uploaded to a cloud storage system through the local disks, wherein the cloud storage system in the embodiment of the application comprises at least one of HDFS, COS, S. In the embodiment of the application, the transit service node corresponds to a Shuffle node, and the execution node corresponds to a Map node.
And (3) storing data in a distributed mode: the execution node performs task processing on the metadata (metadata corresponds to candidate data) to obtain data, and optional task processing includes, but is not limited to, data storage, data calculation, data merging, data screening, data identification and data division.
In the related art, when data is processed by using the distributed computing framework, the distributed storage data is usually directly stored in a local disk or directly written into the HDFS; if the distributed storage data is stored in the corresponding local disk by directly using the transit service node (Shuffle service node), when facing some clusters with smaller local disk (for example, computing clusters and online service clusters), the capacity upper limit is easily reached, so that the data overflow rate is higher; if the distributed storage data is directly written into the HDFS, when a large number of small Shuffle jobs (which may be understood herein as tasks with smaller distributed storage data) are encountered, a large number of small files need to be frequently written into the HDFS, and a large amount of workload is added to the transit service node (Shuffle service node), so that the performance of the transit service node (Shuffle service node) is reduced.
According to the embodiment of the application, a multi-level storage architecture of a local disk and cloud storage system is introduced, local storage and cloud storage are cooperated, a multi-level cache management mode is provided, the problem that a cluster with limited local disk capacity is solved, the workload of a transit service node (transit service node) can be lightened, and the performance of the transit service node (transit service node) is improved.
In the embodiment of the application, the data storage system/method can be processed by using a cloud technology, wherein the cloud technology refers to a hosting technology for integrating a series of resources such as hardware, software, network and the like in a wide area network or a wide area network to realize calculation, storage, processing and sharing of data.
The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business mode, can form a resource pool, and is flexible and convenient as required. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, and the identification mark needs to be transmitted to a background system for logic processing, so that data of different levels of interfaces can be classified and processed, and the data of various industries needs strong system rear shield support and can be realized only through cloud computing.
Cloud technology comprises cloud computing, cloud storage, a database and big data, wherein cloud computing (closed computing) refers to a delivery and use mode of an IT infrastructure, and refers to obtaining required resources in an on-demand and easily-expandable manner through a network; generalized cloud computing refers to the delivery and usage patterns of services, meaning that the required services are obtained in an on-demand, easily scalable manner over a network. Such services may be IT, software, internet related, or other services. Cloud Computing is a product of fusion of traditional computer and network technology developments such as Grid Computing (Grid Computing), distributed Computing (distributed Computing), parallel Computing (Parallel Computing), utility Computing (Utility Computing), network storage (NetworkStorage Technologies), virtualization (Virtualization), load balancing (Load balancing), and the like. With the development of the internet, real-time data flow and diversification of connected devices, and the promotion of demands of search services, social networks, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Unlike the previous parallel distributed computing, the generation of cloud computing will promote the revolutionary transformation of the whole internet mode and enterprise management mode in concept.
Cloud storage (cloud storage) is a new concept that extends and develops in the concept of cloud computing, and a distributed cloud storage system (hereinafter referred to as a storage system for short) refers to a storage system that integrates a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network to work cooperatively through application software or application interfaces through functions such as cluster application, grid technology, and a distributed storage file system, so as to provide data storage and service access functions for the outside. At present, the storage method of the storage system is as follows: when creating logical volumes, each logical volume is allocated a physical storage space, which may be a disk composition of a certain storage device or of several storage devices. The client stores data on a certain logical volume, that is, the data is stored on a file system, the file system divides the data into a plurality of parts, each part is an object, the object not only contains the data but also contains additional information such as a data Identification (ID) and the like, the file system writes each object into a physical storage space of the logical volume, and the file system records storage position information of each object, so that when the client requests to access the data, the file system can enable the client to access the data according to the storage position information of each object. The process of allocating physical storage space for the logical volume by the storage system specifically includes: physical storage space is divided into stripes in advance according to the set of capacity measures for objects stored on a logical volume (which measures tend to have a large margin with respect to the capacity of the object actually to be stored) and redundant array of independent disks (RAID, redundant Array of Independent Disk), and a logical volume can be understood as a stripe, whereby physical storage space is allocated for the logical volume.
The Database (Database), which can be considered as an electronic filing cabinet, is a place for storing electronic files, and users can perform operations such as adding, inquiring, updating, deleting and the like on the data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application. The database management system (Database ManagementSystem, abbreviated as DBMS) is a computer software system designed for managing databases, and generally has basic functions of storage, interception, security, backup and the like. The database management system may classify according to the database model it supports, e.g., relational, XML (Extensible Markup Language ); or by the type of computer supported, e.g., server cluster, mobile phone; or by the query language used, such as SQL (structured query language (Structured QueryLanguage), XQuery, or by the energy impact emphasis, such as maximum-scale, maximum-speed, or other classification means, regardless of which classification means is used, some DBMSs can cross-category, for example, while supporting multiple query languages.
Big data (Big data) refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which needs a new processing mode to have stronger decision-making ability, insight discovery ability and flow optimization ability. With the advent of the cloud age, big data has attracted more and more attention, and special techniques are required for big data to effectively process a large amount of data within a tolerant elapsed time. Technologies applicable to big data include massively parallel processing databases, data mining, distributed file systems, distributed databases, cloud computing platforms, the internet, and scalable storage systems.
Next, referring to fig. 1, a schematic diagram of a data storage system according to an embodiment of the present application is shown, and a data storage method provided by an embodiment of the present application is described in detail with reference to fig. 1. The data storage system 100 includes an executing node 10, a relay service node 11, a local disk 12 corresponding to the relay service node 11, a cloud storage system 14, a registration node 15, a scheduling node 16, and a processing node (not shown in fig. 1), and a detailed description of a data interaction process between each node in the data storage system 100 is as follows:
The registration node 15 receives registration information corresponding to the relay service node 11, where the registration information includes at least one of identity information (Identity Document, abbreviated as ID) of the relay server, partition information, quantity information, and heartbeat information, where the ID information is used to uniquely identify the relay server, the partition information is used to indicate a management service area corresponding to the relay service node 11, the quantity information is used to indicate a number of tasks correspondingly performed in the relay service node 11 or a storage quantity (capacity) corresponding to the relay server 11, the heartbeat information is used to indicate the relay service node 11 to report a state corresponding to a current node to the registration node 15 at a fixed frequency, and the heartbeat information is used to indicate an operating state of the relay service node 11, where the operating state includes a normal operating state and an abnormal state; optionally, the relay service node 11 sends registration information to the registration node 15, and keeps sending heartbeat information to the registration node 15, where the heartbeat information carries state information corresponding to the relay service node 11; or, the registration node 15 receives the registration information and the heartbeat information sent by the transit service node 11, and determines the state information corresponding to the current transit server according to the heartbeat information.
Optionally, the registration node 15 receives registration information sent by at least one transit server, classifies the at least one transit server according to a preset manner according to the registration information, where the preset manner includes at least one of partition information according to which the transit server belongs, quantity information according to which the transit server belongs, and working state according to which, by way of example, the registration node receives registration information corresponding to a transit service node a, a transit service node b, and a transit service node c, where the transit service node a corresponds to partition 1, the transit service node b corresponds to partition 1, and the transit service node c corresponds to partition 3, and classifies the transit service nodes a-c according to the partition to obtain (partition 1: transit service node a, transit server b), and partition 2: transit service node c).
After each transit service node sends registration information to the registration node 15 to complete registration, the task of storing candidate data is executed, and the specific process is as follows: the scheduling node 16 sends a list acquisition request for indicating a list of relay service nodes that execute the present candidate data to the registration node 15; the scheduling node 16 distributes the acquired transit service node list to the corresponding executing node 10 according to the partition information; the executing node 10 transmits distributed storage data (which is determined based on candidate data) to the allocated transit service node according to the partition information.
The relay service node 11 aggregates the distributed storage data according to the partition information granularity and stores the distributed storage data to the local disk 12. In response to the local disk 12 meeting the data upload condition, uploading the locally stored data in the local disk 12 to the cloud storage system 14 according to a preset upload policy, wherein the preset upload policy is a policy determined based on an average file size in the local disk 12.
The transit service node 10 further includes a buffer memory 17, where the buffer memory 17 is configured to receive and store the distributed storage data sent from the execution node 10, and optionally, after the storage capacity of the buffer memory 17 reaches a preset buffer memory capacity threshold, transfer the distributed storage data stored in the buffer memory 17 to the local disk 12 for storage.
Optionally, after the uploading of the local storage data is completed, the transit service node 11 clears the local storage data from the local disk 12.
The processing node receives an acquisition request, where the acquisition request is used to instruct to acquire the stored distributed storage data, determines partition information corresponding to the distributed storage data to be acquired based on the acquisition request, and requests the distributed storage data to be acquired to the distributed transit service node 11 according to the determined partition information. The specific process of acquiring the distributed storage data to be acquired is that the processing node searches the distributed storage data to be acquired from the local disk 12 corresponding to the distributed transfer service node 11, and if the distributed storage data to be acquired is found, the corresponding distributed storage data is directly acquired from the local disk 12; if not, the processing node acquires the distributed storage data to be acquired from the cloud storage system 14 and feeds back the distributed storage data to the terminal equipment corresponding to the acquisition request.
It should be noted that, the system may be embedded into an existing distributed computing frame, or may be used to form a new distributed computing frame alone, and the application scenario of the system is not limited in this application, and in the embodiment of the present application, the system is applied to a MapReduce distributed computing frame to store data.
In an alternative embodiment, referring specifically to fig. 2, fig. 2 shows a multi-level cache management unit provided in an embodiment of the present application, where the multi-level cache management unit 20 is configured to reflect a management flow of data in a distributed computing framework formed by the above system, and includes a metadata management unit 21, a write control unit 22, an upload unit 23, and a cleaning unit 24.
The metadata management unit 21 stores first directory information corresponding to the distributed storage data, first partition information corresponding to the executing node, second partition information corresponding to the uploaded distributed storage data, and reading certificates, where the first directory information includes directory dimensions corresponding to the storage of the distributed storage data, size information and quantity information of files under the directory corresponding to the storage of the distributed storage data, and optionally the size information and quantity information may be calculated according to a single file or may be calculated according to all files; the distributed storage data is stored in an index directory correspondingly, files with total size of 1G and total number of 20 are correspondingly contained in the index directory, first partition information contains first partition information corresponding to each execution node, and list information obtained by summarizing the partition information is also contained in the list information, capacity size information corresponding to each partition is stored, an example is that an execution node a corresponds to partition 1, an execution node b corresponds to partition 2, an execution node c corresponds to partition 2, an execution node d corresponds to partition 1, and a metadata management unit 21 summarizes partition information corresponding to the execution nodes a-d to obtain a transit service list: (partition 1: executing node a, executing node b), (partition 2: executing node b, executing node c); the transfer service list also comprises storage sizes corresponding to the partition 1 and the partition 2; the second partition information is used for indicating a partition list and file size information corresponding to distributed storage data uploaded from the local disk to the cloud storage system; the reading certificate is used for indicating the certification information that the read distributed storage data can be cleaned from the local disk, and the reading certificate is carried in the distributed storage data a, and the transfer service node directly cleans the distributed storage data from the local disk according to the reading certificate. Optionally, the metadata management unit 21 further stores a directory mapping relationship, where the directory mapping relationship is used to reflect a directory mapping relationship between a local disk and a cloud storage system, and an exemplary local disk index directory corresponds to an index directory in the cloud storage system, where the local disk may correspond to a directory of the same name as the cloud storage system, or may correspond to a directory of a different name as the cloud storage system, which is not limited in this application.
The writing control 22 is configured to instruct a writing rule applied when writing distributed storage data into the cloud storage system and/or the local disk, where the writing rule includes a high water level and a low water level, write the distributed storage data block into a corresponding directory when the high water level is reached, and restore the writing state when the storage space is restored to the low water level. The writing state of the high water level and the low water level can be controlled by configuration parameters.
The uploading unit 23 is configured to instruct the distributed storage data to be uploaded from the local disk to the cloud storage system according to a preset policy, where the preset policy is a policy determined based on an average file size in the local disk, and the preset policy includes a first uploading mode and a second uploading mode (please refer to step 303 of the specification specifically, where the first uploading mode corresponds to an uploading mode corresponding to when the average file size reaches a preset threshold, and the second uploading mode corresponds to an uploading mode corresponding to when the average file size is smaller than the preset threshold), and the first uploading mode corresponds to the first uploading subunit 230, and the second uploading mode corresponds to the second uploading subunit 231.
The cleaning unit 24 is configured to clean the distributed storage data that meets a preset cleaning condition, where the preset cleaning condition includes active cleaning and passive cleaning, where the active cleaning is used to instruct the transit service node to clean the read distributed storage data, and the passive cleaning is used to instruct the transit service node to perform distributed storage data that has the heartbeat information of the transit service node is overtime.
In summary, in the data storage system provided by the embodiment of the present application, a data transmission process is established between the local disk and the cloud storage system, and the distributed storage data stored to the local disk is uploaded to the cloud storage system according to a preset uploading policy, so as to achieve the effect of cooperating the local storage and the cloud storage system, and solve the problem of poor data processing efficiency caused by the fact that storage calculation cannot be separated; and the function of supporting the hybrid part cluster by the cloud storage system is realized, and the problem of limited storage of the local disk is solved.
Referring to fig. 3, a data storage method provided in an embodiment of the present application is applied to a transit service node in the data storage system shown in fig. 1, and in the embodiment of the present application, the data storage system is applied to a MapReduce distributed computing framework, where the data storage method includes:
step 301, receiving distributed storage data sent by an executing node.
In this embodiment, the executing node is configured to perform task processing on the candidate data to obtain distributed storage data, where the task processing includes at least one of data storage, data calculation, data merging, data screening, data identification, and data division.
Optionally, when the executing node performs task processing, data aggregation processing is performed according to partition information provided by the scheduling node, or candidate data are combined after the executing node performs all tasks, so as to obtain distributed storage data.
In this embodiment of the present application, the executing nodes include at least two executing nodes, and a specific process of performing task processing by the at least two executing nodes includes at least one of modes:
firstly, candidate data are distributed to at least two execution nodes in batches to perform task processing, the at least two execution nodes perform task processing to obtain corresponding distributed storage data, before the at least two distributed storage data are sent to a transfer service node, the distributed storage data obtained by the at least two execution nodes are subjected to granularity aggregation according to partition information to obtain distributed storage data belonging to the same partition, and the distributed storage data belonging to the same partition are sent to the transfer service node.
Secondly, the candidate data are directly distributed to the designated execution nodes according to the partition information to perform task processing, namely, different execution nodes execute the data processing tasks under the corresponding partition, for example, the execution node A fixedly executes the data processing tasks from the partition a, and the execution node B fixedly executes the data processing tasks from the partition B; that is, the distributed storage data obtained by the transit service node is the data which has been aggregated according to the partition, and the transit service node does not need to perform the partition aggregation processing separately.
In the embodiment of the application, the execution nodes comprise at least two execution nodes, the transfer service node comprises at least two transfer service nodes, and in the process of obtaining the distributed storage data, the at least two execution nodes can aggregate the candidate data according to the partition to obtain the distributed storage data; the distributed storage data sent by the at least two execution nodes can be aggregated and stored by the at least two transit service nodes according to the partition; the candidate data may be subjected to primary granularity aggregation by at least two execution nodes, and the distributed storage data subjected to primary granularity aggregation may be subjected to secondary granularity aggregation by at least two transit service nodes, which is not limited in the application.
In this embodiment, the distributed storage data is data that is stored by the execution node in a distributed storage manner, where the distributed storage manner includes, but is not limited to, aggregate storage according to partition information, aggregate storage according to the execution node, and the like.
Step 302, the distributed storage data is transferred to a local disk for storage.
Optionally, the transit service node further includes at least one buffer and a corresponding local disk, where the local disk may be used to indicate a local storage area corresponding to the binding of the transit server node, and may also be used to indicate a local storage area corresponding to the current device.
The transfer service node stores the received distributed storage data into a buffer, and after the distributed storage data stored in the buffer reaches a preset capacity threshold, the distributed storage data stored in the buffer is transferred to a corresponding directory in a local disk for storage.
Optionally, the distributed storage data stored in the buffer memory reaches a preset capacity threshold or reaches the upper limit of the storage capacity of the buffer memory, and the distributed storage data stored before reaching the capacity threshold is transferred to the local disk for storage.
In this embodiment, the distributed storage data carries heartbeat information corresponding to the execution node, and the transit service node receives the distributed storage data sent by the execution node in response to the heartbeat information being in a normal state; responding to the abnormal state of the heartbeat information, and ignoring the data sent by the executing node by the transit service node; the authenticity and the validity of the distributed data are ensured; optionally, each node in the data storage system provided in the embodiment of the present application sends its own heartbeat information to the next interaction node when performing data interaction, and the next interaction node selects to receive data/ignore data according to the received heartbeat information.
In this embodiment, after receiving the distributed storage data, the transit service node determines the data size of the distributed storage data, and transfers the distributed storage data to a local disk for storage in response to the distributed storage data being smaller than a preset data size; in response to the distributed storage data being greater than or equal to the preset data size, the distributed storage data is directly uploaded to the cloud storage system according to a preset uploading policy, wherein the specific description of the preset uploading policy is described in the following step 303, and the details are not described herein.
Optionally, in response to the distributed storage data being smaller than the preset data size, storing the distributed storage data in a directory corresponding to the local disk, merging the distributed storage data stored in the corresponding directory when the distributed storage data stored in the directory corresponding to the local disk reaches the specified data size, uploading the merged distributed storage data to the cloud storage system according to a preset uploading strategy, and fully combining the storage performances of the local disk and the cloud storage system to achieve the purpose of storing small data in the local disk and large data in the cloud storage system.
Step 303, in response to the local disk meeting the data uploading condition, uploading the local storage data in the local disk to the cloud storage system according to a preset uploading strategy.
Optionally, the data uploading condition is used for indicating that a condition for uploading the distributed storage data from the local disk to the cloud storage system is met.
After meeting the data uploading condition, the transfer service node uploads the local storage data in the local disk to the remote storage system according to a preset uploading strategy, wherein the preset uploading strategy is a strategy determined based on the average file size in the local disk, and the specific uploading process comprises at least one of the following modes:
first, copying the local storage data to a cloud storage system in response to the average file size of the local storage data reaching a preset threshold.
Referring to fig. 4 specifically, fig. 4 shows a schematic diagram corresponding to a data uploading manner provided in this embodiment of the present application, where a transfer service node stores distributed storage data under a corresponding datafile and index file directory in a local disk 40, when the transfer service node determines that an average file size of data under the local disk 40 directory reaches a preset threshold, the transfer service node directly copies and stores all files under an aggregated directory corresponding to a partition-n under a datafile and index file directory corresponding to a cloud storage system 41, and after copying the local storage data, or after reading the local storage data, directly deletes the uploaded/read files from the datafile and index file directory corresponding to the local disk 40, and when the local storage data is 1Gb (reaching the preset threshold), the transfer service node directly copies and stores the copied 1Gb data under the directory corresponding to the storage system. In the embodiment of the present application, this uploading mode is defined as a non-merge uploading (non-combination uploading) mode.
And secondly, in response to the average file size of the local storage data being smaller than a preset threshold value, serially reading the local storage data, uploading target data which are not overlapped with the data stored in the cloud storage system in the local storage data to the cloud storage system, wherein the target data are data which are not uploaded to the cloud storage system in the local storage data retrieval process by the transit service node.
In the embodiment of the present application, the transit service node detects an average file size of local storage data to be uploaded, if the average file size is smaller than a preset threshold value, a stored data list is obtained from the cloud storage system, the transit service node retrieves whether to store the local storage data to be uploaded from the stored data list obtained by the transit service node, and the local storage data (target data) which is not uploaded to the cloud storage system is additionally uploaded to a directory file corresponding to the cloud storage system.
Optionally, after retrieving the local storage data (target data) that is not uploaded, the relay service node merges the target data and obtains merged data, and the relay service node uploads the merged data to a directory corresponding to the cloud storage system.
The data B and the data C are combined to obtain data D when the average file size of the data B and the data C is smaller than a preset threshold value, and the transit service node stores the data D into the cloud storage system. Referring specifically to fig. 5, fig. 5 shows a schematic diagram corresponding to another data uploading manner provided in this embodiment of the present application, where a transit service node aggregates distributed storage data in a partition according to partition information (partition-1, …, partition-k, …, partition-n), the aggregated distributed storage data is stored in a local disk 50, the transit service node determines that an average file size of the distributed storage data stored corresponding to the partition-1, …, partition-k, …, partition-n is smaller than the preset threshold, merges the distributed storage data corresponding to the partition-1, …, partition-k, …, partition-n according to a directory (datafile) and an index file) to obtain merged data, the transit service node uploads the merged data to a data file and an index file corresponding to a Gb storage system 51, when the average file size of the distributed storage data stored corresponding to the partition-1, …, …, the average file size of the distributed storage data stored corresponding to the partition-n is smaller than the preset threshold is set in an upper segment, and the transit service node stores the average file 851 to a smaller segment of the local segment 851 when the average file size of the distributed storage data is smaller than the preset threshold is set in the local segment 1041. In the embodiment of the present application, this uploading mode is defined as a combined uploading mode.
Thirdly, the transfer service node receives an uploading instruction, when the using space of the appointed directory reaches a high water level, the uploading instruction is used for uploading the local storage data in the local disk to the directory corresponding to the cloud storage system, the uploading instruction is used for indicating that the local storage data is not waited to be read and is forcedly uploaded, and after the uploading is completed, the data in the appointed file is emptied. For example, when the usage space of the directory reaches a high water level, a forced upload force mode is used, and after the upload is completed, the trunk file is emptied to make the usage space of the directory be 0, and the high water level is used for indicating that the usage space of the directory reaches an upper limit.
And step 304, in response to the completion of the uploading of the local storage data, clearing the local storage data from the local disk.
Optionally, after the local storage data is uploaded to the cloud storage system, the cloud storage system feeds back the received signal to the transit service node, and the transit server deletes the uploaded local storage data from the directory corresponding to the local disk based on the received signal.
Optionally, the transfer server node reads the local storage data to be uploaded to the cloud storage system from the local disk, after the reading is finished, the transfer server correspondingly writes the local storage data into the reading certificate, uploads the local storage data to the cloud storage system according to the reading certificate, and clears the local storage data from the local disk, wherein the reading certificate is used for indicating that the local storage data is read completely; in this embodiment of the present application, uploading the local storage data to the cloud storage system and clearing the local storage data from the local disk may be parallel processing or may be sequentially performed, and the order of sequential execution is not limited in this embodiment of the present application.
In summary, in the data storage method provided by the embodiment of the application, a data transmission process is established between the local disk and the cloud storage system, the distributed storage data stored to the local disk is uploaded to the cloud storage system according to a preset uploading strategy, the effect of cooperating the local storage and the cloud storage system is achieved, the problem that the workload of big data cannot be effectively deployed in an offline mixed part cluster due to incapability of separating storage and calculation is solved, and the processing efficiency of the data is improved on a certain program.
Referring to fig. 6, a data storage method provided in an embodiment of the present application is applied to a transit service node in the data storage system shown in fig. 1, and in the embodiment of the present application, the data storage system is applied to a MapReduce distributed computing framework, where the data storage method includes:
step 601, receiving distributed storage data sent by an executing node.
In this embodiment, the distributed storage data is data that is stored by the execution node in a distributed storage manner, where the distributed storage manner includes, but is not limited to, aggregate storage according to partition information, aggregate storage according to the execution node, and the like.
The specific flow of this step is the same as that of step 301, and this step is not repeated.
Step 602, transferring the distributed storage data to a local disk for storage.
And the transfer service node stores the received distributed storage data into a buffer, and after the distributed storage data stored in the buffer reaches a preset capacity threshold, the distributed storage data stored in the buffer is stored under a corresponding directory in the local disk.
Optionally, at least two executing nodes perform task processing on the candidate data to obtain corresponding distributed storage data, after the transferring service node receives the distributed storage data, the transferring service node classifies the distributed data according to partition information and stores the distributed data in a corresponding buffer, that is, different buffers correspond to different partitions, and illustratively, the executing node a sends data a to d to the transferring service node, the executing node B sends data e to g to the transferring service node, and the transferring service node receives data a to d and data e to g and classifies the distributed data according to the partition to obtain classification results: (partition 1: data a, data e, data g), (partition 2: data b, data c, data d, data f), the data corresponding to partition 1 is stored in the buffer m, and the data corresponding to partition 2 is stored in the buffer n.
The specific flow of this step is the same as that of step 302, and this step is not repeated.
Step 603, receiving a data uploading instruction, and uploading the locally stored data in the local disk to the cloud storage system according to a preset uploading strategy based on the data uploading instruction.
In the embodiment of the application, the data uploading instruction comprises an active uploading instruction and a passive uploading instruction.
In the embodiment of the present application, the specific process of receiving the active upload instruction for uploading the local storage data is referred to as step 603a and step 603b (refer to fig. 7).
Step 603a, receive an active upload instruction.
Optionally, the active uploading instruction is used for indicating to actively trigger an uploading instruction of the local storage data; the active uploading instruction comprises size information, quantity information and heartbeat information corresponding to the executing node of the local storage data to be uploaded.
Step 603b, uploading the local storage data to the cloud storage system according to a preset uploading strategy based on the active uploading instruction.
Optionally, the transfer service node reads the local storage data to be uploaded in the local disk based on the active uploading instruction, and uploads the read local storage data to the cloud storage system, where the preset uploading policy may be specifically referred to the step 303, which is not described herein. For example, a user actively selects a file in an index file in a local disk to upload, a transfer service node receives an uploading instruction, acquires the index file from the local disk, and uploads the file of the index file of the local disk to the index file in a cloud storage system.
The passive upload instructions are summarized as follows, including but not limited to the following:
firstly, in response to the storage capacity in the local disk exceeding the storage threshold, uploading the local storage data in the local disk to the cloud storage system according to a preset uploading policy, wherein the method includes judging an average file size of the local storage data in the local disk, selecting a non-combination uploading mode or a combination uploading mode based on the average file size, and uploading the local storage data to the cloud storage system, wherein the non-combination uploading mode and the combination uploading mode are specifically referred to the step 303 and are not repeated herein.
Secondly, uploading the local storage to the cloud storage system according to a preset uploading strategy in a preset uploading period, wherein the preset period can be uploaded in an hour, a day, a week or a month, and the local storage data is uploaded by the transit service node in the period of the week.
Thirdly, uploading the unused local storage data of the history to a cloud storage system, wherein the unused local storage data of the history is used for indicating data which is not operated in a preset time period, the preset time period can be in a specified day, a specified week, a specified month or a specified year, and the local storage data which is not used in three months and stored in a local disk is searched by an exemplary transit service node, and the searched local storage data is uploaded to the cloud storage system according to a preset uploading strategy, so that the long-time unused data is prevented from occupying the storage space of the local disk all the time, the use space of the storage capacity of the local disk is improved, and the preset uploading strategy can be specifically described in the step 303 and is not repeated herein.
In response to the local storage data upload being completed, the local storage data is purged from the local disk, step 604.
Optionally, after the local storage data is uploaded to the cloud storage system, the cloud storage system feeds back the received signal to the transit service node, and the transit server deletes the uploaded local storage data from the directory corresponding to the local disk based on the received signal.
The specific implementation of step 604 is referred to above in step 304, and will not be described here again.
In summary, according to the data storage method provided by the embodiment of the application, a data transmission process is established between the local disk and the cloud storage system, distributed storage data stored to the local disk is uploaded to the cloud storage system according to a preset uploading strategy, the effect of cooperating the local storage and the cloud storage system is achieved, and the problem of poor data processing efficiency caused by incapability of separating storage and calculation is solved.
In the data storage method provided by the embodiment, the execution node performs primary aggregation on the candidate data to obtain the distributed storage data, and then performs secondary aggregation on the distributed storage data through the transit service node to obtain the local storage data, so that related data can be written in the cloud storage system efficiently.
Referring to fig. 8, a data storage method provided in an embodiment of the present application is applied to a transit service node in the data storage system shown in fig. 1, and in the embodiment of the present application, the data storage system is applied to a MapReduce distributed computing framework, where the data storage method includes:
step 801, distributed storage data sent by at least two executing nodes is received.
In this embodiment of the present application, the data storage system includes at least two executing nodes, and when performing task processing, the scheduling node obtains, from the registration node, a list corresponding to a transit service node required for executing the task processing, where the scheduling node distributes the list to at least two executing nodes, and the at least two executing nodes perform task processing on candidate data, where the task processing includes, but is not limited to, data storage, data calculation, data merging, data screening, data identification, and data division.
And after the candidate data are processed by the at least two execution nodes, corresponding distributed storage data are obtained, and the distributed storage data are sent to the corresponding transit service nodes.
Optionally, when the execution node performs task processing on the candidate data, only the calculation operation is performed, and the storage operation is not performed, that is, the execution node performs data calculation, and the transfer service node performs data storage, so as to realize calculation storage separation.
Step 802, aggregating the distributed storage data sent by at least two execution nodes according to the partition information to obtain partition aggregate storage data.
Optionally, the transfer service node aggregates the received distributed storage data according to the partition information to obtain partition aggregate storage data; or, aggregating the received distributed storage data according to the execution node to obtain node aggregate storage data.
Optionally, each node in the data storage system sends its own heartbeat information to the next node when performing data interaction processing, the next node determines whether to store processing overtime data according to the received heartbeat information, and the transfer service node receives the distributed storage data of the execution node a and the heartbeat information corresponding to the execution node, and determines that the execution node a has an abnormality according to the heartbeat information, where the abnormality includes transmission interruption, transmission overtime, and the like, and the transfer service node clears the local storage data with the abnormality.
Step 803, transferring the partition aggregate storage data to a local disk for storage.
Optionally, transferring the partition aggregate storage data or the node aggregate storage data storage value to a buffer in the service node, and transferring the partition aggregate storage data or the node aggregate storage data from the buffer to a local disk for storage when the buffer reaches a preset storage capacity threshold.
And step 804, uploading the partition aggregate storage data to the cloud storage system according to a preset uploading strategy in response to the partition aggregate storage data meeting the data uploading condition.
In this embodiment of the present application, the data uploading conditions include the following manner:
firstly, receiving an active uploading instruction, actively uploading partition aggregate storage data or node aggregate storage data to a cloud storage system for storage, and clearing the partition aggregate storage data or node aggregate storage data from a local disk by a transfer service node after the partition aggregate storage data or node aggregate storage data is read or uploaded;
secondly, uploading the partition aggregate storage data or the node aggregate storage data in the local disk to a cloud storage system for storage when the storage capacity in the local disk exceeds a threshold value; or alternatively, the process may be performed,
thirdly, uploading the partition aggregate storage data or the node aggregate storage data to a cloud storage system in a preset period;
fourthly, uploading the partition aggregate storage data or the node aggregate storage data which are not used in the history to a cloud storage system for storage.
The preset uploading policy may be referred to the content of step 303, and will not be described herein.
In summary, in the data storage method provided by the embodiment of the application, a data transmission process is established between the local disk and the cloud storage system, the distributed storage data stored to the local disk is uploaded to the cloud storage system according to a preset uploading strategy, the effect of cooperating the local storage and the cloud storage system is achieved, the problem that the workload of big data cannot be effectively deployed in an offline mixed part cluster due to incapability of separating storage and calculation is solved, and the processing efficiency of the data is improved in a certain program.
Referring to fig. 9, a flowchart of a corresponding implementation of a data storage system according to an embodiment of the present application is shown. In this embodiment, the data storage system is applied to a MapReduce distributed computing framework, where the data storage system includes an execution node, a relay service node, and a cloud storage system, and it may be understood that the execution node may include at least two execution nodes, and the relay service node may also include at least two relay server nodes, which will describe a data interaction process between three nodes in detail below.
S91, the execution node performs task processing on the candidate data to obtain distributed storage data.
Optionally, the executing nodes include at least two executing nodes, and the candidate data is distributed to the at least two executing nodes for task processing, wherein the task processing includes but is not limited to data storage, data calculation, data merging, data screening, data identification and data division; and carrying out batch processing on the candidate data by at least two execution nodes to obtain at least two distributed storage data.
Optionally, the candidate data are distributed to designated executing nodes for performing tasks according to the partition information, and the executing nodes process the candidate data belonging to the same partition according to task processing requirements to obtain corresponding distributed storage data.
S91, the executing node sends the distributed storage data to the transit service node.
Optionally, the executing node performs classification aggregation on the distributed storage data processed by the task according to the executing node type or according to the partition information, and sends the classified and aggregated distributed storage data to the transit service node.
Optionally, after the execution node processes the candidate data, the distributed storage data processed by the same execution node is aggregated, or the distributed storage data processed by the same partition is aggregated, and the aggregated distributed storage data is sent according to the type of the execution node or partition information.
S93, the transfer service node transfers the received distributed storage data to a local disk.
Optionally, the transfer service node receives the distributed storage data sent by the execution node, pre-stores the distributed storage data in the buffer, and transfers the distributed storage data in the buffer to a local disk for storage after the storage capacity of the buffer reaches a threshold value, where the local disk is used to indicate a local storage area corresponding to the current device or to indicate a local storage area bound with the transfer service node.
And S94, uploading the local storage data to the cloud storage system by the transit service node according to a preset uploading strategy.
Optionally, in response to the local storage data meeting the data uploading condition, the transit service node uploads the local storage data to the cloud storage system according to a preset uploading policy.
The specific implementation process of the step S94 refers to the above step 303, and will not be described herein.
S95, the cloud storage system stores the local storage data under the corresponding directory file.
Optionally, the cloud storage system receives the local storage data sent by the transit service node, detects a directory dimension corresponding to the local storage data, and the directory dimension is used for indicating a folder corresponding to the local storage data.
And the cloud storage system stores the local storage data to a directory consistent with the storage position of the local disk.
In summary, in the data storage system provided by the embodiment of the application, a data transmission process is established between the local disk and the cloud storage system, and the distributed storage data stored to the local disk is uploaded to the cloud storage system according to a preset uploading policy, so that the effect of cooperating the local storage and the cloud storage system is achieved, the problem that the workload of big data cannot be effectively deployed in an offline mixed part cluster due to incapability of separating storage and calculation is solved, and the processing efficiency of the data is improved in a certain program.
Referring to fig. 10, a block diagram of a data storage device according to an embodiment of the present application is shown, where the data storage device may be applied to a transit service node in the data storage system shown in fig. 1. The data storage device includes:
a receiving module 1010, configured to receive distributed storage data sent by an executing node, where the distributed storage data is data stored by the executing node in a distributed storage manner;
the storage module 1020 is configured to transfer the distributed storage data to a local disk for storage, where the local disk is a local storage area corresponding to the current device;
The uploading module 1030 is configured to upload the local storage data in the local disk to the cloud storage system according to a preset uploading policy in response to the local disk meeting a data uploading condition, where the preset uploading policy is a policy determined based on an average file size in the local disk;
and the clearing module 1040 is configured to clear the data from the local disk in response to the completion of the data uploading.
In an optional embodiment, the uploading module 1030 is further configured to copy and store the local storage data to the cloud storage system in response to the average file size of the local storage data reaching a preset threshold;
the uploading module 1030 is further configured to serially read the locally stored data in response to the average file size of the locally stored data being less than the preset threshold; and uploading target data which are not overlapped with the data stored in the cloud storage system in the local storage data to the cloud storage system.
In an alternative embodiment, as shown in fig. 11, the apparatus further comprises a merge module 1050;
the merging module 1050 is configured to merge target data that is not overlapped with data stored in the cloud storage system in the local storage data, so as to obtain merged data;
The upload module 1030 is further configured to upload the merged data to a cloud storage system.
In an alternative embodiment, the receiving module 1010 is further configured to receive an active upload instruction, where the active upload instruction is used to trigger uploading the locally stored data;
the uploading module 1030 is further configured to upload the local storage data to the cloud storage system according to the preset uploading policy based on the active uploading instruction.
In an optional embodiment, the uploading module 1030 is further configured to upload the local storage data in the local disk to the cloud storage system according to the preset uploading policy in response to the storage capacity in the local disk exceeding a storage threshold; or alternatively; and uploading the local storage data to the cloud storage system according to the preset uploading strategy in a preset uploading period.
In an alternative embodiment, the receiving module 1010 is further configured to receive the distributed storage data sent by the at least two executing nodes;
the storage module 1020 is further configured to aggregate the distributed storage data sent by the at least two execution nodes according to partition information to obtain partition aggregate storage data; and transferring the partitioned aggregate storage data to the local disk for storage.
In an alternative embodiment, the distributed storage data is data obtained after the task processing of the execution node.
In summary, in the data storage device provided by the embodiment of the application, a data transmission process is established between the local disk and the cloud storage system, and the distributed storage data stored to the local disk is uploaded to the cloud storage system according to a preset uploading policy, so that the effect of cooperating the local storage and the cloud storage system is achieved, the problem that the workload of big data cannot be effectively deployed in an offline mixed part cluster due to incapability of separating storage and calculation is solved, and the processing efficiency of the data is improved on a certain program.
Referring to fig. 12, fig. 12 is a schematic diagram illustrating a structure of a server according to an exemplary embodiment of the present application. The server 1100 has integrated therein the data storage system shown in fig. 1. Specifically, the present invention relates to a method for manufacturing a semiconductor device.
The server 1200 includes a central processing unit (CPU, central Processing Unit) 1201, a system Memory 1204 including a random access Memory (RAM, random Access Memory) 1202 and a Read Only Memory (ROM) 1203, and a system bus 1205 connecting the system Memory 1204 and the central processing unit 1201. The server 1200 also includes a basic input/output system (I/O system, input Output System) 1206, which facilitates the transfer of information between various devices within the computer, and a mass storage device 1207 for storing an operating system 1213, application programs 1114, and other program modules 1215.
The basic input/output system 1206 includes a display 1208 for displaying information and an input device 1209, such as a mouse, keyboard, etc., for user input of information. Wherein both the display 1208 and the input device 1209 are coupled to the central processing unit 1201 via an input-output controller 1210 coupled to a system bus 1205. The basic input/output system 1206 can also include an input/output controller 1210 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 1210 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 1207 is connected to the central processing unit 1201 through a mass storage controller (not shown) connected to the system bus 1205. Mass storage device 1207 and its associated computer-readable media provide non-volatile storage for server 1200. That is, mass storage device 1207 may include a computer readable medium (not shown), such as a hard disk or compact disc read-only memory (CD-ROM, compact Disc Read OnlyMemory) drive.
Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-only memory (EPROM, erasable Programmable Read Only Memory), electrically erasable programmable read-only memory (EEPROM, electrically Erasable Programmable Read Only Memory), flash memory or other solid state storage media, CD-ROM, digital versatile disks (DVD, digital Versatile Disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 1204 and mass storage device 1207 described above may be collectively referred to as memory.
According to various embodiments of the present application, the server 1200 may also operate by being connected to a remote computer on a network, such as the Internet. That is, the server 1200 may be connected to the network 1212 through a network interface unit 1211 coupled to the system bus 1205, or alternatively, the network interface unit 1211 may be used to connect to other types of networks or remote computer systems (not shown).
The memory also includes one or more programs, one or more programs stored in the memory and configured to be executed by the CPU.
One embodiment of the present application provides a computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement a data storage method as described above.
One embodiment of the present application provides a data storage device comprising a processor and a memory having at least one instruction stored therein, the instruction being loaded and executed by the processor to implement a data storage method as described above.
It should be noted that: in the data storage device provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the data storage device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the data storage device and the data storage method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the data storage device and the data storage method are detailed in the method embodiments and are not repeated herein.
It will be appreciated that in the specific embodiments of the present application, related data such as locally stored data is referred to, and when the embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing is not intended to limit the present application, but is intended to cover any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the present application.

Claims (12)

1. A data storage method for use in a transit service node within a distributed computing framework, the method comprising:
receiving distributed storage data sent by an execution node, wherein the distributed storage data is stored by the execution node in a distributed storage mode;
transferring the distributed storage data to a local disk for storage, wherein the local disk is a local storage area corresponding to the current equipment;
in response to the fact that the local disk meets data uploading conditions, uploading local storage data in the local disk to a cloud storage system according to a preset uploading strategy, wherein the preset uploading strategy is determined based on the average file size in the local disk;
and in response to the completion of the uploading of the local storage data, clearing the local storage data from the local disk.
2. The method of claim 1, wherein uploading the locally stored data in the local disk to the cloud storage system according to a preset uploading policy comprises:
Copying and storing the local storage data to the cloud storage system in response to the average file size of the local storage data reaching a preset threshold;
serially reading the locally stored data in response to the average file size of the locally stored data being less than the preset threshold; and uploading target data which are not overlapped with the data stored in the cloud storage system in the local storage data to the cloud storage system.
3. The method of claim 2, wherein uploading the target data of the local storage data that is not coincident with the data stored by the cloud storage system to the cloud storage system comprises:
merging target data which are not overlapped with the data stored in the cloud storage system in the local storage data to obtain merged data;
and uploading the combined data to a cloud storage system.
4. The method according to any one of claims 1 to 3, wherein the uploading the locally stored data in the local disk to the cloud storage system according to a preset uploading policy in response to the local disk meeting a data uploading condition includes:
Receiving an active uploading instruction, wherein the active uploading instruction is used for triggering uploading of the local storage data;
and uploading the local storage data to the cloud storage system according to the preset uploading strategy based on the active uploading instruction.
5. The method according to any one of claims 1 to 3, wherein the uploading the locally stored data in the local disk to the cloud storage system according to a preset uploading policy in response to the local disk meeting a data uploading condition includes:
in response to the storage capacity in the local disk exceeding a storage threshold, uploading the local storage data in the local disk to the cloud storage system according to the preset uploading strategy;
or alternatively, the process may be performed,
and uploading the local storage data to the cloud storage system according to the preset uploading strategy in a preset uploading period.
6. A method according to any one of claims 1 to 3, wherein the receiving the distributed storage data sent by the executing node comprises:
receiving distributed storage data sent by at least two execution nodes;
the transferring the distributed storage data to a local disk for storage comprises the following steps:
Aggregating the distributed storage data sent by the at least two execution nodes according to the partition information to obtain partition aggregate storage data;
and transferring the partitioned aggregate storage data to the local disk for storage.
7. A method according to any one of claims 1 to 3, wherein,
the distributed storage data are data obtained after the task processing of the execution node.
8. A data storage system for use in a distributed computing framework, the system comprising:
the execution node is used for performing task processing on the candidate data to obtain distributed storage data; transmitting the distributed storage data to a transfer service node, wherein the distributed storage data is stored in a distributed storage mode;
the transfer service node is used for receiving the distributed storage data; transferring the distributed storage data to a local disk for storage, wherein the local disk is a local storage area corresponding to the current equipment; in response to the fact that the local disk meets data uploading conditions, uploading local storage data in the local disk to a cloud storage system according to a preset uploading strategy, wherein the preset uploading strategy is determined based on the average file size in the local disk;
The cloud storage system is used for receiving the local storage data uploaded by the local disk;
and the transit service node is further used for clearing the local storage data from the local disk in response to the completion of the uploading of the local storage data.
9. A data storage device, the device comprising:
the receiving module is configured to receive distributed storage data sent by an executing node, wherein the distributed storage data is data stored by the executing node in a distributed storage mode;
the storage module is configured to transfer the distributed storage data to a local disk for storage, wherein the local disk is a local storage area corresponding to the current equipment;
the uploading module is configured to respond to the fact that the local disk meets data uploading conditions, and upload the local storage data in the local disk to a cloud storage system according to a preset uploading strategy, wherein the preset uploading strategy is determined based on the average file size in the local disk;
and the clearing module is configured to clear the data from the local disk in response to the completion of the data uploading.
10. A computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set, the at least one instruction, at least one program, code set or instruction set being loaded and executed by the processor to implement a data storage method as claimed in any one of claims 1 to 7.
11. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the data storage method of any of claims 1 to 7.
12. A computer program product comprising computer programs/instructions stored in a computer readable storage medium, from which a processor of a computer device reads the computer programs/instructions, the processor executing the computer programs/instructions such that the computer device executes to implement the data storage method of any of claims 1 to 7.
CN202111341519.2A 2021-11-12 2021-11-12 Data storage method, system, device, storage medium and program product Pending CN116126209A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111341519.2A CN116126209A (en) 2021-11-12 2021-11-12 Data storage method, system, device, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111341519.2A CN116126209A (en) 2021-11-12 2021-11-12 Data storage method, system, device, storage medium and program product

Publications (1)

Publication Number Publication Date
CN116126209A true CN116126209A (en) 2023-05-16

Family

ID=86297812

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111341519.2A Pending CN116126209A (en) 2021-11-12 2021-11-12 Data storage method, system, device, storage medium and program product

Country Status (1)

Country Link
CN (1) CN116126209A (en)

Similar Documents

Publication Publication Date Title
US10642840B1 (en) Filtered hash table generation for performing hash joins
US10581957B2 (en) Multi-level data staging for low latency data access
US10719250B2 (en) System and method for combining erasure-coded protection sets
US10223431B2 (en) Data stream splitting for low-latency data access
US10956276B2 (en) System state recovery in a distributed, cloud-based storage system
CN107533551B (en) Big data statistics at data Block level
US10892782B2 (en) Flexible system and method for combining erasure-coded protection sets
EP3508978B1 (en) Distributed catalog, data store, and indexing
CN101576915B (en) Distributed B+ tree index system and building method
US10853242B2 (en) Deduplication and garbage collection across logical databases
US9020802B1 (en) Worldwide distributed architecture model and management
US9280381B1 (en) Execution framework for a distributed file system
US8543596B1 (en) Assigning blocks of a file of a distributed file system to processing units of a parallel database management system
CN107180113B (en) Big data retrieval platform
CN102591946A (en) Using index partitioning and reconciliation for data deduplication
CN104932956A (en) Big-data-oriented cloud disaster tolerant backup method
CN103873559A (en) Database all-in-one machine capable of realizing high-speed storage
CN104239377A (en) Platform-crossing data retrieval method and device
CN104584524A (en) Aggregating data in a mediation system
US10642530B2 (en) Global occupancy aggregator for global garbage collection scheduling
JP6269140B2 (en) Access control program, access control method, and access control apparatus
CN117677943A (en) Data consistency mechanism for hybrid data processing
CN117321583A (en) Storage engine for hybrid data processing
CN115114294A (en) Self-adaption method and device of database storage mode and computer equipment
CN107566341B (en) Data persistence storage method and system based on federal distributed file storage system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40086091

Country of ref document: HK