CN113778318A - Data storage method and device - Google Patents

Data storage method and device Download PDF

Info

Publication number
CN113778318A
CN113778318A CN202010575945.1A CN202010575945A CN113778318A CN 113778318 A CN113778318 A CN 113778318A CN 202010575945 A CN202010575945 A CN 202010575945A CN 113778318 A CN113778318 A CN 113778318A
Authority
CN
China
Prior art keywords
data
user hierarchical
hierarchical data
user
partition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010575945.1A
Other languages
Chinese (zh)
Other versions
CN113778318B (en
Inventor
林艳
周德辉
文小东
史金昊
崔词茗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202010575945.1A priority Critical patent/CN113778318B/en
Publication of CN113778318A publication Critical patent/CN113778318A/en
Application granted granted Critical
Publication of CN113778318B publication Critical patent/CN113778318B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data storage method and a data storage device, and relates to the technical field of computers. One specific embodiment of the method includes receiving user hierarchical data obtained through a user hierarchical computation task, writing the user hierarchical data into a preset hot data set based on storage time and a model; according to the filing task, filing the user hierarchical data meeting a preset period in the hot data set into a preset cold data set according to a dump model, and further obtaining partition codes of the user hierarchical data in the cold data set; and calling an archiving metadata writing service, acquiring storage time, a model, archiving time and partition codes corresponding to the user hierarchical data, and further writing the storage time, the model, the archiving time and the partition codes into a data archiving metadata table of the database. Therefore, the embodiment of the invention can solve the problems of low efficiency and difficult management of the existing user hierarchical data storage.

Description

Data storage method and device
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data storage method and apparatus.
Background
The e-commerce platform needs to provide a digital user operation system for platform merchants to support user operation management of the e-commerce platform. In order to make marketing activities and user operation strategies more targeted, a 4A user hierarchical model is established, and users are divided into four hierarchical models of cognition (Aware), attraction (Appeal), action (Act) and advocacy (Advocate) according to the depth degree of interaction behaviors of the users and brand commodities.
Due to the fact that the varieties of commodities are various, some commodities belong to popular explosive products, and some commodities are more and more popular, the quantity difference of user hierarchical data of different models is large, the user hierarchical data has the difference between hundreds of millions of user hierarchical data and hundreds of user hierarchical data every day, and the storage space occupied by the user hierarchical data every day is calculated to be from hundreds of GB to hundreds of KB.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
the e-commerce platform uses Apache Hive (Apache Hive, which is a large-scale data warehouse software sourced by the Apache software foundation) and its evolved version as the base software to build a data warehouse. The bottom file storage system of Hive is HDFS, and when the user hierarchical data is stored, each partition is used as a folder in the HDFS, so that too many small files are generated and integration is difficult. A NameNode of the HDFS (NameNode, software responsible for managing a file system namespace of the HDFS and controlling access of an external client) loads all file meta information into a memory, and if a small file is too many, a large amount of memory space is occupied in the NameNode, resulting in performance degradation and excessive pressure. When the Hive executes a task, if the small files are stored too much, more scanning tasks are generated, resources are wasted, and management is difficult.
Disclosure of Invention
In view of this, embodiments of the present invention provide a data storage method and apparatus, which can solve the problems of low efficiency and difficult management of existing user hierarchical data storage.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a data storage method including receiving user-level data obtained through a user-level calculation task, writing the user-level data into a preset hot data set based on a storage time and a model;
according to the filing task, filing the user hierarchical data meeting a preset period in the hot data set into a preset cold data set according to a dump model, and further obtaining partition codes of the user hierarchical data in the cold data set;
and calling an archiving metadata writing service, acquiring storage time, a model, archiving time and partition codes corresponding to the user hierarchical data, and further writing the storage time, the model, the archiving time and the partition codes into a data archiving metadata table of the database.
Optionally, after writing into the data archive metadata table of the database, the method further includes:
and calling an archiving metadata reading service according to a reading task of the user hierarchical data, inquiring storage information of the target model in a preset time period from a data archiving metadata table of the database based on a reading request, and returning a partition code so as to read the user hierarchical data in the cold data set according to the partition code.
Optionally, querying, from a data archive metadata table of the database, storage information of the target model in a preset time period based on the read request includes:
judging whether the data archiving metadata table of the database stores the storage information of the target model within a preset time period or not based on the reading request, if so, extracting the storage information with the maximum version number, and returning the corresponding partition code so as to read the user hierarchical data in the cold data set according to the partition code; if not, informing that the data is not archived, and reading the user hierarchical data in the hot data set based on the reading request.
Optionally, the archiving, according to a dump model, the user hierarchical data meeting the preset period in the hot data set to a preset cold data set, includes:
grouping the user hierarchical data meeting a preset period according to the model and sequencing the user hierarchical data from large to small according to the preserved data volume;
acquiring the total amount of user hierarchical data and the number of partitions which meet a preset period to obtain the average user hierarchical data volume of each partition;
sequentially taking out a group of user hierarchical data in sequence, and circularly executing the following processes for each group of user hierarchical data until all user hierarchical data are transferred and stored:
judging whether the data volume which is not transferred and stored in the current group is larger than or equal to the average user hierarchical data volume, if so, judging whether a partition with the data volume of 0 exists in the cold data set, if so, transferring and storing the data based on the partition with the data volume of 0, and if not, obtaining the partition with the largest residual space in the cold data set for transferring and storing; if not, judging whether a partition with the difference value between the average user hierarchical data volume and the current stored data volume being larger than or equal to the data volume not transferred in the current group exists, if so, obtaining the minimum residual space in the partition for transferring, and if not, obtaining the maximum residual space in the cold data set for transferring.
Optionally, when performing the unloading, the method includes:
acquiring user hierarchical data to be transferred and stored according to the sequence of the storage time in the group from small to large;
labeling each user hierarchical data to be transferred based on the partition codes used for transferring in the cold data set;
merging the marked user hierarchical data to be stored into the file size stored in the HDFS block, and writing the file size into a cold data set HIVE table.
Optionally, the writing into a data archive metadata table of the database includes:
judging whether the storage information with the same model code and storage time exists in the current data archiving metadata table;
if so, setting the storage information as invalid, and adding 1 to the version number in the storage information to serve as the version number of the archived user hierarchical data; and if not, setting the version number of the archived user hierarchical data to be 1.
Optionally, writing the user hierarchical data into a preset hot data set based on a storage time and a model, including:
partition codes in a hot data set are generated according to storage time and a model to write the user hierarchical data into the hot data set.
In addition, the invention also provides a data storage device, which comprises a first module, a second module and a third module, wherein the first module is used for receiving the user hierarchical data obtained by the user hierarchical computing task and writing the user hierarchical data into a preset hot data set based on the storage time and the model;
the second module is used for archiving the user hierarchical data meeting the preset period in the hot data set to a preset cold data set according to a dump model according to an archiving task so as to obtain the partition codes of the user hierarchical data in the cold data set;
and the third module is used for calling an archiving metadata writing service, acquiring the storage time, the model, the archiving time and the partition codes corresponding to the user hierarchical data, and further writing the storage time, the model, the archiving time and the partition codes into a data archiving metadata table of the database.
One embodiment of the above invention has the following advantages or benefits: because the user hierarchical data is written into the preset hot data set based on the storage time and the model; according to the filing task, filing the user hierarchical data meeting a preset period in the hot data set into a preset cold data set according to a dump model, and further obtaining partition codes of the user hierarchical data in the cold data set; the technical means that the archiving metadata writing service is called, the storage time, the model, the archiving time and the partition codes corresponding to the user hierarchical data are obtained, and then the user hierarchical data are written into the data archiving metadata table of the database is achieved, various data processing scenes can be adapted, when the use frequency of the user hierarchical data is reduced, small file fragments can be sorted, excessive memory of the Nanoode is not occupied, the pressure of the Nanoode and a storage system is reduced, and the technical effect of the reading and writing efficiency of the user hierarchical model data is maintained.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of a main flow of a data storage method according to one embodiment of the present invention;
FIG. 2 is an architecture diagram of a data storage method according to an embodiment of the present invention;
FIG. 3 is a schematic flow diagram of a primary process for hot data set archiving, according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a main flow of writing a hot data set to a cold data set according to an embodiment of the present invention;
FIG. 5 is a schematic flow chart of a user hierarchical data reading method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of the main modules of a data storage device according to an embodiment of the present invention;
FIG. 7 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 8 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a main flow of a data storage method according to an embodiment of the present invention, as shown in fig. 1, the data storage method includes:
step S101, receiving user hierarchical data obtained through a user hierarchical computing task, and writing the user hierarchical data into a preset hot data set based on storage time and a model.
In an embodiment, the invention provides for hot and cold data sets. Wherein the hot data set is used for storing recently produced user hierarchical data for use by subsequent computing tasks. The cold data set stores user-level data that is no longer frequently used. For example: when the user hierarchical data calculated by one user hierarchical model is generated in a defined period and fixed statistical tasks are calculated completely, the user hierarchical data can be considered to be not frequently used any more. Also, tables 1 and 2 below include fields for hot data sets computed for user-layered computing tasks and fields for cold data sets written by archive tasks, respectively.
TABLE 1
Figure BDA0002551347680000061
TABLE 2
Figure BDA0002551347680000062
Preferably, in the embodiment of the present invention, a Spark distributed computing framework is used to execute step S101 to step S103. Wherein Spark is a large-scale distributed data computing framework sourced by the Apache software Foundation.
It should be noted that a preferred embodiment of the present invention employs partition encoding generated in the hot data set according to the storage time and the model to write the user-level data into the hot data set.
And S102, according to the filing task, filing the user hierarchical data meeting the preset period in the hot data set into a preset cold data set according to a dump model, and further obtaining the partition coding of the user hierarchical data in the cold data set.
In some embodiments, in the process of archiving the user hierarchical data meeting the preset period in the hot data set to the preset cold data set according to the dump model, the user hierarchical data meeting the preset period can be grouped according to the model and sorted from large to small according to the preserved data volume; and acquiring the total amount of the user hierarchical data and the number of the partitions which meet a preset period to obtain the average user hierarchical data volume of each partition.
Then, sequentially taking out a group of user hierarchical data in sequence, and circularly executing the following processes for each group of user hierarchical data until all user hierarchical data are transferred and stored:
judging whether the data volume which is not transferred and stored in the current group is larger than or equal to the average user hierarchical data volume, if so, judging whether a partition with the data volume of 0 exists in the cold data set, if so, transferring and storing the data based on the partition with the data volume of 0, and if not, obtaining the partition with the largest residual space in the cold data set for transferring and storing; if not, judging whether a partition with the difference value between the average user hierarchical data volume and the current stored data volume being larger than or equal to the data volume not transferred in the current group exists, if so, obtaining the minimum residual space in the partition for transferring, and if not, obtaining the maximum residual space in the cold data set for transferring.
As a further embodiment, in the process of transferring the hot data set to the cold data set, the user hierarchical data to be transferred may be obtained in the order from the small to the large of the group storage time; then, labeling each user hierarchical data to be transferred based on the partition codes used for transferring in the cold data set; merging the marked user hierarchical data to be stored into the file size stored in the HDFS block, and writing the file size into a cold data set HIVE table. The HDFS is a distributed file storage system. HIVE is a data warehouse tool based on Hadoop, can map structured data files into a database table, and provides a complete sql query function.
It can be seen that the present invention partitions the archived data in the order of the number of pieces of data, the size of the model, and the time of storage. Therefore, the models with small data volume are sorted according to the size of the normal storage file, the data of the same model and the data of the same model with the similar date are filed together as much as possible, and the scanning efficiency when a cold data set is used is guaranteed. Meanwhile, the data volume of each partition of the cold data set is guaranteed to be balanced as much as possible, the situation of serious data inclination is avoided, and the performance bottleneck caused by uneven partition storage is avoided when data are scanned.
Step S103, calling an archiving metadata writing service, acquiring storage time, a model, archiving time and partition codes corresponding to the user hierarchical data, and further writing the storage time, the model, the archiving time and the partition codes into a data archiving metadata table of the database.
In some embodiments, after writing into the data archive metadata table of the database, the present invention can invoke an archive metadata reading service according to a reading task of the user hierarchical data, query the storage information of the target model in a preset time period from the data archive metadata table of the database based on the reading request, and return the partition code to read the user hierarchical data in the cold data set according to the partition code. Preferably, the database may be a relational database Mysql.
It should be noted that, after writing into the data archive metadata table of the database, the archive metadata read service may also be invoked according to the read task of the user hierarchical data. Then, judging whether the data filing metadata table of the database stores storage information of a target model within a preset time period or not based on the reading request, if so, extracting the storage information with the maximum version number, and returning a corresponding partition code to read user hierarchical data in a cold data set according to the partition code; if not, informing that the data is not archived, and reading the user hierarchical data in the hot data set based on the reading request.
As another example, during writing to a data archive metadata table of a database, the present invention may determine whether stored information having the same model code and storage time exists in the current data archive metadata table. According to the judgment result, if the user hierarchical data is not valid, setting the storage information to be invalid, and adding 1 to the version number in the storage information to serve as the version number of the archived user hierarchical data; and if not, setting the version number of the archived user hierarchical data to be 1.
The method and the device can manage and maintain the archived data through the read-write service, ensure the integrity of the data stored in the archived meta-information when the data is refreshed, and distinguish whether the data is latest and effective through the version number. And secondly, when a user hierarchical data scene needs to be used, the storage information of the user hierarchical data of the relevant model can be acquired by directly calling the archiving metadata reading service without judging the storage position of the underlying data according to the business scene. Thirdly, the function can be expanded to the storage positioning of other types of data, and the expansibility is good.
It is worth to be noted that the invention can also store the user hierarchical data in a partitioning way according to the model, directly add the newly added user hierarchical data to the relevant partitions, and merge the files once in a period of time. Because the data size of each model is not uniform, if a model with a large data size is encountered, the task efficiency of scanning data is low, and pressure is caused to downstream tasks; if the data volume of the user hierarchical data is too small, the problem of small files still exists. And secondly, the operation flow of updating data by HIVE is complex, and the HIVE is directly added to the related partitions, so that the hierarchical data of the user is difficult to maintain.
In summary, the data storage method of the invention realizes the hierarchical data access of large-scale e-commerce users, and on one hand, the data storage method of the invention divides the data into a cold data set and a hot data set, and can adapt to various data processing scenes. When the user-defined user hierarchical data is frequently used, the hot data set is used, the efficiency of scanning the user hierarchical data of the current model is ensured, and the problem that the data is difficult to manage due to excessive HIVE partitions is solved. On the other hand, the invention solves the problem of small files caused by too many partitions, merges and sorts the user hierarchical model data with small data volume, can sort small file fragments when the use frequency of the user hierarchical data is reduced, does not occupy too much memory of the Namenode, and reduces the pressure of the Namenode and a storage system. Moreover, the invention reduces the pressure of the Namenode and simultaneously keeps the reading and writing efficiency of the user-defined user hierarchical model data.
Fig. 2 is an architecture diagram of a data storage method according to an embodiment of the present invention, where the data storage method obtains user hierarchical data through a user hierarchical computation task, and writes the user hierarchical data into a preset hot data set based on a storage time and a model. According to the archiving task, user hierarchical data meeting a preset period in the hot data set is archived to a preset cold data set (namely written into the cold data set) according to a dump model, and partition codes of the user hierarchical data in the cold data set are obtained. And then, calling an archiving metadata writing service according to the archiving task, acquiring storage time, a model, archiving time and partition codes corresponding to the user hierarchical data, and writing the storage time, the model, the archiving time and the partition codes into a data archiving metadata table of the Mysql database. And simultaneously deleting the user hierarchical data which are archived in the hot data set according to the archiving task.
In addition, through a data calculation task of a user hierarchy, an archiving metadata reading service can be called to inquire the storage information of the target model in a preset time period from a data archiving metadata table of the Mysql database, and a partition code is returned, so that the user hierarchy data in the cold data set can be read according to the partition code. If the partition code is not returned, it indicates that the user hierarchical data in the hot data set is not written into the cold data set, and the user hierarchical data in the hot data set is read directly.
It is worth noting that the user-tier computing tasks, the archiving tasks, and the user-tier based data computing tasks (i.e., the read service) may be performed in a Spark engine.
FIG. 3 is a schematic flow chart of a hot data set archiving method according to an embodiment of the present invention, including:
the method comprises the following steps: an archived list based on models and storage times is obtained.
Step two: user hierarchical data in the hot dataset is scanned.
Step three: and judging whether user hierarchical data corresponding to the filing list exists in the hot data set, if so, performing the step four, and if not, exiting the process.
Step four: archive partition encodings are assigned according to a dump model.
Step five: the cold data set is written in accordance with the partition encoding.
Step six: and judging whether all the writing is successful, if so, executing a seventh step, otherwise, deleting the user hierarchical data of the filing cold data set and returning to the fifth step.
Step seven: and calling an archiving metadata writing service and writing the metadata information of the cold data set archived at this time.
Step eight: and judging whether the writing of the meta information is successful, if so, deleting the user hierarchical data filed into the success and heat data set, and if not, returning to the seventh step.
FIG. 4 is a schematic diagram of a main flow of writing a hot data set into a cold data set according to an embodiment of the present invention, including:
the method comprises the following steps: and acquiring user hierarchical data in the thermal data set according to a preset period.
Step two: grouping is performed according to the models, and the total number of each model is calculated.
Step three: the models are sorted from large to small according to the total number of the models, and the models are sorted from small to large according to the dates.
Step four: and calculating the total number of the strips, and confirming the number of the partitions part _ num to obtain the average number of the strips row _ num of each partition.
Step five: and judging whether unallocated user hierarchical data exists or not, if so, performing a sixth step, and otherwise, exiting the process.
Step six: and taking out a group of models from small to large, and initializing the total number of the models to be the left _ count of the residual unallocated number.
Step seven: and judging whether the left _ count is greater than or equal to the row _ num, if so, performing the step eight, and if not, performing the step nine.
Step eight: and judging whether partitions without user hierarchical data are available, if so, performing the step ten, and otherwise, performing the step nine.
Step nine: and judging whether partitions with the difference of row _ num and block _ num larger than or equal to left _ count exist, if so, performing the step eleven, and otherwise, performing the step twelve.
Wherein the partition has a number of stored data pieces block _ num.
Step ten: the partition to which data has not been allocated is fetched and then step thirteen is performed.
Step eleven: and finding the model list of which the block _ num is larger than or equal to left _ count subtracted from the row _ num to obtain the partition with the maximum block _ num, and then performing the thirteen step.
Step twelve: and taking out the partition with the maximum row _ num minus block _ num, and then carrying out the step thirteen.
Step thirteen: and sequentially taking out the unallocated user hierarchical data, allocating the user hierarchical data to the taken-out partition, and updating block _ num and left _ count.
Fourteen steps: and judging whether the distribution of the current model is finished, if so, returning to the step five, and if not, performing the step fifteen.
Step fifteen: and judging whether block _ num is larger than or equal to row _ num, if so, returning to the seventh step, and if not, returning to the thirteenth step.
Fig. 5 is a schematic main flow chart of user hierarchical data reading according to an embodiment of the present invention, including:
the method comprises the following steps: and after a computing task based on the user hierarchical data is started, sending a request to an archiving metadata reading service to request data storage information of the required user hierarchical data.
The request may include the number of the user hierarchical model, the start time of the storage time, and the end time.
Step two: and the filing metadata reading service receives the request and judges whether the filing metadata table has the requested metadata information or not, if so, the fourth step is carried out, and if not, the third step is carried out.
Step three: and returning the position of the user hierarchical data corresponding to the hot data set request, and performing the sixth step.
In an embodiment, at this time, the metadata of the user-level data in the hot dataset is archived in the data archive metadata table of the database, but the user-level data in the hot dataset is not written into the cold dataset. That is, the embodiment of the present invention may archive the metadata of the user-level data in the hot data set to the database and write the user-level data into the cold data set separately.
Step four: and judging whether a plurality of pieces of metadata storage information do not exist, if so, performing a fifth step, otherwise, directly returning the metadata storage information, and performing a sixth step.
Step five: and returning the latest and effective storage information.
In an embodiment, the valid meta-information of the largest version number is taken back.
Step six: and analyzing the returned storage information to read the user hierarchical data in the cold data set.
Fig. 6 is a schematic diagram of main modules of a data storage device according to an embodiment of the present invention, and as shown in fig. 6, the data storage device 600 includes a first module 601, a second module 602, and a third module 603. The first module 601 receives user hierarchical data obtained through a user hierarchical computing task, and writes the user hierarchical data into a preset hot data set based on storage time and a model; the second module 602, according to a filing task, files the user hierarchical data meeting a preset period in the hot data set to a preset cold data set according to a dump model, and then obtains a partition code of the user hierarchical data in the cold data set; the third module 603 calls an archive metadata writing service, obtains storage time, a model, archive time and partition codes corresponding to the user hierarchical data, and writes the storage time, the model, the archive time and the partition codes into a data archive metadata table of the database.
In some embodiments, after the third module 603 writes to a data archive metadata table of the database, the method further comprises:
and calling an archiving metadata reading service according to a reading task of the user hierarchical data, inquiring storage information of the target model in a preset time period from a data archiving metadata table of the database based on a reading request, and returning a partition code so as to read the user hierarchical data in the cold data set according to the partition code.
In some embodiments, the third module 603 queries, based on the read request, a data archive metadata table of the database for storage information of the target model in a preset time period, including:
judging whether the data archiving metadata table of the database stores the storage information of the target model within a preset time period or not based on the reading request, if so, extracting the storage information with the maximum version number, and returning the corresponding partition code so as to read the user hierarchical data in the cold data set according to the partition code; if not, informing that the data is not archived, and reading the user hierarchical data in the hot data set based on the reading request.
In some embodiments, the second module 602 archives the user hierarchical data satisfying the preset period in the hot data set into the preset cold data set according to a dump model, including:
grouping the user hierarchical data meeting a preset period according to the model and sequencing the user hierarchical data from large to small according to the preserved data volume;
acquiring the total amount of user hierarchical data and the number of partitions which meet a preset period to obtain the average user hierarchical data volume of each partition;
sequentially taking out a group of user hierarchical data in sequence, and circularly executing the following processes for each group of user hierarchical data until all user hierarchical data are transferred and stored:
judging whether the data volume which is not transferred and stored in the current group is larger than or equal to the average user hierarchical data volume, if so, judging whether a partition with the data volume of 0 exists in the cold data set, if so, transferring and storing the data based on the partition with the data volume of 0, and if not, obtaining the partition with the largest residual space in the cold data set for transferring and storing; if not, judging whether a partition with the difference value between the average user hierarchical data volume and the current stored data volume being larger than or equal to the data volume not transferred in the current group exists, if so, obtaining the minimum residual space in the partition for transferring, and if not, obtaining the maximum residual space in the cold data set for transferring.
In some embodiments, the second module 602, when performing the unloading, includes:
acquiring user hierarchical data to be transferred and stored according to the sequence of the storage time in the group from small to large;
labeling each user hierarchical data to be transferred based on the partition codes used for transferring in the cold data set;
merging the marked user hierarchical data to be stored into the file size stored in the HDFS block, and writing the file size into a cold data set HIVE table.
In some embodiments, the third module 603 writes to a data archive metadata table of the database, including:
judging whether the storage information with the same model code and storage time exists in the current data archiving metadata table;
if so, setting the storage information as invalid, and adding 1 to the version number in the storage information to serve as the version number of the archived user hierarchical data; and if not, setting the version number of the archived user hierarchical data to be 1.
In some embodiments, the first module 601 writes the user hierarchical data into a preset hot data set based on the storage time and the model, including:
partition codes in a hot data set are generated according to storage time and a model to write the user hierarchical data into the hot data set.
It should be noted that, the data storage method and the data storage device of the present invention have corresponding relation in the specific implementation content, so the repeated content is not described again.
FIG. 7 illustrates an exemplary system architecture 700 of a data storage method or data storage device, data storage discrimination or data storage discrimination device to which embodiments of the present invention may be applied.
As shown in fig. 7, the system architecture 700 may include terminal devices 701, 702, 703, a network 704, and a server 705. The network 704 serves to provide a medium for communication links between the terminal devices 701, 702, 703 and the server 705. Network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 701, 702, 703 to interact with a server 705 over a network 704, to receive or send messages or the like. The terminal devices 701, 702, 703 may have installed thereon various communication client applications, such as a shopping-like application, a web browser application, a search-like application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only).
The terminal devices 701, 702, 703 may be various electronic devices having a data storage screen or a data storage discrimination screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 705 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 701, 702, 703. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.
It should be noted that the data storage method or the data storage determination method provided by the embodiment of the present invention is generally executed by the server 705, and accordingly, the computing device is generally disposed in the server 705.
It should be understood that the number of terminal devices, networks, and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data necessary for the operation of the computer system 800 are also stored. The CPU801, ROM802, and RAM803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a liquid crystal data memory (LCD), and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 801.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a first module, a second module, and a third module. Wherein the names of the modules do not in some cases constitute a limitation of the module itself.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include receiving user-level data obtained by a user-level computation task, writing the user-level data into a preset hot data set based on a storage time and a model; according to the filing task, filing the user hierarchical data meeting a preset period in the hot data set into a preset cold data set according to a dump model, and further obtaining partition codes of the user hierarchical data in the cold data set; and calling an archiving metadata writing service, acquiring storage time, a model, archiving time and partition codes corresponding to the user hierarchical data, and further writing the storage time, the model, the archiving time and the partition codes into a data archiving metadata table of the database.
According to the technical scheme of the embodiment of the invention, the problems of low storage efficiency and difficult management of the existing user hierarchical data can be solved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of storing data, comprising:
receiving user hierarchical data obtained through a user hierarchical computing task, and writing the user hierarchical data into a preset hot data set based on storage time and a model;
according to the filing task, filing the user hierarchical data meeting a preset period in the hot data set into a preset cold data set according to a dump model, and further obtaining partition codes of the user hierarchical data in the cold data set;
and calling an archiving metadata writing service, acquiring storage time, a model, archiving time and partition codes corresponding to the user hierarchical data, and further writing the storage time, the model, the archiving time and the partition codes into a data archiving metadata table of the database.
2. The method of claim 1, after writing to a data archive metadata table of a database, further comprising:
and calling an archiving metadata reading service according to a reading task of the user hierarchical data, inquiring storage information of the target model in a preset time period from a data archiving metadata table of the database based on a reading request, and returning a partition code so as to read the user hierarchical data in the cold data set according to the partition code.
3. The method according to claim 2, wherein the step of querying a data archive metadata table of the database for the storage information of the target model in the preset time period based on the read request comprises the following steps:
judging whether the data archiving metadata table of the database stores the storage information of the target model within a preset time period or not based on the reading request, if so, extracting the storage information with the maximum version number, and returning the corresponding partition code so as to read the user hierarchical data in the cold data set according to the partition code; if not, informing that the data is not archived, and reading the user hierarchical data in the hot data set based on the reading request.
4. The method of claim 1, wherein archiving user-level data satisfying a predetermined period in a hot dataset into a predetermined cold dataset according to a dump model comprises:
grouping the user hierarchical data meeting a preset period according to the model and sequencing the user hierarchical data from large to small according to the preserved data volume;
acquiring the total amount of user hierarchical data and the number of partitions which meet a preset period to obtain the average user hierarchical data volume of each partition;
sequentially taking out a group of user hierarchical data in sequence, and circularly executing the following processes for each group of user hierarchical data until all user hierarchical data are transferred and stored:
judging whether the data volume which is not transferred and stored in the current group is larger than or equal to the average user hierarchical data volume, if so, judging whether a partition with the data volume of 0 exists in the cold data set, if so, transferring and storing the data based on the partition with the data volume of 0, and if not, obtaining the partition with the largest residual space in the cold data set for transferring and storing; if not, judging whether a partition with the difference value between the average user hierarchical data volume and the current stored data volume being larger than or equal to the data volume not transferred in the current group exists, if so, obtaining the minimum residual space in the partition for transferring, and if not, obtaining the maximum residual space in the cold data set for transferring.
5. The method of claim 4, when performing a dump, comprising:
acquiring user hierarchical data to be transferred and stored according to the sequence of the storage time in the group from small to large;
labeling each user hierarchical data to be transferred based on the partition codes used for transferring in the cold data set;
merging the marked user hierarchical data to be stored into the file size stored in the HDFS block, and writing the file size into a cold data set HIVE table.
6. The method of claim 1, wherein writing to a data archive metadata table of a database comprises:
judging whether the storage information with the same model code and storage time exists in the current data archiving metadata table;
if so, setting the storage information as invalid, and adding 1 to the version number in the storage information to serve as the version number of the archived user hierarchical data; and if not, setting the version number of the archived user hierarchical data to be 1.
7. The method of any of claims 1-6, wherein writing the user-tiered data into a pre-set hot data set based on storage time and model comprises:
partition codes in a hot data set are generated according to storage time and a model to write the user hierarchical data into the hot data set.
8. A data storage device, comprising:
a first module for receiving user hierarchical data obtained through a user hierarchical computation task, writing the user hierarchical data into a preset hot data set based on a storage time and a model;
the second module is used for archiving the user hierarchical data meeting the preset period in the hot data set to a preset cold data set according to a dump model according to an archiving task so as to obtain the partition codes of the user hierarchical data in the cold data set;
and the third module is used for calling an archiving metadata writing service, acquiring the storage time, the model, the archiving time and the partition codes corresponding to the user hierarchical data, and further writing the storage time, the model, the archiving time and the partition codes into a data archiving metadata table of the database.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202010575945.1A 2020-06-22 2020-06-22 Data storage method and device Active CN113778318B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010575945.1A CN113778318B (en) 2020-06-22 2020-06-22 Data storage method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010575945.1A CN113778318B (en) 2020-06-22 2020-06-22 Data storage method and device

Publications (2)

Publication Number Publication Date
CN113778318A true CN113778318A (en) 2021-12-10
CN113778318B CN113778318B (en) 2024-09-20

Family

ID=78835202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010575945.1A Active CN113778318B (en) 2020-06-22 2020-06-22 Data storage method and device

Country Status (1)

Country Link
CN (1) CN113778318B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509624A (en) * 2018-04-08 2018-09-07 武汉斗鱼网络科技有限公司 A kind of database filing method for cleaning and system, server and storage medium
CN109726174A (en) * 2018-12-28 2019-05-07 江苏满运软件科技有限公司 Data archiving method, system, equipment and storage medium
DE102018129366A1 (en) * 2018-11-21 2020-05-28 Deepshore Gmbh System for processing and storing data requiring archiving

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509624A (en) * 2018-04-08 2018-09-07 武汉斗鱼网络科技有限公司 A kind of database filing method for cleaning and system, server and storage medium
DE102018129366A1 (en) * 2018-11-21 2020-05-28 Deepshore Gmbh System for processing and storing data requiring archiving
CN109726174A (en) * 2018-12-28 2019-05-07 江苏满运软件科技有限公司 Data archiving method, system, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MUHAMMED ZIYA KOMSUL: "A real-time hot swapping technique for SSD RAID systems", IEEE, 11 August 2016 (2016-08-11) *
魏学才;宫庆媛;沈佳杰;周扬帆;王新;: "适应冷热数据存储的多编码架构的设计与实证", 计算机应用与软件, no. 02, 15 February 2017 (2017-02-15) *

Also Published As

Publication number Publication date
CN113778318B (en) 2024-09-20

Similar Documents

Publication Publication Date Title
CN109254733B (en) Method, device and system for storing data
CN108629029B (en) Data processing method and device applied to data warehouse
CN107729399B (en) Data processing method and device
CN107704202B (en) Method and device for quickly reading and writing data
CN108897874B (en) Method and apparatus for processing data
CN107480205B (en) Method and device for partitioning data
CN109947373B (en) Data processing method and device
US11836132B2 (en) Managing persistent database result sets
CN110019367B (en) Method and device for counting data characteristics
CN111061680A (en) Data retrieval method and device
CN111400304A (en) Method and device for acquiring total data of section dates, electronic equipment and storage medium
CN109697019B (en) Data writing method and system based on FAT file system
CN111753019B (en) Data partitioning method and device applied to data warehouse
CN112783887A (en) Data processing method and device based on data warehouse
CN110851419B (en) Data migration method and device
CN112182138A (en) Catalog making method and device
CN112395337B (en) Data export method and device
CN113760966A (en) Data processing method and device based on heterogeneous database system
CN112817930A (en) Data migration method and device
CN113778318B (en) Data storage method and device
CN107665241B (en) Real-time data multi-dimensional duplicate removal method and device
CN111177109A (en) Method and device for deleting overdue key
CN115295164A (en) Medical insurance data processing method and device, electronic equipment and storage medium
CN113760600B (en) Database backup method, database restoration method and related devices
CN113448957A (en) Data query method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant