CN110866068A

CN110866068A - Announcement data storage method and device based on HDFS

Info

Publication number: CN110866068A
Application number: CN201911090968.7A
Authority: CN
Inventors: 姚嘉华; 张晓军; 杨博
Original assignee: SSE INFONET Ltd
Current assignee: SSE INFONET Ltd
Priority date: 2019-11-09
Filing date: 2019-11-09
Publication date: 2020-03-06
Anticipated expiration: 2039-11-09
Also published as: CN110866068B

Abstract

The invention relates to the technical field of data storage, in particular to an HDFS (Hadoop distributed File System) -based announcement data storage method and device, which comprise the following steps: step S101: archiving the announcement data; step S102: synchronizing the announcement data from the network memory to the HDFS according to a set period; step S103: caching the archived announcement data; step S104: and retrieving the announcement data according to the downloading request. Compared with the prior art, the invention has the advantages that: the problem of low efficiency of the existing storage method in processing the public company announcement data is solved by archiving mass announcement data, synchronizing the data from an NAS (network storage) to an HDFS (Hadoop distributed file system), then designing and caching the archived data according to the characteristics of the announcement data, and retrieving the announcement data according to a downloading request.

Description

Announcement data storage method and device based on HDFS

Technical Field

The invention relates to the technical field of data storage, in particular to an HDFS (Hadoop distributed File System) -based announcement data storage method and device.

Background

In order to ensure data security, the current technology adopts a mode of directly downloading and storing the data to the HDFS. Although the existing data storage method has an efficient storage method for small files, the problem that a large amount of downloading requests are faced everyday cannot be solved and the efficient operation of a storage system cannot be guaranteed in the face of the characteristic that companies on the market announce data.

The number of the bulletin data generated by Shanghai city and deep market every day is large, the business change information of all listed companies on the mainboard is covered, and the contained types are various and the content is complex, for example, the bulletin of the listed companies is divided into 35 large categories and 376 small categories (Shanghai stock exchanges, 2013) by the exchange, wherein the large item bulletin, the transaction prompting bulletin, the stock arrangement bulletin, the reissuing bulletin and the like are included.

The requirement of the continuously increased number of announcements on data storage is higher and higher, and the requirement is characterized in that the number of files is large and the size of the files is smaller than that of an HDFS block. In the long term, the expansibility and the performance of the HDFS are influenced by the massive small files. Meanwhile, every day the system is faced with massive announcement data download requests. This poses a serious problem to the performance requirements of the overall storage system. Therefore, how to efficiently store the public announcement data of the listed companies becomes a technical problem to be solved at present.

Disclosure of Invention

The invention aims to solve the defects of the prior art, provides an HDFS-based announcement data storage method and device, and solves the technical problem that the storage efficiency is low due to the fact that the characteristic design of the announcement data of listed companies cannot be achieved in the prior art.

In order to achieve the above object, an HDFS-based advertisement data storage method is designed, the method comprising the steps of: step S101: archiving the announcement data; step S102: synchronizing the announcement data from the network memory to the HDFS according to a set period; step S103: caching the archived announcement data; step S104: and retrieving the announcement data according to the downloading request.

Preferably, each announcement in the announcement data is provided with two parameters, namely an archiving date and an announcement affiliated market, a folder to which the announcement belongs is determined according to the archiving date and the announcement affiliated market, and files required for archiving are initialized, wherein the files comprise a task state file, a file list to be downloaded, an error log and a packaging log, and the task state file comprises the following three attributes: attribute task state, the preferred values are: success and failure, and the initial value is none; the attribute task progress preferably has the following values: downloading, packaging, uploading and deleting the HDFS file, wherein the initial value is nothing; and the attribute running times are used for recording the number of rounds of filing running, and the initial value is 0.

Preferably, the step S101 specifically includes the following steps:

(1) reading two parameters of the filing date of the bulletin and the market to which the bulletin belongs;

(2) initializing files required for archiving the announcements;

(3) reading the announced task state file, and if the value of the attribute task state is 'success', performing the step (13); if the value of the attribute task state is failure, performing the step (4); if the value of the attribute task state does not exist, setting the attribute task progress to be 'downloading', and then performing the step (5);

(4) reading the value of the progress of the attribute task, and resetting the progress of the attribute task to be downloaded if the value of the progress of the attribute task exists; if the value of the progress of the attribute task does not exist, setting the value as a 'delete HDFS file';

(5) then, corresponding processing is carried out according to different values of the attribute task progress, and if the attribute task progress is 'downloading', the step (6) is carried out; if the packaging is 'packaging', performing the step (7); if the number is 'uploading', performing the step (8); if the file is 'HDFS file deletion', performing the step (9);

(6) sequentially downloading the announcement files according to the file list to be downloaded; if the downloading is successful, continuing to execute the next step; if the downloading fails, jumping to the step (11);

(7) packaging operation is carried out, and a packaging file named by the market and the filing date is generated; if the success is achieved, continuing to execute the next step; failing, jumping to the step (11);

(8) uploading the file to the HDFS, and if the file is successfully uploaded, continuing to execute the next step; failing, jumping to the step (11);

(9) deleting files on the HDFS in sequence; if the success is achieved, continuing to execute the next step; failing, jumping to the step (11);

(10) after the execution is successful, setting the value of the task state file as 'success', and carrying out the step (12); if the execution fails, jumping to the step (11);

(11) adding 1 to the value of the running times of the attributes, if run _ count > limit N, setting the value of the task state file as 'failure', and continuing to execute the next step; if the run _ count is less than the limit N, jumping to the step (5), wherein the limit N is an upper limit value for realizing setting;

(12) deleting the locally downloaded file and the compressed file;

(13) and finishing the archiving.

Preferably, the archiving operation is performed once a day.

Preferably, the method for synchronizing the advertisement data from the network storage to the HDFS is as follows:

(1) filtering the bulletin files according to the filing date parameters to be synchronized, and calculating a file list target _ file.list to be synchronized of the task; generating a failed file list failed _ file.list, wherein the file is initially an empty file; meanwhile, judging whether a completed file list complete file exists, if not, creating the file, otherwise, not operating;

(2) starting synchronization, recording task starting information, recording the starting time of the current synchronization task, and predicting the number and the name of the synchronized files;

(3) reading the file, and judging whether the file is synchronized: taking out the files in sequence from the target _ file.list, judging whether the files exist in the completed _ file.list, if so, recording time and file names, marking 'skip', and jumping to the step (6); otherwise, continuing to execute;

(4) and (3) executing synchronization: synchronizing the file from the local NAS to the remote HDFS;

(5) judging whether the synchronization is successful: if the execution code returned by the rsync is 0, writing the file name into the completed _ file.list, recording the time and the file name in the rsync.log, and marking 'synchronization success'; if not, writing the file name in the failed _ file.list, recording the time and the file name in the rsync.log, and marking 'failure' and an error reporting code;

(6) checking whether the target _ file.list is read completely, if not, jumping to the step (3), and if the read is finished, continuing to execute;

(7) recording synchronization end information: log records the end time, the time overhead of the synchronization, the number of the synchronized files, the number of the successful files and the number of the error reporting files in rsync.

Preferably, a cache pool of a designated space is created in the Hadoop, the maximum survival time is set for the cache pool, and when the file of the announcement data is accessed, the data file is stored in the cache pool.

Preferably, the space of the buffer pool is designated as 20G, and the maximum survival time is set to be 1 hour.

Preferably, the method for retrieving the advertisement data according to the download request specifically includes:

(1) confirming the filing date of the bulletins and the market parameters to which the bulletins belong;

(2) determining whether the file is archived or not according to the archiving date and the market parameters to which the bulletins belong, and if the archiving is carried out, jumping to the step (3), and jumping to the step (6) without archiving;

(3) calculating a path: determining the path of the archived file to which the file belongs on the HDFS according to the archiving date and the market parameter to which the file belongs, if the path exists in the HDFS, continuing to execute the operation, and jumping to the step (6) does not exist;

(4) acquiring a file: downloading the archived file to a local, acquiring a required file, and deleting the archived file;

(5) storing into a buffer area: storing the archived file into a cache area of the HDFS, and jumping to the step (9);

(6) calculating an HDFS path of the file, checking whether the file exists, if not, returning to 404 to jump to the step (9), and if so, continuing to execute the following steps;

(7) and storing the file into an HDFS cache area.

The invention also relates to a device for the HDFS-based announcement data storage method, which comprises the following steps: an archiving module for archiving the announcement data; a synchronization module for synchronizing advertisement data from a network storage to the HDFS; the cache module is used for caching the archived data; and the retrieval module is used for retrieving the announcement data according to the downloading request.

Compared with the prior art, the invention has the advantages that: the problem of low efficiency of the existing storage method in processing the public company announcement data is solved by archiving mass announcement data, synchronizing the data from an NAS (network storage) to an HDFS (Hadoop distributed file system), then designing and caching the archived data according to the characteristics of the announcement data, and retrieving the announcement data according to a downloading request.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic flow chart of the present invention for archiving advertisement data;

FIG. 3 is a schematic flow chart of the present invention for synchronizing advertisement data from a network storage to a HDFS;

FIG. 4 is a schematic diagram illustrating a process of retrieving advertisement data according to a download request in the present invention;

fig. 5 is a schematic diagram of an apparatus for the HDFS-based advertisement data storage method according to the present invention.

Detailed Description

The structure and principle of such a method and apparatus will be apparent to those skilled in the art from the following description of the invention, taken in conjunction with the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1

While the present embodiment provides an embodiment of an HDFS-based advertisement data storage method, it should be noted that although a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different from that shown. Fig. 1 shows a schematic flow diagram of the method, which, with reference to fig. 1, comprises the following steps:

s101: archiving the announcement data;

s102: synchronization of advertisement data from NAS to HDFS;

s103: caching the archived announcement data according to the characteristics of the announcement data;

s104: and searching the announcement data according to the downloading request.

As described in the background, the prior art is not designed for the characteristics of public company public data, resulting in inefficient overall storage system and inability to meet demand. To solve the problems in the background art, as shown in fig. 1, the embodiment first files the announcement data; meanwhile, the announcement data is synchronized to the HDFS from the NAS periodically; caching the archived announcement data according to the characteristics of the announcement data of the listed companies; and retrieving the announcement data according to the download request. Therefore, by the mode, the efficiency of the storage system of the public company notice data can be improved, and the defects in the prior art are overcome.

Specifically, referring to fig. 2, the specific steps of archiving the advertisement data are as follows:

(1) two parameters, namely the filing date of each notice in the notice data and the market to which the notice belongs, are determined. And optionally, daily as an archive date.

(2) And (5) initializing a file. According to the filing date and the market to which the bulletins belong, determining a bulletin folder, and initializing files required for filing, wherein the method comprises the following steps: task state file, file list to be downloaded, error log, and package log. The task state file contains three attributes, the attribute task state (hereinafter replaced with task _ status), and the desirable values are: success, failure; the attribute task progress (hereinafter referred to as task _ progress) may have the following values: downloading, packaging, uploading and deleting the HDFS file; the number of runs of the attribute (hereinafter, run _ count is substituted) records the number of rounds of the archive run, and the initial value is 0.

(3) And reading the task state file. If the attribute task _ status is 'success', jumping to the step (13); if the failure is detected, jumping to the step (4); and (5) if the task state does not exist, setting task _ progress to 'download', and jumping to the step (5).

(4) The value of the attribute task progress is read. If the value exists, namely 'downloading', 'packaging' or 'uploading', resetting the task _ progress to 'downloading'; if the value of task _ progress is not in (download, package, upload), it is set to "delete HDFS file".

(5) And performing corresponding processing according to different values of task _ progress. If the download is 'downloading', jumping to the step (6); if the data is packed, jumping to the step (7); if the number is 'uploading', jumping to the step (8); if the file is 'HDFS file delete', jumping to step (9).

(6) If the task _ progress is 'downloading', sequentially downloading the bulletin files according to the file list to be downloaded, and if the task _ progress is 'downloading', continuously executing the next step; and (5) failing, jumping to the step (11).

(7) Setting task _ progress as 'packaging', executing packaging operation, generating a packaging file named by the market and the date to which the task _ progress belongs, and if the task _ progress is successful, continuing to execute the next step; and (5) failing, jumping to the step (11).

(8) Setting task _ progress as 'uploading', executing uploading operation, uploading the file to the HDFS, and if the file is successful, continuing to execute the next step; and (5) failing, jumping to the step (11).

(9) Setting task _ progress as 'HDFS file deletion', sequentially deleting files on the HDFS according to a file list, and continuing to execute the next step after the files are successfully deleted; and (5) failing, jumping to the step (11).

(10) The task _ status is set to "success", and the step (12) is skipped.

(11) Adding 1 to the run _ count, if the run _ count is greater than the limit N (the limit N represents an upper limit, and a specific value is determined by an engineer according to actual conditions), setting the task _ status as 'failure', continuing to execute the next step, and if the run _ count is less than the limit N, jumping to the step (5).

(12) The locally downloaded file and the tar file are deleted.

(13) And finishing the archiving.

Specifically, referring to fig. 3, the advertisement data is synchronized from the NAS to the HDFS. The method for synchronizing the announcement data requires real-time accuracy, and comprises the following specific steps:

(1) and writing a synchronization program to generate a file required for synchronization. The synchronization program filters the bulletin files according to the date parameters of synchronization to be executed, and calculates a file list target _ file.list of the task to be synchronized; and simultaneously judging whether a completed file list complete file exists, if not, creating the file, otherwise, not operating.

(2) And starting synchronization and recording task starting information. Log records the starting time of the synchronization task, and predicts the number and the name of the synchronized files.

(3) And reading the file and judging whether the file is synchronized. Taking out a file from the target _ file.list, judging whether the file exists in the completed _ file.list, if so, recording time, notifying, marking file name and skipping to the step (6) by the rsync.log, otherwise, marking the file name as 'skipped'; otherwise, the execution is continued.

(4) Synchronization is performed. The file is synchronized by the local NAS to the remote HDFS using rsync.

(5) And judging whether the synchronization is successful. If the execution code returned by the rsync is 0, writing the file name into the completed _ file.list, recording the time in the rsync.log, and notifying that the file name is marked as 'synchronization success'; if not, the failed _ file.list is written into the file name, and the time, error report, file name, mark "fail" and error report code are recorded in rsync.log.

(6) Check if the target _ file.list has been read. And (4) jumping to the step (3) if the reading is not finished, and continuing to execute the operation if the reading is finished.

(7) And recording synchronous end information. Log records time in rsync, and informs the time overhead of the synchronization, the number of the synchronized files, the number of the successful files and the number of the error-reported files.

(8) And (6) ending.

Specifically, according to the characteristics of the announcement data, the archived announcement data is cached. And creating a cache pool of a designated space in Hadoop, setting the maximum survival time for the cache pool, and storing the data file into the cache pool when accessing the file of the announcement data.

Preferably, the space of the buffer pool is designated as 20G, and the maximum survival time may be set to 1 hour.

Specifically, referring to fig. 4, the retrieval of the advertisement data is performed. Considering that network delay exists in Hadoop different nodes synchronously and long time is needed in batch processing, when a file is searched in the HDFS, the file can be searched from an original file or an archived file according to actual conditions. The method comprises the following specific steps:

(1) and confirming the data release date and the affiliated market parameters. And confirming the data release date according to the bulletin file name of the data download request.

(2) And determining whether the file is archived according to the date and market parameters, and if the archiving is carried out to step (3), carrying out step (6) without archiving.

(3) And calculating the path. And (4) determining the path of the archived file to which the file belongs on the HDFS according to the date and the market parameter to which the file belongs, if the archived file exists in the HDFS, continuing to execute the process, and jumping to the step (6) does not exist.

(4) And acquiring the file. Downloading the archive file to local, obtaining the required file, and deleting the archive file.

(5) And storing the data into a buffer area. And (5) storing the archive file into a cache area of the HDFS, and jumping to the step (9).

(6) And calculating the HDFS path of the file, checking whether the file exists, returning to 404 to step (9) if the file does not exist, and continuing to execute the following steps if the file exists.

(7) And storing the file into an HDFS cache area.

(8) The file is returned.

(9) And (6) ending.

It should be noted that the foregoing method embodiments are described as a series of acts or combinations for simplicity in explanation, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts or acts described, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Example 2

Referring to fig. 5, fig. 5 illustrates an aperiodic advertisement storage device for the HDFS-based advertisement data storage method. The device includes: the archiving module is used for archiving the public company notice data; the synchronization module is used for synchronizing the announcement data from the NAS to the HDFS; the cache module is used for caching the archived data according to the characteristics of the announcement data; and the retrieval module is used for retrieving the announcement data according to the downloading request.

Claims

1. An HDFS-based announcement data storage method is characterized by comprising the following steps:

step S101: archiving the announcement data;

step S102: synchronizing the announcement data from the network memory to the HDFS according to a set period;

step S103: caching the archived announcement data;

step S104: and retrieving the announcement data according to the downloading request.

2. The method for storing bulletin data as claimed in claim 1, wherein each of the bulletin data is provided with two parameters of filing date and market to which the bulletin belongs, the folder to which the bulletin belongs is determined according to the filing date and the market to which the bulletin belongs, and files required for filing are initialized, including a task state file, a list of files to be downloaded, an error log and a package log, wherein the task state file includes the following three attributes:

attribute task state, the preferred values are: success and failure, and the initial value is none;

the attribute task progress preferably has the following values: downloading, packaging, uploading and deleting the HDFS file, wherein the initial value is nothing;

and the attribute running times are used for recording the number of rounds of filing running, and the initial value is 0.

3. The method for storing advertisement data according to claim 2, wherein the step S101 is as follows:

(2) initializing files required for archiving the announcements;

(12) deleting the locally downloaded file and the compressed file;

(13) and finishing the archiving.

4. A method of advertisement data storage according to claim 1, wherein the archiving is performed once a day.

5. The method for storing advertisement data according to claim 1, wherein the method for synchronizing advertisement data from the network storage to the HDFS comprises the following steps:

6. The method as claimed in claim 1, wherein a buffer pool with a designated space is created in Hadoop, and the maximum survival time is set for the buffer pool, and when accessing the file of the advertisement data, the data file is stored in the buffer pool.

7. An advertisement data storage method according to claim 6, wherein the space of said cache pool is designated as 20G, and the maximum time to live is set to 1 hour.

8. The method for storing advertisement data according to claim 1, wherein the method for retrieving advertisement data according to the download request is as follows:

(7) and storing the file into an HDFS cache area.

9. An apparatus for the HDFS-based advertisement data storage method according to claim 1, comprising:

an archiving module for archiving the announcement data;

a synchronization module for synchronizing advertisement data from a network storage to the HDFS;

the cache module is used for caching the archived data;

and the retrieval module is used for retrieving the announcement data according to the downloading request.