CN110866068B

CN110866068B - Advertisement data storage method and device based on HDFS

Info

Publication number: CN110866068B
Application number: CN201911090968.7A
Authority: CN
Inventors: 姚嘉华; 张晓军; 杨博
Original assignee: SSE INFONET Ltd
Current assignee: SSE INFONET Ltd
Priority date: 2019-11-09
Filing date: 2019-11-09
Publication date: 2024-02-02
Anticipated expiration: 2039-11-09
Also published as: CN110866068A

Abstract

The invention relates to the technical field of data storage, in particular to an announcement data storage method and device based on an HDFS, comprising the following steps: step S101: archiving the announcement data; step S102: synchronizing the advertisement data from the network memory to the HDFS according to the set period; step S103: caching the archived bulletin data; step S104: and searching the bulletin data according to the downloading request. Compared with the prior art, the invention has the advantages that: by archiving massive advertising data, synchronizing the data from NAS (network storage) to HDFS (Hadoop distributed file system), then designing a cache for the advertising data according to the characteristics of the advertising data, and searching the advertising data according to a downloading request, the problem of low efficiency of the existing storage method when processing advertising data of a marketing company is solved.

Description

Advertisement data storage method and device based on HDFS

Technical Field

The invention relates to the technical field of data storage, in particular to an announcement data storage method and device based on an HDFS.

Background

To ensure data security, the current technology adopts a manner of directly downloading and storing to the HDFS. The existing data storage method has a high-efficiency storage method for small files, but the problem that massive downloading requests are faced every day cannot be solved by facing the characteristics of advertising data of a marketing company, and the high-efficiency operation of a storage system cannot be guaranteed.

The daily public announcement data of Shanghai city and deep city has huge quantity, covers the business change information of all the enterprises of the Shanghai city and deep city, and has various kinds and complex contents, and the public announcement of the Shanghai city is divided into 35 major categories, 376 minor categories (Shanghai securities exchanges, 2013) including major event announcement, trade prompt announcement, stock announcement, increased public announcement and the like.

The increasing number of announcements is increasingly demanding on data storage, which is characterized by a large number of files and a file size smaller than the size of the HDFS block. In the long term, massive small files affect the expansibility and performance of HDFS. Meanwhile, the system faces massive bulletin data downloading requests every day. This presents a serious problem for the performance requirements of the overall storage system. Therefore, how to efficiently store the public announcement data of the marketing companies is a technical problem to be solved at present.

Disclosure of Invention

The invention aims to solve the defects of the prior art, provides an advertising data storage method and device based on an HDFS, and solves the technical problem that the storage efficiency is low because the advertising data of a marketing company is not designed according to the characteristics of the advertising data of the marketing company in the prior art.

In order to achieve the above object, an HDFS-based advertisement data storage method is designed, which includes the following steps: step S101: archiving the announcement data; step S102: synchronizing the advertisement data from the network memory to the HDFS according to the set period; step S103: caching the archived bulletin data; step S104: and searching the bulletin data according to the downloading request.

Preferably, each bulletin in the bulletin data is provided with two parameters of a filing date and a market to which the bulletin belongs, a folder to which the bulletin belongs is determined according to the filing date and the market to which the bulletin belongs, and files required for filing are initialized, wherein the files comprise a task state file, a file list to be downloaded, an error log and a packaging log, and the task state file comprises the following three attributes: the attribute task state, the preferable value is: success, failure, no initial value; the attribute task progress, the removable value is: downloading, packaging, uploading and deleting the HDFS file, wherein the initial value is none; and the attribute operation times are used for recording the number of rounds of archiving operation, and the initial value is 0.

Preferably, the step S101 is specifically as follows:

(1) Reading two parameters of the filing date of the bulletin and the market to which the bulletin belongs;

(2) Initializing the files required by the filing of the bulletins;

(3) Reading the announced task state file, and if the value of the attribute task state is successful, performing the step (13); if the value of the attribute task state is failure, performing the step (4); if the value of the attribute task state does not exist, setting the attribute task progress as 'downloading', and then performing the step (5);

(4) Reading the value of the progress of the attribute task, and resetting the progress of the attribute task as 'downloading' if the value of the progress of the attribute task exists; if the value of the attribute task progress does not exist, setting the value as 'delete HDFS file';

(5) Then, corresponding processing is carried out according to different values of the attribute task progress, and if the attribute task progress is 'downloaded', the step (6) is carried out; if the packaging is the packaging, performing the step (7); if the result is "upload", performing step (8); if the result is "the HDFS file is deleted", the step (9) is performed;

(6) Sequentially downloading the bulletin files according to the file list to be downloaded; if the downloading is successful, continuing to execute the next step; if the downloading fails, jumping to the step (11);

(7) Packaging operation is carried out, and a packaging file named by the affiliated market and filing date is generated; if successful, continuing to execute the next step; failure, jump to step (11);

(8) Uploading the file to the HDFS, and if successful, continuing to execute the next step; failure, jump to step (11);

(9) Sequentially deleting files on the HDFS; if successful, continuing to execute the next step; failure, jump to step (11);

(10) Setting the value of the task state file to be successful after successful execution, and performing step (12); if the execution fails, jumping to the step (11);

(11) Adding 1 to the value of the attribute running times, if run_count > limit N, setting the value of the task state file as 'failure', and continuing to execute the next step; if run_count < limit N, jump to step (5), wherein limit N is the upper limit value set for implementation;

(12) Deleting the locally downloaded file and the compressed file;

(13) And (5) finishing archiving.

Preferably, the archiving operation is performed once daily.

Preferably, the method for synchronizing the advertisement data from the network memory to the HDFS is specifically as follows:

(1) Filtering the announcement file according to the filing date parameter to be synchronized, and calculating a file list target_file.list to be synchronized of the task; generating a failed file list, namely a failed file list, wherein the failed file list is an empty file initially; meanwhile, judging whether a completed file list complete_file.list file exists, if not, creating the file, otherwise, not operating;

(2) Starting synchronization, recording task starting information, recording the starting time of the current synchronization task, and predicting the number and the name of the synchronized files;

(3) Reading the file, and judging whether the file has completed synchronization or not: sequentially taking out the file from the target_file.list, judging whether the file exists in the completed_file.list, if so, recording the time and the file name, marking 'skip', and jumping to the step (6); otherwise, continuing to execute;

(4) Synchronization is performed: synchronizing the file from the local NAS to the remote HDFS;

(5) Judging whether the synchronization is successful or not: if the execution code returned by rsync is 0, writing the file name into the completed_file.list, recording the time and the file name in rsync.log, and marking 'successful synchronization'; if not, writing the file name in the failed_file.list, recording the time and the file name in rsync.log, and marking failure and error reporting code;

(6) Checking whether the target_file.list is read completely, if not, jumping to the step (3), and continuing to execute if the target_file.list is read completely;

(7) Recording synchronization end information: and recording the end time and the time cost of the synchronization, the number of the synchronous files, the number of successful files and the number of the error-reported files in the rsync.

Preferably, a buffer pool with a specified space is created in Hadoop, and the maximum survival time is set for the buffer pool, and when the file of the announcement data is accessed, the data file is stored in the buffer pool.

Preferably, the space of the buffer pool is designated as 20G, and the maximum survival time is set to 1 hour.

Preferably, the method for retrieving the advertisement data according to the download request is specifically as follows:

(1) Confirming the filing date of the bulletin and the market parameters to which the bulletin belongs;

(2) Determining whether the file is archived according to the archiving date and the market parameters of the bulletin, if the archiving is skipped to the step (3), the archiving is not skipped to the step (6);

(3) Calculating a path: determining the path of the archive file of the file on the HDFS according to the archive date and the market parameters, if the archive file exists in the HDFS, continuing to execute the archive file, and jumping to the step (6) does not exist;

(4) Acquiring a file: downloading the archive file to a local place, obtaining a required file, and deleting the archive file;

(5) Storing into a cache area: storing the archive file into a buffer area of the HDFS, and jumping to the step (9);

(6) Calculating an HDFS path of the file, checking whether the file exists, if not, returning to 404 to jump to step (9), and if so, continuing to execute the following steps;

(7) The file is stored in the HDFS buffer.

The invention also relates to a device for the HDFS-based bulletin data storage method, which comprises the following steps: an archiving module for archiving the advertising data; a synchronization module for synchronizing the advertisement data from the network memory to the HDFS; the caching module is used for caching the archived data; and the retrieval module is used for retrieving the bulletin data according to the downloading request.

Compared with the prior art, the invention has the advantages that: by archiving massive advertising data, synchronizing the data from NAS (network storage) to HDFS (Hadoop distributed file system), then designing a cache for the advertising data according to the characteristics of the advertising data, and searching the advertising data according to a downloading request, the problem of low efficiency of the existing storage method when processing advertising data of a marketing company is solved.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a flow chart of archiving advertising data in the present invention;

FIG. 3 is a flow chart of synchronizing advertisement data from network storage to HDFS in the present invention;

FIG. 4 is a flow chart of retrieving advertisement data according to a download request in the present invention;

fig. 5 is a schematic diagram of an apparatus for the HDFS-based advertisement data storage method according to the present invention.

Detailed Description

The construction and principles of such methods and apparatus will be readily apparent to those skilled in the art from the following description of the invention taken in conjunction with the accompanying drawings. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Example 1

The present embodiment provides an embodiment of an HDFS-based advertisement data storage method, and it should be noted that although a logic sequence is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than that shown or described herein. Fig. 1 shows a schematic flow chart of the method, and referring to fig. 1, the method comprises the following steps:

s101: archiving of advertisement data;

s102: advertising synchronization of data from NAS to HDFS;

s103: caching the archived bulletin data according to the characteristics of the bulletin data;

s104: and retrieving the bulletin data according to the downloading request.

As described in the background art, the prior art is not designed for the feature of advertising data for the marketing company, which results in inefficiency of the whole storage system and inability to meet the demand. Aiming at the problems existing in the background technology, referring to fig. 1, the embodiment firstly files the announcement data; meanwhile, periodically synchronizing the advertising data from the NAS to the HDFS; caching the archived bulletin data according to the characteristics of the bulletin data of the marketing companies; and retrieving the bulletin data according to the download request. Therefore, the efficiency of the storage system of the public company bulletin data can be improved, and the defects in the prior art are overcome.

Specifically, referring to fig. 2, the specific steps for archiving the advertisement data are as follows:

(1) Two parameters of the filing date of each bulletin in and the market to which the bulletin belongs are determined. And optionally daily as the filing date.

(2) Initializing a file. According to the filing date, the market to which the bulletin belongs determines a bulletin folder, and simultaneously initializes files required for filing, wherein the method comprises the following steps: task state file, file list to be downloaded, error log and package log. The task state file contains three attributes, the attribute task state (hereinafter task_status) is replaced with the following values: success, failure; the attribute task progress (hereinafter task_progress) may take the following values: downloading, packaging, uploading and deleting the HDFS file; the attribute run number (hereinafter run_count) records the number of rounds of archive run, with an initial value of 0.

(3) And reading the task state file. If the attribute task_status is "successful", jumping to the step (13); if the result is failure, jumping to the step (4); if the task state does not exist, setting task_progress as 'download', and jumping to the step (5).

(4) The value of the attribute task_progress is read. If the value exists as "download", "package" or "upload", the task_progress is reset to "download"; if the value of task_progress is not in (download, package, upload), then it is set to "delete HDFS file".

(5) And performing corresponding processing according to different values of the task_progress. If the download is the download, jumping to the step (6); if the package is the "package", jumping to the step (7); if the result is "upload", jumping to the step (8); if "the HDFS file is deleted", the process goes to step (9).

(6) If the task_progress is 'downloading', downloading the bulletin files in sequence according to the file list to be downloaded; failure, jump to step (11).

(7) Setting task_progress as 'packaging', executing packaging operation, generating a packaging file named by the belonging market and date, and if successful, continuing to execute the next step; failure, jump to step (11).

(8) Setting task_progress as 'uploading', executing uploading operation, uploading the file to the HDFS, and if successful, continuing to execute the next step; failure, jump to step (11).

(9) Setting task_progress as 'HDFS file deletion', sequentially deleting files on the HDFS according to a file list, and successfully continuing to execute the next step; failure, jump to step (11).

(10) The task_status is set to "successful", and the process goes to step (12).

(11) run_count is incremented by 1, if run_count > limit (limit n represents an upper limit, a specific value is determined by an engineer according to the actual situation), task_status is set to "fail", and the next step is continued, if run_count < limit n, the process goes to step (5).

(12) And deleting the locally downloaded file and the tar file.

(13) And (5) ending the archiving.

Specifically, referring to fig. 3, the advertisement data is synchronized from NAS to HDFS. The synchronization of the announcement data requires real-time accuracy, and the method comprises the following specific steps:

(1) Writing a synchronous program to generate files needed by synchronization. The synchronization program filters the announcement files according to date parameters to be synchronized, and calculates a file list target_file.list to be synchronized for the task; and simultaneously judging whether a completed file list file exists or not, if not, creating the file, otherwise, not operating.

(2) And starting synchronization and recording task starting information. And starting to execute a synchronization program, recording the starting time of the synchronization task in rsync. Log, and predicting the number and the name of the synchronized files.

(3) And reading the file, and judging whether the file has completed synchronization. Taking a file from the target_file.list, judging whether the file exists in the completed_file.list, if so, recording the time by rsync.log, notifying, file name, marking 'skip', and jumping to the step (6); otherwise, execution continues.

(4) Synchronization is performed. The file is synchronized by the local NAS to the remote HDFS using rsync.

(5) And judging whether the synchronization is successful or not. If the execution code returned by rsync is 0, writing the file name into the completed_file.list, recording the time in rsync.log, informing, and marking the file name as 'successful synchronization'; if the file name is not 0, the failed file list is written into the file name, the time is recorded in rsync log, the file name is reported in error, and the failure and error reporting code are marked.

(6) Check if the target_file.list has been read. If the step (3) is not completed, continuing to execute the step (3) after the step (3) is completed.

(7) And recording the synchronization ending information. Recording time, notification, time overhead of the present synchronization, number of synchronized files, number of successful files, number of files reporting errors in rsync.

(8) And (5) ending.

Specifically, the archived bulletin data is cached according to the characteristics of the bulletin data. And creating a cache pool in a designated space in Hadoop, setting the maximum survival time for the cache pool, and storing the data file into the cache pool when accessing the file of the announcement data.

Preferably, the space of the buffer pool is designated as 20G, and the maximum survival time may be set to 1 hour.

Specifically, referring to fig. 4, the search of the advertisement data is performed. Considering that the network time delay exists in the synchronization of different nodes of Hadoop and long time is needed when batch processing is executed, when searching files in the HDFS, the files can be searched from original files or archive files according to actual conditions. The method comprises the following specific steps:

(1) And confirming the date of data release and the market parameters to which the data belong. And confirming the release date of the data according to the bulletin file name of the data downloading request.

(2) Determining whether the file is archived according to the date and market parameters, and if the archive jumps to (3), not the archive jumps to step (6).

(3) A path is calculated. And determining the path of the archive file to which the file belongs on the HDFS according to the date and the market parameters, if the archive file exists in the HDFS, continuing to execute, and not jumping to the step (6).

(4) And acquiring a file. And downloading the archive file to the local, acquiring the required file, and deleting the archive file.

(5) And storing into a buffer area. Storing the archive file into a buffer area of the HDFS, and jumping to the step (9).

(6) The HDFS path of the file is calculated, and it is checked whether the file exists, if not, the process returns to step (9), and if so, the following steps are continued.

(7) The file is stored in the HDFS buffer.

(8) The file is returned.

(9) And (5) ending.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

Example 2

Referring to fig. 5, fig. 5 illustrates an aperiodic advertisement storage device used for the HDFS-based advertisement data storage method. The device comprises: the archiving module is used for archiving the advertising data of the marketing companies; a synchronization module for synchronizing the advertisement data from the NAS to the HDFS; the caching module is used for caching the archived data according to the characteristics of the announcement data; and the retrieval module is used for retrieving the bulletin data according to the downloading request.

Claims

1. An advertising data storage method based on an HDFS, which is characterized by comprising the following steps:

step S101: archiving the announcement data;

step S102: synchronizing the advertisement data from the network memory to the HDFS according to the set period;

(7) Recording synchronization end information: recording the end time and the time cost of the synchronization, the number of the synchronized files, the number of successful files and the number of the error-reported files in rsync.log;

step S103: caching the archived bulletin data;

step S104: and searching the bulletin data according to the downloading request.

2. The method for storing announcement data according to claim 1, wherein each announcement in the announcement data is provided with two parameters of an archiving date and a market to which the announcement belongs, a folder to which the announcement belongs is determined according to the archiving date and the market to which the announcement belongs, and files required for archiving are initialized, including a task state file, a list of files to be downloaded, an error log and a package log, wherein the task state file includes three attributes:

the attribute task state, the preferable value is: success, failure, no initial value;

the attribute task progress, the removable value is: downloading, packaging, uploading and deleting the HDFS file, wherein the initial value is none;

and the attribute operation times are used for recording the number of rounds of archiving operation, and the initial value is 0.

3. The method for storing announcement data as claimed in claim 2, wherein said step S101 is specifically as follows:

(2) Initializing the files required by the filing of the bulletins;

(12) Deleting the locally downloaded file and the compressed file;

(13) And (5) finishing archiving.

4. A method of advertising data storage as claimed in claim 1, wherein archiving is performed once daily.

5. The method of claim 1, wherein a buffer pool of a designated space is created in Hadoop, and a maximum survival time is set for the buffer pool, and the data file is stored in the buffer pool when the file of the bulletin data is accessed.

6. The method of claim 5, wherein the buffer pool is assigned a space of 20G, and the maximum survival time is set to 1 hour.

7. A method of storing announcement data as claimed in claim 1, characterized in that the method of retrieving announcement data in response to a download request is as follows:

(7) The file is stored in the HDFS buffer.

8. An apparatus for the HDFS based advertisement data storage method of claim 1, comprising:

an archiving module for archiving the advertising data;

a synchronization module for synchronizing the advertisement data from the network memory to the HDFS;

the caching module is used for caching the archived data;

and the retrieval module is used for retrieving the bulletin data according to the downloading request.