CN114185484A

CN114185484A - Method, device, equipment and medium for clustering document storage

Info

Publication number: CN114185484A
Application number: CN202111297292.6A
Authority: CN
Inventors: 张辉; 吴桂荣
Original assignee: Fujian Centerm Information Co Ltd
Current assignee: Fujian Centerm Information Co Ltd
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2022-03-15

Abstract

The invention provides a method, a device, equipment and a medium for clustering document storage, wherein the method comprises the following steps: and (3) file storage process: receiving a file uploaded by a user through a nginx server, forwarding the file to any one document system server in a rear-end document system server cluster according to a load balancing principle, storing the file in each Hadoop server of the Hadoop server cluster in a data block distribution mode, and recording the number, sequence and storage address of the data blocks through a NameNode component; and (3) file downloading process: and receiving a file downloading request of a user through the nginx server, forwarding the file downloading request to any one of the document system servers in the document system server cluster according to a load balancing principle, requesting the Hadoop server cluster to acquire the file, acquiring all data blocks corresponding to the file according to the storage address of each data block, recombining the file and returning the file to the user.

Description

Method, device, equipment and medium for clustering document storage

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for clustering document storage.

Background

Normally, the document is directly stored in the local storage of the server, and is stored in a single point, and there is only one file backup (i.e. the file itself), so that if the server is down or the memory is damaged, the file is inevitably lost, which may cause irreparable results.

On the basis, a hot standby server is generally developed to synchronize file data on a main server in real time, for example, file comparison and synchronization are realized by using rsync, and the backup server can take over the main server to provide file data service once the main server is unavailable or storage is damaged. However, in the prior art, the scheme of real-time master-slave synchronization is adopted, which is not real synchronization actually, data still has a great risk of loss, data synchronization has a certain interval, and the master-slave mode increases the data maintenance cost.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method, an apparatus, a device and a medium for clustering document storage, wherein a data block form is adopted to store completely distributed on at least two machines of a Hadoop server cluster, each data block can have at least two data blocks, data synchronization is not required, and distributed storage of data is realized by software.

In a first aspect, the present invention provides a method for clustering document storage, including:

and (3) file storage process: receiving a file uploaded by a user through the nginx server, and forwarding the file to any one document system server in the document system server cluster at the rear end according to a load balancing principle; the file system server uniformly stores files into a Hadoop server cluster, the files are distributed and stored on each Hadoop server of the Hadoop server cluster in a data block mode, the number of the data blocks, the sequence of each data block and storage addresses are recorded through a NameNode component, and each data block is backed up in at least two Hadoop servers at the same time;

and (3) file downloading process: displaying a file list through the nginx server, receiving a file downloading request of a user, and forwarding the file downloading request to any one document system server in the document system server cluster according to a load balancing principle; and requesting a Hadoop server cluster to acquire the file by the document system server, acquiring all data blocks corresponding to the file from each Hadoop server in a streaming mode according to the storage address of each data block recorded by the NameNode component, re-synthesizing the file according to the number of the data blocks and the sequence of each data block, and returning the file to a user through the nginx server. .

In a second aspect, the present invention provides a document storage clustering apparatus, including:

the nginx service module is used for receiving the file uploaded by the user through the nginx server, displaying a file list through the nginx server, receiving a file downloading request of the user, and forwarding the file downloading request to any one document system server in the document system server cluster at the rear end according to a load balancing principle;

the file system server comprises a file service module, a data block backup module and a data block backup module, wherein the file service module is used for storing files on each Hadoop server of a Hadoop server cluster in a data block mode in a distributed mode when the files are uploaded, and recording the number of data blocks, the sequence of each data block and a storage address through a NameNode component, wherein each data block is backed up in at least two Hadoop servers at the same time; and when downloading the file, the file system server requests the Hadoop server cluster to acquire the file, acquires all data blocks corresponding to the file from each Hadoop server in a streaming mode according to the storage address of each data block recorded by the NameNode component, re-synthesizes the file according to the number of the data blocks and the sequence of each data block, and returns the file to the user through the nginx server.

In a third aspect, the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of the first aspect when executing the program.

In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of the first aspect.

One or more technical solutions provided in the embodiments of the present invention have at least the following technical effects or advantages: files are stored on each Hadoop server of the Hadoop server cluster in a distributed mode in a data block mode, so that the file servers are clustered, the concurrency and the throughput of users of the system can be improved, and the performance is greatly improved; the distributed storage of the files also ensures the high availability and integrity of the data; due to distributed storage of the data blocks, files cannot be directly read on the Hadoop server cluster, and only can be read through the document system server according to the storage address, so that the safety of the data is greatly improved; and each data block is backed up on a plurality of Hadoop servers, so that the safety and high availability of the data are improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

The invention will be further described with reference to the following examples with reference to the accompanying drawings.

FIG. 1 is a schematic block diagram of the system of the present invention;

FIG. 2 is a flow chart of a method according to one embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an apparatus according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to a third embodiment of the invention;

fig. 5 is a schematic structural diagram of a medium according to a fourth embodiment of the present invention.

Detailed Description

The embodiment of the application provides a document storage clustering method, a document storage clustering device and a document storage clustering medium, the document storage clustering method, the document storage clustering device and the document storage clustering medium are completely stored on at least two machines of a Hadoop server cluster in a distributed mode, each data block can have at least two data blocks, data synchronization is not needed, files cannot be directly read on the Hadoop server cluster, and data safety is greatly guaranteed.

The technical scheme in the embodiment of the application has the following general idea: files are stored on each Hadoop server of the Hadoop server cluster in a distributed mode in a data block mode, so that the file servers are clustered, the concurrency and the throughput of users of the system can be improved, and the performance is greatly improved; the distributed storage of the files also ensures the high availability and integrity of the data; due to distributed storage of the data blocks, files cannot be directly read on the Hadoop server cluster, and only can be read through the document system server, so that the safety of the data is greatly improved; and each data block is backed up on a plurality of Hadoop servers, so that the safety and high availability of the data are improved.

Before describing the specific embodiment, a system framework corresponding to the method of the embodiment of the present application is described, and as shown in fig. 1, the system is divided into the following parts:

and the nginx server receives the user request in a unified way and forwards the user request to the back-end document server.

The document system server cluster comprises at least two document servers, is used for uniformly processing daily requests of users, bears file operation services and is connected with each data server according to service properties, such as a Hadoop server cluster, a database server, a document conversion server cluster, a redis cache server cluster and the like.

The Hadoop server cluster comprises at least two Hadoop servers, and is used for storing file data blocks and ensuring distributed storage of files and high availability of the Hadoop cluster. Hadoop is a distributed system infrastructure developed by the Apache Foundation. Users can develop distributed programs without knowing details of a distributed bottom layer, and high-speed operation and storage are performed by fully utilizing the power of the cluster. Hadoop implements a Distributed File System (Distributed File System), where one component is HDFS (Hadoop Distributed File System). HDFS has the characteristic of high fault tolerance and is designed to be deployed on inexpensive (low-cost) hardware; and it can provide high throughput (highthroughput) to access the data of the application program, suitable for the application program with huge data set (large data set). HDFS relaxes the requirements of (relax) POSIX (Portable Operating System Interface of unix), and can access data in a streaming access file System in a streaming format.

The database main and standby system is used for storing service data of users, two servers are deployed, namely a main database server and a standby database server, so that single-point failure is avoided, and high availability is guaranteed.

The document conversion server cluster can be one or more document conversion servers and is used for deploying document conversion services as services required by a user for previewing files;

the redis cache server cluster is used for cluster sharing of session of a user and comprises at least two redis cache servers, user session information is stored, and the session information of the user can be read from the uniform redis cache server cluster no matter which redis cache server the request of the user is forwarded to by the document system server cluster, so that the user request can be sent to any machine in the redis cache server cluster without being forced to log in again. Because the system of the invention is a cluster distributed architecture, each request of a user can be forwarded to different document servers through the nginx server, if the user logs in and is verified on the document server A and directly stores the session information on the document server A, the next request is sent to the document server B, and the document server B does not have the session information stored by the user, the user fails to check and the re-login information is returned, so the session information logged in by the user is stored in a shared redis server cluster, each document server acquires the session information of the user from the redis server cluster, the session cluster sharing can be realized, and the session cluster can not be forcibly re-logged in.

Example one

As shown in fig. 2, the present embodiment provides a method for clustering document storage, including:

and (3) file storage process: receiving a file uploaded by a user through the nginx server, and forwarding the file to any one document system server in the document system server cluster at the rear end according to a load balancing principle; the file system server uniformly stores files into a Hadoop server cluster, the files are distributed and stored on each Hadoop server of the Hadoop server cluster in a data block mode, the number of the data blocks, the sequence of each data block and storage addresses are recorded through a NameNode component, and each data block has backup in a plurality of Hadoop servers at the same time; therefore, when a certain storage server is hung, the data blocks can be read from other servers, the data of the certain server is guaranteed to be damaged, and backup can be found out for recovery in other servers.

And (3) file downloading process: displaying a file list through the nginx server, receiving a file downloading request of a user, and forwarding the file downloading request to any one document system server in the document system server cluster according to a load balancing principle; and requesting a Hadoop server cluster to acquire the file by the document system server, acquiring all data blocks corresponding to the file from each Hadoop server in a streaming mode according to the storage address of each data block recorded by the NameNode component, re-synthesizing the file according to the number of the data blocks and the sequence of each data block, and returning the file to a user through the nginx server. The load balancing principle, namely the Nginx load balancing principle, has four configurations, including ip hash, polling, weight and minimum connection, and the configuration supports manual configuration modification.

As a more preferred or specific implementation manner of this embodiment, the method further includes a data block definition process, a user session management process, a file preview service process, and a data recovery process.

The data block definition process further includes:

(1) a manual configuration process, namely, a manual configuration interface is provided by the nginx server for a user to manually configure the number of data blocks of one file, and the size of the data blocks is the quotient of the size of the file data and the number of the data blocks; for example: if the number of the manually configured data blocks is 6, and the size of the data block is 5MB, the file is divided into 6 data blocks with the same size to be stored.

(2) The size of the data block is automatically set according to the balance principle of the data transmission time and the addressing time of the disk, wherein the balance principle is that the addressing time is 1 percent of the data transmission time, namely the optimal data transmission time T_cIs the average addressing time T_xAnd 100 times, the calculation mode of the size of the data block is as follows:

in the formula, V_cIs the prevailing data transmission speed.

Generally, the number of file data blocks can be manually configured, since the addressing speed and the transmission speed are affected by the size of the data blocks, if the data blocks are set to be too large, the time for transmitting the data from the disk is obviously longer than the addressing time, so that a program can be very slow to process the data blocks; if the setting is too small, on one hand, one file can be divided into a large number of small files, a large number of memories in a NameNode component of Hadoop can be occupied for storing metadata when the large number of small files are stored, and the memories of the NameNode are limited, so that the NameNode is not preferable; on the other hand, since the data block is too small, the addressing time increases, causing the program to always find the start of the block. Therefore, the block size is set to be larger to reduce the addressing time, and the time for transmitting a file composed of a plurality of blocks depends on the transmission speed of the disk. Therefore, the automatic configuration process is automatically set according to the balance principle of the data transmission time and the addressing time of the disk.

For example: average addressing time T in HDFS of Hadoop_xApproximately 10 ms; through a large number of tests, the method finds thatThe optimum state is achieved when the addressing time is 1% of the transmission time, so the optimum transmission time T_cComprises the following steps: 10ms/0.01 ═ 1000s ═ 1s, the transmission speed V prevailing in the current disk_cAnd the optimal block size is calculated to be 100 MB/s: 100MB/s 1 s-100 MB; the block size can be set to 128 MB.

User session management process: before a user uploads or downloads a file, a login request of the user is received through the nginx server, and the login request is forwarded to any one of the document system servers in the document system server cluster at the rear end according to a load balancing principle; the document system server acquires the user information from the main and standby database systems for verification; if the verification is successful, storing the login information of the user in the redis server cluster as a session information certificate of the user, and finally returning login success information; after login is successful, the user file list is usually acquired through the Jacbrabbit component, and the user file list is displayed, so that a user can select files to download.

The redis server cluster is cluster sharing of user session, and the document system server can read session information of the user from the uniform redis cache server cluster no matter which document system server the login request of the user is forwarded to by the nginx server.

File preview service process: in the file storage process or the file downloading process, receiving a file preview request of a user through the nginx server, and forwarding the file preview request to any one of the document system servers in the document system server cluster at the rear end according to a load balancing principle; the document system server forwards all data blocks corresponding to the file to a document conversion server cluster according to a load balancing principle, and the document conversion server cluster performs format conversion on the data blocks into a format capable of being previewed and performs preview display;

and (3) data recovery process: as described above, because each data block has backup in multiple Hadoop servers at the same time, when data of a certain Hadoop server in the Hadoop server cluster is damaged, the document system server reads the damaged data block from other Hadoop servers according to the storage address of the data block recorded by the NameNode component to recover the damaged data block.

Based on the same inventive concept, the application also provides a device corresponding to the method in the first embodiment, which is detailed in the second embodiment.

Example two

As shown in fig. 3, in this embodiment, an apparatus for clustering document storage is provided, including:

the nginx service module is used for receiving the file uploaded by the user through the nginx server, displaying a file list through the nginx server, receiving a file downloading request of the user, and forwarding the file downloading request to any one document system server in the document system server cluster according to a load balancing principle;

the file system server comprises a file service module, a data block backup module and a data block backup module, wherein the file service module is used for storing files on each Hadoop server of a Hadoop server cluster in a data block mode in a distributed mode when the files are uploaded, and recording the number of data blocks, the sequence of each data block and a storage address through a NameNode component, wherein each data block has backup in a plurality of Hadoop servers at the same time; and when downloading the file, the file system server requests the Hadoop server cluster to acquire the file, acquires all data blocks corresponding to the file from each Hadoop server in a streaming mode according to the storage address of each data block recorded by the NameNode component, re-synthesizes the file according to the number of the data blocks and the sequence of each data block, and returns the file to the user through the nginx server.

As a more preferred or specific implementation manner of this embodiment, the apparatus further includes: a data block definition module, a user session management module, a file preview service module and a data recovery module,

the data block definition module is used for providing the following processes:

(1) a manual configuration process, namely, a manual configuration interface is provided by the nginx server for a user to manually configure the number of data blocks of one file, and the size of the data blocks is the quotient of the size of the file data and the number of the data blocks;

in the formula, V_cIs the prevailing data transmission speed.

The user session management module is used for receiving a login request of a user through the nginx server before the user uploads or downloads a file, and forwarding the login request to any one document system server in the document system server cluster at the rear end according to a load balancing principle; the document system server acquires the user information from the main and standby database systems for verification; if the verification is successful, storing the login information of the user in the redis server cluster as a session information certificate of the user, and finally returning login success information;

The file preview service module is used for receiving a file preview request of a user through the nginx server in the file storage process or the file downloading process and forwarding the file preview request to any one of the document system servers in the rear-end document system server cluster according to a load balancing principle; the document system server forwards all data blocks corresponding to the file to a document conversion server cluster according to a load balancing principle, and the document conversion server cluster performs format conversion on the data blocks into a format capable of being previewed and performs preview display;

and the data recovery module is used for reading the damaged data block from other Hadoop servers by the document system server according to the storage address of the data block recorded by the NameNode component to recover the damaged data block when the data of a certain Hadoop server in the Hadoop server cluster is damaged.

Since the apparatus described in the second embodiment of the present invention is an apparatus used for implementing the method of the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the apparatus, and thus the details are not described herein. All the devices adopted in the method of the first embodiment of the present invention belong to the protection scope of the present invention.

Based on the same inventive concept, the application provides an electronic device embodiment corresponding to the first embodiment, which is detailed in the third embodiment.

EXAMPLE III

The embodiment provides an electronic device, as shown in fig. 4, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, any one of the first embodiment modes may be implemented.

Since the electronic device described in this embodiment is a device used for implementing the method in the first embodiment of the present application, based on the method described in the first embodiment of the present application, a specific implementation of the electronic device in this embodiment and various variations thereof can be understood by those skilled in the art, and therefore, how to implement the method in the first embodiment of the present application by the electronic device is not described in detail herein. The equipment used by those skilled in the art to implement the methods in the embodiments of the present application is within the scope of the present application.

Based on the same inventive concept, the application provides a storage medium corresponding to the fourth embodiment, which is described in detail in the fourth embodiment.

Example four

The present embodiment provides a computer-readable storage medium, as shown in fig. 5, on which a computer program is stored, and when the computer program is executed by a processor, any one of the embodiments can be implemented.

The technical scheme provided in the embodiment of the application at least has the following technical effects or advantages: files are stored on each Hadoop server of the Hadoop server cluster in a distributed mode in a data block mode, so that the file servers are clustered, the concurrency and the throughput of users of the system can be improved, and the performance is greatly improved; the distributed storage of the files also ensures the high availability and integrity of the data; due to distributed storage of the data blocks, files cannot be directly read on the Hadoop server cluster, and only can be read through the document system server according to the storage address, so that the safety of the data is greatly improved; and each data block is backed up on a plurality of Hadoop servers, so that the safety and high availability of the data are improved.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus or system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Although specific embodiments of the invention have been described above, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, and that equivalent modifications and variations can be made by those skilled in the art without departing from the spirit of the invention, which is to be limited only by the appended claims.

Claims

1. A method of document storage clustering, characterized by: the method comprises the following steps:

and (3) file downloading process: displaying a file list through the nginx server, receiving a file downloading request of a user, and forwarding the file downloading request to any one document system server in the document system server cluster according to a load balancing principle; and requesting a Hadoop server cluster to acquire the file by the document system server, acquiring all data blocks corresponding to the file from each Hadoop server in a streaming mode according to the storage address of each data block recorded by the NameNode component, re-synthesizing the file according to the number of the data blocks and the sequence of each data block, and returning the file to a user through the nginx server.

2. The method of document storage clustering of claim 1, wherein: also included is a data block definition process, further comprising:

in the formula, V_cIs the prevailing data transmission speed.

3. The method of document storage clustering of claim 1, wherein: further comprising:

user session management process: before a user uploads or downloads a file, a login request of the user is received through the nginx server, and the login request is forwarded to any one of the document system servers in the document system server cluster at the rear end according to a load balancing principle; the document system server acquires the user information from the main and standby database systems for verification; if the verification is successful, storing the login information of the user in the redis server cluster as a session information certificate of the user, and finally returning login success information;

4. The method of document storage clustering of claim 1, wherein: further comprising:

and (3) data recovery process: when data of a certain Hadoop server in the Hadoop server cluster is damaged, the document system server reads the damaged data block from other Hadoop servers according to the storage address of the data block recorded by the NameNode component and recovers the damaged data block.

5. An apparatus for document storage clustering, characterized in that: the method comprises the following steps:

6. The apparatus of claim 5, wherein: further comprising a data block definition module for providing the following procedures:

in the formula, V_cIs the prevailing data transmission speed.

7. The apparatus of claim 5, wherein: further comprising:

8. The apparatus of claim 5, wherein: further comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 4 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 4.