CN109299056B

CN109299056B - A kind of method of data synchronization and device based on distributed file system

Info

Publication number: CN109299056B
Application number: CN201811096362.XA
Authority: CN
Inventors: 张慧如; 周建明; 冯娜
Original assignee: Weifang Engineering Vocational College
Current assignee: Weifang Engineering Vocational College
Priority date: 2018-09-19
Filing date: 2018-09-19
Publication date: 2019-10-01
Anticipated expiration: 2038-09-19
Also published as: CN109299056A

Abstract

The present invention relates to a kind of method of data synchronization and device based on distributed file system, communication interaction is carried out using two server operating mode and physical server and virtual server simultaneously by data server, establish main memory cluster and metadata cluster based on database, signal difference writes data to the temporary data block based on the received, the file content inside former back end is replaced with the content of temporary data block, realizes that data are synchronous.The operating mode that the present invention is interactively communicated using physical server and virtual server and two server, Each performs its own functions is each responsible for for two servers, is switched to single server operating mode when necessary, the operational efficiency of system has been effectively ensured；Meanwhile the cluster content synchronous to data with metadata cluster carries out clustering processing to data store internal based on memory, treats respectively, the reasonable distribution of data isochronous resources and the accuracy that data are synchronous has been effectively ensured.

Description

A kind of method of data synchronization and device based on distributed file system

Technical field

The application belongs to distributed proccessing field, and in particular, to a kind of data based on distributed file system Synchronous method and device.

Background technique

With the continuous improvement of people's quality of life, the application of internet is also constantly being popularized.In order to more convenient Be that we provide service, the application of internet also in continuous development and evolution, at the same time brought by network security Property problem also just it is more and more, be allowed to constantly increase the demand of network security class product in the market.Pacify in numerous networks Among universal class problem, the case where important information changes or loses caused by file is accidentally deleted, is the most serious, domestic thus The research in outer website unexpected deleting technique field anti-to webpage is also carried out constantly.

At the initial stage that webpage tamper resistant systems occur, internal structure is very simple, functional module division is also less.Far Some basic safety issues of website so are able to solve, but there are many shortcomings.If when hacker uses scale Tamper resistant systems that are large-scale, acting when continuously the activity of distorting goes to attack some important website at this time will be unable to completion pair The defencive function of website.It is increasing to the amount of access of webpage as dependence of the people to internet is more and more stronger, this In the case of to guarding website safe and simple tamper resistant systems by powerless reply.Therefore, in order to effectively prevent webpage quilt It distorts, the safety of guarding website, webpage tamper resistant systems are also in development gradually and perfect.With gradually mentioning for tamper-resistance techniques Height, what the internal structure of tamper resistant systems also became becomes increasingly complex, and the division of functional module is also more and more.At this time anti-tamper System is exactly based on the interaction of each intermodule, is fitted to each other, and completes the function of entire tamper resistant systems.The connection of each intermodule It is close, all linked with one another, indispensable.

Therefore, in webpage tamper resistant systems, it is necessary to be carried out more to the synchronous system in distributed files realized herein Optimization, be run through the function for the multimachine publication that simpler system information configuration operation can be completed to file, mention The simplicity that high system uses；It is necessary to further deep in the design aspect expansion for improving file transmission efficiency for synchronization system Enter research work, the function of abundant optimization system enables preferably to blend with tamper resistant systems, plays its due function Energy.

Summary of the invention

A kind of method of data synchronization based on distributed file system is claimed in the present invention, uses physical services on the whole The operating mode that device and virtual server and two server interactively communicate is used for the synchronization check of data interaction, and inside uses memory The mode of data and the separation of metadata set faciation, each side's server execute corresponding work, synchronize to reach to data and synchronize in time, more New accurate technical effect.

A kind of method of data synchronization based on distributed file system, it is characterised in that:

Data server carries out communication friendship using two server operating mode and physical server and virtual server simultaneously In mutual process, the working condition of the physical server and the virtual server is monitored；

Wherein, the communication interaction include: with the physical server carry out signal interaction, with the physical server and The virtual server carries out data interaction simultaneously；

Main memory cluster and metadata cluster based on database are established, the internal storage data storage of distributed file system is arrived should Distributed data base in cluster, meanwhile, the metadata for storing the distributed data base in the cluster carries out processing operation；

The data server is if it is determined that the physical server breaks down and the virtual server works normally, then Single server operating mode switching command is sent to the virtual server；

Automation physical server is to virtual server, the resources of virtual machine layered cylinders defined according to application logicframework Reason, the smart allocation of computing resource, online dynamic expansion resource；

The data server receives the first confirmation response that the virtual server returns, and using single server work Mode continues to carry out the signal interaction and the data interaction with the virtual server；

The back end of virtual server creates temporary data block, and it is interim to write data to this for signal difference based on the received Data block replaces the file content inside former back end with the content of temporary data block, realizes that data are synchronous.

It can be seen that the present invention by foregoing invention content to hand over using physical server and virtual server and two server The operating mode of mutual communication, Each performs its own functions is each responsible for for two servers, is switched to single server operating mode when necessary, The operational efficiency of system has been effectively ensured；Meanwhile cluster is synchronous to data with metadata cluster based on memory for data store internal Content carries out clustering processing, treats respectively, the reasonable distribution of data isochronous resources and the accuracy that data are synchronous has been effectively ensured.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art To obtain other drawings based on these drawings.

Attached drawing 1 is a kind of workflow of the method for data synchronization based on distributed file system according to the present invention Figure.

Attached drawing 2 is a kind of construction module of the data synchronization unit based on distributed file system according to the present invention Figure.

Specific embodiment

The present invention protects a kind of method of data synchronization based on distributed file system first, is this method referring to attached drawing 1 Work flow diagram, it is characterised in that:

Data server carries out communication friendship using two server operating mode and physical server and virtual server simultaneously In mutual process, the working condition of the physical server and the virtual server is monitored, wherein the communication interaction packet It includes: carrying out signal interaction with the physical server, carry out data simultaneously with the physical server and the virtual server Interaction；

Wherein preferably, during the physical server and virtual server carry out communication interaction simultaneously further include:

It is established with virtual server and jumps connection, so that virtual server although it is understood that the current state of oneself, including current Situations such as state, load, update, while virtual server being waited to send task requests, receive the company of virtual server transmission After connecing client request, the operation requests that client is monitored in connection are established with client, are made a response in time；

After present physical server has submitted update from client, update can be synchronized to physics clothes by data cache server Be engaged in device, update message is submitted into virtual server again after the completion, by virtual server control other data cache servers with Physical server synchronizes.

There are two steps for general meeting when synchronizing to small documents: arriving metadata node first and obtains small documents rope The stream of Data Node position and data file where drawing.What is obtained when writing among these is the output stream of data file, is read When obtain be data file inlet flow.When small documents are operated, if the small documents of connected reference are subordinate to In the same catalogue, then their these information are identical.System passes through in the corresponding index position of client-cache catalogue The inlet flow of data file under information and the catalogue, output stream reduce the visit for interacting to improve file with metadata node It asks speed, while also not needing that the modification of metadata node is frequently required to update mark, only just when updating mark and changing It needs to carry out.The Information Number for being buffered in client indexes position is defaulted as 20, and user can be with self-setting.Caching default uses LRU replacement policy, because the case where multiple threads are not present in client, to caching without using lock mechanism.

Due to client-cache small documents index position and update mark information, so there is no need to continually with member Data Node access, allows the performance of system to be greatly improved, but to the consistency indexed in each Data Node Bring problem.According to selection rule, the position that master data node is indexed as creation in index position mapping table, its content It is newest.After client 1 such as inquires copy data node 1, it updates mark in the buffer becomes as N, and It modifies to the corresponding item of metadata node mapping table.There is client 2 to obtain the information of mapping table at this time, learns copy Data Node 1 updates flag bit N.The index of the creation of client 1 at this time is so that the update mark of copy data node 1 becomes For Y, and client 2 can not actively be known, so the small documents that it will can't see client 1 and just created.

It is further preferred that the source path that is inputted according to client of synchronously control node in the back end is from source document The metadata node of part system obtains copy list, creates thread pool and is that per thread distributes source document according to the copy list Part, the copy list are the list of all source files under source path, filename, size and file road including each source file Diameter.

Each thread of synchronously control node obtains the assigned source file of each thread from the metadata node of source file system Metadata, obtain each data block that source file includes respectively from corresponding source data node according to the metadata of source file Check code.

Each thread of synchronously control node obtains the corresponding target of each source file from the metadata node of target file system The size of the metadata of file, reference source and file destination, according to comparison result, to the metadata node Shen of target file system It please create or the data block of delete target file, so that file destination size is consistent with source file.

Each thread of synchronously control node reacquires the member of each file destination from the metadata node of target file system Data obtain all data blocks that each file destination includes from corresponding target data node according to the metadata of each file destination Check code.

Metadata and all source and target number of each thread of synchronously control node according to respective source and target file According to block check code generate the list of file verification code, this document check code list include: the serial number of data block, source block ID, Source block check code, source data node ID and target data block ID, target data block check code, target data node ID and Whether target data block is the new marker bit for creating data block.

It can choose and source data space is split at random, the average segmentation of dimension longer to span clusters each dimension Cutting etc., the cluster of large-scale distributed file system usually cross over multiple racks, logical between the computer in different racks Letter is needed through interchanger, and transmission cost is big.And in most cases, the bandwidth between two computers of different racks is more same Two intercomputers in rack it is small.At present the duplicate strategy of distributed file system be duplicate is stored in two it is different It in rack, can prevent the rack from can lose data when ging wrong, while when data are read, nearest original can be used Then, when accessing the node of the storage source data nearest from client computer, or reducing reading using the bandwidth between different racks Between.Calculating is moved near data memory node, mobile operation obviously more efficient nearby compared with calculate node is moved data over into Cost it is low cross mobile data.

Further, it is preferable to, the distributed file system uses MapReduce thread work, the MapReduce The number to create directory that program is provided according to user creates the quantity of small documents under each catalogue in distributed file system Creation and input file of the control file as Map function of catalogue number as many.

The Map function of test, the small documents of main creation specified quantity and size using mass small documents storage system All small documents write under createing directory when test are read with the interface for using mass small documents storage system to read small documents.

It writes test and reads test and use identical Reduce function, which counts the output data of each Map, such as The total size of MapReduce program test file, total quantity, total run time etc., and these data are stored in distributed text In part system.

Interpretation of result function can read the result data of Reduce statistics from distributed file system, and by giving Fixed formula calculates, the write-in of distributed file system or mass small documents storage system and the speed etc. for reading small documents.

After each MapReduce the end of the program, need to record mass small documents storage system and distribution text The memory occupation rate of part system in systems.

It is further preferred that the metadata cluster, authorized user accesses data cache server cluster, is quickly visitor Family end and data server establish connection;The state of each data cache server of real-time monitoring, according to status information come for The data cache server for being capable of providing optimal service is distributed at family.Ensure that user data exists using buffer consistency strategy simultaneously Consistency and stability in data cache server cluster；Metadata cluster is controlled by virtual server, is responsible for client The interaction for carrying out data, is in real time monitored each user data state.Status information is submitted into Virtual Service simultaneously Device, it is ensured that traceability of the control server to user data state.Heartbeat is established between transaction controlling server to connect, it will The factor that itself network availability bandwidth, cpu busy percentage etc. influence service quality passes to virtual server.

Within the storage system, the file of a physics just corresponds to the expression of a logic, the metadata information of composition.When When carrying out file reading, the reading of logical file will be first carried out, then further according to composed metadata information sequence, That data block corresponding thereto is taken out from storage system, finally restores the copy of its physical file.Data text Part stores the data of small documents by the data structure of key/value, not only reduces mass small documents in distributed text The scale of metadata in part system, while the access speed (by reducing the interaction with metadata node) of small documents is improved, It is more advantageous to the data processing based on MapReduce, provides support for distributed computing.

The small documents that client stores under same catalogue are all saved in the data file under the catalogue, the data File is the file in distributed file system.Generate index simultaneously, record small documents specific location in the data file and Other relevant informations, and by index transfer to each Data Node carry out maintenance management and Data Node to client provide index clothes Business.Metadata node needs to record the Data Node for being used to safeguard small documents index.When client needs to propose to Data Node The position that Data Node is first obtained from distributed file system is needed under some catalogue when the index service request of small documents.Client Data Node position and the data file information for holding caching mechanism meeting cache maintenance small documents index, reduce access metadata node Number to substantially increase the access speeds of small documents.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims

1. a kind of method of data synchronization based on distributed file system, it is characterised in that:

Data server carries out communication interaction using two server operating mode and physical server and virtual server simultaneously In the process, monitor the working condition of the physical server and the virtual server, wherein the communication interaction include: with The physical server carries out signal interaction, carries out data interaction simultaneously with the physical server and the virtual server；

Main memory cluster and metadata cluster based on database are established, the internal storage data of distributed file system is stored to memory collection Group, meanwhile, the metadata of metadata cluster distributed storage file system carries out processing operation；

The data server is if it is determined that the physical server breaks down and the virtual server works normally, then to institute It states virtual server and sends single server operating mode switching command；

Physical server is automated to virtual server, the resources of virtual machine multi-zone supervision defined according to application logicframework is counted Calculate the smart allocation of resource, online dynamic expansion resource；

The data server receives the first confirmation response that the virtual server returns, and using single server operating mode Continue to carry out the signal interaction and the data interaction with the virtual server；

The back end of virtual server creates temporary data block, and signal difference writes data to the ephemeral data based on the received Block replaces the file content inside former back end with the content of temporary data block, realizes that data are synchronous；

During the physical server and virtual server carry out communication interaction simultaneously further include:

It is established with virtual server and jumps connection, so that virtual server although it is understood that the current state of oneself, including current state, Load, update status, while virtual server being waited to send task requests, receive the connection client of virtual server transmission After request, the operation requests that client is monitored in connection are established with client, are made a response in time；Present physical server is from client After end has submitted update, update can be synchronized to physical server by data cache server, after the completion again submit update message To virtual server, other data cache servers are controlled by virtual server and are synchronized with physical server；

Metadata section of the source path that synchronously control node in the back end is inputted according to client from source file system Point obtains copy list, creates thread pool and is that per thread distributes source file according to the copy list, which is source The list of all source files under path, filename, size and file path including each source file；

Each thread of synchronously control node obtains the member of the assigned source file of each thread from the metadata node of source file system Data obtain the verification for each data block that source file includes according to the metadata of source file respectively from corresponding source data node Code；

Each thread of synchronously control node obtains the corresponding file destination of each source file from the metadata node of target file system Metadata, the size of reference source and file destination created to the metadata node application of target file system according to comparison result It builds or the data block of delete target file, so that file destination size is consistent with source file；

Each thread of synchronously control node reacquires the metadata of each file destination from the metadata node of target file system, The school for all data blocks that each file destination includes is obtained from corresponding target data node according to the metadata of each file destination Test code；

Each thread of synchronously control node is according to the metadata and all source and target data blocks of respective source and target file Check code generate the list of file verification code, this document check code list includes: the serial number of data block, source block ID, source number According to block check code, source data node ID and target data block ID, target data block check code, target data node ID and target Whether data block is the new marker bit for creating data block；

The distributed file system uses MapReduce thread work, what the MapReduce program was provided according to user The number to create directory, the quantity that small documents are created under each catalogue create as catalogue number in distributed file system Input file of more control files as Map function；

The Map function of test mainly utilizes the creation specified quantity of mass small documents storage system and the small documents of size and makes All small documents write under createing directory when test are read with the interface that mass small documents storage system reads small documents；

It writes test and reads test and use identical Reduce function, which counts the output data of each Map, including The total size of MapReduce program test file, total quantity, total run time, and these data are stored in distributed document In system；

Interpretation of result function can read the result data of Reduce statistics from distributed file system, and by given Formula calculates, the write-in of distributed file system or mass small documents storage system and the speed for reading small documents；

After each MapReduce the end of the program, need to record mass small documents storage system and distributed field system The memory occupation rate of system in systems；

The metadata cluster, authorized user access data cache server cluster, are quickly client and data server Establish connection;The state of each data cache server of real-time monitoring is capable of providing most according to status information for user's distribution The data cache server of excellent service, while ensuring user data in data cache server collection using buffer consistency strategy Consistency and stability in group；Metadata cluster is controlled by virtual server, is responsible for carrying out the interaction of data with client, real When each user data state is monitored, while status information is submitted into virtual server, it is ensured that control server To the traceability of user data state, heartbeat being established between transaction controlling server and is connect, the network of itself can be used into band The factor that wide, cpu busy percentage influences service quality passes to virtual server.

2. a kind of data synchronization unit based on distributed file system, it is characterised in that:

Corresponding device includes data server, physical server, virtual server, database；Wherein,

It establishes also to jump with virtual server and connect, so that virtual server although it is understood that the current state of oneself, including current shape State, including load, update status, while virtual server being waited to send task requests, receive the company of virtual server transmission After connecing client request, the operation requests that client is monitored in connection are established with client, are made a response in time；Present physical service After device has submitted update from client, update can be synchronized to physical server by data cache server, will be updated again after the completion Message submits to virtual server, controls other data cache servers by virtual server and synchronizes with physical server；