CN108388658A

CN108388658A - Data file reliable storage method

Info

Publication number: CN108388658A
Application number: CN201810186581.0A
Authority: CN
Inventors: 杨晓莹; 吴伟杰
Original assignee: Chengdu Xinte Electronic Technology Co Ltd
Current assignee: Chengdu Xinte Electronic Technology Co Ltd
Priority date: 2018-03-07
Filing date: 2018-03-07
Publication date: 2018-08-10

Abstract

The present invention provides a kind of data file reliable storage method, this method includes：The data received are divided according to data block name, and send the relay unit of DataNode to by channel corresponding with data block name；Result data from the relay unit is transferred to corresponding cloud storage service device.The present invention proposes a kind of data file reliable storage method, realizes the efficient real-time processing of the big data set of real-time change.

Description

Data file reliable storage method

Technical field

The present invention relates to cloud storage, more particularly to a kind of data file reliable storage method.

Background technology

With the rapid development of information technology, the explosive growth of data scale is brought in mass data source, to big data into Thus row complicated calculations have pushed and have changed to big data cloud computing system considerably beyond the processing capacity of single computer Into.After the big data for carrying out complicated calculations will be needed to be divided into fritter in cloud computing system, divides and transfer to more DataNode parallel Processing, and the integration of local calculation result is obtained into final result.However in the big data environment of isomery, there are real-time Transmissions , persistently generate, unstructured data.Such as the monitoring data that sensor generates in real time, social networks generate real-time logical Letter data.The big data that vary always in face of these, if efficient real-time processing cannot be carried out to it, by miss data The key message carried in block.Existing cloud computing system can not integrate the data from multiple heterogeneous data sources, including numerical value Calculating, data mining and model prediction provide being provided as a result, storage also can not be shared across different server for user's care in real time Source.It cannot be satisfied multi-path environment and cloud computing system multinode access storage demand；Include to access conflict prevent and The realization of resources balance.

Invention content

To solve the problems of above-mentioned prior art, the present invention proposes a kind of data file reliable storage method, Including：

The data received are divided according to data block name, and are sent to by channel corresponding with data block name The relay unit of DataNode；Result data from the relay unit is transferred to corresponding cloud storage service device.

Preferably, the relay unit receives the data after dividing, and the data received are put into pending business In the queue of name nominating；

Priority based on each pending business opens pending business, and pending business is sent to DataNode meters Calculate unit；

The pending business that the computing unit is opened, calculates the data from relay unit, and single to relaying Data block after member output calculating.

Preferably, the method further includes：

Data transmission and the isolation of internal logical calculated for making DataNode, input data is divided by data block name, and Transfer data to relay unit；Relay unit is according to the incidence relation of data and the pending business of current DataNode, dimension One layering queue for being directed to all pending business in ready state of shield.

Preferably, the relay unit is determined according to the load of DataNode starts how many a business, and from layering queue Middle selection respective numbers, highest priority pending initiation of services；

The relay unit also transfers data to the computing unit for executing pending business, and receives by computing unit Treated result data；In the cloud computing system including above-mentioned DataNode, division, fusion to input or result data It is all completed in memory with processing.

The present invention compared with prior art, has the following advantages：

The present invention proposes a kind of data file reliable storage method, realizes the efficient of the big data set of real-time change Processing in real time.

Description of the drawings

Fig. 1 is the flow chart of data file reliable storage method according to the ... of the embodiment of the present invention.

Specific implementation mode

Retouching in detail to one or more embodiment of the invention is hereafter provided together with the attached drawing of the diagram principle of the invention It states.The present invention is described in conjunction with such embodiment, but the present invention is not limited to any embodiments.The scope of the present invention is only by right Claim limits, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Just it provides a thorough understanding of the present invention.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.

An aspect of of the present present invention provides a kind of data file reliable storage method.Fig. 1 is according to the ... of the embodiment of the present invention Data file reliable storage method flow chart.Cloud computing system of the present invention for big data processing includes multiple DataNode. Each DataNode includes：

The data received are divided according to data block name, and are passed by channel corresponding with data block name by division unit Give relay unit；It is additionally operable to the result data from relay unit being transferred to corresponding cloud storage service device；

Relay unit receives and comes from the ready-portioned data of division unit, and the data received are put into pending industry It is engaged in the queue of name nominating, the priority based on each pending business opens pending business, and pending business is sent To computing unit；The result data from computing unit is received, and is transmitted to division unit；

Computing unit is used for the pending business based on unlatching, calculates the data from relay unit, and in Data block after unit output processing.

Wherein, division unit realizes the data forwarding between DataNode and extraneous node.Division unit makes DataNode Data transmission and the isolation of internal logical calculated.Specifically, divide input data by data block name in division unit, and Transfer data to relay unit.Relay unit is according to the incidence relation of data and the pending business of current DataNode, dimension One layering queue for being directed to all pending business in ready state of shield.Relay unit is true according to the load of DataNode Surely start how many a business, and selection respective numbers, the highest priority pending initiation of services from layering queue.This Outside, relay unit also transfers data to the computing unit for executing pending business, and receives that treated by computing unit Result data.

In the cloud computing system including above-mentioned DataNode, all to the division of input or result data, fusion and processing It completes in memory, to ensure the accuracy of system-computed result, it is preferable that each DataNode further includes data backup list Member.When scheduling node receives computing unit processing, calculates the result data finished when center, number is sent to by corresponding channel According to backup units, data backup unit is stored in by result data star's result data on blade disk, according to result data and Result data is sent in the shared drive queue named with cloud storage service device by the incidence relation of cloud storage service device, by Division unit is unified to be sent.

By explicit inter-process communication methods into the shared and multiplexing of row information between aforementioned four unit, by mutual Cooperation together constitutes the node of cloud computing system.

In addition, relay unit also further monitors the request of client on port, connection is established, and connection is distributed It is executed to suitable computing unit.Multiple connection requests of each DataNode management clients, utilize I/O multiplex interfaces.It draws Subdivision, relay unit and data backup unit are all to manage multiple event sources by I/O multiplex interfaces, and pass through channel mode Coupling.

The channel of relay unit and data backup unit by I/O multiplex interface management for data transmission between each unit Port.All unit parallel processings and asynchronous execution logic.Relay unit is also initialized and is supervised when DataNode starts Listen work.Relay unit monitors designated port and receives the connection request from external node and initialize the line of computing unit Journey.Relay unit determines the data connection for the business that is packaged into which being distributed to according to the loading condition of the thread of each computing unit One thread executes.

Cloud computing system uses the strategy of adaptive load balancing, determines that relay unit starts how many a threads, Yi Jijie Which thread is the new business received, which be put into, executes.Specifically, the load of relay unit real time monitoring DataNode, when CPU is accounted for When being higher than threshold value with rate, thread is randomly choosed, thread is closed after its business processing terminates, reduces the concurrent of DataNode Amount；Connection is assigned to the thread at least connected up, the mode of distribution is will to connect the industry that the business that is packaged into is sent to It is engaged in queue.

Division unit is used for the data transmission of peer node, including receives data from client and to cloud storage service The result data of device push, division unit make the data transmission of DataNode and upper layer application logic be kept completely separate.In order to manage Multiple I/O data sources, division unit use I/O multiplex interface models.The both sides of data transmission are participated in formally transmission data block Agreement of preceding progress, cloud storage service device notify the position that the previous secondary data block transmission of client terminates.Division unit according to The characteristic of I/O multiplex interface asynchronous read and writes realizes a data transmission state machine, support is provided for the breakpoint transmission of data.

Per thread starts division unit in initialization, when the business for having relay unit to send in the service queue of thread When, the connectivity port in division unit taking-up business is added in the I/O multiplex interface event loops of oneself.Division unit from In connection read data and by data title divide, surely belong to Mr. Yu's data data block for the first time be divided unit receive when, Division unit establishes the corresponding channel of data block name, to write mark opening channel and transmit data；Data block name is led to simultaneously It crosses socket and is sent to relay unit, after relay unit receives data block name, to read to indicate that the opening data block name is corresponding logical Road receives the data that the division that division unit is sent finishes.

Relay unit loads the quantity for determining the business operator started according to DataNode, and it is excellent according to business to start order What first grade calculated, the determination of priority includes importance, the operation conditions of DataNode of the business in entire business.Relaying Unit obtains the incidence relation of data block and external treatment business by server in real time, and the data block received is put into business In the corresponding queue of name.

The data backup unit is assigned to the certain mistake of every data block after result data is stored in blade disk Time phase simultaneously periodically deletes them from blade disk.When the transmission speed of client is more than the processing speed of cloud storage service device When spending, data packet accumulated in the buffering area of client kernel lead to not send when, data backup unit is in cloud storage service Caching is formed in device.

Cloud computing system node realizes positioning in division unit and sends agreement, to the transmission of location data block last time Position, and the data of taking-up corresponding position realize the recovery of data block from backup units.It is single with relaying similar to division unit The mode of data transmission is completed in member cooperation, and data backup unit monitors a specified port and be added to I/O multiplexings and connects for a long time In mouth handle, when receiving the result data name of relay unit transmission, to read to indicate opening channel and describe channel file Symbol is added in I/O multiplex interfaces cycle.Data backup unit continues the result data that reading process finishes, deposit from channel With in the blade disk array of result data name.Data block in blade disk array is stored in the form of key-value pair, and key assignments is several According to the timestamp of block, when facilitating re-transmission from hard disk queue rapidly locating block.

Data backup unit will be existed by the incidence relation of result data and cloud storage service device on inquiry server The data backed up in blade disk array are sent in the shared drive queue named with cloud storage service device, are divided to allow Consistent behavior is presented when sending and receiving data for unit, and cloud storage service device name is packaged into business and put by data backup unit Into service queue, the cloud storage service device name in division unit taking-up business takes out number from corresponding shared drive queue According to, and according to the configuration transmission data of the cloud storage service device.

The cloud computing system of the present invention uses virtual address mechanism to allow each DataNode in cloud computing system to access knife Piece disk array.Once some DataNode breaks down, blade disk array accesses gateway can be between each DataNode It switches over, to provide the high availability of blade disk array access.Meanwhile using the virtual address balance policy based on feedback, Virtual address is reasonably allocated to each DataNode of cloud computing system, ensures the processing capacity and Service Quality of blade disk array Amount.

Specific blade data of magnetic disk array access method includes the following contents：

(1) the access path list for accessing blade disk array is provided, every access path includes virtual address, port and leads to Road ID.Corresponding logic magnetic disc is obtained by every access path.Wherein, which is blade disk array in represent layer Logical mappings.

The logical unit number information of the same disk array of all nodes of cloud computing system is completely the same；All represent layers Access path list is completely the same；Each disk array is corresponded with unique virtual address.Specifically, logical unit number adds It is added in the disk array of all DataNode of cloud computing system, each DataNode allows multiple disk arrays, but same magnetic Disk array has and only there are one logical unit number.Arbitrary represent layer may have access to arbitrary disk array and logical unit number information.

Virtual address mechanism is realized by cloud computing system component manager.The component manager includes cloud platform data pipe Manage device, local data manager and message manager.

Wherein, cloud platform data management system is for making a response to the various events of cloud computing system and decision.Wherein, institute The event of stating includes that the establishment, deletion, link of virtual address are abnormal.Local data manager is for providing virtual address and block storage operation Metadata.The logical relation of relevant metadata operation is stored by cloud platform data with virtual address and block in location resource allocation Manager carries out decision.The local metadata of cloud platform data management system configuration.

Message manager is based on the message transmission and cloud between cloud platform data management system and local data manager Member relation management in calculation system.

Specifically, in cloud storage system, virtual address is managed in the following ways：

Step S11, data block vector is created using local metadata：<Virtual address, blade disk array, logical unit number >。

Step S12, the attribute of data block vector is set.Specifically, data block vector attribute includes the boot sequence of resource Deng.

If S13, attribute setup failed, data block vector is deleted；If attribute is arranged successfully, by data block DUAL PROBLEMS OF VECTOR MAPPING To represent layer；If data block DUAL PROBLEMS OF VECTOR MAPPING to represent layer fails, data block vector is deleted；If resource vector maps to represent layer Success, then update resource vector database information.

Wherein, resource vector database purchase is in the data backup unit.

(2) it is based on feedback and virtual address is assigned to each DataNode of cloud computing system so that cloud computing system is each DataNode equally loadeds；Iteration executes following operation when wherein distributing virtual address：

Step S21, setting can distribute virtual address minimal redundancy amount M_min。

Step S22, according to formula M_i=M_n+k₁*Δt*C/L-k₂*R_n/ C calculates the void of each DataNode of cloud computing system Address redundancy amount；

Wherein M_nThe load redundancy of DataNode is passed to when being reached for last time timestamp by the relay unit of cloud computing system Amount, k₁* Δ t*C/L is the load that DataNode is completed in period Δ t, k₂*R_nIt is cloud computing system that/C, which is in period Δ t, The new request of relay unit addition and increased load.Wherein, k₁、k₂To predefine coefficient；R_nIt is increased in the Δ t periods Number of requests；C is the performance of DataNode；L is the present load of DataNode；Δ t is that current time is arrived with last time timestamp Up to when time difference.

That step S23, chooses cloud computing system meets condition M_i＞ M_minAll DataNode, M_iFor cloud computing system section The virtual address amount of redundancy of point；If the DataNode for meeting the condition is not present in cloud computing system, virtual earth can be distributed by resetting Location minimal redundancy amount, until choosing to the DataNode for meeting the condition.

Step S24, the DataNode of selection is added to candidate collection.

Step S25, the weights of each DataNode in candidate collection are calculated.

Specifically, the weights of each DataNode in candidate collection are calculated according to formula W=C/L.

Step S26, the DataNode of maximum weight in candidate collection is chosen.

Step S27, the load changing value of the DataNode of maximum weight in candidate collection is calculated.

Specifically ,-k₁* Δ t*C/L+k/C is the load changing value of the DataNode of maximum weight in the Δ t times, and k is certainly Defined parameters.Therefore, the present load of the DataNode of maximum weight is L_i-k₁* Δ t*C/L+k/C, wherein L_iFor time last time The load value of the DataNode of maximum weight is passed to when stamp reaches by the relay unit of cloud computing system.

Step S28, it according to the load changing value of the DataNode of maximum weight in candidate collection, changes and is weighed in candidate collection It is worth the virtual address amount of redundancy of maximum DataNode.

Specifically, according to formula M=M_i+k₁* Δ t*C/L-k/C changes the DataNode of maximum weight in candidate collection Virtual address amount of redundancy；Wherein M_iFor the virtual address amount of redundancy of the DataNode of maximum weight ,-k₁* Δ t*C/L+k/C is the Δ t times The virtual address of the DataNode of maximum weight is superfluous in the load changing value of the DataNode of interior maximum weight, that is, Δ t times The changing value of surplus.Therefore, the virtual address amount of redundancy of the DataNode of maximum weight is revised as the DataNode's of maximum weight Former virtual address amount of redundancy and in the Δ t times load changing value of the DataNode of maximum weight difference.

In the big data cloud computing system of the present invention, N number of blade disk is located at component manager side, and each blade Disk array is divided into equal-sized disc, and the number of the disc in each blade disk sorts from low to high according to address； Respectively there is N number of blade disk blade array ID, each disc to be identified with disc, and the disc mark is by the blade battle array Row ID combines to obtain with the number of the disc；

The memory space of the parallel computation task of cloud computing system is made of target blade disk；The target blade disk Anabolic process is as follows：Component manager monitors the distribution state of the disc and the temperature of N number of blade disk；Institute Component manager is stated after receiving the request of parallel computation task creation, the storage for the parallel computation task that determination will create is empty Between demand；Distribution state according to the disc determines the disc in unallocated state；From the disc in unallocated state Middle to select M disc as target blade disk, the memory space of the M disc is needed more than or equal to the memory space It asks；

The M disc is each located on different blade disks；The component manager responds the parallel computation task Request to create builds the parallel computation task in the target blade disk；

If the parallel computation task has data storage requirement in the process of running, the target blade magnetic is obtained first The mark of disk sends temperature inquiry request to the target blade disk, the target is carried in the temperature inquiry request The mark of blade disk；

The parallel computation task receives the temperature that the corresponding blade disk of the M disc returns；

The parallel computation task will need the data stored to be divided into less than M target data, according to the M disc The temperature of corresponding blade disk is respectively stored into each disk in the target blade disk from low to high, by each target data Piece.From system level, realization process includes：

101：Component manager monitors the distribution state of disc and the temperature of N number of blade disk；The temperature is used The data throughout or data throughout that the current either comprehensive historical data of blade disk counts account for corresponding blade disk Data storage capacities ratio.

102：The component manager is after receiving the request of parallel computation task creation, parallel meter that determination will create The memory space requirements of calculation task；Distribution state according to the disc determines the disc in unallocated state；From in not Select M disc as target blade disk in the disc of distribution state, the memory space of the M disc is greater than or equal to institute State memory space requirements；The M disc is each located on different blade disks；

Since the possibility that different parallel computation tasks is used is different, pass through the selectivity point to disc It is balanced for the first time with that can reach.

103：The component manager responds parallel computation task creation request structure in the target blade disk Build parallel computation task；Parallel computation task knows oneself assigned target blade disk and these target blade disks Location.

104：If having data storage requirement during the parallel computation task run, the target blade is obtained first The mark of disk sends temperature inquiry request to the target blade disk, the mesh is carried in the temperature inquiry request Mark the mark of blade disk.

Which knife each disc corresponds to respectively in parallel computation task side needs to preserve the target blade disk Piece disk；Based on this, parallel computation job enquiry temperature may not necessarily be inquired via component manager.

105：The parallel computation task receives the temperature that the corresponding blade disk of the M disc returns；

106：The parallel computation task will need the data stored to be divided into less than or equal to M/2 target data, press According to the corresponding blade disk of the M disc temperature from low to high, each target data is respectively stored into the target blade Each disc in disk.The mark building form for especially setting blade disk in embodiments of the present invention, facilitates subsequent blades magnetic The lookup of disk；In addition, enabling parallel computation task to be assigned to more in the blade disk assigning process of parallel computation task Suitable blade disk, it is possible to reduce congestion；In addition, it will require the data of storage are divided, according to the temperature of blade disk Data distribution is carried out again, can improve the safety of data storage.

The hexadecimal values that the blade array ID is P, the disc are identified as Q hexadecimal values；Often The memory space of a disc is R；The method further includes：

The parallel computation task determines the specified virtual earth of the accessing operation after determination needs to carry out accessing operation Location；Each disc that the target blade disk is contained by it according to the blade array ID where each disc from low to high successively Sequence composition, the virtual address are that starting virtual address serial number obtains with the initial address of the target blade disk；Institute It states and is stored with address mapping table in parallel computation task, the list item of described address mapping table includes：Virtual disk number, disc mark Know；

Virtual address described in the parallel computation task computation and the ratio rounding of the R obtain the virtual disk of the virtual address Number, calculates the virtual address and the ratio remainder of the R obtains offset；

The parallel computation task searches described address mapping table and obtains the table that the virtual disk comprising the virtual address is numbered , and determine that the disc mark for including in the list item is identified as target disk；

The parallel computation task intercepts preceding P of the disc mark as target blade array ID, to the target The corresponding blade disks of blade array ID send read request, and disc mark and the offset are included in the read request Amount makes the disc identify corresponding disc and returns to the offset corresponding physical address described in the starting location offset of the disc Data.

After the parallel computation task is created, the method further includes：If the parallel computation task need by It deletes, then the distribution state for each disc for including in the target blade disk is arranged to unallocated state, does not delete described The data that each disc for including in target blade disk has been written to.Each disc for including in the target blade disk Distribution state is arranged to after unallocated state, and the method further includes：When creating new parallel computation task next time, institute The disc for stating new parallel computation required by task obtains in a random basis, and two are less equal than in the disc got Disc belongs to the disc for including in the target blade disk.

The division unit passes through one group of division vector F also according to data block size, blade number of disks and loading condition Data block is dynamically subjected to division processing；For less than selection threshold value T small documents or system in can use blade number of disks N When=l, carries out variable division using byte partition strategy and handle；And under the premise of available blade number of disks is 1, for super The file of selection threshold value is crossed, then is respectively handled using reconstruct partition strategy；When data block is divided processing, will uniformly it draw Divided data is stored into each blade disk；The memory space of each blade disk is made full use of while reducing file metadata amount.

When dividing data block using byte partition strategy, the remaining sub-block for dividing generation encrypts, is transmitted to phase after backup In the blade disk answered, and dividing the file division information generated in storing process, key information, file storage catalogue information will It preserves into the encrypted area of local flash memory chip.When dividing data using reconstruct partition strategy, cross-assignment function will be called f_cAnd reconstruction of function f_rTo dividing data cross reconstruction processing, will be passed parallel after each data block coding redundancy of reconstruct, encryption It transports in corresponding blade disk, and divides the file division information generated in storing process, key information, file storage catalogue letter Breath will be preserved into the encrypted area of local flash memory chip.

In byte partition strategy, when using blade data in magnetic disk block Block, client will be according to data block Block's Size size can be divided into byte sub-blocks and remaining sub-block two parts with blade number of disks W, and wherein byte sub-blocks are by taking out A small amount of byte in user file is taken to form, and remaining sub-block is made of remaining file data after extracting a small amount of byte.Data After block divides, client will be transmitted to after remaining sub-block encrypted backup in the corresponding blade disk in distal end, and is deposited in division Storing up the file control information generated in the process will together store into the encrypted area of local flash memory chip.Byte partition strategy divides Data block Block divisions are handled by following two processes：

(1) determine that the value range of position sequence Array is 1~Size according to the size Size of data block Block, then The default size r that position sequence Array is determined according to the size Size of data block Block, then generates within the scope of 1~Size The random number of respective numbers finally sorts each element value of generation as each element in position sequence Array successively by size.

The default size d of position sequence Array is determined according to the size Size of data block Block, can be used in foundation system The quantity N of blade disk generates the seed E of N number of ascending arrangement in identified value range 1-Size_i(i∈ { 1,2,3 ..., N }), this group of seed is referred to as seed sequence S；The position-order that size is k is generated finally by cyclical function f Arrange Array, wherein k<d.Cyclical function f=f_s+f_j；It is input with seed sequence S and blade number of disks N, with position sequence Each element p in Array_ji(i, j ∈ { 1,2,3 ... N }) is output, p_jiIndicate i-th of position element in jth cycle；f_s=E_i It is a constant, is the element in seed sequence S；f_j=(j-1) × N indicates the number of cycle.Detailed process includes：

Step 1:The seed sequence S that will at random be generated within the scope of 1-size：{E₁, E₂、E₃、…、E_NAnd blade disk Quantity N enters first circulation as input value, then cyclical function f (E_i, N) and=p1i；Cyclical function once will all give birth to per operation At position number of elements be compared with d, if the two is equal, directly exit cycle, while by generated position element Position sequence Array outputs are generated after sequence；If calculating to f (E_N, N) when, the quantity for generating position element is less than d and cycle letter Number f (E_N, N)<Size then carries out second circulation；Conversely, then returned generated position element value as final result, until The Array generations of this position sequence terminate, wherein element in each position element value i.e. seed sequence S.

Step 2:In second circulation, as i=1, cyclical function p₂₁, work as p₂₁＞ size or the position elements that oneself generates When prime number amount is equal to d, then cycle is exited, and generates position sequence Array outputs after generated position element is sorted；Work as p₂₁ ＜ size and oneself generate position number of elements be less than d when, then calculate f (E₂, N) and=p₂₂。

Work as p₂₂>When Size and generated position number of elements are less than d, then by the seed E in seed sequence S₂And its Subsequent each seed is deleted, and regenerates seed sequence S:{E₁, while exiting this and being recycled into subsequent cycle；If The position number of elements of generation then exits cycle equal to d, and generates position sequence Array after generated position element is sorted Output.And so on, as i=N, cyclical function f (E_N, N) and=p_2N, in p_2N>Under the premise of Size, if generated position elements Prime number is less than d then by the seed E in seed sequence S_NIt deletes, and regenerates seed sequence S:{E₁、E₂、E₃、…、E_N-1, Simultaneously this is exited to be recycled into subsequent cycle；Cycle is exited if the position number of elements that oneself generates is equal to d, and oneself is generated The sequence of position element after generate position sequence Array outputs.

Step 3:Since recycling second, handle each time completely the same；Cyclical function operation is primary, will just generate Position element p_jiCompared with being carried out once with data block size Size.Work as p_ji<When Size, then by generated position number of elements It is compared with d, recycles and continue if this quantity is less than d；If this quantity is equal to d, cycle is exited, and oneself is generated Position sequence Array outputs are generated after the element sequence of position.Work as p_jiWhen >=Size, also by oneself generate position number of elements with D is compared, by current seed E in seed sequence S if this quantity is less than d_iAnd its subsequent each seed is deleted, and is laid equal stress on Newly-generated seed sequence S, while exiting this cycle and carrying out into subsequent cycle；Cycle is exited if this quantity is equal to d, and oneself is raw At the sequence of position element after generate position sequence Array outputs.

(2) after position sequence Array is successfully generated, by according to the value of each position element in this position sequence Array, according to The byte of corresponding position in secondary extraction original document, and by the byte arranged in sequence of extraction form byte sub-blocks, byte sub-blocks with Position sequence Array occurs in pairs, and the two is stored together into local flash memory chip；Extract byte after remaining data then Referred to as remaining sub-block, this block store in blade disk at the far end.

The thought of the reconstruction strategy is：When using multi-blade data in magnetic disk block Block, if the size of data block Block Size is more than selection threshold value T, and data block Block is just divided into the data block of multiple same sizes and is after the completion of division processing It unites and is about in the transmission of data blocks to multiple available blade disks of each same size；By parallel transmission to improve file access Efficiency.

The principle of foundation is when data block Block is evenly dividing, in the premise for improving multi-blade disk parallel access efficiency Under reduce amount of metadata to the greatest extent.When data block Block is divided storage, segment processing is carried out to it first, that is, basis can Data block Block is divided equally with the quantity of blade disk, each section generated after segment processing referred to as sub-block, and each sub-block Size is P, P=Size/N.Then determine that each sub-block suitably divides threshold value by dividing vector F, after dividing threshold value determination, Client recycles the division threshold value to carry out division processing to each sub-block, after division processing each sub-block all will include one or The more than one memory block of person, and the size of each memory block is L_j；Data block Block divides after treatment, passes through friendship Pitch partition function f_cAnd reconstruction of function f_rBy each memory block combined crosswise of generation at disk block, protected between each disk block and blade disk Hold corresponding mapping relations；The size B of disk block_iIt is random length using memory block as base unit and its length, is defaulted as N number of When memory block, such as insufficient N number of memory block, then it is combined as unit of the quantity of actual storage block；It will be each finally by network In disk block transmitted in parallel to corresponding blade disk.

Data block Block divisions processing after the completion of, client by after each disk block coding redundancy of generation and encryption simultaneously Row is transmitted in each blade disk of distal end, is at the same time stored file control information together to the encryption of local flash memory chip Qu Zhong.When user obtains required data block Block, client will read the file control information in flash chip and each blade magnetic Disk establishes communication connection, each disk block needed for Parallel download；At the same time each disk block decryption of acquisition is assembled into use by client Data block Block needed for family.

It reconstructs partition strategy and storage is divided to data block Block by following two stages：

(1) it after data block Block segment processings, is divided as defined in vector F most if each sub-block size P generated is less than Small division threshold value, then client carry out division processing to each section using byte partition strategy；If each sub-block size P generated is more than Minimum division threshold value as defined in vector F is divided, then client is by dividing vector F to determine optimum division threshold value Z_l, then again Division processing is carried out to each sub-block using this threshold value, the detailed process of each partition processing is：

1. defining one group of division vector F={ Z₀, Z₁, Z₂, Z₃..., Z_t..., Z_s), wherein Z₀<Z₁<Z₂<Z₃..., Z_t,<Z_s And each Z_tIt is positive integer；The division threshold value Z in vector F is divided by this group_t, each sub-block can flexibly be divided.

2. after data block Block segment processings, client will be by dividing vector F with the most suitable division threshold of determination Value Z_t.First, client calculates each sub-block of generation with each of division vector F division threshold value successively, different It divides threshold value and different division numbers S can be obtained；When the size P of each sub-block can be by Z_tWhen dividing exactly, then division numbers S=P/Z_t, When each sub-block can not be by Z_tWhen dividing exactly, then division numbers For downward floor operation；Then, client will Each division numbers S for calculating gained is compared with available blade number of disks N successively；If wherein there is division numbers S≤N When, then it is optimum division threshold value Z to take division threshold values of the S closest to blade number of disks N when_t；If gained division numbers S is big When blade number of disks N, then it is optimum division threshold value Z to take division threshold values of the S modN closest to N when_t；Finally, client Utilize optimum division threshold value Z_tDivision processing is carried out to each sub-block.

3. when the size P of each sub-block is no more than the minimum division threshold value Z divided in vector F₀When, call byte partition strategy When carrying out dividing processing, the default size r of each position sequence is equal；As each partition after treatment, visitor Family end will be transmitted in corresponding blade disk after each remaining subblock coding redundancy processing and store.

When the size P of each sub-block is more than the minimum division threshold value Z divided in vector F₀When, it is 2. determined using step best Divide threshold value Z_tDivision processing is carried out to each sub-block.After the processing of each partition, each sub-block will be divided into one with On memory block；If the quantity that each sub-block includes memory block is n, data block Block is also just divided into n × N number of memory block, Each memory block generated uses chunk respectively₁, chunk₂..., chunk_n×NIt indicates, and the intersection of any two memory block is sky； Therefore, the union of all memory blocks is data block Block, i.e. chunk₁∪chunk₂∪…∪chunk_n×N=Block.It is drawing Divide in processing procedure, if each sub-block can be by optimum division threshold value Z_tDivide exactly, then the size L of each memory block generated_jExactly draw Divide threshold value Z_tSize；If each sub-block can not be by optimum division threshold value Z_tDivide exactly, then except last in the memory block that each sub-block generates Other than one piece, the size L of remaining each memory block_jNamely divide threshold value Z_tValue, and in each sub-block the last one memory block it is big Small is P- (n-1) × Z_t；Effect as dividing vector F may make that memory block is evenly obtained by division.

(3) data block Block is divided after treatment by client, calls cross-assignment function f_cAnd reconstruction of function f_rIt will The n that file Block is included × N number of memory block combined crosswise is at disk block.Detailed process is：To being stored included in each sub-block Block uniformly carries out serializing processing, if the last one memory block and remaining memory block differ in size in each sub-block, successively will The sequence number of the last one memory block is set to n × N- (N-i) in each sub-block, and wherein i is the ID of each sub-block；It is deposited after serializing processing Each memory block will possess unique sequence number A, A ∈ { l, 2,3 ..., n × N } in storage data block Block；To then own Memory block pass through cross-assignment function f_cCarry out out of order processing, wherein f_c={ A } mod N, { A } is by data block Block packets The sequence number sets of each memory block contained, N are the quantity that storage can use blade disk；The sequence number sets { A } of memory block pass through Function f_cN groups storage set of blocks will be obtained after calculating.

After the out of order processing of all memory blocks, reconstruction of function f is recycled_rEach group of storage set of blocks is reconstructed respectively Processing, passes through reconstruction of function f_rAfter processing, every group of storage set of blocks is all by the disk block comprising identical quantity；Wherein reconstruction of function f_r =T_i/ N, T_iRepresent the quantity of memory block in i-th group of storage set of blocks；Finally, each group is stored into each disk included in set of blocks Block is transmitted in respective blade disk, and the group number for storing set of blocks is corresponding with available blade number of disks；Between each group It is interacted parallel with corresponding blade disk.

When receiving request relevant with cloud computing system business from the user, provide a user based on http protocol Api interface, user submit itself and the relevant request of cloud computing system business by calling the api interface.When user needs to adopt When being serviced with cloud computing system, need that the algorithm of oneself is first packaged into engine mirror image of increasing income according to certain specification, and will It is uploaded to engine mirror image warehouse of increasing income, and calls api interface later to submit its request.After user has submitted request, come It will be received from user and the relevant request of cloud computing system business.

In DataNode determinations, when receiving above-mentioned request, according to the virtual address of the DataNode of cloud computing system Information determines the more than one DataNode for carrying out above-mentioned cloud computing system business.Specifically, all in physical machine The summation of the distribution virtual resource of calculating task is more than the virtual resource in physical machine restriction range, and can have physical machine limit Determine the virtual money after the summation for the actual use virtual resource that the virtual resource in range removes all common parallel calculating tasks The situation of distribution virtual resource of the source surplus not less than DataNode.

In the storage management to the blade disk, cloud computing system of the invention simulates multiple independent blade disks At a logic magnetic disc, by the division of corresponding file, encryption and transmission mechanism, at the local security for realizing high in the clouds data block Reason improves managerial ability of the user to possessed data

By flash chip load logic disk, the validated user for only possessing flash chip can load logic disk Obtain required service.By the way that secure storage management mechanism is established in data block division and encryption.It is described by data block Division is handled, it is ensured that any one blade disk will not all store the complete information of user file, ensure the privacy of user data Property.File division information is preserved to user terminal, transmission of data blocks to each blade disk.

The flash chip that user is possessed is the mark of user's legal identity, and after authentication passes through, terminal device is by root According to the volume file load logic disk specified in flash chip load logic disk；User is synchronously completed by logic magnetic disc to more The data management of a blade disk.When user is by logic magnetic disc data block, user terminal first divides file destination, mesh Mark file is divided into one or more memory blocks；Each memory block is then encrypted to enhance the confidentiality of storage data, finally will Each cryptographic block is transmitted in multiple blade disks.The document control letter generated in file destination division and encryption process Breath is preserved into the encrypted area of flash chip, by detaching the control information and date object of data itself to realize number According to the transfer of block control.

When user reads the data of cloud computing system by logic magnetic disc, terminal reads corresponding text in flash chip first The control information of part then downloads corresponding data block from each blade disk parallel, finally decrypts and verify the complete of each data block Whole property；If data integrity validation success, is reconstructed into required file by each data block assembly and is presented to use with plaintext version Family；If data integrity validation fails, redundant data block is downloaded from corresponding blade disk, restores loss or damage data.

In conclusion the present invention proposes a kind of data file reliable storage method, the big data of real-time change is realized The efficient real-time processing of set.

Obviously, it should be appreciated by those skilled in the art each units or each step of, the above-mentioned present invention can be with general Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed Network on, optionally, they can be realized with the program code that computing system can perform, it is thus possible to they are stored It is executed within the storage system by computing system.In this way, the present invention is not limited to any specific hardware and softwares to combine.

It should be understood that the above-mentioned specific implementation mode of the present invention is used only for exemplary illustration or explains the present invention's Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims

1. a kind of data file reliable storage method, which is characterized in that including：

The data received are divided according to data block name, and send DataNode's to by channel corresponding with data block name Relay unit；Result data from the relay unit is transferred to corresponding cloud storage service device.

2. according to the method described in claim 1, it is characterized in that：

The relay unit receives the data after dividing, and the data received are put into the team named with pending Business Name In row；

Priority based on each pending business opens pending business, and pending business is sent to DataNode and calculates list Member；

The pending business that the computing unit is opened, calculates the data from relay unit, and defeated to relay unit Go out the data block after calculating.

3. according to the method described in claim 1, it is characterized in that, the method further includes：

Data transmission and the isolation of internal logical calculated for making DataNode, input data is divided by data block name, and will count According to sending relay unit to；Relay unit safeguards one according to the incidence relation of data and the pending business of current DataNode A layering queue for all pending business in ready state.

4. according to the method described in claim 1, it is characterized in that, the relay unit is opened according to the load of DataNode determination Move how many a business, and selection respective numbers, the highest priority pending initiation of services from layering queue；

The relay unit also transfers data to the computing unit for executing pending business, and receives and handled by computing unit Result data afterwards；In the cloud computing system including above-mentioned DataNode, to division, fusion and the place of input or result data Reason is all completed in memory.