CN108920110A - A kind of parallel processing big data storage system and method calculating mode based on memory - Google Patents

A kind of parallel processing big data storage system and method calculating mode based on memory Download PDF

Info

Publication number
CN108920110A
CN108920110A CN201810826423.7A CN201810826423A CN108920110A CN 108920110 A CN108920110 A CN 108920110A CN 201810826423 A CN201810826423 A CN 201810826423A CN 108920110 A CN108920110 A CN 108920110A
Authority
CN
China
Prior art keywords
data
memory
vector
big data
parallel processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810826423.7A
Other languages
Chinese (zh)
Inventor
吴勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Mechanical and Electrical Polytechnic
Original Assignee
Hunan Mechanical and Electrical Polytechnic
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Mechanical and Electrical Polytechnic filed Critical Hunan Mechanical and Electrical Polytechnic
Priority to CN201810826423.7A priority Critical patent/CN108920110A/en
Publication of CN108920110A publication Critical patent/CN108920110A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0634Configuration or reconfiguration of storage systems by changing the state or mode of one or more devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Abstract

The invention belongs to information retrieval and its database structure technical fields, disclose a kind of parallel processing big data storage system and method for calculating mode based on memory, the present invention is based on novel storage level memories and the novel mixing memory hierarchy of traditional DRAM design, memory size is substantially improved under the premise of keeping cost and energy demand advantages, making to calculate can not only carry out on DRAM memory, it can also be carried out on SCM, a kind of data-centered tupe based on mixing memory architecture is provided for big data processing, significantly promotes the timeliness of big data processing.Large capacity mixing memory hierarchy is constructed based on novel non-volatile memory devices to accelerate the mode of data processing, to significantly promote the timeliness of big data processing, referred to as memory is calculated.From the point of view of architecturally, the appearance of memory calculating mode provides strong timeliness, high-performance, the high architecture handled up for big data processing and supports to bring possibility.

Description

A kind of parallel processing big data storage system and method calculating mode based on memory
Technical field
The invention belongs to information retrieval and its database structure technical fields, more particularly to one kind to calculate mode based on memory Parallel processing big data storage system and method.
Background technique
Currently, the prior art commonly used in the trade is such:Big data brings the challenge of 4V:Scale, data volume are more next It is bigger, from terabyte grade to 10,000,000 hundred million byte levels even to 10,000,100,000,000 byte ranks;Type, data class is various, both wraps Include traditional structural data includes the unstructured datas such as text, video, picture and audio again, and unstructured number According to specific gravity quickling increase;Value, data value density are low, it is difficult to carry out the meters such as forecast analysis, operation intelligence, decision support It calculates;The speed issue of speed, big data processing is more prominent, and timeliness is difficult to ensure.All in all, the challenge of big data processing Caused by contradiction substantially as the problem of the processing capacity of rural IT application and data processing between scale.Big data institute table The features such as incremental velocity revealed is fast, temporal locality is low objectively exacerbates contradictory evolution, so that centered on calculating Traditional mode is faced with that memory size is limited, input/output (I/O) pressure is big, cache hit rate is low, the bulking property of data processing The low lot of challenges of energy, it is difficult to the optimum balance of performance, energy consumption and cost is obtained, so that current computer system can not be effective The big data of PB grades of processing or more.In terms of distributed system, people towards big data processing propose with MapReduce (or Hadoop) framework etc. solves this problem.MapReduce is based on key assignments by providing two function processing of Map and Reduce (key-value) data that mode stores can simply and easily obtain good scalability and fault-tolerant in distributed system Property.However MapReduce needs to obtain data from disk, then intermediate result data is write back disk, setting based on disk For meter so that its efficiency is lower, I/O expense is very big, is not suitable for the application with online and real-time demand.Pass through multiple nodes Although big data processing facing challenges can be alleviated by handling data simultaneously, its distributed system is mainly with coarse grain parallelism Based on, do not give full play to the resource capability of existing computing unit.It can be seen that being all to the optimization of big data processing at present Based on traditional memory-disk access mode, although taking various ways carries out certain optimization, the pass of data processing Key " data I/O bottleneck " always exists.
In conclusion problem of the existing technology is:
(1) PB grades or more of big data can not be effectively treated in current computer system.For PB grades of data, memory headroom It is limited, it needs constantly to influence the efficiency of data processing from the external memory exchange page to memory.Present big data system, although logical It crosses the technological means such as fragment, Mapper, Reducer to be handled to decompose big data, but due to memory-limited, in treatment process In, being related to data will constantly read and write from external memory, and big data processing is caused to be had a greatly reduced quality in terms of real-time.
(2) MapReduce needs to obtain data from disk, calculates the data generated by mapper, will not write direct Disk, but memory is first written, disk can be just written by reaching certain amount.That is, when the data volume of processing is excessive, it will incite somebody to action Intermediate result data writes back disk, and for the design based on disk so that its efficiency is lower, I/O expense is very big, is not suitable for having There is online and real-time demand application.
Solve the difficulty and meaning of above-mentioned technical problem:
1, how to coordinate the unified addressing and application of novel storage SCM and DRAM.After regarding them as an entirety, how It is addressed,
2, how the different data of write operation frequency are stored in SCM and DRAM respectively.Due to SCM only reading rate It is suitable with DRAM, and writing rate then differs 10~100 times or more, and SCM will cause permanent failure when writing millions of secondary.How The number write is reduced, maximizes favourable factors and minimizes unfavourable ones, becomes the difficult point of the program.
3, in this mixing memory hierarchy, how to guarantee reading data, the accuracy of write-in.
By designing Novel internal memory SCM and DRAM mixing memory counting system, on the one hand increases memory size, improve and calculate Efficiency, reduce power consumption, be on the other hand avoided that data power down lose, to data have protective effect.
Summary of the invention
In view of the problems of the existing technology, the present invention provides a kind of big numbers of parallel processing for calculating mode based on memory According to storage system and method.
The invention is realized in this way a kind of parallel processing big data storage method for calculating mode based on memory, described The parallel processing big data storage method of calculating mode is by novel storage level memory SCM and DRAM empir-ical formulation one based on memory Block is collectively as memory, and SCM mainly stores initial data for read operation, and DRAM is then used to store the number of frequent read-write operation According to;
It is described based on memory calculate mode parallel processing big data storage method include:By Novel internal memory SCM and DRAM Unified addressing, the initial data read in from external memory are stored in SCM, by the intermediate data frequently read and write in program operation process and It verifies in data deposit DRAM, the intermediate data of write-in is then written in SCM when reaching certain amount;
The data processing of parallel processing big data storage method for calculating mode based on memory further comprises:
(1) if being arbitrarily G by the binary system encoder matrix that " 0,1 " determinesr·m, Gr·mServe as reasons " 0,1 " composition two into Matrix processed, matrix are embodied as generating redundant data:
(2) according to the row vector l of binary coded matrix1, l2..., lr·mIn the number of " 1 " determine according to the vector Required XOR calculation times when check bit are calculated, and calculate any two vectors la, lbBetween different digit;
(3) if vector laMiddle element is that the digit of " 1 " is k, then system carries out generating redundant data needs using the vector Carry out k-1 XOR operation.
Further, the parallel processing big data storage method for calculating mode based on memory is directed to entire encoder matrix Gr·mTo original document carry out coding calculating optimization method include:
(1) according to G in encoder matrixr·mEach row vector in " 1 " number, determine according to the row vector calculate school XOR number required for position is tested, the number of " 1 " is marked with k in row vector, then is calculated required for check bit using the row vector XOR number be (k-1) m, wherein m be it is each participate in verification calculate original data block size;
(2) compare the element identical bits in encoder matrix between any two row vector and the number of element difference position, remember For (e/d), wherein e indicates the identical position number of element in two vectors;D indicates the position number that element is different in two vectors;
(3) if a certain row vector liXOR number required for (1≤i≤rm) is less than or equal in step B not isotopic number D, then directly according to the vector calculate the row corresponding to verification data block, and the vector is denoted as lj
(4) the vector l for utilizing (3) to determinej, according to digit identical in step B and not the ratio between isotopic number, determine next meter Row vector is calculated, as certain row vector lkWith vector ljIsotopic number is not less than identical digit, and lkWith vector ljIsotopic number is not each with remaining A vector is not when isotopic number reaches minimum, then according to vector ljThe verification data that have calculated that are calculated by lkDetermining check number According to;
(5) check bit is not calculated if still having, according to (4) computation rule, with lkFor basic vector, find next to be calculated Vector, and return to (4);
(6) complete verification position calculating process whether is had determined that, if so, check bit successively calculating process is saved, if it is not, then It is calculated according to original corresponding relationship.
Further, the storage of the inspection data includes with index process method:
(1) first by every a line, i.e., attribute-name of one record using major key as rowkey is as column family name, all column families All only one is arranged, and column name is fixed, and attribute value is stored as train value into HBase;By each attribute storage a to column family In, when verification rule is related to matching a certain attribute value according to major key, incoherent attribute value is read into;
(2) the line unit format of concordance list concordance list is established using the attribute field value that verification rule is related to as rowkey again For { main table index train value }, the value format of concordance list is { main table row key 1, main table row key 2 ... };Each main table row key is made It is stored for a column name, when needing to increase a main table row key, it is only necessary to increase by a column, when verification rule involves a need to basis When some attribute value matches other attribute values, quickly finds all records with same alike result value and verified;
(3) based on the concordance list of timestamp, the data rapidly inquired in Fixed Time Interval are verified, when line unit is Between stab, key assignments is major key, and full dose data processing is that data when carrying out quality indicator for the high-volume data of historical accumulation are deposited Storage and index process process, input data full dose data processing data storage identical as incremental data processing and index process mistake Cheng Shi:After incrementally data and index are poured into HBase by data storage and indexing means, while extracting full dose verification rule Then relevant attribute field is stored into HDFS index file.
Further, the multiplexing querying method of the frequent read-write operation of storage includes:
Data warehouse D={ S1, before loading object table T, calculated according to Schema Matching () algorithm and Filter () Method obtains triple M ';Then the relevant information of M ' is saved in RTable table, last item Dataload_Reusing () algorithm completes the loading of initial data;
(1) match query knows A ' in T by inquiry RTable table4It can not find multiplexed information, and A '1, A '2, A '3? It can be in S1In find reusable data;
(2) query rewrite, first it is ensured that query statement is of equal value, Q1In for A '1> const1;A′1=A11, i.e., two column Data equivalent;Alternative condition is without changing, for A '2> const2, A '2With A12Data between there are transformational relation, choosing Select whether condition needs to change the source for depending on target data;Sum1, sum2 are respectively T and S1Record sum;If sum1 =sum2, A '2Multiplexing A completely12;Q ' is rewritten as according to the former query statement of f '1;SELEC T A′1, A '2, A '3, A '4FROM T WHERE A′1> const1 AND A′2> (const2/0.1);Otherwise as sum1 > sum2;Data source is in A12In can The data imported outside reusing data collection;For the data that outside imports, alternative condition is A '2> const2, look into Q1Ask sentence not Become, is Q ' for reusable data query rewrite1
(3) query execution, for being multiplexed completely, directly execution Q1;Inquiry according to the difference of data source be broken down by< col′t, col 's>The start-stop data block that two parts data in each multiplexing relationship are obtained in corresponding blk_id_list, for can Reusing data executes Q '1;The data imported for outside still carry out Q1
(4) result is integrated, and the data item for meeting condition in reusable data is read, according to col 't=f (col 's) respectively into Row conversion finally integrates the data after converting and exports final result.
Another object of the present invention is to provide the parallel processing big datas for calculating mode described in a kind of realize based on memory The parallel processing big data storage system for calculating mode based on memory of storage method, it is described to calculate the parallel of mode based on memory Handling big data storage system includes:
User program module is connect with multicore module, memory modules, disk module respectively, the data for that will handle into Row output;
Disk module is connect with memory modules, and memory modules obtain data from disk, i.e., based on traditional memory-disk Access module;
Memory modules are connect with multicore module, for handling the data of storage by multicore module.
Another object of the present invention is to provide the parallel processing big datas for calculating mode described in a kind of realize based on memory The computer program of storage method.
Another object of the present invention is to provide the parallel processing big datas for calculating mode described in a kind of realize based on memory The information data processing terminal of storage method.
Another object of the present invention is to provide a kind of computer readable storage mediums, including instruction, when it is in computer When upper operation, so that computer executes the parallel processing big data storage method for calculating mode based on memory.
In conclusion advantages of the present invention and good effect are:The present invention is based on novel storage level memory (Storage Class Memory, SCM) and the novel mixing memory hierarchy of traditional DRAM design, under the premise of keeping cost and energy demand advantages Memory size is substantially improved, making to calculate can not only carry out on DRAM memory, can also carry out on SCM, be at big data Reason provide it is a kind of based on mixing memory architecture data-centered tupe, significantly promoted big data processing when Effect property.One is provided effectively based on novel memory devices part and traditional novel mixing memory hierarchy of DRAM design for big data processing Support technology;Large capacity mixing memory hierarchy is constructed based on novel non-volatile memory devices to accelerate data processing Mode, to significantly promote the timeliness of big data processing, referred to as memory is calculated.From the point of view of architecturally, memory is calculated The appearance of mode provides strong timeliness, high-performance, the high architecture handled up for big data processing and supports to bring possibility.Based on interior The parallel processing system (PPS) for depositing calculating mode is mainly faced with the challenge of three key technical problems:Isomery collaboration, efficient parallel and Adaptive scheduling management.Isomery collaboration refers to how architecture and operating system level realize isomery level memory hierarchy Coordinated management, transparent service data processing back-up environment;Efficient parallel refers in programming model and parallel processing level how It calculates based on memory, realizes the efficient parallel processing environment of big data;Adaptive scheduling management refers to that memory calculates parallel ring It, can be dynamically using different suitable how according to calculate node structure and characteristic in border, and the characteristics of application data processing Resource scheduling management strategy, with realize big data parallel processing system (PPS) resource load stabilization and efficiently utilize.
Compared with prior art, the present invention optimizing cataloged procedure, the reduction of cataloged procedure calculation amount can be realized.It is depositing It, can be according to the characteristics of each row vector, changing original check number in encoder matrix when storage system carries out code storage to data According to the calculating order of block, and then reduce the calculation times of cataloged procedure;It is carried out using method proposed by the present invention to encoder matrix Optimization after calculating order, can store in a computer, in later each calculating, can be according to the optimization after Rule is calculated;Cataloged procedure optimization method proposed by the present invention can be suitable for all binary matrixs, particularly, should Method can be adapted for any correlated process calculated based on binary matrix, be applicable not only to coding when data storage Process is applied also for when dropout of data block, carries out the process of data reconstruction to loss data block using binary system check matrix, Value with popularization and use.
The processing method of inspection data of the invention the full dose data of marketing table and GIS have been carried out based on HDFS index and Single node that graftabl is handled, the checking experiment of single gauge then, test result 42s, wherein from HDFS by whole HDFS Index data graftabl spends 40s, scans through into full dose verification only time-consuming 2s in memory;And it is existing based on database For data check production system when executing full dose verification, single gauge will then spend about 40min;Confirmatory system based on Hadoop platform When system carries out GIS and marketing table full dose data sheet rule verification, the existing production system based on database of performance comparable is mentioned If high about more than 50 times set a certain number of Hadoop nodes and realize that the parallelization of more rules full dose data check executes, even if Share and access HDFS generates certain performance decline when more rules execute, it is contemplated that total full dose data check time will compare mesh The preceding production system based on database at least improves an order of magnitude.
Detailed description of the invention
Fig. 1 is that the parallel processing big data memory system architecture provided in an embodiment of the present invention for calculating mode based on memory is shown It is intended to;
In figure:1, user program module;2, multicore module;3, memory modules;4, disk module.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
Present invention seek to address that PB grades or more of big data can not be effectively treated in current computer system;MapReduce Need to obtain data from disk, then intermediate result data write back into disk, the design based on disk so that its efficiency is lower, I/O expense is very big, is not suitable for the application with online and real-time demand.The present invention architecturally from the point of view of, memory meter The appearance of calculation mode provides strong timeliness, high-performance, the high architecture handled up for big data processing and supports to bring possibility.
Application principle of the invention is explained in detail with reference to the accompanying drawing.
As shown in Figure 1, the parallel processing big data storage system provided in an embodiment of the present invention for calculating mode based on memory Including:
User program module 1 is connect with multicore module 2, memory modules 3, disk module 4 respectively, the data for will handle It is exported.
Disk module 4 is connect with memory modules 3, and memory modules 3 need to obtain data from disk, i.e., based in traditional Deposit-disk access mode.
Memory modules 3 are connect with multicore module 2, for handling the data of storage by multicore module 2.
The storage method of the of the invention parallel processing big data storage system for calculating mode based on memory is:Design will be new The Novel internal memory system of type storage level memory SCM and tradition DRAM mixing.
It is provided in an embodiment of the present invention based on memory calculate mode parallel processing big data storage system and method include: By novel storage level memory SCM and DRAM empir-ical formulation at one piece collectively as memory, SCM mainly stores initial data for reading Operation, DRAM are then used to store the data of frequent read-write operation;
It is described based on memory calculate mode parallel processing big data storage method include:By Novel internal memory SCM and DRAM Unified addressing, the initial data read in from external memory are stored in SCM, by the intermediate data frequently read and write in program operation process and It verifies in data deposit DRAM, the intermediate data of write-in is then written in SCM when reaching certain amount, promotes the effect of data processing Rate.
The data processing of parallel processing big data storage method for calculating mode based on memory further comprises:
(1) if being arbitrarily G by the binary system encoder matrix that " 0,1 " determinesr·m, Gr·mServe as reasons " 0,1 " composition two into Matrix processed, matrix are embodied as generating redundant data:
(2) according to the row vector l of binary coded matrix1, l2..., lr·mIn the number of " 1 " determine according to the vector Required XOR calculation times when check bit are calculated, and calculate any two vectors la, lbBetween different digit;
(3) if vector laMiddle element is that the digit of " 1 " is k, then system carries out generating redundant data needs using the vector Carry out k-1 XOR operation.
Further, the parallel processing big data storage method for calculating mode based on memory is directed to entire encoder matrix Gr·mTo original document carry out coding calculating optimization method include:
(1) according to G in encoder matrixr·mEach row vector in " 1 " number, determine according to the row vector calculate school XOR number required for position is tested, the number of " 1 " is marked with k in row vector, then is calculated required for check bit using the row vector XOR number be (k-1) m, wherein m be it is each participate in verification calculate original data block size;
(2) compare the element identical bits in encoder matrix between any two row vector and the number of element difference position, remember For (e/d), wherein e indicates the identical position number of element in two vectors;D indicates the position number that element is different in two vectors;
(3) if a certain row vector liXOR number required for (1≤i≤rm) is less than or equal in step B not isotopic number D, then directly according to the vector calculate the row corresponding to verification data block, and the vector is denoted as lj
(4) the vector l for utilizing (3) to determinej, according to digit identical in step B and not the ratio between isotopic number, determine next meter Row vector is calculated, as certain row vector lkWith vector ljIsotopic number is not less than identical digit, and lkWith vector ljIsotopic number is not each with remaining A vector is not when isotopic number reaches minimum, then according to vector ljThe verification data that have calculated that are calculated by lkDetermining check number According to;
(5) check bit is not calculated if still having, according to (4) computation rule, with lkFor basic vector, find next to be calculated Vector, and return to (4);
(6) complete verification position calculating process whether is had determined that, if so, check bit successively calculating process is saved, if it is not, then It is calculated according to original corresponding relationship.
Further, the storage of the inspection data includes with index process method:
(1) first by every a line, i.e., attribute-name of one record using major key as rowkey is as column family name, all column families All only one is arranged, and column name is fixed, and attribute value is stored as train value into HBase;By each attribute storage a to column family In, when verification rule is related to matching a certain attribute value according to major key, incoherent attribute value is read into;
(2) the line unit format of concordance list concordance list is established using the attribute field value that verification rule is related to as rowkey again For { main table index train value }, the value format of concordance list is { main table row key 1, main table row key 2 ... };Each main table row key is made It is stored for a column name, when needing to increase a main table row key, it is only necessary to increase by a column, when verification rule involves a need to basis When some attribute value matches other attribute values, quickly finds all records with same alike result value and verified;
(3) based on the concordance list of timestamp, the data rapidly inquired in Fixed Time Interval are verified, when line unit is Between stab, key assignments is major key, and full dose data processing is that data when carrying out quality indicator for the high-volume data of historical accumulation are deposited Storage and index process process, input data full dose data processing data storage identical as incremental data processing and index process mistake Cheng Shi:After incrementally data and index are poured into HBase by data storage and indexing means, while extracting full dose verification rule Then relevant attribute field is stored into HDFS index file.
Further, the multiplexing querying method of the frequent read-write operation of storage includes:
Data warehouse D={ S1, before loading object table T, calculated according to Schema Matching () algorithm and Filter () Method obtains triple M ';Then the relevant information of M ' is saved in RTable table, last item Dataload_Reusing () algorithm completes the loading of initial data;
(1) match query knows A ' in T by inquiry RTable table4It can not find multiplexed information, and A '1, A '2, A '3? It can be in S1In find reusable data;
(2) query rewrite, first it is ensured that query statement is of equal value, Q1In for A '1> const1;A′1=A11, i.e., two column Data equivalent;Alternative condition is without changing, for A '2> const2, A '2With A12Data between there are transformational relation, choosing Select whether condition needs to change the source for depending on target data;Sum1, sum2 are respectively T and S1Record sum;If sum1 =sum2, A '2Multiplexing A completely12;Q ' is rewritten as according to the former query statement of f '1;SELECT A′1, A '2, A '3, A '4FROM T WHERE A′1> const1 A ND A′2> (con st2/0.1);Otherwise as sum1 > sum2;Data source is in A12In The data imported outside reusable data set;For the data that outside imports, alternative condition is A '2> const2, look into Q1Ask sentence It is constant, it is Q ' for reusable data query rewrite1
(3) query execution, for being multiplexed completely, directly execution Q1;Inquiry according to the difference of data source be broken down by< col′t, col 's>The start-stop data block that two parts data in each multiplexing relationship are obtained in corresponding blk_id_list, for can Reusing data executes Q '1;The data imported for outside still carry out Q1
(4) result is integrated, and the data item for meeting condition in reusable data is read, according to col 't=f (col 's) respectively into Row conversion finally integrates the data after converting and exports final result.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to Process described in the embodiment of the present invention or function.The computer can be general purpose computer, special purpose computer, computer network Network or other programmable devices.The computer instruction may be stored in a computer readable storage medium, or from one Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center Transmission).The computer-readable storage medium can be any usable medium or include one that computer can access The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims (8)

1. a kind of parallel processing big data storage method for calculating mode based on memory, which is characterized in that described to count based on memory The parallel processing big data storage method of calculation mode by novel storage level memory SCM and DRAM empir-ical formulation one piece collectively as Memory, SCM mainly store initial data for read operation, and DRAM is then used to store the data of frequent read-write operation;
It is described based on memory calculate mode parallel processing big data storage method include:Novel internal memory SCM and DRAM is unified Addressing, the initial data read in from external memory is stored in SCM, by the intermediate data frequently read and write in program operation process and verification Data are stored in DRAM, and the intermediate data of write-in is then written in SCM when reaching certain amount;
The data processing of parallel processing big data storage method for calculating mode based on memory further comprises:
(1) if being arbitrarily G by the binary system encoder matrix that " 0,1 " determinesr·m, Gr·mThe binary system square for " 0,1 " composition of serving as reasons Battle array, matrix are embodied as generating redundant data:
(2) according to the row vector l of binary coded matrix1, l2..., lr·mIn " 1 " number determine according to the vector calculate school Required XOR calculation times when testing, and calculate any two vectors la, lbBetween different digit;
(3) if vector laMiddle element be " 1 " digit be k, then system using the vector carry out generation redundant data need to carry out k- 1 XOR operation.
2. calculating the parallel processing big data storage method of mode based on memory as described in claim 1, which is characterized in that institute The parallel processing big data storage method for calculating mode based on memory is stated for entire encoder matrix Gr·mOriginal document is compiled Code calculate optimization method include:
(1) according to G in encoder matrixr·mEach row vector in " 1 " number, determine according to the row vector calculate check bit Required XOR number, the number of " 1 " is marked with k in row vector, then is calculated required for check bit using the row vector XOR number is (k-1) m, and wherein m is each size for participating in the original data block that verification calculates;
(2) compare the element identical bits in encoder matrix between any two row vector and the number of element difference position, be denoted as (e/ D), wherein e indicates the identical position number of element in two vectors;D indicates the position number that element is different in two vectors;
(3) if a certain row vector liXOR number required for (1≤i≤rm) is less than or equal in step B not isotopic number d, then Directly according to the vector calculate the row corresponding to verification data block, and the vector is denoted as lj
(4) the vector l for utilizing (3) to determinej, according to digit identical in step B and not the ratio between isotopic number, determine next calculating row Vector, as certain row vector lkWith vector ljIsotopic number is not less than identical digit, and lkWith vector ljNot isotopic number and remaining it is each to Amount is not when isotopic number reaches minimum, then according to vector ljThe verification data that have calculated that are calculated by lkDetermining verification data;
(5) check bit is not calculated if still having, according to (4) computation rule, with lkFor basic vector, next vector to be calculated is found, And return to (4);
(6) whether have determined that complete verification position calculating process, if so, save check bit successively calculating process, if it is not, then according to Original corresponding relationship is calculated.
3. calculating the parallel processing big data storage method of mode based on memory as claimed in claim 2, which is characterized in that institute The storage and index process method for stating inspection data include:
(1) first by every a line, i.e., attribute-name of one record using major key as rowkey is as column family name, and all column families are all only There are a column, column name is fixed, and attribute value is stored as train value into HBase;By the storage of each attribute into a column family, When verification rule is related to matching a certain attribute value according to major key, incoherent attribute value is read into;
(2) it is { main for the attribute field value that verification rule is related to being established the line unit format of concordance list concordance list as rowkey again Table index train value }, the value format of concordance list is { main table row key 1, main table row key 2 ... };Using each main table row key as one Column name storage, when needing to increase a main table row key, it is only necessary to increase by a column, when verification rule involves a need to belong to according to some When property value matches other attribute values, quickly finds all records with same alike result value and verified;
(3) based on the concordance list of timestamp, the data rapidly inquired in Fixed Time Interval are verified, and line unit is the time Stamp, key assignments are major key, data storage when full dose data processing is the high-volume data progress quality indicator for historical accumulation With index process process, input data full dose data processing data storage identical as incremental data processing and index process process It is:After incrementally data and index are poured into HBase by data storage and indexing means, while extracting full dose verification rule Relevant attribute field is stored into HDFS index file.
4. calculating the parallel processing big data storage method of mode based on memory as claimed in claim 2, which is characterized in that institute It states and stores the multiplexing querying method of frequent read-write operation and include:
Data warehouse D={ S1, before loading object table T, according to SchemaMatching () algorithm and Filter () algorithm, obtain Triple M ';Then the relevant information of M ' is saved in RTable table, last item Dataload_Reusing () algorithm, Complete the loading of initial data;
(1) match query knows A ' in T by inquiry RTable table4It can not find multiplexed information, and A '1, A '2, A '3It can be S1In find reusable data;
(2) query rewrite, first it is ensured that query statement is of equal value, Q1In for A '1> const1;A′1=A11, i.e. two column datas Equivalent;Alternative condition is without changing, for A '2> const2, A '2With A12Data between there are transformational relation, selector bars Whether part needs to change the source for depending on target data;Sum1, sum2 are respectively T and S1Record sum;If sum1= Sum2, A '2Multiplexing A completely12;Q ' is rewritten as according to the former query statement of f '1;SELECT A′1, A '2, A '3, A 'tFROM T WHERE A′1> const1 AND A′2> (const2/0.1);Otherwise as sum1 > sum2;Data source is in A12In reusable number According to the external data imported of collection;For the data that outside imports, alternative condition is A '2> const2, look into Q1It is constant to ask sentence, it is right In reusable data query rewrite be Q '1
(3) query execution, for being multiplexed completely, directly execution Q1;Inquiry according to the difference of data source be broken down by<col ′t, col 's>The start-stop data block that two parts data in each multiplexing relationship are obtained in corresponding blk_id_list, for that can weigh Q ' is executed with data1;The data imported for outside still carry out Q1
(4) result is integrated, and the data item for meeting condition in reusable data is read, according to col 't=f (col 's) turned respectively Change, the data after finally integration conversion simultaneously export final result.
5. a kind of realize calculates the parallel processing big data storage method of mode based on memory based on memory described in claim 1 The parallel processing big data storage system of calculating mode, which is characterized in that the parallel processing for calculating mode based on memory is big Data-storage system includes:
User program module is connect with multicore module, memory modules, disk module respectively, and the data for that will handle carry out defeated Out;
Disk module is connect with memory modules, and memory modules obtain data from disk, i.e., based on traditional memory-disk access Mode;
Memory modules are connect with multicore module, for handling the data of storage by multicore module.
6. a kind of realize the calculating for calculating the parallel processing big data storage method of mode described in Claims 1 to 4 based on memory Machine program.
7. a kind of realize the information for calculating the parallel processing big data storage method of mode described in Claims 1 to 4 based on memory Data processing terminal.
8. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer is executed as weighed Benefit require 1~4 described in based on memory calculate mode parallel processing big data storage method.
CN201810826423.7A 2018-07-25 2018-07-25 A kind of parallel processing big data storage system and method calculating mode based on memory Pending CN108920110A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810826423.7A CN108920110A (en) 2018-07-25 2018-07-25 A kind of parallel processing big data storage system and method calculating mode based on memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810826423.7A CN108920110A (en) 2018-07-25 2018-07-25 A kind of parallel processing big data storage system and method calculating mode based on memory

Publications (1)

Publication Number Publication Date
CN108920110A true CN108920110A (en) 2018-11-30

Family

ID=64416718

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810826423.7A Pending CN108920110A (en) 2018-07-25 2018-07-25 A kind of parallel processing big data storage system and method calculating mode based on memory

Country Status (1)

Country Link
CN (1) CN108920110A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968269A (en) * 2019-11-18 2020-04-07 华中科技大学 SCM and SSD-based key value storage system and read-write request processing method
CN112308328A (en) * 2020-11-09 2021-02-02 中国科学院计算技术研究所 Top-Down network measurement system-oriented parallel measurement task optimization method and system
WO2021169635A1 (en) * 2020-02-27 2021-09-02 华为技术有限公司 Data processing method for memory device, apparatus, and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6599132B1 (en) * 1999-11-30 2003-07-29 The Board Of Trustees Of The Leland Stanford Junior University Scanning capacitance sample preparation technique
CN101349979A (en) * 2008-09-05 2009-01-21 清华大学 Method for updating double-magnetic head user data of large scale fault-tolerant magnetic disk array storage system
CN103838649A (en) * 2014-03-06 2014-06-04 中国科学院成都生物研究所 Method for reducing calculation amount in binary coding storage system
CN105446899A (en) * 2015-11-09 2016-03-30 上海交通大学 Memory data quick persistence method based on storage-class memory
CN105930356A (en) * 2016-04-08 2016-09-07 上海交通大学 Method for implementing log type heterogeneous hybrid memory file system
CN105938458A (en) * 2016-04-13 2016-09-14 上海交通大学 Software-defined heterogeneous hybrid memory management method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6599132B1 (en) * 1999-11-30 2003-07-29 The Board Of Trustees Of The Leland Stanford Junior University Scanning capacitance sample preparation technique
CN101349979A (en) * 2008-09-05 2009-01-21 清华大学 Method for updating double-magnetic head user data of large scale fault-tolerant magnetic disk array storage system
CN103838649A (en) * 2014-03-06 2014-06-04 中国科学院成都生物研究所 Method for reducing calculation amount in binary coding storage system
CN105446899A (en) * 2015-11-09 2016-03-30 上海交通大学 Memory data quick persistence method based on storage-class memory
CN105930356A (en) * 2016-04-08 2016-09-07 上海交通大学 Method for implementing log type heterogeneous hybrid memory file system
CN105938458A (en) * 2016-04-13 2016-09-14 上海交通大学 Software-defined heterogeneous hybrid memory management method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张志亮 等: "基于Hadoop的电网数据质量校验方法与验证系统", 《计算机研究与发展》 *
王梅 等: "一种列存储数据仓库中的数据复用策略", 《计算机学报》 *
罗乐 等: "内存计算技术研究综述", 《软件学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968269A (en) * 2019-11-18 2020-04-07 华中科技大学 SCM and SSD-based key value storage system and read-write request processing method
WO2021169635A1 (en) * 2020-02-27 2021-09-02 华为技术有限公司 Data processing method for memory device, apparatus, and system
CN112308328A (en) * 2020-11-09 2021-02-02 中国科学院计算技术研究所 Top-Down network measurement system-oriented parallel measurement task optimization method and system
CN112308328B (en) * 2020-11-09 2023-06-06 中国科学院计算技术研究所 Top-Down network measurement system-oriented parallel measurement task optimization method and system

Similar Documents

Publication Publication Date Title
Chambi et al. Better bitmap performance with roaring bitmaps
US10296498B2 (en) Coordinated hash table indexes to facilitate reducing database reconfiguration time
CN104794123B (en) A kind of method and device building NoSQL database indexes for semi-structured data
US10353923B2 (en) Hadoop OLAP engine
CN103177027B (en) Obtain the method and system of dynamic Feed index
Kepner et al. Achieving 100,000,000 database inserts per second using Accumulo and D4M
CN103812939B (en) Big data storage system
US20130191523A1 (en) Real-time analytics for large data sets
Sethi et al. RecShard: statistical feature-based memory optimization for industry-scale neural recommendation
CN112269792B (en) Data query method, device, equipment and computer readable storage medium
CN108595664B (en) Agricultural data monitoring method in hadoop environment
CN108920110A (en) A kind of parallel processing big data storage system and method calculating mode based on memory
CN106570113B (en) Mass vector slice data cloud storage method and system
CN104036029A (en) Big data consistency comparison method and system
Mizrahi et al. Blockchain state sharding with space-aware representations
Hu et al. Trix: Triangle counting at extreme scale
CN105706092A (en) Methods and systems of four-valued simulation
CN111104457A (en) Massive space-time data management method based on distributed database
Hashem et al. An Integrative Modeling of BigData Processing.
Xiong et al. HaDaap: a hotness‐aware data placement strategy for improving storage efficiency in heterogeneous Hadoop clusters
CN110019299A (en) A kind of method and apparatus for creating or refreshing the off-line data set of analytic type data warehouse
Purdilă et al. Single‐scan: a fast star‐join query processing algorithm
Ma et al. Blockchain retrieval model based on elastic bloom filter
Bai et al. An efficient skyline query algorithm in the distributed environment
Kanojia et al. IT Infrastructure for Smart City: Issues and Challenges in Migration from Relational to NoSQL Databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181130