CN104615736B

CN104615736B - Big data fast resolving storage method based on database

Info

Publication number: CN104615736B
Application number: CN201510070607.1A
Authority: CN
Inventors: 彭成志; 刘钧钧; 咸峰
Original assignee: Upper Seabird Scape Computer System Co Ltd
Current assignee: Shanghai Chuangjing Information Technology Co ltd
Priority date: 2015-02-10
Filing date: 2015-02-10
Publication date: 2017-10-27
Anticipated expiration: 2035-02-10
Also published as: CN104615736A

Abstract

A kind of big data fast resolving storage method based on database, including：Step 1：According to the data content for being actually subjected to analysis, data format is defined into file；Step 2：Data format is read by file data structure is constructed in internal memory, foundation is parsed as data；Step 3：Preparation before data parsing；Step 4：Data are parsed；Step 5：Data storage；Step 6：After list storage is finished in step 5, its results list is emptied, and sets processing to complete labeled as vacation, it is idle condition then to set its state, is recovered to thread pool, waits new data block to be allocated.Method provided by the present invention, resolution speed is fast, and the data structure of parsing can configure, and highly versatile, analysis result is stored in database, and data storage redundancy is small, and the result data ex-post analysis stored is convenient.

Description

Big data fast resolving storage method based on database

Technical field

The present invention relates to a kind of microcomputer data processing, in particular, it is related to for solving the quick of big data Parsing and ex-post analysis problem.

Background technology

Can all be run into many engineerings needs to handle the mass data produced at a high speed, if these data can not be located in time Reason is completed, and whole software systems can be produced with fatal influence, the existing parsing scheme to big data is essentially all processing Set form, autgmentability compatibility is excessively poor, and processing speed is not fast enough in single computer, does not make full use of existing Some technologies play the disposal ability of computer to the full extent, it is necessary to which Distributed Calculation is to network speed and stability requirement Height, and without the efficient and comprehensive date storage method of information is provided, so being extracted data are handled afterwards when dumb.

Through retrieval, a kind of Volume data processing method and system are disclosed in Application No. 200810097594.7, Solution Volume data can not be handled at the appointed time causes processing to be delayed, the problem of eventually causing system crash.Including： According to original document naming rule distribution server, original document is split as small documents；For each small documents after fractionation, According to small documents naming rule distribution server again, the small documents after fractionation are handled.The invention can dispose many Server is split and handled to large-data documents simultaneously, greatly improves the disposal ability of system, it is ensured that system exists File process is finished in stipulated time.Moreover, the system has extraordinary autgmentability, when file is increasing either When more and more, demand can just be met by newly-increased server, you can with linear expansion, without buying higher level Server, it is not required that redeploy the server that has run in the past.But it there is the technological deficiency of following aspect：

1st, versatility：File in above-mentioned patent will set data format, autgmentability poor universality；

2nd, treatment effeciency：Design logic in above-mentioned patent is greatly how to split file, composition file, correspondence Disk read-write and network transmission can expend many time resources；

3rd, data integrity：Above-mentioned patent uses client-server pattern, it is necessary to Distributed Calculation, but single client Hold the processing to data not make full use of computer resource, and if exception occurs in network connection, may lead to not pre- The time delay of survey or loss of data are so as to trigger serious consequence.

The content of the invention

There is provided a kind of big data based on database is quick for technical problem present in above-mentioned prior art by the present invention Storage method is parsed there is provided data format definition method, general-purpose data parsing method is realized；The side of quick processing data is provided Method, efficient process mass data in the unit interval；Efficient date storage method is provided, is easy to extraction and analysis afterwards.

To reach above-mentioned purpose, the technical solution adopted in the present invention is as follows：

A kind of big data fast resolving storage method based on database, including step are as follows：

Step 1：According to the data content for being actually subjected to analysis, data format is defined into file；

Step 2：Data format is read by file data structure is constructed in internal memory, foundation is parsed as data；

Step 3：Preparation before data parsing, the data block of specified size is cut into initial data, creates specified quantity Thread, and these thread identifications are saved in list, the list is considered as a thread pool, for each thread, is It creates a list for being used to preserve analysis result；

Step 4：Data are parsed, and the data block segmented in step 3 are distributed into idle thread in thread pool, while basis The context of data block gives these idle thread Allotment Serial Numbers, and these idle threads are taken and obtained after data in foundation step 2 Data structure in alphabetic data list carry out Data Matching parsing, parse obtained result and store into step 3 and created for it In the results list built；

Step 5：Data storage, creates tables of data T1, the corresponding data of data format definition obtained for storing step 2 The binary data content of structure sequence, and be one unique data definition ID of this record distribution；Create associated data table T1 Data definition ID tables of data T2, for be provided with being disposed mark in storing step 4 and its sequence number be in thread pool most The analysis result of small thread information.

Step 6：After list storage is finished in step 5, its results list is emptied, and sets processing to complete labeled as vacation, Then it is idle condition to set its state, is recovered to thread pool, waits new data block to be allocated.

The specific method of the step 1 is：Data format is write using extensible markup language, to define any form Data, the data item being related in data format to define its mark, title, type, length, whether have positive and negative point, it is relative Original position, if a data item can be subdivided into multiple data item again, defines data subitem, the definition of each data subitem The same data item of mode, for the uncertain data item of length, definition calculates fraction.

The calculation formula includes conventional mathematic(al) representation, or including the reference to other data items lengths, so as to Go out the physical length of data item according to given specific data content dynamic calculation.

The specific method of the step 2 is：The data structure will preserve the data defined in data format definition file Between tree structure, this tree structure is also organized into the list of data items of an order, can be after data have been parsed The data item content comprising data subitem is easily extracted, again can be when data be parsed according to the use of the list of data items of order Mode fast resolving data of the matching analysis item by item.

In the step 3, if initial data is stored on disk, data are read to internal memory by File Mapping mechanism Cutting is carried out, to improve data reading speed.

In the step 4, it is to determine due to the length of each data item in each analysis result, so the results list In every record be all provided with the physical length information of all variable-length fields in this record, and according to data definition order The each result of list of data items contains a binary data stream, when to use each data item, it is only necessary to according to number The data of corresponding length are extracted from this binary data stream according to the length and data start of item, when distributing to thread Data block parsing finish after, set processing to complete labeled as true this thread and list.

The specific method of the step 5 is：A tables of data T1, the data format definition that step 2 is obtained are created first Then corresponding data structure serializing stores the binary data content in file in this table into temporary file, and For one unique data definition ID of this record distribution；Then creating a tables of data T2 is used to deposit analysis result, in T2 Defined in an external key be associated with data definition ID in T1, define the physical length letter that a field deposits all variable-length fields Breath, re-defines the thread information in the binary data stream that a field deposits each analysis result, analytical procedure 4, if certain Thread is provided with the mark that is disposed, and its sequence number is minimum in thread pool, then arrives its analysis result list storage In tables of data T2, the external key of each tables of data record is set to distribute to the data format definition that step 2 is obtained in T1 during storage The ID of corresponding data structure, and the corresponding of T2 is arrived into the variable-length field physical length information of record, binary data stream storage In field.

It is to the method that the data stored in step 7 are extracted：Only need to read out from tables of data T1 corresponding Binary data is put into temporary file, is then obtained data structure definition from temporary file unserializing, is then further taken out number Matched in sequence by Data Identification with obtained data structure according to the binary data stream in table T2, you can obtain each The data content of field.

The above-mentioned technical proposal of the present invention, relative to a kind of big data quantity disclosed in prior art 200810097594.7 Data processing method and system, have the advantages that following aspect：

1st, versatility：The present invention is extracted in specific file according to by data characteristic information, and this file can also allow use Family increases, deletes, change data feature description information, and software is according to the rule parsing data of this paper formulation, so as to possess good Autgmentability；

2nd, treatment effeciency：The present invention greatlys save the disk read-write time by File Mapping mechanism, multi-thread by thread pool Journey synchronization process data, speed is fast, efficiency high；

3rd, data integrity, the present invention makes full use of single computer to handle energy using thread pool and File Mapping mechanism Power, is guaranteed in terms of data integrity.

What the present invention was brought has the beneficial effect that：

1) resolution speed is fast；

2) data structure of parsing can configure, highly versatile；

3) analysis result is stored in database, and data storage redundancy is small；

4) the result data ex-post analysis stored is convenient.

Brief description of the drawings

By reading the detailed description made with reference to the following drawings to non-limiting example, further feature of the invention, Objects and advantages will become more apparent upon：

Fig. 1 is method flow diagram provided by the present invention.

Embodiment

With reference to specific embodiment, the present invention is described in detail.Following examples will be helpful to the technology of this area Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that to the ordinary skill of this area For personnel, without departing from the inventive concept of the premise, various modifications and improvements can be made.These belong to the present invention Protection domain.

As shown in figure 1, the big data fast resolving storage method provided by the present invention based on database, is specifically included Step is as follows：

Step 1：According to the data content for being actually subjected to analysis, data format is defined into file.

Data format is write using extensible markup language, is related to define in the data of any form, data format Data item to define its mark, title, type, length, whether have positive and negative point, relative starting position etc., if a number Multiple data item can be subdivided into again according to item, then define data subitem, the same data item of definition mode of each data subitem.For length Uncertain data item is spent, definition calculates fraction, and calculation formula can include conventional mathematic(al) representation, can also include to it The reference of its data items length, just can so go out the actual (tube) length of data item according to given specific data content dynamic calculation Degree.

Step 2：Data format is read by file data structure is constructed in internal memory, foundation is parsed as data.

This data structure will preserve the tree structure between the data defined in data format definition file, also set this Shape structure is organized into the list of data items of an order, so can easily be extracted after data have been parsed comprising data Data item content, again can when data are parsed according to the list of data items of order by the way of the matching analysis item by item it is quick Parse data.

Step 3：Preparation before data parsing, the data block of specified size is cut into initial data.

If initial data is stored on disk, cutting is carried out to internal memory by File Mapping mechanism reading data, this Sample can improve data reading speed.The thread of specified quantity is created, and these thread identifications are saved in list, this row Table is considered a thread pool.It is that it creates a list for being used to preserve analysis result for each thread.

Step 4：Data are parsed.

The data block segmented in step 3 is distributed into idle thread in thread pool, while according to the front and rear pass of data block It is to give these idle thread Allotment Serial Numbers, these idle threads are taken after data according to suitable in the data structure obtained in step 2 Sequence list of data items carries out Data Matching parsing, parses obtained result and stores into step 3 as in the results list of its establishment.

Now it is to determine due to the length of each data item in each analysis result, so in the results list Every record is all provided with the physical length information of all variable-length fields in this record, and according to the data of data definition order The item each result of list contains a binary data stream, when to use each data item, it is only necessary to according to data item Length and data start from this binary data stream extract corresponding length data.When the number for distributing to thread After being finished according to block parsing, processing is set to complete labeled as true this thread and list.

Step 5：Data storage

A tables of data T1 is created first, and the corresponding data structure serializing of the data format definition that step 2 is obtained is arrived In temporary file, then the binary data content in file is stored in this table, and it is unique for this record distribution Data definition ID；Then creating a tables of data T2 is used to deposit analysis result, and an external key is associated with T1 defined in T2 Middle data definition ID, defines the physical length information that a field deposits all variable-length fields, re-defines a field storage every The binary data stream of individual analysis result.Thread information in analytical procedure 4, if certain thread is provided with the mark that is disposed Note, and its sequence number is minimum in thread pool, then by its analysis result list storage into tables of data T2, per number during storage It is set to distribute to the corresponding data knot of data format definition that step 2 is obtained in T1 according to the external key (data definition ID) of token record The ID of structure, and by the variable-length field physical length information of record, binary data stream storage into T2 respective field.

The database obtained using this step storage, the complete documentation data structure of data, data record redundancy It is few, without relying on other configuration files during ex-post analysis, it is only necessary to read out corresponding binary data from tables of data T1 It is put into temporary file, then obtains data structure definition from temporary file unserializing, two then further taken out in T2 tables enters Data flow processed is matched the data content that can obtain each field with obtained data structure in sequence by Data Identification, Data are extracted very simple.

Step 6：After thread analysis result list storage is finished, its results list is emptied, and sets processing to complete mark It is false, it is idle condition then to set its state, is recovered to thread pool, waits new data block to be allocated.

The specific embodiment of the present invention is described above.It is to be appreciated that the invention is not limited in above-mentioned Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow Ring the substantive content of the present invention.

Claims

1. a kind of big data fast resolving storage method based on database, it is characterised in that as follows including step：

Step 3：Preparation before data parsing, the data block of specified size is cut into initial data, creates the line of specified quantity Journey, and these thread identifications are saved in list, the list is considered as a thread pool, for each thread, is its wound Build a list for being used to preserve analysis result；

Step 4：Data are parsed, and the data block segmented in step 3 are distributed into idle thread in thread pool, while according to data The context of block gives these idle thread Allotment Serial Numbers, and these idle threads are taken after data according to the number obtained in step 2 Data Matching parsing is carried out according to alphabetic data list in structure, obtained result is parsed and stores into step 3 as its establishment In the results list；

Step 5：Data storage, creates tables of data T1, the corresponding data structure of data format definition obtained for storing step 2 The binary data content of sequence, and be one unique data definition ID of this record distribution；Create associated data table T1 data Define ID tables of data T2, for be provided with being disposed mark in storing step 4 and its sequence number be in thread pool it is minimum The analysis result of thread information；

Step 6：After list storage is finished in step 5, its results list is emptied, and sets processing to complete to be labeled as vacation, then It is idle condition to set its state, is recovered to thread pool, waits new data block to be allocated.

2. the big data fast resolving storage method according to claim 1 based on database, it is characterised in that the step Rapid 1 specific method is：Data format is write using extensible markup language, to define the data of any form, data lattice The data item being related in formula will define its mark, title, type, length, whether have positive and negative point, relative starting position, if One data item can be subdivided into multiple data item again, then define data subitem, the same data item of definition mode of each data subitem, For the uncertain data item of length, definition calculates fraction.

3. the big data fast resolving storage method according to claim 2 based on database, it is characterised in that the meter Calculating formula includes conventional mathematic(al) representation, or including the reference to other data items lengths, so as to specific according to what is given Data content dynamic calculation goes out the physical length of data item.

4. the big data fast resolving storage method according to claim 1 based on database, it is characterised in that the step Rapid 2 specific method is：The data structure will preserve the tree structure between the data defined in data format definition file, also This tree structure is organized into the list of data items of an order, can easily extract and include after data have been parsed The data item content of data subitem, again can be when data be parsed according to side of the list of data items of order using the matching analysis item by item Formula fast resolving data.

5. the big data fast resolving storage method according to claim 1 based on database, it is characterised in that the step In rapid 3, if initial data is stored on disk, data are read to internal memory progress cutting by File Mapping mechanism, to carry High data reading speed.

6. the big data fast resolving storage method according to claim 1 based on database, it is characterised in that the step In rapid 4, it is to determine due to the length of each data item in each analysis result, so every record in the results list is all There is provided in this record all variable-length fields physical length information, it is and each according to the list of data items of data definition order As a result a binary data stream is all contained, when to use each data item, it is only necessary to according to the length sum of data item The data of corresponding length are extracted from this binary data stream according to original position, when the data block for distributing to thread has been parsed Bi Hou, sets processing to complete labeled as true this thread and list.

7. the big data fast resolving storage method according to claim 1 based on database, it is characterised in that the step Rapid 5 specific method is：A tables of data T1, the corresponding data structure of data format definition that step 2 is obtained are created first Serialize in temporary file, then store the binary data content in file in this table, and be this record distribution One unique data definition ID；Then creating a tables of data T2 is used to deposit analysis result, an external key defined in T2 Data definition ID in T1 is associated with, the physical length information that a field deposits all variable-length fields is defined, re-defines a word Thread information in the binary data stream of each analysis results of Duan Cunfang, analytical procedure 4, if certain thread is provided with processing Finish mark, and its sequence number is minimum in thread pool, then by its analysis result list storage into tables of data T2, during storage The external key of each tables of data record is set to distribute to the corresponding data structure of data format definition that step 2 is obtained in T1 ID, and by the variable-length field physical length information of record, binary data stream storage into T2 respective field.

8. the big data fast resolving storage method according to claim 7 based on database, it is characterised in that to step The method that the data stored in 5 are extracted is：Only need to read out corresponding binary data from tables of data T1 and be put into In temporary file, data structure definition then is obtained from temporary file unserializing, two then further taken out in tables of data T2 enter Data flow processed is matched in sequence with obtained data structure by Data Identification, you can obtained in the data of each field Hold.