Big data fast resolving storage method based on database
Technical field
The present invention relates to a kind of microcomputer data processing, in particular, it is related to for solving the quick of big data
Parsing and ex-post analysis problem.
Background technology
Can all be run into many engineerings needs to handle the mass data produced at a high speed, if these data can not be located in time
Reason is completed, and whole software systems can be produced with fatal influence, the existing parsing scheme to big data is essentially all processing
Set form, autgmentability compatibility is excessively poor, and processing speed is not fast enough in single computer, does not make full use of existing
Some technologies play the disposal ability of computer to the full extent, it is necessary to which Distributed Calculation is to network speed and stability requirement
Height, and without the efficient and comprehensive date storage method of information is provided, so being extracted data are handled afterwards when dumb.
Through retrieval, a kind of Volume data processing method and system are disclosed in Application No. 200810097594.7,
Solution Volume data can not be handled at the appointed time causes processing to be delayed, the problem of eventually causing system crash.Including:
According to original document naming rule distribution server, original document is split as small documents;For each small documents after fractionation,
According to small documents naming rule distribution server again, the small documents after fractionation are handled.The invention can dispose many
Server is split and handled to large-data documents simultaneously, greatly improves the disposal ability of system, it is ensured that system exists
File process is finished in stipulated time.Moreover, the system has extraordinary autgmentability, when file is increasing either
When more and more, demand can just be met by newly-increased server, you can with linear expansion, without buying higher level
Server, it is not required that redeploy the server that has run in the past.But it there is the technological deficiency of following aspect:
1st, versatility:File in above-mentioned patent will set data format, autgmentability poor universality;
2nd, treatment effeciency:Design logic in above-mentioned patent is greatly how to split file, composition file, correspondence
Disk read-write and network transmission can expend many time resources;
3rd, data integrity:Above-mentioned patent uses client-server pattern, it is necessary to Distributed Calculation, but single client
Hold the processing to data not make full use of computer resource, and if exception occurs in network connection, may lead to not pre-
The time delay of survey or loss of data are so as to trigger serious consequence.
The content of the invention
There is provided a kind of big data based on database is quick for technical problem present in above-mentioned prior art by the present invention
Storage method is parsed there is provided data format definition method, general-purpose data parsing method is realized;The side of quick processing data is provided
Method, efficient process mass data in the unit interval;Efficient date storage method is provided, is easy to extraction and analysis afterwards.
To reach above-mentioned purpose, the technical solution adopted in the present invention is as follows:
A kind of big data fast resolving storage method based on database, including step are as follows:
Step 1:According to the data content for being actually subjected to analysis, data format is defined into file;
Step 2:Data format is read by file data structure is constructed in internal memory, foundation is parsed as data;
Step 3:Preparation before data parsing, the data block of specified size is cut into initial data, creates specified quantity
Thread, and these thread identifications are saved in list, the list is considered as a thread pool, for each thread, is
It creates a list for being used to preserve analysis result;
Step 4:Data are parsed, and the data block segmented in step 3 are distributed into idle thread in thread pool, while basis
The context of data block gives these idle thread Allotment Serial Numbers, and these idle threads are taken and obtained after data in foundation step 2
Data structure in alphabetic data list carry out Data Matching parsing, parse obtained result and store into step 3 and created for it
In the results list built;
Step 5:Data storage, creates tables of data T1, the corresponding data of data format definition obtained for storing step 2
The binary data content of structure sequence, and be one unique data definition ID of this record distribution;Create associated data table T1
Data definition ID tables of data T2, for be provided with being disposed mark in storing step 4 and its sequence number be in thread pool most
The analysis result of small thread information.
Step 6:After list storage is finished in step 5, its results list is emptied, and sets processing to complete labeled as vacation,
Then it is idle condition to set its state, is recovered to thread pool, waits new data block to be allocated.
The specific method of the step 1 is:Data format is write using extensible markup language, to define any form
Data, the data item being related in data format to define its mark, title, type, length, whether have positive and negative point, it is relative
Original position, if a data item can be subdivided into multiple data item again, defines data subitem, the definition of each data subitem
The same data item of mode, for the uncertain data item of length, definition calculates fraction.
The calculation formula includes conventional mathematic(al) representation, or including the reference to other data items lengths, so as to
Go out the physical length of data item according to given specific data content dynamic calculation.
The specific method of the step 2 is:The data structure will preserve the data defined in data format definition file
Between tree structure, this tree structure is also organized into the list of data items of an order, can be after data have been parsed
The data item content comprising data subitem is easily extracted, again can be when data be parsed according to the use of the list of data items of order
Mode fast resolving data of the matching analysis item by item.
In the step 3, if initial data is stored on disk, data are read to internal memory by File Mapping mechanism
Cutting is carried out, to improve data reading speed.
In the step 4, it is to determine due to the length of each data item in each analysis result, so the results list
In every record be all provided with the physical length information of all variable-length fields in this record, and according to data definition order
The each result of list of data items contains a binary data stream, when to use each data item, it is only necessary to according to number
The data of corresponding length are extracted from this binary data stream according to the length and data start of item, when distributing to thread
Data block parsing finish after, set processing to complete labeled as true this thread and list.
The specific method of the step 5 is:A tables of data T1, the data format definition that step 2 is obtained are created first
Then corresponding data structure serializing stores the binary data content in file in this table into temporary file, and
For one unique data definition ID of this record distribution;Then creating a tables of data T2 is used to deposit analysis result, in T2
Defined in an external key be associated with data definition ID in T1, define the physical length letter that a field deposits all variable-length fields
Breath, re-defines the thread information in the binary data stream that a field deposits each analysis result, analytical procedure 4, if certain
Thread is provided with the mark that is disposed, and its sequence number is minimum in thread pool, then arrives its analysis result list storage
In tables of data T2, the external key of each tables of data record is set to distribute to the data format definition that step 2 is obtained in T1 during storage
The ID of corresponding data structure, and the corresponding of T2 is arrived into the variable-length field physical length information of record, binary data stream storage
In field.
It is to the method that the data stored in step 7 are extracted:Only need to read out from tables of data T1 corresponding
Binary data is put into temporary file, is then obtained data structure definition from temporary file unserializing, is then further taken out number
Matched in sequence by Data Identification with obtained data structure according to the binary data stream in table T2, you can obtain each
The data content of field.
The above-mentioned technical proposal of the present invention, relative to a kind of big data quantity disclosed in prior art 200810097594.7
Data processing method and system, have the advantages that following aspect:
1st, versatility:The present invention is extracted in specific file according to by data characteristic information, and this file can also allow use
Family increases, deletes, change data feature description information, and software is according to the rule parsing data of this paper formulation, so as to possess good
Autgmentability;
2nd, treatment effeciency:The present invention greatlys save the disk read-write time by File Mapping mechanism, multi-thread by thread pool
Journey synchronization process data, speed is fast, efficiency high;
3rd, data integrity, the present invention makes full use of single computer to handle energy using thread pool and File Mapping mechanism
Power, is guaranteed in terms of data integrity.
What the present invention was brought has the beneficial effect that:
1) resolution speed is fast;
2) data structure of parsing can configure, highly versatile;
3) analysis result is stored in database, and data storage redundancy is small;
4) the result data ex-post analysis stored is convenient.
Brief description of the drawings
By reading the detailed description made with reference to the following drawings to non-limiting example, further feature of the invention,
Objects and advantages will become more apparent upon:
Fig. 1 is method flow diagram provided by the present invention.
Embodiment
With reference to specific embodiment, the present invention is described in detail.Following examples will be helpful to the technology of this area
Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that to the ordinary skill of this area
For personnel, without departing from the inventive concept of the premise, various modifications and improvements can be made.These belong to the present invention
Protection domain.
As shown in figure 1, the big data fast resolving storage method provided by the present invention based on database, is specifically included
Step is as follows:
Step 1:According to the data content for being actually subjected to analysis, data format is defined into file.
Data format is write using extensible markup language, is related to define in the data of any form, data format
Data item to define its mark, title, type, length, whether have positive and negative point, relative starting position etc., if a number
Multiple data item can be subdivided into again according to item, then define data subitem, the same data item of definition mode of each data subitem.For length
Uncertain data item is spent, definition calculates fraction, and calculation formula can include conventional mathematic(al) representation, can also include to it
The reference of its data items length, just can so go out the actual (tube) length of data item according to given specific data content dynamic calculation
Degree.
Step 2:Data format is read by file data structure is constructed in internal memory, foundation is parsed as data.
This data structure will preserve the tree structure between the data defined in data format definition file, also set this
Shape structure is organized into the list of data items of an order, so can easily be extracted after data have been parsed comprising data
Data item content, again can when data are parsed according to the list of data items of order by the way of the matching analysis item by item it is quick
Parse data.
Step 3:Preparation before data parsing, the data block of specified size is cut into initial data.
If initial data is stored on disk, cutting is carried out to internal memory by File Mapping mechanism reading data, this
Sample can improve data reading speed.The thread of specified quantity is created, and these thread identifications are saved in list, this row
Table is considered a thread pool.It is that it creates a list for being used to preserve analysis result for each thread.
Step 4:Data are parsed.
The data block segmented in step 3 is distributed into idle thread in thread pool, while according to the front and rear pass of data block
It is to give these idle thread Allotment Serial Numbers, these idle threads are taken after data according to suitable in the data structure obtained in step 2
Sequence list of data items carries out Data Matching parsing, parses obtained result and stores into step 3 as in the results list of its establishment.
Now it is to determine due to the length of each data item in each analysis result, so in the results list
Every record is all provided with the physical length information of all variable-length fields in this record, and according to the data of data definition order
The item each result of list contains a binary data stream, when to use each data item, it is only necessary to according to data item
Length and data start from this binary data stream extract corresponding length data.When the number for distributing to thread
After being finished according to block parsing, processing is set to complete labeled as true this thread and list.
Step 5:Data storage
A tables of data T1 is created first, and the corresponding data structure serializing of the data format definition that step 2 is obtained is arrived
In temporary file, then the binary data content in file is stored in this table, and it is unique for this record distribution
Data definition ID;Then creating a tables of data T2 is used to deposit analysis result, and an external key is associated with T1 defined in T2
Middle data definition ID, defines the physical length information that a field deposits all variable-length fields, re-defines a field storage every
The binary data stream of individual analysis result.Thread information in analytical procedure 4, if certain thread is provided with the mark that is disposed
Note, and its sequence number is minimum in thread pool, then by its analysis result list storage into tables of data T2, per number during storage
It is set to distribute to the corresponding data knot of data format definition that step 2 is obtained in T1 according to the external key (data definition ID) of token record
The ID of structure, and by the variable-length field physical length information of record, binary data stream storage into T2 respective field.
The database obtained using this step storage, the complete documentation data structure of data, data record redundancy
It is few, without relying on other configuration files during ex-post analysis, it is only necessary to read out corresponding binary data from tables of data T1
It is put into temporary file, then obtains data structure definition from temporary file unserializing, two then further taken out in T2 tables enters
Data flow processed is matched the data content that can obtain each field with obtained data structure in sequence by Data Identification,
Data are extracted very simple.
Step 6:After thread analysis result list storage is finished, its results list is emptied, and sets processing to complete mark
It is false, it is idle condition then to set its state, is recovered to thread pool, waits new data block to be allocated.
The specific embodiment of the present invention is described above.It is to be appreciated that the invention is not limited in above-mentioned
Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow
Ring the substantive content of the present invention.