CN104615736A

CN104615736A - Quick analysis and storage method of big data based on database

Info

Publication number: CN104615736A
Application number: CN201510070607.1A
Authority: CN
Inventors: 彭成志; 刘钧钧; 咸峰
Original assignee: Upper Seabird Scape Computer System Co Ltd
Current assignee: Shanghai Chuangjing Information Technology Co ltd
Priority date: 2015-02-10
Filing date: 2015-02-10
Publication date: 2015-05-13
Anticipated expiration: 2035-02-10
Also published as: CN104615736B

Abstract

A quick analysis and storage method of big data based on a database includes the steps 1, according to data content actually to be analyzed, defining a data format to a file; 2, reading the data format from the file to a data structure constructed in a memory, as data analysis basis; 3, making preparation before data analysis; 4, analyzing data; 5, storing the data; 6, after a list in the step 5 finishes storage, clearing a results list, setting a processing completion mark to be false, setting a status of the processing completion mark to be idle, recovering the processing completion mark to a thread pool, and waiting for allocation of new data blocks. The method has the advantages that analytical speed is high, the analyzed data structure is configurable and highly universal, data storage redundancy is low, and stored results data are convenient to analyze afterwards.

Description

Based on the large data fast resolving storage means of database

Technical field

The present invention relates to a kind of microcomputer data processing, in particular, relating to the fast resolving for solving large data and ex-post analysis problem.

Background technology

The mass data needing process to produce at a high speed all can be run in a lot of engineering, if these data can not process in time, fatal impact can be produced to whole software systems, the existing parsing scheme to large data, substantially be all process set form, the compatible non-constant of extendability, and processing speed is fast not in single computer, do not make full use of the processing power that existing technology plays computing machine to the full extent, need Distributed Calculation to network speed and stability requirement high, and do not provide efficient and the comprehensive date storage method of information, so extract dumb when data process afterwards.

Through retrieval, application number is disclose a kind of Volume data disposal route and system in 200810097594.7, and solution Volume data cannot process at the appointed time and cause process time delay, finally causes the problem of system crash.Comprise: according to source document naming rule distribution server, source document is split as small documents; For each small documents after fractionation, according to small documents naming rule distribution server again, the small documents after splitting is processed.This invention can be disposed multiple servers and splits large-data documents simultaneously and process, and greatly improves the processing power of system, ensures that system is complete to file processing at the appointed time.And described system has extraordinary extendability, when files tend large or increasing time, just can be satisfied the demands by newly-increased server, namely can linear expansion, and do not need to buy more senior server, do not need the server run before redeploying yet.But there is the technological deficiency of following aspect:

1, versatility: setting data form wanted by the file in above-mentioned patent, extendability poor universality;

2, treatment effeciency: the design logic in above-mentioned patent is how to split file, composition file greatly, corresponding disk read-write and Internet Transmission can expend a lot of time resources;

3, data integrity: above-mentioned patent adopts client-server pattern, need Distributed Calculation, but the process of single client to data does not make full use of computer resource, if network connects and occurs abnormal, unpredictable time delay or loss of data may be caused thus cause serious consequence.

Summary of the invention

The present invention is directed to the technical matters existed in above-mentioned prior art, a kind of large data fast resolving storage means based on database is provided, data format definition method is provided, realizes general-purpose data parsing method; The method of fast processing data is provided, in the unit interval, efficiently processes mass data; Efficient date storage method is provided, is convenient to off-line compute analysis.

For achieving the above object, the technical solution adopted in the present invention is as follows:

Based on a large data fast resolving storage means for database, comprise step as follows:

Step 1: according to the actual data content that will analyze, definition data layout is in file;

Step 2: data layout is read in internal memory by file and constructs data structure, as Data Analysis foundation;

Step 3: the preparation before Data Analysis, is cut into the data block of specifying size to raw data, create the thread of specified quantity, and these thread identification are saved in list, this list is considered to a thread pool, for each thread, for it creates one for preserving the list of analysis result;

Step 4: Data Analysis, the data block segmented in step 3 is distributed to idle thread in thread pool, give these idle thread Allotment Serial Numbers according to the context of data block simultaneously, carry out Data Matching parsing according to alphabetic data item list in the data structure obtained in step 2 after these idle threads take data, resolve the result obtained and be stored in step 3 as in its results list created;

Step 5: data store, creates tables of data T1, the binary data content of the data structure sequence that the data format definition obtained for storing step 2 is corresponding, and be that the unique data of this record distribution one define ID; Create the tables of data T2 of associated data table T1 data definition ID, for being set up the analysis result that be disposed mark and its sequence number are thread information minimum in thread pool in storing step 4.

Step 6: in step 5 after list storage, empties its results list, and set handling completes and is labeled as vacation, and then arranging its state is idle condition, is recovered to thread pool, waits new data block to be allocated.

The concrete grammar of described step 1 is: data layout adopts extend markup language to write, to define the data of any form, the data item related in data layout will define its mark, title, type, length, whether have positive and negative point, relative starting position, if a data item can be subdivided into multiple data item again, then define data subitem, the same data item of definition mode of each data subitem, for the uncertain data item of length, definition calculates fraction.

Described computing formula comprises conventional mathematic(al) representation, or comprises quoting other data items length, to go out the physical length of data item according to given concrete data content dynamic calculation.

The concrete grammar of described step 2 is: described data structure will preserve the tree structure between the data that define in data format definition file, also this tree structure to be organized into the list of data items of an order, the data item content comprising data subitem can be extracted easily after having resolved data, the mode fast resolving data of the matching analysis can be adopted item by item again when Data Analysis according to the list of data items of order.

In described step 3, if raw data is stored on disk, then reads data by File Mapping mechanism and carry out cutting to internal memory, to improve data reading speed.

In described step 4, because the length of each data item in each analysis result is determined, so the every bar record in the results list is all provided with the physical length information of all variable-length fields in this record, and contain a binary data stream according to each result of the list of data items of data definition of order, when each data item will be used, only needs extract the data of corresponding length according to the length of data item and data reference position from this binary data stream, after the data block distributing to thread is resolved, this thread and list set handling are completed and is labeled as very.

The concrete grammar of described step 5 is: first create a tables of data T1, data structure serializing corresponding to data format definition step 2 obtained is in temporary file, then the binary data content in file is stored in this table, and be that this records the unique data definition ID of distribution one, then create a tables of data T2 to be used for depositing analysis result, in T2, define an external key be associated with data definition ID in T1, define the physical length information that a field deposits all variable-length fields, define the binary data stream that a field deposits each analysis result again, thread information in analytical procedure 4, if certain thread has been set up the mark that is disposed, and its sequence number is minimum in thread pool, then by its analysis result list storage in tables of data T2, when depositing, the external key of each tables of data record is set to distribute in T1 the ID of data structure corresponding to data format definition that step 2 obtains, and by the variable-length field physical length information of record, binary data stream is stored in the respective field of T2.

To the method that the data stored in step 7 are extracted be: only need from tables of data T1, read out corresponding binary data and be put in temporary file, then data structure definition is obtained from temporary file unserializing, then the binary data stream taken out again in tables of data T2 mates by Data Identification in order with the data structure obtained, and can obtain the data content of each field.

Technique scheme of the present invention, relative to Volume data disposal route a kind of disclosed in prior art 200810097594.7 and system, has the advantage of following aspect:

1, versatility: the present invention is according to extracting in specific file by data characteristic information, this file can also allow user increase, deletes, change data feature description information, and software according to the rule parsing data of this paper formulation, thus possesses good extendability;

2, treatment effeciency: the present invention saves the disk read-write time greatly by File Mapping mechanism, by thread pool Multi-thread synchronization process data, speed is fast, efficiency is high;

3, data integrity, the present invention utilizes thread pool and File Mapping mechanism to make full use of single computer processing power, is guaranteed in data integrity.

The beneficial effect that the present invention brings is as follows:

1) resolution speed is fast;

2) data structure of resolving is configurable, highly versatile;

3) analysis result is stored in database, and data storage redundancy is little;

4) the result data ex-post analysis stored is convenient.

Accompanying drawing explanation

By reading the detailed description done non-limiting example with reference to the following drawings, other features, objects and advantages of the present invention will become more obvious:

Fig. 1 is method flow diagram provided by the present invention.

Embodiment

Below in conjunction with specific embodiment, the present invention is described in detail.Following examples will contribute to those skilled in the art and understand the present invention further, but not limit the present invention in any form.It should be pointed out that to those skilled in the art, without departing from the inventive concept of the premise, some distortion and improvement can also be made.These all belong to protection scope of the present invention.

As shown in Figure 1, the large data fast resolving storage means based on database provided by the present invention, the step specifically comprised is as follows:

Step 1: according to the actual data content that will analyze, definition data layout is in file.

Data layout adopts extend markup language to write, to define the data of any form, the data item related in data layout will define its mark, title, type, length, whether have positive and negative point, relative starting position etc., if a data item can be subdivided into multiple data item again, then define data subitem, the same data item of definition mode of each data subitem.For the uncertain data item of length, definition calculates fraction, computing formula can comprise conventional mathematic(al) representation, also can comprise quoting other data items length, so just can go out the physical length of data item according to given concrete data content dynamic calculation.

Step 2: data layout is read in internal memory by file and constructs data structure, as Data Analysis foundation.

This data structure will preserve the tree structure between the data that define in data format definition file, also this tree structure to be organized into the list of data items of an order, the data item content comprising data subitem can be extracted easily after having resolved data like this, the mode fast resolving data of the matching analysis can be adopted item by item again when Data Analysis according to the list of data items of order.

Step 3: the preparation before Data Analysis, is cut into the data block of specifying size to raw data.

If raw data is stored on disk, then reads data by File Mapping mechanism and carry out cutting to internal memory, can data reading speed be improved like this.Create the thread of specified quantity, and be saved in list by these thread identification, this list can be considered to a thread pool.For each thread, for it creates one for preserving the list of analysis result.

Step 4: Data Analysis.

The data block segmented in step 3 is distributed to idle thread in thread pool, give these idle thread Allotment Serial Numbers according to the context of data block simultaneously, carry out Data Matching parsing according to alphabetic data item list in the data structure obtained in step 2 after these idle threads take data, resolve the result obtained and be stored in step 3 as in its results list created.

Because the length of each data item in each analysis result is now determined, so the every bar record in the results list is all provided with the physical length information of all variable-length fields in this record, and contain a binary data stream according to each result of the list of data items of data definition of order, when using each data item, only needs extract the data of corresponding length according to the length of data item and data reference position from this binary data stream.After the data block distributing to thread is resolved, this thread and list set handling are completed and is labeled as very.

Step 5: data store

First create a tables of data T1, the binary data content in file, in temporary file, then stores in this table by data structure serializing corresponding to data format definition step 2 obtained, and be that the unique data of this record distribution one define ID; Then create a tables of data T2 to be used for depositing analysis result, in T2, define an external key be associated with data definition ID in T1, define the physical length information that a field deposits all variable-length fields, then define the binary data stream that a field deposits each analysis result.Thread information in analytical procedure 4, if certain thread has been set up the mark that is disposed, and its sequence number is minimum in thread pool, then by its analysis result list storage in tables of data T2, when depositing, the external key (data definition ID) of each tables of data record is set to distribute in T1 the ID of data structure corresponding to data format definition that step 2 obtains, and the variable-length field physical length information of record, binary data stream is stored in the respective field of T2.

Adopt the database that this step storage obtains, the complete documentation data structure of data, data record redundant information is few, without the need to relying on other configuration file during ex-post analysis, only needing from tables of data T1, read out corresponding binary data is put in temporary file, then data structure definition is obtained from temporary file unserializing, then the binary data stream taken out again in T2 table carries out mating the data content that can obtain each field by Data Identification with the data structure obtained in order, and data are extracted very simple.

Step 6: after thread analysis result list storage, its results list is emptied, and set handling completes and is labeled as vacation, then arranging its state is idle condition, is recovered to thread pool, waits new data block to be allocated.

Above specific embodiments of the invention are described.It is to be appreciated that the present invention is not limited to above-mentioned particular implementation, those skilled in the art can make various distortion or amendment within the scope of the claims, and this does not affect flesh and blood of the present invention.

Claims

1., based on a large data fast resolving storage means for database, it is characterized in that, comprise step as follows:

2. the large data fast resolving storage means based on database according to claim 1, it is characterized in that, the concrete grammar of described step 1 is: data layout adopts extend markup language to write, to define the data of any form, the data item related in data layout will define its mark, title, type, length, whether there is positive and negative dividing, relative starting position, if a data item can be subdivided into multiple data item again, then define data subitem, the same data item of definition mode of each data subitem, for the uncertain data item of length, definition calculates fraction.

3. the large data fast resolving storage means based on database according to claim 2, it is characterized in that, described computing formula comprises conventional mathematic(al) representation, or comprise quoting other data items length, to go out the physical length of data item according to given concrete data content dynamic calculation.

4. the large data fast resolving storage means based on database according to claim 1, it is characterized in that, the concrete grammar of described step 2 is: described data structure will preserve the tree structure between the data that define in data format definition file, also this tree structure to be organized into the list of data items of an order, the data item content comprising data subitem can be extracted easily after having resolved data, the mode fast resolving data of the matching analysis can be adopted item by item again when Data Analysis according to the list of data items of order.

5. the large data fast resolving storage means based on database according to claim 1, it is characterized in that, in described step 3, if raw data is stored on disk, then read data by File Mapping mechanism and carry out cutting to internal memory, to improve data reading speed.

6. the large data fast resolving storage means based on database according to claim 1, it is characterized in that, in described step 4, because the length of each data item in each analysis result is determined, so the every bar record in the results list is all provided with the physical length information of all variable-length fields in this record, and contain a binary data stream according to each result of the list of data items of data definition of order, when each data item will be used, only needs extract the data of corresponding length according to the length of data item and data reference position from this binary data stream, after the data block distributing to thread is resolved, this thread and list set handling are completed and is labeled as very.

7. the large data fast resolving storage means based on database according to claim 1, it is characterized in that, the concrete grammar of described step 5 is: first create a tables of data T1, data structure serializing corresponding to data format definition step 2 obtained is in temporary file, then the binary data content in file is stored in this table, and be that this records the unique data definition ID of distribution one, then create a tables of data T2 to be used for depositing analysis result, in T2, define an external key be associated with data definition ID in T1, define the physical length information that a field deposits all variable-length fields, define the binary data stream that a field deposits each analysis result again, thread information in analytical procedure 4, if certain thread has been set up the mark that is disposed, and its sequence number is minimum in thread pool, then by its analysis result list storage in tables of data T2, when depositing, the external key of each tables of data record is set to distribute in T1 the ID of data structure corresponding to data format definition that step 2 obtains, and by the variable-length field physical length information of record, binary data stream is stored in the respective field of T2.

8. the large data fast resolving storage means based on database according to claim 7, it is characterized in that, to the method that the data stored in step 7 are extracted be: only need from tables of data T1, read out corresponding binary data and be put in temporary file, then data structure definition is obtained from temporary file unserializing, then the binary data stream taken out again in tables of data T2 mates by Data Identification in order with the data structure obtained, and can obtain the data content of each field.