Processing method, device, equipment and the computer storage media of column data storage
Technical field
The application is related to field of computer technology, more particularly to the processing method of column data storage, device, equipment and meter
Calculation machine storage medium.
Background technology
Data line is continuously deposited difference by column storage format and traditional line storage, and column storage format is by data
The partial data value (or all data values) of a certain row serializes Coutinuous store together in file, then stores another row again
Partial data value (or all data values), data trailer can be with the storage format descriptions of Footer information, the i.e. data file
Information, including number, relative position, data type information or statistical information of documentary metadata information, row etc..
Often it is related to the demand to former data file increase new data in practical application, is being got newly in correlation technique
During data, typically new data is stored by the way of a newly-built data file.But newdata file can take certain money
Source, data-handling efficiency is influenceed, and when facing data query requirements, it is necessary in former data file and newdata file
Inquire about respectively, search efficiency is relatively low.
The content of the invention
To overcome problem present in correlation technique, this application provides column stored data processing method, device, equipment
And computer storage media.
According to the first aspect of the embodiment of the present application, there is provided a kind of processing method of column data storage, methods described bag
Include:
The newly-increased data for raw data file are received, the raw data file is deposited using column storage format
Storage, the Footer information of the raw data file use the Footer files independently of the raw data file to be remembered
Record;
The newly-increased data are write to the afterbody of the raw data file, and described according to column storage format
Increase is for the Footer information of the newly-increased data in Footer files, raw data file and Footer after being updated
File.
In an optional implementation, after newly-increased data of the reception for raw data file, the side
Method includes:
The newly-increased data of reception are loaded onto in high speed storing space;
It is described to write the newly-increased data to the afterbody of the raw data file according to column storage format, including:
When the newly-increased data loaded in the high speed storing space meet default storage condition, according to column storage format
The newly-increased data of the loading are write to the afterbody of the raw data file.
In an optional implementation, the default storage condition includes following one or more conditions:
The data volume of the newly-increased data reaches default data-quantity threshold;Or,
Loading duration of the newly-increased data in high speed storing space reaches default duration threshold value.
In an optional implementation, methods described also includes:
When getting the data inquiry request for the raw data file, according to the Footer files before renewal,
Read in the raw data file and meet the first data of the request, and loaded in the high speed storing space new
Increase the second data that digital independent meets the request;
Exported after first data and the second data are merged.
In an optional implementation, methods described also includes:
The wave file of the newly-increased data of the loading is generated, and the wave file of the newly-increased data of the loading is stored
Under the wave file identical catalogue with the raw data file.
According to the second aspect of the embodiment of the present application, there is provided a kind of processing unit of column data storage, including:
Data reception module, it is used for:The newly-increased data for raw data file are received, the raw data file uses
Column storage format is stored, and the Footer information of the raw data file is used independently of the raw data file
Footer files are recorded;
Data write. module, it is used for:The newly-increased data are write to initial data text according to column storage format
The afterbody of part, and increase is for the Footer information of the newly-increased data in the Footer files, the original after being updated
Beginning data file and Footer files.
In an optional implementation, the data reception module, it is additionally operable to:
After newly-increased data of the reception for raw data file, the newly-increased data of reception are loaded onto high speed storing
In space;
The Data write. module, is specifically used for:
When the newly-increased data loaded in the high speed storing space meet default storage condition, according to column storage format
The newly-increased data of the loading are write to the afterbody of the raw data file.
In an optional implementation, the default storage condition includes following one or more conditions:
The data volume of the newly-increased data reaches default data-quantity threshold;Or,
Loading duration of the newly-increased data in high speed storing space reaches default duration threshold value.
In an optional implementation, described device also includes read module, is used for:
When getting the data inquiry request for the raw data file, according to the Footer files before renewal,
Read in the raw data file and meet the first data of the request, and loaded in the high speed storing space new
Increase the second data that digital independent meets the request;
Exported after first data and the second data are merged.
In an optional implementation, described device also includes replica processes module, is used for:
The wave file of the newly-increased data of the loading is generated, and the wave file of the newly-increased data of the loading is stored
Under the wave file identical catalogue with the raw data file
According to the third aspect of the embodiment of the present application, there is provided a kind of computer equipment, including:
Processor;
For storing the memory of processor-executable instruction;
Wherein, the processor is configured as:
The newly-increased data for raw data file are received, the raw data file is deposited using column storage format
Storage, the Footer information of the raw data file use the Footer files independently of the raw data file to be remembered
Record;
The newly-increased data are write to the afterbody of the raw data file, and described according to column storage format
Increase is for the Footer information of the newly-increased data in Footer files, raw data file and Footer after being updated
File.
According to the fourth aspect of the embodiment of the present application, there is provided a kind of computer-readable storage medium, store in the storage medium
There is programmed instruction, described program instruction includes:
The newly-increased data for raw data file are received, the raw data file is deposited using column storage format
Storage, the Footer information of the raw data file use the Footer files independently of the raw data file to be remembered
Record;
The newly-increased data are write to the afterbody of the raw data file, and described according to column storage format
Increase is for the Footer information of the newly-increased data in Footer files, raw data file and Footer after being updated
File.
The technical scheme that embodiments herein provides can include the following benefits:
In the application, tail of file is stored in different from the Footer information of raw data file in correlation technique, but
Recorded using an independent Footer files, therefore newly-increased data can directly be appended to the afterbody of raw data file,
And the Footer information of newly-increased data then records in Footer files.The embodiment of the present application can be directed to column storage number
Factually existing efficient streaming supplemental data, newly-increased data can be appended in raw data file, without being entered using new files
The mode of row record, therefore treatment effeciency is higher, occupancy resource is less, and data query speed is faster.
It should be appreciated that the general description and following detailed description of the above are only exemplary and explanatory, not
The application can be limited.
Brief description of the drawings
Accompanying drawing herein is merged in specification and forms the part of this specification, shows the implementation for meeting the application
Example, and be used to together with specification to explain the principle of the application.
Figure 1A is a kind of schematic diagram of column storage format in correlation technique.
Figure 1B is the schematic diagram of another column storage format in correlation technique.
Fig. 2 is a kind of flow of the processing method of column data storage of the application according to an exemplary embodiment
Figure.
Fig. 3 A are a kind of applications of the processing method of column data storage of the application according to an exemplary embodiment
Scene graph.
Fig. 3 B are the schematic diagrames that a kind of newly-increased data of the application according to an exemplary embodiment are carried in internal memory.
Fig. 4 is a kind of hardware structure diagram of computer equipment where the processing unit of the application column data storage.
Fig. 5 is a kind of block diagram of the processing unit of column data storage of the application according to an exemplary embodiment.
Embodiment
Here exemplary embodiment will be illustrated in detail, its example is illustrated in the accompanying drawings.Following description is related to
During accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represent same or analogous key element.Following exemplary embodiment
Described in embodiment do not represent all embodiments consistent with the application.On the contrary, they be only with it is such as appended
The example of the consistent apparatus and method of some aspects be described in detail in claims, the application.
It is only merely for the purpose of description specific embodiment in term used in this application, and is not intended to be limiting the application.
" one kind " of singulative used in the application and appended claims, " described " and "the" are also intended to including majority
Form, unless context clearly shows that other implications.It is also understood that term "and/or" used herein refers to and wrapped
Containing the associated list items purpose of one or more, any or all may be combined.
It will be appreciated that though various information, but this may be described using term first, second, third, etc. in the application
A little information should not necessarily be limited by these terms.These terms are only used for same type of information being distinguished from each other out.For example, do not departing from
In the case of the application scope, the first information can also be referred to as the second information, and similarly, the second information can also be referred to as
One information.Depending on linguistic context, word as used in this " if " can be construed to " ... when " or " when ...
When " or " in response to determining ".
The column storage format of file is illustrated first.As shown in Figure 1A, it is a kind of column storage in correlation technique
The schematic diagram of form, the column storage format in Figure 1A be by the total data value sequenceization of a certain row in data file together
Then Coutinuous store stores all data values of another row again in disk;Column storage format can also be for a certain row
Partial data value, be the schematic diagram of another column storage format in correlation technique as shown in Figure 1B, by overall number in Figure 1B
It is that (such as the RowGroup concepts in Parquet, the quantity of piecemeal can also be other numbers for two piecemeals according to first divided by row
Value), the data in each piecemeal enter determinant storage again.
In addition, data trailer with Footer information, can include the metadata information of data file, the number of row, phase
To position, data type information or statistical information etc..Because column storage format is by the whole of a certain row in data file
Or partial data to serialize storage, therefore Footer information be used for log file store when relevant information, for data
Read.
Often it is related to the demand to former data file increase new data in practical application, is being got newly in correlation technique
During data, because data trailer is with Footer information, therefore the mode of the newly-built data file of generally use stores new data.
The Footer information that the embodiment of the present application is different from raw data file in correlation technique is stored in tail of file, and
It is to be recorded using an independent Footer files, therefore newly-increased data can directly be appended to the tail of raw data file
Portion, and the Footer information of newly-increased data is then recorded in Footer files.The embodiment of the present application can be directed to column storage number
Factually existing efficient streaming supplemental data, newly-increased data can be appended in raw data file, without being entered using new files
The mode of row record, therefore treatment effeciency is higher, occupancy resource is less, and data query speed is faster.Next it is real to the application
Example is applied to be described in detail.
As shown in Fig. 2 Fig. 2 is a kind of processing side of column data storage of the application according to an exemplary embodiment
The flow chart of method, the embodiment of the present application can be applied in the database using column storage format, comprise the following steps 201 to
202:
In step 201, the newly-increased data for raw data file are received, the raw data file is deposited using column
Storage form is stored, and the Footer information of the raw data file uses the Footer independently of the raw data file
File is recorded.
In step 202, the newly-increased data are write to the tail of the raw data file according to column storage format
Portion, and increase is for the Footer information of the newly-increased data in the Footer files, the initial data after being updated
File and Footer files.
For the raw data file of column storage, original is recorded by the way of newly-built Footer files in the present embodiment
The Footer information of beginning data file.By above-mentioned processing, when needing newly-increased data, newly-increased data can be directly appended to
The afterbody of raw data file, as deblocking new in raw data file, newly-increased data can be still write to original number
According in file, and realize efficient stream data and add.Afterwards, Footer information is carried out more in Footer files
Newly.
In practical application, the raw data file for having entered determinant storage, raw data file has been stored to disk
In.In some examples, when getting newly-increased data, newly-increased data can in real time be obtained and write in real time to initial data text
In part.In other examples, the data volume for increasing data newly may be larger, and can persistently acquire, in order to improve at data
Efficiency is managed, after newly-increased data of the reception for raw data file, methods described can also include:
The newly-increased data of reception are loaded onto in high speed storing space.
It is described to write the newly-increased data to the afterbody of the raw data file according to column storage format, including:
When the newly-increased data loaded in the high speed storing space meet default storage condition, according to column storage format
The newly-increased data of the loading are write to the afterbody of the raw data file.
Wherein, the high speed storing space can include the space of the temporary transient storage data such as internal memory or caching or to data
The cushion space of exchange, can specifically be determined according to the hardware environment where actual database system, the present embodiment to this not
It is construed as limiting.
By the above-mentioned means, newly-increased data can be temporarily loaded onto in the high speed memory space such as internal memory, specifically, loading
Mode to internal memory can use line to store.Unify to write newly-increased data to raw data file again afterwards, therefore can be with
Improve data-handling efficiency.Wherein, storage condition sign is preset to write newly-increased data to original number from high speed storing space
, can be with flexible configuration in practical application according to the opportunity of file, such as default storage condition can currently make in high speed storing space
When reaching higher utilization rate with rate by the newly-increased data supplementing of loading into raw data file, to prevent internal memory or caching etc. from overflowing
Go out to cause loss of data;Either the currently used rate in high speed storing space in relatively low utilization rate by the newly-increased data supplementing of loading
Into raw data file, to realize processing data etc. various ways in the case where hardware is in idle.
In an optional implementation, the default storage condition can include following one or more conditions:
The first, the data volume of the newly-increased data reach default data-quantity threshold.In such a mode, storage bar is preset
Part using data volume as Consideration, can be when newly-increased Data Data amount be larger, in time by newly-increased data supplementing to original
In data file, caused by preventing data volume larger the problems such as loss of data.Wherein, data-quantity threshold can be clever as needed
Configuration living, the present embodiment are not construed as limiting to this.
Secondth, loading duration of the newly-increased data in internal memory reaches default duration threshold value.In such a mode, in advance
If storage condition, as Consideration, can be carried in certain time in internal memory using the loading duration of data in newly-increased data
Afterwards, in time by newly-increased data supplementing into raw data file, loss of data etc. caused by preventing the data load time longer is asked
Topic.Wherein, duration threshold value can as needed and flexible configuration, the present embodiment are not construed as limiting to this.
Pass through above two mode, it may be determined that a rational opportunity chases after the newly-increased data loaded in high speed storing space
Add in raw data file, Consideration be used as using data volume and/or duration, you can prevent comparatively fast by newly-increased data supplementing extremely
Resource consumption caused by raw data file, additional newly-increased data in time can also be accomplished, the problems such as preventing loss of data.
In correlation technique, the hardware such as some internal memories or caching may require data block storage, pin when loading data
It is described that the newly-increased data are loaded onto in internal memory in the present embodiment to such a situation, it can include:With default data block
Size is unit, and the newly-increased data are split as into one or more data blocks is carried in the internal memory.
In Database Systems, for increasing data newly, it will usually be related to the problem of data are visible ageing.For example, when
After newly-increased data are loaded onto internal memory, user needs to inquire about some data, is typically employed in correlation technique in raw data file
Inquire about the processing mode of data.And now increasing data newly may also be carried in internal memory, also have not enough time to be stored in disk, because
If some data that this user needs to inquire about also are carried in internal memory, this partial data does not export, and causes output to user's
Data are not comprehensive, cause to load the visible ageing poor of newly-increased data.For this problem, the method for the embodiment of the present application may be used also
Including:
When getting the data inquiry request for the raw data file, according to the Footer files before renewal,
Read in the raw data file and meet the first data of the request, and loaded in the high speed storing space new
Increase the second data that digital independent meets the request.
Exported after first data and the second data are merged.
, on the one hand can be when getting the data inquiry request for the raw data file in the present embodiment
The first data for meeting request are read in the raw data file stored in disk, on the other hand, it is also necessary to deposited in the high speed
The second data for meeting the request are read in the newly-increased data loaded in storage space.Merge the first data and the second number afterwards
According to being exported using the data after merging as the response data to data inquiry request.It is described by the above-mentioned means, due to for
The newly-increased data loaded in high speed storing space can also be inquired about, therefore can prevent that output is incomplete to the data of user
Problem, data it is visible ageing higher.
In practical application, in order to prevent loss of data, it will usually to raw data file ghost file to enter line number
According to backup.If the newly-increased data received are loaded onto in high speed storing space, the newly-increased data of loading are not write temporarily to initial data
In file, it is therefore possible to occur the situation of loss of data, in the present embodiment, methods described also includes:
The wave file of the newly-increased data of the loading is generated, and the wave file of the newly-increased data of the loading is stored
Under the wave file identical catalogue with the raw data file.
By the above-mentioned means, due to generating corresponding wave file for the newly-increased data of loading, therefore can prevent
Loss of data, in addition, loading newly-increased data wave file be stored in it is identical with the wave file of the raw data file
Catalogue under so that the wave file of newly-increased data is corresponding with the wave file of raw data file, be easy to data recovery.
Scheme provided herein is described in detail again followed by a specific embodiment.
As shown in Figure 3A, it is a kind of processing method of column data storage of the application according to an exemplary embodiment
Application scenario diagram, Fig. 3 A include a data base management system, the data base management system can be Parquet or
Orcfile etc. supports the data base management system of column storage format.The column data storage that the embodiment of the present application is provided
Processing scheme can be independent as one module or process, run in the data base management system, with to column data storage
Handled.
As shown in Figure 3A, safeguard there is a raw data file in data base management system, the raw data file is with column
Storage format is stored in some position in disk.Mode in the raw data file such as Figure 1B is stored, and overall data is first pressed
Go and be divided into multiple piecemeals (RowGroup), the data in each piecemeal enter determinant storage again, the raw data file
Footer information uses the Footer files independently of the raw data file to be recorded.
In some period, persistently inputted to data base management system for the newly-increased data of raw data file.This reality
Applying can continue to receive newly-increased data in example, and the newly-increased data received are deposited in internal memory.According to the loading machine of internal memory
System, is the signal that a kind of newly-increased data of the application according to an exemplary embodiment are carried in internal memory as shown in Figure 3 B
Figure, newly-increased data write (write) to internal memory by way of streaming adds (AppendRecord), can be to scheme in internal memory
DataBlock (data block) shown in 3B is present, when the total amount of data of DataBlock corresponding to newly-increased data reaches certain
Size, or DataBlock loading duration reach certain time length, can call the bottom interface of data base management system, will
DataBlock corresponding to the newly-increased data is appended to the tail end of raw data file, forms a new RowGroup.Wherein,
When newly-increased data are corresponding with more DataBlock, or meet other conditions, multiple DataBlock can also be merged
(compaction) tail end of raw data file is appended to after again, to reduce DataBlock quantity, specifically whether is used
This can be not construed as limiting with flexible configuration, the present embodiment in DataBlock processing mode practical application.On the other hand, at this
Increase is for the Footer information of the newly-increased data in Footer files, raw data file and Footer after being updated
File.Wherein, because record has Footer information, the newly-increased data of reception can be directly appended in original without row
Sequence.
After data write-in internal memory, it is contemplated that data loss problem, therefore the processing of the more copies of data can be carried out, for
The newly-increased data loaded in internal memory, corresponding copy can be generated, and it is consistent with the deposit position of the copy of raw data file,
It that is to say and the wave file of newly-increased data is stored under the wave file identical catalogue with the raw data file.
In data read process, in the present embodiment in addition to normal disk file is read, it is also necessary to which consideration loads on
Newly-increased data in internal memory, it is the visible ageing key of lifting data herein.When receiving data inquiry request, Ke Yiru
Read operation (Reader) is carried out shown in Fig. 3 B, request can be split as to two son requests:Disk file digital independent
(DiskTableScan) and internal storage data reads (MemTableScan), after two parts data result is merged, is used as this
The returning result of request, in Fig. 3 B by taking Query Engine data query languages as an example, carry out table scan (TableScan) and obtain
Query Result.
Corresponding with the embodiment of the processing method of foregoing column data storage, present invention also provides column data storage
Processing unit and its computer equipment applied embodiment.
The embodiment of the processing unit of the application column data storage can be applied in computer equipment.Device embodiment can
To be realized by software, can also be realized by way of hardware or software and hardware combining.Exemplified by implemented in software, as one
Device on logical meaning, being will be corresponding in nonvolatile memory by the processor of the processing of column data storage where it
Computer program instructions read in internal memory what operation was formed.For hardware view, as shown in figure 4, being the application column
A kind of hardware structure diagram of computer equipment where the processing unit of data storage, except processor 410, the internal memory shown in Fig. 4
430th, outside network interface 420 and nonvolatile memory 440, the computer equipment in embodiment where device 431 is usual
According to the actual functional capability of the computer equipment, other hardware can also be included, this is repeated no more.
As shown in figure 5, Fig. 5 is a kind of processing dress of column data storage of the application according to an exemplary embodiment
The block diagram put, including:
Data reception module 51, is used for:The newly-increased data for raw data file are received, the raw data file is adopted
Stored with column storage format, the Footer information of the raw data file is used independently of the raw data file
Footer files recorded.
Data write. module 52, is used for:The newly-increased data are write to the initial data according to column storage format
The afterbody of file, and increase is directed to the Footer information of the newly-increased data in the Footer files, after being updated
Raw data file and Footer files.
In an optional implementation, the data reception module 51, it is additionally operable to:
After newly-increased data of the reception for raw data file, the newly-increased data of reception are loaded onto high speed storing
In space;
The Data write. module 52, is specifically used for:
When the newly-increased data loaded in the high speed storing space meet default storage condition, according to column storage format
The newly-increased data of the loading are write to the afterbody of the raw data file.
In an optional implementation, the default storage condition includes following one or more conditions:
The data volume of the newly-increased data reaches default data-quantity threshold;Or,
Loading duration of the newly-increased data in high speed storing space reaches default duration threshold value.
In an optional implementation, described device also includes read module, is used for:
When getting the data inquiry request for the raw data file, according to the Footer files before renewal,
Read in the raw data file and meet the first data of the request, and loaded in the high speed storing space new
Increase the second data that digital independent meets the request;
Exported after first data and the second data are merged.
In an optional implementation, described device also includes replica processes module, is used for:
The wave file of the newly-increased data of the loading is generated, and the wave file of the newly-increased data of the loading is stored
Under the wave file identical catalogue with the raw data file
According to the third aspect of the embodiment of the present application, there is provided a kind of processing unit of column data storage, including:Processing
Device;For storing the memory of processor-executable instruction;Wherein, the processor is configured as:
The newly-increased data for raw data file are received, the raw data file is deposited using column storage format
Storage, the Footer information of the raw data file use the Footer files independently of the raw data file to be remembered
Record.
The newly-increased data are write to the afterbody of the raw data file, and described according to column storage format
Increase is for the Footer information of the newly-increased data in Footer files, raw data file and Footer after being updated
File.
The function of modules and the implementation process of effect specifically refer in the processing unit of above-mentioned column data storage
The implementation process that step is corresponded in the processing method of column data storage is stated, will not be repeated here.
Accordingly, the application also provides a kind of computer-readable storage medium, and have program stored therein instruction in the storage medium, institute
Stating programmed instruction includes:
The newly-increased data for raw data file are received, the raw data file is deposited using column storage format
Storage, the Footer information of the raw data file use the Footer files independently of the raw data file to be remembered
Record;
The newly-increased data are write to the afterbody of the raw data file, and described according to column storage format
Increase is for the Footer information of the newly-increased data in Footer files, raw data file and Footer after being updated
File.
The embodiment of the present application can use the storage medium for wherein including program code in one or more (including but unlimited
In magnetic disk storage, CD-ROM, optical memory etc.) on the form of computer program product implemented.Computer can use storage
Medium includes permanent and non-permanent, removable and non-removable media, can realize information by any method or technique
Storage.Information can be computer-readable instruction, data structure, the module of program or other data.The storage medium of computer
Example include but is not limited to:Phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory
(DRAM), other kinds of random access memory (RAM), read-only storage (ROM), Electrically Erasable Read Only Memory
(EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), digital versatile disc
(DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus or any other non-biography
Defeated medium, the information that can be accessed by a computing device available for storage.
For device embodiment, because it corresponds essentially to embodiment of the method, so related part is real referring to method
Apply the part explanation of example.Device embodiment described above is only schematical, wherein described be used as separating component
The module of explanation can be or may not be physically separate, can be as the part that module is shown or can also
It is not physical module, you can with positioned at a place, or can also be distributed on multiple mixed-media network modules mixed-medias.Can be according to reality
Need to select some or all of module therein to realize the purpose of application scheme.Those of ordinary skill in the art are not paying
In the case of going out creative work, you can to understand and implement.
Those skilled in the art will readily occur to the application its after considering specification and putting into practice the invention applied here
Its embodiment.The application is intended to any modification, purposes or the adaptations of the application, these modifications, purposes or
Person's adaptations follow the general principle of the application and the common knowledge in the art do not applied including the application
Or conventional techniques.Description and embodiments are considered only as exemplary, and the true scope of the application and spirit are by following
Claim is pointed out.
It should be appreciated that the precision architecture that the application is not limited to be described above and is shown in the drawings, and
And various modifications and changes can be being carried out without departing from the scope.Scope of the present application is only limited by appended claim.
The preferred embodiment of the application is the foregoing is only, not limiting the application, all essences in the application
God any modification, equivalent substitution and improvements done etc., should be included within the scope of the application protection with principle.