CN110109910A - Data processing method and system, electronic equipment and computer readable storage medium - Google Patents
Data processing method and system, electronic equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN110109910A CN110109910A CN201810014799.8A CN201810014799A CN110109910A CN 110109910 A CN110109910 A CN 110109910A CN 201810014799 A CN201810014799 A CN 201810014799A CN 110109910 A CN110109910 A CN 110109910A
- Authority
- CN
- China
- Prior art keywords
- data
- field
- row
- column
- storage table
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/221—Column-oriented storage; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/328—Management therefor
Abstract
The application provides a kind of data processing method and system, electronic equipment and computer readable storage medium, comprise determining that each first row field for needing to combine in the column field of column storage table, the row field list of the column storage table is solicited articles shelves, the column field characterization attributes of the column storage table;Each first row field is merged into composite column field, and data of each first row field under each row field are merged, data of the data after merging as the composite column field under each row field.This programme not only have column storage efficiently more new data the advantages of, but also the efficient reading effect for being similar to row storage mode can be realized, so that the advantages of taking into account two ways, effectively improves the efficiency of data processing when reading the data of same document.
Description
Technical field
This application involves big data field more particularly to a kind of data processing methods and system, electronic equipment and computer
Readable storage medium storing program for executing.
Background technique
In data acquisition, it will usually which the data definition for crawling crawler is document (Document).Specifically, a text
Shelves often include many attributes, by taking web document as an example, attribute such as article title (Title), article content (Body),
The number (word frequency) and its go out that the word and each word that article click volume (click), document contain occur in the document
Existing position (for example, offset relative to document stem) etc..In practical application, in order to promote the efficiency and effect of data acquisition
Fruit, the mode that crawler crawls data can be built into can with the inverted index (inverted index) of quick-searching and convenient for into
The forward index (forward index) of row data analysis (for example, calculating document and inquiry request correlation).
Wherein, forward index is the dimension from document, extracts the attribute of each document and is stored.It is main to deposit
There are two types of storage modes: row storage and column storage.Under above-mentioned application scenarios, the advantages of storing of going is same document properties data
The shortcomings that reading efficiency is high, but there are partial data update low efficiencys;And the advantages of arranging storage is data write efficiency height, and
The quick update of support section field, but the disadvantage is that the reading efficiency of same document properties data is low, it needs repeatedly to read.
And in practical application, such as under search scene, need frequently to be related to same document data and read and part attribute number
According to the scene of update, and above-mentioned storage mode can not meet these demands simultaneously.And the inefficiencies for arranging the reading of storage can not
Meet high performance demand, the partial data of row storage, which updates the inefficient of operation, not can guarantee high-timeliness yet.
Summary of the invention
The application provides a kind of data processing method and system, electronic equipment and computer readable storage medium, for solving
The problem of certainly existing data processing scheme cannot achieve efficient process under different scenes.
The first aspect of the application is to provide a kind of data processing method, comprising: determines in the column field of column storage table
Each first row field for needing to combine, the row field list of the column storage table are solicited articles shelves, the column field characterization of the column storage table
Attribute;Each first row field is merged into composite column field, and data of each first row field under each row field are closed
And data of the data after merging as the composite column field under each row field.
The second aspect of the application is to provide a kind of data processing system, comprising: grouping module, for determining column storage
Each first row field for needing to combine in the column field of table, the row field list of the column storage table are solicited articles shelves, the column storage table
Column field characterization attributes;Merging module, for each first row field to be merged into composite column field, and to each first row field
Data under each row field merge, number of the data after merging as the composite column field under each row field
According to.
It is to provide a kind of electronic equipment in terms of the third of the application, comprising: at least one processor and memory;It is described
Memory stores computer executed instructions;The computer execution that at least one described processor executes the memory storage refers to
It enables, to execute foregoing method.
The 4th aspect of the application is to provide a kind of computer readable storage medium, in the computer readable storage medium
It is stored with program instruction, described program instruction realizes foregoing method when being executed by processor.
Data processing method and system provided by the present application, electronic equipment and computer readable storage medium, for document
Attribute data carries out data storage, the column field characterization attributes of column storage table by the way of column storage, and row field list is solicited articles
Shelves, this programme determine the combined column field of needs from the column field of column storage table, these column fields are merged into composite column
Field, and the data of these column fields are merged to obtain the data of composite column field, realize the subassembly of attribute data
And.Data processing scheme provided by the present application, not only have column storage efficiently more new data the advantages of, but also same document can read
Data when, realize be similar to row storage mode efficient reading effect, so that the advantages of taking into account two ways, effectively improves number
According to the efficiency of processing.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the application
Some embodiments be also possible to obtain other drawings based on these drawings for those of ordinary skill in the art.
Figure 1A and Figure 1B is respectively the topology example figure of row storage table and column storage table;
Fig. 2A~Fig. 2 C is a kind of flow diagram for data processing method that the embodiment of the present application one provides;
Fig. 3 is a kind of exemplary diagram of data processing method provided by the embodiments of the present application;
Fig. 4 is a kind of flow diagram for data processing method that the embodiment of the present application two provides;
Fig. 5 is a kind of flow diagram for data processing method that the embodiment of the present application three provides;
Fig. 6 A~Fig. 6 B is the structural schematic diagram for the data processing system that the embodiment of the present application four provides;
Fig. 7 is a kind of structural schematic diagram for data processing system that the embodiment of the present application five provides;
Fig. 8 is a kind of structural schematic diagram for data processing system that the embodiment of the present application six provides;
Fig. 9 is a kind of example architecture figure for data processing system that embodiment seven provides;
Figure 10 is the structural schematic diagram for the data processing system that the embodiment of the present application eight provides.
Specific embodiment
To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application
In attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is
Some embodiments of the present application, instead of all the embodiments.
In practical application, as shown in FIG. 1A and 1B, the respectively topology example figure of row storage table and column storage table.Such as figure
Shown in, the black background part of table indicates the row field and column field of row storage table and column storage table, white background portions
Indicate the data in row storage table and column storage table.Wherein, Row is row field, and Column is column field.Unlike, row is deposited
Table is stored up as unit of a line, column storage table is as unit of column data set or column family (Column family).Row is stored
For table, read-write process is consistent, i.e., is successively directed to every a line, terminates since the first row of the row to last column, the row
Reading or write-in after the completion of, then terminate to last column since the first row of next line, recycled with this until institute is in need
It reads or the data of write-in is completed to read or are written.And for column storage table, it can be single with column when reading data
The data read in one or more columns per page are concentrated in position, when data write-in, then need the content type for levying data according to each list
It is split, and the data after fractionation is respectively written into the end of respective column.
For example in conjunction with this programme, the row field (Row) in storage table characterizes document, and column field (Column) characterization belongs to
Property, attribute here is the attribute of document, and content can need to set according to data statistics with what is analyzed, such as may include
But it is not limited to: title, author, type (for example, comedy class, sadness class, terrible class) and is clicked number etc..In figure
Data indicates data, wherein the data of same document are indicated with label, for example, data Data1-1 in Row1,
Data1-2 ... Data1-M is same document (for example, document 1) in different attribute, for example, the attribute 1 of Column1 characterization,
Column2 characterization attribute 2 ... ColumnM characterization attribute M) under data;The rest may be inferred, data DataN-1 in RowN,
DataN-2 ... DataN-M is data of the same document (for example, document N) under different attribute.
In conjunction with the example above, it is possible to understand that, it is assumed that when needing to read the data of document 1 from row storage table, deposited according to row
The reading manner of table is stored up, the corresponding full line data of row field Row1 of characterization document 1 are read, because of the reading of this journey storage table
It is high-efficient, and the data sequential storage of same document can also reduce the probability of cache miss.But works as and need from row storage table
When the middle part field for updating document 1, then the corresponding data of all column fields under document 1 are needed to be traversed for, for example, it is assumed that needing more
The data of new document 1 properties 5, then need, and traverses Column1, Column2 under Row1 until Column5, can realize pair
The update of the data (i.e. Data1-5 in figure) of Column5 under Row1, because the part field update of this journey storage table can be extended
Change, causes to update low efficiency.
Still with the example above, for column storage table, it is assumed that need to read the data of document 1 from column storage table, then
The data for needing to be traversed for each column field extract the corresponding partial data of document 1 from the data of all column fields, are spelled
Obtain document 1 data, therefore the reading efficiency of column storage table is low, and the reading with document data needs repeatedly to traverse each column word
The data of section.But the advantages of column storage table is the high-efficient of partial data update, when needs update document 1 from column storage table
When part field (for example, attribute 5 of document 1), then only document 1 need to be found from the corresponding data of Column5 of characterization attributes 5
Under data be updated.
It can not meet simultaneously for above-mentioned row storage table and column storage table and efficiently read and partial data more news, figure
2A is a kind of flow diagram for data processing method that the embodiment of the present application one provides;With reference to Fig. 2A it is found that the present embodiment mentions
A kind of data processing method has been supplied, for realizing while efficiently reading, the efficiency that raising partial data updates.Specifically,
The data processing method includes:
101, each first row field for needing to combine in the column field of column storage table, the row field characterization of column storage table are determined
Document, the column field characterization attributes of column storage table;
102, each first row field is merged into composite column field, and the data to each first row field under each row field
It merges, data of the data after merging as composite column field under each row field.
In practical application, the executing subject of the data processing method can be data processing system.In practical applications, should
Data processing system can be by software code realization, which may be to be stored with related Jie for executing code
Matter, for example, USB flash disk etc.;Alternatively, the data processing system can also be entity apparatus that is integrated or being equipped with related execution code,
For example, chip, intelligent terminal, computer, database, server and various electronic equipments.
Example is carried out in conjunction with actual scene: collected document properties being stored into column storage table, wherein column storage table
Row field list solicit articles shelves, column field characterization attributes.It determines to need combined column field from the column storage table, specifically,
Combined column field is needed to can be set as needed different screening conditions to determine;After needing combined column field to determine,
On the one hand these column fields are merged into composite column field, on the other hand also needs to close the data in these column fields
And.Specifically, merging for data of these column fields under each row field, the data after merging are to combine two words
Data under Duan Hang field.This programme will need combined attribute to merge into a composite attribute progress data storage, storage
Mode use column storage table, thus the advantages of the storage of comprehensive ranks, meet aforementioned same document data and read and partial data
Update the high-timeliness requirement of data processing under two kinds of scenes.
It for more intuitivism apprehension this programme, is illustrated in conjunction with 3, as shown in figure 3, in certain column storage table altogether
There are three column family, difference characterization attributes 1 (attr1), attr2 and attr3, for example, when the data for needing to read document 1
When, it needs once to traverse attr1, attr2 and attr3, to obtain data of the document 1 in different lines, later by these data
Spliced, to obtain the data of document 1.Based on this programme, it is assumed that need to merge attr2 and attr3, then can incite somebody to action
Attr2 and attr3 merges into composite attribute attr4, and the data of attr2 and attr3 are merged into the data of attr4.It can be with
Understand, it is subsequent, when needing to read the data of document 1, two attribute can be obtained together in the data of composite attribute attr4
Data, that is, need to only traverse two column datas and can be obtained the data of document 1, due to carrying out data in composite attribute attr4
When reading, need to only traverse the column can be completed column data reading, therefore reading speed is compared the reading data from different lines and had
Larger promotion, to improve the reading efficiency of same document data.
Specifically, determining to need combined column field first in this programme, which can according to need determination, example
Such as, it is specified by user, determines to need combined column field alternatively, can also analyze based on data.It should be noted that our
The attribute for needing combined column field to characterize in case can be single attribute, or composite attribute, it can belong to combination
Property is further merged.
As a kind of enforceable mode, the similar column field of access frequency can be merged, correspondingly, such as Fig. 2 B
Shown, on the basis of any embodiment, 101 be can specifically include:
1011, the access frequency of each column field in the column storage table is counted;
1012, using the similar column field of access frequency as first row field.
Specifically, the frequency that access frequency reflection is read can after merging to the similar attribute of these access frequencys
To be obtained together by single ergodic column data, to improve the efficiency that same document data is read to a greater degree.Here
Described is similar including identical and approximate, for example, the access frequency of the identical and each attribute of the access frequency of each attribute is differed pre-
In the range of if etc..
As another enforceable mode, the similar column field of access frequency can be merged, correspondingly, as schemed
Shown in 2C, on the basis of any embodiment, 101 be can specifically include:
1013, the data renewal frequency of each column field in the column storage table is counted;
1014, data renewal frequency is below the column field of predeterminated frequency as first row field.
In practical application, the renewal frequency of certain attributes is lower, such as the attributes such as Document Title, document author, for this
The attribute for not needing frequent updating a bit, can be merged, same to improve while guarantee section data updating efficiency
The efficiency that one document data is read.
It should be noted that the embodiment of above-mentioned determination column field to be combined is only used as illustrating for this programme, it is real
In the application of border, it is also based on other way and determines to need combined column field.
Correspondingly, determining after needing combined column field, the data by these column fields are needed to merge, equally
, combined mode can also there are many.In practical application, during the data of each first row field are merged, need
Take into account and the corresponding data of each attribute are recognized when row reading data, therefore carries out example by following several embodiments
Explanation.
In one embodiment, on the basis of any embodiment, to each first row field in each row described in 102
Data under field merge, data of the data after merging as the composite column field under each row field, tool
Body may include:
According to the predetermined row time of each first row field, data of each first row field under each row field are ranked up
After merge, data of the data after merging as the composite column field under the row field.
Specifically, presetting arrangement order in present embodiment for each column field, arrangement order here refers to
When carrying out data merging, position of each attribute data in entire merging data.For example still with Fig. 3, it can set in advance
The row time for determining attr2 is first, and the row time for setting attr3 is second, then correspondingly, when needing to merge attr2 and attr3
When, attr2 and attr3 are on the one hand merged into composite attribute attr4, on the other hand, it is also necessary to the number of attr2 and attr3
According to merging, it is based on present embodiment, it is right according to default row time for the data of attr2 and attr3 under each row field
The data of attr2 and attr3 carry out row time, as shown in Figure 3, with the data instance under document 1, to attr2 and attr3 in text
Data under shelves 1 are ranked up according to default row time, i.e. preceding (ranked first position), attr3's data of the attr2 under document 1 exists
Data under document 1 merge into overall data after sequence rear (ranked second position), to obtain data of the attr4 under document 1.
And so on, data of the attr4 under all documents are obtained, the merging of attr2 and attr3 respective column is completed.It is appreciated that group
It closes the mode of the data of attribute in certain sequence to store, the efficiency of reading data can be effectively improved, realize similar row storage mode
Quick reading data effect, can effectively promote the efficiency of online retrieving.
Present embodiment carries out data merging by presetting sequence, so that the data after merging have sequential access
Characteristic, i.e., the data of each attribute are sequential storage in the composite attribute data of document, to read the combination of the document in access
The sequential access that data are realized when attribute data, realizes the efficient reading effect of row storage.
In another embodiment, on the basis of any embodiment, to each first row field each described in 102
Data under row field merge, data of the data after merging as the composite column field under each row field,
It can specifically include:
The mark of the column field is added for data of each first row field under each row field;
Data of each first row field after addition mark under each row field are merged, the number after merging
According to the data as the composite column field under the row field.
Specifically, be the corresponding mark of data addition of each column field in present embodiment, it is corresponding to characterize the data
Attribute.For example still with Fig. 3, when needing to merge attr2 and attr3, attr2 and attr3 are on the one hand merged into group
It closes attribute attr4 on the other hand the data of attr2 and attr3 are marked, present embodiment is based on, under each row field
The data of attr2 add the mark of attr2 for it, likewise, adding attr3 to the data of attr3 under each row field for it
Mark, later by this two parts data merge obtain composite attribute attr4 data.It is appreciated that subsequent reading attr4
In data when, the mark carried according to wherein each data is you can learn that the corresponding attribute of each data.
Present embodiment passes through to need combined data to add attribute-bit, different attribute pair in data after realization merges
The identification of data is answered, the flexibility that data merge is improved.
In practical application, for the description information of composite attribute, the description can also be recorded convenient for subsequent reading data
Information refers to some relevant informations of composite attribute, for example, composite attribute characterization each attribute and composite attribute number
According to merging mode etc..Optionally, in one embodiment, the data processing method further include:
Update the description file of the column storage table, each column field of column storage table described in the description file record and
The attribute of each column field characterization.
Specifically, update mentioned here includes but is not limited to the operation such as newly-built, deletion, modification.As an example it is assumed that working as
It is preceding (the corresponding column field of including but not limited to single attribute and/or to merge through combinations of attributes the corresponding column field of certain attributes
Made of composite column field) merge, then merge after need for merge obtain column field, description file in create should
The attribute of column field and its characterization, specifically, the attribute of column field characterization includes the attribute for each column field characterization being merged.
For example, certain column field column4 is merged by column2 and column3, wherein column2 characterizes Document Title,
Column3 characterizes document author, then after merging the data for generating column4, needs new in the description file of column storage table
Build the attribute of column4 characterization, i.e. Document Title and document author.
Further, the column field being merged and its description letter can also be removed from the description file of column storage table
Breath, to save the memory space occupied.Optionally, on the basis of the embodiment, the update column storage table
File is described, comprising:
Delete the attribute of each first row field recorded in the description file of column storage table and each first row field characterization;
The attribute of composite column field and composite column field characterization, composite column word are added in the description file of column storage table
The attribute of segment table sign includes the attribute of each first row field characterization.
It generally speaking, in the present embodiment, is that column storage table is established and maintenance description file, when respectively being arranged in column storage table
When the description information of field changes, for example, the merging of at least two column fields occurs, the attribute of column field characterization becomes
Change, column field is deleted etc., then it needs to be updated the description file of column storage table.It is recorded in the description file of column storage table
The description information of each column field of the column storage table, the description information include but is not limited to the attribute of column field characterization, for
Merge the composite column field obtained, the mode of recorded data merging is gone back in description information, for example, each attribute in merging process
The rank order etc. of data, so that the data of different attribute can be therefrom found when reading the data of the composite column field, and
The data of each attribute are spliced.It for example still with Fig. 3, can be in the description file of column storage table shown in figure
Record the attribute of attr1 and attr4 characterization, wherein the attribute of attr4 characterization includes attr2 and attr3, and column storage table
Description file in also record there are the data of attr2 and attr3 to merge mode, the data merge mode and are used for from attr4's
Identify the corresponding data of attr2 and attr3 in data, concrete form can there are many, the present embodiment herein not to its into
Row limitation.
, can be with the corresponding relationship between maintenance column field and attribute by the description file of maintenance column storage table, and root
The attribute for each column field characterization that timely updates the case where merging according to column field combination, improves the real-time that data storage updates, just
In subsequent reading data, the convenience and accuracy of reading data are improved.
Data processing method provided in this embodiment carries out data by the way of column storage for document properties data
Storage, the column field characterization attributes of column storage table, row field list are solicited articles shelves, and this programme is determined from the column field of column storage table
Need combined column field, these column fields merged into composite column field, and to the data of these column fields merge with
The data of composite column field are obtained, realize the packet combining of different attribute data.Data processing scheme provided by the present application, both had
Standby column store the advantages of efficiently more new data, and can realize when reading the data of same document and be similar to row storage mode
Effect is efficiently read, so that the advantages of taking into account two ways, effectively improves the efficiency of data processing.
In practical application, for different types of scene, the Doctype of acquisition is different.For the ease of carrying out data pipe
Reason can also be carried out dividing according to attribute data of the different types to each document and individually be managed.Correspondingly, Fig. 4 is this Shen
Please embodiment two provide a kind of data processing method flow diagram;With reference to Fig. 4 it is found that the present embodiment still provides one kind
Data processing method is managed different types of document properties data for further realizing.Specifically, in any implementation
On the basis of example, the data processing method further include:
201, from all types of corresponding column storage tables, first row storage corresponding with the type of document to be written is searched
Table;
202, the attribute characterized according to column field each in the first row storage table, extracts from the document to be written
Corresponding attribute data;
203, the first row storage table is written into the attribute data.
Specifically, different types of scene is directed to, for example, the scenes such as shopping, education, tourism, life.It acquires data
Type is also different, for example, statistics data relevant to user's shopping characteristics are more needed under shopping scene, for example, history purchase data,
Ad click information etc.;Education scene more needs the personal information of counting user, for example, age, profession, educational background etc.;Tourism scene
More need the historical location data etc. of counting user.It is appreciated that the attribute of the data acquired under identical type scene can also compare
Relatively similar, therefore, present embodiment is based on different type for the document data of different type scene acquisition and carries out Classification Management,
Establish the corresponding column storage table of different type.For example, corresponding column storage can be established for types such as shopping, education respectively
Table is stored storing from the document properties data of the shopping websites such as Taobao, Jingdone district or application acquisition to the corresponding column of shopping type
In table, stored storing from the document properties data of the Educational websites such as Hu Jiang network school or application acquisition to the corresponding column of education type
In table.
Present embodiment establishes all types of corresponding column storage tables, the similar document data of data attribute is divided to same
Type, using the corresponding column storage table of different type as dimension carry out data storage and management, can be isolated different types of data it
Between interference.For example, the renewal frequency of all types of document properties data is different, the document properties data for type of doing shopping may be needed
The data such as the ad click number of frequent updating user are wanted, and the document properties data for educating type are then often relatively stable,
It is not required to frequent updating in certain time, is stored if all types of document properties data all combined, it is right
The update of shopping categorical data, which can expand, involves other types of data, influences the efficiency of data update and be easy to happen accidentally to grasp
Make.In this regard, different types of document properties data are distinguished storage and management, avoid different types of document by present embodiment
Attribute data interferes, and improves the efficiency and accuracy of data processing.That is, passing through the management mould of present embodiment
Formula will not involve the data in entire storage system, drop when updating the document properties data of different update frequency under different type
Low data processing overheads.
In the present embodiment, the corresponding document properties data of each type are all with the format management of single row storage table,
It is independent from each other between the corresponding column storage table of each type.Further, may be used also in the corresponding column storage table of certain type
With further division management.
Optionally, in one embodiment, on the basis of example 2,203 can specifically include:
2031, corresponding from the first row storage table according to the field to be written of entering a profession of presently described first row storage table
First version belonging to the field to be written of entering a profession is determined in version, wherein different editions characterize the first row storage table
Field of not going together range;
2032, whether the scale for detecting data write-in in the first version reaches preset saturation conditions;If so, building
The vertical second edition, and the attribute data is written in the row field of second edition characterization;Otherwise, it to be written enters a profession described
The attribute data is written in field.
Specifically, different editions correspond to different row field ranges, for example, the row field 1 in 0 respective column storage table of version
~row field 10, row 11~row of field field 20 in 1 respective column storage table of version.Wherein, the field to be written of entering a profession is used for
Be written the document properties data that currently need to be written, the method for determination of the field to be written of entering a profession can there are many, for example, can be with
According in such a way that row is sequentially written in, the first row field that data are currently not written into each row field is found, as working as
Preceding field to be written of entering a profession.And then determine version belonging to field to be written of entering a profession, still with aforementioned as an example it is assumed that current true
Fixed field to be written of entering a profession is row field 19, then version belonging to current field to be written of entering a profession is version 1.In present embodiment, write
Enter the data that data refer to being written new document, the i.e. data when existing document non-in the storage table of forefront.
In present embodiment, data in the corresponding column storage table of each type storage and management, version in the form of version
This division can be divided according to timeliness, i.e., the data being written as needed establish new version in real time.In order to reduce number
According to maintenance and the resource consumed is updated, can be written in the data of current version and reach certain scale, that is, meet preset saturation
When condition, new version is created.Saturation conditions mentioned here is used for the space hold situation reflected in version, for example,
The saturation conditions can have been write completely for data in version, can also reach certain for the row field proportion of written data
The row field of data can not also be written no more than certain threshold value etc., still with aforementioned as an example it is assumed that really in threshold value for residue
It is settled it is preceding it is to be written enter a profession field be row field 19, belonging to version be version 1, if preset saturation conditions be do not write
The row field quantity for entering data needs to maintain minimum 2 and (there was only 20 two row words of row field 19 and row field in citing at present
Data are not written for section), then version 2 is established, and will be in the row field for currently needing the document properties data being written write-in version 2.
It is understood that, it is assumed that current all version to have been established or the data write-in scale of most newly-established version reaches preset saturation
Condition then directly establishes new version and carries out data write-in.In practical application, in order to avoid the document of homogeneous write-in is saved
The problem of different editions cause follow-up data inconvenience to be safeguarded, DUMP operation can be initiated before newly-built version, i.e., in newly-built version
This when, is written without data.
Further, it is also possible to discharge effective memory space in such a way that version recycles.Optionally, in the base of embodiment two
On plinth, the method can also include:
204, it whether there is third version, the row of the third version characterization in the corresponding version of detection first row storage table
The valid data amount stored in field is lower than preset threshold value, and the valid data are not deleted data;
205, in the row field for counting the third version characterization, it is stored with the first quantity of the row field of valid data,
And fourth edition is determined from the corresponding version of the first row storage table, it is not written in the row field of the fourth edition characterization
The row field quantity of data is not less than first quantity;
206, the valid data stored in the third version are transferred to the fourth edition, and by the third version
Labeled as invalid version.
Specifically, present embodiment carries out versions merging according to the valid data scale dynamic in version.In practical application,
Needs based on data update, it is possible to create some invalid datas, for example, data be deleted etc..Correspondingly, working as certain version
The valid data of interior storage are less, for example, when being lower than preset threshold value, then it can be by the versions merging into other versions, to release
The occupied memory space of invalid data under the version is put, and improves follow-up data reading and effectiveness of retrieval.Optionally, version
There are many combined triggering scenes, for example, 204 can periodically be executed, whether each version in scan columns storage table needs to carry out
Versions merging removes invalid data.In practical application, the version for being incorporated into other versions can be removed, or can also be with
It is invalid version by the version flag, when subsequent progress reading data retrieval, without browsing the data in invalid version.It is optional
, after the invalid data in invalid version is removed, which can be used for that data are written, correspondingly, the version after write-in data
This will be updated labeled as effective version.
Data processing method provided in this embodiment is deposited for different types of document properties data by arranging accordingly
It stores up table and carries out data storage, it is mutually indepedent between the corresponding column storage table of different type, to avoid different types of document properties
Data are interfered, and the efficiency and accuracy of data processing are improved.
In addition, can also be established for it for the ease of being retrieved to the data in column storage table and maintenance indexes.Accordingly
, Fig. 5 is a kind of flow diagram for data processing method that the embodiment of the present application three provides;With reference to Fig. 5 it is found that the present embodiment
A kind of data processing method is still provided, for further establishing the index with maintenance column storage table.Specifically, in any implementation
On the basis of example, the data processing method further include:
301, index is established for the column storage table, the index is including each column field in the column storage table in each row word
The storage address of data under section.
Wherein, the form of the index can there are many, it is preferred that k-v index (primary key can be used
Index), data storage location can quickly be navigated to by major key (primary key, abbreviation pk).Specifically, column storage table
Index include the column storage table each unit lattice in address data memory, i.e., data of each column field under each row field
Storage address.Specifically, the format of storage address can be determined according to the Format Type of data, it is preferred that for fixed-length data,
Since its data length is certain, storage address can only record the storage address of its first data, for elongated number
According in storage address other than recording the storage address of its first data, it is also necessary to record the length of the elongated data.
In practical application, although the individual data of document properties data is often smaller, the quantity of document properties data is logical
It is often huger, when whenever certain data variation if carry out data and update to need to dispatch and expend very big process resource.Therefore
Preferably, to delete data instance, on the basis of the above embodiment, the method also includes:
302, from the corresponding index of column storage table belonging to data to be deleted, with searching the storage of the data to be deleted
Location;
It 303, is the first label of storage address addition of the data to be deleted in the index, it is described wait delete to characterize
Except data invalid.
Specifically, the data can be found from the index of column storage table when needing to delete the data in column storage table
Storage address, and the first label invalid for characterize data for storage address addition in the index.That is, arranging
It is the first label of index addition for the data for needing to delete in the index of storage table, to characterize the data invalid, without holding
The data delete processing of row essence, can be realized the effect of data deletion.When the unified update operation of follow-up data is triggered,
For example, removing current all invalid datas together when invalid data amount reaches certain amount.
Another scene is to carry out data update to canned data.Correspondingly, on the basis of any embodiment, it is described
Method further include:
304, the document according to belonging to data to be written is searched secondary series storage table, is existed in the secondary series storage table
Characterize the first row field of document belonging to the data to be written;
305, the attribute according to belonging to the data to be written, to secondary series field in the secondary series storage table first
Data under row field are updated, and the secondary series field characterizes attribute belonging to the data to be written.
It is understood that, if it is possible to find the second storage table, then illustrate that document belonging to data to be written is to have deposited text
Shelves, that is, belong to the update to canned data, if searching the number for illustrating that data to be written are new document less than the second storage table
According to may relate to the related procedure in embodiment two in the case where the present embodiment is combined with embodiment two scene implemented.It needs to illustrate
, the embodiment of each embodiment can individually be implemented or combine to implement under the premise of not conflicting in this programme, this
Embodiment is not limited.Specifically, present embodiment, is determining that document belonging to current data to be written is to have deposited document
Afterwards, the current data under attribute belonging to data to be written is updated.
I.e. in practical application, for the data with existing in column storage table, when data change, needs mutually to cope with column and deposit
Data in storage table are updated.Wherein, its update mode of the data of different-format can also be different, it is preferred that in a kind of reality
It applies in mode, 305 can specifically include:
If 3051, the corresponding data format of secondary series field is fixed-length data, by secondary series field in the first row field
Under data be updated to data to be written;
If 3052, the corresponding data format of secondary series field is elongated data, data are written into as secondary series word
The patch file of data of the section under the first row field is stored, and by the corresponding index of secondary series storage table, secondary series
The storage address of data of the field under the first row field is recorded as the storage address of data to be written.
Specifically, proposing Different Strategies to the update of the canned data of different-format: for fixed length number in present embodiment
According to update, take the in situ update mode for updating and being updated replacement;For elongated data, then patch installing file is taken
Mode, that is, the patch file for being written into data as data to be updated store, in addition to corresponding single in the index of column storage table
Storage address in first lattice is updated to the storage address of data to be written, to guarantee the accuracy of reading data and retrieval.?
In present embodiment, likewise, in order to save data processing resources, it is not real-time to its when needing to update canned data
It is updated, but stores data to be written by way of patch installing file, the effect that data update is realized, to subsequent unification
When progress data update is triggered, for example, when the patch file quantity in column storage table reaches certain amount, it is unified to store column
All data to be updated in table replace with the data in corresponding latest patch file, and remove all patch files, from
And realize that data are centrally updated and memory space discharges, avoid resource caused by handling in real time from frequently consuming.
Data processing method provided in this embodiment is indexed for establishing and safeguarding for column storage table, and on this basis
The operations such as data are deleted and the update of data with existing is written are carried out, realize diversified data processing function, and avoid frequently
Handle bring resource consumption.
Fig. 6 A is a kind of structural schematic diagram for data processing system that the embodiment of the present application four provides;With reference to Fig. 6 A it is found that
The data processing system includes:
Grouping module 61, each first row field for needing to combine in the column field for determining column storage table, column storage table
Row field list solicit articles shelves, the column field characterization attributes of column storage table;
Merging module 62, for each first row field to be merged into composite column field, and to each first row field in each row
Data under field merge, data of the data after merging as composite column field under each row field.
In practical applications, which can be by software code realization, which can also be with
To be stored with the related medium for executing code, for example, USB flash disk etc.;Alternatively, the data processing system can also be integrated or be equipped with
Correlation executes the entity apparatus of code, for example, chip, intelligent terminal, computer, database, server and various electronics are set
It is standby.
Specifically, grouping module 61 can be set as needed different screening conditions to determine column word that needs combine
Section;After needing combined column field to determine, these column fields are merged into composite column field, another party by 62 one side of merging module
Face also needs to merge the data in these column fields.Specifically, grouping module 61 is determined to need combined column first
Field, which can according to need determination, for example, being specified by user, alternatively, can also determine to need based on data analysis
Combined column field.It should be noted that the attribute for needing combined column field to characterize in this programme can be single attribute,
It may be composite attribute, it can further merged to composite attribute.
As a kind of enforceable mode, the similar column field of access frequency can be merged, correspondingly, any
On the basis of embodiment, grouping module 61 may include:
First statistic unit, for counting the access frequency of each column field in the column storage table;
First grouped element, for using the similar column field of access frequency as first row field.
Specifically, the frequency that access frequency reflection is read can after merging to the similar attribute of these access frequencys
To be obtained together by single ergodic column data, to improve the efficiency that same document data is read to a greater degree.Here
Described is similar including identical and approximate, for example, the access frequency of the identical and each attribute of the access frequency of each attribute is differed pre-
In the range of if etc..
As another enforceable mode, the similar column field of access frequency can be merged, correspondingly, in office
On the basis of one embodiment, grouping module 61 may include:
Second statistic unit, for counting the data renewal frequency of each column field in the column storage table;
Second packet unit, for data renewal frequency to be below to the column field of predeterminated frequency as first row field.
In practical application, the renewal frequency of certain attributes is lower, such as the attributes such as Document Title, document author, for this
The attribute for not needing frequent updating a bit, can be merged, same to improve while guarantee section data updating efficiency
The efficiency that one document data is read.
It should be noted that the embodiment of above-mentioned determination column field to be combined is only used as illustrating for this programme, it is real
In the application of border, it is also based on other way and determines to need combined column field.
Correspondingly, determining after needing combined column field, the data by these column fields are needed to merge, equally
, combined mode can also there are many.In practical application, during the data of each first row field are merged, need
Take into account and the corresponding data of each attribute are recognized when row reading data, therefore carries out example by following several embodiments
Explanation.
In one embodiment, on the basis of any embodiment, merging module 61 may include:
First processing units, for the predetermined row time according to each first row field, to each first row field in each row word
Data under section merge after being ranked up, number of the data after merging as the composite column field under the row field
According to.
Specifically, presetting arrangement order in present embodiment for each column field, arrangement order here refers to
When carrying out data merging, position of each attribute data in entire merging data.It is appreciated that the data of composite attribute are by certain
The mode of sequence stores, and can effectively improve the efficiency of reading data, realizes the effect of the quick reading data of similar row storage mode
Fruit can effectively promote the efficiency of online retrieving.
Present embodiment carries out data merging by presetting sequence, so that the data after merging have sequential access
Characteristic, i.e., the data of each attribute are sequential storage in the composite attribute data of document, to read the combination of the document in access
The sequential access that data are realized when attribute data, realizes the efficient reading effect of row storage.
In another embodiment, on the basis of any embodiment, merging module 61 may include:
Unit is identified, the mark of the column field is added for data of each first row field under each row field;
The second processing unit, for closing data of each first row field after addition mark under each row field
And data of the data after merging as the composite column field under the row field.
Specifically, be the corresponding mark of data addition of each column field in present embodiment, it is corresponding to characterize the data
Attribute.Present embodiment passes through that combined data is needed to add attribute-bit, realizes that different attribute is corresponding in data after merging
The identification of data improves the flexibility that data merge.
In practical application, for the description information of composite attribute, the description can also be recorded convenient for subsequent reading data
Information refers to some relevant informations of composite attribute.Optionally, as shown in Figure 6B, in one embodiment, the system
Further include:
Describing module 63 arranges storage described in the description file record for updating the description file of the column storage table
The attribute of each column field of table and each column field characterization.
Specifically, update mentioned here includes but is not limited to the operation such as newly-built, deletion, modification.Optionally, in the reality
On the basis of applying mode, describing module 63 includes:
Unit, each first row field recorded in the description file for deleting the column storage table and institute are deleted in description
State the attribute of each first row field characterization;
Adding unit is described, for adding the composite column field and described in the description file of the column storage table
The attribute of composite column field characterization, the attribute of the composite column field characterization include the attribute of each first row field characterization.
It generally speaking, in the present embodiment, is that column storage table is established and maintenance description file, when respectively being arranged in column storage table
When the description information of field changes, for example, the merging of at least two column fields occurs, the attribute of column field characterization becomes
Change, column field is deleted etc., then it needs to be updated the description file of column storage table.It is recorded in the description file of column storage table
The description information of each column field of the column storage table, the description information include but is not limited to the attribute of column field characterization, for
Merge the composite column field obtained, the mode of recorded data merging is gone back in description information.
, can be with the corresponding relationship between maintenance column field and attribute by the description file of maintenance column storage table, and root
The attribute for each column field characterization that timely updates the case where merging according to column field combination, improves the real-time that data storage updates, just
In subsequent reading data, the convenience and accuracy of reading data are improved.
Data processing system provided in this embodiment, not only have column storage efficiently more new data the advantages of, but also can read
When the data of same document, the efficient reading effect for being similar to row storage mode is realized, so that the advantages of taking into account two ways, has
Effect improves the efficiency of data processing.
In practical application, for the ease of carrying out data management, Fig. 7 is a kind of data processing that the embodiment of the present application five provides
The structural schematic diagram of system;With reference to Fig. 7 it is found that on the basis of any embodiment, the data processing system further include:
Memory management module 71, for searching the type pair with document to be written from all types of corresponding column storage tables
The first row storage table answered;
Memory management module 71 is also used to the attribute characterized according to column field each in the first row storage table, from described
Corresponding attribute data is extracted in document to be written;
Memory management module 71 is also used to the attribute data first row storage table is written.
Present embodiment establishes all types of corresponding column storage tables, the similar document data of data attribute is divided to same
Type, using the corresponding column storage table of different type as dimension carry out data storage and management, can be isolated different types of data it
Between interference.In the present embodiment, the corresponding document properties data of each type are all with the format management of single row storage table,
It is independent from each other between the corresponding column storage table of each type.Further, may be used also in the corresponding column storage table of certain type
With further division management.
Optionally, in one embodiment, on the basis of embodiment five, memory management module 71 includes:
Version management unit, for the field to be written of entering a profession according to current first row storage table, from first row storage table pair
First version belonging to field to be written of entering a profession is determined in the version answered, wherein the difference of different editions characterization first row storage table
Row field range;
Version management unit, is also used to detect whether the scale that data are written in first version reaches preset saturation item
Part;If so, establishing the second edition, and attribute data is written in the row field of second edition characterization;Otherwise, it enters a profession to be written
Attribute data is written in field.
Specifically, different editions correspond to different row field ranges.Wherein, the field to be written of entering a profession is current for being written
The document properties data for needing to be written, the method for determination of the field to be written of entering a profession can there are many.In present embodiment, write-in
Data refer to being written the data of new document, the i.e. data when existing document non-in the storage table of forefront.
Further, it is also possible to discharge effective memory space in such a way that version recycles.Optionally, in the base of embodiment five
On plinth, the system also includes:
Versions merging module, for detecting in the corresponding version of first row storage table with the presence or absence of third version, described the
For the valid data amount stored in the row field of three versions characterization lower than preset threshold value, the valid data are not deleted number
According to;
The versions merging module is also used to count in the row field that the third version characterizes, is stored with valid data
Row field the first quantity, and from the corresponding version of the first row storage table determine fourth edition, the fourth edition
The row field quantity of data is not written in the row field of characterization not less than first quantity;
The versions merging module is also used to the valid data stored in the third version being transferred to the fourth edition
This, and the third version is labeled as invalid version.
Specifically, present embodiment carries out versions merging according to the valid data scale dynamic in version.When in certain version
The valid data of storage are less, can be occupied to discharge invalid data under the version by the versions merging into other versions
Memory space, and improve follow-up data reading and effectiveness of retrieval.
Data processing system provided in this embodiment is deposited for different types of document properties data by arranging accordingly
It stores up table and carries out data storage, it is mutually indepedent between the corresponding column storage table of different type, to avoid different types of document properties
Data are interfered, and the efficiency and accuracy of data processing are improved.
In addition, can also be established for it for the ease of being retrieved to the data in column storage table and maintenance indexes.Accordingly
, Fig. 8 is a kind of structural schematic diagram for data processing system that the embodiment of the present application six provides;With reference to Fig. 8 it is found that in any reality
On the basis of applying example, the data processing system further include:
Index management module 81, for establishing index for the column storage table, the index includes in the column storage table
The storage address of data of each column field under each row field.
Wherein, the form of the index can there are many, it is preferred that k-v index (primary key can be used
index).Specifically, the format of storage address can be determined according to the Format Type of data, it is preferred that for fixed-length data, by
Certain in its data length, therefore, storage address can only record the storage address of its first data, for elongated data,
In its storage address other than recording the storage address of its first data, it is also necessary to record the length of the elongated data.
Preferably, to delete data instance, on the basis of the above embodiment, the system also includes:
Data removing module, it is described wait delete for searching from the corresponding index of column storage table belonging to data to be deleted
Except the storage address of data;
The data removing module, be also used in the index be the data to be deleted storage address addition first
Label, to characterize the data invalid to be deleted.
Specifically, the data can be found from the index of column storage table when needing to delete the data in column storage table
Storage address, and the first label invalid for characterize data for storage address addition in the index.Without executing essence
Data delete processing, can be realized data deletion effect.When the unified update operation of follow-up data is triggered, for example,
When invalid data amount reaches certain amount, current all invalid datas are removed together.
Another scene is to carry out data update to canned data.Correspondingly, on the basis of any embodiment, it is described
System further include:
Data update module searches secondary series storage table, the secondary series for the document according to belonging to data to be written
There is the first row field for characterizing document belonging to the data to be written in storage table;
The data update module is also used to the attribute according to belonging to the data to be written, stores to the secondary series
Data of the secondary series field under the first row field are updated in table, and the secondary series field characterizes the data institute to be written
The attribute of category.
It is understood that, if it is possible to find the second storage table, then illustrate that document belonging to data to be written is to have deposited text
Shelves, that is, belong to the update to canned data, if searching the number for illustrating that data to be written are new document less than the second storage table
According to may relate to the related procedure in embodiment five in the case where the present embodiment is combined with embodiment five scene implemented.This embodiment party
Formula, after determining document belonging to current data to be written to have deposited document, to current under attribute belonging to data to be written
Data are updated.
Wherein, its update mode of the data of different-format can also be different, it is preferred that in one embodiment, described
Data update module, if being specifically used for the corresponding data format of the secondary series field is fixed-length data, by the secondary series
Data of the field under the first row field are updated to the data to be written;The data update module, if also particularly useful for institute
Stating the corresponding data format of secondary series field is elongated data, then using the data to be written as the secondary series field the
The patch file of data under a line field is stored, and by the corresponding index of the secondary series storage table, and described second
The storage address of data of the column field under the first row field is recorded as the storage address of the data to be written.
In present embodiment, likewise, in order to save data processing resources, when needing to update canned data, not
It is updated in real time, but stores data to be written by way of patch installing file, realizes the effect that data update, to
When subsequent unified progress data update is triggered, for example, when the patch file quantity in column storage table reaches certain amount, it is unified
All data to be updated in column storage table are replaced with into the data in corresponding latest patch file, and remove all patches
File avoids resource caused by handling in real time from frequently consuming to realize that data are centrally updated and memory space discharges.
Data processing system provided in this embodiment is indexed for establishing and safeguarding for column storage table, and on this basis
The operations such as data are deleted and the update of data with existing is written are carried out, realize diversified data processing function, and avoid frequently
Handle bring resource consumption.
Fig. 9 is a kind of example architecture figure for data processing system that embodiment seven provides, and the explanation of nouns being directed to is such as
Under:
Schema: the description file of each column storage table in storage system, record have each column field characterization in column storage table
Attribute;
Table: all types of corresponding column storage tables take single storage table as the unit of management, the corresponding column of each type
It is independent from each other between storage table;
Segment: the data in single row storage table, according to more new version with version management, increasing data newly can be new
Increase a new segment to store;
Primary key index: maintaining a k-v index, can quickly navigate to depositing for attribute data according to pk
Store up address;
Patch: the update of elongated data is stored in the form of patch file;
Version: for stating the state of current version, including available and down state.
Specifically, the document properties data of each type are in this programme with the form pipe of single row storage table (table)
Reason is independent from each other between each table, in table the corresponding data of each attribute (field) with segment format management,
Current newest segment is written in the attribute data of new document, after the data write-in of current latest edition reaches certain scale,
It can trigger and create new segment;It can be seen that the foundation of segment is according to timeliness, while segment is also not unlimited increasing
Add, dynamic versions merging can be done according to valid data amount in segment;In addition, for the ease of to the data in table
It is retrieved, each table maintenance has corresponding k-v index, can quickly navigate to data storage location (figure by major key pk
In data indicate document properties data the first data storage location, for elongated data storage location in addition to the first number
According to storage location outside, further include the length of elongated data, that is, the offset offset stored), for data delete operation, only
Label need to be added for the storage location of data to be deleted in primary key index table;In addition, for data with existing
It updates, this programme proposes two kinds of strategies: one is the schemes that the update of fixed-length data is taken to update in situ;One is elongated numbers
According to taking the mode of patch installing file.
Data processing system provided in this embodiment, not only have column storage efficiently more new data the advantages of, but also can read
When the data of same document, the efficient reading effect for being similar to row storage mode is realized, so that the advantages of taking into account two ways, has
Effect improves the efficiency of data processing.
Figure 10 is the structural schematic diagram for the data processing system that the embodiment of the present application eight provides, as shown in Figure 10, the data
Processing system 700 includes that at least one processor 701, memory 702 and communication interface 703 are connected by bus 704;Storage
Device 702 stores computer executed instructions;At least one processor 701 executes the computer executed instructions that memory 702 stores, and makes
It obtains data processing system and data interaction is carried out to execute aforementioned any embodiment by communication interface 703 and external server
Method.
It may include different types of processor, or including phase in the processor 701 of above-mentioned data processing system 700
The processor of same type;Processor can be below any: central processing unit (Central Processing Unit, letter
Claim CPU), arm processor, field programmable gate array (Field Programmable Gate Array, abbreviation FPGA), specially
There is the device of calculation processing ability with processor etc..A kind of optional embodiment, at least one processor can also be integrated into
Many-core processor.
Memory 702 in above-mentioned data processing system 700 can be below any or any combination: random
Access memory (Random Access Memory, abbreviation RAM), read-only memory (read only memory, abbreviation
ROM), nonvolatile memory (non-volatile memory, abbreviation NVM), solid state hard disk (Solid State Drives,
Abbreviation SSD), mechanical hard disk, disk, the storage mediums such as disk permutation.
Communication interface 703 carries out data interaction for data processing system 700 and other equipment.Communication interface can be with
Under any or any combination: network interface (such as Ethernet interface), wireless network card etc. have network access facility
Device.
Bus may include address bus, data/address bus, control bus etc., for convenient for indicating, with a thick line table in figure
Show the bus.The bus can be below any or any combination: industry standard architecture (Industry
Standard Architecture, abbreviation ISA) bus, peripheral component interconnection (Peripheral Component
Interconnect, abbreviation PCI) bus, expanding the industrial standard structure (Extended Industry Standard
Architecture, abbreviation EISA) wired data transfers such as bus device.
The application also provides a kind of computer readable storage medium, this is stored with computer executed instructions, works as data processing
When at least one processor of system executes the computer executed instructions, data processing system is executed in any of the above-described embodiment
Method.
The application also provides a kind of electronic equipment, which includes computer executed instructions, which executes
In a computer-readable storage medium, at least one processor of data processing system can be deposited from computer-readable for instruction storage
Storage media reads the computer executed instructions, at least one processor executes the computer and executes in any of the above-described embodiment
Method.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description
Specific work process, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above-mentioned each method embodiment can lead to
The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer readable storage medium.The journey
When being executed, execution includes the steps that above-mentioned each method embodiment to sequence;And storage medium above-mentioned include: ROM, RAM, magnetic disk or
The various media that can store program code such as person's CD.
Finally, it should be noted that the above various embodiments is only to illustrate the technical solution of the application, rather than its limitations;To the greatest extent
Pipe is described in detail the application referring to foregoing embodiments, those skilled in the art should understand that: its according to
So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into
Row equivalent replacement;And these are modified or replaceed, each embodiment technology of the application that it does not separate the essence of the corresponding technical solution
The range of scheme.
Claims (30)
1. a kind of data processing method characterized by comprising
Determine each first row field for needing to combine in the column field of column storage table, the row field list of the column storage table is solicited articles
Shelves, the column field characterization attributes of the column storage table;
Each first row field is merged into composite column field, and data of each first row field under each row field are closed
And data of the data after merging as the composite column field under each row field.
2. the method according to claim 1, wherein the data to each first row field under each row field
It merges, data of the data after merging as the composite column field under each row field, comprising:
According to the predetermined row time of each first row field, each first row field is closed after the data under each row field are ranked up
And data of the data after merging as the composite column field under the row field.
3. the method according to claim 1, wherein the data to each first row field under each row field
It merges, data of the data after merging as the composite column field under each row field, comprising:
The mark of the column field is added for data of each first row field under each row field;
Data of each first row field after addition mark under each row field are merged, the data after merging are made
For data of the composite column field under the row field.
4. the method according to claim 1, wherein in the column field of the determining column storage table needing to combine
Each first row field, comprising:
Count the access frequency of each column field in the column storage table;
Using the similar column field of access frequency as first row field.
5. the method according to claim 1, wherein in the column field of the determining column storage table needing to combine
Each first row field, comprising:
Count the data renewal frequency of each column field in the column storage table;
Data renewal frequency is below the column field of predeterminated frequency as first row field.
6. the method according to claim 1, wherein the method also includes:
Update the description file of the column storage table, each column field of column storage table described in the description file record and described
The attribute of each column field characterization.
7. according to the method described in claim 6, it is characterized in that, the description file for updating the column storage table, comprising:
Delete each first row field recorded in the description file of the column storage table and each first row field characterization
Attribute;
The attribute of the composite column field and composite column field characterization is added in the description file of the column storage table,
The attribute of the composite column field characterization includes the attribute of each first row field characterization.
8. method according to any one of claims 1-7, which is characterized in that the method also includes:
From all types of corresponding column storage tables, first row storage table corresponding with the type of document to be written is searched;
According to the attribute that column field each in the first row storage table characterizes, corresponding category is extracted from the document to be written
Property data;
The first row storage table is written into the attribute data.
9. according to the method described in claim 8, it is characterized in that, described be written the first row storage for the attribute data
Table, comprising:
According to the field to be written of entering a profession of presently described first row storage table, determined from the corresponding version of the first row storage table
First version belonging to the field to be written of entering a profession, wherein different editions characterize the field of not going together of the first row storage table
Range;
Detect whether the scale that data are written in the first version reaches preset saturation conditions;If so, establishing the second edition
This, and the attribute data is written in the row field of second edition characterization;Otherwise, it is write in the field to be written of entering a profession
Enter the attribute data.
10. according to the method described in claim 8, it is characterized in that, the method also includes:
Detecting whether there is third version in the corresponding version of first row storage table, deposit in the row field of the third version characterization
The valid data amount of storage is lower than preset threshold value, and the valid data are not deleted data;
In the row field for counting third version characterization, it is stored with the first quantity of the row field of valid data, and from described
Fourth edition is determined in the corresponding version of first row storage table, and the row of data is not written in the row field of the fourth edition characterization
Field quantity is not less than first quantity;
The valid data stored in the third version are transferred to the fourth edition, and the third version is labeled as nothing
Imitate version.
11. method according to any one of claims 1-7, which is characterized in that the method also includes:
It establishes and indexes for the column storage table, the index includes number of each column field under each row field in the column storage table
According to storage address.
12. according to the method for claim 11, which is characterized in that the method also includes:
From the corresponding index of column storage table belonging to data to be deleted, the storage address of the data to be deleted is searched;
Be the first label of storage address addition of the data to be deleted in the index, with characterize the data to be deleted without
Effect.
13. according to the method for claim 11, which is characterized in that the method also includes:
According to document belonging to data to be written, secondary series storage table is searched, is existed described in characterization in the secondary series storage table
The first row field of document belonging to data to be written;
According to attribute belonging to the data to be written, to secondary series field in the secondary series storage table under the first row field
Data be updated, the secondary series field characterizes attribute belonging to the data to be written.
14. according to the method for claim 13, which is characterized in that the attribute according to belonging to the data to be written,
Data of the secondary series field under the first row field in the secondary series storage table are updated, comprising:
If the corresponding data format of the secondary series field is fixed-length data, by the secondary series field under the first row field
Data be updated to the data to be written;
If the corresponding data format of the secondary series field is elongated data, using the data to be written as the secondary series
The patch file of data of the field under the first row field is stored, and by the corresponding index of the secondary series storage table,
The storage address of data of the secondary series field under the first row field is recorded as the storage address of the data to be written.
15. a kind of data processing system characterized by comprising
Grouping module, each first row field for needing to combine in the column field for determining column storage table, the column storage table
Row field list is solicited articles shelves, the column field characterization attributes of the column storage table;
Merging module, for each first row field to be merged into composite column field, and to each first row field under each row field
Data merge, data of the data after merging as the composite column field under each row field.
16. system according to claim 15, which is characterized in that the merging module includes:
First processing units, for the predetermined row time according to each first row field, to each first row field under each row field
Data be ranked up after merge, data of the data after merging as the composite column field under the row field.
17. system according to claim 15, which is characterized in that the merging module includes:
Unit is identified, the mark of the column field is added for data of each first row field under each row field;
The second processing unit, for data of each first row field after addition mark under each row field to be merged,
Data using the data after merging as the composite column field under the row field.
18. system according to claim 15, which is characterized in that the grouping module includes:
First statistic unit, for counting the access frequency of each column field in the column storage table;
First grouped element, for using the similar column field of access frequency as first row field.
19. system according to claim 15, which is characterized in that the grouping module includes:
Second statistic unit, for counting the data renewal frequency of each column field in the column storage table;
Second packet unit, for data renewal frequency to be below to the column field of predeterminated frequency as first row field.
20. system according to claim 15, which is characterized in that the system also includes:
Describing module, it is described to describe each of column storage table described in file record for updating the description file of the column storage table
The attribute of column field and each column field characterization.
21. system according to claim 20, which is characterized in that the describing module includes:
Unit is deleted in description, each first row field for recording in the description file for deleting the column storage table and it is described respectively
The attribute of first row field characterization;
Adding unit is described, for adding the composite column field and the combination in the description file of the column storage table
The attribute of column field characterization, the attribute of the composite column field characterization include the attribute of each first row field characterization.
22. system described in any one of 5-21 according to claim 1, which is characterized in that the system also includes:
Memory management module searches corresponding with the type of document to be written for from all types of corresponding column storage tables
One column storage table;
The memory management module, be also used to according to column field each in the first row storage table characterize attribute, from it is described to
Corresponding attribute data is extracted in write-in document;
The memory management module is also used to the attribute data first row storage table is written.
23. system according to claim 22, which is characterized in that the memory management module includes:
Version management unit is stored for the field to be written of entering a profession according to presently described first row storage table from the first row
First version belonging to the field to be written of entering a profession is determined in the corresponding version of table, wherein different editions characterize the first row
The field range of not going together of storage table;
The version management unit, is also used to detect whether the scale that data are written in the first version reaches preset saturation
Condition;If so, establishing the second edition, and the attribute data is written in the row field of second edition characterization;Otherwise,
The attribute data is written in the field to be written of entering a profession.
24. system according to claim 22, which is characterized in that the system also includes:
Versions merging module, for detecting in the corresponding version of first row storage table with the presence or absence of third version, the third edition
The valid data amount stored in the row field of this characterization is lower than preset threshold value, and the valid data are not deleted data;
The versions merging module is also used to count in the row field that the third version characterizes, is stored with the row of valid data
First quantity of field, and fourth edition is determined from the corresponding version of the first row storage table, the fourth edition characterization
Row field in the row field quantity of data is not written not less than first quantity;
The versions merging module is also used to the valid data stored in the third version being transferred to the fourth edition,
And the third version is labeled as invalid version.
25. system described in any one of 5-21 according to claim 1, which is characterized in that the system also includes:
Index management module, for establishing index for the column storage table, the index includes each column word in the column storage table
The storage address of data of the section under each row field.
26. system according to claim 25, which is characterized in that the system also includes:
Data removing module, for searching the number to be deleted from the corresponding index of column storage table belonging to data to be deleted
According to storage address;
The data removing module, be also used in the index be the data to be deleted storage address addition first mark
Note, to characterize the data invalid to be deleted.
27. system according to claim 25, which is characterized in that the system also includes:
Data update module searches secondary series storage table, the secondary series storage for the document according to belonging to data to be written
There is the first row field for characterizing document belonging to the data to be written in table;
The data update module is also used to the attribute according to belonging to the data to be written, in the secondary series storage table
Data of the secondary series field under the first row field are updated, and the secondary series field characterizes belonging to the data to be written
Attribute.
28. system according to claim 27, which is characterized in that
The data update module, if being specifically used for the corresponding data format of the secondary series field is fixed-length data, by institute
It states data of the secondary series field under the first row field and is updated to the data to be written;
The data update module will if being elongated data also particularly useful for the corresponding data format of the secondary series field
The patch file of data of the data to be written as the secondary series field under the first row field is stored, and by institute
It states in the corresponding index of secondary series storage table, the storage address of data of the secondary series field under the first row field is recorded as
The storage address of the data to be written.
29. a kind of electronic equipment characterized by comprising at least one processor and memory;
The memory stores computer executed instructions;At least one described processor executes the computer of the memory storage
It executes instruction, to execute the method as described in any one of claim 1-14.
30. a kind of computer readable storage medium, which is characterized in that be stored with program in the computer readable storage medium and refer to
It enables, method described in any one of claim 1-14 is realized in described program instruction when being executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810014799.8A CN110109910A (en) | 2018-01-08 | 2018-01-08 | Data processing method and system, electronic equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810014799.8A CN110109910A (en) | 2018-01-08 | 2018-01-08 | Data processing method and system, electronic equipment and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110109910A true CN110109910A (en) | 2019-08-09 |
Family
ID=67483096
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810014799.8A Pending CN110109910A (en) | 2018-01-08 | 2018-01-08 | Data processing method and system, electronic equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110109910A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111046074A (en) * | 2019-12-13 | 2020-04-21 | 北京百度网讯科技有限公司 | Streaming data processing method, device, equipment and medium |
CN111259107A (en) * | 2020-01-10 | 2020-06-09 | 北京百度网讯科技有限公司 | Storage method and device of determinant text and electronic equipment |
CN111581331A (en) * | 2020-04-27 | 2020-08-25 | 北京字节跳动网络技术有限公司 | Method and device for processing file, electronic equipment and computer readable medium |
CN111639091A (en) * | 2020-06-04 | 2020-09-08 | 山东汇贸电子口岸有限公司 | Multi-table merging method based on table merging |
CN112069172A (en) * | 2020-08-21 | 2020-12-11 | 南京南瑞继保电气有限公司 | Power grid data processing method and device, electronic equipment and storage medium |
CN112632939A (en) * | 2020-12-30 | 2021-04-09 | 北京达佳互联信息技术有限公司 | Data processing method, data display method, data processing device and storage medium |
CN113064919A (en) * | 2021-03-31 | 2021-07-02 | 北京达佳互联信息技术有限公司 | Data processing method, data storage system, computer device and storage medium |
CN113591485A (en) * | 2021-06-17 | 2021-11-02 | 国网浙江省电力有限公司 | Intelligent data quality auditing system and method based on data science |
CN114706527A (en) * | 2022-03-24 | 2022-07-05 | 北京涵鑫盛科技有限公司 | Distributed storage space release method and distributed system |
CN115438114A (en) * | 2022-11-09 | 2022-12-06 | 浪潮电子信息产业股份有限公司 | Storage format conversion method, system, device, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5899986A (en) * | 1997-02-10 | 1999-05-04 | Oracle Corporation | Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer |
CN101354723A (en) * | 2008-09-10 | 2009-01-28 | 金蝶软件(中国)有限公司 | Method and apparatus for implementing combined field |
CN106528821A (en) * | 2016-11-16 | 2017-03-22 | 济南浪潮高新科技投资发展有限公司 | Method for importing change column data into database |
CN107038202A (en) * | 2016-12-28 | 2017-08-11 | 阿里巴巴集团控股有限公司 | Data processing method, device and equipment, computer-readable recording medium |
-
2018
- 2018-01-08 CN CN201810014799.8A patent/CN110109910A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5899986A (en) * | 1997-02-10 | 1999-05-04 | Oracle Corporation | Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer |
CN101354723A (en) * | 2008-09-10 | 2009-01-28 | 金蝶软件(中国)有限公司 | Method and apparatus for implementing combined field |
CN106528821A (en) * | 2016-11-16 | 2017-03-22 | 济南浪潮高新科技投资发展有限公司 | Method for importing change column data into database |
CN107038202A (en) * | 2016-12-28 | 2017-08-11 | 阿里巴巴集团控股有限公司 | Data processing method, device and equipment, computer-readable recording medium |
Non-Patent Citations (3)
Title |
---|
XIANGWU DING 等: "A Column-based Self-organizing Hybrid Storage", 《THE 2ND INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND ENGINEERING》 * |
丁祥武: "列存储系统的若干关键技术研究", 《中国博士学位论文全文数据库 信息科技辑》 * |
鲍玉斌 等: "数据仓库环境下以用户为中心的数据清洗过程模型", 《计算机科学》 * |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111046074A (en) * | 2019-12-13 | 2020-04-21 | 北京百度网讯科技有限公司 | Streaming data processing method, device, equipment and medium |
CN111046074B (en) * | 2019-12-13 | 2023-09-01 | 北京百度网讯科技有限公司 | Streaming data processing method, device, equipment and medium |
CN111259107A (en) * | 2020-01-10 | 2020-06-09 | 北京百度网讯科技有限公司 | Storage method and device of determinant text and electronic equipment |
CN111259107B (en) * | 2020-01-10 | 2023-08-18 | 北京百度网讯科技有限公司 | Determinant text storage method and device and electronic equipment |
CN111581331A (en) * | 2020-04-27 | 2020-08-25 | 北京字节跳动网络技术有限公司 | Method and device for processing file, electronic equipment and computer readable medium |
CN111581331B (en) * | 2020-04-27 | 2023-08-25 | 抖音视界有限公司 | Method, device, electronic equipment and computer readable medium for processing text |
CN111639091A (en) * | 2020-06-04 | 2020-09-08 | 山东汇贸电子口岸有限公司 | Multi-table merging method based on table merging |
CN111639091B (en) * | 2020-06-04 | 2023-09-19 | 山东汇贸电子口岸有限公司 | Multi-table merging method based on merging table |
CN112069172B (en) * | 2020-08-21 | 2022-07-22 | 南京南瑞继保电气有限公司 | Power grid data processing method and device, electronic equipment and storage medium |
CN112069172A (en) * | 2020-08-21 | 2020-12-11 | 南京南瑞继保电气有限公司 | Power grid data processing method and device, electronic equipment and storage medium |
CN112632939A (en) * | 2020-12-30 | 2021-04-09 | 北京达佳互联信息技术有限公司 | Data processing method, data display method, data processing device and storage medium |
CN113064919A (en) * | 2021-03-31 | 2021-07-02 | 北京达佳互联信息技术有限公司 | Data processing method, data storage system, computer device and storage medium |
CN113064919B (en) * | 2021-03-31 | 2022-11-22 | 北京达佳互联信息技术有限公司 | Data processing method, data storage system, computer device and storage medium |
CN113591485A (en) * | 2021-06-17 | 2021-11-02 | 国网浙江省电力有限公司 | Intelligent data quality auditing system and method based on data science |
CN114706527A (en) * | 2022-03-24 | 2022-07-05 | 北京涵鑫盛科技有限公司 | Distributed storage space release method and distributed system |
CN114706527B (en) * | 2022-03-24 | 2022-09-20 | 北京涵鑫盛科技有限公司 | Distributed storage space release method and distributed system |
CN115438114B (en) * | 2022-11-09 | 2023-03-24 | 浪潮电子信息产业股份有限公司 | Storage format conversion method, system, device, electronic equipment and storage medium |
CN115438114A (en) * | 2022-11-09 | 2022-12-06 | 浪潮电子信息产业股份有限公司 | Storage format conversion method, system, device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110109910A (en) | Data processing method and system, electronic equipment and computer readable storage medium | |
CN100458779C (en) | Index and its extending and searching method | |
CN102339315B (en) | Index updating method and system of advertisement data | |
US9672241B2 (en) | Representing an outlier value in a non-nullable column as null in metadata | |
CN102541757B (en) | Write cache method, cache synchronization method and device | |
US11449564B2 (en) | System and method for searching based on text blocks and associated search operators | |
KR101740271B1 (en) | Method and device for constructing on-line real-time updating of massive audio fingerprint database | |
CN107491487A (en) | A kind of full-text database framework and bitmap index establishment, data query method, server and medium | |
US11327985B2 (en) | System and method for subset searching and associated search operators | |
CN102955792A (en) | Method for implementing transaction processing for real-time full-text search engine | |
CN101136013A (en) | Method for quick updating data domain in full text retrieval system | |
CN102231168A (en) | Method for quickly retrieving resume from resume database | |
CN103186622A (en) | Updating method of index information in full text retrieval system and device thereof | |
CN105630934A (en) | Data statistic method and system | |
CN102411632B (en) | Chain table-based memory database page type storage method | |
US9047363B2 (en) | Text indexing for updateable tokenized text | |
CN103473324A (en) | Multi-dimensional service attribute retrieving device and method based on unstructured data storage | |
CN101963993B (en) | Method for fast searching database sheet table record | |
CN112416992B (en) | Industry type identification method, system and equipment based on big data and keywords | |
CN111708895B (en) | Knowledge graph system construction method and device | |
JP3666907B2 (en) | Database file storage management system | |
CN116450664A (en) | Data processing method, device, equipment and storage medium | |
CN106528590B (en) | Query method and device | |
CN112131215B (en) | Bottom-up database information acquisition method and device | |
CN108984720B (en) | Data query method and device based on column storage, server and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20200420 Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province Applicant after: Alibaba (China) Co.,Ltd. Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping square B radio tower 13 layer self unit 01 Applicant before: Guangdong Shenma Search Technology Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190809 |
|
RJ01 | Rejection of invention patent application after publication |