CN110109910A - Data processing method and system, electronic equipment and computer readable storage medium - Google Patents

Data processing method and system, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN110109910A
CN110109910A CN201810014799.8A CN201810014799A CN110109910A CN 110109910 A CN110109910 A CN 110109910A CN 201810014799 A CN201810014799 A CN 201810014799A CN 110109910 A CN110109910 A CN 110109910A
Authority
CN
China
Prior art keywords
data
field
row
column
storage table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810014799.8A
Other languages
Chinese (zh)
Inventor
王俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangdong Shenma Search Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Shenma Search Technology Co Ltd filed Critical Guangdong Shenma Search Technology Co Ltd
Priority to CN201810014799.8A priority Critical patent/CN110109910A/en
Publication of CN110109910A publication Critical patent/CN110109910A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/328Management therefor

Abstract

The application provides a kind of data processing method and system, electronic equipment and computer readable storage medium, comprise determining that each first row field for needing to combine in the column field of column storage table, the row field list of the column storage table is solicited articles shelves, the column field characterization attributes of the column storage table;Each first row field is merged into composite column field, and data of each first row field under each row field are merged, data of the data after merging as the composite column field under each row field.This programme not only have column storage efficiently more new data the advantages of, but also the efficient reading effect for being similar to row storage mode can be realized, so that the advantages of taking into account two ways, effectively improves the efficiency of data processing when reading the data of same document.

Description

Data processing method and system, electronic equipment and computer readable storage medium
Technical field
This application involves big data field more particularly to a kind of data processing methods and system, electronic equipment and computer Readable storage medium storing program for executing.
Background technique
In data acquisition, it will usually which the data definition for crawling crawler is document (Document).Specifically, a text Shelves often include many attributes, by taking web document as an example, attribute such as article title (Title), article content (Body), The number (word frequency) and its go out that the word and each word that article click volume (click), document contain occur in the document Existing position (for example, offset relative to document stem) etc..In practical application, in order to promote the efficiency and effect of data acquisition Fruit, the mode that crawler crawls data can be built into can with the inverted index (inverted index) of quick-searching and convenient for into The forward index (forward index) of row data analysis (for example, calculating document and inquiry request correlation).
Wherein, forward index is the dimension from document, extracts the attribute of each document and is stored.It is main to deposit There are two types of storage modes: row storage and column storage.Under above-mentioned application scenarios, the advantages of storing of going is same document properties data The shortcomings that reading efficiency is high, but there are partial data update low efficiencys;And the advantages of arranging storage is data write efficiency height, and The quick update of support section field, but the disadvantage is that the reading efficiency of same document properties data is low, it needs repeatedly to read.
And in practical application, such as under search scene, need frequently to be related to same document data and read and part attribute number According to the scene of update, and above-mentioned storage mode can not meet these demands simultaneously.And the inefficiencies for arranging the reading of storage can not Meet high performance demand, the partial data of row storage, which updates the inefficient of operation, not can guarantee high-timeliness yet.
Summary of the invention
The application provides a kind of data processing method and system, electronic equipment and computer readable storage medium, for solving The problem of certainly existing data processing scheme cannot achieve efficient process under different scenes.
The first aspect of the application is to provide a kind of data processing method, comprising: determines in the column field of column storage table Each first row field for needing to combine, the row field list of the column storage table are solicited articles shelves, the column field characterization of the column storage table Attribute;Each first row field is merged into composite column field, and data of each first row field under each row field are closed And data of the data after merging as the composite column field under each row field.
The second aspect of the application is to provide a kind of data processing system, comprising: grouping module, for determining column storage Each first row field for needing to combine in the column field of table, the row field list of the column storage table are solicited articles shelves, the column storage table Column field characterization attributes;Merging module, for each first row field to be merged into composite column field, and to each first row field Data under each row field merge, number of the data after merging as the composite column field under each row field According to.
It is to provide a kind of electronic equipment in terms of the third of the application, comprising: at least one processor and memory;It is described Memory stores computer executed instructions;The computer execution that at least one described processor executes the memory storage refers to It enables, to execute foregoing method.
The 4th aspect of the application is to provide a kind of computer readable storage medium, in the computer readable storage medium It is stored with program instruction, described program instruction realizes foregoing method when being executed by processor.
Data processing method and system provided by the present application, electronic equipment and computer readable storage medium, for document Attribute data carries out data storage, the column field characterization attributes of column storage table by the way of column storage, and row field list is solicited articles Shelves, this programme determine the combined column field of needs from the column field of column storage table, these column fields are merged into composite column Field, and the data of these column fields are merged to obtain the data of composite column field, realize the subassembly of attribute data And.Data processing scheme provided by the present application, not only have column storage efficiently more new data the advantages of, but also same document can read Data when, realize be similar to row storage mode efficient reading effect, so that the advantages of taking into account two ways, effectively improves number According to the efficiency of processing.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the application Some embodiments be also possible to obtain other drawings based on these drawings for those of ordinary skill in the art.
Figure 1A and Figure 1B is respectively the topology example figure of row storage table and column storage table;
Fig. 2A~Fig. 2 C is a kind of flow diagram for data processing method that the embodiment of the present application one provides;
Fig. 3 is a kind of exemplary diagram of data processing method provided by the embodiments of the present application;
Fig. 4 is a kind of flow diagram for data processing method that the embodiment of the present application two provides;
Fig. 5 is a kind of flow diagram for data processing method that the embodiment of the present application three provides;
Fig. 6 A~Fig. 6 B is the structural schematic diagram for the data processing system that the embodiment of the present application four provides;
Fig. 7 is a kind of structural schematic diagram for data processing system that the embodiment of the present application five provides;
Fig. 8 is a kind of structural schematic diagram for data processing system that the embodiment of the present application six provides;
Fig. 9 is a kind of example architecture figure for data processing system that embodiment seven provides;
Figure 10 is the structural schematic diagram for the data processing system that the embodiment of the present application eight provides.
Specific embodiment
To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is Some embodiments of the present application, instead of all the embodiments.
In practical application, as shown in FIG. 1A and 1B, the respectively topology example figure of row storage table and column storage table.Such as figure Shown in, the black background part of table indicates the row field and column field of row storage table and column storage table, white background portions Indicate the data in row storage table and column storage table.Wherein, Row is row field, and Column is column field.Unlike, row is deposited Table is stored up as unit of a line, column storage table is as unit of column data set or column family (Column family).Row is stored For table, read-write process is consistent, i.e., is successively directed to every a line, terminates since the first row of the row to last column, the row Reading or write-in after the completion of, then terminate to last column since the first row of next line, recycled with this until institute is in need It reads or the data of write-in is completed to read or are written.And for column storage table, it can be single with column when reading data The data read in one or more columns per page are concentrated in position, when data write-in, then need the content type for levying data according to each list It is split, and the data after fractionation is respectively written into the end of respective column.
For example in conjunction with this programme, the row field (Row) in storage table characterizes document, and column field (Column) characterization belongs to Property, attribute here is the attribute of document, and content can need to set according to data statistics with what is analyzed, such as may include But it is not limited to: title, author, type (for example, comedy class, sadness class, terrible class) and is clicked number etc..In figure Data indicates data, wherein the data of same document are indicated with label, for example, data Data1-1 in Row1, Data1-2 ... Data1-M is same document (for example, document 1) in different attribute, for example, the attribute 1 of Column1 characterization, Column2 characterization attribute 2 ... ColumnM characterization attribute M) under data;The rest may be inferred, data DataN-1 in RowN, DataN-2 ... DataN-M is data of the same document (for example, document N) under different attribute.
In conjunction with the example above, it is possible to understand that, it is assumed that when needing to read the data of document 1 from row storage table, deposited according to row The reading manner of table is stored up, the corresponding full line data of row field Row1 of characterization document 1 are read, because of the reading of this journey storage table It is high-efficient, and the data sequential storage of same document can also reduce the probability of cache miss.But works as and need from row storage table When the middle part field for updating document 1, then the corresponding data of all column fields under document 1 are needed to be traversed for, for example, it is assumed that needing more The data of new document 1 properties 5, then need, and traverses Column1, Column2 under Row1 until Column5, can realize pair The update of the data (i.e. Data1-5 in figure) of Column5 under Row1, because the part field update of this journey storage table can be extended Change, causes to update low efficiency.
Still with the example above, for column storage table, it is assumed that need to read the data of document 1 from column storage table, then The data for needing to be traversed for each column field extract the corresponding partial data of document 1 from the data of all column fields, are spelled Obtain document 1 data, therefore the reading efficiency of column storage table is low, and the reading with document data needs repeatedly to traverse each column word The data of section.But the advantages of column storage table is the high-efficient of partial data update, when needs update document 1 from column storage table When part field (for example, attribute 5 of document 1), then only document 1 need to be found from the corresponding data of Column5 of characterization attributes 5 Under data be updated.
It can not meet simultaneously for above-mentioned row storage table and column storage table and efficiently read and partial data more news, figure 2A is a kind of flow diagram for data processing method that the embodiment of the present application one provides;With reference to Fig. 2A it is found that the present embodiment mentions A kind of data processing method has been supplied, for realizing while efficiently reading, the efficiency that raising partial data updates.Specifically, The data processing method includes:
101, each first row field for needing to combine in the column field of column storage table, the row field characterization of column storage table are determined Document, the column field characterization attributes of column storage table;
102, each first row field is merged into composite column field, and the data to each first row field under each row field It merges, data of the data after merging as composite column field under each row field.
In practical application, the executing subject of the data processing method can be data processing system.In practical applications, should Data processing system can be by software code realization, which may be to be stored with related Jie for executing code Matter, for example, USB flash disk etc.;Alternatively, the data processing system can also be entity apparatus that is integrated or being equipped with related execution code, For example, chip, intelligent terminal, computer, database, server and various electronic equipments.
Example is carried out in conjunction with actual scene: collected document properties being stored into column storage table, wherein column storage table Row field list solicit articles shelves, column field characterization attributes.It determines to need combined column field from the column storage table, specifically, Combined column field is needed to can be set as needed different screening conditions to determine;After needing combined column field to determine, On the one hand these column fields are merged into composite column field, on the other hand also needs to close the data in these column fields And.Specifically, merging for data of these column fields under each row field, the data after merging are to combine two words Data under Duan Hang field.This programme will need combined attribute to merge into a composite attribute progress data storage, storage Mode use column storage table, thus the advantages of the storage of comprehensive ranks, meet aforementioned same document data and read and partial data Update the high-timeliness requirement of data processing under two kinds of scenes.
It for more intuitivism apprehension this programme, is illustrated in conjunction with 3, as shown in figure 3, in certain column storage table altogether There are three column family, difference characterization attributes 1 (attr1), attr2 and attr3, for example, when the data for needing to read document 1 When, it needs once to traverse attr1, attr2 and attr3, to obtain data of the document 1 in different lines, later by these data Spliced, to obtain the data of document 1.Based on this programme, it is assumed that need to merge attr2 and attr3, then can incite somebody to action Attr2 and attr3 merges into composite attribute attr4, and the data of attr2 and attr3 are merged into the data of attr4.It can be with Understand, it is subsequent, when needing to read the data of document 1, two attribute can be obtained together in the data of composite attribute attr4 Data, that is, need to only traverse two column datas and can be obtained the data of document 1, due to carrying out data in composite attribute attr4 When reading, need to only traverse the column can be completed column data reading, therefore reading speed is compared the reading data from different lines and had Larger promotion, to improve the reading efficiency of same document data.
Specifically, determining to need combined column field first in this programme, which can according to need determination, example Such as, it is specified by user, determines to need combined column field alternatively, can also analyze based on data.It should be noted that our The attribute for needing combined column field to characterize in case can be single attribute, or composite attribute, it can belong to combination Property is further merged.
As a kind of enforceable mode, the similar column field of access frequency can be merged, correspondingly, such as Fig. 2 B Shown, on the basis of any embodiment, 101 be can specifically include:
1011, the access frequency of each column field in the column storage table is counted;
1012, using the similar column field of access frequency as first row field.
Specifically, the frequency that access frequency reflection is read can after merging to the similar attribute of these access frequencys To be obtained together by single ergodic column data, to improve the efficiency that same document data is read to a greater degree.Here Described is similar including identical and approximate, for example, the access frequency of the identical and each attribute of the access frequency of each attribute is differed pre- In the range of if etc..
As another enforceable mode, the similar column field of access frequency can be merged, correspondingly, as schemed Shown in 2C, on the basis of any embodiment, 101 be can specifically include:
1013, the data renewal frequency of each column field in the column storage table is counted;
1014, data renewal frequency is below the column field of predeterminated frequency as first row field.
In practical application, the renewal frequency of certain attributes is lower, such as the attributes such as Document Title, document author, for this The attribute for not needing frequent updating a bit, can be merged, same to improve while guarantee section data updating efficiency The efficiency that one document data is read.
It should be noted that the embodiment of above-mentioned determination column field to be combined is only used as illustrating for this programme, it is real In the application of border, it is also based on other way and determines to need combined column field.
Correspondingly, determining after needing combined column field, the data by these column fields are needed to merge, equally , combined mode can also there are many.In practical application, during the data of each first row field are merged, need Take into account and the corresponding data of each attribute are recognized when row reading data, therefore carries out example by following several embodiments Explanation.
In one embodiment, on the basis of any embodiment, to each first row field in each row described in 102 Data under field merge, data of the data after merging as the composite column field under each row field, tool Body may include:
According to the predetermined row time of each first row field, data of each first row field under each row field are ranked up After merge, data of the data after merging as the composite column field under the row field.
Specifically, presetting arrangement order in present embodiment for each column field, arrangement order here refers to When carrying out data merging, position of each attribute data in entire merging data.For example still with Fig. 3, it can set in advance The row time for determining attr2 is first, and the row time for setting attr3 is second, then correspondingly, when needing to merge attr2 and attr3 When, attr2 and attr3 are on the one hand merged into composite attribute attr4, on the other hand, it is also necessary to the number of attr2 and attr3 According to merging, it is based on present embodiment, it is right according to default row time for the data of attr2 and attr3 under each row field The data of attr2 and attr3 carry out row time, as shown in Figure 3, with the data instance under document 1, to attr2 and attr3 in text Data under shelves 1 are ranked up according to default row time, i.e. preceding (ranked first position), attr3's data of the attr2 under document 1 exists Data under document 1 merge into overall data after sequence rear (ranked second position), to obtain data of the attr4 under document 1. And so on, data of the attr4 under all documents are obtained, the merging of attr2 and attr3 respective column is completed.It is appreciated that group It closes the mode of the data of attribute in certain sequence to store, the efficiency of reading data can be effectively improved, realize similar row storage mode Quick reading data effect, can effectively promote the efficiency of online retrieving.
Present embodiment carries out data merging by presetting sequence, so that the data after merging have sequential access Characteristic, i.e., the data of each attribute are sequential storage in the composite attribute data of document, to read the combination of the document in access The sequential access that data are realized when attribute data, realizes the efficient reading effect of row storage.
In another embodiment, on the basis of any embodiment, to each first row field each described in 102 Data under row field merge, data of the data after merging as the composite column field under each row field, It can specifically include:
The mark of the column field is added for data of each first row field under each row field;
Data of each first row field after addition mark under each row field are merged, the number after merging According to the data as the composite column field under the row field.
Specifically, be the corresponding mark of data addition of each column field in present embodiment, it is corresponding to characterize the data Attribute.For example still with Fig. 3, when needing to merge attr2 and attr3, attr2 and attr3 are on the one hand merged into group It closes attribute attr4 on the other hand the data of attr2 and attr3 are marked, present embodiment is based on, under each row field The data of attr2 add the mark of attr2 for it, likewise, adding attr3 to the data of attr3 under each row field for it Mark, later by this two parts data merge obtain composite attribute attr4 data.It is appreciated that subsequent reading attr4 In data when, the mark carried according to wherein each data is you can learn that the corresponding attribute of each data.
Present embodiment passes through to need combined data to add attribute-bit, different attribute pair in data after realization merges The identification of data is answered, the flexibility that data merge is improved.
In practical application, for the description information of composite attribute, the description can also be recorded convenient for subsequent reading data Information refers to some relevant informations of composite attribute, for example, composite attribute characterization each attribute and composite attribute number According to merging mode etc..Optionally, in one embodiment, the data processing method further include:
Update the description file of the column storage table, each column field of column storage table described in the description file record and The attribute of each column field characterization.
Specifically, update mentioned here includes but is not limited to the operation such as newly-built, deletion, modification.As an example it is assumed that working as It is preceding (the corresponding column field of including but not limited to single attribute and/or to merge through combinations of attributes the corresponding column field of certain attributes Made of composite column field) merge, then merge after need for merge obtain column field, description file in create should The attribute of column field and its characterization, specifically, the attribute of column field characterization includes the attribute for each column field characterization being merged. For example, certain column field column4 is merged by column2 and column3, wherein column2 characterizes Document Title, Column3 characterizes document author, then after merging the data for generating column4, needs new in the description file of column storage table Build the attribute of column4 characterization, i.e. Document Title and document author.
Further, the column field being merged and its description letter can also be removed from the description file of column storage table Breath, to save the memory space occupied.Optionally, on the basis of the embodiment, the update column storage table File is described, comprising:
Delete the attribute of each first row field recorded in the description file of column storage table and each first row field characterization;
The attribute of composite column field and composite column field characterization, composite column word are added in the description file of column storage table The attribute of segment table sign includes the attribute of each first row field characterization.
It generally speaking, in the present embodiment, is that column storage table is established and maintenance description file, when respectively being arranged in column storage table When the description information of field changes, for example, the merging of at least two column fields occurs, the attribute of column field characterization becomes Change, column field is deleted etc., then it needs to be updated the description file of column storage table.It is recorded in the description file of column storage table The description information of each column field of the column storage table, the description information include but is not limited to the attribute of column field characterization, for Merge the composite column field obtained, the mode of recorded data merging is gone back in description information, for example, each attribute in merging process The rank order etc. of data, so that the data of different attribute can be therefrom found when reading the data of the composite column field, and The data of each attribute are spliced.It for example still with Fig. 3, can be in the description file of column storage table shown in figure Record the attribute of attr1 and attr4 characterization, wherein the attribute of attr4 characterization includes attr2 and attr3, and column storage table Description file in also record there are the data of attr2 and attr3 to merge mode, the data merge mode and are used for from attr4's Identify the corresponding data of attr2 and attr3 in data, concrete form can there are many, the present embodiment herein not to its into Row limitation.
, can be with the corresponding relationship between maintenance column field and attribute by the description file of maintenance column storage table, and root The attribute for each column field characterization that timely updates the case where merging according to column field combination, improves the real-time that data storage updates, just In subsequent reading data, the convenience and accuracy of reading data are improved.
Data processing method provided in this embodiment carries out data by the way of column storage for document properties data Storage, the column field characterization attributes of column storage table, row field list are solicited articles shelves, and this programme is determined from the column field of column storage table Need combined column field, these column fields merged into composite column field, and to the data of these column fields merge with The data of composite column field are obtained, realize the packet combining of different attribute data.Data processing scheme provided by the present application, both had Standby column store the advantages of efficiently more new data, and can realize when reading the data of same document and be similar to row storage mode Effect is efficiently read, so that the advantages of taking into account two ways, effectively improves the efficiency of data processing.
In practical application, for different types of scene, the Doctype of acquisition is different.For the ease of carrying out data pipe Reason can also be carried out dividing according to attribute data of the different types to each document and individually be managed.Correspondingly, Fig. 4 is this Shen Please embodiment two provide a kind of data processing method flow diagram;With reference to Fig. 4 it is found that the present embodiment still provides one kind Data processing method is managed different types of document properties data for further realizing.Specifically, in any implementation On the basis of example, the data processing method further include:
201, from all types of corresponding column storage tables, first row storage corresponding with the type of document to be written is searched Table;
202, the attribute characterized according to column field each in the first row storage table, extracts from the document to be written Corresponding attribute data;
203, the first row storage table is written into the attribute data.
Specifically, different types of scene is directed to, for example, the scenes such as shopping, education, tourism, life.It acquires data Type is also different, for example, statistics data relevant to user's shopping characteristics are more needed under shopping scene, for example, history purchase data, Ad click information etc.;Education scene more needs the personal information of counting user, for example, age, profession, educational background etc.;Tourism scene More need the historical location data etc. of counting user.It is appreciated that the attribute of the data acquired under identical type scene can also compare Relatively similar, therefore, present embodiment is based on different type for the document data of different type scene acquisition and carries out Classification Management, Establish the corresponding column storage table of different type.For example, corresponding column storage can be established for types such as shopping, education respectively Table is stored storing from the document properties data of the shopping websites such as Taobao, Jingdone district or application acquisition to the corresponding column of shopping type In table, stored storing from the document properties data of the Educational websites such as Hu Jiang network school or application acquisition to the corresponding column of education type In table.
Present embodiment establishes all types of corresponding column storage tables, the similar document data of data attribute is divided to same Type, using the corresponding column storage table of different type as dimension carry out data storage and management, can be isolated different types of data it Between interference.For example, the renewal frequency of all types of document properties data is different, the document properties data for type of doing shopping may be needed The data such as the ad click number of frequent updating user are wanted, and the document properties data for educating type are then often relatively stable, It is not required to frequent updating in certain time, is stored if all types of document properties data all combined, it is right The update of shopping categorical data, which can expand, involves other types of data, influences the efficiency of data update and be easy to happen accidentally to grasp Make.In this regard, different types of document properties data are distinguished storage and management, avoid different types of document by present embodiment Attribute data interferes, and improves the efficiency and accuracy of data processing.That is, passing through the management mould of present embodiment Formula will not involve the data in entire storage system, drop when updating the document properties data of different update frequency under different type Low data processing overheads.
In the present embodiment, the corresponding document properties data of each type are all with the format management of single row storage table, It is independent from each other between the corresponding column storage table of each type.Further, may be used also in the corresponding column storage table of certain type With further division management.
Optionally, in one embodiment, on the basis of example 2,203 can specifically include:
2031, corresponding from the first row storage table according to the field to be written of entering a profession of presently described first row storage table First version belonging to the field to be written of entering a profession is determined in version, wherein different editions characterize the first row storage table Field of not going together range;
2032, whether the scale for detecting data write-in in the first version reaches preset saturation conditions;If so, building The vertical second edition, and the attribute data is written in the row field of second edition characterization;Otherwise, it to be written enters a profession described The attribute data is written in field.
Specifically, different editions correspond to different row field ranges, for example, the row field 1 in 0 respective column storage table of version ~row field 10, row 11~row of field field 20 in 1 respective column storage table of version.Wherein, the field to be written of entering a profession is used for Be written the document properties data that currently need to be written, the method for determination of the field to be written of entering a profession can there are many, for example, can be with According in such a way that row is sequentially written in, the first row field that data are currently not written into each row field is found, as working as Preceding field to be written of entering a profession.And then determine version belonging to field to be written of entering a profession, still with aforementioned as an example it is assumed that current true Fixed field to be written of entering a profession is row field 19, then version belonging to current field to be written of entering a profession is version 1.In present embodiment, write Enter the data that data refer to being written new document, the i.e. data when existing document non-in the storage table of forefront.
In present embodiment, data in the corresponding column storage table of each type storage and management, version in the form of version This division can be divided according to timeliness, i.e., the data being written as needed establish new version in real time.In order to reduce number According to maintenance and the resource consumed is updated, can be written in the data of current version and reach certain scale, that is, meet preset saturation When condition, new version is created.Saturation conditions mentioned here is used for the space hold situation reflected in version, for example, The saturation conditions can have been write completely for data in version, can also reach certain for the row field proportion of written data The row field of data can not also be written no more than certain threshold value etc., still with aforementioned as an example it is assumed that really in threshold value for residue It is settled it is preceding it is to be written enter a profession field be row field 19, belonging to version be version 1, if preset saturation conditions be do not write The row field quantity for entering data needs to maintain minimum 2 and (there was only 20 two row words of row field 19 and row field in citing at present Data are not written for section), then version 2 is established, and will be in the row field for currently needing the document properties data being written write-in version 2. It is understood that, it is assumed that current all version to have been established or the data write-in scale of most newly-established version reaches preset saturation Condition then directly establishes new version and carries out data write-in.In practical application, in order to avoid the document of homogeneous write-in is saved The problem of different editions cause follow-up data inconvenience to be safeguarded, DUMP operation can be initiated before newly-built version, i.e., in newly-built version This when, is written without data.
Further, it is also possible to discharge effective memory space in such a way that version recycles.Optionally, in the base of embodiment two On plinth, the method can also include:
204, it whether there is third version, the row of the third version characterization in the corresponding version of detection first row storage table The valid data amount stored in field is lower than preset threshold value, and the valid data are not deleted data;
205, in the row field for counting the third version characterization, it is stored with the first quantity of the row field of valid data, And fourth edition is determined from the corresponding version of the first row storage table, it is not written in the row field of the fourth edition characterization The row field quantity of data is not less than first quantity;
206, the valid data stored in the third version are transferred to the fourth edition, and by the third version Labeled as invalid version.
Specifically, present embodiment carries out versions merging according to the valid data scale dynamic in version.In practical application, Needs based on data update, it is possible to create some invalid datas, for example, data be deleted etc..Correspondingly, working as certain version The valid data of interior storage are less, for example, when being lower than preset threshold value, then it can be by the versions merging into other versions, to release The occupied memory space of invalid data under the version is put, and improves follow-up data reading and effectiveness of retrieval.Optionally, version There are many combined triggering scenes, for example, 204 can periodically be executed, whether each version in scan columns storage table needs to carry out Versions merging removes invalid data.In practical application, the version for being incorporated into other versions can be removed, or can also be with It is invalid version by the version flag, when subsequent progress reading data retrieval, without browsing the data in invalid version.It is optional , after the invalid data in invalid version is removed, which can be used for that data are written, correspondingly, the version after write-in data This will be updated labeled as effective version.
Data processing method provided in this embodiment is deposited for different types of document properties data by arranging accordingly It stores up table and carries out data storage, it is mutually indepedent between the corresponding column storage table of different type, to avoid different types of document properties Data are interfered, and the efficiency and accuracy of data processing are improved.
In addition, can also be established for it for the ease of being retrieved to the data in column storage table and maintenance indexes.Accordingly , Fig. 5 is a kind of flow diagram for data processing method that the embodiment of the present application three provides;With reference to Fig. 5 it is found that the present embodiment A kind of data processing method is still provided, for further establishing the index with maintenance column storage table.Specifically, in any implementation On the basis of example, the data processing method further include:
301, index is established for the column storage table, the index is including each column field in the column storage table in each row word The storage address of data under section.
Wherein, the form of the index can there are many, it is preferred that k-v index (primary key can be used Index), data storage location can quickly be navigated to by major key (primary key, abbreviation pk).Specifically, column storage table Index include the column storage table each unit lattice in address data memory, i.e., data of each column field under each row field Storage address.Specifically, the format of storage address can be determined according to the Format Type of data, it is preferred that for fixed-length data, Since its data length is certain, storage address can only record the storage address of its first data, for elongated number According in storage address other than recording the storage address of its first data, it is also necessary to record the length of the elongated data.
In practical application, although the individual data of document properties data is often smaller, the quantity of document properties data is logical It is often huger, when whenever certain data variation if carry out data and update to need to dispatch and expend very big process resource.Therefore Preferably, to delete data instance, on the basis of the above embodiment, the method also includes:
302, from the corresponding index of column storage table belonging to data to be deleted, with searching the storage of the data to be deleted Location;
It 303, is the first label of storage address addition of the data to be deleted in the index, it is described wait delete to characterize Except data invalid.
Specifically, the data can be found from the index of column storage table when needing to delete the data in column storage table Storage address, and the first label invalid for characterize data for storage address addition in the index.That is, arranging It is the first label of index addition for the data for needing to delete in the index of storage table, to characterize the data invalid, without holding The data delete processing of row essence, can be realized the effect of data deletion.When the unified update operation of follow-up data is triggered, For example, removing current all invalid datas together when invalid data amount reaches certain amount.
Another scene is to carry out data update to canned data.Correspondingly, on the basis of any embodiment, it is described Method further include:
304, the document according to belonging to data to be written is searched secondary series storage table, is existed in the secondary series storage table Characterize the first row field of document belonging to the data to be written;
305, the attribute according to belonging to the data to be written, to secondary series field in the secondary series storage table first Data under row field are updated, and the secondary series field characterizes attribute belonging to the data to be written.
It is understood that, if it is possible to find the second storage table, then illustrate that document belonging to data to be written is to have deposited text Shelves, that is, belong to the update to canned data, if searching the number for illustrating that data to be written are new document less than the second storage table According to may relate to the related procedure in embodiment two in the case where the present embodiment is combined with embodiment two scene implemented.It needs to illustrate , the embodiment of each embodiment can individually be implemented or combine to implement under the premise of not conflicting in this programme, this Embodiment is not limited.Specifically, present embodiment, is determining that document belonging to current data to be written is to have deposited document Afterwards, the current data under attribute belonging to data to be written is updated.
I.e. in practical application, for the data with existing in column storage table, when data change, needs mutually to cope with column and deposit Data in storage table are updated.Wherein, its update mode of the data of different-format can also be different, it is preferred that in a kind of reality It applies in mode, 305 can specifically include:
If 3051, the corresponding data format of secondary series field is fixed-length data, by secondary series field in the first row field Under data be updated to data to be written;
If 3052, the corresponding data format of secondary series field is elongated data, data are written into as secondary series word The patch file of data of the section under the first row field is stored, and by the corresponding index of secondary series storage table, secondary series The storage address of data of the field under the first row field is recorded as the storage address of data to be written.
Specifically, proposing Different Strategies to the update of the canned data of different-format: for fixed length number in present embodiment According to update, take the in situ update mode for updating and being updated replacement;For elongated data, then patch installing file is taken Mode, that is, the patch file for being written into data as data to be updated store, in addition to corresponding single in the index of column storage table Storage address in first lattice is updated to the storage address of data to be written, to guarantee the accuracy of reading data and retrieval.? In present embodiment, likewise, in order to save data processing resources, it is not real-time to its when needing to update canned data It is updated, but stores data to be written by way of patch installing file, the effect that data update is realized, to subsequent unification When progress data update is triggered, for example, when the patch file quantity in column storage table reaches certain amount, it is unified to store column All data to be updated in table replace with the data in corresponding latest patch file, and remove all patch files, from And realize that data are centrally updated and memory space discharges, avoid resource caused by handling in real time from frequently consuming.
Data processing method provided in this embodiment is indexed for establishing and safeguarding for column storage table, and on this basis The operations such as data are deleted and the update of data with existing is written are carried out, realize diversified data processing function, and avoid frequently Handle bring resource consumption.
Fig. 6 A is a kind of structural schematic diagram for data processing system that the embodiment of the present application four provides;With reference to Fig. 6 A it is found that The data processing system includes:
Grouping module 61, each first row field for needing to combine in the column field for determining column storage table, column storage table Row field list solicit articles shelves, the column field characterization attributes of column storage table;
Merging module 62, for each first row field to be merged into composite column field, and to each first row field in each row Data under field merge, data of the data after merging as composite column field under each row field.
In practical applications, which can be by software code realization, which can also be with To be stored with the related medium for executing code, for example, USB flash disk etc.;Alternatively, the data processing system can also be integrated or be equipped with Correlation executes the entity apparatus of code, for example, chip, intelligent terminal, computer, database, server and various electronics are set It is standby.
Specifically, grouping module 61 can be set as needed different screening conditions to determine column word that needs combine Section;After needing combined column field to determine, these column fields are merged into composite column field, another party by 62 one side of merging module Face also needs to merge the data in these column fields.Specifically, grouping module 61 is determined to need combined column first Field, which can according to need determination, for example, being specified by user, alternatively, can also determine to need based on data analysis Combined column field.It should be noted that the attribute for needing combined column field to characterize in this programme can be single attribute, It may be composite attribute, it can further merged to composite attribute.
As a kind of enforceable mode, the similar column field of access frequency can be merged, correspondingly, any On the basis of embodiment, grouping module 61 may include:
First statistic unit, for counting the access frequency of each column field in the column storage table;
First grouped element, for using the similar column field of access frequency as first row field.
Specifically, the frequency that access frequency reflection is read can after merging to the similar attribute of these access frequencys To be obtained together by single ergodic column data, to improve the efficiency that same document data is read to a greater degree.Here Described is similar including identical and approximate, for example, the access frequency of the identical and each attribute of the access frequency of each attribute is differed pre- In the range of if etc..
As another enforceable mode, the similar column field of access frequency can be merged, correspondingly, in office On the basis of one embodiment, grouping module 61 may include:
Second statistic unit, for counting the data renewal frequency of each column field in the column storage table;
Second packet unit, for data renewal frequency to be below to the column field of predeterminated frequency as first row field.
In practical application, the renewal frequency of certain attributes is lower, such as the attributes such as Document Title, document author, for this The attribute for not needing frequent updating a bit, can be merged, same to improve while guarantee section data updating efficiency The efficiency that one document data is read.
It should be noted that the embodiment of above-mentioned determination column field to be combined is only used as illustrating for this programme, it is real In the application of border, it is also based on other way and determines to need combined column field.
Correspondingly, determining after needing combined column field, the data by these column fields are needed to merge, equally , combined mode can also there are many.In practical application, during the data of each first row field are merged, need Take into account and the corresponding data of each attribute are recognized when row reading data, therefore carries out example by following several embodiments Explanation.
In one embodiment, on the basis of any embodiment, merging module 61 may include:
First processing units, for the predetermined row time according to each first row field, to each first row field in each row word Data under section merge after being ranked up, number of the data after merging as the composite column field under the row field According to.
Specifically, presetting arrangement order in present embodiment for each column field, arrangement order here refers to When carrying out data merging, position of each attribute data in entire merging data.It is appreciated that the data of composite attribute are by certain The mode of sequence stores, and can effectively improve the efficiency of reading data, realizes the effect of the quick reading data of similar row storage mode Fruit can effectively promote the efficiency of online retrieving.
Present embodiment carries out data merging by presetting sequence, so that the data after merging have sequential access Characteristic, i.e., the data of each attribute are sequential storage in the composite attribute data of document, to read the combination of the document in access The sequential access that data are realized when attribute data, realizes the efficient reading effect of row storage.
In another embodiment, on the basis of any embodiment, merging module 61 may include:
Unit is identified, the mark of the column field is added for data of each first row field under each row field;
The second processing unit, for closing data of each first row field after addition mark under each row field And data of the data after merging as the composite column field under the row field.
Specifically, be the corresponding mark of data addition of each column field in present embodiment, it is corresponding to characterize the data Attribute.Present embodiment passes through that combined data is needed to add attribute-bit, realizes that different attribute is corresponding in data after merging The identification of data improves the flexibility that data merge.
In practical application, for the description information of composite attribute, the description can also be recorded convenient for subsequent reading data Information refers to some relevant informations of composite attribute.Optionally, as shown in Figure 6B, in one embodiment, the system Further include:
Describing module 63 arranges storage described in the description file record for updating the description file of the column storage table The attribute of each column field of table and each column field characterization.
Specifically, update mentioned here includes but is not limited to the operation such as newly-built, deletion, modification.Optionally, in the reality On the basis of applying mode, describing module 63 includes:
Unit, each first row field recorded in the description file for deleting the column storage table and institute are deleted in description State the attribute of each first row field characterization;
Adding unit is described, for adding the composite column field and described in the description file of the column storage table The attribute of composite column field characterization, the attribute of the composite column field characterization include the attribute of each first row field characterization.
It generally speaking, in the present embodiment, is that column storage table is established and maintenance description file, when respectively being arranged in column storage table When the description information of field changes, for example, the merging of at least two column fields occurs, the attribute of column field characterization becomes Change, column field is deleted etc., then it needs to be updated the description file of column storage table.It is recorded in the description file of column storage table The description information of each column field of the column storage table, the description information include but is not limited to the attribute of column field characterization, for Merge the composite column field obtained, the mode of recorded data merging is gone back in description information.
, can be with the corresponding relationship between maintenance column field and attribute by the description file of maintenance column storage table, and root The attribute for each column field characterization that timely updates the case where merging according to column field combination, improves the real-time that data storage updates, just In subsequent reading data, the convenience and accuracy of reading data are improved.
Data processing system provided in this embodiment, not only have column storage efficiently more new data the advantages of, but also can read When the data of same document, the efficient reading effect for being similar to row storage mode is realized, so that the advantages of taking into account two ways, has Effect improves the efficiency of data processing.
In practical application, for the ease of carrying out data management, Fig. 7 is a kind of data processing that the embodiment of the present application five provides The structural schematic diagram of system;With reference to Fig. 7 it is found that on the basis of any embodiment, the data processing system further include:
Memory management module 71, for searching the type pair with document to be written from all types of corresponding column storage tables The first row storage table answered;
Memory management module 71 is also used to the attribute characterized according to column field each in the first row storage table, from described Corresponding attribute data is extracted in document to be written;
Memory management module 71 is also used to the attribute data first row storage table is written.
Present embodiment establishes all types of corresponding column storage tables, the similar document data of data attribute is divided to same Type, using the corresponding column storage table of different type as dimension carry out data storage and management, can be isolated different types of data it Between interference.In the present embodiment, the corresponding document properties data of each type are all with the format management of single row storage table, It is independent from each other between the corresponding column storage table of each type.Further, may be used also in the corresponding column storage table of certain type With further division management.
Optionally, in one embodiment, on the basis of embodiment five, memory management module 71 includes:
Version management unit, for the field to be written of entering a profession according to current first row storage table, from first row storage table pair First version belonging to field to be written of entering a profession is determined in the version answered, wherein the difference of different editions characterization first row storage table Row field range;
Version management unit, is also used to detect whether the scale that data are written in first version reaches preset saturation item Part;If so, establishing the second edition, and attribute data is written in the row field of second edition characterization;Otherwise, it enters a profession to be written Attribute data is written in field.
Specifically, different editions correspond to different row field ranges.Wherein, the field to be written of entering a profession is current for being written The document properties data for needing to be written, the method for determination of the field to be written of entering a profession can there are many.In present embodiment, write-in Data refer to being written the data of new document, the i.e. data when existing document non-in the storage table of forefront.
Further, it is also possible to discharge effective memory space in such a way that version recycles.Optionally, in the base of embodiment five On plinth, the system also includes:
Versions merging module, for detecting in the corresponding version of first row storage table with the presence or absence of third version, described the For the valid data amount stored in the row field of three versions characterization lower than preset threshold value, the valid data are not deleted number According to;
The versions merging module is also used to count in the row field that the third version characterizes, is stored with valid data Row field the first quantity, and from the corresponding version of the first row storage table determine fourth edition, the fourth edition The row field quantity of data is not written in the row field of characterization not less than first quantity;
The versions merging module is also used to the valid data stored in the third version being transferred to the fourth edition This, and the third version is labeled as invalid version.
Specifically, present embodiment carries out versions merging according to the valid data scale dynamic in version.When in certain version The valid data of storage are less, can be occupied to discharge invalid data under the version by the versions merging into other versions Memory space, and improve follow-up data reading and effectiveness of retrieval.
Data processing system provided in this embodiment is deposited for different types of document properties data by arranging accordingly It stores up table and carries out data storage, it is mutually indepedent between the corresponding column storage table of different type, to avoid different types of document properties Data are interfered, and the efficiency and accuracy of data processing are improved.
In addition, can also be established for it for the ease of being retrieved to the data in column storage table and maintenance indexes.Accordingly , Fig. 8 is a kind of structural schematic diagram for data processing system that the embodiment of the present application six provides;With reference to Fig. 8 it is found that in any reality On the basis of applying example, the data processing system further include:
Index management module 81, for establishing index for the column storage table, the index includes in the column storage table The storage address of data of each column field under each row field.
Wherein, the form of the index can there are many, it is preferred that k-v index (primary key can be used index).Specifically, the format of storage address can be determined according to the Format Type of data, it is preferred that for fixed-length data, by Certain in its data length, therefore, storage address can only record the storage address of its first data, for elongated data, In its storage address other than recording the storage address of its first data, it is also necessary to record the length of the elongated data.
Preferably, to delete data instance, on the basis of the above embodiment, the system also includes:
Data removing module, it is described wait delete for searching from the corresponding index of column storage table belonging to data to be deleted Except the storage address of data;
The data removing module, be also used in the index be the data to be deleted storage address addition first Label, to characterize the data invalid to be deleted.
Specifically, the data can be found from the index of column storage table when needing to delete the data in column storage table Storage address, and the first label invalid for characterize data for storage address addition in the index.Without executing essence Data delete processing, can be realized data deletion effect.When the unified update operation of follow-up data is triggered, for example, When invalid data amount reaches certain amount, current all invalid datas are removed together.
Another scene is to carry out data update to canned data.Correspondingly, on the basis of any embodiment, it is described System further include:
Data update module searches secondary series storage table, the secondary series for the document according to belonging to data to be written There is the first row field for characterizing document belonging to the data to be written in storage table;
The data update module is also used to the attribute according to belonging to the data to be written, stores to the secondary series Data of the secondary series field under the first row field are updated in table, and the secondary series field characterizes the data institute to be written The attribute of category.
It is understood that, if it is possible to find the second storage table, then illustrate that document belonging to data to be written is to have deposited text Shelves, that is, belong to the update to canned data, if searching the number for illustrating that data to be written are new document less than the second storage table According to may relate to the related procedure in embodiment five in the case where the present embodiment is combined with embodiment five scene implemented.This embodiment party Formula, after determining document belonging to current data to be written to have deposited document, to current under attribute belonging to data to be written Data are updated.
Wherein, its update mode of the data of different-format can also be different, it is preferred that in one embodiment, described Data update module, if being specifically used for the corresponding data format of the secondary series field is fixed-length data, by the secondary series Data of the field under the first row field are updated to the data to be written;The data update module, if also particularly useful for institute Stating the corresponding data format of secondary series field is elongated data, then using the data to be written as the secondary series field the The patch file of data under a line field is stored, and by the corresponding index of the secondary series storage table, and described second The storage address of data of the column field under the first row field is recorded as the storage address of the data to be written.
In present embodiment, likewise, in order to save data processing resources, when needing to update canned data, not It is updated in real time, but stores data to be written by way of patch installing file, realizes the effect that data update, to When subsequent unified progress data update is triggered, for example, when the patch file quantity in column storage table reaches certain amount, it is unified All data to be updated in column storage table are replaced with into the data in corresponding latest patch file, and remove all patches File avoids resource caused by handling in real time from frequently consuming to realize that data are centrally updated and memory space discharges.
Data processing system provided in this embodiment is indexed for establishing and safeguarding for column storage table, and on this basis The operations such as data are deleted and the update of data with existing is written are carried out, realize diversified data processing function, and avoid frequently Handle bring resource consumption.
Fig. 9 is a kind of example architecture figure for data processing system that embodiment seven provides, and the explanation of nouns being directed to is such as Under:
Schema: the description file of each column storage table in storage system, record have each column field characterization in column storage table Attribute;
Table: all types of corresponding column storage tables take single storage table as the unit of management, the corresponding column of each type It is independent from each other between storage table;
Segment: the data in single row storage table, according to more new version with version management, increasing data newly can be new Increase a new segment to store;
Primary key index: maintaining a k-v index, can quickly navigate to depositing for attribute data according to pk Store up address;
Patch: the update of elongated data is stored in the form of patch file;
Version: for stating the state of current version, including available and down state.
Specifically, the document properties data of each type are in this programme with the form pipe of single row storage table (table) Reason is independent from each other between each table, in table the corresponding data of each attribute (field) with segment format management, Current newest segment is written in the attribute data of new document, after the data write-in of current latest edition reaches certain scale, It can trigger and create new segment;It can be seen that the foundation of segment is according to timeliness, while segment is also not unlimited increasing Add, dynamic versions merging can be done according to valid data amount in segment;In addition, for the ease of to the data in table It is retrieved, each table maintenance has corresponding k-v index, can quickly navigate to data storage location (figure by major key pk In data indicate document properties data the first data storage location, for elongated data storage location in addition to the first number According to storage location outside, further include the length of elongated data, that is, the offset offset stored), for data delete operation, only Label need to be added for the storage location of data to be deleted in primary key index table;In addition, for data with existing It updates, this programme proposes two kinds of strategies: one is the schemes that the update of fixed-length data is taken to update in situ;One is elongated numbers According to taking the mode of patch installing file.
Data processing system provided in this embodiment, not only have column storage efficiently more new data the advantages of, but also can read When the data of same document, the efficient reading effect for being similar to row storage mode is realized, so that the advantages of taking into account two ways, has Effect improves the efficiency of data processing.
Figure 10 is the structural schematic diagram for the data processing system that the embodiment of the present application eight provides, as shown in Figure 10, the data Processing system 700 includes that at least one processor 701, memory 702 and communication interface 703 are connected by bus 704;Storage Device 702 stores computer executed instructions;At least one processor 701 executes the computer executed instructions that memory 702 stores, and makes It obtains data processing system and data interaction is carried out to execute aforementioned any embodiment by communication interface 703 and external server Method.
It may include different types of processor, or including phase in the processor 701 of above-mentioned data processing system 700 The processor of same type;Processor can be below any: central processing unit (Central Processing Unit, letter Claim CPU), arm processor, field programmable gate array (Field Programmable Gate Array, abbreviation FPGA), specially There is the device of calculation processing ability with processor etc..A kind of optional embodiment, at least one processor can also be integrated into Many-core processor.
Memory 702 in above-mentioned data processing system 700 can be below any or any combination: random Access memory (Random Access Memory, abbreviation RAM), read-only memory (read only memory, abbreviation ROM), nonvolatile memory (non-volatile memory, abbreviation NVM), solid state hard disk (Solid State Drives, Abbreviation SSD), mechanical hard disk, disk, the storage mediums such as disk permutation.
Communication interface 703 carries out data interaction for data processing system 700 and other equipment.Communication interface can be with Under any or any combination: network interface (such as Ethernet interface), wireless network card etc. have network access facility Device.
Bus may include address bus, data/address bus, control bus etc., for convenient for indicating, with a thick line table in figure Show the bus.The bus can be below any or any combination: industry standard architecture (Industry Standard Architecture, abbreviation ISA) bus, peripheral component interconnection (Peripheral Component Interconnect, abbreviation PCI) bus, expanding the industrial standard structure (Extended Industry Standard Architecture, abbreviation EISA) wired data transfers such as bus device.
The application also provides a kind of computer readable storage medium, this is stored with computer executed instructions, works as data processing When at least one processor of system executes the computer executed instructions, data processing system is executed in any of the above-described embodiment Method.
The application also provides a kind of electronic equipment, which includes computer executed instructions, which executes In a computer-readable storage medium, at least one processor of data processing system can be deposited from computer-readable for instruction storage Storage media reads the computer executed instructions, at least one processor executes the computer and executes in any of the above-described embodiment Method.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description Specific work process, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above-mentioned each method embodiment can lead to The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer readable storage medium.The journey When being executed, execution includes the steps that above-mentioned each method embodiment to sequence;And storage medium above-mentioned include: ROM, RAM, magnetic disk or The various media that can store program code such as person's CD.
Finally, it should be noted that the above various embodiments is only to illustrate the technical solution of the application, rather than its limitations;To the greatest extent Pipe is described in detail the application referring to foregoing embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement;And these are modified or replaceed, each embodiment technology of the application that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims (30)

1. a kind of data processing method characterized by comprising
Determine each first row field for needing to combine in the column field of column storage table, the row field list of the column storage table is solicited articles Shelves, the column field characterization attributes of the column storage table;
Each first row field is merged into composite column field, and data of each first row field under each row field are closed And data of the data after merging as the composite column field under each row field.
2. the method according to claim 1, wherein the data to each first row field under each row field It merges, data of the data after merging as the composite column field under each row field, comprising:
According to the predetermined row time of each first row field, each first row field is closed after the data under each row field are ranked up And data of the data after merging as the composite column field under the row field.
3. the method according to claim 1, wherein the data to each first row field under each row field It merges, data of the data after merging as the composite column field under each row field, comprising:
The mark of the column field is added for data of each first row field under each row field;
Data of each first row field after addition mark under each row field are merged, the data after merging are made For data of the composite column field under the row field.
4. the method according to claim 1, wherein in the column field of the determining column storage table needing to combine Each first row field, comprising:
Count the access frequency of each column field in the column storage table;
Using the similar column field of access frequency as first row field.
5. the method according to claim 1, wherein in the column field of the determining column storage table needing to combine Each first row field, comprising:
Count the data renewal frequency of each column field in the column storage table;
Data renewal frequency is below the column field of predeterminated frequency as first row field.
6. the method according to claim 1, wherein the method also includes:
Update the description file of the column storage table, each column field of column storage table described in the description file record and described The attribute of each column field characterization.
7. according to the method described in claim 6, it is characterized in that, the description file for updating the column storage table, comprising:
Delete each first row field recorded in the description file of the column storage table and each first row field characterization Attribute;
The attribute of the composite column field and composite column field characterization is added in the description file of the column storage table, The attribute of the composite column field characterization includes the attribute of each first row field characterization.
8. method according to any one of claims 1-7, which is characterized in that the method also includes:
From all types of corresponding column storage tables, first row storage table corresponding with the type of document to be written is searched;
According to the attribute that column field each in the first row storage table characterizes, corresponding category is extracted from the document to be written Property data;
The first row storage table is written into the attribute data.
9. according to the method described in claim 8, it is characterized in that, described be written the first row storage for the attribute data Table, comprising:
According to the field to be written of entering a profession of presently described first row storage table, determined from the corresponding version of the first row storage table First version belonging to the field to be written of entering a profession, wherein different editions characterize the field of not going together of the first row storage table Range;
Detect whether the scale that data are written in the first version reaches preset saturation conditions;If so, establishing the second edition This, and the attribute data is written in the row field of second edition characterization;Otherwise, it is write in the field to be written of entering a profession Enter the attribute data.
10. according to the method described in claim 8, it is characterized in that, the method also includes:
Detecting whether there is third version in the corresponding version of first row storage table, deposit in the row field of the third version characterization The valid data amount of storage is lower than preset threshold value, and the valid data are not deleted data;
In the row field for counting third version characterization, it is stored with the first quantity of the row field of valid data, and from described Fourth edition is determined in the corresponding version of first row storage table, and the row of data is not written in the row field of the fourth edition characterization Field quantity is not less than first quantity;
The valid data stored in the third version are transferred to the fourth edition, and the third version is labeled as nothing Imitate version.
11. method according to any one of claims 1-7, which is characterized in that the method also includes:
It establishes and indexes for the column storage table, the index includes number of each column field under each row field in the column storage table According to storage address.
12. according to the method for claim 11, which is characterized in that the method also includes:
From the corresponding index of column storage table belonging to data to be deleted, the storage address of the data to be deleted is searched;
Be the first label of storage address addition of the data to be deleted in the index, with characterize the data to be deleted without Effect.
13. according to the method for claim 11, which is characterized in that the method also includes:
According to document belonging to data to be written, secondary series storage table is searched, is existed described in characterization in the secondary series storage table The first row field of document belonging to data to be written;
According to attribute belonging to the data to be written, to secondary series field in the secondary series storage table under the first row field Data be updated, the secondary series field characterizes attribute belonging to the data to be written.
14. according to the method for claim 13, which is characterized in that the attribute according to belonging to the data to be written, Data of the secondary series field under the first row field in the secondary series storage table are updated, comprising:
If the corresponding data format of the secondary series field is fixed-length data, by the secondary series field under the first row field Data be updated to the data to be written;
If the corresponding data format of the secondary series field is elongated data, using the data to be written as the secondary series The patch file of data of the field under the first row field is stored, and by the corresponding index of the secondary series storage table, The storage address of data of the secondary series field under the first row field is recorded as the storage address of the data to be written.
15. a kind of data processing system characterized by comprising
Grouping module, each first row field for needing to combine in the column field for determining column storage table, the column storage table Row field list is solicited articles shelves, the column field characterization attributes of the column storage table;
Merging module, for each first row field to be merged into composite column field, and to each first row field under each row field Data merge, data of the data after merging as the composite column field under each row field.
16. system according to claim 15, which is characterized in that the merging module includes:
First processing units, for the predetermined row time according to each first row field, to each first row field under each row field Data be ranked up after merge, data of the data after merging as the composite column field under the row field.
17. system according to claim 15, which is characterized in that the merging module includes:
Unit is identified, the mark of the column field is added for data of each first row field under each row field;
The second processing unit, for data of each first row field after addition mark under each row field to be merged, Data using the data after merging as the composite column field under the row field.
18. system according to claim 15, which is characterized in that the grouping module includes:
First statistic unit, for counting the access frequency of each column field in the column storage table;
First grouped element, for using the similar column field of access frequency as first row field.
19. system according to claim 15, which is characterized in that the grouping module includes:
Second statistic unit, for counting the data renewal frequency of each column field in the column storage table;
Second packet unit, for data renewal frequency to be below to the column field of predeterminated frequency as first row field.
20. system according to claim 15, which is characterized in that the system also includes:
Describing module, it is described to describe each of column storage table described in file record for updating the description file of the column storage table The attribute of column field and each column field characterization.
21. system according to claim 20, which is characterized in that the describing module includes:
Unit is deleted in description, each first row field for recording in the description file for deleting the column storage table and it is described respectively The attribute of first row field characterization;
Adding unit is described, for adding the composite column field and the combination in the description file of the column storage table The attribute of column field characterization, the attribute of the composite column field characterization include the attribute of each first row field characterization.
22. system described in any one of 5-21 according to claim 1, which is characterized in that the system also includes:
Memory management module searches corresponding with the type of document to be written for from all types of corresponding column storage tables One column storage table;
The memory management module, be also used to according to column field each in the first row storage table characterize attribute, from it is described to Corresponding attribute data is extracted in write-in document;
The memory management module is also used to the attribute data first row storage table is written.
23. system according to claim 22, which is characterized in that the memory management module includes:
Version management unit is stored for the field to be written of entering a profession according to presently described first row storage table from the first row First version belonging to the field to be written of entering a profession is determined in the corresponding version of table, wherein different editions characterize the first row The field range of not going together of storage table;
The version management unit, is also used to detect whether the scale that data are written in the first version reaches preset saturation Condition;If so, establishing the second edition, and the attribute data is written in the row field of second edition characterization;Otherwise, The attribute data is written in the field to be written of entering a profession.
24. system according to claim 22, which is characterized in that the system also includes:
Versions merging module, for detecting in the corresponding version of first row storage table with the presence or absence of third version, the third edition The valid data amount stored in the row field of this characterization is lower than preset threshold value, and the valid data are not deleted data;
The versions merging module is also used to count in the row field that the third version characterizes, is stored with the row of valid data First quantity of field, and fourth edition is determined from the corresponding version of the first row storage table, the fourth edition characterization Row field in the row field quantity of data is not written not less than first quantity;
The versions merging module is also used to the valid data stored in the third version being transferred to the fourth edition, And the third version is labeled as invalid version.
25. system described in any one of 5-21 according to claim 1, which is characterized in that the system also includes:
Index management module, for establishing index for the column storage table, the index includes each column word in the column storage table The storage address of data of the section under each row field.
26. system according to claim 25, which is characterized in that the system also includes:
Data removing module, for searching the number to be deleted from the corresponding index of column storage table belonging to data to be deleted According to storage address;
The data removing module, be also used in the index be the data to be deleted storage address addition first mark Note, to characterize the data invalid to be deleted.
27. system according to claim 25, which is characterized in that the system also includes:
Data update module searches secondary series storage table, the secondary series storage for the document according to belonging to data to be written There is the first row field for characterizing document belonging to the data to be written in table;
The data update module is also used to the attribute according to belonging to the data to be written, in the secondary series storage table Data of the secondary series field under the first row field are updated, and the secondary series field characterizes belonging to the data to be written Attribute.
28. system according to claim 27, which is characterized in that
The data update module, if being specifically used for the corresponding data format of the secondary series field is fixed-length data, by institute It states data of the secondary series field under the first row field and is updated to the data to be written;
The data update module will if being elongated data also particularly useful for the corresponding data format of the secondary series field The patch file of data of the data to be written as the secondary series field under the first row field is stored, and by institute It states in the corresponding index of secondary series storage table, the storage address of data of the secondary series field under the first row field is recorded as The storage address of the data to be written.
29. a kind of electronic equipment characterized by comprising at least one processor and memory;
The memory stores computer executed instructions;At least one described processor executes the computer of the memory storage It executes instruction, to execute the method as described in any one of claim 1-14.
30. a kind of computer readable storage medium, which is characterized in that be stored with program in the computer readable storage medium and refer to It enables, method described in any one of claim 1-14 is realized in described program instruction when being executed by processor.
CN201810014799.8A 2018-01-08 2018-01-08 Data processing method and system, electronic equipment and computer readable storage medium Pending CN110109910A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810014799.8A CN110109910A (en) 2018-01-08 2018-01-08 Data processing method and system, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810014799.8A CN110109910A (en) 2018-01-08 2018-01-08 Data processing method and system, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN110109910A true CN110109910A (en) 2019-08-09

Family

ID=67483096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810014799.8A Pending CN110109910A (en) 2018-01-08 2018-01-08 Data processing method and system, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN110109910A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046074A (en) * 2019-12-13 2020-04-21 北京百度网讯科技有限公司 Streaming data processing method, device, equipment and medium
CN111259107A (en) * 2020-01-10 2020-06-09 北京百度网讯科技有限公司 Storage method and device of determinant text and electronic equipment
CN111581331A (en) * 2020-04-27 2020-08-25 北京字节跳动网络技术有限公司 Method and device for processing file, electronic equipment and computer readable medium
CN111639091A (en) * 2020-06-04 2020-09-08 山东汇贸电子口岸有限公司 Multi-table merging method based on table merging
CN112069172A (en) * 2020-08-21 2020-12-11 南京南瑞继保电气有限公司 Power grid data processing method and device, electronic equipment and storage medium
CN112632939A (en) * 2020-12-30 2021-04-09 北京达佳互联信息技术有限公司 Data processing method, data display method, data processing device and storage medium
CN113064919A (en) * 2021-03-31 2021-07-02 北京达佳互联信息技术有限公司 Data processing method, data storage system, computer device and storage medium
CN113591485A (en) * 2021-06-17 2021-11-02 国网浙江省电力有限公司 Intelligent data quality auditing system and method based on data science
CN114706527A (en) * 2022-03-24 2022-07-05 北京涵鑫盛科技有限公司 Distributed storage space release method and distributed system
CN115438114A (en) * 2022-11-09 2022-12-06 浪潮电子信息产业股份有限公司 Storage format conversion method, system, device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5899986A (en) * 1997-02-10 1999-05-04 Oracle Corporation Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer
CN101354723A (en) * 2008-09-10 2009-01-28 金蝶软件(中国)有限公司 Method and apparatus for implementing combined field
CN106528821A (en) * 2016-11-16 2017-03-22 济南浪潮高新科技投资发展有限公司 Method for importing change column data into database
CN107038202A (en) * 2016-12-28 2017-08-11 阿里巴巴集团控股有限公司 Data processing method, device and equipment, computer-readable recording medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5899986A (en) * 1997-02-10 1999-05-04 Oracle Corporation Methods for collecting query workload based statistics on column groups identified by RDBMS optimizer
CN101354723A (en) * 2008-09-10 2009-01-28 金蝶软件(中国)有限公司 Method and apparatus for implementing combined field
CN106528821A (en) * 2016-11-16 2017-03-22 济南浪潮高新科技投资发展有限公司 Method for importing change column data into database
CN107038202A (en) * 2016-12-28 2017-08-11 阿里巴巴集团控股有限公司 Data processing method, device and equipment, computer-readable recording medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
XIANGWU DING 等: "A Column-based Self-organizing Hybrid Storage", 《THE 2ND INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND ENGINEERING》 *
丁祥武: "列存储系统的若干关键技术研究", 《中国博士学位论文全文数据库 信息科技辑》 *
鲍玉斌 等: "数据仓库环境下以用户为中心的数据清洗过程模型", 《计算机科学》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046074A (en) * 2019-12-13 2020-04-21 北京百度网讯科技有限公司 Streaming data processing method, device, equipment and medium
CN111046074B (en) * 2019-12-13 2023-09-01 北京百度网讯科技有限公司 Streaming data processing method, device, equipment and medium
CN111259107A (en) * 2020-01-10 2020-06-09 北京百度网讯科技有限公司 Storage method and device of determinant text and electronic equipment
CN111259107B (en) * 2020-01-10 2023-08-18 北京百度网讯科技有限公司 Determinant text storage method and device and electronic equipment
CN111581331A (en) * 2020-04-27 2020-08-25 北京字节跳动网络技术有限公司 Method and device for processing file, electronic equipment and computer readable medium
CN111581331B (en) * 2020-04-27 2023-08-25 抖音视界有限公司 Method, device, electronic equipment and computer readable medium for processing text
CN111639091A (en) * 2020-06-04 2020-09-08 山东汇贸电子口岸有限公司 Multi-table merging method based on table merging
CN111639091B (en) * 2020-06-04 2023-09-19 山东汇贸电子口岸有限公司 Multi-table merging method based on merging table
CN112069172B (en) * 2020-08-21 2022-07-22 南京南瑞继保电气有限公司 Power grid data processing method and device, electronic equipment and storage medium
CN112069172A (en) * 2020-08-21 2020-12-11 南京南瑞继保电气有限公司 Power grid data processing method and device, electronic equipment and storage medium
CN112632939A (en) * 2020-12-30 2021-04-09 北京达佳互联信息技术有限公司 Data processing method, data display method, data processing device and storage medium
CN113064919A (en) * 2021-03-31 2021-07-02 北京达佳互联信息技术有限公司 Data processing method, data storage system, computer device and storage medium
CN113064919B (en) * 2021-03-31 2022-11-22 北京达佳互联信息技术有限公司 Data processing method, data storage system, computer device and storage medium
CN113591485A (en) * 2021-06-17 2021-11-02 国网浙江省电力有限公司 Intelligent data quality auditing system and method based on data science
CN114706527A (en) * 2022-03-24 2022-07-05 北京涵鑫盛科技有限公司 Distributed storage space release method and distributed system
CN114706527B (en) * 2022-03-24 2022-09-20 北京涵鑫盛科技有限公司 Distributed storage space release method and distributed system
CN115438114B (en) * 2022-11-09 2023-03-24 浪潮电子信息产业股份有限公司 Storage format conversion method, system, device, electronic equipment and storage medium
CN115438114A (en) * 2022-11-09 2022-12-06 浪潮电子信息产业股份有限公司 Storage format conversion method, system, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110109910A (en) Data processing method and system, electronic equipment and computer readable storage medium
CN100458779C (en) Index and its extending and searching method
CN102339315B (en) Index updating method and system of advertisement data
US9672241B2 (en) Representing an outlier value in a non-nullable column as null in metadata
CN102541757B (en) Write cache method, cache synchronization method and device
US11449564B2 (en) System and method for searching based on text blocks and associated search operators
KR101740271B1 (en) Method and device for constructing on-line real-time updating of massive audio fingerprint database
CN107491487A (en) A kind of full-text database framework and bitmap index establishment, data query method, server and medium
US11327985B2 (en) System and method for subset searching and associated search operators
CN102955792A (en) Method for implementing transaction processing for real-time full-text search engine
CN101136013A (en) Method for quick updating data domain in full text retrieval system
CN102231168A (en) Method for quickly retrieving resume from resume database
CN103186622A (en) Updating method of index information in full text retrieval system and device thereof
CN105630934A (en) Data statistic method and system
CN102411632B (en) Chain table-based memory database page type storage method
US9047363B2 (en) Text indexing for updateable tokenized text
CN103473324A (en) Multi-dimensional service attribute retrieving device and method based on unstructured data storage
CN101963993B (en) Method for fast searching database sheet table record
CN112416992B (en) Industry type identification method, system and equipment based on big data and keywords
CN111708895B (en) Knowledge graph system construction method and device
JP3666907B2 (en) Database file storage management system
CN116450664A (en) Data processing method, device, equipment and storage medium
CN106528590B (en) Query method and device
CN112131215B (en) Bottom-up database information acquisition method and device
CN108984720B (en) Data query method and device based on column storage, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200420

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: Alibaba (China) Co.,Ltd.

Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping square B radio tower 13 layer self unit 01

Applicant before: Guangdong Shenma Search Technology Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20190809

RJ01 Rejection of invention patent application after publication