CN105574093A - Method for establishing index in HDFS based spark-sql big data processing system - Google Patents

Method for establishing index in HDFS based spark-sql big data processing system Download PDF

Info

Publication number
CN105574093A
CN105574093A CN201510918956.4A CN201510918956A CN105574093A CN 105574093 A CN105574093 A CN 105574093A CN 201510918956 A CN201510918956 A CN 201510918956A CN 105574093 A CN105574093 A CN 105574093A
Authority
CN
China
Prior art keywords
index
data
sql
spark
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510918956.4A
Other languages
Chinese (zh)
Other versions
CN105574093B (en
Inventor
张鋆
冯骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huaxun Ark Photoelectric Technology Co ltd
Shenzhen Huaxun Fangzhou Software Technology Co ltd
Original Assignee
Shenzhen Huaxun Fangzhou Software Technology Co Ltd
Shenzhen Huaxun Ark Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huaxun Fangzhou Software Technology Co Ltd, Shenzhen Huaxun Ark Technology Co Ltd filed Critical Shenzhen Huaxun Fangzhou Software Technology Co Ltd
Priority to CN201510918956.4A priority Critical patent/CN105574093B/en
Publication of CN105574093A publication Critical patent/CN105574093A/en
Priority to PCT/CN2016/094925 priority patent/WO2017096939A1/en
Application granted granted Critical
Publication of CN105574093B publication Critical patent/CN105574093B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/134Distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for establishing an index in an HDFS based spark-sql big data processing system. The method comprises the steps of adding the index in the HDFS based spark-sql big data processing system through an SQL statement; deleting the index; inserting data; deleting the data; during data query, automatically judging whether the index exists in a query column or not; and if the index exists, then searching for file blocks contained in the index and filtering file blocks that do not need to be queried. According to the method, the query speed can be effectively increased after spark-sql is endowed with an index function; and for example, a typical spark-sql data table has the capacity of 1,000GB, one file is stored by the capacity of 1GB, 1,000 files are stored by the capacity of 1,000GB, if a single record is queried, the 1,000 files need to be scanned in a conventional method, and after the index is established, only one file needs to be scanned, so that the efficiency is improved by 1,000 times. In combination with experience with a conventional relational database, it is estimated according to a general condition that the sql statement query speed of a spark-sql database with the index is higher than that of a database without the index by 100-10,000 times or more.

Description

A kind of based on the method large data handling system of spark-sql of HDFS being set up index
Technical field
The present invention relates to a kind of method setting up index on a data processing system, particularly relate to a kind of based on the method large data handling system of spark-sql of HDFS being set up index.
Background technology
Spark-sql is on the basis of large data processing platform (DPP) spark, adds the function of support standard sql query statement.
The large data processing platform (DPP) of Spark is the universal parallel framework of the class HadoopMapReduce that Berkeley increases income, and it has the advantage that HadoopMapReduce has; But writing of the large data processor of Spark needs to be grasped scala language, and carry out code based on open api function interface to write, loaded down with trivial details and complicated, and the sql like language that a large amount of traditional database developers grasps cannot use on spark.The birth of spark-sql solves the problems referred to above, traditional database table concept is applied to spark process framework by it, user can as operation with traditional database table, table and inquiry is built with sql statement, corresponding operating is converted into spark built-in function by spark-sql automatically, shields complicated process details.
But due to the singularity of the large data processing platform (DPP) of spark, spark-sql is not supported in tables of data and sets up index, that does not namely support to be similar to traditional database sets up index statement, such as:
createindexmyindexontablet(b);
Mean: the general index setting up myindex by name on the b row of table t.
Traditional Relational DataBase is after receiving mentioned order, and the c row namely started as a table set up index.
The type of index has a variety of, and such as B-sets index, Hash index, GiST index, GIN index etc.Set index for B-, it is as follows that relevant database sets up index principle:
Database opens up one piece of independent storage area, is used for storing index tree.
Field in the row (being the row of b by name in example) of index as required generates B-tree.And this tree is saved in designated storage area.Wherein each element in the corresponding b row of each node of B-tree, also comprises a pointer in addition in each node, and this node corresponding element of this pointer record is kept at the relevant position in database file.
When b row insert new element, also new element to be inserted B-tree (B-tree meeting adjustment automatically), this this element of tree node record position in database file simultaneously.
When b row delete element, also element to be deleted from B-sets (B-tree can adjust automatically).
Database index is based on the entity file of database purchase, namely above said " relevant position in database file ", database file can user-defined format as required, so can there be different method for expressing data relevant position hereof, but general thought is all have recorded element accurate location hereof, follow-up when searching this element, do not need to travel through file, and quick position record is someway passed through in the position can recorded by this, thus reach the object of accelerating to search.
As shown in Figure 1, data file correspondence table t, have a, b two arranges, and wherein b row establish an index.The i.e. index of the right tree construction, the unit in index in each node have the pointer pointing to data file respective element position, and wherein index itself also stores as a file.
When data query, such as query statement
Select*fromtwhereb=22;
Represent all row that in question blank t, b row equal 22.First database first resolves sql statement, then finds that b row exist index.
Then, directly from index tree, find 22 elements fast, according to the pointer of 22 elements, navigate to the row that element value is the b row place of 22, its address is 0x90, then direct according to address, takes out this line, returns results as " 522 ".
When showing t and inserting element, according to the value of b row, corresponding index tree can be on-the-fly modified; Accordingly, when showing t and deleting element, the content in index tree can also dynamically be deleted.
The index technology comparative maturity of traditional relational, what state here is its General Principle, and its implementation is varied, and the form of expression is not fixed usually, but ultimate principle all communicates, method for building up and the performing step of other index repeat here no longer one by one.
For spark-sql, document storage mode due to its bottom is different from traditional Relational DataBase and (usually adopts HDFS, instead of general Linux or Windows file system), and a usual table capacity is very large, a table even can be associated with thousands of physical files, so do not create the function of index when spark-sql design.Its design focal point is to emphasize the concurrent capability of data processing and have ignored the efficiency of process.
When usual spark-sql carries out data query, can search for whole tables of data, the data volume of usual spark-sql process is very large, and a database may correspond to multiple physical file, and spark-sql by concurrent technology, can search for All Files.
As shown in Figure 2, for same sql query statement:
Select*fromtwhereb=22;
First Spark-sql resolves sql statement, then navigates to the data file (having 4 files here) of table t.Then these files are split into multiple pieces according to specific size, distribute to different progress of work process, progress of work order scans the All Files piecemeal that whole table comprises, find b train value be 22 row, after finding, return results.
Can find out, spark-sql is not when having index, and the Method compare carrying out showing search is simple, there is inefficiency, needs all row scanning whole data files.
Except simple select statement, traditional relational comprises complex query, subquery in all places relating to inquiry, and nested query etc. all can application references technology, reduces queries, and accelerate inquiry velocity, spark-sql does not have such mechanism.
In sum, current spark-sql, owing to not having data directory mechanism, cannot make inquiry velocity reach optimum, compared to traditional Relational DataBase, there is the problem of inefficiency.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of based on the method large data handling system of spark-sql of HDFS being set up index, the method can make spark-sql be adapted to more, application scenarios more flexibly, accelerate the speed that spark-sql execution sql statement carries out inquiring about, improve the execution efficiency of spark-sql, play the advantage of the large data capability of spark-sql process more fully.
In order to solve the problems of the technologies described above, the invention provides a kind of based on the method large data handling system of spark-sql of HDFS being set up index, increase index by SQL statement based in the large data handling system of spark-sql of HDFS, delete index, data inserting, delete data, when data query, whether automatic decision inquiry row exist index, if existed, then search the blocks of files that index comprises, filter the blocks of files not needing to inquire about.
When increasing index, first a newly-increased index file is needed, the form of index file can be arranged according to configuration and other instruction, conventional has the forms such as B-tree, Hash index, then all records in original table are traveled through, determine that the value of the row of index required for every bar record is arranged in HDFS or alternative document system position, then record the train value of this record and corresponding fileinfo, write index tree structure.The all records of searching loop, preserve index structure with document form, final updating table metadata information, by the metadata of new index information write table, use in order to subsequent query.
Delete certain when showing the index of certain row, only need to navigate to corresponding index file and deleted, and updating form metadata information, delete the index information in metadata simultaneously.
After inserting data, judge whether the data inserted relate to index, if relate to index, then need to adjust corresponding index structure, the fileinfo that these data and it are associated also is joined in index structure and goes.
In whole flow process, it is constant that tables of data increase data flow continues to use former flow process, and only after data increase completes, the filename at record data place, according to filename structure index node that this returns.
After deleting data, judge whether the data of deleting relate to index, if relate to index, then need to adjust corresponding index structure, are deleted by the index node of this data correlation.
In whole flow process, wherein tables of data is deleted data flow to continue to use former flow process constant, only after data have been deleted, increases the index information that deletion data are corresponding
During data query, according to data train value, corresponding node elements in search index file, then element value is read, thus obtain the filename at file place corresponding to these data, then former querying flow is continued, all data files in this table of reading are finally inquired about by former querying flow, before this according to the filename that previous step obtains, filter out inactive file, then continue to perform querying flow to remaining file, then perform SQL operation according to the data of inquiry, finally return Query Result.
In recording one, the index of certain field navigates to certain file, namely which file have recorded this record is included in, follow-up when searching this record, only need to be directly targeted to certain file according to index, and this All Files showing to comprise need not be scanned.
The present invention compared with prior art has following beneficial effect based on the method large data handling system of spark-sql of HDFS being set up index.
After index function is increased to spark-sql, effectively can increase inquiry velocity, such as one typical spark-sql tables of data, size is that 1000GB, 1GB file is deposited, be divided into 1000 files, if inquiry wall scroll record, original way needs scanning 1000 files, after setting up index, only need scanning 1 file, efficiency improves 1000 times.According to generalized case estimation, in conjunction with traditional relevant database experience, the spark-sql database setting up index performs fast 100-10000 doubly or more than not having the sql statement inquiry velocity of index.
Accompanying drawing explanation
Below in conjunction with the drawings and specific embodiments, the present invention is being described in further detail based on the method large data handling system of spark-sql of HDFS being set up index.
Fig. 1 is general data table and index tree structural representation in prior art.
Fig. 2 is inquiry schematic diagram when not having index in prior art in the large data handling system of the spark-sql of HDFS.
Fig. 3 is increase index process flow diagram of the present invention.
Fig. 4 is deletion index process flow diagram of the present invention.
Fig. 5 is increase data flowchart of the present invention.
Fig. 6 is deletion data flowchart of the present invention.
Fig. 7 is data query process flow diagram of the present invention.
Fig. 8 is HDF distributed memory system structural representation.
Fig. 9 is tables of data and index tree structural representation in distributed memory system of the present invention.
Embodiment
As shown in Fig. 3 to Fig. 9, present embodiment is achieving spark-sql increase support index function based on the method large data handling system of spark-sql of HDFS being set up index, can as traditional Relational DataBase, index is increased by SQL statement, delete index, data inserting, delete data, when data query, whether automatic decision inquiry row exist index, if existed, then search the blocks of files that index comprises, filter the blocks of files not needing to inquire about, reach the object accelerating inquiry velocity.
1) as shown in Figure 3, index flow process is increased.
Increasing index refers on the basis of legacy data table, and for a certain row increase index, the follow-up inquiry for these row can be accelerated by index.
When increasing index, first a newly-increased index file is needed, the form of index file can be arranged according to configuration and other instruction, conventional has the forms such as B-tree, Hash index, then all records in original table are traveled through, determine that the value of the row of index required for every bar record is arranged in HDFS or alternative document system position, then record the train value of this record and corresponding fileinfo, write index tree structure.The all records of searching loop, preserve index structure with document form, final updating table metadata information, by the metadata of new index information write table, use in order to subsequent query.
2) as shown in Figure 4, index flow process is deleted.
Delete certain index flow process showing certain row comparatively simple, only need to navigate to corresponding index file and deleted, and updating form metadata information, delete the index information in metadata simultaneously.
3) as shown in Figure 5, data inserting flow process (the existing index of table).
(comprise batch to insert after inserting data, reality is that continuous wall scroll data are inserted), can judge whether the data inserted relate to index, if relate to index, then need to adjust corresponding index structure, the fileinfo that these data and it are associated also is joined in index structure and goes.
In whole flow process, wherein tables of data increases data flow to continue to use former flow process constant, and only after data increase completes, the filename at record data place, according to filename structure index node that this returns.
4) data flow (the existing index of table) as shown in Figure 6, is deleted.
(comprise batch to delete after deleting data, reality is that continuous wall scroll data are deleted), can judge whether the data of deleting relate to index, if relate to index, then need to adjust corresponding index structure, the index node of this data correlation is deleted.
In whole flow process, wherein tables of data is deleted data flow to continue to use former flow process constant, only after data have been deleted, increases the index information that deletion data are corresponding.
5) as shown in Figure 7, data query flow process (table has index).
During data query, according to data train value, corresponding node elements in search index file, then read element value, thus obtain the filename at file place corresponding to these data, then continue former querying flow, all data files in this table of reading are finally inquired about by former querying flow, before this according to the filename that previous step obtains, filter out inactive file, then continue to perform querying flow to remaining file.Quantity of documents through filtering can greatly reduce, and reduces inquiry burden, then performs SQL operation according to the data of inquiry, finally returns Query Result.
Ben at this, present embodiment, based on the index of spark-sql, is different from the index of traditional database, and its purpose of design is in order to process big data quantity.Traditional data storage capacity can accomplish 1PB for 10GB, spark-sql, i.e. 100,000 times of common traditional data storage capacities.
Physical file in general several file system corresponding of one, general data storehouse tables of data, and spark-sql typical deployed mode combines with HDFS, carry out storage file in a kind of mode of distributed storage, one opens tables of data can correspond to thousands of and even up to ten thousand the files be stored on HDFS, as shown in Figure 8.
A usual spark-sql node is made up of several spark nodes, and its bottom stores and adopts HDFS distributed memory system.Namely data file is present in HDFS.In figure, t1-p1 represents the part1 part of t1, and it is a physical file, and in like manner t1-p2 represents the part2 file of t1, whole table t1 by p1-p7 totally 7 files form; Similarly, show t2 to be made up of 3 files.
Former querying flow, when carrying out the inquiry of sql statement, can scan all list files.
Such as Select*fromtwhereb=22
Spark-sql resolves above-mentioned sql statement, the database file that then look-up table t is corresponding, and result is t1-p1, t1-p2, t1-p3, t1-p4, t1-p5, t1-p6, t1-p7 have 7 files altogether, when not considering the excessive cutting of file, spark-sql can set up 7 query tasks, these 7 files corresponding start scan for inquiries respectively, scanning All Files, until find qualified record row.
The present invention is also improved for spark-sql storage characteristics on this basis with reference to the principle of general index.
Index granularity of the present invention is different from traditional database index, traditional database index refers generally to the address be recorded in certain in file, because spark-sql database list file is made up of a lot of file usually, so the thinking that this law is taked is, in recording one, the index of certain field navigates to certain file, namely which file have recorded this record is included in, follow-up search this record time, only need to be directly targeted to certain file according to index, and this All Files showing to comprise need not be scanned.
Quote with above-mentioned example, table t has t1-p1, t1-p2, t1-p3, t1-p4, t1-p5, t1-p6, t1-p7 have 7 file compositions altogether, table t has 2 field a and b, wherein b field establishes index, supposes wherein there are some records (not showing whole record here), sets up index as shown in Figure 9.
Wherein table record is the raw readings be inserted in table, while insertion, b row sets up B-and sets index.Physical file corresponding in value in the data-base recording that in index tree, this node of each nodes records is corresponding and this record place HDFS file system.
When data query, such as query statement.
Select*fromtwhereb=22;
Represent all row that in question blank t, b row equal 22.First database first resolves sql statement, then finds that b row exist index.
Then directly from index tree, find 22 elements fast, according to the pointer of 22 elements, to navigate to element value be the physical file at the b row place of 22 is t1-p7, then only reads this file content and searches, return after finding record.
When showing t and inserting element, according to the value of b row, corresponding index tree can be on-the-fly modified; Corresponding, when showing t and deleting element, the content in index tree also dynamically can be deleted.
Can find out, although in the present invention the index concept of spark-sql and traditional database similar, but there is basic difference, the present invention is according to the large data characteristics of spark-sql process, change index granularity certain position from the file of traditional database in spark database certain file, thus avoid scanning a large amount of inactive file, avoid wasting system resource.
Index in the present invention is applicable to all sql statements, namely in or complicated sql inquiry simple no matter, in every case the query manipulation of index column is related to, capital is first according to index locating file, then in the file of location, carry out sql query manipulation, this and traditional relational way have fundamental difference.
Key point of the present invention
1, on spark-sql, increase the mechanism supporting index, such as, support following sql statement:
Set up index: createindexmyindexont (b); Wherein key word is createindexon
Check index: showindexfromt; Wherein key word is showindexfrom
Delete index: dropindexmyindexont; Wherein key word is dropindexon
2, based on the Indexing Mechanism of file
Spark-sql is different from traditional Relational DataBase, one of key point of the present invention is, index is based upon in file basis, a concrete file namely in index point HDFS or other file system, instead of the content in file, granularity is larger than traditional database.Under the prerequisite that database table sets up index according to the present invention, can effectively filter invalid inquiry file, inquired about file extent can be reduced, thus improve search efficiency.
3, the index of foundation is including but not limited to unique index, major key index, many property indexs, partial index, expression formula index.These index types are consistent with the concept in traditional database; Set up the data structure that index uses to set including but not limited to B-, Hash, GiST, GIN etc., these data structures are consistent with the concept in traditional database.
Advantage of the present invention is as follows.
There is no disclosed real-time proposals and the method for spark-sql being supported to index technology at present.
So in current public technology, the database table set up in spark-sql does not all have index, and its inquiry velocity and search efficiency are limited, and by setting up Indexing Mechanism to spark-sql, can improve the some orders of magnitude of inquiry velocity.Under can accomplishing mass data situation, search efficiency is equally matched with traditional Relational DataBase with inquiry velocity.
It should be noted that, reference each embodiment described by accompanying drawing is only in order to illustrate the present invention but not to limit the scope of the invention above, those of ordinary skill in the art is to be understood that, the amendment carried out the present invention under the premise without departing from the spirit and scope of the present invention or equivalently to replace, all should contain within the scope of the present invention.In addition, unless the context outside indication, the word occurred in the singular comprises plural form, and vice versa.In addition, unless stated otherwise, all or part of of so any embodiment can use in conjunction with all or part of of other embodiment any.

Claims (9)

1. one kind based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: increasing index by SQL statement based in the large data handling system of spark-sql of HDFS, delete index, data inserting, deletes data, when data query, whether automatic decision inquiry row exist index, if existed, then search the blocks of files that index comprises, filter the blocks of files not needing to inquire about.
2. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: when increasing index, first a newly-increased index file is needed, the form of index file can be arranged according to configuration and other instruction, the conventional B-that has sets, the forms such as Hash index, then all records in original table are traveled through, determine that the value of the row of index required for every bar record is arranged in HDFS or alternative document system position, record the train value of this record and corresponding fileinfo again, write index tree structure, the all records of searching loop, index structure is preserved with document form, final updating table metadata information, by in the metadata of new index information write table, use in order to subsequent query.
3. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: delete certain show certain row index time, only need to navigate to corresponding index file to be deleted, and updating form metadata information, delete the index information in metadata simultaneously.
4. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: after inserting data, judge whether the data inserted relate to index, if relate to index, then need to adjust corresponding index structure, the fileinfo that these data and it are associated also is joined in index structure and goes.
5. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: in whole flow process, it is constant that tables of data increase data flow continues to use former flow process, only after data increase completes, the filename at record data place, according to the filename structure index node that this returns.
6. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: after deleting data, judge whether the data of deleting relate to index, if relate to index, then need to adjust corresponding index structure, the index node of this data correlation is deleted.
7. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: in whole flow process, wherein tables of data is deleted data flow to continue to use former flow process constant, only after data have been deleted, increase and delete index information corresponding to data.
8. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: during data query, according to data train value, corresponding node elements in search index file, then element value is read, thus obtain the filename at file place corresponding to these data, then former querying flow is continued, all data files in this table of reading are finally inquired about by former querying flow, before this according to the filename that previous step obtains, filter out inactive file, then continue to perform querying flow to remaining file, then SQL operation is performed according to the data of inquiry, finally return Query Result.
9. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: in recording one, the index of certain field navigates to certain file, namely which file have recorded this record is included in, follow-up search this record time, only need to be directly targeted to certain file according to index, and this All Files showing to comprise need not be scanned.
CN201510918956.4A 2015-12-10 2015-12-10 A method of index is established in the spark-sql big data processing system based on HDFS Expired - Fee Related CN105574093B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201510918956.4A CN105574093B (en) 2015-12-10 2015-12-10 A method of index is established in the spark-sql big data processing system based on HDFS
PCT/CN2016/094925 WO2017096939A1 (en) 2015-12-10 2016-08-12 Method for establishing index on hdfs-based spark-sql big-data processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510918956.4A CN105574093B (en) 2015-12-10 2015-12-10 A method of index is established in the spark-sql big data processing system based on HDFS

Publications (2)

Publication Number Publication Date
CN105574093A true CN105574093A (en) 2016-05-11
CN105574093B CN105574093B (en) 2019-09-10

Family

ID=55884224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510918956.4A Expired - Fee Related CN105574093B (en) 2015-12-10 2015-12-10 A method of index is established in the spark-sql big data processing system based on HDFS

Country Status (2)

Country Link
CN (1) CN105574093B (en)
WO (1) WO2017096939A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599062A (en) * 2016-11-18 2017-04-26 北京奇虎科技有限公司 Data processing method and device in SparkSQL system
CN106777278A (en) * 2016-12-29 2017-05-31 海尔优家智能科技(北京)有限公司 A kind of data processing method and device based on Spark
CN106844415A (en) * 2016-11-18 2017-06-13 北京奇虎科技有限公司 A kind of data processing method and device in SparkSQL systems
WO2017096939A1 (en) * 2015-12-10 2017-06-15 深圳市华讯方舟软件技术有限公司 Method for establishing index on hdfs-based spark-sql big-data processing system
CN107092685A (en) * 2017-04-24 2017-08-25 广州新盛通科技有限公司 A kind of method that file system and RDBMS store transaction data are used in combination
CN107391555A (en) * 2017-06-07 2017-11-24 中国科学院信息工程研究所 A kind of metadata real time updating method towards Spark Sql retrievals
CN108132986A (en) * 2017-12-14 2018-06-08 北京航天测控技术有限公司 A kind of immediate processing method of aircraft magnanimity biosensor assay data
CN107368517B (en) * 2017-06-02 2018-07-13 上海恺英网络科技有限公司 A kind of method and apparatus of high amount of traffic inquiry
CN108874897A (en) * 2018-05-23 2018-11-23 新华三大数据技术有限公司 Data query method and device
CN110019497A (en) * 2017-08-07 2019-07-16 北京国双科技有限公司 A kind of method for reading data and device
CN110046176A (en) * 2019-04-28 2019-07-23 南京大学 A kind of querying method of the large-scale distributed DataFrame based on Spark
CN111177102A (en) * 2019-12-25 2020-05-19 苏州浪潮智能科技有限公司 Optimization method and system for realizing HDFS (Hadoop distributed File System) starting acceleration
CN112015729A (en) * 2019-05-29 2020-12-01 核桃运算股份有限公司 Data management apparatus, method and computer storage medium thereof

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674154B (en) * 2019-09-26 2023-04-07 浪潮软件股份有限公司 Spark-based method for inserting, updating and deleting data in Hive
CN110928835A (en) * 2019-10-12 2020-03-27 虏克电梯有限公司 Novel file storage system and method based on mass storage
CN111125216B (en) * 2019-12-10 2024-03-12 中盈优创资讯科技有限公司 Method and device for importing data into Phoenix
CN111752804B (en) * 2020-06-29 2022-09-09 中国电子科技集团公司第二十八研究所 Database cache system based on database log scanning
CN113297204B (en) * 2020-07-15 2024-03-08 阿里巴巴集团控股有限公司 Index generation method and device
CN112231321B (en) * 2020-10-20 2022-09-20 中国电子科技集团公司第二十八研究所 Oracle secondary index and index real-time synchronization method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060041606A1 (en) * 2004-08-19 2006-02-23 Fujitsu Services Limited Indexing system for a computer file store
CN101344881A (en) * 2007-07-09 2009-01-14 中国科学院大气物理研究所 Index generation method and device and search system for mass file type data
CN101727465A (en) * 2008-11-03 2010-06-09 中国移动通信集团公司 Methods for establishing and inquiring index of distributed column storage database, device and system thereof
CN103631910A (en) * 2013-11-26 2014-03-12 烽火通信科技股份有限公司 Distributed database multi-column composite query system and method
CN104462291A (en) * 2014-11-27 2015-03-25 杭州华为数字技术有限公司 Method and device for data processing

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133867A (en) * 2014-07-18 2014-11-05 中国科学院计算技术研究所 DOT in-fragment secondary index method and DOT in-fragment secondary index system
CN105574093B (en) * 2015-12-10 2019-09-10 深圳市华讯方舟软件技术有限公司 A method of index is established in the spark-sql big data processing system based on HDFS

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060041606A1 (en) * 2004-08-19 2006-02-23 Fujitsu Services Limited Indexing system for a computer file store
CN101344881A (en) * 2007-07-09 2009-01-14 中国科学院大气物理研究所 Index generation method and device and search system for mass file type data
CN101727465A (en) * 2008-11-03 2010-06-09 中国移动通信集团公司 Methods for establishing and inquiring index of distributed column storage database, device and system thereof
CN103631910A (en) * 2013-11-26 2014-03-12 烽火通信科技股份有限公司 Distributed database multi-column composite query system and method
CN104462291A (en) * 2014-11-27 2015-03-25 杭州华为数字技术有限公司 Method and device for data processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
唐振坤: "基于Spark的机器学习平台设计与实现", 《中国学位论文全文数据库(万方数据)》 *
晁平复 等: "支持通信数据查询分析的分布式计算系统", 《华东师范大学学报(自然科学版)》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017096939A1 (en) * 2015-12-10 2017-06-15 深圳市华讯方舟软件技术有限公司 Method for establishing index on hdfs-based spark-sql big-data processing system
CN106844415A (en) * 2016-11-18 2017-06-13 北京奇虎科技有限公司 A kind of data processing method and device in SparkSQL systems
CN106599062A (en) * 2016-11-18 2017-04-26 北京奇虎科技有限公司 Data processing method and device in SparkSQL system
CN106844415B (en) * 2016-11-18 2021-08-20 北京奇虎科技有限公司 Data processing method and device in spark SQL system
CN106777278A (en) * 2016-12-29 2017-05-31 海尔优家智能科技(北京)有限公司 A kind of data processing method and device based on Spark
CN107092685A (en) * 2017-04-24 2017-08-25 广州新盛通科技有限公司 A kind of method that file system and RDBMS store transaction data are used in combination
CN107368517B (en) * 2017-06-02 2018-07-13 上海恺英网络科技有限公司 A kind of method and apparatus of high amount of traffic inquiry
CN107391555B (en) * 2017-06-07 2020-08-04 中国科学院信息工程研究所 Spark-Sql retrieval-oriented metadata real-time updating method
CN107391555A (en) * 2017-06-07 2017-11-24 中国科学院信息工程研究所 A kind of metadata real time updating method towards Spark Sql retrievals
CN110019497A (en) * 2017-08-07 2019-07-16 北京国双科技有限公司 A kind of method for reading data and device
CN108132986A (en) * 2017-12-14 2018-06-08 北京航天测控技术有限公司 A kind of immediate processing method of aircraft magnanimity biosensor assay data
CN108874897A (en) * 2018-05-23 2018-11-23 新华三大数据技术有限公司 Data query method and device
CN108874897B (en) * 2018-05-23 2019-09-13 新华三大数据技术有限公司 Data query method and device
CN110046176A (en) * 2019-04-28 2019-07-23 南京大学 A kind of querying method of the large-scale distributed DataFrame based on Spark
CN110046176B (en) * 2019-04-28 2023-03-31 南京大学 Spark-based large-scale distributed DataFrame query method
CN112015729A (en) * 2019-05-29 2020-12-01 核桃运算股份有限公司 Data management apparatus, method and computer storage medium thereof
CN112015729B (en) * 2019-05-29 2024-04-02 核桃运算股份有限公司 Data management device, method and computer storage medium thereof
CN111177102A (en) * 2019-12-25 2020-05-19 苏州浪潮智能科技有限公司 Optimization method and system for realizing HDFS (Hadoop distributed File System) starting acceleration

Also Published As

Publication number Publication date
CN105574093B (en) 2019-09-10
WO2017096939A1 (en) 2017-06-15

Similar Documents

Publication Publication Date Title
CN105574093A (en) Method for establishing index in HDFS based spark-sql big data processing system
CN109299102B (en) HBase secondary index system and method based on Elastcissearch
US9672235B2 (en) Method and system for dynamically partitioning very large database indices on write-once tables
US20150310129A1 (en) Method of managing database, management computer and storage medium
US11269954B2 (en) Data searching method of database, apparatus and computer program for the same
US20140201192A1 (en) Automatic data index establishment method
US11030196B2 (en) Method and apparatus for processing join query
CN112231321B (en) Oracle secondary index and index real-time synchronization method
US8880553B2 (en) Redistribute native XML index key shipping
WO2021179782A1 (en) Method, device and apparatus for improving execution efficiency of database appliance, and medium
US10990573B2 (en) Fast index creation system for cloud big data database
CN109597829B (en) Middleware method for realizing searchable encryption relational database cache
CN113704248B (en) Block chain query optimization method based on external index
US10558636B2 (en) Index page with latch-free access
JP5287071B2 (en) Database management system and program
CN111125216B (en) Method and device for importing data into Phoenix
KR101679011B1 (en) Method and Apparatus for moving data in DBMS
EP3091447B1 (en) Method for modifying root nodes and modifying apparatus
CN113961730A (en) Graph data query method, system, computer device and readable storage medium
KR101642072B1 (en) Method and Apparatus for Hybrid storage
CN112182028B (en) Data line number query method and device based on table of distributed database
JP5936465B2 (en) Multiple database automatic search device
CN111061721B (en) Data processing method and device
CN115033547A (en) Data processing method and device, electronic equipment and system
CN118069685A (en) Hudi data lake index creation method, use method and related products

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 518102 Guangdong Province, Baoan District Xixiang street Shenzhen City Tian Yi Lu Chen Tian Bao Industrial District thirty-seventh building 3 floor

Applicant after: SHENZHEN HUAXUN FANGZHOU SOFTWARE TECHNOLOGY Co.,Ltd.

Applicant after: CHINA COMMUNICATION TECHNOLOGY Co.,Ltd.

Address before: 518102 Guangdong Province, Baoan District Xixiang street Shenzhen City Tian Yi Lu Chen Tian Bao Industrial District thirty-seventh building 3 floor

Applicant before: SHENZHEN HUAXUN FANGZHOU SOFTWARE TECHNOLOGY Co.,Ltd.

Applicant before: CHINA COMMUNICATION TECHNOLOGY Co.,Ltd.

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant
PP01 Preservation of patent right

Effective date of registration: 20210630

Granted publication date: 20190910

PP01 Preservation of patent right
PD01 Discharge of preservation of patent

Date of cancellation: 20230421

Granted publication date: 20190910

PD01 Discharge of preservation of patent
TR01 Transfer of patent right

Effective date of registration: 20230606

Address after: 518102 room 404, building 37, chentian Industrial Zone, chentian community, Xixiang street, Bao'an District, Shenzhen City, Guangdong Province

Patentee after: Shenzhen Huaxun ark Photoelectric Technology Co.,Ltd.

Patentee after: SHENZHEN HUAXUN FANGZHOU SOFTWARE TECHNOLOGY Co.,Ltd.

Address before: 518102 3rd floor, building 37, chentian Industrial Zone, Baotian 1st Road, Xixiang street, Bao'an District, Shenzhen City, Guangdong Province

Patentee before: SHENZHEN HUAXUN FANGZHOU SOFTWARE TECHNOLOGY Co.,Ltd.

Patentee before: CHINA COMMUNICATION TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190910