CN105574093B - A method of index is established in the spark-sql big data processing system based on HDFS - Google Patents

A method of index is established in the spark-sql big data processing system based on HDFS Download PDF

Info

Publication number
CN105574093B
CN105574093B CN201510918956.4A CN201510918956A CN105574093B CN 105574093 B CN105574093 B CN 105574093B CN 201510918956 A CN201510918956 A CN 201510918956A CN 105574093 B CN105574093 B CN 105574093B
Authority
CN
China
Prior art keywords
index
data
sql
spark
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510918956.4A
Other languages
Chinese (zh)
Other versions
CN105574093A (en
Inventor
张鋆
冯骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huaxun Ark Photoelectric Technology Co ltd
Shenzhen Huaxun Fangzhou Software Technology Co ltd
Original Assignee
Shenzhen Huaxun Fangzhou Software Technology Co Ltd
Shenzhen Huaxun Ark Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huaxun Fangzhou Software Technology Co Ltd, Shenzhen Huaxun Ark Technology Co Ltd filed Critical Shenzhen Huaxun Fangzhou Software Technology Co Ltd
Priority to CN201510918956.4A priority Critical patent/CN105574093B/en
Publication of CN105574093A publication Critical patent/CN105574093A/en
Priority to PCT/CN2016/094925 priority patent/WO2017096939A1/en
Application granted granted Critical
Publication of CN105574093B publication Critical patent/CN105574093B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/134Distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of methods that index is established in the spark-sql big data processing system based on HDFS, increase index in the spark-sql big data processing system based on HDFS by SQL statement, delete index, data are inserted into, data are deleted, when data query, inquiry column are judged automatically with the presence or absence of index, if it is present searching the blocks of files that index includes, filtering does not need the blocks of files of inquiry.The present invention is to after increasing index function to spark-sql, inquiry velocity can be effectively increased, such as a typical spark-sql tables of data, mono- file storage of size 1000GB, 1GB, it is divided into 1000 files, if inquiring single record, original way needs to scan 1000 files, after establishing index, only need to scan 1 file, efficiency improves 1000 times.It is estimated according to ordinary circumstance, in conjunction with traditional relevant database experience, the spark-sql database for establishing index executes fast 100-10000 times or more than the sql sentence inquiry velocity not indexed.

Description

It is a kind of to establish index in the spark-sql big data processing system based on HDFS Method
Technical field
The present invention relates to one kind to establish the method for index more particularly to a kind of based on HDFS's on a data processing system The method of index is established in spark-sql big data processing system.
Background technique
Spark-sql is to increase support standard sql query statement on the basis of big data processing platform spark Function.
Spark big data processing platform is the universal parallel frame for the class Hadoop MapReduce that Berkeley is increased income Frame, it possesses advantage possessed by Hadoop MapReduce;But writing for Spark big data processing routine needs to be grasped Scala language, and written in code is carried out based on open api function interface, it is cumbersome and complicated, and a large amount of traditional number The sql like language grasped according to library developer can not use on spark.The birth of spark-sql solves the above problem, it Traditional database table concept is applied to spark processing frame, user can use sql as operation with traditional database table Sentence builds table and inquiry, and spark-sql converts operation inside spark for corresponding operating automatically, it is thin to shield complicated processing Section.
But due to the particularity of spark big data processing platform, spark-sql does not support to establish rope in tables of data Draw, i.e., does not support the foundation index sentence similar to traditional database, such as:
create index myindex on table t(b);
It means: establishing the general index of entitled myindex on the b column of table t.
Traditional Relational DataBase starts to establish index for the c column of a table after receiving mentioned order.
There are many kinds of the types of index, such as B- tree index, Hash index, GiST index, GIN index etc..With B- tree For index, it is as follows that relevant database establishes index principle:
Database opens up one piece of individual storage region, for storing index tree.
Field in the column (being the column of entitled b in example) indexed as needed generates B- tree.And this tree is saved in finger Determine storage region.Wherein each node of B- tree corresponds to each element in b column, in addition also includes a pointer in each node, This node corresponding element of the pointer record is stored in the corresponding position in database file.
When b arranges insertion new element, new element is also inserted into B- tree (B- tree meeting adjust automatically), while the tree node is remembered Record position of the element in database file.
When b, which is arranged, deletes element, element is also deleted to (B- tree meeting adjust automatically) from B- tree.
Entity file of the database index based on database purchase, the i.e., " corresponding positions in database file described above Set ", database file can according to need user-defined format, so the corresponding position of data hereof can have different expressions Method, but general thought is all to have recorded the accurate location of an element hereof when subsequent lookup element, does not need time File is gone through, and record can quickly be positioned by some way by this recorded position, is searched to reach quickening Purpose.
As shown in Figure 1, data file corresponds to table t, there are a, b two to arrange, wherein b column establish an index.That is the right tree knot The index of structure, the member in index in each node are known as the pointer for being directed toward data file respective element position, wherein index itself Also it is stored as a file.
When inquiring data, such as query statement
Select*from t where b=22;
Indicate all rows of the b column equal to 22 in inquiry table t.Database first parses sql sentence first, then finds that b column are deposited It is indexing.
Then, 22 elements are directly quickly found out from index tree, according to the pointer of 22 elements, navigating to element value is 22 Row where b column, address 0x90 take out this line then directly according to address, return the result as " 5 22 ".
When table t is inserted into element, according to the value that b is arranged, corresponding index tree can be dynamically modified;Correspondingly, when table t deletes member When plain, the content in index tree also can be dynamically deleted.
The index technology of traditional relational comparative maturity, what is stated here is its General Principle, realization side Formula is varied, and the form of expression is usually not fixed, but basic principle all communicates, the method for building up and realization step of other indexes Suddenly it no longer repeats one by one here.
For spark-sql, (generallyd use since the document storage mode of its bottom is different from traditional Relational DataBase HDFS, rather than general Linux or Windows file system), and a usual table capacity is very big, a table even meeting Thousands of a physical files are associated with, so there is no the functions of creation index when spark-sql is designed.Its design focal point The efficiency for being to emphasize the concurrent capability of data processing and having ignored processing.
When usual spark-sql carries out data query, entire tables of data, the data of usual spark-sql processing can be searched for Measure very big, a database likely corresponds to multiple physical files, spark-sql can by concurrent technology, to All Files into Row search.
As shown in Fig. 2, by taking same sql query statement as an example:
Select*from t where b=22;
Spark-sql first parses sql sentence, then navigates to the data file (having 4 files here) of table t.Then These files are split into multiple pieces according to particular size, distribute to different progress of work processing, progress of work sequence is swept The All Files piecemeal that whole table includes is retouched, the row that b train value is 22 is found, after finding, returns the result.
As can be seen that spark-sql, in the case where no index, the mode for carrying out table search is fairly simple, there is effect Rate is low, needs to scan all rows of whole data files.
In addition to simple select sentence, traditional relational is looked into all places for being related to inquiry including complexity Ask, subquery, nested query etc. all can application references technology to reduce queries accelerate inquiry velocity, spark-sql does not have There is such mechanism.
In conclusion current spark-sql due to not having data directory mechanism, can not make inquiry velocity reach most It is excellent, compared to traditional Relational DataBase, there are problems that inefficiency.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of in the spark-sql big data processing system based on HDFS The method for establishing index, this method can make spark-sql be adapted to more, more flexible application scenarios, accelerate spark-sql The speed that sql sentence is inquired is executed, the execution efficiency of spark-sql is improved, more fully performance spark-sql processing is big The advantage of data capability.
In order to solve the above-mentioned technical problems, the present invention provides a kind of in the spark-sql big data processing based on HDFS The method that index is established in system increases rope in the spark-sql big data processing system based on HDFS by SQL statement Draw, delete index, be inserted into data, delete data, when data query, judges automatically inquiry column with the presence or absence of index, such as Fruit exists, then searches the blocks of files that index includes, and filtering does not need the blocks of files of inquiry.
When increasing index, it is necessary first to increase an index file newly, the format of index file can be according to configuration and other Instruction setting, there are commonly formats such as B- tree, Hash indexes, then traverses all records in original table, determines every record institute The value for the column for needing to index is located at position in HDFS or alternative document system, re-records the train value and corresponding file letter of the record Breath, write-in index tree construction.All records are looped through, index structure, final updating table metadata letter are saved with document form Breath is used in the metadata of new index information write in table in case of subsequent query.
When deleting the index of certain table column, it is only necessary to navigate to corresponding index file and be deleted, and update list cell number It is believed that breath, while deleting the index information in metadata.
After being inserted into a data, judge whether the data of insertion are related to indexing, if being related to indexing, needs to adjust phase This data and its associated the file information are also added in index structure by the index structure answered.
In whole flow process, tables of data increases data flow, and to continue to use former process constant, only after the completion of data increase, record Filename where data constructs index node according to the filename that this is returned.
After deleting a data, judge whether the data deleted are related to indexing, if being related to indexing, needs to adjust phase The index structure answered deletes the associated index node of this data.
In whole flow process, wherein tables of data deletes data flow to continue to use former process constant, only after the completion of data are deleted, Increase and deletes the corresponding index information of data
When inquiring data, according to data train value, corresponding node elements, then read element value in search index file, To obtain the filename where file corresponding to the data, former querying flow is then proceeded to, former querying flow will finally be read All data files are inquired in this table, and the filename obtained before this according to previous step filters out inactive file, Then querying flow is continued to execute to remaining file, then executes SQL operation according to the data of inquiry, finally return to inquiry knot Fruit.
The index of certain field in one record is navigated into some file, that is, has recorded which text this record is included in It is subsequent when searching this record in part, it is only necessary to some file is directly targeted to according to index, and do not have to scan this table and be wrapped The All Files contained.
The present invention establishes the method and prior art phase of index in the spark-sql big data processing system based on HDFS Than having the advantages that.
After increasing index function to spark-sql, inquiry velocity, such as a typical spark-sql can be effectively increased Tables of data, mono- file storage of size 1000GB, 1GB, is divided into 1000 files, if inquiry single record, original way Need to scan 1000 files, after establishing index, it is only necessary to scan 1 file, efficiency improves 1000 times.According to general feelings Condition estimation establishes the spark-sql database of index than the sql language that does not index in conjunction with traditional relevant database experience Sentence inquiry velocity execution wants fast 100-10000 times or more.
Detailed description of the invention
Below in conjunction with the drawings and specific embodiments to the present invention in the spark-sql big data processing system based on HDFS The upper method for establishing index is described in further detail.
Fig. 1 is general data table and index tree structural schematic diagram in the prior art.
Fig. 2 is the inquiry principle when not indexing in the spark-sql big data processing system of HDFS in the prior art Figure.
Fig. 3 is increase index flow chart of the invention.
Fig. 4 is deletion index flow chart of the invention.
Fig. 5 is increase data flowchart of the invention.
Fig. 6 is deletion data flowchart of the invention.
Fig. 7 is inquiry data flowchart of the invention.
Fig. 8 is HDF distributed memory system structural schematic diagram.
Fig. 9 is tables of data and index tree structural schematic diagram in distributed memory system of the present invention.
Specific embodiment
As shown in figs. 3 to 9, present embodiment establishes rope in the spark-sql big data processing system based on HDFS The method drawn realizes spark-sql and increases support index function, can pass through SQL language as traditional Relational DataBase Sentence increases index, deletes index, is inserted into data, deletes data, and when data query, judging automatically inquiry column whether there is Index, if it is present searching the blocks of files that index includes, filtering does not need the blocks of files of inquiry, reaches quickening inquiry velocity Purpose.
1) as shown in figure 3, increasing index process.
Increase index to refer on the basis of legacy data table, increases index for a certain column, the subsequent inquiry for this column can To be accelerated by index.
When increasing index, it is necessary first to increase an index file newly, the format of index file can be according to configuration and other Instruction setting, there are commonly formats such as B- tree, Hash indexes, then traverses all records in original table, determines every record institute The value for the column for needing to index is located at position in HDFS or alternative document system, re-records the train value and corresponding file letter of the record Breath, write-in index tree construction.All records are looped through, index structure, final updating table metadata letter are saved with document form Breath is used in the metadata of new index information write in table in case of subsequent query.
2) as shown in figure 4, deleting index process.
The index process for deleting certain table column is relatively simple, it is only necessary to and it navigates to corresponding index file and is deleted, and Table metadata information is updated, while deleting the index information in metadata.
3) as shown in figure 5, insertion data flow (table has index).
After being inserted into a data (including batch is inserted into, and practical is that continuous single data are inserted into), the data of insertion are judged Whether it is related to indexing, if being related to indexing, needs to adjust corresponding index structure, by this data and its associated text Part information is also added in index structure.
In whole flow process, wherein tables of data increases data flow to continue to use former process constant, only after the completion of data increase, The filename where data is recorded, index node is constructed according to the filename that this is returned.
4) as shown in fig. 6, deleting data flow (table has index).
After deleting a data (including batch is deleted, and practical is that continuous single data are deleted), the data of deletion are judged Whether it is related to indexing, if being related to indexing, needs to adjust corresponding index structure, by the associated index node of this data It deletes.
In whole flow process, wherein tables of data deletes data flow to continue to use former process constant, only after the completion of data are deleted, Increase and deletes the corresponding index information of data.
5) as shown in fig. 7, inquiry data flow (table has index).
When inquiring data, according to data train value, corresponding node elements, then read element value in search index file, To obtain the filename where file corresponding to the data, former querying flow is then proceeded to, former querying flow will finally be read All data files are inquired in this table, and the filename obtained before this according to previous step filters out inactive file, Then querying flow is continued to execute to remaining file.Quantity of documents by filtering can greatly reduce, and reduce inquiry burden, so SQL operation is executed according to the data of inquiry afterwards, finally returns to query result.
It is in this special emphasis, index of the present embodiment based on spark-sql, different from the rope of traditional database Draw, purpose of design is to handle big data quantity.By taking 10GB as an example, spark-sql can accomplish traditional data storage capacity 1PB, i.e., 100,000 times common traditional data storage capacities.
General data library one opens tables of data and generally corresponds to physical file in several file system, and spark-sql allusion quotation Type deployment way is combined with HDFS, is to carry out storage file in a manner of a kind of distributed storage, and one tables of data can be right The file that should be stored on HDFS in thousands of or even up to ten thousand, as shown in Figure 8.
A usual spark-sql node is made of several spark nodes, and bottom storage is deposited using HDFS distribution Storage system.I.e. data file is present in HDFS.In figure, t1-p1 indicates the part part1 of table t1, it is a physical file, together Managing t1-p2 indicates the part2 file of table t1, and by p1-p7, totally 7 files form whole table t1;Similarly, table t2 is by 3 files Composition.
Former querying flow can scan all list files when carrying out the inquiry of sql sentence.
Such as Select*from t where b=22
Spark-sql parses above-mentioned sql sentence, then the corresponding database file of look-up table t, result t1-p1, t1- P2, t1-p3, t1-p4, t1-p5, t1-p6, t1-p7 have 7 files altogether, in the case where not considering the excessive cutting of file, Spark-sql can establish 7 query tasks, respectively correspond this 7 files and start scan for inquiries, All Files are scanned, until looking for It goes to qualified record.
The present invention refers to the principle generally indexed and is improved on this basis for spark-sql storage characteristics.
Index granularity of the invention is indexed different from traditional database, and traditional database index is generally directed to certain and is recorded in Address in file, since mono- database list file of spark-sql is usually made of many files, so the think of that this law is taken The index of certain field in one record is navigated to some file, that is, has recorded which file this record is included in by Lu Wei In, it is subsequent when searching this record, it is only necessary to some file is directly targeted to according to index, and do not have to scan this table included All Files.
With above-mentioned example, table t has t1-p1, t1-p2, t1-p3, t1-p4, t1-p5, t1-p6, t1-p7 to have 7 altogether for reference File composition, table t have 2 fields a and b, and wherein b field establishes index, it is assumed that wherein have several records (here without aobvious Show whole records), establish index as shown in Figure 9.
Wherein table is recorded as being inserted into the original record in table, while insertion, establishes B- tree on b column and indexes.Index It is right in HDFS file system where value and the record in tree in each nodes records node corresponding data-base recording The physical file answered.
When inquiring data, such as query statement.
Select*from t where b=22;
Indicate all rows of the b column equal to 22 in inquiry table t.Database first parses sql sentence first, then finds that b column are deposited It is indexing.
Then it is directly quickly found out 22 elements from index tree, according to the pointer of 22 elements, navigates to the b that element value is 22 Physical file where column is t1-p7, then only reads this file content and is searched, and is returned after finding record.
When table t is inserted into element, according to the value that b is arranged, corresponding index tree can be dynamically modified;It is corresponding, when table t deletes member When plain, the content in index tree also can be dynamically deleted.
As can be seen that the present invention in spark-sql index concept it is although similar with traditional database, but have it is basic Difference, the present invention are to handle big data feature according to spark-sql, will index granularity certain position from the file of traditional database It is changed to some file in spark database, to avoid scanning a large amount of inactive files, avoids waste system resource.
Index in the present invention is suitable for all sql sentences, i.e., in no matter simple or complicated sql inquiry, as long as relating to And the inquiry operation to index column, all file first can be positioned according to index, sql inquiry behaviour is then carried out in the file of positioning Make, this has fundamental difference with traditional relational way.
Key point of the invention
1, increase the mechanism for supporting index on spark-sql, such as support following sql sentence:
Establish index: create index myindex on t (b);Wherein keyword is create index on
Check index: show index from t;Wherein keyword is show index from
Delete index: drop index myindex on t;Wherein keyword is drop index on
2, Indexing Mechanism file-based
Spark-sql is different from traditional Relational DataBase, and one of key point of the invention is, index is established in file On the basis of, i.e. index is directed toward a specific file in a HDFS or other file system, rather than the content in file, Granularity is bigger than traditional database.Under the premise of database table establishes index according to the present invention, it can effectively filter and look into vain File is ask, inquired file extent can be reduced, to improve search efficiency.
3, the index established is including but not limited to unique index, major key index, more property indexes, partial index, expression formula Index.These index types are consistent with the concept in traditional database;Establishing data structure used in indexing includes but unlimited In B- tree, Hash, GiST, GIN etc., these data structures are consistent with the concept in traditional database.
Advantages of the present invention is as follows.
Currently without the disclosed real-time proposals and method for supporting spark-sql index technology.
So the database table established in spark-sql does not all index, inquiry velocity at present in public technology It is limited with search efficiency, by establishing Indexing Mechanism to spark-sql, inquiry velocity a number of orders of magnitude can be improved.It can be with In the case of accomplishing mass data, search efficiency and inquiry velocity are equally matched with traditional Relational DataBase.
It should be noted that each embodiment above by reference to described in attached drawing is only to illustrate the present invention rather than limits this The range of invention, those skilled in the art should understand that, it is right under the premise without departing from the spirit and scope of the present invention The modification or equivalent replacement that the present invention carries out, should all cover within the scope of the present invention.In addition, signified unless the context Outside, the word occurred in the singular includes plural form, and vice versa.In addition, unless stated otherwise, then any embodiment All or part of in combination with any other embodiment all or part of come using.

Claims (7)

1. a kind of method for establishing index in the spark-sql big data processing system based on HDFS, it is characterised in that: Spark-sql is to increase the function of support standard sql query statement on the basis of big data processing platform spark, pass through SQL statement increases index in the spark-sql big data processing system based on HDFS, deletes index, is inserted into data, deletes number According to when data query, judging automatically inquiry column whether there is index, if it is present searching the file that index includes Block, filtering do not need the blocks of files of inquiry;
Wherein, the index that increases has recorded this for the index of certain field in a record is navigated to some file Record be included in which file in, it is subsequent search this record when, it is only necessary to some file is directly targeted to according to index, without The All Files for being included with this table is scanned;
Wherein, when inquiring data, according to data train value, corresponding node elements, then read element value in search index file, To obtain the filename where file corresponding to the data, former querying flow is then proceeded to, former querying flow will finally be read All data files are inquired in this table, and the filename obtained before this according to previous step filters out inactive file, Then querying flow is continued to execute to remaining file, then executes SQL operation according to the data of inquiry, finally return to inquiry knot Fruit.
2. the method for establishing index in the spark-sql big data processing system based on HDFS according to claim 1, It is characterized in that: when increasing index, it is necessary first to increase an index file newly, the format of index file can be arranged according to configuration, Then all records in original table are traversed, determine that the value of the column indexed required for every record is located at position in HDFS file system It sets, the train value and corresponding the file information, write-in index tree for re-recording the record loop through all records, with document form Save index structure, final updating table metadata information, by the metadata of new index information write in table, in case subsequent query It uses.
3. the method for establishing index in the spark-sql big data processing system based on HDFS according to claim 1, It is characterized in that: when deleting the index of certain table column, it is only necessary to navigate to corresponding index file and be deleted, and update list cell number It is believed that breath, while deleting the index information in metadata.
4. the method for establishing index in the spark-sql big data processing system based on HDFS according to claim 1, It is characterized in that: after one data of insertion, judging whether the data of insertion are related to indexing, if being related to indexing, need to adjust This data and its associated the file information are also added in index structure by corresponding index structure.
5. the method for establishing index in the spark-sql big data processing system based on HDFS according to claim 4, Be characterized in that: in whole flow process, it is constant that tables of data insertion data flow continues to use former process, only after the completion of data insertion, note The filename where data is recorded, index node is constructed according to the filename that this is returned.
6. the method for establishing index in the spark-sql big data processing system based on HDFS according to claim 1, It is characterized in that: after deleting a data, judging whether the data deleted are related to indexing, if being related to indexing, need to adjust Corresponding index structure deletes the associated index node of this data.
7. the method for establishing index in the spark-sql big data processing system based on HDFS according to claim 6, Be characterized in that: in whole flow process, wherein tables of data deletes data flow to continue to use former process constant, only deletes and completes in data Afterwards, the corresponding index information of data is deleted.
CN201510918956.4A 2015-12-10 2015-12-10 A method of index is established in the spark-sql big data processing system based on HDFS Expired - Fee Related CN105574093B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201510918956.4A CN105574093B (en) 2015-12-10 2015-12-10 A method of index is established in the spark-sql big data processing system based on HDFS
PCT/CN2016/094925 WO2017096939A1 (en) 2015-12-10 2016-08-12 Method for establishing index on hdfs-based spark-sql big-data processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510918956.4A CN105574093B (en) 2015-12-10 2015-12-10 A method of index is established in the spark-sql big data processing system based on HDFS

Publications (2)

Publication Number Publication Date
CN105574093A CN105574093A (en) 2016-05-11
CN105574093B true CN105574093B (en) 2019-09-10

Family

ID=55884224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510918956.4A Expired - Fee Related CN105574093B (en) 2015-12-10 2015-12-10 A method of index is established in the spark-sql big data processing system based on HDFS

Country Status (2)

Country Link
CN (1) CN105574093B (en)
WO (1) WO2017096939A1 (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105574093B (en) * 2015-12-10 2019-09-10 深圳市华讯方舟软件技术有限公司 A method of index is established in the spark-sql big data processing system based on HDFS
CN106844415B (en) * 2016-11-18 2021-08-20 北京奇虎科技有限公司 Data processing method and device in spark SQL system
CN106599062A (en) * 2016-11-18 2017-04-26 北京奇虎科技有限公司 Data processing method and device in SparkSQL system
CN106777278B (en) * 2016-12-29 2021-02-23 海尔优家智能科技(北京)有限公司 Spark-based data processing method and device
CN107092685A (en) * 2017-04-24 2017-08-25 广州新盛通科技有限公司 A kind of method that file system and RDBMS store transaction data are used in combination
CN107368517B (en) * 2017-06-02 2018-07-13 上海恺英网络科技有限公司 A kind of method and apparatus of high amount of traffic inquiry
CN107391555B (en) * 2017-06-07 2020-08-04 中国科学院信息工程研究所 Spark-Sql retrieval-oriented metadata real-time updating method
CN110019497B (en) * 2017-08-07 2021-06-08 北京国双科技有限公司 Data reading method and device
CN108132986B (en) * 2017-12-14 2020-06-16 北京航天测控技术有限公司 Rapid processing method for test data of mass sensors of aircraft
CN108874897B (en) * 2018-05-23 2019-09-13 新华三大数据技术有限公司 Data query method and device
CN110046176B (en) * 2019-04-28 2023-03-31 南京大学 Spark-based large-scale distributed DataFrame query method
CN112015729B (en) * 2019-05-29 2024-04-02 核桃运算股份有限公司 Data management device, method and computer storage medium thereof
CN110674154B (en) * 2019-09-26 2023-04-07 浪潮软件股份有限公司 Spark-based method for inserting, updating and deleting data in Hive
CN110928835A (en) * 2019-10-12 2020-03-27 虏克电梯有限公司 Novel file storage system and method based on mass storage
CN111125216B (en) * 2019-12-10 2024-03-12 中盈优创资讯科技有限公司 Method and device for importing data into Phoenix
CN111177102B (en) * 2019-12-25 2022-07-19 苏州浪潮智能科技有限公司 Optimization method and system for realizing HDFS (Hadoop distributed File System) starting acceleration
CN111752804B (en) * 2020-06-29 2022-09-09 中国电子科技集团公司第二十八研究所 Database cache system based on database log scanning
CN113297204B (en) * 2020-07-15 2024-03-08 阿里巴巴集团控股有限公司 Index generation method and device
CN112231321B (en) * 2020-10-20 2022-09-20 中国电子科技集团公司第二十八研究所 Oracle secondary index and index real-time synchronization method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101344881A (en) * 2007-07-09 2009-01-14 中国科学院大气物理研究所 Index generation method and device and search system for mass file type data
CN101727465A (en) * 2008-11-03 2010-06-09 中国移动通信集团公司 Methods for establishing and inquiring index of distributed column storage database, device and system thereof
CN103631910A (en) * 2013-11-26 2014-03-12 烽火通信科技股份有限公司 Distributed database multi-column composite query system and method
CN104462291A (en) * 2014-11-27 2015-03-25 杭州华为数字技术有限公司 Method and device for data processing

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2417342A (en) * 2004-08-19 2006-02-22 Fujitsu Serv Ltd Indexing system for a computer file store
CN104133867A (en) * 2014-07-18 2014-11-05 中国科学院计算技术研究所 DOT in-fragment secondary index method and DOT in-fragment secondary index system
CN105574093B (en) * 2015-12-10 2019-09-10 深圳市华讯方舟软件技术有限公司 A method of index is established in the spark-sql big data processing system based on HDFS

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101344881A (en) * 2007-07-09 2009-01-14 中国科学院大气物理研究所 Index generation method and device and search system for mass file type data
CN101727465A (en) * 2008-11-03 2010-06-09 中国移动通信集团公司 Methods for establishing and inquiring index of distributed column storage database, device and system thereof
CN103631910A (en) * 2013-11-26 2014-03-12 烽火通信科技股份有限公司 Distributed database multi-column composite query system and method
CN104462291A (en) * 2014-11-27 2015-03-25 杭州华为数字技术有限公司 Method and device for data processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于Spark的机器学习平台设计与实现;唐振坤;《中国学位论文全文数据库(万方数据)》;20150106;摘要、第23-25页
支持通信数据查询分析的分布式计算系统;晁平复 等;《华东师范大学学报(自然科学版)》;20140930(第05期);第89、91-92、95-98、101页

Also Published As

Publication number Publication date
CN105574093A (en) 2016-05-11
WO2017096939A1 (en) 2017-06-15

Similar Documents

Publication Publication Date Title
CN105574093B (en) A method of index is established in the spark-sql big data processing system based on HDFS
CN109299102B (en) HBase secondary index system and method based on Elastcissearch
US6801904B2 (en) System for keyword based searching over relational databases
US6789094B2 (en) Method and apparatus for providing extended file attributes in an extended attribute namespace
RU2427896C2 (en) Annotation of documents in jointly operating applications by data in separated information systems
US8527556B2 (en) Systems and methods to update a content store associated with a search index
US8924373B2 (en) Query plans with parameter markers in place of object identifiers
US20120084291A1 (en) Applying search queries to content sets
JP3914662B2 (en) Database processing method and apparatus, and medium storing the processing program
CN103870588B (en) A kind of method and device used in data base
CN1979469A (en) Index and its extending and searching method
JP2009110260A (en) File sharing system in cooperation with search engine
WO2018097846A1 (en) Edge store designs for graph databases
MX2010012866A (en) Paging hierarchical data.
Rozsnyai et al. Large-scale distributed storage system for business provenance
CN112231321B (en) Oracle secondary index and index real-time synchronization method
Yafooz et al. Managing unstructured data in relational databases
CN104035993A (en) Memory search method for e-books, e-book management system and reading system
CN110134335A (en) A kind of RDF data management method, device and storage medium based on key-value pair
US7844596B2 (en) System and method for aiding file searching and file serving by indexing historical filenames and locations
US20100082587A1 (en) Apparatus, method, and computer program product for searching structured document
CN110110034A (en) A kind of RDF data management method, device and storage medium based on figure
CN106649462B (en) A kind of implementation method for mass data full-text search scene
KR101679011B1 (en) Method and Apparatus for moving data in DBMS
KR101299555B1 (en) Apparatus and method for text search using index based on hash function

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 518102 Guangdong Province, Baoan District Xixiang street Shenzhen City Tian Yi Lu Chen Tian Bao Industrial District thirty-seventh building 3 floor

Applicant after: SHENZHEN HUAXUN FANGZHOU SOFTWARE TECHNOLOGY Co.,Ltd.

Applicant after: CHINA COMMUNICATION TECHNOLOGY Co.,Ltd.

Address before: 518102 Guangdong Province, Baoan District Xixiang street Shenzhen City Tian Yi Lu Chen Tian Bao Industrial District thirty-seventh building 3 floor

Applicant before: SHENZHEN HUAXUN FANGZHOU SOFTWARE TECHNOLOGY Co.,Ltd.

Applicant before: CHINA COMMUNICATION TECHNOLOGY Co.,Ltd.

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant
PP01 Preservation of patent right

Effective date of registration: 20210630

Granted publication date: 20190910

PP01 Preservation of patent right
PD01 Discharge of preservation of patent
PD01 Discharge of preservation of patent

Date of cancellation: 20230421

Granted publication date: 20190910

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230606

Address after: 518102 room 404, building 37, chentian Industrial Zone, chentian community, Xixiang street, Bao'an District, Shenzhen City, Guangdong Province

Patentee after: Shenzhen Huaxun ark Photoelectric Technology Co.,Ltd.

Patentee after: SHENZHEN HUAXUN FANGZHOU SOFTWARE TECHNOLOGY Co.,Ltd.

Address before: 518102 3rd floor, building 37, chentian Industrial Zone, Baotian 1st Road, Xixiang street, Bao'an District, Shenzhen City, Guangdong Province

Patentee before: SHENZHEN HUAXUN FANGZHOU SOFTWARE TECHNOLOGY Co.,Ltd.

Patentee before: CHINA COMMUNICATION TECHNOLOGY Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190910