CN113868230B - Large-scale connection optimization method based on Spark computing framework - Google Patents

Large-scale connection optimization method based on Spark computing framework Download PDF

Info

Publication number
CN113868230B
CN113868230B CN202111220042.2A CN202111220042A CN113868230B CN 113868230 B CN113868230 B CN 113868230B CN 202111220042 A CN202111220042 A CN 202111220042A CN 113868230 B CN113868230 B CN 113868230B
Authority
CN
China
Prior art keywords
data
spark
connection
probability
rdda
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111220042.2A
Other languages
Chinese (zh)
Other versions
CN113868230A (en
Inventor
付蔚
宾茂梨
张棚
李正
刘庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202111220042.2A priority Critical patent/CN113868230B/en
Publication of CN113868230A publication Critical patent/CN113868230A/en
Application granted granted Critical
Publication of CN113868230B publication Critical patent/CN113868230B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/217Database tuning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • G06F16/2456Join operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a large-table connection optimization method based on Spark computing frames, and belongs to the field of large data computing. The method comprises the following steps: s1: performing data cleaning by combining predicate pushing with a compressed bloom filter, filtering out a large amount of invalid data in a large table, and avoiding a large amount of useless data from entering a shuffle stage; s2: building a Spark-based data inclination detection model, and counting the Key value distribution in the global Map stage through a reservoir sampling algorithm; s3: and cutting the inclined data clusters according to the average load rated capacity by adopting an intermediate data cluster segmentation strategy, so that keys with high occurrence frequency enter other partitions with rapid processing, and the keys are in a uniform distribution state. The invention can filter out a large amount of useless data, improve the data inclination condition, shorten the connection inquiry time, solve the problem of memory overflow of Spark cluster nodes and improve the user satisfaction.

Description

Large-scale connection optimization method based on Spark computing framework
Technical Field
The invention belongs to the field of big data calculation, relates to a large-scale connection optimization method based on Spark computing frames, and can be used for solving the problem of structured data inclination in the fields of smart grids, the Internet, industry and the like, and rapidly and accurately carrying out large-scale connection.
Background
Along with the continuous integration of the Internet, the Internet of things, the social network and the like into the life of people, the data explosion age has been entered, various data (such as data in the fields of smart grids, the Internet, industry and the like) are exponentially increased, and a huge data scale and a complex data relationship provide new challenges for the data analysis technology. The mass data is basically stored and calculated in a table form in a Spark big data calculation frame or a big database, and the connection is the most frequent and basic operation in data processing. In a big data environment, the data table is very large in size, and in a traditional relational database, such as: the processing join operations of Mysql, oracle, DB2, etc. are time consuming and therefore optimizing the large table join operations is necessary.
Spark can be fully compatible with a Hadoop distributed storage access interface, and the performance of a big data computing system is greatly improved by utilizing a distributed memory to process data. Spark SQL is a module for processing structured data in Spark, which can greatly reduce the difficulty of data analysis, but once the data table is very large and connection operation is performed, a large amount of invalid data enters the shuffle process. In addition, in the process of the buffer of the Spark SQL framework, all intermediate data are regarded as key/value structures, the Spark utilizes a hash algorithm to pull the same key into the same Task, if the data quantity difference between the keys is large, and a plurality of keys occupy a particularly large data proportion, data inclination occurs, so that CPU and memory resources of a system cannot be fully utilized in the process of Spark application processing, the operation time is prolonged by the large Task, even memory overflow abnormality is caused, and operation cannot be operated.
Disclosure of Invention
In view of the above, the present invention aims to provide a large table connection optimization method based on Spark computing frames, which is suitable for structured data such as smart grids, the internet or industry, and the like, and to provide two large tables to improve the data inclination condition during join operation, so as to solve the problem of high connection time consumption, further solve the problem of node memory overflow, shorten the connection query time, and improve user satisfaction.
In order to achieve the above purpose, the present invention provides the following technical solutions:
A large-table connection optimization method based on Spark computing framework includes the steps of firstly, carrying out data cleaning on two tables through predicate push down strategy and a compression bloom filter, and filtering out a large amount of useless data in the two tables. After data are cleaned, a data inclination detection model is built, after data inclination is detected, the middle data cluster is divided, the data quantity in each reduce node is balanced, the data inclination condition is improved, and the problem that memory overflow occurs in the large-scale connection process of the Spark node and the time consumption is high is solved.
The method specifically comprises the following steps:
s1: performing data cleaning by combining predicate pushing with a compressed bloom filter, filtering out a large amount of invalid data in a large table, and avoiding a large amount of useless data from entering a shuffle stage;
s2: building a Spark-based data inclination detection model, and counting the Key value distribution in the global Map stage through a reservoir sampling algorithm;
s3: and cutting the inclined data clusters according to the average load rated capacity by adopting an intermediate data cluster segmentation strategy, so that keys with high occurrence frequency enter other partitions with rapid processing, and the keys are in a uniform distribution state.
Further, the step S1 specifically includes: firstly, a filtering expression of the SQL expression is pushed down to a storage layer to directly filter data, so that the data quantity transmitted to calculation is reduced, and the data IO of other columns in the same row of scanning invalid data columns is reduced; and then carrying out Hash mapping by using the compressed Bloom Filter, finding out the attribute connection value commonly owned in the two tables and storing the attribute connection value into a new bit array A and a new bit array B, and carrying out network broadcasting bit array A and network broadcasting bit array B by using the compressed Bloom Filter to remove other invalid data which do not participate in the connection stage.
Further, the step S1 specifically includes the following steps:
S11: after the Spark calculation engine analyzes filters which can be pushed down, the Spark transmits the filters to Parquet for merging operation, and the merging operation is put on leaf nodes to enable the filters to be executed on a data source, so that filtering of invalid data is truly executed in the data source;
S12: putting the connection attribute of the left table and the right table after the execution of the step S11 into two new RDDs, and marking the connection attribute as RDDA and RDDB;
S13: in each node of the Spark cluster, respectively and sequentially reading the attribute connection values of RDDA and RDDB, adopting n groups of hash functions to calculate the attribute connection values in RDDA and RDDB, and then putting the calculated values into a bloom compression filter to generate a bit array until the connection attributes in all the nodes are calculated;
S14: performing OR operation on the RDDA and RDDB bit arrays in each node processed in the step S13 until the bit arrays of each node are processed to obtain CBFA and CBFB, and broadcasting the CBFA and the CBFB into the Spark cluster, wherein the compressed bloom filter is a data structure which greatly saves space, so that the network consumption is in an acceptance range;
S15: after each node of the Spark cluster receives the broadcast CBFA and CBFB, filtering RDDA by using CBFB, mapping the connection attribute in RDDA to the CBFB bit array by using n sets of hash functions of CBFB, which indicates that the connection attribute is common to RDDA and RDDB, and similarly, filtering RDDB by using CBFA, and performing the next connection operation on the filtered RDDA and RDDB.
Further, the step S2 specifically includes: by adopting a Master-Slaves mode, extracting Key value distribution and data by each Slave node through a reservoir sampling algorithm, wherein the probability of each sample being extracted is K/N, and the probability is equal, wherein K represents the number of extracted samples, and N represents the total number of samples; according to the frequency distribution of each Key in the sample, calculating the distribution situation approximate to the whole, and judging whether the large-table data Key value is inclined.
Further, the step S2 specifically includes the following steps:
S21: before performing the map task, a uniform random sample of fixed size k is first selected from the input data without substitution, the goal of this step being to form a base reservoir;
s22: from the (k+1) th sample, the probability of the sample in the reservoir being replaced by the (k+1) th data=the probability of the (k+1) th sample being selected ×the probability of the (i) th sample being replaced, namely The probability of being retained is
S23: for the jth data, where j > k, the probability that the jth data is selected is k/j; the probability of not being replaced by the j+1th number isWhen running to the nth data, the retained probability=the selected probability =the probability of not being replaced, i.e., the continuous multiplication of conditional probabilities: /(I)The probability of retention for each data is/>
S24: and finally, summarizing the data to the Master node by each Slave node to generate reservoir sampling data, wherein the reservoir sampling data can ensure that the key in the original data is closer to the whole condition.
Further, the step S3 specifically includes: after detecting data inclination in the step S2, calculating the average load rated capacity of a data cluster, setting inclination tolerance, and not cutting data when the transmission time of the data cluster data cutting network is longer than the node service processing time; the rest inclined data clusters are cut according to the average load rated capacity, and the size of each data cluster is ensured to be the same as much as possible.
Further, the step S3 specifically includes the following steps:
s31: the data set sampled in step S2 is sc= { SC i},SCi representing the key-value key value pair number of the sampled data;
S32: by calculating the standard rated capacity Havg in each bucket, havg is expressed as: Where m is the number of data clusters, h is the number of buckets, and the current remaining capacity of the bucket is represented as { DB 1,DB2,…,DBh };
S33: reverse ordering of SC i, if SC i≥DB1, then a new segment will be split from Havg-sized SC i and loaded into DB 1, with the remainder of SC i-DB1 and the remaining clusters going to the next iterator;
S34: when SC i<DBi, put SC i into DB i, recheck the current second largest SC i-1 for the remaining space to see if it can fill DB i, if SC i+SCi-1≥DBi, SC i-1 would be split and traverse the remaining key pairs forward through all remaining DBs i to see if the remaining key pairs can be fit down;
S35: after each iteration, SC i will be reordered while setting the skew tolerance, and when Havg < SC i is less than or equal to Havg 1.1, SC i is not cut, and because the network overhead time caused by cutting SCi is longer than the data processing time of the current bucket, SCi within the skew tolerance is not processed.
The invention has the beneficial effects that: the invention is beneficial to filtering out a large amount of useless data when two large tables in the Spark computing framework are connected and inquired, improving the data inclination condition, shortening the connection and inquiry time, solving the problem of memory overflow of Spark cluster nodes and improving the user satisfaction.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:
FIG. 1 is an overall flow chart of a large table connection optimization method based on Spark computing framework of the present invention;
FIG. 2 is a diagram of a predicate push strategy in combination with a compressed bloom filter for data filtering according to the present invention;
fig. 3 is a flowchart of a construction data inclination detection model.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
Referring to fig. 1 to 3, the present invention provides a large table connection optimization method based on Spark computing framework, which specifically includes the following steps:
Step 1: the data is cleaned by using an extended Bloom Filter, and a large amount of useless data is filtered; the data here includes structured data such as smart grids, the internet, or industry.
As shown in FIG. 2, the predicate-push strategy is adopted first to perform primary filtering on the data of the tables A and B which do not meet the connection condition, two new RDDs are generated, the connection attributes of the tables A and B are respectively stored, then the compressed bloom filter is utilized to perform Hash mapping operation on RDDA and RDDB to obtain a plurality of node bit arrays, and finally the OR operation is performed on each bit array to generate a new table A and a new table B after data filtering.
Step 2: constructing a data inclination detection model, and counting the Key value distribution of a global Map stage in a reservoir sampling mode;
as shown in FIG. 3, the data detection model adopts a Master-Slave mode, a Master is deployed on a Driver node of Spark, and a Slave is deployed on a workbench node. And carrying out reservoir sampling according to a set ratio by utilizing RDD SAMPLE operators, increasing a Key counter continuously along with the sampling, and judging whether a Key distribution histogram reaches a steady state or not by utilizing a Simon model in a long tail theory, wherein the sampling is finished after the steady state is reached. And finally, returning the Key distribution histogram to the Master node by each Slave node to generate a global Key data inclination detection model.
Step 3: load balancing strategy of Reduce node aiming at data inclination;
1) Recording the sampled data set as SC= { SC i},SCi to represent the key-value key value pair quantity of the sampled data;
2) By calculating the standard rated capacity Havg in each bucket, havg is expressed as: Where m is the number of data clusters, h is the number of buckets, and the current remaining capacity of the bucket is represented as { DB 1,DB2,…,DBh };
3) Reverse ordering of SC i, if SC i≥DB1, then a new segment will be split from Havg-sized SC i and loaded into DB 1, with the remainder of SC i-DB1 and the remaining clusters going to the next iterator;
4) When SC i<DBi, put SC i into DB i, recheck the current second largest SC i-1 for the remaining space to see if it can fill DB i, if SC i+SCi-1≥DBi, SC i-1 would be split and traverse the remaining key pairs forward through all remaining DBs i to see if the remaining key pairs can be fit down;
5) After each iteration, SC i will be reordered while setting the skew tolerance, and when Havg < SC i is less than or equal to Havg 1.1, SC i is not cut, and because the network overhead time caused by cutting SCi is longer than the data processing time of the current bucket, SCi within the skew tolerance is not processed.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims (3)

1. A large-table connection optimization method based on Spark computing framework is characterized by comprising the following steps:
S1: the predicate pushing is utilized to combine with the compressed bloom filter to clean data, a large amount of invalid data in a large table is filtered, and a large amount of useless data is prevented from entering a shuffle stage, and the method specifically comprises the following steps: firstly, pushing down a filtering expression of the SQL expression to a storage layer to directly filter data; then, hash mapping is carried out by utilizing the compressed Bloom Filter, the attribute connection value commonly owned in the two tables is found and stored into a new bit array A and a new bit array B, and the compressed Bloom Filter is utilized to carry out network broadcasting on the bit array A and the bit array B so as to remove other invalid data which do not participate in the connection stage;
s2: building a Spark-based data inclination detection model, and counting the Key value distribution of a global Map stage through a reservoir sampling algorithm, wherein the method specifically comprises the following steps: by adopting a Master-Slaves mode, extracting Key value distribution and data by each Slave node through a reservoir sampling algorithm, wherein the probability of each sample being extracted is K/N, K represents the number of extracted samples, and N represents the total number of samples; calculating the distribution condition approximate to the whole according to the frequency distribution of each Key in the sample, and judging whether the large-table data Key value is inclined or not;
S3: cutting the inclined data cluster according to the average load rated capacity by adopting an intermediate data cluster segmentation strategy, so that keys with high occurrence frequency enter other partitions with rapid processing, and the keys are in a uniform distribution state, and the method specifically comprises the following steps: after detecting data inclination in the step S2, calculating the average load rated capacity of a data cluster, setting inclination tolerance, and not cutting data when the transmission time of the data cluster data cutting network is longer than the node service processing time; cutting the rest inclined data clusters according to the average load rated capacity;
the step S3 specifically comprises the following steps:
s31: the data set sampled in step S2 is sc= { SC i},SCi representing the key-value key value pair number of the sampled data;
S32: by calculating the standard rated capacity Havg in each bucket, havg is expressed as: Where m is the number of data clusters, h is the number of buckets, and the current remaining capacity of the bucket is represented as { DB 1,DB2,…,DBh };
S33: reverse ordering of SC i, if SC i≥DB1, then a new segment will be split from Havg-sized SC i and loaded into DB 1, with the remainder of SC i-DB1 and the remaining clusters going to the next iterator;
S34: when SC i<DBi, put SC i into DB i, recheck the current second largest SC i-1 for the remaining space to see if it can fill DB i, if SC i+SCi-1≥DBi, SC i-1 would be split and traverse the remaining key pairs forward through all remaining DBs i to see if the remaining key pairs can be fit down;
s35: after each iteration, SC i will be reordered while setting the tilt tolerance, and SC i will not be cut when Havg < SC i is less than Havg 1.1.
2. The large table connection optimization method according to claim 1, wherein the step S1 specifically includes the steps of:
S11: after the Spark calculation engine analyzes filters which can be pushed down, the Spark transmits the filters to Parquet for merging operation, and the merging operation is put on leaf nodes to enable the filters to be executed on a data source, so that filtering of invalid data is truly executed in the data source;
S12: putting the connection attribute of the left table and the right table after the execution of the step S11 into two new RDDs, and marking the connection attribute as RDDA and RDDB;
S13: in each node of the Spark cluster, respectively and sequentially reading the attribute connection values of RDDA and RDDB, adopting n groups of hash functions to calculate the attribute connection values in RDDA and RDDB, and then putting the calculated values into a bloom compression filter to generate a bit array until the connection attributes in all the nodes are calculated;
S14: performing OR operation on the bit arrays RDDA and RDDB in each node processed in the step S13 until the bit arrays of each node are processed to obtain CBFA and CBFB, and broadcasting the CBFA and the CBFB into a Spark cluster;
S15: after each node of the Spark cluster receives the broadcast CBFA and CBFB, filtering RDDA by using CBFB, mapping the connection attribute in RDDA to the CBFB bit array by using n sets of hash functions of CBFB, which indicates that the connection attribute is common to RDDA and RDDB, and similarly, filtering RDDB by using CBFA, and performing the next connection operation on the filtered RDDA and RDDB.
3. The large table connection optimization method according to claim 1, wherein the step S2 specifically includes the steps of:
S21: before performing a map task, first selecting a uniform random sample with a fixed size k from the input data without substitution to form a base reservoir;
s22: from the (k+1) th sample, the probability of the sample in the reservoir being replaced by the (k+1) th data=the probability of the (k+1) th sample being selected ×the probability of the (i) th sample being replaced, namely The probability of being retained is/>
S23: for the jth data, where j > k, the probability that the jth data is selected is k/j; the probability of not being replaced by the j+1th number isWhen running to the nth data, the retained probability=the selected probability =the probability of not being replaced, i.e., the continuous multiplication of conditional probabilities: /(I)The probability of retention for each data is/>
S24: and finally, summarizing the data to the Master node by each Slave node to generate reservoir sampling data.
CN202111220042.2A 2021-10-20 2021-10-20 Large-scale connection optimization method based on Spark computing framework Active CN113868230B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111220042.2A CN113868230B (en) 2021-10-20 2021-10-20 Large-scale connection optimization method based on Spark computing framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111220042.2A CN113868230B (en) 2021-10-20 2021-10-20 Large-scale connection optimization method based on Spark computing framework

Publications (2)

Publication Number Publication Date
CN113868230A CN113868230A (en) 2021-12-31
CN113868230B true CN113868230B (en) 2024-06-04

Family

ID=78996709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111220042.2A Active CN113868230B (en) 2021-10-20 2021-10-20 Large-scale connection optimization method based on Spark computing framework

Country Status (1)

Country Link
CN (1) CN113868230B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116561171B (en) * 2023-07-10 2023-09-15 浙江邦盛科技股份有限公司 Method, device, equipment and medium for processing dual-time-sequence distribution of inclination data
CN116719846A (en) * 2023-08-07 2023-09-08 北京滴普科技有限公司 Distributed computing engine data query optimization method, device and storage medium
CN117633024B (en) * 2024-01-23 2024-04-23 天津南大通用数据技术股份有限公司 Database optimization method based on preprocessing optimization join

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598729A (en) * 2016-11-18 2017-04-26 深圳市证通电子股份有限公司 Data distribution method and system of distributed parallel computing system
CN108572873A (en) * 2018-04-24 2018-09-25 中国科学院重庆绿色智能技术研究院 A kind of load-balancing method and device solving the problems, such as Spark data skews
CN108628889A (en) * 2017-03-21 2018-10-09 北京京东尚科信息技术有限公司 Sampling of data mthods, systems and devices based on timeslice
CN108763489A (en) * 2018-05-28 2018-11-06 东南大学 A method of optimization Spark SQL execute workflow
CN110659304A (en) * 2019-09-09 2020-01-07 杭州中科先进技术研究院有限公司 Multi-path data stream connection system based on data inclination
CN112000467A (en) * 2020-07-24 2020-11-27 广东技术师范大学 Data tilt processing method and device, terminal equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299131B (en) * 2018-11-14 2020-05-29 百度在线网络技术(北京)有限公司 Spark query method and system supporting trusted computing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598729A (en) * 2016-11-18 2017-04-26 深圳市证通电子股份有限公司 Data distribution method and system of distributed parallel computing system
CN108628889A (en) * 2017-03-21 2018-10-09 北京京东尚科信息技术有限公司 Sampling of data mthods, systems and devices based on timeslice
CN108572873A (en) * 2018-04-24 2018-09-25 中国科学院重庆绿色智能技术研究院 A kind of load-balancing method and device solving the problems, such as Spark data skews
CN108763489A (en) * 2018-05-28 2018-11-06 东南大学 A method of optimization Spark SQL execute workflow
CN110659304A (en) * 2019-09-09 2020-01-07 杭州中科先进技术研究院有限公司 Multi-path data stream connection system based on data inclination
CN112000467A (en) * 2020-07-24 2020-11-27 广东技术师范大学 Data tilt processing method and device, terminal equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Spatial data management in apache spark: the GeoSpark perspective and beyond.《Springer》.2018,第23卷37-78. *
基于Spark的结构化数据连接查询优化策略研究;宾茂梨;《重庆邮电大学硕士学位论文》;20240418;1-81 *
基于大数据多维分析的近似查询处理技术研究;谢金星;《中国优秀硕士学位论文全文数据库信息科技辑》;20180315(第03期);I138-1242 *

Also Published As

Publication number Publication date
CN113868230A (en) 2021-12-31

Similar Documents

Publication Publication Date Title
CN113868230B (en) Large-scale connection optimization method based on Spark computing framework
Hilprecht et al. Deepdb: Learn from data, not from queries!
CN109669934B (en) Data warehouse system suitable for electric power customer service and construction method thereof
US11003649B2 (en) Index establishment method and device
CN109241093B (en) Data query method, related device and database system
CN110837585B (en) Multi-source heterogeneous data association query method and system
US9135301B2 (en) Pushdown of sorting and set operations (union, intersection, minus) to a large number of low-power cores in a heterogeneous system
CN102521406B (en) Distributed query method and system for complex task of querying massive structured data
US8935233B2 (en) Approximate index in relational databases
CN111259933B (en) High-dimensional characteristic data classification method and system based on distributed parallel decision tree
US8037057B2 (en) Multi-column statistics usage within index selection tools
CN110389950B (en) Rapid running big data cleaning method
CN105117488B (en) A kind of distributed storage RDF data balanced division method based on hybrid hierarchy cluster
CN110909111A (en) Distributed storage and indexing method based on knowledge graph RDF data characteristics
US9110949B2 (en) Generating estimates for query optimization
WO2020211466A1 (en) Non-redundant gene clustering method and system, and electronic device
CN109325062B (en) Data dependency mining method and system based on distributed computation
CN108073641B (en) Method and device for querying data table
WO2019184325A1 (en) Community division quality evaluation method and system based on average mutual information
CN112148830A (en) Semantic data storage and retrieval method and device based on maximum area grid
CN110597929A (en) Parallel data cube construction method based on MapReduce
CN116701351A (en) Function dependence approximation discovery method suitable for big data
WO2024016569A1 (en) Index recommendation method and apparatus based on data feature
Svynchuk et al. Modification of Query Processing Methods in Distributed Databases Using Fractal Trees.
Behr et al. Learn What Really Matters: A Learning-to-Rank Approach for ML-based Query Optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant