WO2017096941A1 - Procédé de rafraîchissement d'arrière-plan basé sur une plateforme de traitement de données volumineuses spark-sql - Google Patents

Procédé de rafraîchissement d'arrière-plan basé sur une plateforme de traitement de données volumineuses spark-sql Download PDF

Info

Publication number
WO2017096941A1
WO2017096941A1 PCT/CN2016/095361 CN2016095361W WO2017096941A1 WO 2017096941 A1 WO2017096941 A1 WO 2017096941A1 CN 2016095361 W CN2016095361 W CN 2016095361W WO 2017096941 A1 WO2017096941 A1 WO 2017096941A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
spark
sql
big data
query
Prior art date
Application number
PCT/CN2016/095361
Other languages
English (en)
Chinese (zh)
Inventor
王成
冯骏
Original Assignee
深圳市华讯方舟软件技术有限公司
华讯方舟科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市华讯方舟软件技术有限公司, 华讯方舟科技有限公司 filed Critical 深圳市华讯方舟软件技术有限公司
Publication of WO2017096941A1 publication Critical patent/WO2017096941A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1737Details of further file system functions for reducing power consumption or coping with limited storage space, e.g. in mobile devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Definitions

  • the invention relates to a background refreshing method of a big data processing platform, in particular to a background refreshing method based on a Spark-SQL big data processing platform.
  • the data import function of the Spark big data processing platform is implemented by Spark-SQL, which is implemented by Hive on Spark.
  • the Hive query can be submitted to the Spark cluster as a task of the Spark. Calculate on. Hive has more comprehensive support and a broader user base for SQL syntax than Impala and Shark.
  • Data import usually involves key points such as import content, storage format, and import speed:
  • the imported content can be a formatted or unformatted text file, and each record and each field are separated by a specific separator or file format.
  • the data content can be transmitted in the form of a file, or can be transmitted in the form of a data stream, and the size Uncertainty.
  • the format of the stored data can be either text format or compressed format to reduce disk usage.
  • the compression formats supported by Spark-SQL include zip, snappy, and parquet.
  • importing data can be partitioned based on content, and data can be stored in partitions to speed up queries.
  • the Spark-SQL data import and data refresh scheme (the external data file is in a text format) is as follows:
  • the directory structure on the distributed file system HDFS will be scanned and the metabase will be updated. Therefore, in the context of big data, the first query will take a long time. Instead of the first query, the directory structure of HDFS is no longer scanned, and the scan results of the first query are directly used, aiming to shorten the final query time.
  • the advantage of this mechanism is that the speed of non-first query is faster, but there are also drawbacks that cannot be ignored. That is, after the first query scan, any direct modification of the table space on the HDFS cannot be recognized.
  • HDFS does not support modification operations in principle
  • Any insert and delete operations can only be performed by Spark-SQL and executed in Spark.
  • both read and write occupy a certain amount of system resources, which indirectly leads to a decrease in data import speed and query speed.
  • all the queries on the Spark will fail. The file does not exist. You can only restart the Spark-SQL process and re-run the first query and scan. HDFS.
  • the first query of Spark-SQL scans the entire tablespace in the HDFS distributed file system according to the query table, and saves the snapshot of the tablespace. In the context of big data, the first query takes a very long time and cannot meet the time. Claim. Any changes to the table after scanning are not recognized by Spark-SQL.
  • Scala is a pure object-oriented programming language that uses the Scalac compiler to compile source files into Java class files (that is, bytecodes running on the JVM), so it is an interpreted language, and query and import are less efficient.
  • the Spark big data processing platform In the Standalone mode of the Spark big data processing platform, there is a waste of resources in the control node.
  • the Spark big data processing platform is generally deployed as a cluster, and the cluster is composed of several machines.
  • the cluster running process usually the import of external data and the real-time query of the data are synchronized. Therefore, the resources of the machine in the cluster will be allocated to the data import program and the data query program at the same time, in the IO conflict, CPU time, and memory.
  • the two In terms of application, the two will have more or less conflicts, and in severe cases, the performance of the two will be greatly reduced.
  • the technical problem to be solved by the present invention is to avoid the step of scanning the distributed file system HDFS for the first query in the context of big data, and greatly shorten the first query time of the Spark-SQL big data processing platform.
  • the background refresh method based on the Spark-SQL big data processing platform of the present invention creates a refresh process and sets a timing refresh machine in the entry function of the Spark-SQL. System, regularly scans the specified file space file directory structure of the distributed file system HDFS.
  • the directory structure information of the table space is not specified in the memory before the refresh process is first refreshed.
  • the original first refresh policy is used, and the query is scanned before the query.
  • the distributed file system HDFS specifies the file directory structure of the tablespace. If the refresh process is refreshed for the first time, the directory structure information of the specified tablespace on the HDFS is saved in the memory. When Spark-SQL receives the query, it does not scan the HDFS. Directly use the directory structure information of the table space in the memory to achieve the effect of shortening the query time.
  • the refresh interval is one tenth to one half of the time taken to refresh once, or the refresh interval is 5 seconds to 10 seconds, and the refresh interval size may be customized according to product or user requirements.
  • the external data file is compressed and stored, and the compression format is ZIP, BZ2, SNAPPY or PARQUET.
  • the creating a temporary table is: creating a temporary table for storing text format data according to a data model, the temporary table being used as a data source of the final data table;
  • the big data table with partition information is created: in the context of big data, creating a big data table with partition information can improve the speed of data query; in actual applications, by month, week, day or hour in time Partition, or partition according to a substring of the string, or partition by integer interval, or combine partitioning, further divide the data, partition the data, and improve the data query speed;
  • the data file into the temporary table is: according to the data file format, the Spark-SQL statement or the Hadoop-supported Load statement is executed, and the data in the text format is directly guided. Into the temporary table.
  • the processing of the temporary table data and storing the big data table with the partition information is: executing a Spark-SQL statement specifying the partition format and the storage format, analyzing and processing the data in the temporary table according to the specified partition format, and then specifying The storage format (compressed format) writes the data to the final big data table; in this step, Spark first divides the data in the temporary table space into RDD data blocks according to the configuration, and each RDD data block is assigned to the specified task. Parallel processing, and then through Spark-SQL internal transformation mechanism, the partition information in the SQL statement is converted into a specific operation method for the RDD data block, thereby partitioning the data based on the RDD data block, and compressing the partitioned data Processing, writing to the distributed file system HDFS.
  • the background refresh method based on the Spark-SQL big data processing platform has the following beneficial effects compared with the prior art.
  • the first query time of the Spark-SQL big data processing platform is greatly shortened; taking 20T data as an example, the big data table is divided into 25 zones according to the hour as the first zone (0 to 23 points and one
  • the default partition is divided into 1001 areas (000-999 and a default partition) according to the first 3 digits of the mobile phone number, and is compressed and stored according to the PARQUET format, for querying the total number of data of a certain number segment of a certain time period.
  • the original query time is about 20 minutes.
  • the background refresh method optimized by the present invention shortens the time of the first query to about 45 seconds.
  • the native Spark data import program is the data import statement of Spark-SQL.
  • the data import program When used, it will occupy some or all of the computing resources of the Spark big data processing platform, which greatly affects the speed and efficiency of data query.
  • Using a more efficient data importer to process data independently from Spark makes system utilization even higher.
  • the background refresh adopts an independent process, which does not occupy the original Spark's system resources.
  • the common compression formats in Spark are ZIP, BZ2, SNAPPY, and PARQUET.
  • the PARQUET format supports all projects in the Hadoop ecosystem, providing efficient compression of columnar data representation, and is independent of the data processing framework, data model, and programming language.
  • the PARQUET format can be preferred as a big data storage format.
  • the Spark big data processing platform has certain limitations on the data query in the PARQUET format. For large data tables stored in the PARQUET format, Spark-SQL scans the directory structure of the table on HDFS only when the table is first queried. The scan is performed again, so the directory structure added or deleted after the first query cannot be identified.
  • the background refreshing technique of the present invention can effectively solve this problem.
  • FIG. 1 is a schematic diagram of an overall framework of a Spark big data processing platform in the prior art.
  • FIG. 2 is a flow chart of a background refresh method based on the Spark-SQL big data processing platform of the present invention.
  • Figure 3 is a flow chart of the modified data query.
  • the background refresh method based on the Spark-SQL big data processing platform in this embodiment is to create a refresh process in the Spark-SQL entry function and set a timing refresh mechanism, and periodically scan the distributed file system HDFS.
  • the specified tablespace file directory structure, as a preference, the refresh result is stored in memory to support the query request for the table data.
  • the directory structure information of the table space is not specified in the memory before the refresh process is first refreshed.
  • the first refresh policy is to scan the file directory structure of the specified file space of the distributed file system HDFS before querying; if the refresh process is refreshed for the first time, the directory structure information of the specified table space on the HDFS is saved in the memory, when Spark-SQL receives When the query is executed, HDFS is no longer scanned, and the directory structure information of the table space in the memory is directly used, thereby shortening the query time.
  • the refresh interval is one tenth to one half of the time taken to refresh once, or the refresh interval is 5 seconds to 10 seconds, and the refresh interval size may be customized according to product or user requirements.
  • the external data file is compressed and stored, and the compression format is ZIP, BZ2, SNAPPY or PARQUET.
  • the creating a temporary table is: creating a temporary table for storing text format data according to a data model, the temporary table being used as a data source of the final data table;
  • the big data table with partition information is created: in the context of big data, creating a big data table with partition information can improve the speed of data query; in actual applications, by month, week, day or hour in time Partition, or partition according to a substring of a string, or partition by integer interval, or combine partitions to further divide data and improve data query speed;
  • the importing the data file in the text format into the temporary table is: according to the data file format, executing a Spark-SQL statement or a Load statement supported by Hadoop, and directly importing the data into the temporary table.
  • the processing of the temporary table data and storing the big data table with the partition information is: executing a Spark-SQL statement specifying the partition format and the storage format, analyzing and processing the data in the temporary table according to the specified partition format, and then specifying The storage format (compressed format) writes the data to the final big data table; in this step, Spark first divides the data in the temporary table space into RDD data blocks according to the configuration, and each RDD data block is assigned to the specified task. Parallel processing, and then through the internal transformation mechanism of Spark-SQL, the partition information in the SQL statement is converted into The specific operation method of the RDD data block, thereby partitioning the data based on the RDD data block, and compressing the partitioned data into the distributed file system HDFS.
  • the illustration is a background refresh flow chart.
  • Spark-SQL uses Scala language programming to add a background refresh process to the Spark-SQL entry function, periodically scanning the specified tablespace directory structure on the distributed file system HDFS, and saving it to memory for data query use.
  • Spark-SQL After Spark-SQL starts, it first reads the hive-site.xml configuration file, parses out the configuration items related to the background refresh process, and sets the timing refresh mechanism to perform timing refresh in the manner of message triggering.
  • Spark-SQL creates a query plan for the big data table to be refreshed, locates the space in the memory to store the table information according to the query plan, and calls the refresh method in its attribute to scan the distributed file system HDFS. Table directory structure. The refresh method will overwrite the previous scan result, and the original result will not be emptied before overwriting, thus ensuring that data is also available when receiving a data query request during the refresh process.
  • FIG. 3 is a flow chart of the modified data query.
  • the configuration item is located in the hive-site.xml file in the conf folder of the Spark installation directory.
  • the refresh process supports all data compression formats supported by Spark, such as PARQUET, SNAPPY, and ZIP.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Library & Information Science (AREA)

Abstract

La présente invention concerne un procédé de rafraîchissement d'arrière-plan basé sur une plateforme de traitement de données volumineuses Spark-SQL. Un nouveau processus est créé et un mécanisme de rafraîchissement temporisé est réglé dans une fonction d'entrée de Spark-SQL, et une structure de répertoire de fichier d'espace de table spécifiée d'un système de fichier distribué Hadoop (HDFS) est balayée périodiquement. Des éléments de configuration sont ajoutés dans un hive-site.xml sous un dossier conf d'un répertoire d'installation Spark, et ainsi, le point de savoir d'ouvrir ou non un processus de rafraîchissement, un intervalle de rafraîchissement et un ensemble d'espaces de table de données volumineuses à rafraîchir peuvent être configurés d'une manière personnalisée. Dans la présente invention, sous l'arrière-plan de données volumineuses, un premier temps d'interrogation de la plateforme de traitement de données volumineuses Spark-SQL est fortement réduit; en prenant des données 20T en tant qu'exemple, une table de données volumineuses est partitionnée en 25 régions dans une manière prenant une heure en tant que première sous-région, est partitionnée en 1001 régions dans une manière prenant trois premiers chiffres d'un numéro de téléphone mobile en tant que seconde sous-région, et est soumise à un stockage compressé selon un format PARQUET; pour l'interrogation demandant une quantité totale de toutes les données d'une certaine section de numéro d'une certaine période de temps, le premier temps d'interrogation d'origine est approximativement de 20 minutes, et au moyen du procédé de rafraîchissement d'arrière-plan optimisé par la présente invention, le temps de la première interrogation est réduit à approximativement 45 secondes.
PCT/CN2016/095361 2015-12-11 2016-08-15 Procédé de rafraîchissement d'arrière-plan basé sur une plateforme de traitement de données volumineuses spark-sql WO2017096941A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510919868.6A CN105550293B (zh) 2015-12-11 2015-12-11 一种基于Spark‑SQL大数据处理平台的后台刷新方法
CN201510919868.6 2015-12-11

Publications (1)

Publication Number Publication Date
WO2017096941A1 true WO2017096941A1 (fr) 2017-06-15

Family

ID=55829482

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/095361 WO2017096941A1 (fr) 2015-12-11 2016-08-15 Procédé de rafraîchissement d'arrière-plan basé sur une plateforme de traitement de données volumineuses spark-sql

Country Status (2)

Country Link
CN (1) CN105550293B (fr)
WO (1) WO2017096941A1 (fr)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110136777A (zh) * 2018-02-09 2019-08-16 深圳先进技术研究院 一种基于Spark框架的重测序序列比对方法
CN110162563A (zh) * 2019-05-28 2019-08-23 深圳市网心科技有限公司 一种数据入库方法、系统及电子设备和存储介质
CN110727684A (zh) * 2019-10-08 2020-01-24 浪潮软件股份有限公司 一种用于大数据统计分析的增量数据同步的方法
CN110765154A (zh) * 2019-10-16 2020-02-07 华电莱州发电有限公司 火电厂海量实时生成数据的处理方法及装置
CN110990340A (zh) * 2019-11-12 2020-04-10 上海麦克风文化传媒有限公司 一种大数据多层次存储架构
CN110990669A (zh) * 2019-10-16 2020-04-10 广州丰石科技有限公司 一种基于规则生成的dpi解析方法和系统
CN111179048A (zh) * 2019-12-31 2020-05-19 中国银行股份有限公司 基于spark的用户资讯个性化分析方法、装置及系统
CN111488323A (zh) * 2020-04-14 2020-08-04 中国农业银行股份有限公司 一种数据处理方法、装置及电子设备
CN111666260A (zh) * 2019-03-08 2020-09-15 杭州海康威视数字技术股份有限公司 数据处理方法及装置
CN112163030A (zh) * 2020-11-03 2021-01-01 北京明略软件系统有限公司 多表批量操作方法、系统及计算机设备
CN112783923A (zh) * 2020-11-25 2021-05-11 辽宁振兴银行股份有限公司 一种基于Spark和Impala高效采集数据库的实现方法
CN113434608A (zh) * 2021-07-06 2021-09-24 中国银行股份有限公司 Hive数据仓库的数据处理方法及装置
CN113553533A (zh) * 2021-06-10 2021-10-26 国网安徽省电力有限公司 一种基于数字化内部五级市场考核体系的指标计算方法

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550293B (zh) * 2015-12-11 2018-01-16 深圳市华讯方舟软件技术有限公司 一种基于Spark‑SQL大数据处理平台的后台刷新方法
US10305967B2 (en) * 2016-03-14 2019-05-28 Business Objects Software Ltd. Unified client for distributed processing platform
CN106570129A (zh) * 2016-10-27 2017-04-19 南京邮电大学 一种对实时数据进行快速分析的存储系统及其存储方法
CN106777278B (zh) * 2016-12-29 2021-02-23 海尔优家智能科技(北京)有限公司 一种基于Spark的数据处理方法及装置
CN106682213B (zh) * 2016-12-30 2020-08-07 Tcl科技集团股份有限公司 基于Hadoop平台的物联网任务订制方法及系统
CN108959952B (zh) * 2017-05-23 2020-10-30 中国移动通信集团重庆有限公司 数据平台权限控制方法、装置和设备
CN107391555B (zh) * 2017-06-07 2020-08-04 中国科学院信息工程研究所 一种面向Spark-Sql检索的元数据实时更新方法
CN108108490B (zh) * 2018-01-12 2019-08-27 平安科技(深圳)有限公司 Hive表扫描方法、装置、计算机设备及存储介质
CN109491973A (zh) * 2018-09-25 2019-03-19 中国平安人寿保险股份有限公司 电子装置、保单变化数据分布式分析方法及存储介质
CN109189798B (zh) * 2018-09-30 2021-12-17 浙江百世技术有限公司 一种基于spark同步更新数据的方法
CN109473178B (zh) * 2018-11-12 2022-04-01 北京懿医云科技有限公司 医疗数据整合的方法、系统、设备及存储介质
CN109800782A (zh) * 2018-12-11 2019-05-24 国网甘肃省电力公司金昌供电公司 一种基于模糊knn算法的电网故障检测方法及装置
CN110222009B (zh) * 2019-05-28 2021-08-06 咪咕文化科技有限公司 一种Hive入库异常文件自动处理方法及装置
CN110209654A (zh) * 2019-06-05 2019-09-06 深圳市网心科技有限公司 一种文本文件数据入库方法、系统及电子设备和存储介质
CN111159235A (zh) * 2019-12-20 2020-05-15 中国建设银行股份有限公司 数据预分区方法、装置、电子设备及可读存储介质
CN111427887A (zh) * 2020-03-17 2020-07-17 中国邮政储蓄银行股份有限公司 一种快速扫描HBase分区表的方法、装置、系统
CN114238450B (zh) * 2022-02-22 2022-08-16 阿里云计算有限公司 时间分区方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8516022B1 (en) * 2012-01-11 2013-08-20 Emc Corporation Automatically committing files to be write-once-read-many in a file system
CN104239377A (zh) * 2013-11-12 2014-12-24 新华瑞德(北京)网络科技有限公司 跨平台的数据检索方法及装置
CN104767795A (zh) * 2015-03-17 2015-07-08 浪潮通信信息系统有限公司 一种基于hadoop的lte mro数据统计方法及系统
CN105550293A (zh) * 2015-12-11 2016-05-04 深圳市华讯方舟软件技术有限公司 一种基于Spark-SQL大数据处理平台的后台刷新方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699676B (zh) * 2013-12-30 2017-02-15 厦门市美亚柏科信息股份有限公司 基于mssql server表分区及自动维护方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8516022B1 (en) * 2012-01-11 2013-08-20 Emc Corporation Automatically committing files to be write-once-read-many in a file system
CN104239377A (zh) * 2013-11-12 2014-12-24 新华瑞德(北京)网络科技有限公司 跨平台的数据检索方法及装置
CN104767795A (zh) * 2015-03-17 2015-07-08 浪潮通信信息系统有限公司 一种基于hadoop的lte mro数据统计方法及系统
CN105550293A (zh) * 2015-12-11 2016-05-04 深圳市华讯方舟软件技术有限公司 一种基于Spark-SQL大数据处理平台的后台刷新方法

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110136777A (zh) * 2018-02-09 2019-08-16 深圳先进技术研究院 一种基于Spark框架的重测序序列比对方法
CN111666260A (zh) * 2019-03-08 2020-09-15 杭州海康威视数字技术股份有限公司 数据处理方法及装置
CN110162563A (zh) * 2019-05-28 2019-08-23 深圳市网心科技有限公司 一种数据入库方法、系统及电子设备和存储介质
CN110162563B (zh) * 2019-05-28 2023-11-17 深圳市网心科技有限公司 一种数据入库方法、系统及电子设备和存储介质
CN110727684A (zh) * 2019-10-08 2020-01-24 浪潮软件股份有限公司 一种用于大数据统计分析的增量数据同步的方法
CN110727684B (zh) * 2019-10-08 2023-07-25 浪潮软件股份有限公司 一种用于大数据统计分析的增量数据同步的方法
CN110765154A (zh) * 2019-10-16 2020-02-07 华电莱州发电有限公司 火电厂海量实时生成数据的处理方法及装置
CN110990669A (zh) * 2019-10-16 2020-04-10 广州丰石科技有限公司 一种基于规则生成的dpi解析方法和系统
CN110990340A (zh) * 2019-11-12 2020-04-10 上海麦克风文化传媒有限公司 一种大数据多层次存储架构
CN110990340B (zh) * 2019-11-12 2024-04-12 上海麦克风文化传媒有限公司 一种大数据多层次存储架构
CN111179048B (zh) * 2019-12-31 2023-05-02 中国银行股份有限公司 基于spark的用户资讯个性化分析方法、装置及系统
CN111179048A (zh) * 2019-12-31 2020-05-19 中国银行股份有限公司 基于spark的用户资讯个性化分析方法、装置及系统
CN111488323B (zh) * 2020-04-14 2023-06-13 中国农业银行股份有限公司 一种数据处理方法、装置及电子设备
CN111488323A (zh) * 2020-04-14 2020-08-04 中国农业银行股份有限公司 一种数据处理方法、装置及电子设备
CN112163030A (zh) * 2020-11-03 2021-01-01 北京明略软件系统有限公司 多表批量操作方法、系统及计算机设备
CN112783923A (zh) * 2020-11-25 2021-05-11 辽宁振兴银行股份有限公司 一种基于Spark和Impala高效采集数据库的实现方法
CN113553533A (zh) * 2021-06-10 2021-10-26 国网安徽省电力有限公司 一种基于数字化内部五级市场考核体系的指标计算方法
CN113434608A (zh) * 2021-07-06 2021-09-24 中国银行股份有限公司 Hive数据仓库的数据处理方法及装置

Also Published As

Publication number Publication date
CN105550293A (zh) 2016-05-04
CN105550293B (zh) 2018-01-16

Similar Documents

Publication Publication Date Title
WO2017096941A1 (fr) Procédé de rafraîchissement d'arrière-plan basé sur une plateforme de traitement de données volumineuses spark-sql
US9678969B2 (en) Metadata updating method and apparatus based on columnar storage in distributed file system, and host
CN109891402B (zh) 可撤销和在线模式转换
US11119997B2 (en) Lock-free hash indexing
US11556396B2 (en) Structure linked native query database management system and methods
WO2017096940A1 (fr) Procédé d'importation de données pour une plateforme de traitement de big data basée sur spark sql
EP3170109B1 (fr) Procédé et système de construction et de mise à jour adaptatives d'une base de données de stockage en colonnes à partir d'une base de données de stockage en lignes sur la base de demandes d'interrogations
WO2019128205A1 (fr) Procédé et dispositif de réalisation d'une publication d'échelle de gris, nœud informatique et système
US11275759B2 (en) Data storage method and apparatus, server, and storage medium
CN111797121B (zh) 读写分离架构业务系统的强一致性查询方法、装置及系统
US9418094B2 (en) Method and apparatus for performing multi-stage table updates
CN104679898A (zh) 一种大数据访问方法
CN112286941B (zh) 一种基于Binlog+HBase+Hive的大数据同步方法和装置
CN104778270A (zh) 一种用于多文件的存储方法
US20230418811A1 (en) Transaction processing method and apparatus, computing device, and storage medium
Sears et al. Rose: Compressed, log-structured replication
WO2020041950A1 (fr) Procédé de mise à jour de données, dispositif, et dispositif de stockage employant une indexation avec arbre b+
CN113050886B (zh) 面向嵌入式内存数据库的非易失性内存存储方法及系统
US10558636B2 (en) Index page with latch-free access
CN113672556A (zh) 一种批量文件的迁移方法及装置
Schindler Profiling and analyzing the I/O performance of NoSQL DBs
Valvag et al. Cogset vs. hadoop: Measurements and analysis
US20190163799A1 (en) Database management system and database management method
US11816106B2 (en) Memory management for KLL sketch
US20240143566A1 (en) Data processing method and apparatus, and computing system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16872137

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16872137

Country of ref document: EP

Kind code of ref document: A1