WO2023124543A1 - Procédé de traitement de données et appareil de traitement de données pour mégadonnées - Google Patents

Procédé de traitement de données et appareil de traitement de données pour mégadonnées Download PDF

Info

Publication number
WO2023124543A1
WO2023124543A1 PCT/CN2022/130286 CN2022130286W WO2023124543A1 WO 2023124543 A1 WO2023124543 A1 WO 2023124543A1 CN 2022130286 W CN2022130286 W CN 2022130286W WO 2023124543 A1 WO2023124543 A1 WO 2023124543A1
Authority
WO
WIPO (PCT)
Prior art keywords
program
data processing
virtual machine
operator
big data
Prior art date
Application number
PCT/CN2022/130286
Other languages
English (en)
Chinese (zh)
Inventor
俞博文
冯冠宇
曹焕琦
郑纬民
陈文光
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Publication of WO2023124543A1 publication Critical patent/WO2023124543A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • optimizing the first part of the program includes: setting at least one operator in the first part of the program to batch process multiple data each time.
  • optimizing the first part of the program may include: for the data to be processed by the first part of the program, converting data of the same data type into a data arrangement continuously stored in the memory. This optimization operation may be referred to as a data arrangement operation.
  • the function pointer in the second part of the program is converted into a pointer address that can actually call the function in the predetermined virtual machine big data processing system, and these functions that are called have the function of the second operator in the second part of the program accomplish.
  • the virtual machine code calls the functional modules of the predetermined virtual machine big data processing system (for example, the processing module of the corresponding operator), avoiding the repeated programming of these functional modules.
  • predetermined here means pre-determined, that is, pre-selected target processing system for code conversion and invocation.
  • the code compiled in the native system includes the function implementation of the operator, so it can be run by the native system using the machine code execution mechanism of the virtual machine (eg, JNI, JNA). After this part of the program runs in the native system, the functions of the operators will be processed.
  • the code compiled in the native system does not contain the function implementation of the operator, but is converted into the virtual machine code supported by the predetermined virtual machine big data processing system, so it can be generated by residing in the virtual machine Execute the big data processing system on the scheduled virtual machine.
  • the engine of the predetermined virtual machine big data processing system can be reused, so as to realize functions such as distributed execution, elasticity, fall behind mitigation, and monitoring.
  • the engine of the native system big data processing system in the embodiment of the present disclosure may include a driver (driver) in the virtual machine system, which is a virtual machine big data processing system (such as Spark) commonly implemented in a virtual machine programming language (such as Java).
  • An application program for loading compiled program code ie, a "loadable module"
  • the driver can be submitted to a Spark-compatible cluster just like a normal Spark application. After a successful submission, the driver loads the loadable module from the native system, registers the engine implementation with the loadable module, and starts the main program contained in the loadable module.
  • the driver can also instruct the newly created executor to prepare the environment, such as downloading the above-mentioned loadable modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

L'invention concerne un procédé de traitement de données et un appareil de traitement de données pour mégadonnées. Le procédé de traitement de données consiste à : acquérir un programme principal écrit dans un langage de programmation natif (S101) ; compiler le programme principal dans un système natif et générer un module chargeable, le module chargeable comprenant une première partie de programme et une deuxième partie de programme (S102) ; charger le module chargeable par une machine virtuelle s'exécutant dans le système natif, et convertir la deuxième partie de programme en un code de machine virtuelle pris en charge par un système de traitement de mégadonnées de machine virtuelle prédéterminé (S103) ; et exécuter, par la machine virtuelle, le programme principal contenu dans le module chargeable, le mécanisme virtuel construisant un programme de graphe acyclique orienté correspondant au module chargeable, puis exécutant le programme de graphe acyclique orienté, transférant la première partie de programme au système natif pour l'exécution, et transférant la deuxième partie de programme convertie au système de traitement de mégadonnées de machine virtuelle prédéterminé pour l'exécution (S104). Selon le procédé de traitement de données, un cadre de traitement de mégadonnées haute performance peut être construit tout en intégrant un écosystème logiciel de mégadonnées de machine virtuelle existant.
PCT/CN2022/130286 2021-12-27 2022-11-07 Procédé de traitement de données et appareil de traitement de données pour mégadonnées WO2023124543A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111618375.0 2021-12-27
CN202111618375.0A CN114327479A (zh) 2021-12-27 2021-12-27 用于大数据的数据处理方法和数据处理装置

Publications (1)

Publication Number Publication Date
WO2023124543A1 true WO2023124543A1 (fr) 2023-07-06

Family

ID=81014410

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/130286 WO2023124543A1 (fr) 2021-12-27 2022-11-07 Procédé de traitement de données et appareil de traitement de données pour mégadonnées

Country Status (2)

Country Link
CN (1) CN114327479A (fr)
WO (1) WO2023124543A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114327479A (zh) * 2021-12-27 2022-04-12 清华大学 用于大数据的数据处理方法和数据处理装置
CN115378789B (zh) * 2022-10-24 2023-01-10 中国地质大学(北京) 一种多层次协作的流资源管理方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103777997A (zh) * 2013-12-25 2014-05-07 中软信息系统工程有限公司 一种基于mips的java虚拟机硬件无关化平台及其无关化改进方法
CN106648681A (zh) * 2016-12-29 2017-05-10 南京科远自动化集团股份有限公司 一种可编程语言编译装载系统及方法
CN111309449A (zh) * 2020-03-17 2020-06-19 上海蓝载信息科技有限公司 面向元编程、交互式编程和区块链互操作的与编程语言无关的虚拟机
CN111767116A (zh) * 2020-06-03 2020-10-13 江苏中科重德智能科技有限公司 面向机械臂程序开发编程语言的虚拟机及对汇编文件的运行方法
US20210124600A1 (en) * 2019-10-29 2021-04-29 International Business Machines Corporation Rescheduling jit compilation based on jobs of parallel distributed computing framework
CN114327479A (zh) * 2021-12-27 2022-04-12 清华大学 用于大数据的数据处理方法和数据处理装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550296B (zh) * 2015-12-10 2018-10-30 深圳市华讯方舟软件技术有限公司 一种基于spark-SQL大数据处理平台的数据导入方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103777997A (zh) * 2013-12-25 2014-05-07 中软信息系统工程有限公司 一种基于mips的java虚拟机硬件无关化平台及其无关化改进方法
CN106648681A (zh) * 2016-12-29 2017-05-10 南京科远自动化集团股份有限公司 一种可编程语言编译装载系统及方法
US20210124600A1 (en) * 2019-10-29 2021-04-29 International Business Machines Corporation Rescheduling jit compilation based on jobs of parallel distributed computing framework
CN111309449A (zh) * 2020-03-17 2020-06-19 上海蓝载信息科技有限公司 面向元编程、交互式编程和区块链互操作的与编程语言无关的虚拟机
CN111767116A (zh) * 2020-06-03 2020-10-13 江苏中科重德智能科技有限公司 面向机械臂程序开发编程语言的虚拟机及对汇编文件的运行方法
CN114327479A (zh) * 2021-12-27 2022-04-12 清华大学 用于大数据的数据处理方法和数据处理装置

Also Published As

Publication number Publication date
CN114327479A (zh) 2022-04-12

Similar Documents

Publication Publication Date Title
US10437573B2 (en) General purpose distributed data parallel computing using a high level language
KR102370568B1 (ko) 모놀리식 레거시 애플리케이션들에 기초한 마이크로서비스들의 컨테이너화된 전개
Elser et al. An evaluation study of bigdata frameworks for graph processing
Chen et al. Flinkcl: An opencl-based in-memory computing architecture on heterogeneous cpu-gpu clusters for big data
WO2023124543A1 (fr) Procédé de traitement de données et appareil de traitement de données pour mégadonnées
Murray et al. {CIEL}: A universal execution engine for distributed {Data-Flow} computing
US11556396B2 (en) Structure linked native query database management system and methods
US8572575B2 (en) Debugging a map reduce application on a cluster
Yuan et al. Spark-GPU: An accelerated in-memory data processing engine on clusters
Isard et al. Distributed data-parallel computing using a high-level programming language
Raychev et al. Parallelizing user-defined aggregations using symbolic execution
US8863096B1 (en) Parallel symbolic execution on cluster of commodity hardware
US20090271775A1 (en) Optimizing Just-In-Time Compiling For A Java Application Executing On A Compute Node
US10749984B2 (en) Processing requests for multi-versioned service
US11848980B2 (en) Distributed pipeline configuration in a distributed computing system
US10019473B2 (en) Accessing an external table in parallel to execute a query
Miceli et al. Programming abstractions for data intensive computing on clouds and grids
de Carvalho Junior et al. Contextual abstraction in a type system for component-based high performance computing platforms
Asadi et al. Hybrid quantum programming with PennyLane Lightning on HPC platforms
Schneider et al. Language Runtime and Optimizations in IBM Streams.
Thor et al. Cloudfuice: A flexible cloud-based data integration system
Lei et al. Chitu: Accelerating Serverless Workflows with Asynchronous State Replication Pipelines
Ren et al. Efficient shuffle management for DAG computing frameworks based on the FRQ model
Kukreti et al. CloneHadoop: Process Cloning to Reduce Hadoop's Long Tail
Tardieu et al. X10 for productivity and performance at scale

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22913806

Country of ref document: EP

Kind code of ref document: A1