CN111831713A

CN111831713A - A data processing method, device and equipment

Info

Publication number: CN111831713A
Application number: CN201910312700.7A
Authority: CN
Inventors: 周祥; 王烨; 李鸣翔
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-04-18
Filing date: 2019-04-18
Publication date: 2020-10-27
Anticipated expiration: 2039-04-18
Also published as: WO2020211717A1; CN111831713B

Abstract

The present application provides a data processing method, apparatus and device, the method includes: obtaining a data processing request, the data processing request including a first input format and a second output format; obtaining a target processing unit, the target processing unit of the target processing unit The conversion information is first conversion information, and the first conversion information is used to realize the conversion of the first input format and the second output format; the first input format of the first input format is obtained from the data source according to the data processing request. data, and output the first data to the target processing unit, so that the target processing unit converts the first data into the second data in the second output format using the first conversion information; obtain the first data from the target processing unit Second data, and output the second data. Through the technical solution of the present application, the computing resources of the data lake analysis system can be saved, and the processing performance can be improved.

Description

A data processing method, device and equipment

技术领域technical field

本申请涉及互联网技术领域，尤其涉及一种数据处理方法、装置及设备。The present application relates to the field of Internet technologies, and in particular, to a data processing method, apparatus, and device.

背景技术Background technique

数据湖分析(Data Lake Analytics)用于为用户提供无服务器化(Serverless)的查询分析服务，能够对海量的数据进行任意维度的分析和查询，并可以支持高并发、低延时(毫秒级响应)、实时在线分析、海量数据查询等功能。Data Lake Analytics is used to provide users with serverless query and analysis services, which can analyze and query massive data in any dimension, and can support high concurrency and low latency (millisecond response). ), real-time online analysis, massive data query and other functions.

在数据湖分析系统中，包括存储集群和计算集群，存储集群包括不同类型的数据源，这些数据源采用不同的数据格式。计算集群包括多个计算节点，不同计算节点可以采用不同的数据格式。通常情况下，数据源采用的数据格式与计算节点采用的数据格式不同，因此，就需要对数据格式进行转换。The data lake analysis system includes storage clusters and computing clusters. The storage clusters include different types of data sources, and these data sources use different data formats. A computing cluster includes multiple computing nodes, and different computing nodes can use different data formats. Usually, the data format used by the data source is different from the data format used by the computing node. Therefore, the data format needs to be converted.

例如，从数据源读取数据格式A1的数据，将数据格式A1的数据转换为数据格式B1的数据，将数据格式B1的数据输出给计算节点，由计算节点利用数据格式B1的数据进行处理。由于不同类型的数据源采用不同的数据格式，不同计算节点也采用不同的数据格式，因此，数据湖分析系统需要支持各种数据格式的转换，数据湖分析系统需要提供大量计算资源，由这些计算资源实现数据格式的转换，随着用户数量的增加，对计算资源的需求也随之增加。For example, read data in data format A1 from a data source, convert the data in data format A1 into data in data format B1, output the data in data format B1 to the computing node, and the computing node uses the data in data format B1 for processing. Since different types of data sources use different data formats, and different computing nodes also use different data formats, the data lake analysis system needs to support the conversion of various data formats, and the data lake analysis system needs to provide a large number of computing resources. Resources realize the conversion of data formats, and as the number of users increases, the demand for computing resources also increases.

发明内容SUMMARY OF THE INVENTION

本申请提供一种数据处理方法，所述方法包括：The application provides a data processing method, the method includes:

获取数据处理请求，所述数据处理请求包括第一输入格式和第二输出格式；obtaining a data processing request, where the data processing request includes a first input format and a second output format;

获取目标处理单元，所述目标处理单元的目标转换信息为第一转换信息，所述第一转换信息用于实现所述第一输入格式与所述第二输出格式的转换；Acquiring a target processing unit, where target conversion information of the target processing unit is first conversion information, and the first conversion information is used to realize conversion of the first input format and the second output format;

根据所述数据处理请求从数据源获取第一输入格式的第一数据，并将所述第一数据输出给所述目标处理单元，以使所述目标处理单元利用所述第一转换信息将所述第一数据转换为第二输出格式的第二数据；Acquire first data in a first input format from a data source according to the data processing request, and output the first data to the target processing unit, so that the target processing unit uses the first conversion information to convert the converting the first data into second data in a second output format;

从所述目标处理单元获取所述第二数据，并输出所述第二数据。The second data is acquired from the target processing unit, and the second data is output.

本申请提供一种数据处理方法，应用于数据湖分析系统，所述数据湖分析系统用于为用户提供无服务器化的数据处理服务，所述方法包括：The present application provides a data processing method, which is applied to a data lake analysis system, and the data lake analysis system is used to provide users with serverless data processing services, and the method includes:

从所述数据湖分析系统的多个处理单元中获取目标处理单元；其中，所述目标处理单元的目标转换信息为第一转换信息，所述第一转换信息用于实现所述第一输入格式与所述第二输出格式的转换；Obtain a target processing unit from multiple processing units of the data lake analysis system; wherein, target conversion information of the target processing unit is first conversion information, and the first conversion information is used to implement the first input format conversion with the second output format;

从所述目标处理单元获取所述第二数据，并输出所述第二数据；Obtain the second data from the target processing unit, and output the second data;

其中，所述数据源包括所述数据湖分析系统提供的云数据库。Wherein, the data source includes a cloud database provided by the data lake analysis system.

根据所述数据处理请求从数据源获取第一输入格式的第一数据；Acquiring first data in a first input format from a data source according to the data processing request;

将所述第一输入格式的第一数据输出给目标处理单元，以使所述目标处理单元将所述第一数据转换为第二输出格式的第二数据；outputting the first data in the first input format to a target processing unit, so that the target processing unit converts the first data into second data in a second output format;

本申请提供一种数据处理方法，应用于数据湖分析系统，针对所述数据湖分析系统的多个处理单元中的处理单元，所述处理单元包括多个不同的转换信息，不同的转换信息用于实现不同格式的数据转换，所述方法包括：The present application provides a data processing method, which is applied to a data lake analysis system. For a processing unit in a plurality of processing units of the data lake analysis system, the processing unit includes a plurality of different conversion information, and the different conversion information uses For realizing data conversion in different formats, the method includes:

所述处理单元获取第一输入格式的第一数据；The processing unit obtains the first data in the first input format;

若所述处理单元的目标转换信息为第一转换信息，且所述第一转换信息用于实现所述第一输入格式与所述第二输出格式的转换，则利用所述第一转换信息将所述第一数据转换为第二输出格式的第二数据；If the target conversion information of the processing unit is the first conversion information, and the first conversion information is used to realize the conversion of the first input format and the second output format, use the first conversion information to convert The first data is converted into second data in a second output format;

所述处理单元输出所述第二数据。The processing unit outputs the second data.

本申请提供一种数据处理装置，所述装置包括：The present application provides a data processing device, the device comprising:

获取模块，用于获取数据处理请求，所述数据处理请求包括第一输入格式和第二输出格式；获取目标处理单元，所述目标处理单元的目标转换信息为第一转换信息，所述第一转换信息用于实现所述第一输入格式与所述第二输出格式的转换；an acquisition module, configured to acquire a data processing request, where the data processing request includes a first input format and a second output format; acquire a target processing unit, the target conversion information of the target processing unit is first conversion information, the first conversion information The conversion information is used to realize the conversion of the first input format and the second output format;

处理模块，用于根据所述数据处理请求从数据源获取第一输入格式的第一数据，并将所述第一数据输出给所述目标处理单元，以使所述目标处理单元利用所述第一转换信息将所述第一数据转换为第二输出格式的第二数据；a processing module, configured to obtain first data in a first input format from a data source according to the data processing request, and output the first data to the target processing unit, so that the target processing unit uses the first data a conversion information to convert the first data into second data in a second output format;

本申请提供一种数据处理设备，包括：This application provides a data processing device, including:

处理器和机器可读存储介质，所述机器可读存储介质上存储有若干计算机指令，所述处理器执行所述计算机指令时进行如下处理：A processor and a machine-readable storage medium, where several computer instructions are stored on the machine-readable storage medium, and the processor performs the following processing when executing the computer instructions:

基于上述技术方案，本申请实施例中，通过将目标处理单元的目标转换信息设置为第一转换信息，使得目标处理单元利用第一转换信息将第一输入格式的第一数据转换为第二输出格式的第二数据，即由目标处理单元实现数据格式的转换，而目标处理单元通常由逻辑芯片实现，逻辑芯片具有很高的处理性能，因此，可以节省数据湖分析系统的计算资源(如CPU(Central Processing Unit，中央处理器)资源等)，并提高数据湖分析系统的整体处理性能，提升数据湖分析系统整体的使用效率和体验，可以加速数据处理和计算性能，结合硬件加速技术来处理存储集群的数据对接，并向计算集群提供数据接口。Based on the above technical solutions, in this embodiment of the present application, by setting the target conversion information of the target processing unit as the first conversion information, the target processing unit uses the first conversion information to convert the first data in the first input format into the second output The second data in the format, that is, the data format conversion is realized by the target processing unit, and the target processing unit is usually realized by a logic chip. The logic chip has high processing performance. Therefore, the computing resources (such as CPU) of the data lake analysis system can be saved. (Central Processing Unit, central processing unit) resources, etc.), and improve the overall processing performance of the data lake analysis system, improve the overall use efficiency and experience of the data lake analysis system, accelerate data processing and computing performance, and combine hardware acceleration technology to process Data docking of storage clusters and providing data interfaces to computing clusters.

附图说明Description of drawings

为了更加清楚地说明本申请实施例或者现有技术中的技术方案，下面将对本申请实施例或者现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请中记载的一些实施例，对于本领域普通技术人员来讲，还可以根据本申请实施例的这些附图获得其它的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the drawings that need to be used in the description of the embodiments of the present application or the prior art. Obviously, the drawings in the following description These are just some embodiments described in this application, and for those of ordinary skill in the art, other drawings can also be obtained according to these drawings in the embodiments of this application.

图1是本申请一种实施方式中的数据处理方法的流程示意图；1 is a schematic flowchart of a data processing method in an embodiment of the present application;

图2是本申请一种实施方式中的数据湖分析系统的结构示意图；FIG. 2 is a schematic structural diagram of a data lake analysis system in an embodiment of the present application;

图3A-图3E是本申请一种实施方式中的数据扫描集群的示意图；3A-3E are schematic diagrams of a data scanning cluster in an embodiment of the present application;

图4是本申请一种实施方式中的数据格式转换的示意图；4 is a schematic diagram of data format conversion in an embodiment of the present application;

图5A和图5B是本申请一种实施方式中的数据扫描集群的结构图；5A and 5B are structural diagrams of a data scanning cluster in an embodiment of the present application;

图6是本申请一种实施方式中的数据处理方法的流程示意图；6 is a schematic flowchart of a data processing method in an embodiment of the present application;

图7是本申请一种实施方式中的数据处理装置的结构示意图；7 is a schematic structural diagram of a data processing apparatus in an embodiment of the present application;

图8是本申请一种实施方式中的数据处理设备的结构示意图。FIG. 8 is a schematic structural diagram of a data processing device in an embodiment of the present application.

具体实施方式Detailed ways

在本申请实施例使用的术语仅仅是出于描述特定实施例的目的，而非限制本申请。本申请和权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其它含义。还应当理解，本文中使用的术语“和/或”是指包含一个或多个相关联的列出项目的任何或所有可能组合。The terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, rather than limiting the present application. As used in this application and the claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

应当理解，尽管在本申请实施例可能采用术语第一、第二、第三等来描述各种信息，但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如，在不脱离本申请范围的情况下，第一信息也可以被称为第二信息，类似地，第二信息也可以被称为第一信息。取决于语境，此外，所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information without departing from the scope of the present application. Furthermore, the use of the word "if" can be interpreted as "at the time of" or "when" or "in response to determining", depending on the context.

本申请实施例中提出一种数据处理方法，可以应用于任意设备，如数据湖分析系统的任意设备，参见图1所示，为该方法的流程图，该方法可以包括：A data processing method is proposed in this embodiment of the present application, which can be applied to any device, such as any device of a data lake analysis system. Referring to FIG. 1 , which is a flowchart of the method, the method may include:

步骤101，获取数据处理请求，该数据处理请求包括第一输入格式(即数据源中数据的格式)和第二输出格式(即需要输出的数据的格式)。Step 101: Obtain a data processing request, where the data processing request includes a first input format (ie, the format of the data in the data source) and a second output format (ie, the format of the data to be output).

步骤102，获取目标处理单元，目标处理单元的目标转换信息为第一转换信息，第一转换信息用于实现第一输入格式与第二输出格式的转换，基于第一转换信息，目标处理单元能够将第一输入格式的数据转换为第二输出格式的数据。Step 102: Acquire a target processing unit, the target conversion information of the target processing unit is first conversion information, and the first conversion information is used to realize the conversion of the first input format and the second output format. Based on the first conversion information, the target processing unit can Convert data in the first input format to data in the second output format.

可选地，在一个例子中，获取目标处理单元，可以包括但不限于：从数据湖分析系统的多个处理单元中任意选择处理单元，并将选择的处理单元作为目标处理单元。或者，可以获取数据湖分析系统的多个处理单元的目标转换信息，并利用每个处理单元的目标转换信息从所述多个处理单元中选择处理单元，将选择的处理单元作为目标处理单元。Optionally, in an example, acquiring the target processing unit may include, but is not limited to: arbitrarily selecting a processing unit from multiple processing units in the data lake analysis system, and using the selected processing unit as the target processing unit. Alternatively, target conversion information of multiple processing units of the data lake analysis system may be acquired, and a processing unit may be selected from the plurality of processing units by using the target conversion information of each processing unit, and the selected processing unit may be used as the target processing unit.

在一个例子中，针对数据湖分析系统的每个处理单元，该处理单元可以为：当前未工作的处理单元(即该处理单元当前没有执行数据的转换操作)，或者，当前已工作的处理单元(即该处理单元当前正在执行数据的转换操作)。In one example, for each processing unit of the data lake analysis system, the processing unit may be: a processing unit that is currently not working (that is, the processing unit is not currently performing data conversion operations), or a processing unit that is currently working (that is, the processing unit is currently performing a data transformation operation).

在一个例子中，利用每个处理单元的目标转换信息从所述多个处理单元中选择处理单元，将选择的处理单元作为目标处理单元，可以包括但不限于：若存在目标转换信息为第一转换信息(用于实现第一输入格式与第二输出格式的转换)的处理单元，则将目标转换信息为第一转换信息的处理单元确定为目标处理单元；或者，若不存在目标转换信息为第一转换信息的处理单元，则从多个处理单元中任意选择处理单元，并将选择的处理单元确定为目标处理单元。In one example, using the target conversion information of each processing unit to select a processing unit from the plurality of processing units, and using the selected processing unit as the target processing unit, may include, but is not limited to: if there is target conversion information for the first The processing unit of the conversion information (for realizing the conversion of the first input format and the second output format), then the processing unit whose target conversion information is the first conversion information is determined as the target processing unit; or, if there is no target conversion information, it is For the processing unit of the first conversion information, a processing unit is arbitrarily selected from the plurality of processing units, and the selected processing unit is determined as the target processing unit.

可选地，在一个例子中，获取目标处理单元之后，还可以包括但不限于：若目标处理单元的目标转换信息为第一转换信息，则根据第一输入格式和第二输出格式保持目标处理单元的目标转换信息不变；或者，若目标处理单元的目标转换信息为第二转换信息(第二转换信息不用于实现第一输入格式与第二输出格式的转换)，则根据第一输入格式和第二输出格式将目标处理单元的目标转换信息修改为第一转换信息。Optionally, in an example, after acquiring the target processing unit, it may also include but is not limited to: if the target conversion information of the target processing unit is the first conversion information, maintaining the target processing according to the first input format and the second output format. The target conversion information of the unit is unchanged; or, if the target conversion information of the target processing unit is the second conversion information (the second conversion information is not used to realize the conversion of the first input format and the second output format), then according to the first input format and the second output format modifies the target conversion information of the target processing unit to the first conversion information.

可选地，在一个例子中，针对步骤102，还可以判断数据湖分析系统是否支持第一输入格式与第二输出格式的转换。如果是，即数据湖分析系统的处理单元支持第一输入格式与第二输出格式的转换，则从数据湖分析系统的多个处理单元中获取目标处理单元。如果否，即数据湖分析系统的所有处理单元均不支持第一输入格式与第二输出格式的转换，则采用传统流程进行处理。Optionally, in an example, for step 102, it can also be determined whether the data lake analysis system supports the conversion of the first input format and the second output format. If yes, that is, the processing unit of the data lake analysis system supports the conversion of the first input format and the second output format, the target processing unit is obtained from the multiple processing units of the data lake analysis system. If no, that is, all processing units of the data lake analysis system do not support the conversion between the first input format and the second output format, the traditional process is used for processing.

可选地，在一个例子中，针对步骤102，数据处理请求还可以包括分片数量，根据该分片数量确定目标处理单元的数量，并获取所述数量个目标处理单元。Optionally, in an example, for step 102, the data processing request may further include the number of shards, the number of target processing units is determined according to the number of shards, and the number of target processing units is acquired.

步骤103，根据该数据处理请求从数据源获取第一输入格式的第一数据，并将该第一数据输出给目标处理单元，以使目标处理单元利用该第一转换信息将该第一数据转换为第二输出格式的第二数据，对此转换过程不再赘述。Step 103: Obtain the first data in the first input format from the data source according to the data processing request, and output the first data to the target processing unit, so that the target processing unit converts the first data using the first conversion information is the second data in the second output format, and the conversion process will not be described again.

步骤104，从目标处理单元获取该第二数据，并输出该第二数据，例如，可以将该第二数据输出给计算节点，以使计算节点利用该第二数据进行处理。Step 104: Acquire the second data from the target processing unit, and output the second data, for example, the second data may be output to the computing node, so that the computing node uses the second data for processing.

在一个例子中，数据处理请求还可以包括服务模式，若服务模式为流量模式，则可以获取数据总量，并根据该数据总量确定虚拟资源信息(如费用信息)，并输出虚拟资源信息。或者，若服务模式为实例模式，则可以获取目标处理单元数量，并根据目标处理单元数量确定虚拟资源信息，并输出虚拟资源信息。In one example, the data processing request may further include a service mode. If the service mode is a traffic mode, the total amount of data may be obtained, virtual resource information (such as cost information) may be determined according to the total amount of data, and virtual resource information may be output. Alternatively, if the service mode is the instance mode, the number of target processing units may be acquired, and virtual resource information may be determined according to the number of target processing units, and the virtual resource information may be output.

在上述实施例中，目标处理单元包括多个不同的转换信息，不同的转换信息用于实现不同格式的数据转换；目标处理单元为通过逻辑芯片实现，逻辑芯片可以包括但不限于：FPGA(Field Programmable Gate Array,现场可编程逻辑门阵列)、CPLD(ComplexProgrammable Logic Device，复杂可编程逻辑器件)、ASIC(Application SpecificIntegrated Circuit，专用集成电路)等，对此不做限制。In the above embodiment, the target processing unit includes a plurality of different conversion information, and the different conversion information is used to realize data conversion in different formats; the target processing unit is realized by a logic chip, and the logic chip may include but not limited to: FPGA (Field Programmable Gate Array, field programmable logic gate array), CPLD (Complex Programmable Logic Device, complex programmable logic device), ASIC (Application Specific Integrated Circuit, application-specific integrated circuit), etc., which are not limited.

在一个例子中，上述执行顺序只是为了方便描述给出的一个示例，在实际应用中，还可以改变步骤之间的执行顺序，对此执行顺序不做限制。而且，在其它实施例中，并不一定按照本说明书示出和描述的顺序来执行相应方法的步骤，其方法所包括的步骤可以比本说明书所描述的更多或更少。此外，本说明书中所描述的单个步骤，在其它实施例中可能被分解为多个步骤进行描述；本说明书中所描述的多个步骤，在其它实施例也可能被合并为单个步骤进行描述。In an example, the above execution sequence is only an example given for the convenience of description. In practical applications, the execution sequence between steps may also be changed, and the execution sequence is not limited. Moreover, in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification, and the methods may include more or less steps than those described in this specification. In addition, a single step described in this specification may be decomposed into multiple steps for description in other embodiments; multiple steps described in this specification may also be combined into a single step for description in other embodiments.

基于上述技术方案，本申请实施例中，由目标处理单元实现数据格式的转换，目标处理单元通常由逻辑芯片实现，具有很高的处理性能，可以节省数据湖分析系统的计算资源(如CPU资源等)，提高数据湖分析系统的整体处理性能，提升数据湖分析系统整体的使用效率和体验，加速数据处理和计算性能，结合硬件加速技术处理存储集群的数据对接，向计算集群提供数据接口。Based on the above technical solutions, in the embodiment of the present application, the conversion of data formats is implemented by the target processing unit. The target processing unit is usually implemented by a logic chip and has high processing performance, which can save the computing resources (such as CPU resources) of the data lake analysis system. etc.), improve the overall processing performance of the data lake analysis system, improve the overall use efficiency and experience of the data lake analysis system, accelerate data processing and computing performance, and combine hardware acceleration technology to process the data docking of storage clusters and provide data interfaces to computing clusters.

基于与上述方法同样的申请构思，本申请实施例还提出另一种数据处理方法，可以应用于数据湖分析系统(如数据湖分析系统中的云计算平台)，所述数据湖分析系统用于为用户提供无服务器化的数据处理服务，该方法包括：Based on the same application concept as the above method, the embodiment of the present application also proposes another data processing method, which can be applied to a data lake analysis system (such as a cloud computing platform in the data lake analysis system), and the data lake analysis system is used for Provide users with serverless data processing services, the method includes:

获取数据处理请求，该数据处理请求包括第一输入格式和第二输出格式；从数据湖分析系统的多个处理单元中获取目标处理单元，目标处理单元的目标转换信息为第一转换信息，第一转换信息用于实现第一输入格式与第二输出格式的转换。根据数据处理请求从数据源获取第一输入格式的第一数据，并将该第一数据输出给目标处理单元，以使目标处理单元利用该第一转换信息将该第一数据转换为第二输出格式的第二数据；从目标处理单元获取该第二数据，并输出该第二数据；其中，该数据源包括数据湖分析系统提供的云数据库。Obtain a data processing request, the data processing request includes a first input format and a second output format; obtain a target processing unit from multiple processing units of the data lake analysis system, and target conversion information of the target processing unit is the first conversion information, and the first conversion information is the first conversion information. A conversion information is used to realize the conversion of the first input format and the second output format. Obtain the first data in the first input format from the data source according to the data processing request, and output the first data to the target processing unit, so that the target processing unit uses the first conversion information to convert the first data into the second output the second data in the format; obtain the second data from the target processing unit, and output the second data; wherein, the data source includes a cloud database provided by the data lake analysis system.

其中，上述数据源可以包括数据湖分析系统提供的云数据库，且云数据库可以用于提供无服务器化的查询分析服务。数据湖分析系统可以是以数据存储为主的存储型云平台，或者，以数据处理为主的计算型云平台，或者，计算和数据存储处理兼顾的综合云计算平台，对此数据湖分析系统不做限制。The above data source may include a cloud database provided by the data lake analysis system, and the cloud database may be used to provide serverless query and analysis services. The data lake analysis system can be a storage-type cloud platform that mainly focuses on data storage, or a computing-type cloud platform that focuses on data processing, or a comprehensive cloud computing platform that takes both computing and data storage and processing into consideration. No restrictions.

针对数据湖分析系统提供的云数据库，可以用于为用户提供无服务器化(Serverless)的查询分析服务，能够对海量的数据进行任意维度的分析和查询，支持高并发、低延时(毫秒级响应)、实时在线分析、海量数据查询等功能。The cloud database provided by the data lake analysis system can be used to provide users with serverless query and analysis services, which can analyze and query massive data in any dimension, and support high concurrency and low latency (millisecond level). response), real-time online analysis, massive data query and other functions.

在一个例子中，数据湖分析系统具体为：存储与计算分离的数据湖分析系统；数据湖分析系统包括存储集群和计算集群，存储集群包括采用不同输入格式的多个数据源，计算集群包括采用不同输出格式的多个计算节点。进一步的，数据湖分析系统还可以包括数据扫描集群，数据扫描集群包括多个处理单元；数据扫描集群作为计算集群的内置模块，与计算集群的计算资源部署在相同节点；或者，数据扫描集群作为计算集群的独立模块，与计算集群的计算资源部署在不同节点；或者，数据扫描集群作为与计算集群不同的独立集群。In one example, the data lake analysis system is specifically: a data lake analysis system with separate storage and computing; the data lake analysis system includes a storage cluster and a computing cluster, the storage cluster includes multiple data sources with different input formats, and the computing cluster includes Multiple compute nodes with different output formats. Further, the data lake analysis system may also include a data scanning cluster, and the data scanning cluster includes multiple processing units; the data scanning cluster, as a built-in module of the computing cluster, is deployed on the same node as the computing resources of the computing cluster; The independent modules of the computing cluster are deployed on different nodes from the computing resources of the computing cluster; or, the data scanning cluster is used as an independent cluster different from the computing cluster.

基于与上述方法同样的申请构思，本申请实施例中还提出一种数据处理方法，该方法可以包括：获取数据处理请求，该数据处理请求可以包括第一输入格式和第二输出格式；根据该数据处理请求从数据源获取第一输入格式的第一数据，并将第一数据输出给目标处理单元，以使目标处理单元将第一数据转换为第二输出格式的第二数据；从目标处理单元获取第二数据，并输出第二数据。Based on the same application concept as the above method, an embodiment of the present application also proposes a data processing method, which may include: acquiring a data processing request, where the data processing request may include a first input format and a second output format; The data processing request obtains the first data in the first input format from the data source, and outputs the first data to the target processing unit, so that the target processing unit converts the first data into the second data in the second output format; The unit acquires the second data and outputs the second data.

基于与上述方法同样的申请构思，本申请实施例中还提出一种数据处理方法，应用于数据湖分析系统，该数据湖分析系统包括多个处理单元，针对所述多个处理单元中的每个处理单元，所述处理单元包括多个不同的转换信息，不同的转换信息用于实现不同格式的数据转换，所述方法包括：Based on the same application concept as the above method, an embodiment of the present application also proposes a data processing method, which is applied to a data lake analysis system. The data lake analysis system includes a plurality of processing units, for each of the plurality of processing units a processing unit, the processing unit includes a plurality of different conversion information, and the different conversion information is used to realize the data conversion of different formats, and the method includes:

所述处理单元获取第一输入格式的第一数据；若所述处理单元的目标转换信息为第一转换信息，且所述第一转换信息用于实现所述第一输入格式与第二输出格式的转换，则所述处理单元利用所述第一转换信息将所述第一数据转换为第二输出格式的第二数据；所述处理单元输出所述第二数据。The processing unit obtains the first data in the first input format; if the target conversion information of the processing unit is the first conversion information, and the first conversion information is used to realize the first input format and the second output format conversion, the processing unit converts the first data into second data in a second output format by using the first conversion information; the processing unit outputs the second data.

在一个例子中，所述处理单元利用所述第一转换信息将所述第一数据转换为第二输出格式的第二数据之前，若所述处理单元的目标转换信息不为第一转换信息，则所述处理单元将所述处理单元的目标转换信息修改为所一转换信息。In an example, before the processing unit converts the first data into the second data in the second output format by using the first conversion information, if the target conversion information of the processing unit is not the first conversion information, Then the processing unit modifies the target conversion information of the processing unit to the conversion information.

以下结合具体的应用场景，对上述数据处理方法进行进一步的说明。The above data processing method will be further described below with reference to specific application scenarios.

参见图2所示，为数据湖分析(Data Lake Analytics)系统的结构示意图，数据湖分析系统可以包括客户端、负载均衡设备、前端节点(front node，也可以称为前端服务器)、计算节点(compute node，也可以称为计算服务器)和数据库，当然，数据湖分析系统还可以包括其它服务器，对此不做限制。Referring to Figure 2, which is a schematic diagram of the structure of the Data Lake Analytics system, the Data Lake Analytics system may include a client, a load balancing device, a front node (also referred to as a front-end server), a computing node ( A compute node (also referred to as a computing server) and a database, of course, the data lake analysis system may also include other servers, which is not limited.

在图2中，以3个前端节点为例，在实际应用中，前端节点的数量还可以为其它数量，对此不做限制。在图2中，以4个计算节点为例，在实际应用中，计算节点的数量还可以为其它数量，对此不做限制。由于每个前端节点的处理流程相同，每个计算节点的处理流程相同，因此，为方便描述，后续实施例中，以1个前端节点的处理流程为例，以1个计算节点的处理流程为例。In FIG. 2 , three front-end nodes are taken as an example. In practical applications, the number of front-end nodes may also be other numbers, which are not limited. In FIG. 2 , four computing nodes are taken as an example. In practical applications, the number of computing nodes may also be other numbers, which are not limited. Since the processing flow of each front-end node is the same, and the processing flow of each computing node is the same, for convenience of description, in the following embodiments, the processing flow of one front-end node is taken as an example, and the processing flow of one computing node is example.

在图2中，以5个数据库为例，实际应用中，数据库的数量还可以为其它数量，对此不做限制，这些数据库就是数据源。本实施例中，可以是针对异构数据源的场景，也就是说，这些数据库可以是相同类型的数据库，也可以是不同类型的数据库。这些数据库可以是关系型数据库，或是非关系型数据库。In FIG. 2 , five databases are taken as an example. In practical applications, the number of databases may also be other numbers, which are not limited, and these databases are the data sources. In this embodiment, it may be a scenario for heterogeneous data sources, that is, these databases may be databases of the same type, or may be databases of different types. These databases can be relational or non-relational.

进一步的，对于每个数据库来说，这个数据库的类型还可以包括但不限于：OSS(Object Storage Service，对象存储服务)、TableStore(表格存储)、HBase(HadoopDatabase，Hadoop数据库)、HDFS(Hadoop Distributed File System，Hadoop分布式文件系统)、MySQL(即关系型数据库)、RDS(Relational Database Service，关系型数据库服务)、DRDS(Distribute Relational Database Service，分布式关系型数据库服务)、RDBMS(Relational Database Management System，关系数据库管理系统)、SQLServer(即关系型数据库)、PostgreSQL(即对象关系型数据库)，MongoDB(即基于分布式文件存储的数据库)等，当然，上述只是数据库类型的几个示例，对此数据库的类型不做限制。Further, for each database, the type of the database may also include but is not limited to: OSS (Object Storage Service, object storage service), TableStore (table storage), HBase (HadoopDatabase, Hadoop database), HDFS (Hadoop Distributed) File System, Hadoop distributed file system), MySQL (that is, relational database), RDS (Relational Database Service, relational database service), DRDS (Distribute Relational Database Service, distributed relational database service), RDBMS (Relational Database Management) System, relational database management system), SQLServer (i.e. relational database), PostgreSQL (i.e. object-relational database), MongoDB (i.e. database based on distributed file storage), etc. Of course, the above are just a few examples of database types. The type of this database is not limited.

其中，数据库用于存储各种类型的数据，对此数据类型不做限制，如可以是用户数据、商品数据、地图数据、视频数据、图像数据、音频数据等。Among them, the database is used to store various types of data, which is not limited, such as user data, commodity data, map data, video data, image data, audio data, and the like.

其中，客户端可以是终端设备(如PC(Personal Computer，个人计算机)、笔记本电脑、移动终端等)包括的APP(Application，应用)，也可以是终端设备包括的浏览器，对此不做限制。负载均衡设备用于对客户端的数据请求进行负载均衡，如接收到数据请求后，将数据请求负载均衡到各个前端节点。The client may be an APP (Application) included in a terminal device (such as a PC (Personal Computer, personal computer), a notebook computer, a mobile terminal, etc.), or a browser included in the terminal device, which is not limited . The load balancing device is used to load balance the data request of the client. For example, after receiving the data request, load balancing the data request to each front-end node.

在一个例子中，多个前端节点可以用于提供相同的功能，形成前端节点的资源池。针对资源池中的每个前端节点，用于接收客户端发送的数据请求，并对数据请求进行SQL(Structured Query Language，结构化查询语言)解析，根据解析结果生成多个执行计划，并处理这些执行计划。例如，前端节点可以将这些执行计划发送给一个或者多个计算节点，由计算节点处理执行计划。In one example, multiple front-end nodes can be used to provide the same function, forming a resource pool of front-end nodes. For each front-end node in the resource pool, it is used to receive the data request sent by the client, perform SQL (Structured Query Language, Structured Query Language) parsing on the data request, generate multiple execution plans according to the parsing results, and process these Implementation plan. For example, the front-end node can send these execution plans to one or more computing nodes, and the computing nodes process the execution plans.

在一个例子中，多个计算节点用于提供相同的功能，形成计算节点的资源池。针对资源池中的每个计算节点，若该计算节点接收到前端节点发送的执行计划，则该计算节点可以处理该执行计划，并将处理结果返回给前端节点。In one example, multiple computing nodes are used to provide the same function, forming a resource pool of computing nodes. For each computing node in the resource pool, if the computing node receives the execution plan sent by the front-end node, the computing node can process the execution plan and return the processing result to the front-end node.

综上所述，数据湖分析系统采用存储、计算分离的架构，计算节点从不同的数据源(Data Source)读取数据，这些数据源就是各种类型的数据库。To sum up, the data lake analysis system adopts an architecture that separates storage and computing, and computing nodes read data from different data sources, which are various types of databases.

在一个例子中，数据湖分析系统具体为存储与计算分离的架构，即数据湖分析系统包括存储集群和计算集群，存储集群包括采用不同输入格式的多个数据源(即数据库)，计算集群包括采用不同输出格式的多个计算节点。在此基础上，本申请实施例中，数据湖分析系统还可以包括数据扫描集群，该数据扫描集群可以包括多个处理单元，例如，通过FPGA实现的处理单元等。In one example, the data lake analysis system is specifically an architecture in which storage and computing are separated, that is, the data lake analysis system includes a storage cluster and a computing cluster, the storage cluster includes multiple data sources (ie databases) using different input formats, and the computing cluster includes Multiple compute nodes with different output formats. On this basis, in this embodiment of the present application, the data lake analysis system may further include a data scanning cluster, and the data scanning cluster may include multiple processing units, for example, processing units implemented by FPGAs.

参见图3A所示，数据扫描集群可以作为计算集群的独立模块，与计算集群的计算资源(如CPU资源等)部署在不同节点，也就是说，数据扫描集群的处理单元部署在计算集群，但是与计算集群的计算资源(如CPU资源等)部署在不同节点。具体的，在存储、计算分离的数据湖分析系统中，数据扫描集群作为计算集群中的模块，是计算集群中直接面向存储集群的功能模块。Referring to Figure 3A, the data scanning cluster can be used as an independent module of the computing cluster, and the computing resources (such as CPU resources, etc.) of the computing cluster are deployed on different nodes, that is, the processing units of the data scanning cluster are deployed in the computing cluster, but The computing resources (such as CPU resources, etc.) of the computing cluster are deployed on different nodes. Specifically, in the data lake analysis system in which storage and computing are separated, the data scanning cluster, as a module in the computing cluster, is a functional module in the computing cluster that directly faces the storage cluster.

参见图3B所示，数据扫描集群可以作为计算集群的内置模块，与计算集群的计算资源(如CPU资源等)部署在相同节点，也就是说，数据扫描集群的处理单元部署在计算集群，作为计算集群中计算节点的内置模块，与基于CPU的算子位于同一节点，由计算任务调度决定是否启用数据扫描集群进行数据格式转换，如果不启用，则基于计算节点的CPU软件模块实现数据格式转换。Referring to Fig. 3B, the data scanning cluster can be used as a built-in module of the computing cluster, and is deployed on the same node as the computing resources (such as CPU resources, etc.) of the computing cluster, that is, the processing units of the data scanning cluster are deployed in the computing cluster, as The built-in module of the computing node in the computing cluster is located on the same node as the CPU-based operator. The computing task scheduling determines whether to enable the data scanning cluster to perform data format conversion. If not, the CPU software module based on the computing node realizes data format conversion. .

参见图3C所示，数据扫描集群可以作为与计算集群不同的独立集群，在存储、计算分离的数据湖分析系统中，数据扫描集群作为面向计算集群的功能模块，且数据扫描集群作为面向存储集群的功能模块。数据扫描集群是在云上完全独立的集群，能够以服务的形式，并发响应云上不同的多个计算集群的数据扫描请求，数据扫描集群完全独立运行，有自己的集群弹性管理与扩缩容。Referring to Figure 3C, the data scanning cluster can be used as an independent cluster different from the computing cluster. In the data lake analysis system where storage and computing are separated, the data scanning cluster is used as a functional module oriented to the computing cluster, and the data scanning cluster is used as a storage-oriented cluster. function module. The data scanning cluster is a completely independent cluster on the cloud, which can concurrently respond to data scanning requests from different computing clusters on the cloud in the form of a service. The data scanning cluster runs completely independently, and has its own cluster elastic management and capacity expansion .

为了方便描述，在后续实施例中，以数据扫描集群作为独立集群为例。For convenience of description, in subsequent embodiments, the data scanning cluster is taken as an independent cluster as an example.

在一个例子中，数据湖分析系统可以包括多个计算集群，每个计算集群包括多个计算节点。针对每个计算集群来说，可以是面向SQL(Structured Query Language，结构化查询语言)计算的计算集群，也可以是面向机器学习的计算集群，还可以是面向深度学习(Deep Learning)的计算集群，对此不做限制。In one example, the data lake analysis system may include multiple computing clusters, each computing cluster including multiple computing nodes. For each computing cluster, it can be a computing cluster oriented to SQL (Structured Query Language, Structured Query Language) computing, a computing cluster oriented to machine learning, or a computing cluster oriented to deep learning (Deep Learning). , there is no restriction on this.

具体的，参见图3D所示，这些计算集群可以包括但不限于：基于Presto的计算集群、基于Spark的计算集群、基于Hadoop的计算集群、基于Flink的计算集群、基于TensorFlow的计算集群、基于PyTorch的计算集群等。Specifically, as shown in Figure 3D, these computing clusters may include but are not limited to: Presto-based computing clusters, Spark-based computing clusters, Hadoop-based computing clusters, Flink-based computing clusters, TensorFlow-based computing clusters, and PyTorch-based computing clusters computing clusters, etc.

针对基于Presto的计算集群，提供适配Presto的数据访问接口，也就是说，输出给该计算集群的数据，是与Presto数据格式匹配的数据。针对基于Spark的计算集群，提供适配Spark的数据访问接口，也就是说，输出给该计算集群的数据，是与Spark数据格式匹配的数据。针对基于Hadoop的计算集群，提供适配Hadoop的数据访问接口，也就是说，输出给该计算集群的数据，是与Hadoop数据格式匹配的数据。针对基于Flink的计算集群，提供适配Flink的数据访问接口，也就是说，输出给该计算集群的数据，是与Flink数据格式匹配的数据。针对基于TensorFlow的计算集群，提供适配TensorFlow的数据访问接口，也就是说，输出给该计算集群的数据，是与TensorFlow数据格式匹配的数据。针对基于PyTorch的计算集群，提供适配PyTorch的数据访问接口，也就是说，输出给该计算集群的数据，是与PyTorch数据格式匹配的数据，以此类推。For a Presto-based computing cluster, a data access interface adapted to Presto is provided, that is, the data output to the computing cluster is data that matches the Presto data format. For a Spark-based computing cluster, a data access interface adapted to Spark is provided, that is, the data output to the computing cluster is data that matches the Spark data format. For a Hadoop-based computing cluster, a data access interface adapted to Hadoop is provided, that is, the data output to the computing cluster is data that matches the Hadoop data format. For a Flink-based computing cluster, a data access interface adapted to Flink is provided, that is, the data output to the computing cluster is data that matches the Flink data format. For a TensorFlow-based computing cluster, a data access interface adapted to TensorFlow is provided, that is, the data output to the computing cluster is data that matches the TensorFlow data format. For a PyTorch-based computing cluster, a data access interface adapted to PyTorch is provided, that is, the data output to the computing cluster is data that matches the PyTorch data format, and so on.

在一个例子中，数据湖分析系统可以包括存储集群，存储集群包括多个数据源，该数据源可以是数据库，如云数据库，云数据库用于为用户提供无服务器化(Serverless)的查询分析服务，能够对海量数据进行任意维度的分析和查询，支持高并发、低延时(毫秒级响应)、实时在线分析、海量数据查询等。In one example, the data lake analysis system may include a storage cluster, and the storage cluster includes multiple data sources, and the data source may be a database, such as a cloud database, and the cloud database is used to provide users with serverless (Serverless) query and analysis services , can analyze and query massive data in any dimension, support high concurrency, low latency (millisecond response), real-time online analysis, massive data query, etc.

在一个例子中，这些数据源可以包括但不限于：基于OSS的数据源、基于TableStore的数据源、基于HBase的数据源、基于HDFS的数据源、基于MySQL的数据源、基于RDS的数据源、基于DRDS的数据源、基于RDBMS的数据源、基于PostgreSQL的数据源等。当然，上述只是示例，对此不做限制。In one example, these data sources may include, but are not limited to: OSS-based data sources, TableStore-based data sources, HBase-based data sources, HDFS-based data sources, MySQL-based data sources, RDS-based data sources, DRDS-based data sources, RDBMS-based data sources, PostgreSQL-based data sources, etc. Of course, the above is just an example, which is not limited.

参见图3E所示，由于数据源的类型不同，则数据源中数据的数据格式也不相同，例如，数据格式可以包括但不限于：parquet数据格式、orc数据格式、text数据格式、json数据格式、kv数据格式、rcfile数据格式、avro数据格式、arrow数据格式等。当然，上述只是示例，还可以有其它数据格式，对此不做限制。Referring to Figure 3E, due to the different types of data sources, the data formats of the data in the data sources are also different. For example, the data formats may include but are not limited to: parquet data format, orc data format, text data format, json data format , kv data format, rcfile data format, avro data format, arrow data format, etc. Of course, the above is just an example, and other data formats are also possible, which is not limited.

综上所述，由于数据源的数据格式与计算集群的数据格式并不相同，因此，需要进行数据格式的转换，使得计算集群能够正确的处理数据。例如，若数据源的数据格式为json数据格式，且计算集群为基于Presto的计算集群，则需要将json数据格式的数据，转换为与Presto数据格式匹配的数据。To sum up, since the data format of the data source is different from that of the computing cluster, it is necessary to convert the data format so that the computing cluster can process the data correctly. For example, if the data format of the data source is json data format and the computing cluster is a Presto-based computing cluster, the data in json data format needs to be converted into data that matches the Presto data format.

本申请实施例中，正是通过提供数据扫描集群，实现数据格式的转换，即通过数据扫描集群中的处理单元(如FPGA等)，实现数据格式的转换。In the embodiment of the present application, the data format conversion is realized by providing a data scanning cluster, that is, the data format conversion is realized by a processing unit (such as an FPGA, etc.) in the data scanning cluster.

在一个例子中，为了实现数据格式的转换，可以在处理单元(如FPGA等)配置转换信息，处理单元可以利用转换信息实现数据格式的转换，对此转换信息的内容不做限制，只要处理单元能够利用转换信息实现数据格式的转换即可。In an example, in order to realize the conversion of the data format, the conversion information can be configured in the processing unit (such as FPGA, etc.), and the processing unit can use the conversion information to realize the conversion of the data format. The content of the conversion information is not limited, as long as the processing unit The conversion of the data format can be realized by using the conversion information.

例如，预先在处理单元配置转换信息A1，基于转换信息A1，处理单元能够将json数据格式的数据，转换为与Presto数据格式匹配的数据。For example, the conversion information A1 is configured in the processing unit in advance, and based on the conversion information A1, the processing unit can convert the data in the json data format into data matching the Presto data format.

在一个例子中，可以在处理单元(如FPGA等)配置多个不同的转换信息，不同的转换信息用于实现不同格式的数据转换。例如，在处理单元配置转换信息A1、转换信息A2、转换信息A3、转换信息A4，以此类推。基于转换信息A1，处理单元能够将json数据格式的数据，转换为与Presto数据格式匹配的数据。基于转换信息A2，处理单元能够将json数据格式的数据，转换为与Spark数据格式匹配的数据。基于转换信息A3，处理单元能够将text数据格式的数据，转换为与Presto数据格式匹配的数据。基于转换信息A4，处理单元能够将text数据格式的数据，转换为与Spark数据格式匹配的数据，以此类推。In one example, a plurality of different conversion information may be configured in a processing unit (eg, FPGA, etc.), and different conversion information is used to realize data conversion in different formats. For example, the conversion information A1, the conversion information A2, the conversion information A3, and the conversion information A4 are configured in the processing unit, and so on. Based on the conversion information A1, the processing unit can convert the data in the json data format into data matching the Presto data format. Based on the conversion information A2, the processing unit can convert the data in the json data format into data matching the Spark data format. Based on the conversion information A3, the processing unit can convert the data in the text data format into data matching the Presto data format. Based on the conversion information A4, the processing unit can convert the data in the text data format into data matching the Spark data format, and so on.

当然，上述只是转换信息的示例，在实际应用中，可以在处理单元配置更多的转换信息，以实现各种数据格式的转换，参见图4所示，为数据格式转换的示意图。第一列表示数据源支持的数据格式，第一行表示计算集群支持的数据格式。图4中的“是”表示支持这两种数据格式的转换，图4中的“否”表示不支持这两种数据格式的转换。基于此，可以在处理单元配置多个转换信息，以通过这些转换信息使处理单元支持“是”对应的两种数据格式的转换。Of course, the above is just an example of conversion information. In practical applications, more conversion information can be configured in the processing unit to realize conversion of various data formats. See FIG. 4 , which is a schematic diagram of data format conversion. The first column indicates the data format supported by the data source, and the first row indicates the data format supported by the computing cluster. "Yes" in Figure 4 indicates that the conversion of these two data formats is supported, and "No" in Figure 4 indicates that the conversion of these two data formats is not supported. Based on this, a plurality of conversion information can be configured in the processing unit, so that the processing unit supports the conversion of two data formats corresponding to "Yes" through the conversion information.

综上所述，由于在处理单元配置多个不同转换信息，不同转换信息用于实现不同格式的数据转换，因此，可以充分利用处理单元的计算能力，提高处理单元的利用率。例如，若处理单元配置转换信息A1，则处理单元用于将json数据格式的数据，转换为与Presto数据格式匹配的数据。当没有“将json数据格式的数据，转换为与Presto数据格式匹配的数据”的任务时，处理单元就处于空闲状态，浪费了处理单元的计算能力。若处理单元配置转换信息A1和转换信息A2，则处理单元用于将json数据格式的数据，转换为与Presto数据格式匹配的数据，将json数据格式的数据，转换为与Spark数据格式匹配的数据。当没有“将json数据格式的数据，转换为与Presto数据格式匹配的数据”的任务时，处理单元还可以将json数据格式的数据，转换为与Spark数据格式匹配的数据，从而避免处理单元处于空闲状态，提高了处理单元的计算能力。To sum up, since a plurality of different conversion information is configured in the processing unit, and the different conversion information is used to realize data conversion of different formats, the computing capability of the processing unit can be fully utilized, and the utilization rate of the processing unit can be improved. For example, if the processing unit is configured with conversion information A1, the processing unit is configured to convert the data in the json data format into data matching the Presto data format. When there is no task of "converting the data in the json data format into data that matches the Presto data format", the processing unit is in an idle state, wasting the computing power of the processing unit. If the processing unit is configured with conversion information A1 and conversion information A2, the processing unit is used to convert the data in the json data format into data matching the Presto data format, and convert the data in the json data format into data matching the Spark data format . When there is no task of "converting data in json data format into data matching the Presto data format", the processing unit can also convert the data in json data format into data matching the Spark data format, so as to avoid the processing unit in the The idle state increases the computing power of the processing unit.

在一个例子中，数据扫描集群中的处理单元的使用相对固定，用于对不同的计算集群的数据扫描任务进行加速。参见图5A所示，该数据扫描集群可以包括指令存储、数据存储、常量存储、寄存器组、数据存储链表、指令执行等基本模块。进一步的，参见图5B所示，该数据扫描集群还可以包括多个处理单元(如FPGA等)，每个处理单元用于进行不同数据格式的转换，此外，该数据扫描集群还可以包括调度与管理模块、输入模块和输出模块等。In one example, the usage of the processing units in the data scanning cluster is relatively fixed for accelerating data scanning tasks of different computing clusters. Referring to FIG. 5A , the data scanning cluster may include basic modules such as instruction storage, data storage, constant storage, register group, data storage linked list, and instruction execution. Further, as shown in FIG. 5B , the data scanning cluster may also include multiple processing units (such as FPGAs, etc.), each processing unit is used to convert different data formats, in addition, the data scanning cluster may also include scheduling and Management modules, input modules, and output modules, etc.

在上述应用场景下，参见图6所示，为本申请实施例提出的数据处理方法的流程图，可以应用于数据湖分析系统的数据扫描集群，该方法可以包括：In the above application scenario, referring to FIG. 6 , the flowchart of the data processing method proposed in this embodiment of the present application can be applied to a data scanning cluster of a data lake analysis system, and the method may include:

步骤601，获取数据处理请求，如数据扫描(data scan)请求等。In step 601, a data processing request, such as a data scan request, is obtained.

具体的，客户端可以通过负载均衡设备向数据湖分析系统发送数据处理请求，这样，数据湖分析系统的数据扫描集群可以获取到该数据处理请求。例如，数据扫描集群的调度与管理模块可以获取到该数据处理请求。Specifically, the client can send a data processing request to the data lake analysis system through the load balancing device, so that the data scanning cluster of the data lake analysis system can obtain the data processing request. For example, the scheduling and management module of the data scanning cluster can obtain the data processing request.

步骤602，判断数据湖分析系统是否支持与数据处理请求对应的数据格式转换。如果是，则执行步骤603；如果否，则提示不支持数据处理请求。Step 602: Determine whether the data lake analysis system supports data format conversion corresponding to the data processing request. If yes, go to step 603; if no, prompt that the data processing request is not supported.

具体的，数据处理请求可以包括输入数据格式(即数据源中的数据的格式，为了区分方便，后续以第一输入格式为例，如json数据格式)和输出目标格式(即需要输出的数据的格式，为了区分方便，后续以第二输出格式为例，如Presto数据格式)，因此，可以判断数据湖分析系统是否支持第一输入格式与第二输出格式的转换，如果是，执行步骤603；如果否，提示不支持数据处理请求。Specifically, the data processing request may include the input data format (that is, the format of the data in the data source, for the convenience of distinction, the first input format is taken as an example, such as the json data format) and the output target format (that is, the format of the data to be outputted) Format, for the convenience of distinction, the second output format is used as an example in the following, such as Presto data format), therefore, it can be judged whether the data lake analysis system supports the conversion of the first input format and the second output format, and if so, go to step 603; If not, prompt that the data processing request is not supported.

例如，数据扫描集群的调度与管理模块可以从数据处理请求中获取第一输入格式和第二输出格式，并查询数据湖分析系统是否支持第一输入格式与第二输出格式的转换。具体的，假设数据湖分析系统包括能力注册表，且能力注册表用于记录数据湖分析系统支持的所有数据格式的转换，且能力注册表参见图4所示。若能力注册表不存在第一输入格式和/或第二输出格式，则确定数据湖分析系统不支持第一输入格式与第二输出格式的转换；若能力注册表存在第一输入格式和第二输出格式，且第一输入格式和第二输出格式对应的为“否”，则确定数据湖分析系统不支持第一输入格式与第二输出格式的转换；若能力注册表存在第一输入格式和第二输出格式，且第一输入格式和第二输出格式对应的为“是”，则确定数据湖分析系统支持第一输入格式与第二输出格式的转换。For example, the scheduling and management module of the data scanning cluster can obtain the first input format and the second output format from the data processing request, and query whether the data lake analysis system supports the conversion of the first input format and the second output format. Specifically, it is assumed that the data lake analysis system includes a capability registry, and the capability registry is used to record the conversion of all data formats supported by the data lake analysis system, and the capability registry is shown in FIG. 4 . If the capability registry does not have the first input format and/or the second output format, it is determined that the data lake analysis system does not support the conversion of the first input format and the second output format; if the capability registry has the first input format and the second output format Output format, and the corresponding value of the first input format and the second output format is "No", it is determined that the data lake analysis system does not support the conversion of the first input format and the second output format; if the capability registry has the first input format and The second output format, and the corresponding value of the first input format and the second output format is "Yes", it is determined that the data lake analysis system supports the conversion of the first input format and the second output format.

步骤603，从数据湖分析系统的多个处理单元中选择目标处理单元。Step 603, select a target processing unit from multiple processing units of the data lake analysis system.

其中，针对所述多个处理单元中的每个处理单元，该处理单元可以为：当前未工作的处理单元(即该处理单元当前没有执行数据的转换操作)，或者，当前已工作的处理单元(即该处理单元当前正在执行数据的转换操作)。Wherein, for each processing unit in the plurality of processing units, the processing unit may be: a processing unit that is currently not working (that is, the processing unit does not currently perform a data conversion operation), or a processing unit that is currently working (that is, the processing unit is currently performing a data transformation operation).

例如，可以从处理单元1、处理单元2和处理单元3中选择目标处理单元，处理单元1可以为当前未工作的处理单元或者当前已工作的处理单元，处理单元2可以为当前未工作的处理单元或者当前已工作的处理单元，以此类推。For example, the target processing unit can be selected from processing unit 1, processing unit 2, and processing unit 3. Processing unit 1 can be the currently inactive processing unit or the currently active processing unit, and processing unit 2 can be the currently inactive processing unit unit or currently working processing unit, and so on.

在一个例子中，数据处理请求还可以包括服务模式，若该服务模式为流量模式，则表示用户采用数据总量计费，基于此，这个用户可以与其它用户共用处理单元，因此，数据湖分析系统的多个处理单元，可以为当前未工作的处理单元或者当前已工作的处理单元，也就是说，可以将当前未工作的处理单元作为目标处理单元，也可以将当前已工作的处理单元作为目标处理单元。若该服务模式为实例模式，表示用户采用处理单元数量计费，基于此，这个用户单独使用处理单元，因此，数据湖分析系统的多个处理单元，可以为当前未工作的处理单元，也就是说，可以将当前未工作的处理单元作为目标处理单元。In an example, the data processing request may also include a service mode. If the service mode is a traffic mode, it means that the user adopts the total amount of data billing. Based on this, the user can share the processing unit with other users. Therefore, the data lake analysis The multiple processing units of the system can be the currently not working processing units or the currently working processing units, that is to say, the currently not working processing units can be used as the target processing units, or the currently working processing units can be used as the target processing units. target processing unit. If the service mode is the instance mode, it means that the user is billed by the number of processing units. Based on this, the user uses the processing unit alone. Therefore, the multiple processing units of the data lake analysis system can be the processing units that are not currently working, that is, Say, the processing unit that is not currently working can be the target processing unit.

在一个例子中，可以从数据湖分析系统的多个处理单元中任意选择处理单元，并将选择的处理单元作为目标处理单元。或者，可以获取数据湖分析系统的多个处理单元的目标转换信息，并利用每个处理单元的目标转换信息从所述多个处理单元中选择处理单元，并将选择的处理单元作为目标处理单元。In one example, a processing unit may be arbitrarily selected from a plurality of processing units of the data lake analysis system, and the selected processing unit may be used as a target processing unit. Alternatively, the target conversion information of multiple processing units of the data lake analysis system may be acquired, and the target conversion information of each processing unit may be used to select a processing unit from the plurality of processing units, and the selected processing unit may be used as the target processing unit .

例如，假设数据扫描集群的调度与管理模块需要从处理单元1、处理单元2和处理单元3中选择目标处理单元，则采用如下方式：可以从这些处理单元中随机选择处理单元，如选择处理单元1，并将处理单元1作为目标处理单元；或者，根据处理单元1、处理单元2和处理单元3的目标转换信息，从这些处理单元中选择处理单元，如选择处理单元2，并将处理单元2作为目标处理单元。For example, assuming that the scheduling and management module of the data scanning cluster needs to select the target processing unit from processing unit 1, processing unit 2 and processing unit 3, the following method is adopted: the processing unit can be randomly selected from these processing units, such as selecting a processing unit 1, and use processing unit 1 as the target processing unit; or, according to the target conversion information of processing unit 1, processing unit 2 and processing unit 3, select a processing unit from these processing units, such as selecting processing unit 2, and set the processing unit 2 as the target processing unit.

其中，利用每个处理单元的目标转换信息从多个处理单元中选择处理单元，并将选择的处理单元作为目标处理单元，可以包括但不限于：若存在目标转换信息为第一转换信息(用于实现第一输入格式与第二输出格式的转换)的处理单元，则可以将目标转换信息为第一转换信息的处理单元确定为目标处理单元；或者，若不存在目标转换信息为第一转换信息的处理单元，则可以从多个处理单元中随机选择处理单元，并将选择的处理单元确定为目标处理单元。Wherein, using the target conversion information of each processing unit to select a processing unit from a plurality of processing units, and using the selected processing unit as the target processing unit, may include but not limited to: if there is target conversion information, the first conversion information (using In order to realize the processing unit of the conversion of the first input format and the second output format), the processing unit whose target conversion information is the first conversion information can be determined as the target processing unit; or, if there is no target conversion information, it is the first conversion information. information processing unit, a processing unit may be randomly selected from a plurality of processing units, and the selected processing unit may be determined as the target processing unit.

其中，目标转换信息是处理单元当前使能的转换信息，即处理单元当前正在使用的转换信息。例如，处理单元配置转换信息A1(用于将json数据格式的数据，转换为与Presto数据格式匹配的数据)和转换信息A2(用于将json数据格式的数据，转换为与Spark数据格式匹配的数据)，若目标转换信息是转换信息A1，则表示处理单元当前用于将json数据格式的数据，转换为与Presto数据格式匹配的数据，但是，不用于将json数据格式的数据，转换为与Spark数据格式匹配的数据。若目标转换信息是转换信息A2，则表示处理单元当前用于将json数据格式的数据，转换为与Spark数据格式匹配的数据，以此类推。The target conversion information is the conversion information currently enabled by the processing unit, that is, the conversion information currently being used by the processing unit. For example, the processing unit configures conversion information A1 (used to convert data in json data format to data matching the Presto data format) and conversion information A2 (used to convert data in json data format to data matching the Spark data format) data), if the target conversion information is conversion information A1, it means that the processing unit is currently used to convert the data in the json data format into data that matches the Presto data format, but it is not used to convert the data in the json data format into data with Data that matches the Spark data format. If the target conversion information is conversion information A2, it means that the processing unit is currently used to convert the data in the json data format into data that matches the Spark data format, and so on.

假设第一输入格式为json数据格式，第二输出格式为Presto数据格式，则第一转换信息为转换信息A1，即第一转换信息用于实现json数据格式与Presto数据格式的转换。若处理单元1的目标转换信息是转换信息A1，则处理单元1的目标转换信息为第一转换信息，可以将处理单元1确定为目标处理单元。Assuming that the first input format is the json data format and the second output format is the Presto data format, the first conversion information is the conversion information A1, that is, the first conversion information is used to implement the conversion between the json data format and the Presto data format. If the target conversion information of the processing unit 1 is the conversion information A1, the target conversion information of the processing unit 1 is the first conversion information, and the processing unit 1 may be determined as the target processing unit.

在一个例子中，数据处理请求还可以包括分片数量，表示用户需要使用的处理单元的数量，因此，还可以根据该分片数量确定目标处理单元的数量，然后，从数据湖分析系统的多个处理单元中选择所述数量个目标处理单元。In one example, the data processing request may further include the number of shards, which indicates the number of processing units that the user needs to use. Therefore, the number of target processing units can also be determined according to the number of shards, and then the data lake can be analyzed from the data lake. The number of target processing units is selected from among the processing units.

例如，假设分片数量为5，即处理单元的数量为5，则需要从数据湖分析系统的多个处理单元中选择5个目标处理单元，具体选择方式参见上述实施例。For example, if the number of shards is 5, that is, the number of processing units is 5, it is necessary to select 5 target processing units from the multiple processing units of the data lake analysis system. For the specific selection method, refer to the above embodiment.

步骤604，根据第一输入格式和第二输出格式将目标处理单元(如一个或多个目标处理单元，如5个目标处理单元)的目标转换信息设置为第一转换信息。Step 604: Set the target conversion information of the target processing unit (eg, one or more target processing units, eg, 5 target processing units) as the first conversion information according to the first input format and the second output format.

其中，该第一转换信息用于实现第一输入格式与第二输出格式的转换，也就是说，用于将第一输入格式的数据转换为第二输出格式的数据。Wherein, the first conversion information is used to realize the conversion of the first input format and the second output format, that is, used to convert the data of the first input format into the data of the second output format.

具体的，若目标处理单元的目标转换信息为第一转换信息，则根据第一输入格式和第二输出格式保持目标处理单元的目标转换信息不变；或者，若目标处理单元的目标转换信息为第二转换信息(第二转换信息不用于实现第一输入格式与第二输出格式的转换)，则根据第一输入格式和第二输出格式将目标处理单元的目标转换信息，从第二转换信息修改为第一转换信息。Specifically, if the target conversion information of the target processing unit is the first conversion information, then keep the target conversion information of the target processing unit unchanged according to the first input format and the second output format; or, if the target conversion information of the target processing unit is the second conversion information (the second conversion information is not used to realize the conversion of the first input format and the second output format), then according to the first input format and the second output format, the target conversion information of the target processing unit is converted from the second conversion information Modified to the first conversion information.

例如，假设第一输入格式为json数据格式，第二输出格式为Presto数据格式，则第一转换信息为转换信息A1，即第一转换信息用于实现json数据格式与Presto数据格式的转换。进一步的，若目标处理单元的目标转换信息是转换信息A1，则可以保持目标处理单元的目标转换信息不变，即目标转换信息仍然是转换信息A1。若目标处理单元的目标转换信息是转换信息A2(用于实现json数据格式与Spark数据格式的转换)，则可以将目标处理单元的目标转换信息修改为转换信息A1，这样，目标处理单元不再用于实现json数据格式与Spark数据格式的转换，而是用于实现json数据格式与Presto数据格式的转换。For example, assuming that the first input format is the json data format and the second output format is the Presto data format, the first conversion information is conversion information A1, that is, the first conversion information is used to implement the conversion between the json data format and the Presto data format. Further, if the target conversion information of the target processing unit is the conversion information A1, the target conversion information of the target processing unit can be kept unchanged, that is, the target conversion information is still the conversion information A1. If the target conversion information of the target processing unit is conversion information A2 (for realizing the conversion between json data format and Spark data format), the target conversion information of the target processing unit can be modified to conversion information A1, so that the target processing unit no longer It is used to realize the conversion between json data format and Spark data format, but it is used to realize the conversion between json data format and Presto data format.

其中，可以由数据扫描集群的调度与管理模块执行步骤601-步骤604。Wherein, steps 601 to 604 may be performed by the scheduling and management module of the data scanning cluster.

步骤605，根据数据处理请求从数据源获取第一输入格式的第一数据(可以将数据源中的数据称为第一数据)，并将第一数据输出给目标处理单元。Step 605: Acquire the first data in the first input format from the data source according to the data processing request (the data in the data source may be referred to as the first data), and output the first data to the target processing unit.

具体的，数据处理请求可以包括数据源的信息，基于数据源的信息，可以从数据源获取第一数据，且第一数据的数据格式为第一输入格式，对此获取过程不再赘述。然后，可以将第一输入格式的第一数据输出给目标处理单元。Specifically, the data processing request may include information of the data source. Based on the information of the data source, the first data may be obtained from the data source, and the data format of the first data is the first input format, and the obtaining process will not be repeated. Then, the first data in the first input format can be output to the target processing unit.

例如，数据扫描集群的输入模块可以从数据源获取第一输入格式的第一数据，并将第一输入格式的第一数据输出给目标处理单元。For example, the input module of the data scanning cluster may acquire the first data in the first input format from the data source, and output the first data in the first input format to the target processing unit.

步骤606，目标处理单元利用第一转换信息将第一数据转换为第二输出格式的第二数据(将转换后的数据称为第二数据)，对此转换过程不再赘述。Step 606 , the target processing unit converts the first data into second data in the second output format by using the first conversion information (the converted data is referred to as the second data), and the conversion process will not be repeated.

具体的，参见上述实施例，目标处理单元的目标转换信息为第一转换信息，如转换信息A1，转换信息A1用于实现json数据格式与Presto数据格式的转换。假设第一输入格式为json数据格式，第二输出格式为Presto数据格式，基于此，第一数据的数据格式为json数据格式，而且，目标处理单元能够利用转换信息A1将json数据格式的第一数据转换为Presto数据格式的第二数据。Specifically, referring to the above embodiment, the target conversion information of the target processing unit is the first conversion information, such as conversion information A1, and the conversion information A1 is used to realize the conversion between the json data format and the Presto data format. Assuming that the first input format is the json data format, and the second output format is the Presto data format, based on this, the data format of the first data is the json data format, and the target processing unit can use the conversion information A1 to convert the first data format of the json data format. The data is converted into the second data in the Presto data format.

步骤607，从目标处理单元获取第二输出格式的第二数据，并输出第二数据。Step 607: Acquire second data in the second output format from the target processing unit, and output the second data.

例如，数据扫描集群的输出模块从目标处理单元获取第二输出格式的第二数据，如Presto数据格式的第二数据，并将Presto数据格式的第二数据输出给计算节点，如基于Presto的计算集群内的计算节点。由于输出给计算节点的是Presto数据格式的第二数据，因此，计算节点可以利用第二数据进行处理。For example, the output module of the data scanning cluster obtains the second data in the second output format from the target processing unit, such as the second data in the Presto data format, and outputs the second data in the Presto data format to the computing node, such as Presto-based computing Compute nodes within the cluster. Since the output to the computing node is the second data in the Presto data format, the computing node can use the second data for processing.

在一个例子中，数据处理请求还可以包括服务模式，若该服务模式为流量模式(即共享服务型)，则表示用户采用数据总量计费，基于此，这个用户可以与其它用户共用处理单元，因此，可以获取数据总量(即从数据源中读取的数据总量)，并根据该数据总量确定虚拟资源信息(如费用信息)，并输出虚拟资源信息，如向用户输出虚拟资源信息。若该服务模式为实例模式(即独占实例型)，表示用户采用处理单元数量计费，基于此，这个用户单独使用处理单元，因此，可以获取目标处理单元数量，并根据目标处理单元数量确定虚拟资源信息(如费用信息)，并输出虚拟资源信息，如向用户输出虚拟资源信息。In an example, the data processing request may also include a service mode. If the service mode is a traffic mode (ie, a shared service type), it means that the user adopts the total amount of data billing, and based on this, the user can share the processing unit with other users. , therefore, the total amount of data (that is, the total amount of data read from the data source) can be obtained, and the virtual resource information (such as cost information) can be determined according to the total amount of data, and the virtual resource information can be output, such as outputting virtual resources to users information. If the service mode is an instance mode (ie, an exclusive instance type), it means that the user uses the number of processing units for billing. Based on this, the user uses the processing unit alone. Therefore, the number of target processing units can be obtained, and the virtual resource information (such as cost information), and output virtual resource information, such as outputting virtual resource information to the user.

基于上述技术方案，本申请实施例中，由目标处理单元实现数据格式的转换，目标处理单元通常由逻辑芯片实现，具有很高的处理性能，节省数据湖分析系统的计算资源(如CPU资源等)，提高数据湖分析系统的整体处理性能，提升数据湖分析系统整体的使用效率和体验，加速数据处理和计算性能，结合硬件加速技术处理存储集群的数据对接，向计算集群提供数据接口。本实施例中的数据扫描集群具有更好的通用性和产品化应用能力，大大提升对接和加速的计算集群适用范围，大大提升云产品的产品化能力，提供多种模式的FPGA数据扫描加速服务，提出通用的FPGA数据扫描引擎，能够内置多种数据格式的输入与输出支持，为特定计算引擎开发特定的FPGA数据扫描计算加速核。Based on the above technical solutions, in the embodiment of the present application, the data format conversion is realized by the target processing unit. The target processing unit is usually realized by a logic chip, which has high processing performance and saves the computing resources (such as CPU resources, etc.) of the data lake analysis system. ), improve the overall processing performance of the data lake analysis system, improve the overall use efficiency and experience of the data lake analysis system, accelerate data processing and computing performance, and combine hardware acceleration technology to process data connection of storage clusters and provide data interfaces to computing clusters. The data scanning cluster in this embodiment has better versatility and productized application capabilities, greatly improves the application range of computing clusters for interconnection and acceleration, greatly improves the productization capabilities of cloud products, and provides FPGA data scanning acceleration services in multiple modes. , a general FPGA data scanning engine is proposed, which can build input and output support for multiple data formats, and develop a specific FPGA data scanning computing acceleration core for a specific computing engine.

基于与上述方法同样的申请构思，本申请实施例还提供一种数据处理装置，如图7所示，为所述数据处理装置的结构图，所述数据处理装置包括：Based on the same application concept as the above method, an embodiment of the present application further provides a data processing apparatus. As shown in FIG. 7 , which is a structural diagram of the data processing apparatus, the data processing apparatus includes:

获取模块71，用于获取数据处理请求，所述数据处理请求包括第一输入格式和第二输出格式；获取目标处理单元，所述目标处理单元的目标转换信息为第一转换信息，所述第一转换信息用于实现所述第一输入格式与所述第二输出格式的转换；The obtaining module 71 is configured to obtain a data processing request, where the data processing request includes a first input format and a second output format; obtain a target processing unit, the target conversion information of the target processing unit is the first conversion information, and the first conversion information is the first conversion information. A conversion information is used to realize the conversion of the first input format and the second output format;

处理模块72，用于根据所述数据处理请求从数据源获取第一输入格式的第一数据，并将所述第一数据输出给所述目标处理单元，以使所述目标处理单元利用所述第一转换信息将所述第一数据转换为第二输出格式的第二数据；A processing module 72, configured to obtain first data in a first input format from a data source according to the data processing request, and output the first data to the target processing unit, so that the target processing unit uses the The first conversion information converts the first data into second data in a second output format;

所述获取模块71获取目标处理单元时具体用于：获取数据湖分析系统的多个处理单元的目标转换信息，并利用所述目标转换信息从所述多个处理单元中选择处理单元作为目标处理单元。When the acquisition module 71 acquires the target processing unit, it is specifically used to: acquire target conversion information of multiple processing units of the data lake analysis system, and use the target conversion information to select a processing unit from the plurality of processing units as the target processing unit. unit.

在一个例子中，所述处理模块72还用于：In one example, the processing module 72 is further configured to:

若目标处理单元的目标转换信息为第一转换信息，则根据第一输入格式和第二输出格式保持所述目标处理单元的目标转换信息不变；或者，If the target conversion information of the target processing unit is the first conversion information, keep the target conversion information of the target processing unit unchanged according to the first input format and the second output format; or,

若目标处理单元的目标转换信息为第二转换信息，则根据第一输入格式和第二输出格式将所述目标处理单元的目标转换信息修改为第一转换信息。If the target conversion information of the target processing unit is the second conversion information, the target conversion information of the target processing unit is modified into the first conversion information according to the first input format and the second output format.

基于与上述方法同样的申请构思，本申请实施例还提供一种数据处理设备，包括：处理器和机器可读存储介质，所述机器可读存储介质上存储有若干计算机指令，所述处理器执行所述计算机指令时进行如下处理：Based on the same application concept as the above method, an embodiment of the present application further provides a data processing device, including: a processor and a machine-readable storage medium, where several computer instructions are stored on the machine-readable storage medium, and the processor The following processes are performed when executing the computer instructions:

本申请实施例还提供一种机器可读存储介质，所述机器可读存储介质上存储有若干计算机指令；所述计算机指令被执行时进行如下处理：Embodiments of the present application also provide a machine-readable storage medium, where several computer instructions are stored on the machine-readable storage medium; when the computer instructions are executed, the following processing is performed:

参见图8所示，为本申请实施例中提出的数据处理设备的结构图，所述数据处理设备80可以包括：处理器81，网络接口82，总线83，存储器84。存储器84可以是任何电子、磁性、光学或其它物理存储装置，可以包含或存储信息，如可执行指令、数据等等。例如，存储器84可以是：RAM(Radom Access Memory，随机存取存储器)、易失存储器、非易失性存储器、闪存、存储驱动器(如硬盘驱动器)、固态硬盘、任何类型的存储盘(如光盘、dvd等)。Referring to FIG. 8 , which is a structural diagram of a data processing device proposed in an embodiment of the application, the data processing device 80 may include: a processor 81 , a network interface 82 , a bus 83 , and a memory 84 . Memory 84 may be any electronic, magnetic, optical, or other physical storage device that may contain or store information, such as executable instructions, data, and the like. For example, the memory 84 may be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, storage drives (such as hard disk drives), solid state drives, any type of storage disks (such as optical disks) , DVD, etc.).

上述实施例阐明的系统、装置、模块或单元，具体可以由计算机芯片或实体实现，或者由具有某种功能的产品来实现。一种典型的实现设备为计算机，计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。The systems, devices, modules or units described in the above embodiments may be specifically implemented by computer chips or entities, or by products with certain functions. A typical implementing device is a computer, which may be in the form of a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, email sending and receiving device, game control desktop, tablet, wearable device, or a combination of any of these devices.

为了描述的方便，描述以上装置时以功能分为各种单元分别描述。当然，在实施本申请时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various units and described respectively. Of course, when implementing the present application, the functions of each unit may be implemented in one or more software and/or hardware.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可以由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其它可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其它可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

而且，这些计算机程序指令也可以存储在能引导计算机或其它可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或者多个流程和/或方框图一个方框或者多个方框中指定的功能。Furthermore, these computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer readable memory result in an article of manufacture comprising the instruction means, The instruction means implements the functions specified in a flow or flows of the flowcharts and/or a block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其它可编程数据处理设备上，使得在计算机或者其它可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其它可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

以上所述仅为本申请的实施例而已，并不用于限制本申请。对于本领域技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本申请的权利要求范围之内。The above descriptions are merely examples of the present application, and are not intended to limit the present application. Various modifications and variations of this application are possible for those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the scope of the claims of this application.

Claims

1. a data processing method, is characterized in that, described method comprises:

obtaining a data processing request, where the data processing request includes a first input format and a second output format;

Acquiring a target processing unit, where target conversion information of the target processing unit is first conversion information, and the first conversion information is used to realize conversion of the first input format and the second output format;

Acquire first data in a first input format from a data source according to the data processing request, and output the first data to the target processing unit, so that the target processing unit uses the first conversion information to convert the converting the first data into second data in a second output format;

The second data is acquired from the target processing unit, and the second data is output.

2. The method according to claim 1, wherein

The acquisition target processing unit includes:

Obtain target conversion information of multiple processing units of the data lake analysis system, and use the target conversion information to select a processing unit from the plurality of processing units as a target processing unit.

3. The method according to claim 2, wherein the selecting a processing unit from the plurality of processing units by using the target conversion information as the target processing unit comprises:

If there is a processing unit whose target conversion information is the first conversion information, the processing unit whose target conversion information is the first conversion information is determined as the target processing unit; or,

If there is no processing unit whose target conversion information is the first conversion information, a processing unit is selected from the plurality of processing units, and the selected processing unit is determined as the target processing unit.

4. The method of claim 1, wherein

After the acquisition of the target processing unit, the method further includes:

If the target conversion information of the target processing unit is the first conversion information, keep the target conversion information of the target processing unit unchanged according to the first input format and the second output format; or,

If the target conversion information of the target processing unit is the second conversion information, the target conversion information of the target processing unit is modified into the first conversion information according to the first input format and the second output format.

5. The method of claim 1, wherein

The acquisition target processing unit includes:

Determine whether the data lake analysis system supports the conversion between the first input format and the second output format;

If yes, get the target processing unit from the multiple processing units of the data lake analytics system.

6. The method of claim 1, wherein

The data processing request further includes the number of shards, and the acquisition target processing unit includes:

Determine the number of target processing units according to the number of shards;

Obtain the number of target processing units.

7. The method of claim 1, wherein the method further comprises:

The data processing request further includes a service mode, and if the service mode is a traffic mode, the total amount of data is obtained, and virtual resource information is determined according to the total amount of data, and the virtual resource information is output;

If the service mode is the instance mode, the number of target processing units is acquired, virtual resource information is determined according to the number of target processing units, and the virtual resource information is output.

8. A data processing method, characterized in that it is applied to a data lake analysis system, and the data lake analysis system is used to provide users with serverless data processing services, the method comprising:

Obtain a target processing unit from multiple processing units of the data lake analysis system; wherein, target conversion information of the target processing unit is first conversion information, and the first conversion information is used to implement the first input format conversion with the second output format;

Obtain the second data from the target processing unit, and output the second data;

Wherein, the data source includes a cloud database provided by the data lake analysis system.

9. The method of claim 8, wherein:

The data lake analysis system is specifically: a data lake analysis system in which storage and computing are separated; the data lake analysis system includes a storage cluster and a computing cluster, the storage cluster includes multiple data sources using different input formats, and the computing A cluster includes multiple compute nodes with different output formats;

The data lake analysis system further includes a data scanning cluster, and the data scanning cluster includes a plurality of processing units; the data scanning cluster, as a built-in module of the computing cluster, is deployed on the same node as the computing resources of the computing cluster; Alternatively, the data scanning cluster acts as an independent module of the computing cluster, and is deployed on a different node from the computing resources of the computing cluster; or, the data scanning cluster acts as an independent cluster different from the computing cluster.

10. A data processing method, wherein the method comprises:

Acquiring first data in a first input format from a data source according to the data processing request;

outputting the first data in the first input format to a target processing unit, so that the target processing unit converts the first data into second data in a second output format;

11. A data processing method, characterized in that, when applied to a data lake analysis system, for a processing unit in a plurality of processing units of the data lake analysis system, the processing unit includes a plurality of different conversion information, different The conversion information is used to realize data conversion in different formats, and the method includes:

The processing unit obtains the first data in the first input format;

If the target conversion information of the processing unit is the first conversion information, and the first conversion information is used to realize the conversion of the first input format and the second output format, use the first conversion information to convert the converting the first data into second data in a second output format;

The processing unit outputs the second data.

12 . The method according to claim 11 , wherein before converting the first data into the second data in the second output format by using the first conversion information, the method further comprises: 12 .

If the target conversion information of the processing unit is not the first conversion information, the processing unit modifies the target conversion information of the processing unit to the first conversion information.

13. A data processing device, wherein the device comprises:

an acquisition module, configured to acquire a data processing request, where the data processing request includes a first input format and a second output format; acquire a target processing unit, the target conversion information of the target processing unit is first conversion information, the first conversion information The conversion information is used to realize the conversion of the first input format and the second output format;

a processing module, configured to obtain first data in a first input format from a data source according to the data processing request, and output the first data to the target processing unit, so that the target processing unit uses the first data a conversion information to convert the first data into second data in a second output format;

14. The apparatus of claim 13, wherein

When the acquisition module acquires the target processing unit, it is specifically used for:

15. The apparatus according to claim 13, wherein the processing module is further configured to:

16. A data processing device, comprising:

A processor and a machine-readable storage medium, where several computer instructions are stored on the machine-readable storage medium, and the processor performs the following processing when executing the computer instructions: