CN112799799A

CN112799799A - Data consumption method and device

Info

Publication number: CN112799799A
Application number: CN202011592829.7A
Authority: CN
Inventors: 张普; 李翔远
Original assignee: Hangzhou Tuya Information Technology Co Ltd
Current assignee: Hangzhou Tuya Information Technology Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-05-14

Abstract

The application discloses a data consumption method and device. The data consumption method comprises the following steps: the computing engine acquires a plurality of subtasks; initiating a main task election to the coordination service module based on the plurality of subtasks so that the coordination service module elects the main task from the plurality of subtasks; and the calculation engine acquires the information specified by the distribution scheme for each subtask from the information queue based on the distribution scheme formulated by the main task, so that each subtask performs operation processing on the specified information. The method and the device can solve the problem that the subtask draws data of an environment area different from the area where the subtask is located, so that cross-area flow consumption is caused.

Description

Data consumption method and device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data consumption method and apparatus.

Background

Currently, as the technology is developed, the amount of data is increasing, and the technology of "big data" has penetrated all walks of business. Many devices currently collect large amounts of data and desire to process the data in a timely manner to mine the value thereof. For example, data generated by a smart phone, a sensor, internet of things equipment, a social network and an online transaction system need to be collected continuously, and the data are analyzed in real time, so that quick response is realized. Nowadays, a series of short and small subtasks divided by a task are processed in a distributed manner through a computing engine such as Spark, Storm or Flink, so as to improve the capability of data real-time analysis and processing.

However, when the computing engine extracts data in the message queue, it is determined by default that data that should be pulled by the current subtask is in a fixed hash remainder manner, which may cause the current subtask to pull data of an environment area different from the area where the subtask is located, resulting in consumption of cross-region traffic.

Disclosure of Invention

The application provides a data consumption method and device, which can solve the problem that cross-regional flow consumption is caused by the fact that a subtask pulls data of an environment region different from a region where the subtask is located.

To solve the above problem, the present application provides a data consumption method, including:

the computing engine acquires a plurality of subtasks;

initiating a main task election to the coordination service module based on the plurality of subtasks so that the coordination service module elects the main task from the plurality of subtasks;

and the calculation engine acquires the information specified by the distribution scheme for each subtask from the information queue based on the distribution scheme formulated by the main task, so that each subtask performs operation processing on the specified information.

Initiating a main task election to the coordination service module based on the plurality of subtasks, wherein the method comprises the following steps:

informing the coordination service module of the characteristics and the area of each subtask;

the method further comprises the following steps: and the main task determines the message consumed by each subtask based on the characteristics of the plurality of subtasks acquired from the coordination service module, the area where the subtasks are located and the distribution condition of the messages in the message queue so as to obtain an allocation scheme.

Wherein, the area of all subtasks is the same as the area of the designated message.

and sending a registration request of each subtask to the coordination service module to inform task information to which each subtask belongs, so that the coordination service module selects a main task from all subtasks when confirming and acquiring the registration requests of all subtasks of the task.

in response to a main task election initiated by the computing engine to the coordination service module based on the plurality of subtasks, the coordination service module elects the main task from the plurality of subtasks;

interacting with the compute engine to cause the master task to formulate an allocation plan, wherein the allocation plan specifies for each of the subtasks the messages in the message queue.

Wherein, responding to a main task election initiated by the computing engine to the coordination service module based on the plurality of subtasks comprises:

acquiring a registration request of each subtask from a computing engine to acquire information of a task to which each subtask belongs;

and when the coordination service module confirms that the registration requests of all the subtasks of the task are acquired, selecting the main task from all the subtasks of the task.

Wherein, the method also comprises:

the allocation scheme is obtained from the compute engine so that each subtask on the compute engine can listen to the allocation scheme from the orchestration service module and obtain the messages specified by the allocation scheme for each subtask from the message queue.

To solve the above problem, the present application provides a data processing system, including:

a computing engine for obtaining a plurality of subtasks; initiating a main task election to the coordination service module based on the plurality of subtasks; acquiring a message specified by the allocation scheme for each subtask from the message queue based on the allocation scheme formulated by the main task, and enabling each subtask to perform operation processing on the specified message;

the coordination service module is used for responding to main task election initiated by the calculation engine to the coordination service module based on the plurality of subtasks, and the coordination service module elects the main task from the plurality of subtasks; interact with the compute engine to make the allocation plan for the primary task.

To solve the above problem, the present application provides an electronic device, which includes a processor; the processor is used for executing instructions to realize the method.

To solve the above problems, the present application provides a computer-readable storage medium for storing instructions/program data that can be executed to implement the above-described method.

According to the method, a computing engine obtains a plurality of subtasks, then main task election is initiated to a coordination service module based on the plurality of subtasks, so that the coordination service module elects a main task from the plurality of subtasks, and therefore a distribution scheme determined based on the main task is obtained.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic flow chart diagram of an embodiment of a data consumption method of the present application;

FIG. 2 is a schematic flow chart diagram illustrating another embodiment of a data consumption method of the present application;

FIG. 3 is a block diagram of an embodiment of a data processing system according to the present application;

FIG. 4 is a schematic structural diagram of an embodiment of an electronic device of the present application;

FIG. 5 is a schematic structural diagram of an embodiment of a computer-readable storage medium according to the present application.

Detailed Description

The description and drawings illustrate the principles of the application. It will thus be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the application and are included within its scope. Moreover, all examples described herein are principally intended expressly to be only for pedagogical purposes to aid the reader in understanding the principles of the application and the concepts provided by the inventors and thereby deepening the art, and are to be construed as being limited to such specifically recited examples and conditions. Additionally, the term "or" as used herein refers to a non-exclusive "or" (i.e., "and/or") unless otherwise indicated (e.g., "or otherwise" or in the alternative). Moreover, the various embodiments described herein are not necessarily mutually exclusive, as some embodiments may be combined with one or more other embodiments to form new embodiments.

Referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of the data consuming method of the present application. It should be noted that the following numbers are only used for simplifying the description, and are not intended to limit the execution order of the steps, and the execution order of the steps in the present embodiment may be arbitrarily changed without departing from the technical idea of the present application.

S101: the compute engine fetches a plurality of subtasks.

The calculation engine can acquire the plurality of subtasks, so that main task election is initiated to the coordination service module based on the plurality of subtasks in the following process, the coordination service module can select the main task from the plurality of subtasks, and therefore based on the distribution scheme determined by the main task, the calculation engine can acquire data specified by the distribution scheme for each subtask from the message queue, and therefore the data required to be consumed by each subtask can be confirmed by means of the coordination service module, the data amount of cross-region pulling is reduced by means of the distribution scheme determined by the main task, and the flow consumption of cross-region pulling is reduced.

Alternatively, before step S101, the task may be divided into a plurality of short and small sub-tasks by a scheduling module in the data processing system, and then the scheduling module allocates the plurality of sub-tasks to a plurality of computing units in the computing engine, so that the computing units that have acquired the sub-tasks perform computing processing on the sub-tasks.

S102: and initiating a main task election to the coordination service module based on the plurality of subtasks so that the coordination service module elects the main task from the plurality of subtasks.

After the computing engine obtains the plurality of subtasks, the computing engine can initiate main task election to the coordination service module based on the plurality of subtasks, so that the coordination service module elects the main task from the plurality of subtasks.

Optionally, each computing unit that obtains a subtask in the computing engine may register the subtask with the coordination service module to notify the coordination service module that the subtask needs to confirm the allocation scheme, so that when the coordination service module confirms that the registered subtask reaches a preset condition, the coordination service module obtains a main task election request initiated by the computing engine based on the plurality of subtasks by default, and thus the coordination service module elects one subtask from the registered plurality of subtasks as the main task.

When registering a subtask to the coordination service module, each computing unit that obtains the subtask may send information of a task to which the subtask belongs (for example, a name of the task divided into several subtasks, and names of all subtasks divided by the task) and information of the computing unit itself to the coordination service module; the coordination service module can confirm whether the registration of all subtasks of one task is received or not based on the information of the plurality of registered subtasks and the information of the task to which each subtask belongs; if the task is received, the registered subtasks are confirmed to meet the preset conditions, and one subtask is selected from all subtasks of the task to serve as a main task.

In other implementation manners, of course, the coordination service module may also confirm that the registered subtasks reach the preset condition when the number of the received registered subtasks is greater than the number threshold, and then select one subtask from the registered subtasks as the main task.

The coordination service module may select one subtask from the plurality of subtasks based on any one of the following methods, but is not limited thereto.

For example, the coordination service module may take as a primary task a subtask of the plurality of subtasks that completes the registration first.

As another example, the orchestration service module may select one of the plurality of subtasks as the main task.

For example, the coordination service module may determine, based on the subtask information and the information of the computing unit, a subtask with the most sufficient computing resource of the corresponding computing unit in the plurality of subtasks, and use the subtask with the most sufficient computing resource of the corresponding computing unit as a main task, so that the computing unit with the most sufficient computing resource determines the allocation scheme, and the efficiency of formulating the allocation scheme may be improved.

S103: and the calculation engine acquires data specified by the distribution scheme for each subtask from the message queue based on the distribution scheme formulated by the main task, so that each subtask performs operation processing on the specified data.

The computing engine initiates main task election to the coordination service module based on the plurality of subtasks, so that after the coordination service module elects the main task from the plurality of subtasks, the computing engine can acquire data specified by the distribution scheme for each subtask from the message queue based on the distribution scheme formulated by the main task, and each subtask can perform operation processing on the data specified in the distribution scheme.

Before step S102, that is, after the main task is elected, the computing unit in which the main task is located may determine the data that each sub task needs to pull, so as to obtain the allocation scheme. In addition, the allocation scheme may record the address of the data that each subtask needs to pull, for example, the data that the a task needs to pull is M fragments of the B queue of the a room.

Further, in step S102, when registering a subtask to the coordination service module, each computing unit that obtains a subtask may send a subtask feature (for example, a message type calculated by the subtask, an area where the subtask is located, and the like) to the coordination service module, so that the coordination service module may obtain each registered subtask feature, and in addition, the coordination service module may store a message queue distribution condition (for example, a message type and a message quantity stored by each topic in a message queue, an area where the topic is located, and the like) in the message queue, so that after the coordination service module confirms the main task, the features of the plurality of subtasks and the message queue distribution condition may be shared with the main task, and the main task may make a reasonable allocation scheme based on the plurality of subtask features and the message queue distribution condition. Preferably, when the main task makes the allocation scheme, the problem of cross-region consumption can be considered, so that the areas of all the subtasks in the allocation scheme are the same as the area of the specified message as much as possible based on the area where the subtasks are located and the area where the topic is located, if the areas cannot be located, the number of the subtasks needing cross-region consumption can be reduced as much as possible, or the total amount of data consumed by the cross-region can be reduced as much as possible, so that cross-region traffic consumption is reduced as much as possible. In addition, when the main task makes a distribution scheme, balanced consumption can be considered, so that the data consumed in the message queue is balanced as much as possible.

After the main task formulates the distribution scheme, the distribution scheme can be transmitted to the coordination service module, so that the coordination service module informs the other subtasks of the data specified by the other subtasks. Specifically, after the main task formulates the allocation scheme, the allocation scheme may be transmitted to a specified path of the coordination service module, and the computing units of the plurality of subtasks monitor the specified path to determine whether the main task has written the allocation scheme into the specified path; if the data is confirmed to be written, the multiple subtasks can read data information (namely, the data information specified by the subtasks according to the distribution scheme) required to be pulled by the multiple subtasks on the specified path of the coordination service module, and after the data information is read, the computing units of the multiple subtasks can read the data required by the multiple subtasks from the message queue based on the data information. In other embodiments, the primary task may not need to monitor the writing condition of the allocation scheme in the coordination service module, and the primary task may directly pull the data specified by the allocation scheme for the primary task from the message queue based on the allocation scheme formulated by the primary task.

In this embodiment, the calculation engine obtains a plurality of subtasks, and then initiates a main task election to the coordination service module based on the plurality of subtasks, so that the coordination service module elects the main task from the plurality of subtasks, and thus based on the distribution scheme determined by the main task, the calculation engine can obtain data specified by the distribution scheme for each subtask from the message queue, and thus, the data to be consumed by each subtask can be confirmed by means of the coordination service module, so that the data amount of cross-region pulling can be reduced by means of the distribution scheme determined by the main task in the present application, and the traffic consumption of cross-region pulling can be reduced.

Referring to fig. 2, the data consuming method of the present embodiment may include the following steps. It should be noted that the following numbers are only used for simplifying the description, and are not intended to limit the execution order of the steps, and the execution order of the steps in the present embodiment may be arbitrarily changed without departing from the technical idea of the present application.

S201: in response to a primary task election initiated by the compute engine to the orchestration service module based on the plurality of subtasks, the orchestration service module elects the primary task from the plurality of subtasks.

Wherein, step S201 may include: acquiring a registration request of each subtask from a computing engine to acquire information of a task to which each subtask belongs; and when the coordination service module confirms that the registration requests of all the subtasks of the task are acquired, selecting the main task from all the subtasks of the task.

S202: interact with the compute engine to make the allocation plan for the primary task.

Wherein the allocation scheme specifies the messages in the message queue for each subtask.

After step S202, the coordination service module may further obtain the allocation scheme from the computing engine, so that each subtask on the computing engine can monitor the allocation scheme from the coordination service module, and obtain a message specified by the allocation scheme for each subtask from the message queue.

In order to better explain the data consumption method of the present application, the following data consumption embodiments are provided for illustrative purposes:

example 1

As shown in FIG. 3, subtasks (Task)1-A and subtasks 2-B register with the orchestration service module before initialization, i.e., before deciding to pull the partition corresponding to topic, to initiate a main Task election request;

the coordination service module responds to a main task election request, and elects a subtask 1-A from the subtask 1-A and the subtask 2-B as a main task;

the main task specifies an allocation scheme, and determines data consumed by the subtasks 1-A and 2-B;

the main task writes the distribution scheme into a path designated by the coordination service module;

the subtasks 1-A and 2-B serving as the client side of the coordination service module can monitor that the distribution scheme is completed, and then respectively acquire the topic-partition information which needs to be consumed by the subtasks from the specified path;

eventually subtask 1-a and subtask 2-B will pull data from the message queue according to the specified consumption data.

Further, the data consumption method in the above embodiment may be applied to the data processing system 10. Data processing system 10 may include a compute engine 12, a coordination service module 13, and a message queue 11.

The message queue 11 may be used to collectively store various types of business data as a data source for the compute engine 12. The business data may include raw data, computed results, snapshot data, baseline data, alarm data, and the like. The message queue 11 of the present application may be a Hive message queue 11, a kafka message queue 11, a rockmq message queue 11, and the like, and is not limited herein.

The computing engine 12 may perform distributed processing on the acquired plurality of subtasks. The compute engine 12 may be used to obtain a plurality of subtasks; initiating a main task election to the coordination service module 13 based on the plurality of subtasks; based on the distribution scheme formulated by the main task, the message specified by the distribution scheme is acquired from the message queue 11 for each subtask, so that each subtask performs operation processing on the specified message. When processing the subtasks, the calculation engine 12 may pull data from the message queue 11 for calculation based on the acquired subtasks to obtain a calculation result of the subtasks. The computing engine 12 of the present application may be Flink, Spark, Storm, S4, Samza, etc., and is not limited thereto. It is understood that the compute engine 12 of the present application may be comprised of multiple compute units, each of which may process a sub-task at the same time. And multiple computing units of the present application may be distributed in different environmental regions.

In the present application, the coordination service module 13 may be configured to respond to a main task election initiated by the computing engine 12 to the coordination service module 13 based on the plurality of subtasks, where the coordination service module 13 elects the main task from the plurality of subtasks; interact with the compute engine 12 to make the primary task work out an allocation plan. The orchestration service module 13 may also be used to provide orchestration services for the message queue 11 and the compute engine 12. The coordination service comprises configuration service, distributed synchronization and node monitoring. In addition, the coordination service module 13 may also store related plug-in programs and Schema configuration files of the business database object set. The coordination service module 13 of the present application may be any external storage device, such as ZooKeeper and Redis not limited herein.

In addition, the distributed data processing system 10 of the present application may further include a visualization control module and a data cache cluster.

The visualization control module is used for showing and managing data in a web mode.

The data cache cluster may be a compute engine 12 auxiliary memory storage cluster for reducing the memory overhead of the compute engine 12 during large-batch computation.

In addition, please refer to fig. 4, fig. 4 is a schematic structural diagram of an embodiment of the electronic device 20 according to the present application. The electronic device 20 of the present application includes a processor 22, and the processor 22 is configured to execute instructions to implement the methods provided by any of the embodiments of the data consumption methods of the present application and any non-conflicting combinations thereof.

The electronic device 20 may be a terminal such as a mobile phone or a notebook computer, or may also be a server, or may also be an internet of things device that constructs a local area network with a foot wearable device such as a refrigerator or an air conditioner.

The processor 22 may also be referred to as a CPU (Central Processing Unit). The processor 22 may be an integrated circuit chip having signal processing capabilities. The processor 22 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor 22 may be any conventional processor or the like.

The electronic device 20 may further include a memory 21 for storing instructions and data required for operation of the processor 22.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present disclosure. The computer readable storage medium 30 of the embodiments of the present application stores instructions/program data 31 that when executed enable the methods provided by any of the above embodiments of the methods of the present application, as well as any non-conflicting combinations. The instructions/program data 31 may form a program file stored in the storage medium 30 in the form of a software product, so as to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods according to the embodiments of the present application. And the aforementioned storage medium 30 includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or various media capable of storing program codes, or a computer, a server, a mobile phone, a tablet, or other devices.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above embodiments are merely examples and are not intended to limit the scope of the present disclosure, and all modifications, equivalents, and flow charts using the contents of the specification and drawings of the present disclosure or those directly or indirectly applied to other related technical fields are intended to be included in the scope of the present disclosure.

Claims

1. A method of data consumption, the method comprising:

the computing engine acquires a plurality of subtasks;

initiating a main task election to a coordination service module based on the plurality of subtasks so that the coordination service module elects the main task from the plurality of subtasks;

2. The method of claim 1, wherein initiating a primary task election to a orchestration service module based on the plurality of subtasks comprises:

the method further comprises the following steps: and the main task determines the message consumed by each subtask based on the characteristics and the areas of the subtasks obtained from the coordination service module and the distribution condition of the messages in the message queue, so as to obtain the distribution scheme.

3. The method of claim 2, wherein all subtasks are located in the same area as the message they specify.

4. The method of claim 1, wherein initiating a primary task election to a orchestration service module based on the plurality of subtasks comprises:

and sending a registration request of each subtask to the coordination service module to inform task information to which each subtask belongs, so that the main task is selected from all subtasks when the coordination service module confirms to acquire the registration requests of all subtasks of the task.

5. A method of data consumption, the method comprising:

in response to a primary task election initiated by the compute engine to a coordination service module based on the plurality of subtasks, the coordination service module elects a primary task from the plurality of subtasks;

interacting with the compute engine to cause the master task to formulate an allocation plan, wherein the allocation plan specifies for each subtask a message in a message queue.

6. The method of claim 5, wherein the responding to a primary task election initiated by the compute engine to a orchestration service module based on the plurality of subtasks comprises:

acquiring a registration request of each subtask from the computing engine to acquire information of a task to which each subtask belongs;

and when the coordination service module confirms that the registration requests of all the subtasks of the task are acquired, the main task is selected from all the subtasks of the task.

7. The method of claim 5, further comprising:

and acquiring the distribution scheme from the computing engine so that each subtask on the computing engine can monitor the distribution scheme from the coordination service module, and acquiring the message specified by the distribution scheme for each subtask from a message queue.

8. A data processing system, characterized in that the system comprises:

a computing engine for obtaining a plurality of subtasks; initiating a main task election to a coordination service module based on the plurality of subtasks; acquiring a message specified by the allocation scheme for each subtask from a message queue based on the allocation scheme formulated by the main task, and enabling each subtask to perform operation processing on the specified message;

the coordination service module is used for responding to main task election initiated by the computing engine to the coordination service module based on the plurality of subtasks, and the coordination service module elects a main task from the plurality of subtasks; interacting with the compute engine to make an allocation of the primary task.

9. An electronic device, characterized in that the electronic device comprises a processor; the processor is configured to execute instructions to implement the method of any one of claims 1-7.

10. A computer-readable storage medium for storing instructions/program data executable to implement the method of any one of claims 1-7.