CN115495220A - Data processing system, method and device and electronic equipment - Google Patents

Data processing system, method and device and electronic equipment Download PDF

Info

Publication number
CN115495220A
CN115495220A CN202211200945.9A CN202211200945A CN115495220A CN 115495220 A CN115495220 A CN 115495220A CN 202211200945 A CN202211200945 A CN 202211200945A CN 115495220 A CN115495220 A CN 115495220A
Authority
CN
China
Prior art keywords
data
processing unit
main task
task processing
distributed node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211200945.9A
Other languages
Chinese (zh)
Inventor
卞嘉骏
唐成山
陈军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
CCB Finetech Co Ltd
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN202211200945.9A priority Critical patent/CN115495220A/en
Publication of CN115495220A publication Critical patent/CN115495220A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)

Abstract

The application relates to the technical field of data processing, in particular to a data processing system, a data processing method, a data processing device and electronic equipment, which are used for solving the problem of cluster resource waste of high-frequency processing units of all branches. The data processing method comprises the following steps: acquiring data to be processed corresponding to each distributed node; determining a mapping relation between each distributed node and each main task processing unit so as to enable a difference value between the data volumes to be processed corresponding to each main task processing unit to be in a preset range; distributing each data to be processed to each main task processing unit according to the mapping relation; and processing all the to-be-processed data corresponding to each main task processing unit according to each main task processing unit and the sub-task processing unit corresponding to each main task processing unit. By the method, the data processing amount corresponding to each main task processing unit can be balanced, so that cluster resource waste of the main task processing units is avoided.

Description

Data processing system, method and device and electronic equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data processing system, method, and apparatus, and an electronic device.
Background
The global IT market is undergoing huge changes, and new technologies such as cloud computing, big data, internet of things, artificial intelligence and the like are rapidly developed. As a core device of these new technologies, a host is always favored by each organization because of its high availability and high throughput. In a practical application scenario, each large organization usually registers a new branch office in different regions, and along with the business increase of the branch office, the workload of the host device of the original general organization gradually increases, thereby facing an increasingly serious challenge.
In order to solve the problem, various major organizations actively explore data processing schemes for host downward movement, and a batch processing framework suitable for a distributed architecture is provided to support the batch processing function of the original host. In the parallel computing process of batch processing frame, a high frequency scheduling device is usually used. For example, the main organization a registers four branch organizations A1, A2, A3, and A4, the high-frequency scheduling device is divided into four high-frequency processing units according to the four branch organizations, and during the operation of the batch processing framework, the batch processing task corresponding to the main organization a is completed by the high-frequency processing units corresponding to the branch organizations A1, A2, A3, and A4, respectively, wherein the high-frequency processing units of the branch organizations process the service data of their respective regions.
However, since the service data in each area is distributed unevenly, the service data growth rate is different, and the host operation peak periods corresponding to each branch office are different, according to the existing data processing scheme, there may be a situation that the data processing amount of the high-frequency processing units corresponding to some branch offices is insufficient, and the high-frequency processing units corresponding to some branch offices are idle, thereby causing cluster resource waste of the high-frequency processing units of each branch office.
Disclosure of Invention
The embodiment of the application provides a data processing system, a data processing method, a data processing device and electronic equipment, which are used for solving the problem of cluster resource waste of high-frequency processing units of all branches.
In a first aspect, the present application provides a data processing system, where the system includes a main task access unit, a mapping unit, N main task processing units, and a sub task processing unit corresponding to each main task processing unit, where the main task access unit and the N main task processing units are respectively connected to the mapping unit, and each main task processing unit is connected to a corresponding sub task processing unit, where N is an integer greater than or equal to 1;
the main task access unit is used for acquiring the data to be processed of each current distributed node and sending all the data to be processed to the mapping unit; the mapping unit is used for determining the main task processing units corresponding to the distributed nodes so as to balance the data volumes to be processed corresponding to the main task processing units; and each main task processing unit and each corresponding sub task processing unit are used for processing the data to be processed of each distributed node determined by the mapping unit.
By the system, the data amount to be processed of each partial number of nodes corresponding to each main task processing unit is balanced based on the mapping unit, so that the cluster resources of each main task processing unit are fully utilized, and the cluster resource waste corresponding to each distributed node is avoided.
In one possible design, the system further includes N message queue units, where the N message queue units are connected with the N main task processing units and the N subtask processing units in a one-to-one correspondence, a plurality of data channels are established between each message queue unit and the subtask processing units corresponding to each other, the plurality of data channels at least include M dedicated channels and 1 common channel, and a value of M is greater than or equal to the number of distributed nodes served by the message queue unit; each dedicated channel is used for each distributed node to independently process data; the common channel may be used by multiple distributed nodes simultaneously.
Through the system, each distributed node can be ensured to have a dedicated channel which is independently used by the distributed node, and data processing can be assisted to be completed by combining the shared same channel under the condition that the data processing capacity of the dedicated channel is insufficient, so that the data channels are fully utilized, and the waste of cluster resources is avoided.
In one possible design, the message queue element is the Kafka messaging system.
In one possible design, the subtask processing unit includes a preset number of servers, where each server establishes a connection relationship with a preset data channel of the multiple data channels in a polling manner.
Through the system, each distributed node is provided with a special processing channel, the sharing and utilization of cluster resources are realized, and the condition that the server is idle is avoided.
In a second aspect, the present application provides a data processing method, based on any one of the data processing systems, the method including:
acquiring data to be processed corresponding to each distributed node;
determining a mapping relation between each distributed node and each main task processing unit so as to enable a difference value between the data volumes to be processed corresponding to each main task processing unit to be in a preset range;
distributing each data to be processed to each main task processing unit according to the mapping relation;
and processing all the to-be-processed data corresponding to each main task processing unit according to each main task processing unit and the sub-task processing unit corresponding to each main task processing unit.
By the method, the data processing amount corresponding to each main task processing unit can be balanced, so that cluster resource waste of the main task processing units is avoided.
In one possible design, the determining a mapping relationship between each distributed node and each main task processing unit includes:
obtaining region information of each distributed node and historical data information of each distributed node, wherein the historical data information at least comprises data volume and data processing peak time period;
and determining the mapping relation between each distributed node and each main task processing unit according to each region information, each historical data information and the number of the main task processing units.
By the method, historical data information and region information of each distributed node are fully considered, so that the determined mapping relation can fully ensure that the to-be-processed data amount corresponding to each main task processing unit is balanced.
In a possible design, the processing all the to-be-processed data respectively corresponding to each main task processing unit according to each main task processing unit and the sub task processing unit respectively corresponding to each main task processing unit includes:
determining a plurality of data channels between the main task processing unit and the sub task processing unit which correspond to each other; the plurality of data channels at least comprise M dedicated channels and 1 public channel, and the value of M is greater than or equal to the number of distributed nodes served by the plurality of data channels;
determining each special channel corresponding to each distributed node according to the number corresponding to any one of the data channels, and detecting whether the data processing amount of each special channel corresponding to the same distributed node is greater than or equal to the data amount to be processed;
if so, transmitting the data to be processed corresponding to the same distributed node in the main task processing unit to the sub-task processing units corresponding to each other for processing based on each special channel corresponding to the same distributed node;
if not, transmitting the data to be processed corresponding to the same distributed node in the main task processing unit to the sub task processing units corresponding to each other for processing based on each dedicated channel corresponding to the same distributed node and the common data channel.
By the method, the data channels occupied by the distributed nodes are dynamically adjusted, and the data channels occupied by the distributed nodes are processed in a differentiated mode, so that the data channels are fully utilized, and the waste of cluster resources is avoided. Meanwhile, each data node can be provided with a dedicated channel and a common channel, so that the independence of data transmission can be ensured through the dedicated channel, the data transmission capability can be further ensured through the common channel, and the task blockage can be prevented.
In a possible design, the transmitting the data to be processed corresponding to the same distributed node in the main task processing unit to the sub task processing units corresponding to each other for processing includes:
determining each server corresponding to the same distributed node in a subtask processing unit corresponding to the same distributed node according to the number of each data channel corresponding to the same distributed node, wherein the number corresponding to each server is determined in a polling mode;
and instructing each server to process the data to be processed corresponding to the same distributed node.
By the method, each distributed node is provided with a special data channel, the sharing and utilization of cluster resources are realized, and the condition that the server is idle is avoided.
In a third aspect, the present application provides a data processing apparatus, comprising:
the acquisition module is used for acquiring the data to be processed corresponding to each distributed node;
the determining module is used for determining the mapping relation between each distributed node and each main task processing unit so as to enable the difference value between the data to be processed corresponding to each main task processing unit to be in a preset range;
the distribution module is used for distributing each data to be processed to each main task processing unit according to the mapping relation;
and the processing module is used for processing all the to-be-processed data respectively corresponding to each main task processing unit according to each main task processing unit and the subtask processing unit respectively corresponding to each main task processing unit.
In one possible design, the determining module is specifically configured to:
obtaining region information of each distributed node and historical data information of each distributed node, wherein the historical data information at least comprises data volume and data processing peak time period;
and determining the mapping relation between each distributed node and each main task processing unit according to each region information, each historical data information and the number of the main task processing units.
In one possible design, the processing module is specifically configured to:
determining a plurality of data channels between the main task processing unit and the sub task processing unit which correspond to each other; the plurality of data channels at least comprise M dedicated channels and 1 public channel, and M values are more than or equal to the number of distributed nodes served by the plurality of data channels;
determining each special channel corresponding to each distributed node according to the number corresponding to any one of the data channels, and detecting whether the data processing amount of each special channel corresponding to the same distributed node is greater than or equal to the data amount to be processed;
if yes, transmitting the data to be processed corresponding to the same distributed node in the main task processing unit to the corresponding subtask processing units for processing based on each special channel corresponding to the same distributed node;
if not, transmitting the data to be processed corresponding to the same distributed node in the main task processing unit to the sub task processing units corresponding to each other for processing based on each dedicated channel corresponding to the same distributed node and the common data channel.
In one possible design, the processing module is further to:
determining each server corresponding to the same distributed node in a subtask processing unit corresponding to the same distributed node according to the number of each data channel corresponding to the same distributed node, wherein the number corresponding to each server is determined in a polling mode;
and instructing each server to process the data to be processed corresponding to the same distributed node.
In a fourth aspect, the present application provides an electronic device, comprising:
a memory for storing program instructions;
a processor for calling the program instructions stored in the memory and executing the steps included in the data processing method according to any one of the second aspect according to the obtained program instructions.
In a fifth aspect, the present application provides a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a computer, cause the computer to perform the data processing method of any one of the second aspects.
In a sixth aspect, the present application provides a computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the data processing method of any one of the second aspects.
Any one of the data processing system, the method, the device, the electronic equipment, the storage medium and the computer program product provided by the application can ensure that the to-be-processed data amount corresponding to each main task unit is balanced. Meanwhile, the data channels occupied by the distributed nodes can be further dynamically adjusted, and the data channels occupied by the distributed nodes are processed in a differentiated mode, so that the data channels are fully utilized, and the waste of cluster resources is avoided. In addition, each data node can also have a dedicated channel and a common channel, so that the independence of data transmission can be ensured through the dedicated channel, the data transmission capability can be further ensured through the common channel, and task blocking can be prevented.
Drawings
Fig. 1 is a schematic diagram of a framework cluster deployment structure of a high-frequency scheduling apparatus according to an embodiment of the present application;
FIG. 2 is a block diagram of a data processing system according to an embodiment of the present application;
FIG. 3 is a block diagram of another data processing system architecture according to an embodiment of the present application;
FIG. 4 is a block diagram of another data processing system architecture according to an embodiment of the present application;
fig. 5 is a flowchart of a data processing method according to an embodiment of the present application;
FIG. 6 is a block diagram of a data processing architecture according to an embodiment of the present application;
FIG. 7 is a schematic diagram of another data processing architecture according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a sub-task processing unit according to an embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 10 is a structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
The terms "first" and "second" in the description and claims of the present application and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the term "comprises" and any variations thereof are intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The "plurality" in the present application may mean at least two, for example, two, three or more, and the embodiments of the present application are not limited.
In the technical scheme, the data acquisition, transmission, use and the like all meet the requirements of relevant national laws and regulations.
Before describing the data processing system provided by the embodiments of the present application, for ease of understanding, the following detailed description of the technical background of the embodiments of the present application will be provided.
In the field of data processing, various large mechanisms actively explore data processing schemes of host computer downward movement, and a batch processing framework suitable for a distributed architecture is provided to support the batch processing function of the original host computer. In the parallel computing process of the batch processing frame, a high-frequency scheduling device is generally used. Referring to fig. 1, in fig. 1, a frame cluster deployment of an existing high-frequency scheduling device includes four high-frequency main task processing units A1, A2, A3, and A4, where the 4 high-frequency main task processing units respectively correspond to branches of an area that does not correspond to the area, and each high-frequency main task processing unit also corresponds to one high-frequency sub task processing unit. And in the process of running the batch processing framework, each high-frequency main task processing unit and the corresponding sub-task processing unit thereof respectively process the service data of each branch mechanism.
However, since the service data in each area is distributed unevenly, the service data growth rate is different, and the host operation peak periods corresponding to each branch office are different, according to the existing data processing scheme, there may be a situation that the data processing amount of the high-frequency processing units corresponding to some branch offices is insufficient, and the high-frequency processing units corresponding to some branch offices are idle, thereby causing cluster resource waste of the high-frequency main task processing units and the high-frequency sub task processing units of each branch office.
In order to solve the above problem, embodiments of the present application provide a data processing system, a method, a device, and an electronic device, where the amount of data to be processed of each partial node corresponding to each main task processing unit is balanced based on a mapping unit, so that cluster resources of each main task processing unit are fully utilized, and cluster resource waste corresponding to each distributed node is avoided.
Based on the above technical effects, the preferred embodiments of the present application are described below with reference to the drawings of the specification, it should be understood that the preferred embodiments described herein are only for illustrating and explaining the present application, and are not used to limit the present application, and the features in the embodiments and examples of the present application may be combined with each other without conflict.
As shown in fig. 2, a data processing system provided for the embodiment of the present application includes a main task access unit 21, a mapping unit 22, N main task processing units 23, and a sub task processing unit 24 corresponding to each main task processing unit 23, where the main task access unit 21 and the N main task processing units 23 are respectively connected to the mapping unit 22, each main task processing unit 23 is connected to a corresponding sub task processing unit 24, and N is an integer greater than or equal to 1;
the main task access unit 21 is configured to obtain to-be-processed data of each current distributed node, and send all to-be-processed data to the mapping unit 22; after receiving the to-be-processed data corresponding to each distributed node, the mapping unit 22 further determines the main task processing unit 23 corresponding to each distributed node, so that the to-be-processed data amount corresponding to each main task processing unit 23 is balanced. For example, in the case of only considering the data amount, the data amounts to be processed corresponding to the 3 distributed nodes B1, B2, and B3 are B1, B2, and B3, respectively, and B1+ B2 ≈ B3, and in the case of only two main task processing units C1 and C2, it may be determined that the distributed nodes B1 and B2 correspond to the main task processing unit C1 at the same time, and the distributed node B3 corresponds to the main task processing unit C2.
Each main task processing unit 23 and each corresponding sub task processing unit 24 are configured to process the to-be-processed data of each distributed node determined by the mapping unit 22.
By the system, the data volumes to be processed of the partial number nodes corresponding to the main task processing units are balanced based on the mapping unit, so that the cluster resources of the main task processing units are fully utilized, and the cluster resource waste corresponding to the distributed nodes is avoided.
In a possible application scenario, as shown in fig. 3, the data processing system further includes N message queue units 31, where the N message queue units 31 are connected with the N main task processing units 23 and the N subtask processing units 24 in a one-to-one correspondence manner, a plurality of data channels are established between each message queue unit 31 and the subtask processing units 24 corresponding to each other, the plurality of data channels at least include M dedicated channels and 1 common channel, and a value of M is greater than or equal to the number of distributed nodes served by the message queue unit 31; each dedicated channel is used for each distributed node to independently process data; the common channel may be used by multiple distributed nodes simultaneously.
Optionally, the message queue unit 31 is a Kafka message system.
In a possible application scenario, as shown in fig. 4, any subtask processing unit 24 in the data processing system includes a preset number of servers 41, where each server 41 establishes a connection relationship with a preset data channel of the multiple data channels in a polling manner.
For example, in fig. 4, each subtask processing unit has 5 servers, there are (M + 1) data channels, and when M takes the value of 24, then when allocating a data channel to each server, connection relationships may be established between numbers 1, 6, 11, 16, and 21 and the server 1, connection relationships between numbers 2, 7, 12, 17, and 22 and the server 2, connection relationships between numbers 3, 8, 13, 18, and 23 and the server 3, connection relationships between numbers 4, 9, 14, 19, and 24 and the server 4, and connection relationships between numbers 5, 10, 15, 20, and 25 and the server 5 in a polling manner.
Through the system, each distributed node is provided with a special processing channel, the sharing and utilization of cluster resources are realized, and the condition that the server is idle is avoided.
Based on any one of the data processing systems, as shown in fig. 5, an embodiment of the present application provides a data processing method, where an execution flow of the method includes the following steps:
s51, acquiring to-be-processed data corresponding to each distributed node;
s52, determining the mapping relation between each distributed node and each main task processing unit so as to enable the difference value between the data to be processed corresponding to each main task processing unit to be in a preset range;
s53, distributing each data to be processed to each main task processing unit according to the mapping relation;
and S54, processing all the to-be-processed data respectively corresponding to each main task processing unit according to each main task processing unit and the subtask processing unit respectively corresponding to each main task processing unit.
Specifically, when the to-be-processed data corresponding to each distributed node is obtained through the main task access unit, if the to-be-processed data corresponding to each distributed node is directly processed according to the to-be-processed data corresponding to each distributed node, due to the difference in data distribution, a situation that a data channel corresponding to some distributed nodes is idle and a data channel corresponding to some distributed nodes is blocked may exist.
In order to avoid this, in this embodiment of the present application, when the master task access unit obtains to-be-processed data corresponding to each distributed node, a mapping relationship between each distributed node and each master task processing unit is first determined, where the specific determination method is as follows: obtaining region information of each distributed node and historical data information of each distributed node, wherein the historical data information at least comprises data volume and data processing peak time period; and determining the mapping relation between each distributed node and each main task processing unit according to the region information, the historical data information and the number of the main task processing units, so that the difference value between the data to be processed by each main task processing unit is in a preset range, and the relative balance of the data processing amount among the main task processing units is realized.
Of course, the determination may also be performed according to a table lookup method, that is, a binding relationship is established between each distributed node and the main task processing unit in advance, and when data to be processed corresponding to any distributed node is obtained, the main task processing unit corresponding to the distributed node may be determined by querying a table.
After determining the mapping relationship between each distributed node and each main task processing unit, further, according to the mapping relationship, allocating the to-be-processed data corresponding to each distributed node to each main task processing unit corresponding to each other, so that each main task processing unit and each sub-task processing unit corresponding to each main task processing unit process all the to-be-processed data corresponding to each main task processing unit, respectively, the specific processing method includes:
firstly, a plurality of data channels between a main task processing unit and a sub task processing unit which are mutually corresponding are determined, wherein the plurality of data channels at least comprise M dedicated channels and 1 common channel, and the value of M is greater than or equal to the number of distributed nodes served by the plurality of data channels, so that any distributed node is ensured to have at least 1 independent dedicated channel. Of course, the number of the dedicated channels corresponding to each distributed node may be determined according to the range of the historical data amount, and if the data amount exceeds a certain range, 1 dedicated channel is added.
Then, according to the number corresponding to any one of the data channels, each dedicated channel corresponding to each distributed node is determined. In the embodiment of the application, a plurality of data channels between the main task processing unit and the sub task processing unit are all configured with numbers, and the dedicated channel corresponding to each distributed node can be inquired and obtained according to the numbers. After the dedicated channels are obtained through inquiry, whether the data processing quantity of each dedicated channel corresponding to the same distributed node is larger than or equal to the data quantity to be processed is further detected. For example, when it is determined that the dedicated channels corresponding to the distributed node P are the data channel 1 and the data channel 2, first, the data processing amount corresponding to both the data channel 1 and the data channel 2 is calculated, and whether the data processing amount is greater than or equal to the to-be-processed data amount corresponding to the distributed node P is calculated.
If the data processing amount is larger than or equal to the data amount to be processed, transmitting the data to be processed corresponding to the same distributed node in the main task processing unit to the corresponding subtask processing units for processing based on each special channel corresponding to the same distributed node; for example, if the data processing capacity corresponding to the data channel 1 and the data channel 2 is greater than the to-be-processed data capacity corresponding to the distributed node P, it indicates that the data processing capacity of the dedicated channel corresponding to the distributed node P meets the requirement, and at this time, under the condition that the corresponding data transmission is not required to be undertaken by other data channels, the task blocking phenomenon does not occur, and the independence of the distributed node P can also be ensured.
And if the data processing amount is less than the data amount to be processed, transmitting the data to be processed corresponding to the same distributed node in the main task processing unit to the corresponding subtask processing units for processing based on each dedicated channel and the common data channel corresponding to the same distributed node. For example, if the data processing amount corresponding to the data channel 1 and the data channel 2 is smaller than the to-be-processed data amount corresponding to the distributed node P, which indicates that the data processing capability of the dedicated channel corresponding to the distributed node P cannot meet the requirement, at this time, other common channels need to be used for carrying out corresponding data transmission, so that the phenomenon of task blocking is avoided, and each data channel can be fully utilized.
By the method, the data channels occupied by the distributed nodes are dynamically adjusted, and the data channels occupied by the distributed nodes are processed in a differentiated mode, so that the data channels are fully utilized, and the waste of cluster resources is avoided. Meanwhile, each data node can be provided with a dedicated channel and a common channel, so that the independence of data transmission can be ensured through the dedicated channel, the data transmission capability can be further ensured through the common channel, and the task blockage can be prevented.
In a possible application scenario, the transmitting the to-be-processed data corresponding to the same distributed node in the main task processing unit to the sub task processing units corresponding to each other for processing includes:
determining each server corresponding to the same distributed node in a subtask processing unit corresponding to the same distributed node according to the number of each data channel corresponding to the same distributed node, wherein the number corresponding to each server is determined in a polling mode; and then, each server is instructed to process the data to be processed corresponding to the same distributed node.
Through the above method, each server may be enabled to poll and subscribe to the related data channels, for example, server 1 in fig. 4 may subscribe to several data channels with consumption numbers 1, 6, 11, 16, 21, server 2 may subscribe to several channels with consumption numbers 2, 7, 12, 17, 22, and so on. By the method, each distributed node is provided with a special processing channel, the sharing and utilization of cluster resources are realized, and the condition that the server is idle is avoided.
To explain the data processing method in more detail, the data processing method will be described with reference to a specific application scenario.
In the application scene, according to data distribution and database storage space, the branch structures of 38 different regions are combined into 4 units of east, south, west and north according to regions and data volume, each unit is responsible for data processing of about 10 institutions, and the data distribution is approximately uniform. Furthermore, according to a configuration table corresponding to 4 unit optimization 'branch mechanisms-large areas', each branch mechanism can find the corresponding large area for data processing. According to the configuration, when the to-be-processed data corresponding to each branch structure is acquired, each to-be-processed data is operated according to the flow shown in fig. 6.
In the operation process, each branch mechanism in 4 units of east, south, west and north can independently operate, and the branch mechanism task with large data volume does not block the branch mechanism task with small data volume. Therefore, in the application scenario, by using the partition mechanism of kafka, a plurality of partition channels are divided and numbered for the same sub-task large-area topic, for example, 65 data channels are divided for each large-area, and the 65 data channels are consumed by the sub-task processing unit corresponding to the large-area cluster. The specific architecture can refer to fig. 7.
Next, a corresponding configuration table is set for each branch, wherein the primary key of the table structure is the branch number, and the value of the table field is the data channel of the branch. In the actual operation process, the data channels occupied by the branch mechanisms can be dynamically adjusted according to the data volume and the resource utilization condition of each branch structure, so that the branch mechanisms with large data volume occupy more data channels and the branch mechanisms with small data volume occupy less data channels, and meanwhile, at least one special channel is distributed to each branch mechanism, so that each branch mechanism has a special channel, and the other data channels are used as common channels.
After the channel relationship is established, taking the eastern large area as an example, referring to fig. 8, 13 servers are configured for the subtask processing units, and the kafka partition channels are subscribed in a polling manner, and the polling strategy utilizes the kafka configuration file to realize polling subscription on the servers.
After polling subscription, 13 servers can poll and subscribe to the relevant channels, for example, the east server 1 can subscribe to several data channels of consumption processes 1, 14, 27, 40 and 53, the server 2 can subscribe to several data channels of consumption processes 2, 15, 28, 41 and 54, and so on. And furthermore, each branch mechanism is provided with a special processing channel, the sharing and utilization of cluster resources are realized, and the condition that the server is idle is avoided.
The technical scheme can realize the following technical effects:
(1) The cluster resource sharing mechanism is realized through kafka messages and the partition channel characteristics thereof.
(2) The branch dimension task is provided with a channel in the cluster for a special channel to execute, and the phenomenon of task blocking is avoided.
(3) The channel maintenance configuration can dynamically adjust the special channels and the shared channels of each branch mechanism according to the actual running condition, accelerate the task processing and improve the resource utilization rate.
Based on the same inventive concept, an embodiment of the present application provides a data processing apparatus, please refer to fig. 9, the apparatus includes:
the acquiring module 91 is configured to acquire to-be-processed data corresponding to each distributed node;
a determining module 92, configured to determine a mapping relationship between each distributed node and each main task processing unit, so that a difference between to-be-processed data amounts corresponding to each main task processing unit is within a preset range;
the allocating module 93 is configured to allocate each to-be-processed data to each main task processing unit according to the mapping relationship;
and a processing module 94, configured to process all to-be-processed data corresponding to each main task processing unit according to each main task processing unit and the sub task processing unit corresponding to each main task processing unit.
In one possible design, the determining module 92 is specifically configured to:
obtaining region information of each distributed node and historical data information of each distributed node, wherein the historical data information at least comprises data volume and data processing peak time period;
and determining the mapping relation between each distributed node and each main task processing unit according to each region information, each historical data information and the number of the main task processing units.
In one possible design, the processing module 94 is specifically configured to:
determining a plurality of data channels between the main task processing unit and the sub task processing unit which correspond to each other; the plurality of data channels at least comprise M dedicated channels and 1 public channel, and the value of M is greater than or equal to the number of distributed nodes served by the plurality of data channels;
determining each special channel corresponding to each distributed node according to the number corresponding to any one of the data channels, and detecting whether the data processing amount of each special channel corresponding to the same distributed node is greater than or equal to the data amount to be processed;
if so, transmitting the data to be processed corresponding to the same distributed node in the main task processing unit to the sub-task processing units corresponding to each other for processing based on each special channel corresponding to the same distributed node;
if not, transmitting the data to be processed corresponding to the same distributed node in the main task processing unit to the sub task processing units corresponding to each other for processing based on each dedicated channel corresponding to the same distributed node and the common data channel.
In one possible design, the processing module 94 is further configured to:
determining each server corresponding to the same distributed node in a subtask processing unit corresponding to the same distributed node according to the number of each data channel corresponding to the same distributed node, wherein the number corresponding to each server is determined in a polling mode;
and instructing each server to process the data to be processed corresponding to the same distributed node.
By the device, the balance of the data to be processed corresponding to each main task unit can be ensured. Meanwhile, the data channels occupied by the distributed nodes can be further dynamically adjusted, and the data channels occupied by the distributed nodes are processed in a differentiated mode, so that the data channels are fully utilized, and the waste of cluster resources is avoided. In addition, each data node can also have a dedicated channel and a common channel, so that the independence of data transmission can be ensured through the dedicated channel, the data transmission capability can be further ensured through the common channel, and the task blockage can be prevented.
Based on the same inventive concept, an embodiment of the present application further provides an electronic device, where the electronic device may implement the functions of the foregoing data processing method and apparatus, and with reference to fig. 10, the electronic device includes:
at least one processor 101, and a memory 102 connected to the at least one processor 101, in this embodiment, a specific connection medium between the processor 101 and the memory 102 is not limited in this application, and fig. 10 illustrates an example in which the processor 101 and the memory 102 are connected through a bus 100. The bus 100 is shown in fig. 10 by a thick line, and the connection manner between other components is merely illustrative and not limited thereto. The bus 100 may be divided into an address bus, a data bus, a control bus, etc., and is shown in fig. 10 with only one thick line for ease of illustration, but does not represent only one bus or type of bus. Alternatively, the processor 101 may also be referred to as a controller, without limitation to name.
In the embodiment of the present application, the memory 102 stores instructions executable by the at least one processor 101, and the at least one processor 101 can execute the data processing method discussed above by executing the instructions stored in the memory 102. The processor 101 may implement the functions of the various modules in the apparatus shown in fig. 9.
The processor 101 is a control center of the apparatus, and may connect various parts of the entire control device by using various interfaces and lines, and perform various functions and process data of the apparatus by executing or executing instructions stored in the memory 102 and calling data stored in the memory 102, thereby performing overall monitoring of the apparatus.
In one possible design, processor 101 may include one or more processing units, and processor 101 may integrate an application processor, which primarily handles operating systems, user interfaces, application programs, and the like, and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 101. In some embodiments, the processor 101 and the memory 102 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.
The processor 101 may be a general-purpose processor, such as a Central Processing Unit (CPU), a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, that implements or performs the methods, steps, and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the data processing method disclosed in the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
Memory 102, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 102 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and the like. The memory 102 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 102 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
By programming the processor 101, the code corresponding to the data processing method described in the foregoing embodiment may be solidified in the chip, so that the chip can execute the steps of the data processing method of the embodiment shown in fig. 5 when running. How to program the processor 101 is well known to those skilled in the art and will not be described in detail herein.
Based on the same inventive concept, an embodiment of the present application provides a computer-readable storage medium, and a computer program product includes: computer program code which, when run on a computer, causes the computer to perform any of the data processing methods as discussed in the foregoing. Since the principle of solving the problem of the computer-readable storage medium is similar to that of the data processing method, the implementation of the computer-readable storage medium can refer to the implementation of the method, and repeated details are not repeated.
Based on the same inventive concept, the embodiment of the present application further provides a computer program product, which includes: computer program code which, when run on a computer, causes the computer to perform any of the data processing methods as discussed in the foregoing. Because the principle of solving the problems of the computer program product is similar to that of the data processing method, the implementation of the computer program product can refer to the implementation of the method, and repeated descriptions are omitted.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of user-operated steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (15)

1. A data processing system is characterized by comprising a main task access unit, a mapping unit, N main task processing units and a subtask processing unit corresponding to each main task processing unit, wherein the main task access unit and the N main task processing units are respectively connected with the mapping unit, each main task processing unit is connected with the corresponding subtask processing unit, and N is an integer greater than or equal to 1;
the main task access unit is used for acquiring the data to be processed of each current distributed node and sending all the data to be processed to the mapping unit; the mapping unit is used for determining the main task processing units corresponding to the distributed nodes so as to balance the data volumes to be processed corresponding to the main task processing units; and each main task processing unit and each corresponding sub task processing unit are used for processing the data to be processed of each distributed node determined by the mapping unit.
2. The data processing system of claim 1, wherein the system further comprises N message queue units, wherein the N message queue units are connected with the N main task processing units and the N subtask processing units in a one-to-one correspondence, a plurality of data channels are established between each message queue unit and the corresponding subtask processing unit, the plurality of data channels at least include M dedicated channels and 1 common channel, and a value of M is greater than or equal to the number of distributed nodes served by the message queue unit; each dedicated channel is used for each distributed node to independently process data; the common channel may be used by multiple distributed nodes simultaneously.
3. The data processing system of claim 2, wherein the message queue element is a Kafka messaging system.
4. The data processing system of claim 2, wherein the subtask processing unit includes a predetermined number of servers, and wherein each server establishes a connection with a predetermined data channel of the plurality of data channels in a round-robin manner.
5. A data processing method according to any one of claims 1 to 4, wherein the method comprises:
acquiring data to be processed corresponding to each distributed node;
determining a mapping relation between each distributed node and each main task processing unit so as to enable a difference value between the data volumes to be processed corresponding to each main task processing unit to be in a preset range;
distributing each data to be processed to each main task processing unit according to the mapping relation;
and processing all the data to be processed corresponding to each main task processing unit according to each main task processing unit and the sub-task processing unit corresponding to each main task processing unit.
6. The method of claim 5, wherein determining a mapping relationship between each distributed node and each primary task processing unit comprises:
obtaining region information of each distributed node and historical data information of each distributed node, wherein the historical data information at least comprises data volume and data processing peak time period;
and determining the mapping relation between each distributed node and each main task processing unit according to each region information, each historical data information and the number of the main task processing units.
7. The method according to claim 5, wherein the processing all the to-be-processed data respectively corresponding to each main task processing unit according to each main task processing unit and the sub task processing unit respectively corresponding to each main task processing unit comprises:
determining a plurality of data channels between the main task processing unit and the sub task processing unit which correspond to each other; the plurality of data channels at least comprise M dedicated channels and 1 public channel, and the value of M is greater than or equal to the number of distributed nodes served by the plurality of data channels;
determining each special channel corresponding to each distributed node according to the number corresponding to any one of the data channels, and detecting whether the data processing amount of each special channel corresponding to the same distributed node is greater than or equal to the data amount to be processed;
if so, transmitting the data to be processed corresponding to the same distributed node in the main task processing unit to the sub-task processing units corresponding to each other for processing based on each special channel corresponding to the same distributed node;
if not, transmitting the data to be processed corresponding to the same distributed node in the main task processing unit to the sub task processing units corresponding to each other for processing based on each dedicated channel corresponding to the same distributed node and the common data channel.
8. The method as claimed in claim 7, wherein said transmitting the data to be processed corresponding to the same distributed node in the main task processing unit to the sub task processing units corresponding to each other for processing comprises:
determining each server corresponding to the same distributed node in a subtask processing unit corresponding to the same distributed node according to the number of each data channel corresponding to the same distributed node, wherein the number corresponding to each server is determined in a polling mode;
and instructing each server to process the data to be processed corresponding to the same distributed node.
9. A data processing apparatus, characterized in that the method comprises:
the acquisition module is used for acquiring the data to be processed corresponding to each distributed node;
the determining module is used for determining the mapping relation between each distributed node and each main task processing unit so as to enable the difference value between the data to be processed corresponding to each main task processing unit to be in a preset range;
the distribution module is used for distributing each data to be processed to each main task processing unit according to the mapping relation;
and the processing module is used for processing all the data to be processed corresponding to each main task processing unit according to each main task processing unit and the subtask processing unit corresponding to each main task processing unit.
10. The apparatus of claim 9, wherein the determination module is specifically configured to:
obtaining region information of each distributed node and historical data information of each distributed node, wherein the historical data information at least comprises data volume and data processing peak time period;
and determining the mapping relation between each distributed node and each main task processing unit according to each region information, each historical data information and the number of the main task processing units.
11. The apparatus of claim 9, wherein the processing module is specifically configured to:
determining a plurality of data channels between the main task processing unit and the sub task processing unit which correspond to each other; the plurality of data channels at least comprise M dedicated channels and 1 public channel, and M values are more than or equal to the number of distributed nodes served by the plurality of data channels;
determining each special channel corresponding to each distributed node according to the number corresponding to any one of the data channels, and detecting whether the data processing amount of each special channel corresponding to the same distributed node is greater than or equal to the data amount to be processed;
if so, transmitting the data to be processed corresponding to the same distributed node in the main task processing unit to the sub-task processing units corresponding to each other for processing based on each special channel corresponding to the same distributed node;
if not, transmitting the data to be processed corresponding to the same distributed node in the main task processing unit to the corresponding subtask processing units for processing based on each dedicated channel corresponding to the same distributed node and the common data channel.
12. The apparatus of claim 11, wherein the processing module is further to:
determining each server corresponding to the same distributed node in a subtask processing unit corresponding to the same distributed node according to the number of each data channel corresponding to the same distributed node, wherein the number corresponding to each server is determined in a polling mode;
and instructing each server to process the data to be processed corresponding to the same distributed node.
13. An electronic device, comprising:
a memory for storing program instructions;
a processor for calling program instructions stored in said memory and for executing the steps comprised by the method of any one of claims 5 to 8 in accordance with the obtained program instructions.
14. A computer-readable storage medium, characterized in that it stores a computer program comprising program instructions which, when executed by a computer, cause the computer to carry out the method according to any one of claims 5-8.
15. A computer program product, the computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the method of any of the preceding claims 5-8.
CN202211200945.9A 2022-09-29 2022-09-29 Data processing system, method and device and electronic equipment Pending CN115495220A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211200945.9A CN115495220A (en) 2022-09-29 2022-09-29 Data processing system, method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211200945.9A CN115495220A (en) 2022-09-29 2022-09-29 Data processing system, method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN115495220A true CN115495220A (en) 2022-12-20

Family

ID=84472182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211200945.9A Pending CN115495220A (en) 2022-09-29 2022-09-29 Data processing system, method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115495220A (en)

Similar Documents

Publication Publication Date Title
CN107688492B (en) Resource control method and device and cluster resource management system
US11036556B1 (en) Concurrent program execution optimization
CN109218355B (en) Load balancing engine, client, distributed computing system and load balancing method
EP3073374B1 (en) Thread creation method, service request processing method and related device
CN110896355B (en) Network slice selection method and device
CN108268317B (en) Resource allocation method and device
CN115328663B (en) Method, device, equipment and storage medium for scheduling resources based on PaaS platform
CN109564528B (en) System and method for computing resource allocation in distributed computing
CN104750557A (en) Method and device for managing memories
CN105791254B (en) Network request processing method and device and terminal
CN112463375A (en) Data processing method and device
CN111309491A (en) Operation cooperative processing method and system
EP3698246A1 (en) Management of a virtual network function
CN115460216A (en) Calculation force resource scheduling method and device, calculation force resource scheduling equipment and system
CN115495220A (en) Data processing system, method and device and electronic equipment
CN111158911A (en) Processor configuration method and device, processor and network equipment
CN115766582A (en) Flow control method, device and system, medium and computer equipment
CN111309467B (en) Task distribution method and device, electronic equipment and storage medium
EP3811210B1 (en) Method and supporting node for supporting process scheduling in a cloud system
CN110046040B (en) Distributed task processing method and system and storage medium
CN114489978A (en) Resource scheduling method, device, equipment and storage medium
CN111858019B (en) Task scheduling method and device and computer readable storage medium
CN111796932A (en) GPU resource scheduling method
CN112616143A (en) Method and device for distributing communication number, electronic equipment and storage medium
CN106385385B (en) Resource allocation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination