Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 illustrates an exemplary system architecture 100 to which the distributed computing method of embodiments of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include a system background server 101, a key center 102, a processing task master node server 103, a processing task sub-node server 104, a network 105, and a terminal device 106. The network 105 is used to provide a medium of a communication link between the system background server 101, the main control center 102, the processing task main node server 103, and the processing task sub-node server 104. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal device 106 to interact with system backend server 101 over network 105 to receive or send messages and the like.
Terminal device 106 may be a variety of electronic devices including, but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like. .
The system background server 101, the main control center 102, the processing task master node server 103, and the processing task sub-node server 104 may be servers providing various services, for example, a background server providing support for processing data on the terminal device 106. The background server may analyze and perform other processing on the received data such as the data processing information, and feed back a processing result (e.g., data processed by the user) to the terminal device.
It should be noted that the system background server 101, the main control center 102, the processing task main node server 103, and the processing task sub-node server 104 may be hardware or software. When the server is hardware, the servers 101, 102, 103, and 104 may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be understood that the number of the various components of the system architecture in FIG. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow diagram 200 of one embodiment of a method applied to a system backend server of a distributed computing system is shown, in accordance with the present application. The method comprises the following steps:
step 201, receiving data processing information input by a target user.
In this embodiment, an execution subject (e.g., the backend server 101 shown in fig. 1) of the method applied to the backend server may receive data processing information input by a target user from a terminal device (e.g., the terminal device 106 shown in fig. 1) used by the user through a wired connection manner or a wireless connection manner.
Wherein the target user may be a user using a terminal device as shown in fig. 1. The data processing information may include information such as keywords, start time, and end time. For example, the keyword may be a type name of the data to be processed, and the start time and the end time may be a start time and an end time of a time period for generating the data to be processed, which are selected by the target user.
Step 202, generating processing task information based on the data processing information.
In this embodiment, the execution subject may generate the processing task information based on the data processing information. Wherein the processing task information may be used to characterize the manner in which the data is processed, for example, the manner in which the data is processed may include, but is not limited to, at least one of: querying, screening, calculating, etc. The processing task information may include various information. For example, identification information for identifying information of the processing task, status information for characterizing an execution status of the processing task, parameter information (e.g., a keyword, a start time, an end time, a region code, a table name, etc.), and the like may be included.
In the present embodiment, various information included in the above-described processing task information may be stored in the task information table. In general, the task information table may be stored in a preset relational database.
And step 203, sending the processing task information to a master control center.
In this embodiment, the execution subject may send the processing task information generated in step 202 to a key center (e.g., the key center 102 shown in fig. 1). The main control center is used for processing the processing tasks represented by the processing task information.
And step 204, in response to receiving the processing task state information sent by the main control center, updating the processing task state information corresponding to the processing task information in a preset processing task state information table.
In this embodiment, the execution main body may update the processing task state information corresponding to the processing task information in the preset processing task state information table in response to receiving the processing task state information sent by the main control center. The processing task state information table may be included in the task information table described in step 202. In practice, the state characterized by the processing task state information may include, but is not limited to, at least one of: unprocessed (e.g., processing task status information of "0"), pending (e.g., processing task status information of "1"), in-process (e.g., processing task status information of "2"), successful (e.g., processing task status information of "3"), and failed (e.g., processing task status information of "9").
The method provided by the embodiment of the application monitors the execution state of the task in real time by receiving the processing task state information sent by the main control center, and is beneficial to improving the efficiency and accuracy of the distributed computing system for data processing.
With further reference to FIG. 3, a flow 300 of one embodiment of a method applied to a key center of a distributed computing system is shown, in accordance with the present application. The method comprises the following steps:
step 301, receiving processing task information sent by a system background server.
In this embodiment, an execution main body (for example, the main control center 102 shown in fig. 1) applied to the main control center may receive processing task information sent by the system background server in a wired connection manner or a wireless connection manner.
And 302, in response to the fact that the processing task information is legal information and the processing task represented by the processing task information is a task to be executed, analyzing the processing task information to obtain a task code.
In this embodiment, the execution main body may, in response to determining that the processing task information is legal information and determining that the processing task represented by the processing task information is a task to be executed, analyze the processing task information to obtain a task code. Specifically, the execution body may parse a name space from the processing task information, and determine whether the billiard task information is legal information according to the name space. The name space may include letters, numbers, etc., and the length may be preset. For example: jinxiang 0. The processing task information may carry a field of a namespace as an identifier of the processing task information.
The execution main body may determine, according to various methods, that the processing task represented by the processing task information is the task to be executed, for example, may determine, according to a predetermined execution sequence of each task, whether the processing task represented by the processing task information is the task that needs to be executed currently, and if so, determine that the processing task represented by the processing task information is the task to be executed.
In some optional implementations of this embodiment, after step 301, the executing body may further perform the following steps:
first, an execution order in which processing tasks characterized by processing task information are executed is determined. The execution order may be characterized according to a number, a sequence number, and the like of the preconfigured task. For example, numbered "001", "002" … …, the order of the numbers characterizes the execution order of the tasks.
And then, based on the execution sequence, generating nodes and edges which are included in the preset directed acyclic graph and represent the processing tasks. The edge is connected with the two nodes and used for representing the execution sequence of the tasks represented by the two connected nodes. A Directed Acyclic Graph (DAG) may include a plurality of nodes (or vertices) and edges connected to the nodes, and a topology of the DAG does not include a ring.
Due to constraints that some tasks must execute earlier than others, a set of tasks that must be ordered into a queue can be represented by a DAG graph, where each vertex represents a task and each edge represents a constraint, which can be used to generate a valid sequence using a topological ordering algorithm. DAGs may also be used to model the process of information passing through a network of processors in a consistent direction. Reachability relationships within the DAG constitute a local order, and any limited local order may be presented by the DAG using reachability.
Furthermore, DAGs are representative of overlapping subsequences of a sequence set that can be space efficient. In some task scheduling and scheduling issues. Different problems or other dependent relationships between tasks, some tasks need to be done after some tasks are completed. Dependencies between tasks are expressed on a Directed Acyclic Graph (DAG).
In some optional implementation manners of this embodiment, the execution main body may determine whether the processing task is a task to be executed according to a node representing the processing task and an edge connecting the node. Specifically, the execution agent may determine, from the DAG, a node characterizing the processing task, and then determine another node connected to the node through an edge, thereby determining an execution order of the tasks, and determining whether the processing task is a task to be executed.
And 303, determining whether the processing task information needs to be forwarded according to the task code, and sending the processing task information to a processing task main node server in response to determining that the processing task information needs to be forwarded.
In this embodiment, the execution body may determine whether the processing task information needs to be forwarded according to the task code. In particular, task coding may be used to characterize the type of task. The execution main body can determine whether the task is the task needing to be forwarded according to the task code. For example, the code ds _402 is defined as a data source task code of a media class, and the prefix ds represents a data source stage and needs to be forwarded to a processing task master node server.
And the execution main body responds to the determination of the need of forwarding and sends the processing task information to a processing task main node server side which establishes a corresponding relation with the advance task information in advance. The processing task master node server side can generate processing task state information according to the running condition of the executed processing task. As an example, the task code of the advance task information may establish a corresponding relationship with the processing task master node server in advance, and the execution main body may determine the corresponding processing task master node server according to the task code.
And step 304, responding to the received processing task state information sent by the processing task master node server, and sending the processing task state information to the system background server.
In this embodiment, the execution main body may send the processing task state information to a system backend server (e.g., the system backend server 101 shown in fig. 1) in response to receiving the processing task state information sent by the processing task master node server. Therefore, the system background server side updates the processing task state information corresponding to the processing task information in the processing task state information table.
The method provided by the embodiment of the application can determine the processing task master node server for executing the processing task according to the processing task information by receiving the processing task information sent by the system background server, thereby being beneficial to accurately determining the processing task master node server and being beneficial to improving the accuracy and efficiency of the distributed computing system for processing data.
With further reference to FIG. 4, a flow diagram 400 of one embodiment of a method for processing a task master node server for a distributed computing system in accordance with the present application is shown. The method comprises the following steps:
step 401, in response to receiving processing task information sent by the main control center, determining the total amount of data to be processed according to parameters included in the processing task information.
In this embodiment, an execution main body (for example, the processing task master node server 103 shown in fig. 1) applied to the processing task master node server may determine, in response to receiving the processing task information sent by the master control center, the total amount of data to be processed according to parameters included in the processing task information. Specifically, the parameters may include a keyword, a start time, an end time, a region code, a table name, and the like. The execution agent may query the full-text retrieval cluster for the total amount of data matching the information included in the parameter.
The full-text search system cluster refers to a system of Solr or elastic search full-text search cluster storing mass data. The full-text retrieval system is used for providing data support for data analysis and application. The Elasticsearch is a real-time distributed search and analysis engine. It can help you process large-scale data with unprecedented speed. It can be used for full-text search, structured search and analysis, and certainly, the three can also be combined. The Elasticissearch is a search engine built on the basis of a full-text search engine Apache Lucene, which is an efficient full-function open-source search engine framework today. The full-text search system cluster may be a server (which may be hardware or software) included in the distributed computing system, or may be a server included in another system communicatively connected to the distributed computing system.
And step 402, determining whether the processing task represented by the processing task information needs to be split into at least two subtasks according to the total number.
In this embodiment, the execution subject may determine whether the processing task represented by the processing task information needs to be split into at least two subtasks according to the total number. As an example, the execution subject may determine whether the segmentation is required according to a preset data amount threshold. For example, the task amount threshold is 10000, and if the determined total amount exceeds 10000, the executed tasks are divided by a preset number of days (for example, five days). If the data volume of the single task after segmentation also exceeds the data volume threshold, the task is continuously segmented by other preset days (such as one day).
And 403, in response to the determination that segmentation is required, generating subtask information of at least two representation subtasks, and sending each subtask information to a corresponding processing task sub-node server.
In this embodiment, the execution subject may generate at least two pieces of subtask information characterizing the subtasks in response to determining that the segmentation is required. The subtask information may include identification information for identifying the subtask information, parameter information (e.g., start time, end time, etc.), and other information. The number of the processing task sub-node servers may be at least one, and the correspondence between the processing task sub-node servers and the sub-task information may be preset. For example, the execution main body may search the processing task sub-node server corresponding to the identification information of the sub-task information from a preset correspondence table for representing a correspondence between the processing task sub-node server and the sub-task information.
Then, the execution main body may send each subtask information to a corresponding processing task sub-node server.
Step 404, in response to receiving the subtask state information sent by the processing task child node server, generating processing task state information representing the state of the processing task based on the received subtask state information.
In this embodiment, the execution main body may generate, in response to receiving subtask state information sent by the processing task child node server, processing task state information representing a state of the processing task based on each received subtask state information. As an example, the execution main body may determine an information set composed of the respective subtask state information as the processing task state information. Or adding each subtask state information into a new message to be sent, wherein the message to be sent is the processing task state information.
Step 405, sending the processing task state information to the main control center.
In this embodiment, the execution main body may send the processing task state information to the main control center. The key center is the key center 102 shown in fig. 1.
According to the method provided by the embodiment of the application, the total amount of the data to be processed is determined according to the parameters included in the processing task information, whether the processing task represented by the processing task information needs to be divided into at least two subtasks is determined according to the total amount, if the processing task represented by the processing task information needs to be divided, subtask information of the at least two represented subtasks is generated, and each subtask information is sent to the corresponding processing task subnode server, so that the processing task can be effectively divided, distributed processing of the processing task is facilitated, and the data processing efficiency of a distributed computing system is improved.
With further reference to FIG. 5, a flow 500 of one embodiment of a method for processing a task child node server for a distributed computing system according to the present application is illustrated. The method comprises the following steps:
step 501, receiving subtask information sent by a processing task master node server.
In this embodiment, an execution subject (for example, the processing task sub-node service end 104 shown in fig. 1) applied to the processing task master node service end may receive sub-task information sent by the processing task master node service end (for example, the processing task master node service end 103 shown in fig. 1). The subtask information may refer to the steps in the embodiment shown in fig. 4, and is not described here again.
Step 502, determining the data size of the processing data corresponding to the subtask information according to the parameters included in the subtask information.
In this embodiment, the execution main body may determine the data size of the processing data corresponding to the subtask information according to the parameter included in the subtask information. Specifically, the parameters may include a keyword, a start time, an end time, a region code, a table name, and the like. The execution agent may query the full-text retrieval cluster for the total amount of data matching the parameters included in the subtask information. The processing data may include data extracted from a full-text search cluster.
Step 503, based on the data amount, determining the data queue to which the processing data belongs.
In this embodiment, the execution body may determine a data queue to which the processing data belongs, based on the data amount. The data queue is preset and can accommodate message queues with different data volumes.
In some optional implementations of this embodiment, the type of the data queue may include at least one of: fast queues, regular queues, long queues. Wherein, each type of data queue respectively corresponds to different data volume intervals. As an example, the upper limit of the amount of data corresponding to the fast queue is a, the upper limit of the amount of data corresponding to the regular queue is B, and the upper limit of the amount of data corresponding to the long-time queue is C. Wherein, A is more than B and less than C.
In some optional implementations of this embodiment, the execution subject may determine, based on the data amount, a data queue to which the processing data belongs according to the following steps:
first, usage information for the data queue is determined. In particular, the usage may be characterized by usage information, the usage information may be used to characterize whether the data queue is congested, or the usage information may be the idle rate of the data queue. The idle rate may be a ratio of a current amount of data of the data queue that is idle to an upper limit of an amount of data that the data queue holds.
Then, in response to determining that the usage information characterizes the data queue as an abnormal state, the data in the data queue is transferred to another data queue. Specifically, the execution body may select another data queue according to a principle that data in the data queue is transferred to a data queue with a larger data size than the current data queue can accommodate. By determining the use condition of the data queue, the data queue used by the processing subtask can be adjusted in real time, so that the efficiency of data processing is improved.
Optionally, the execution main body may estimate a data amount of a currently used data queue after a preset time period in a process of running the processing task, and transfer data in the currently used data queue to another data queue if the data amount of the currently used data queue exceeds the data amount of the currently used data queue. Wherein the other data queues may be other data queues capable of accommodating more data. For example, from a fast queue to a regular queue or a long task queue. By estimating the data volume of the data queue, the data queue used by the processing subtask can be adjusted in real time, so that the data processing efficiency is improved.
Step 504, storing the data in the data queue into a preset storage area.
In this embodiment, the execution body may store the data in the data queue into a preset storage area. As an example, the execution subject may store the data extracted from the full-text search cluster in a preset format (for example, json format) in a local hard disk. Or, the execution subject may store the data in the data queue into a preset database, for example, an hbase cluster, a spark distributed computing cluster, or the like.
And 505, generating subtask state information representing the running state of the processing subtask indicated by the subtask information, and sending the subtask state information to the processing task master node server.
In this embodiment, the execution main body may generate subtask state information representing an operation state of the processing subtask indicated by the subtask information, and send the subtask state information to the processing task master node server. Specifically, the subtask state information may be used to characterize an operating state of the processing subtask indicated by the subtask information, and the operating state may include, but is not limited to, at least one of the following: unprocessed (e.g., processing task status information of "0"), pending (e.g., processing task status information of "1"), in-process (e.g., processing task status information of "2"), successful (e.g., processing task status information of "3"), and failed (e.g., processing task status information of "9").
According to the method provided by the embodiment of the application, the subtask information sent by the processing task master node server is received, the data volume of the processing data corresponding to the subtask information is determined according to the subtask information, the data queue to which the processing data belongs is determined based on the data volume, and the data in the data queue is stored in the preset storage area, so that the method is beneficial to quickly and efficiently transmitting the data by using different types of data queues, and the data processing efficiency of the distributed computing system is improved.
With further reference to FIG. 6, there is shown a schematic block diagram of a distributed computing system 600 according to the present application. The system 600 includes: a system background server 601, a main control center 602, a processing task main node server 603, and a processing task sub-node server 604.
The system background server 601 is configured to execute the method described in the embodiment of fig. 2; the key center 604 is used for executing the method described in the embodiment of fig. 3; the processing task master node server 605 is configured to perform the method described in the embodiment of fig. 4; the processing task sub-node server 606 is configured to perform the method described in the embodiment of fig. 5. It should be noted that the number of the processing task sub-node servers may be at least one, and all or part of the components included in the system 600 may be arranged in the same electronic device in a combined manner, or may be arranged in different electronic devices that are communicatively connected to each other.
When each component included in the system is respectively disposed in a plurality of communicatively connected electronic devices, information may be transmitted wholly or partially between the system background server 601, the main control center 602, the processing task master node server 603, and the processing task sub-node server 604 via an RPC (Remote Procedure Call) protocol. RPC (remote Procedure Call) -a remote Procedure call, a protocol that requests services from a remote computer program over a network without knowledge of the underlying network technology. The RPC Protocol assumes the existence of some Transmission protocols, such as TCP (Transmission Control Protocol) or UDP (User Datagram Protocol), to carry information data between communication procedures. In the OSI network communication model, RPC spans the transport and application layers. RPC makes it easier to develop applications including network distributed multiprogrammers.
Optionally, as shown in fig. 6, the system 600 may further include an in-memory database 605 for storing the most recent statistical class data; the Hbase cluster 606 is used for storing the data source data extracted by the processing task sub-node server 604 and storing result data generated by data preprocessing and data analysis; spark distributed computing cluster 607, provides the computational resources needed for efficient distributed computing in real time. Among them, Hbase clustering and Spark distributed computing clustering are well-known technologies widely studied and used at present, and are not described herein again.
The system provided by the embodiment of the application comprises the system background server, the main control center, the processing task main node server, the processing task sub-node server, and the scheduling and processing tasks of the parts included in the system, so that the capacity of the distributed computing system for processing mass data is improved, and the efficiency of processing data is improved.
Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by a Central Processing Unit (CPU)701, performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable medium or any combination of the two. A computer readable medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.