CN115617469A

CN115617469A - Data processing method in cluster, electronic device and storage medium

Info

Publication number: CN115617469A
Application number: CN202110802810.9A
Authority: CN
Inventors: 徐陇浙
Original assignee: Zhejiang Uniview Technologies Co Ltd
Current assignee: Zhejiang Uniview Technologies Co Ltd
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2023-01-17

Abstract

The embodiment of the invention discloses a data processing method in a cluster, electronic equipment and a storage medium. Wherein the method comprises the following steps: acquiring a record to be processed; according to the data partition identification in the record to be processed, determining that a priority processing node corresponding to the data partition identification in the cluster is a target node for processing the record to be processed, and distributing the record to be processed to the target node; and updating the history record set cached on the target node according to the record to be processed. According to the scheme provided by the embodiment of the disclosure, the data distribution is performed according to the partition identification of the data to be processed by utilizing the node preference of the data partition, so that each node in the cluster can relatively and fixedly cache and process the data of the determined partition, the problem of conflict of multi-node cache data processing in the cluster and the problem of overlarge cache data volume of each node caused by randomly distributing processing nodes are avoided, and the processing performance and stability of the cluster system are remarkably improved.

Description

Data processing method in cluster, electronic device and storage medium

Technical Field

The present invention relates to, but not limited to, the field of computer information, and in particular, to a data processing method in a cluster, an electronic device, and a storage medium.

Background

With the continuous improvement of the application requirements of various application systems, the real-time requirement of data processing is higher and higher. Meanwhile, the application scale is continuously increased, and the cluster is also popularized for distributed processing. In order to meet the real-time requirement, the application server (node in the cluster) caches part of the key service data in the memory, so that the access overhead of a disk or external data is reduced, and the service processing efficiency is improved. In the cluster scheme, how to coordinate the distribution of data processing tasks among multiple nodes and avoid the conflict of cache data among different nodes is a direction continuously explored in the field.

Disclosure of Invention

The embodiment of the disclosure provides a data processing method, an electronic device and a storage medium in a cluster, which utilize a relatively fixed priority processing node corresponding to a data partition to distribute data according to a partition identifier of data to be processed, so as to keep each node in the cluster capable of relatively fixedly caching and processing the data of a certain partition, avoid conflict of multi-node cache data processing in the cluster, avoid overlarge cache data amount of each node caused by randomly distributing processing nodes, and significantly improve processing performance and stability of a cluster system.

In one aspect, an embodiment of the present disclosure provides a data processing method in a cluster, including obtaining a record to be processed;

according to the data partition identification in the record to be processed, determining a priority processing node corresponding to the data partition identification in the cluster as a target node for processing the record to be processed, and distributing the record to be processed to the target node;

and updating the history record set cached on the target node according to the record to be processed.

On the other hand, the embodiment of the present disclosure further provides an electronic device, including:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the data processing method in the cluster according to any embodiment of the present disclosure.

In another aspect, the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements a data processing method in a cluster according to any embodiment of the present disclosure.

The embodiment of the invention fully utilizes the setting relationship of the data partition and the partition priority processing node, and keeps the relative stability of the data range when each node in the cluster updates the cache data.

Other aspects will be apparent upon reading and understanding the attached drawings and detailed description.

Drawings

In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the embodiments or technical solutions of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

Fig. 1 is a flowchart of a method for processing data in a cluster according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a cluster system for high performance vehicle gear-gathering in an example of the present invention;

fig. 3 is a schematic diagram of a node failure of a cluster system for high performance vehicle gear aggregation according to an example of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

It should be noted that all directional indicators (such as up, down, left, right, front, back \8230;) in the embodiments of the present invention are only used to explain the relative positional relationship between the components, the motion situation, etc. in a specific posture (as shown in the attached drawings), and if the specific posture is changed, the directional indicator is changed accordingly.

In addition, descriptions such as "first", "second", etc. in the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless explicitly specified otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "connected," "secured," and the like are to be construed broadly, and for example, "secured" may be a fixed connection, a removable connection, or an integral part; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be interconnected within two elements or in a relationship where two elements interact with each other unless otherwise specifically limited. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In addition, the technical solutions in the embodiments of the present invention may be combined with each other, but it must be based on the realization of those skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should not be considered to exist, and is not within the protection scope of the present invention.

In real life, various real-time data processing business systems are visible everywhere, and the clustering scheme is widely applied according to the application scale and the system availability requirement. Taking a vehicle dynamic monitoring system as an example, business functions such as statistics, management, history record tracing and the like all need to acquire dynamic archive information of vehicles in cities or regions, and the dynamic archive information describes vehicle information which appears in cities or regions in a period. Wherein, a period is also called a parking period, if the passing time of the vehicle is not in the parking period, the passing time needs to be deleted from the vehicle dynamic archive information, which is also called vehicle gathering. The module or subsystem for storing the vehicle gear-gathering data is called a gear-gathering library, and the vehicle passing information in the gear-gathering library requires global uniqueness by taking the vehicle identification as a main key. Because the data volume of the number of the vehicles coming and going every day in a city or a region is very large, and stable and reliable service guarantee needs to be provided, the whole application system needs to have fault-tolerant capability and support the online expansion of the scale of an application server and a cluster file library, therefore, a cluster scheme is adopted for deploying and applying related systems, and a plurality of nodes in a cluster are used for carrying out distributed service processing.

In a related scheme, in order to realize accurate gathering of the vehicle passing data, streaming processing engines such as Spark/Flink/Storm/flash and the like are used for processing the data in real time, and global unique judgment is carried out on the basis of Redis. If the Redis single-machine mode is adopted, the whole system cannot realize linear expansion, a plurality of nodes cannot be introduced for load sharing, and the application requirement of large data volume cannot be met. If the Redis cluster mode is used, the processing of each service data needs to interact with the Redis, which significantly increases the network IO overhead of the server system and affects the overall service processing performance.

In view of the above disadvantages, the embodiments of the present disclosure provide a data processing scheme in a cluster environment, where the data processing scheme is distributed to relatively fixed processing nodes according to partition identifiers of data, so that different processing nodes perform data caching and processing (for example, global unique judgment) in a relatively fixed partition data range, the cache data volume of each processing node is effectively controlled, a conflict of cache data among multiple nodes is avoided, and reliability and stability of an overall service system are ensured.

It should be noted that the method for processing data in a cluster provided by the embodiment of the present disclosure may be applied to various service systems, and is not limited to the vehicle data archive application exemplified by the exemplary description. And the system can also be used for personnel monitoring systems, logistics monitoring and the like. According to the aspects described in the embodiments of the present disclosure, those skilled in the art can achieve the corresponding technical goals by adjusting the related aspects.

Before the related embodiments, several aspects related to the embodiments of the present disclosure are introduced:

spark: apache Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. Spark provides a large number of libraries including Spark Core, spark SQL, spark Streaming, MLlib, graphX, and the like.

Redis: remote Dictionary Server), which is an open source log-type and Key-Value database written in ANSI C language, supporting network, based on memory and persistent, and providing API in multiple languages.

And (4) Flink: apache Flink is an open source stream processing framework developed by the Apache software foundation, at the heart of which is a distributed stream data stream engine written in Java and Scala. Flink executes arbitrary stream data programs in a data parallel and pipelined manner, and Flink's pipelined runtime system can execute batch and stream processing programs.

Storm: storm is a distributed real-time big data processing framework of Twitter open source, and can simply and reliably process a large number of data streams.

Flume: the Flume is a highly available, highly reliable and distributed system for collecting, aggregating and transmitting mass logs provided by Cloudera.

Spark actuator: the Spark executor is the basic unit of true calculation processing data of Spark operation.

Spark driver: the main method for executing the Spark task is responsible for issuing the computing task to the executor.

Kafka: kafka is an open source stream processing platform developed by the Apache software foundation, written by Scala and Java. Kafka is a high throughput distributed publish-subscribe messaging system that can handle all the action flow data of a consumer in the relevant system.

RocketMQ: the RocketMQ is a piece of distributed, open source message middleware of a queue model donated by Ali to Apache.

Gathering the files: also referred to as archiving, drops the deduplicated filtered data to persistent storage.

The present disclosure provides a data processing method in a cluster, as shown in fig. 1, including,

step 101, acquiring a record to be processed;

102, according to a data partition identifier in the record to be processed, determining a priority processing node corresponding to the data partition identifier in the cluster as a target node for processing the record to be processed, and distributing the record to be processed to the target node;

and 103, updating the history record set cached on the target node according to the record to be processed.

In some exemplary embodiments, the pending records include at least the following information: target identification and recording time. The target identification is an identification for searching for duplication in the aggregate file library or the cache of each node, namely, only one piece of data is recorded in each target identification in the aggregate file library or the cache of each node. The recording time indicates the occurrence time of the relevant event, or the generation time of the record to be processed, or other times used for judging the validity of the time of the record, and is not limited to the content of the example of the embodiment of the present disclosure.

In some exemplary embodiments, the pending records retrieved in step 101 may be from the middle of Kafka, rocktmq, etc. messages, retrieved in real-time using spark Streaming.

In some exemplary embodiments, the pending record further includes other related information required by the application. For example, the pending record is a vehicle passing record, a vehicle passing record from a monitoring system. The target identification is: a license plate number; alternatively, the target identification is: number plate number + number plate color; or other data that can uniquely identify the vehicle; the pending records further include: vehicle color, license plate type, vehicle brand, etc.

It should be noted that, different application systems relate to different management targets or different business bodies, and the content of the target identifier in the corresponding to-be-processed record is different, and other aspects are also different, and are not limited to the scope and form exemplified by the implementation of the present disclosure.

In some exemplary embodiments, the identification of the data partition in step 102 corresponds to a priority processing node in the cluster is determined by:

and according to the data partition identification, performing hash modulo according to the number of the processing nodes included in the cluster, and determining a priority processing node corresponding to the data partition identification in the processing nodes included in the cluster. That is, according to the result of the hashing module, a corresponding priority processing node is determined for each data partition identifier from the processing nodes included in the cluster.

In some exemplary embodiments, the data partition identification in the pending record is determined according to the following manner: and performing Hash modulo according to the target identifier in the record to be processed and the number of partitions to determine the data partition identifier in the record to be processed. Namely, according to the result of the Hash model extraction, the data partition identification corresponding to each record to be processed is determined. In some exemplary embodiments, the data partition identification is also referred to as a partition number.

It can be seen that, when the cluster system is established, after the number of distributed processing nodes included in the cluster system is determined, according to the above manner, a relatively fixed priority processing node, also referred to as a preference node of the partition, can be determined for each partition. When the cluster system is established, the partition number of the whole data plan is also determined, and for the same management target or service main body, the target identification is kept unchanged, and the data partition identification determined according to the mode is also kept unchanged.

It should be noted that, in the case that the target is the same, the target identifier in the to-be-processed record obtained in step 101 is the same, and the data partition identifier of the to-be-processed record is the same as the data partition identifier of the history record of the target, then the target node determined in step 102 to be processed by the to-be-processed record is the same as the node processed by the history record of the target. That is to say, according to the scheme of the embodiment of the present disclosure, when each node in a cluster is normally available, a record of the same target may be kept on the same node for processing, and the latest record data of the target is cached on the node; alternatively, it can also be understood that the log data belonging to the same partition is processed on the determined priority processing node. Therefore, the determined data of one or more partitions are respectively cached and processed on different nodes in the cluster, the data volume cached on each node is effectively controlled, meanwhile, processing conflicts of the same target record among different nodes are avoided under the condition of sharing the cache, the accuracy of the overall system for checking the duplicate is ensured, and the execution efficiency of synchronizing the subsequent cache data to the clustering library is improved.

In some exemplary embodiments, the number of partitions is greater than or equal to a number of nodes included in the cluster system. In this way, each node may be determined to be a priority processing node for at least one partition.

In some exemplary embodiments, the method further includes step 100, where the node in the cluster loads the cache data, including the steps of:

and determining one or more corresponding data partition identifications when the node is used as a priority processing node, and acquiring corresponding history records from the aggregate file library according to the one or more data partition identifications and loading the history records into a cache.

It should be noted that after the cluster is established, the included nodes are determined, and according to the above scheme, it can be known that each node is determined as a priority processing node corresponding to one or more data partitions. Therefore, after the cluster is started, that is, after all the nodes are started, each node loads the history records of one or more data partitions corresponding to the node from the cluster database into its cache, so that the records are updated or added or deleted in its cache when new records are subsequently processed.

In some exemplary embodiments, in step 102, determining, according to the data partition identifier in the record to be processed, that a priority processing node corresponding to the data partition identifier in the cluster is a target node for processing the record to be processed, includes:

determining a priority processing node corresponding to the data partition identifier in the cluster according to the data partition identifier in the record to be processed;

determining the priority processing node as a target node for processing the record to be processed under the condition that the priority processing node is available;

under the condition that the priority processing node is unavailable, acquiring the number of currently available nodes in the cluster; performing hash modulo according to the number of the current available nodes in the cluster and determining a temporary processing node corresponding to the data partition identifier in the current available nodes; and determining the temporary processing node as a target node for processing the record to be processed.

That is, according to the result of the hashing module, determining a corresponding temporary processing node for each data partition identifier from the currently available nodes in the cluster.

It should be noted that the nodes in the cluster do not always remain all available. When determining the target node for the pending record at step 102, it may happen that the priority processing node is available or unavailable:

when the priority processing node (i.e. the preferred node of the partition) is available, the pending record is normally distributed to its corresponding priority processing node, and the cache update process of step 103 is executed by this priority processing node. It can be seen that, since the priority processing node has loaded all the history records of the home partition of the record to be processed at the time of startup, it is accurate to perform duplicate checking, history record updating or record adding based on the history records.

And (II) when the priority processing node (namely the preference node of the partition) is unavailable, only secondarily distributing the to-be-processed record which is originally distributed to the unavailable node, and distributing the to-be-processed record within the range of the current available node, so that other available nodes share the processing task of the unavailable node, wherein the nodes participating in secondary distribution are called temporary processing nodes for the partition to which the to-be-processed record belongs. That is, when a node is unavailable, the pending records of the original corresponding partition are shared to other nodes for processing, so as to keep the smooth operation of the whole service, and the related service processing is not interrupted due to the unavailability of the node, thereby improving the fault tolerance of the whole system. In this case, these other available nodes are both priority processing nodes for recording in one or more partitions and temporary processing nodes for sharing the recording processing tasks of other partitions.

Correspondingly, in step 103, updating the history record set cached on the target node according to the record to be processed includes:

under the condition that the target node is a priority processing node corresponding to the data partition mark in the record to be processed, updating the history record of the target identifier cached on the target node according to the target identifier in the record to be processed or adding the record to be processed into the cache of the target node;

under the condition that the target node is not a priority processing node corresponding to the data partition mark in the record to be processed, acquiring a corresponding historical record from a archive aggregation base according to the data partition mark in the record to be processed and synchronizing the historical record to a cache of the target node; and updating the history record of the cached target identifier or adding the to-be-processed record into the cache of the target node according to the target identifier in the to-be-processed record.

It should be noted that, because the step 102 of distributing data occurs in the above two cases according to whether the priority processing node is available, for the node receiving the record to be processed, the step 103 is also executed in the two cases:

if the node (target node) receiving the record to be processed is the priority processing node corresponding to the partition of the record to be processed, the node performs processing of the cache data according to the record to be processed, including: when the history record of the target identifier in the record to be processed exists in the cache, updating the history record according to the record to be processed; and when the history record of the target identifier in the record to be processed does not exist in the cache, adding the record to be processed into the cache. It can be seen that the node receiving the pending record performs step 103 in the role of priority processing node at this time.

(II) under the condition that the node (target node) receiving the record to be processed is not the priority processing node corresponding to the partition of the record to be processed, that is to say, under the condition that the priority processing node corresponding to the partition of the record to be processed is currently unavailable and cannot process the record to be processed normally, the record to be processed is shared to a temporary processing node.

At this time, the node receiving the pending record further performs step 103 in the role of a temporary processing node, including:

acquiring a corresponding historical record from a file aggregation base according to a data partition mark in the received record to be processed, and synchronizing the corresponding historical record into a cache of the corresponding historical record; processing the cache data according to the record to be processed, comprising: when the history record of the target identifier in the record to be processed exists in the cache, updating the history record according to the record to be processed; and when the history record of the target identifier in the record to be processed does not exist in the cache, adding the record to be processed into the cache.

It should be noted that, for each node in the cluster, when all nodes are normally available, the to-be-processed record is distributed according to the priority processing node (preference node of the partition) corresponding to the data partition, and the to-be-processed record received by each node is called a first to-be-processed record for the node; when one or more nodes are not available, the distributed first to-be-processed records of the one or more nodes are subjected to secondary distribution, and the to-be-processed records which are subjected to secondary distribution and received by other available nodes are called second to-be-processed records and also called temporary to-be-processed records. The first record to be processed and the second record to be processed are for distinguishing the records to be processed, do not contain other connotations, and do not represent the priority of each node in the subsequent record processing.

In some exemplary embodiments, the above steps further include: recording the synchronous data partition marks of the gathering file; or recording the data partition marks of the aggregate archive synchronization and the aggregate archive synchronization time.

In some exemplary embodiments, the node receiving the pending record also performs step 103 in the role of a temporary processing node, including:

under the condition that the loading condition of a second to-be-processed data cache is determined to be met, acquiring a corresponding historical record from a cluster file library according to a received data partition mark in the to-be-processed record and synchronizing the corresponding historical record into the cache of the cluster file library; processing the cache data according to the record to be processed, comprising: when the history record of the target identifier in the record to be processed exists in the cache, updating the history record according to the record to be processed; and when the history record of the target identifier in the record to be processed does not exist in the cache, adding the record to be processed into the cache.

The method for satisfying the loading condition of the second to-be-processed data cache comprises the following steps:

and when the partition data indicated by the data partition identification in the received second record to be processed is not loaded, determining that the second cache loading condition of the data to be processed is met.

In some exemplary embodiments, a determination of whether or not it has been loaded may be made based on the data partition flags of the logged archive synchronization. Alternatively, one skilled in the art may determine whether it is loaded in other ways.

For example, when the node a is unavailable in the cluster, the process proceeds to step 102, and the record of the partition having the node a as the priority processing node is distributed to other nodes again. These other available nodes perform local cache updates in both the role of priority processing node in the scheme (one) and the role of temporary processing node in the scheme (two). The cached data on these nodes will be synchronized into the polybase when a preset polybase condition is met according to the scheme related to writing to the polybase.

When the node a is back to be available, after the node a is started, step 100 is executed to load the latest history of the corresponding partition from the archive. At this time, the partition data with the node a as the priority processing node does not need to be distributed to other nodes for the second time, and the node a receives the record to be processed according to the scheme (one) and performs local caching processing. Other nodes are not distributed to the data to be processed with the node A as the priority processing node any more, and the whole cluster system recovers the normal operation mode.

In some exemplary embodiments, when a plurality of nodes in the cluster are unavailable, similar to the implementation process in which one node is unavailable, the distribution is performed according to the priority processing node determined in the case that all nodes are available, and then the to-be-processed records distributed to the plurality of unavailable nodes in the distribution are secondarily distributed within the range of the currently available nodes. The detailed implementation can be known to those skilled in the art according to the above description, and the detailed description is omitted here.

It should be noted how to determine whether each node in the cluster system is available or unavailable or available for recovery, and how to inform other nodes in the cluster, which is implemented according to the related cluster scheme, and the specific scheme is not within the scope of protection or limitation of the present application. For example, when a node is down or down, the Spark driver performs remote procedure call (rpc) communication with the executors of all nodes in the cluster to learn which nodes are unavailable, which nodes are available, and the number of new nodes and nodes; and when the dropped or crashed node is recovered, performing rpc communication with the Spark driver, and performing rpc communication with all nodes in the cluster after the Spark driver knows to acquire the available information of the new nodes and the number of the new available nodes.

In some exemplary embodiments, the method further comprises:

and 104, clearing the overdue data cached on the target node when a preset cache clearing condition is met.

In some exemplary embodiments, the meeting the preset cache cleaning condition includes one or more of:

when the number of the history records cached on the current node is larger than a preset first number threshold, determining that a preset cache cleaning condition is met;

when the total data storage amount of the history record number cached on the current node is larger than a preset first storage amount threshold value, determining that a preset cache cleaning condition is met;

when a preset cache cleaning time interval is reached, determining that a preset cache cleaning condition is met;

and when the current node acquires a new batch of records to be processed, determining that a preset cache cleaning condition is met.

In some exemplary embodiments, the clearing the stale data cached on the target node includes one or more of the following:

calculating the cached time length of each history record in the cache, and deleting the cache record of which the cached time length is greater than a preset first time length threshold; wherein the cached duration = current time-recording time;

deleting N historical records with the earliest recording time according to the recording time of each historical record; n is a positive integer;

deleting the previous M percent of history records with the earliest recording time according to the recording time of each history record; m is a positive number.

In some exemplary embodiments, satisfying the preset cache flush condition further comprises:

and when the current node is used as a temporary processing node and does not receive the second to-be-processed record of one or more data partition identifiers after the preset time length is exceeded, determining that the preset cache cleaning condition is met. The preset duration is not limited to a specific duration, and can be flexibly set according to the service requirement.

Accordingly, at this time, the clearing the stale data cached on the target node includes: and deleting the history records corresponding to the one or more data partition identifications in the cache.

It should be noted that, when the current node is used as a temporary processing node and does not receive one or more to-be-processed records that do not belong to its own processing partition over a preset time period, the current node determines the one or more partition identifiers first and deletes the cache history records of the one or more partitions. If one or more to-be-processed records which do not belong to the own processing partition are not received after the preset time length is exceeded, which indicates that the second to-be-processed records corresponding to the one or more partition identifiers may not need to share the processing by other temporary processing nodes due to the recovery of the availability of the home node (the priority processing node determined according to the data partition identifier), and then the current node deletes the cached history records.

In some exemplary embodiments, the method further comprises:

and 105, synchronizing the record cached on the target node to a gathering file library when a preset gathering file condition is met.

In some exemplary embodiments, the meeting the predetermined conditions includes one or more of:

when the preset caching and gear-gathering time interval is reached, determining that the preset gear-gathering condition is met;

when the cache record is deleted, determining that a preset file gathering condition is met;

when the cache record is updated, determining that a preset file aggregation condition is met;

when the cache record is newly added, determining that the preset file aggregation condition is met;

and when the cache updating and/or the number of newly added records is larger than a preset change number threshold value, determining that the preset file aggregation condition is met.

The archive can be a database, a file, or other forms of persistent storage data, and is not limited to a specific manner.

In some exemplary embodiments, the execution order of step 104 and step 105 may be adjusted according to the situation, and is not limited to a specific order.

It should be noted that, the specific manner of synchronizing the cached data on each node to the archive aggregation does not fall within the scope protected or defined by the present application, and a person skilled in the art may implement the method according to the related art, and does not limit the specific manner.

In some exemplary embodiments, the method further comprises:

and 106, when a preset gathering database cleaning condition is met, cleaning out the overdue data in the gathering database.

In some exemplary embodiments, the meeting the preset aggregated repository cleaning condition comprises one or more of:

when the number of the historical records in the gathering database is larger than a preset second number threshold value, determining that a preset gathering database cleaning condition is met;

when the total data storage amount of the historical record number in the gathering database is larger than a preset second storage amount threshold value, determining that a preset gathering database cleaning condition is met;

and when the preset accumulated file cleaning time interval is reached, determining that the preset accumulated file cleaning condition is met.

In some exemplary embodiments, clearing the stale data in the archive comprises one or more of:

calculating the saved time length of each history record of the aggregation database, and deleting the history records of which the cached time length is greater than a preset second time length threshold value; wherein the saved duration = current time-recording time;

deleting the P historical records with the earliest recording time according to the recording time of each historical record in the gathering library; p is a positive integer;

deleting the history record of the top percentage Q with the earliest recording time according to the recording time of each history record in the document gathering library; q is a positive number.

The first duration threshold and the second duration threshold are set independently, and may be the same or different.

It should be noted that the node described in the embodiment of the present disclosure represents a logical node in a cluster that performs a distributed task, and is not particularly limited to a physical node. In the Spark cluster processing scheme, one or more executors are included in one physical node, each executor may independently execute a distributed task, and each executor is also considered as a node in this embodiment of the present application.

It can be seen that, in the data processing method in the cluster provided in the embodiment of the present disclosure, a priority processing node mechanism corresponding to a data partition is used, data is distributed according to a data partition identifier of a record to be processed, and each node in the cluster is kept to cache and process data of a certain partition relatively fixedly, so that a conflict of multi-node cache data processing in the cluster is avoided, an excessive cache data volume of each node caused by randomly allocating processing nodes is also avoided, and a processing performance of an overall system is improved. In some exemplary embodiments, through a secondary distribution mode, the data processing tasks of the unavailable nodes are shared in time while the original data processing range of the normal available nodes is maintained, and the availability and the robustness of the system are integrally improved.

An embodiment of the present disclosure further provides an electronic device, including:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a data processing method in a cluster as in any one of the above embodiments.

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to implement the data processing method in the cluster as described in any one of the above embodiments.

Examples of the invention

In this example, the Spark scheme implements a cluster processing logic, and taking a file aggregation scheme of vehicle passing data of a vehicle monitoring system as an example, an overall business system logic is shown in fig. 2. In this example, each actuator is a logical node, and in the following description, the nodes represent actuators unless otherwise specified.

Step 1: the vehicle passing data detected and identified by the vehicle monitoring front-end equipment are recorded in the following table:

TABLE 1-vehicle gather the gear table (vehicle-passing record)

Every time the vehicle passes by, the front-end vehicle monitoring equipment generates a new vehicle passing record. Wherein, the target identification (id) is: the license plate color and the license plate number are used as main keys (keys) to perform Hash modular determination according to the partition quantity to determine a data partition identifier, which is also called a partition _ id; the recording time of the pending record is in this example the passing time. Alternatively, the partition identification (partition _ id) in the generated new vehicle passing record may be null.

Step 2: the method comprises the steps that a to-be-processed record (to-be-processed vehicle record) generated by a front-end vehicle monitoring device is firstly accessed to message middleware such as Kafka and RocktMQ, and a cluster system acquires vehicle data (to-be-processed vehicle record) in real time by using Spark Streaming.

And step 3: each Spark executor in the cluster is processed as follows:

and 3.1, each node determines a data partition identifier/partition number which takes the node as a priority processing node, and loads the collected-file vehicle-passing data from the collected-file library into an actuator memory to serve as historical vehicle-passing data cache. The partition number is modulo according to the number of the actuators, the priority processing node corresponding to the partition number is determined, and each node can determine which partitions use the partition number as the priority processing node (the preferred node) according to the mode.

3.2, when the data partition identification in the to-be-processed vehicle data acquired by Spark Streaming is empty (partition _ id), performing hash module extraction on a Key (vehicle unique identification (id): license plate color + license plate number)) according to the partition number to determine the data partition identification (partition _ id), and writing the identification into the to-be-processed vehicle data.

3.3, when Spark Streaming processes data of different batches, data with the same Key cannot be guaranteed to be distributed to the same node for processing, so that a priority processing node of each partition, also called Spark partition node preference, is determined first, spark distributes data of a corresponding partition to a designated node for processing according to the priority processing node (node preference), and guarantees that the same vehicle processes on the same node. The determination rule of the priority processing node is as follows: and the data partition identification \ partition number is subjected to modulus operation according to the number of the actuators to obtain a priority processing node (a preference node) corresponding to the partition, and the vehicle record to be processed is sent to the corresponding priority processing node, so that the condition that the same key is only sent to the same actuator in the same physical node is ensured, and the data partition identification \ partition number to be sent by the corresponding node is informed.

3.4, in order to avoid the situation that the priority processing node (node preference) can not be met due to cluster fault switching, the following processing is carried out:

1. when the priority processing node corresponding to the record to be processed is offline or down, the Spark driver performs rpc communication with the executors of all the nodes in the cluster to obtain the new node number (the number of the executors). The partition data (vehicle records to be processed) processed by the original normally available nodes is subjected to Hash modulus by the original node number (the original number of actuators), and the partition data without node processing is subjected to Hash modulus by the new node number. After the partition data processed by the nodes is calculated to obtain a new node number, spark sends the partition data to the determined new node for calculation processing. As shown in fig. 3, the partition data (pending vehicle record) originally processed by the executor on physical node 0 is redistributed to the executors of other physical nodes for processing. Fig. 3 only schematically shows data streams processed by actuators secondarily distributed to the physical node 1, and does not represent data streams secondarily distributed to actuators on other physical nodes.

2. And when all the nodes receive the distributed to-be-processed vehicle records, the data partition identification \ partition number transmitted to the nodes for processing are also known. Before processing the partition data (the vehicle record to be processed), the node reads the data partition identification \ partition number carried by the partition data, and when the read data partition identification \ partition number belongs to a series of data partition identification \ partition numbers which are required to be processed, the data is processed normally. When the read data partition identification \ partition number does not belong to a series of data partition identification \ partition numbers which are required to be processed, the node inquires all data caches corresponding to the partition identification \ partition number of the current vehicle record to be processed from the aggregation repository to the memory of the actuator, and continues to run Spark operation.

3. And (3) when the node which is disconnected or crashed recovers, performing rpc communication with a Spark driver, obtaining the number of new nodes and performing Hash modulo, and continuing the step 2, wherein the Spark driver performs rpc communication with all nodes in the cluster after learning.

4. After a period of time (the preset time length and the time can be configured by self), when one or more partition numbers which do not belong to the self-processing are not read any more, all the nodes delete the history records of the cache corresponding to the one or more partition numbers in the cache.

And 4, step 4: and filtering the vehicle passing data in each subarea by using the historical vehicle passing information cached in the memory of the actuator (node), updating or newly adding the vehicle passing information into the cache, and writing the vehicle passing information (including the subarea number) newly added into the cache into the gathering database.

And 5: the crash of Spark clusters caused by the overflow of the memory due to the excessive amount of the cache data in the memory of the executor (node), so a first quantity threshold value is set for the executor in the node (for example, 1000 ten thousand, and about 1000 ten thousand are obtained after the information of the vehicles passing by and coming from a city is removed).

Before processing each batch of data by the Spark Streaming operation, memory cache cleaning is performed firstly, and the specific cleaning rule is as follows:

1. and traversing all data in the cache, comparing the time of the passing of each piece of data with the current time of the server system, and deleting the data exceeding the given cache storage period (the first time length threshold) from the cache.

2. Judging the current data amount of the cache, when the data amount exceeds the maximum capacity (a first quantity threshold value, 1000 ten thousand), sorting the data in the cache according to the passing time, and deleting the first 100 ten thousand (one tenth of the maximum capacity of the cache) of the earliest passing time.

And 6: and deleting the data of which the passing time exceeds the retention period (second duration threshold) of the gathering bank from the vehicle gathering list in the gathering bank.

It can be seen that, in the related technical scheme, the duplication elimination is judged based on the global uniqueness of the Redis cluster, and frequent interaction with the Redis cluster is required, so that the shutdown of the Redis cluster is easily caused, and the load of the server is greatly increased.

According to the embodiment, the memory in the Spark cluster is used for caching, so that impact on other components is avoided, the number of nodes in the Spark cluster can be added to linearly expand and improve the processing amount and achieve load balance, and the Spark cluster can be used for achieving high-performance, high-real-time, high-expansion and high-fault-tolerance vehicle gear aggregation.

In the data processing method in the cluster provided by the present example, a distributed memory streaming computing framework is used, a vehicle unique identifier hash modulo method is used to determine a partition, and a mechanism that priority processing nodes are allocated according to the partition (partition node preference) is adopted to implement distribution of vehicle data to an actuator, so as to achieve load balancing. And the memory in the actuator is used as the cache of the local vehicle gathering database, so that the data processing zoning and the localization of each actuator are realized, and the flexible linear expansion can be realized. Aiming at the condition that the fault switching in the cluster causes the offline of an actuator, and the vehicle passing data cannot be forwarded according to the node priority processing node (partition node preference), a temporary vehicle gathering cache loading means is used, the fault tolerance processing mechanism of the cluster fault is realized, and the robustness of the whole system is obviously improved.

It can be understood by those skilled in the art that, in the case that a cluster scheme of another application processes data recorded by another application, by using the data processing scheme provided in the embodiment of the present disclosure, each node in the cluster can be kept to cache and process data of a certain partition relatively fixedly, so that a conflict of multi-node cache data processing in the cluster is avoided, an excessive cache data amount of each node due to random allocation of processing nodes is also avoided, and the processing performance and stability of the cluster system are significantly improved. The document gathering scheme of the vehicle passing record described in the embodiment of the disclosure is used for illustrating relevant details, and does not limit the application scope of the embodiment of the disclosure.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, or suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method of data processing in a cluster, comprising:

acquiring a record to be processed;

according to the data partition identification in the record to be processed, determining that a priority processing node corresponding to the data partition identification in the cluster is a target node for processing the record to be processed, and distributing the record to be processed to the target node;

2. The method of claim 1,

the priority processing node corresponding to the data partition identifier in the cluster is determined according to the following modes:

and performing hash modulo according to the data partition identifier and the number of processing nodes included in the cluster, and determining a priority processing node corresponding to the data partition identifier in the processing nodes included in the cluster.

3. The method of claim 2,

the method further comprises the following steps: loading cache data by nodes in the cluster;

the method for loading the cache data by the nodes in the cluster comprises the following steps:

4. The method of claim 2,

the determining, according to the data partition identifier in the record to be processed, that the priority processing node corresponding to the data partition identifier in the cluster is the target node for processing the record to be processed includes:

acquiring the number of currently available nodes in the cluster under the condition that the priority processing node is unavailable; performing hash modulo according to the data partition identification and the number of the current available nodes in the cluster, and determining a temporary processing node corresponding to the data partition identification in the current available nodes; and determining the temporary processing node as a target node for processing the record to be processed.

5. The method of claim 3,

the updating the history record set cached on the target node according to the record to be processed includes:

under the condition that the target node is a priority processing node corresponding to the data partition mark in the record to be processed, updating the history record of the target identifier cached on the target node or adding the record to be processed into the cache of the target node according to the target identifier in the record to be processed;

under the condition that the target node is not a priority processing node corresponding to the data partition mark in the record to be processed, acquiring a corresponding historical record from a archive aggregation base according to the data partition mark in the record to be processed and synchronizing the historical record to a cache of the target node; and updating the history record of the target identifier cached on the target node or adding the to-be-processed record into the cache of the target node according to the target identifier in the to-be-processed record.

6. The method of any one of claims 1 to 5,

the method further comprises the following steps:

and when a preset cache cleaning condition is met, cleaning the overdue data cached on the target node.

7. The method of any one of claims 1 to 5,

the method further comprises the following steps:

and synchronizing the record cached on the target node to a gathering file library when a preset gathering file condition is met.

8. The method of any one of claims 1 to 5,

the data partition identification in the record to be processed is determined according to the following mode:

and performing Hash modulo according to the target identifier in the record to be processed and the number of partitions to determine the data partition identifier in the record to be processed.

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the data processing method in the cluster of any of claims 1-8.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the data processing method in the cluster according to any one of claims 1 to 8.