CN115934304A

CN115934304A - Data processing method and device, computer equipment and readable storage medium

Info

Publication number: CN115934304A
Application number: CN202110949487.8A
Authority: CN
Inventors: 郑祥云
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2023-04-07

Abstract

The embodiment of the application discloses a data processing method, a data processing device, computer equipment and a readable storage medium, wherein the data processing method comprises the following steps: acquiring state information of a first stream data processing engine when processing data stored in a message middleware; if the state information meets the parallelism adjusting condition, determining the target parallelism according to the state information; and adjusting the parallelism of the message middleware according to the target parallelism, and determining a second stream data processing engine according to the target parallelism, wherein the second stream data processing engine is used for processing the data stored in the message middleware after the parallelism is adjusted. By the embodiment of the application, the parallelism can be reasonably adjusted by detecting and monitoring the fault in the data processing process in real time, the elastic expansion of the parallelism can be realized by avoiding stopping the stream data processing engine, and the nondestructive and real-time performance of data processing can be ensured.

Description

Data processing method and device, computer equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method and apparatus, a computer device, and a readable storage medium.

Background

In a real-time stream computing system, when a stream computation encounters a sudden increase in data volume, that is, when data reporting traffic increases, a stream processing computation engine faces a problem of not processing the data. For example, when Kafka is used as a message queue, data accumulation and Flink (data stream execution engine) consumption generate backpressure, and the parallelism of Flink and Kafka needs to be expanded to improve the capability of real-time data processing.

However, some current solutions have a more general level of parallelism adjustment, which may result in wasted or insufficient resources, and usually implement expansion or reduction of parallelism by stopping processing of data streams to reallocate resources, which may cause problems of data processing delay or data loss. For this reason, it is necessary to design a scheme capable of not only appropriately adjusting the parallelism but also securing the data processing quality.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device, a computer device and a readable storage medium, which can reasonably adjust the parallelism by detecting and monitoring faults in the data processing process in real time, avoid stopping a stream data processing engine to realize the elastic expansion of the parallelism, and ensure the nondestructive and real-time performance of data processing.

An embodiment of the present application provides a data processing method, including:

acquiring state information of a first stream data processing engine when processing data stored in a message middleware;

if the state information meets the parallelism adjusting condition, determining the target parallelism according to the state information;

and adjusting the parallelism of the message middleware according to the target parallelism, and determining a second stream data processing engine according to the target parallelism, wherein the second stream data processing engine is used for processing the data stored in the message middleware after the parallelism is adjusted.

An embodiment of the present application provides a data processing apparatus, including:

the acquisition module is used for acquiring state information of the first stream data processing engine when processing data stored in the message middleware;

the determining module is used for determining the target parallelism according to the state information if the state information meets the parallelism adjusting condition;

the determining module is further configured to adjust the parallelism of the message middleware according to the target parallelism, and determine a second stream data processing engine according to the target parallelism, where the second stream data processing engine is configured to process data stored in the message middleware after the parallelism is adjusted.

An aspect of an embodiment of the present application provides a computer device, including: a processor, a memory, and a network interface; the processor is connected with the memory and the network interface, wherein the network interface is used for providing a network communication function, the memory is used for storing program codes, and the processor is used for calling the program codes to execute the data processing method in the embodiment of the application.

In one aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, where the computer program includes program instructions, and when the program instructions are executed by a processor, the data processing method in the embodiments of the present application is performed.

Accordingly, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method provided by one aspect of the embodiment of the application.

In the embodiment of the application, the target parallelism is determined by the first stream data processing engine according to the state information stored in the message middleware during data processing, and the parallelism can be acquired as required, so that the parallelism is matched with the specific data processing requirement, and the waste of resources is avoided. In addition, a new stream data processing engine, namely a second stream data processing engine, is determined according to the target parallelism, the second stream data processing engine is used for processing the data of the message middleware after the parallelism adjustment in time, and the data in the message middleware can be processed simultaneously under the condition that the first stream data processing engine does not stop, so that the data processing is not delayed and lost, and the real-time performance and the accuracy of the data processing are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a diagram of a network architecture of a data processing system provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application;

fig. 3 is a schematic diagram illustrating an effect of a data processing method applied to an actual environment according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a data processing method according to an embodiment of the present application;

fig. 5 is a functional block diagram of a data processing method according to an embodiment of the present application;

fig. 6 is a schematic flowchart of parallelism expansion according to an embodiment of the present application;

fig. 7 is a schematic flowchart of a data processing method according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The scheme provided by the embodiment of the application belongs to cloud computing belonging to the technical field of cloud. The Cloud technology (Cloud technology) is a hosting technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. Cloud technology (Cloud technology) is based on a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied in a Cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing. According to the scheme, the data stored in the message middleware is processed by the stream data processing engine, namely the data is processed through cloud computing.

Cloud computing (cloud computing) is a computing model that distributes computing tasks over a large pool of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

The cloud computing resource pool mainly comprises computing equipment (which is a virtualized machine and comprises an operating system), storage equipment and network equipment, and according to logic function division, a Platform as a Service (Platform as a Service) layer can be deployed on the IaaS layer, a Software as a Service (SaaS) layer can be deployed on the PaaS layer, and the SaaS layer can also be directly deployed on the IaaS layer.

Fig. 1 is a network architecture diagram of a data processing system according to an embodiment of the present application, including a plurality of terminal devices 100 and servers with different functions, where the servers include a data source access server 101, a monitoring server 102, a message middleware server 103, a data processing server 104, a warehousing server 105, and a database 106.

The terminal device 100 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto. As a source of the data source, various types of data to be processed may be generated, such as image data, text data, audio data, and the like, the data source may be classified from different angles, and may also be log data, structured data, IOT data (a kind of highly unstructured data), file data, and the like, and to some extent, the data source may also be data stored from other databases. Data processing in a specific scenario is usually massive, for example, in an e-commerce system, data generated by a terminal device may be browsing records, order numbers, transaction amounts, etc. of a user on a commodity, or a volume of a commodity in a fixed time, an access frequency, a collection amount, etc., and in different application scenarios, data objects to be specifically processed may be different, and a magnitude or a type of data specifically processed by the terminal device 100 and the network architecture diagram is not limited herein.

The data source access server 101 can be used as an access layer to perform access processing on data sources from terminal equipment or other computer equipment, the data access can realize unified management on data from different sources, the access cost is mainly reduced by standardizing a unified access mode, the data access is used as a basic link and an important link of a data processing platform, and good data access bottom layer construction can provide more stable and reliable transmission service for an upper layer, and the value of the data is fully exerted. In this embodiment of the present application, mass data may be accessed to a data processing platform through a message middleware cluster (e.g., kafka cluster) deployed in the message middleware server 103, where the Kafka cluster serves as a message delivery system and plays a role as a data transmission pipeline, and the data processing platform refers to a stream data processing engine (e.g., flink) deployed in the data processing server 104.

The data processing server 104 obtains the normalized data from the message middleware server 103, and the stream data processing engine started therein implements real-time or offline processing of mass data, for example, performs operations such as sorting, screening, aggregating, and calculating on millions of data, thereby obtaining a data processing result. The data processing result is stored in the database 106 through the message middleware cluster deployed in the warehousing server 105, that is, the warehousing server 105 also has a data transmission function similar to the message middleware server 103, so that it can be ensured that data is not lost when the database 106 or the data processing server 104 has a fault (such as down or message jam). The database 106 may be Driud, elasticisearch, etc., and is not limited thereto.

The monitoring server 102 plays a key role in the data processing system, and can monitor the change situation of consumption hysteresis of Kafka data deviation amount in the message middleware server 103 in real time, when it is monitored that the data backlog is too much, the monitoring server 102 can determine the concurrency according to the specific change situation and issue an instruction to the message middleware server 103, expand the message middleware according to the determined concurrency, that is, expand the message middleware, and simultaneously, need to issue an instruction to the data processing server 104 to pull up a stream data processing engine with expanded parallelism, that is, a Flink stream, while the original Flink stream in the data processing server 104 is stopped running because the parallelism cannot support the real-time processing of the data stream. Therefore, when the data volume is suddenly increased, the capacity of the data transmission pipeline in the message middleware server 103 and the capacity of the stream data processing engine in the data processing server 104 can be expanded in time, so that the data can be processed in time, and the data loss or delay processing can be avoided. In addition, when data is processed at high concurrency, if the data amount is rapidly reduced within a certain period of time, the concurrency can be obtained according to specific data processing requirements, and the concurrency of the message middleware and the stream data processing engine is adjusted in the same manner as the above manner, which is not described herein again.

It can be found that, by monitoring the message middleware server 103 in real time by the monitoring server 102, when data processing needs, the monitoring server 102 issues an instruction to let the data processing server 104 newly pull up a Flink stream with adjusted parallelism, so that data processing can process data in the message middleware by using a new stream data processing engine without suspending the original Flink stream, and the monitoring server 102 can issue an instruction to let the data processing server 104 stop the original stream data processing engine at a proper time, thereby implementing seamless connection of data processing, and achieving the purpose of no loss and no delay of data processing. The parallelism is determined according to the data lag, so that the adjustment of the parallelism is more reasonable and the accuracy is higher. The monitoring service in the monitoring server 102 is used as a management tool, and can monitor and control faults such as data back pressure and the like in real time, so that the parallelism is flexibly adjusted.

It should be noted that the server may be an independent physical server, may also be a server cluster or distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal device 100 and the server (here, the data source access server 101) may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

Further, for ease of understanding, the methods provided in the following embodiments of the present application are all described as being implemented by the monitoring server 102. Referring to fig. 2, fig. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application, where the data processing method at least includes the following steps S101 to S103:

s101, state information of the first stream data processing engine in processing the data stored in the message middleware is obtained.

In one embodiment, the stream data processing engine may be a Flink, which is an open source distributed streaming framework that may provide functions such as data distribution, data communication, and fault tolerance mechanisms for distributed computation of data streams. The message middleware may be Kafka, a messaging system specifically designed for distributed high throughput systems, which can deliver messages from one endpoint to another. For convenience of understanding, in the embodiments of the present application, the stream data processing engine is a flight, and the message middleware is Kafka. Accordingly, the first stream data processing engine refers to the data stream execution engine Flink which currently acquires and processes Kafka data. Based on the architecture diagram of the data processing system as provided in fig. 1, data from different sources is normalized by the data source receiving server 101 as in fig. 1 and stored in the message middleware Kafka, and the data stored in the data source receiving server is processed by the stream data processing engine in the data processing server 104, and the processing process may essentially be any one or more of aggregation, conversion, and calculation on data read in real time from topic of Kafka, and then writing the processed result into a new topic. Therefore, it is understood from a certain point of view that the stream data processing engine can be regarded as a consumer, which actively pulls data from the message middleware for processing (or consumption), while Kafka is regarded as the message middleware, into which a producer writes a message, i.e., stores collected data into the message middleware, and establishes a real-time stream data pipeline through the message middleware, so as to reliably obtain data between systems or applications.

In the process of data processing, objective or subjective factors, such as a sudden and sharp increase of data volume or a downtime of a stream data processing engine, may cause a consumption speed of a consumer to fail to keep up with a data production speed of a producer, resulting in backlog of data, so that data received by an upstream node cannot be processed by a downstream node in time, and problems such as data loss or processing delay occur in the data processing process. In the scheme provided by the application, the monitoring server 102 acquires the state information of the stream data processing engine in real time, which is an important index, so as to control the data consumption progress, wherein the state information indicates whether the stream data processing engine can process the data stored in the message middleware in time. The optional way to obtain the status information may be to use a command line tool script carried by the message middleware, or may be to obtain the status information through API programming or other ways, and the data expression used for the status information is not limited herein.

And S102, if the state information meets the parallelism adjusting condition, determining the target parallelism according to the state information.

In an embodiment, when the status information acquired by the monitoring server 102 indicates that the data processing server 104 processes data in the message middleware server 103 in real time, a phenomenon (i.e., backpressure) occurs that the speed of data generated upstream of a certain node of the data pipeline is greater than the speed of data processing by the node, and the parallelism of the currently processed data needs to be expanded, or after the stream data processing engine processes the data in high saturation for a period of time, the status information indicates that the amount of collected data is obviously reduced, and in order to save computing resources, the parallelism required by the data processing can also be reduced. In summary, whether or not the parallelism needs to be adjusted can be measured by the state information, and another role of the state information is to determine the target parallelism required for the final data processing.

Optionally, the status information includes a lag amount of data consumption, and before this step, further includes: and if the duration of the continuous increase of the lag amount of the data consumption reaches the first duration and the increase amount in the unit time is greater than or equal to the first number threshold, or the lag amount of the data consumption is greater than or equal to the second number threshold, determining that the state information meets the parallelism adjusting condition. The Lag in data consumption is referred to herein as the Lag value, which represents the amount of data that the consumer lags behind the producer, i.e., the Lag value is the difference between the displacement value of the last consumption message and the displacement value of the last production message (or the difference between the topic record amount and the consumption group consumption progress (offset)), which is understood simply as the Lag of the stream data processing engine processing the data in the message middleware compared to the data stored in the message middleware. Under normal processing conditions, the Lag amount should be close to 0, which indicates that the consumer can timely consume the message produced by the producer, but the degree of Lag is very small, but if the Lag value is very large or continuously increased to a certain threshold, which indicates that the consumer cannot keep up with the speed of the producer, the processing speed of the downstream message is slowed, and the Lag value is within the range that the streaming data processing engine can bear, that is, the streaming data processing engine and the message middleware can perform self-processing on the small-amplitude Lag value, the parallelism may not be adjusted, otherwise, when the Lag value is significantly increased to exceed the range that the streaming data processing engine can process and the message middleware can receive, necessary measures need to be taken to perform capacity expansion processing on the relevant data tool, and the specific content is not detailed herein. And such a boundary is the content of the above-mentioned judgment on whether the parallelism adjustment condition is satisfied, and includes two types, one is the joint judgment of the duration of the continuous increase of the Lag value and the increase amount in the unit time, and the other is the comparison judgment of the increase amount of the Lag value and the second quantity threshold, and when the Lag value satisfies any one of them, it can be determined that the state information satisfies the parallelism adjustment condition. Illustratively, the first duration value is 1h (hour), the first quantity threshold value is 2w (ten thousand), and the second quantity threshold value is 200w (ten thousand), the delay condition of the streaming data processing engine is monitored by the monitoring service in the monitoring server 102 for 1 minute, and if the Lag value of the message middleware continuously increases by 1h and increases by 2w per minute, it may be determined that the Lag satisfies the parallelism adjustment condition, or the Lag value is greater than 200w, and it may also be determined that the Lag satisfies the parallelism adjustment condition. Otherwise, if the two are not reached, the parallelism adjusting condition is not satisfied. In this case, due to the tendency of the Lag value to increase, the corresponding parallelism adjustment refers to expanding the parallelism.

Alternatively, the following determination may be made as to whether or not the state information satisfies the parallelism adjustment condition: and if the duration of the continuous reduction of the hysteresis quantity of the data consumption reaches the second duration and the reduction quantity in the unit time is greater than or equal to the third quantity threshold, or the hysteresis quantity of the data consumption is less than or equal to the fourth quantity threshold, determining that the state information meets the parallelism adjusting condition. This determination is made in response to the adjustment condition for the extended parallelism, and is a decision condition for the reduced parallelism adjustment. Generally, when the stream data processing engine and the message middleware are both in high-concurrency processing data, the hysteresis amount of data consumption is necessarily reduced continuously, and when the hysteresis amount is reduced to be close to 0, the stream data processing engine can meet the load of data processing. However, in another case, the data amount is reduced from the source, the corresponding hysteresis amount is also gradually reduced, and at this time, the data amount does not need such a high parallelism degree to carry the function of data processing, so that the hysteresis amount of data consumption included in the state information monitored by the monitoring server 102 is continuously reduced, and the reduced duration and the reduced amount in the unit time reach the threshold (corresponding to the second duration and the third quantity threshold, respectively), or the bag value is directly less than or equal to the fourth quantity threshold, it can be determined that the state information satisfies the parallelism degree adjustment condition. Illustratively, if the monitoring server 102 detects that the current bag value is decreased from 500w at a rate of 5w per minute (the third quantity threshold) for 30 minutes (the second duration), or the bag value is decreased from 500w directly to 200w (the fourth quantity threshold), it may be determined that the state information satisfies the parallelism adjusting condition, and the current parallelism is adjusted after the parallelism is determined according to the state information.

It should be noted that the second time duration may be the same as or different from the first time duration, and similarly, the third quantity threshold may be the same as or different from the first quantity threshold, and the fourth quantity threshold may also be the same as or different from the second quantity threshold, which is not limited herein.

In an embodiment, an optional implementation manner of determining the target parallelism according to the state information may be: acquiring the reference data processing amount corresponding to a single parallelism and the current parallelism of the message middleware; determining a parallelism adjustment amount according to the reference data processing amount and the lag amount of data consumption; and determining the target parallelism according to the current parallelism and the parallelism adjustment quantity. In this embodiment, the maximum data amount of data in a single parallelism consumption message middleware of a stream data processing engine is referred to as a reference data processing amount, the current parallelism of the message middleware is the parallelism before being adjusted, the ratio of the lag amount of data consumption to the reference data amount is used as a parallelism adjustment amount, for example, 100w is used as a reference data amount, corresponding to one parallelism, and the lag amount of data consumption is 200w, which is 200w/100w =2 parallelisms, that is, the parallelism adjustment amount is used for corresponding to the lag amount. According to different conditions, the implementation mode of determining the target parallelism by the parallelism adjustment amount and the current parallelism comprises the steps of adding the parallelism adjustment amount or subtracting the parallelism adjustment amount on the basis of the current parallelism, and the expansion and the contraction are respectively corresponding to the adjustment of the target parallelism. Optionally, the user may also use parameter configuration in the monitoring service according to the service characteristics and the actual conditions of the current physical resources, and control whether to automatically reduce the capacity after automatic capacity expansion, or control whether to automatically expand the capacity again after automatic capacity reduction, so that the flexibility of the system can be enhanced.

In short, adjustment for parallelism requires reference to the current parallelism, the reference data processing amount, and the hysteresis amount at different time instants. Illustratively, the lag amount is increased from 0 to 200w, and the parallelism needs to be increased on the basis of the current parallelism, wherein 0 and 200w correspond to the lag amount of data consumption at different times, and specifically two parallelisms are increased according to the rule that 100w corresponds to one parallelism. In addition, if the lag is reduced from 200w to 0, the parallelism needs to be reduced on the basis of the current parallelism, and similarly, 2 parallelisms need to be reduced, it should be noted that the calculation of the parallelism adjustment amount and the target parallelism can be realized by the monitoring service application in the monitoring server 102, the parallelism adjustment amount at the time of the parallelism expansion can be determined with reference to the lag at the final time and the reference data throughput, but the parallelism adjustment at the time of the parallelism reduction needs to be determined with reference to the lag at different times and the reference data throughput, because the current parallelism is completely sufficient and redundant for the support of the data processing, and the reduction of the parallelism by reflecting the parallelism adjustment amount by the change of the lag can complete the data processing function by making full use of the existing data resources.

It can be seen that whether the parallelism of the streaming data processing engine or the message middleware is expanded or reduced, the parallelism can be determined by the hysteresis quantity of data consumption and the maximum data processing quantity carried by the parallelism, so that the adjustment of the parallelism can be reasonable and has higher applicability.

S103, adjusting the parallelism of the message middleware according to the target parallelism, and determining a second stream data processing engine according to the target parallelism, wherein the second stream data processing engine is used for processing the data stored in the message middleware after the parallelism is adjusted.

In one embodiment, since the message middleware may establish the real-time streaming data pipe, the parallelism of the message middleware may be the same number of real-time streaming data pipes as the parallelism, which is understood colloquially that one real-time data pipe corresponds to one parallelism. And adjusting the parallelism of the message middleware according to the target parallelism, namely adjusting the current parallelism of the message middleware to the target parallelism, wherein the adjustment comprises any one of expanding the parallelism and reducing the parallelism. Optionally, the parallelism of the message middleware can be expanded or contracted by calling the API interface of the message middleware Kafka, so that the parallelism of the message middleware can store the message (i.e. data) produced by the producer in time after being adjusted or the existing message middleware can be fully utilized to store the data.

In addition, there is a corresponding adjustment for the stream data processing engine, that is, a second stream data processing engine is determined according to the target parallelism, the second stream data processing engine can be a Flink stream which is pulled up again according to the target parallelism, and the parallelism of the data processed by the second stream data processing engine is matched with the target parallelism, different from the first stream data processing engine. The data processing capacity and the data volume of the second stream data processing engine are adaptive, and resources can be fully utilized to consume the data stored in the message middleware. In this process, since the second stream data processing engine is determined additionally, the parallelism of the second stream data processing engine is extended or reduced compared to the parallelism of the first stream data processing engine, and this is not extended or reduced by stopping the first stream data processing engine currently processing data, thereby avoiding a system halt and a delay or loss of data processing.

Based on the scheme, when the data report amount is increased and the stream data processing engine cannot process the data, the system can be enabled to have no pause and no data loss by carrying out lossless capacity expansion on the Kafka and the Flink, the number of the parallelism degree can be reasonably expanded, and resources are reasonably utilized. For the effect of the present solution in a practical environment, see what is shown in fig. 3, as shown in fig. 3, a single parallelism processing capability of a Flink stream is 100w/min, and 20: when about 00, the problem of inverse pressure caused by the fact that the Flink cannot be calculated, the monitoring server finds that the upstream Kafka parallelism is reasonably expanded to be 2 and the Flink parallelism is 2 through calculation, and then the Flink processing amount naturally rises to 22: the data throughput at 38 hours was 1609189, i.e., up to 160 w/min. Wherein the different curves represent the amount of data for the Flink stream processing Kafka data at different dates.

In summary, the embodiments of the present application have at least the following advantages:

the state information of the consumption data of the streaming data engine is monitored in real time through the monitoring server, the real-time monitoring of the data processing state is realized, and corresponding adjustment is timely made according to the state information, so that problems occurring in the data processing process are timely responded; under the condition of proper parallelism adjustment, the expansion or contraction quantity of the parallelism (namely the target parallelism) is accurately determined according to the state information, so that the parallelism of the message middleware can be reasonably adjusted, and the situations of resource waste or insufficient resources are avoided; in addition, a new stream data processing engine is determined according to the target parallelism degree to correspondingly process the data in the message middleware after the parallelism degree is adjusted, so that the real-time property of data processing is prevented from being sacrificed, the data is ensured not to be lost or delayed during capacity expansion or capacity reduction, and further the lossless capacity expansion is realized.

Referring to fig. 4, fig. 4 is a schematic flowchart of a data processing method according to an embodiment of the present disclosure. The data processing method may include at least the following steps S201 to S204:

s201, state information of the first stream data processing engine in processing the data stored in the message middleware is obtained.

And S202, if the state information meets the parallelism adjusting condition, determining the target parallelism according to the state information.

The specific implementation manner of steps S201 to S202 may refer to S101 to S102 in the corresponding embodiment of fig. 2, which is not described herein again.

S203, adjusting the parallelism of the message middleware according to the target parallelism, and creating a second stream data processing engine on the data processing server.

In an embodiment, the parallelism of the message middleware is adjusted according to the foregoing embodiment, which is not described herein again. Alternatively, the data processing server may refer to the data processing server 104 shown in fig. 1, in which a stream data processing engine (e.g. Flink) for real-time computation may be installed, so that a second stream data processing engine may be created in the data processing server, specifically, a monitoring server may issue an instruction to the data processing server by a monitoring service, and pull up a new Flink stream therein, that is, a new stream data processing engine, where the stream data processing engine and the first stream data processing engine have the same function, that is, are used for processing data stored in the message middleware after parallelism adjustment, and only differ in the size of data processing capability. The principle followed to create the second stream data processing engine is to match the real-time requirements of the data processing, and specific matching rules can be referred to the contents set forth in the subsequent steps.

And S204, starting one or more data processing threads matched with the target parallelism in the second stream data processing engine.

In one embodiment, the stream data processing engine allows the system to provide high concurrency while providing strong consistency guarantees at a uniform time, meeting high concurrency and low latency, i.e., computing large amounts of data quickly, which is supported primarily by parallel data processing threads. Therefore, it is necessary to start a data processing thread in the streaming data processing engine, and correspondingly, start one or more data processing threads matching the target parallelism in the second streaming data processing engine, which may be data processing threads with the target parallelism number, for example, the target parallelism is 5, then the number of started data processing threads may be 5, and these multiple data processing threads may concurrently process data in the message middleware, so as to achieve fast computation on massive data. It should be noted that the data processing threads are started in distributed data processing servers, each data processing server may be distributed with one or more data processing threads, and the target parallelism may be greater than or less than the parallelism of the first-stream data processing engine, for example, if the parallelism of the data processing threads started in the first-stream data processing engine is 2, the processing of the current data cannot be loaded, and the target parallelism is determined to be 4 according to the hysteresis amount, which indicates that the real-time performance of the data processing cannot be satisfied by extending the two parallelisms, so the parallelism of the data processing threads started in the second-stream data processing engine created in the data processing server is also 4. Of course, the above example is a case of expanding the parallelism of the data processing threads, and a second stream data processing engine with the parallelism smaller than that of the first stream data processing engine may also be created in the case of capacity reduction, which is not described herein again.

In an embodiment, the second stream data processing engine includes a receiving unit, a calculating unit, and a warehousing unit, and an optional implementation manner of step S204 may be: starting a target parallelism data processing thread in a receiving unit; starting one or more data processing threads in the computing unit and the storage unit in an equal proportion according to the data quantity received by the receiving unit; and setting the elastic maximum parallelism of the computing unit and the warehousing unit, wherein the elastic maximum parallelism is used for adjusting the parallelism of data processing threads running in the computing unit and the warehousing unit according to the data volume and the computing complexity transmitted to the computing unit by the receiving unit in the data processing process of the second stream data processing engine.

The above-mentioned receiving unit, the computing unit and the warehousing unit, which are collectively referred to as data processing units, may be deployed in the data processing server in the form of program codes, and each data processing unit and the parallelism may be independent from each other and may be linked to each other when the stream data processing engine is constructed. The receiving unit is used for acquiring data from the message middleware, the calculating unit performs operations such as aggregation, conversion, calculation and the like on the data acquired by the receiving unit, and the warehousing unit is used for receiving the result of the data processing by the calculating unit and writing the result into the message middleware. The required parallelism of the data processing threads is also supplied on demand, due to the different functions of the different data processing units. Optionally, the data processing threads started in the receiving unit are matched with the target parallelism, where matching means that the number of the data processing threads started by the receiving unit is the same as the target parallelism, so that the parallelism of the data processing threads in the receiving unit is equal to the parallelism of the message middleware, and the real-time acquisition of data can be satisfied. In addition, the number of the started data processing threads in the computing unit and the storage unit is adjusted in an equal proportion according to the quantity, the equal proportion adjustment includes equal proportion expansion or equal proportion contraction, the parallelism can be correspondingly adjusted according to the data quantity received by the data processing thread of the target parallelism of the receiving unit and the maximum data quantity capable of being processed by the data processing thread started in the computing unit, namely, the expanded or contracted parallelism is determined according to the increased or reduced data quantity and the unit data processing quantity on the basis of the original parallelism by how many times the received data quantity is expanded or reduced. Because the stream data processing engine is most core to the data processing capability or is embodied in the computing capability of the computing unit, under general conditions, the parallelism of the data processing threads started by the computing unit and the warehousing unit is greater than or equal to the parallelism of the data processing threads started by the receiving unit, namely, is higher than the target parallelism, so that the data processing threads concurrently process data, the computing unit can sufficiently deal with the computing processing of complex data, the processing result is quickly obtained, and the efficiency of the warehousing unit for storing the processing result can be further improved. It should be noted that the number of the data processing threads started in the computing unit and the binning unit may be the same or different, and the first stream data processing engine may also include the data processing unit.

In addition, the monitoring service in the monitoring server can set the elastic maximum parallelism while pulling up the new Flink flow in the data processing server, and the elastic maximum parallelism is set for the computing unit and the warehousing unit, so that when the data volume transmitted to the computing unit by the receiving unit is reduced or the computing complexity is reduced due to the service requirement, one or more data processing threads can be closed or stopped in the computing unit and the warehousing unit, thereby realizing elastic capacity reduction and saving computing resources. It should be noted that when a new Flink stream is pulled up on the data processing server, two streams consume Kafka data simultaneously, the new Flink stream (i.e., the second stream data processing engine) and the old Flink stream data (i.e., the first stream data processing engine) are in the same Kafka consumption group, and consume data in Kafka with expanded parallelism at the same time, so that the data is not consumed repeatedly, and the data is not consumed less, but too many computing resources are wasted substantially when processing data at the same time, so that the monitoring service running in the monitoring server stops the old Flink stream at a proper time, and the new Flink stream consumes the Kafka data, and thus lossless expansion or contraction is completed. It should be noted that, in the data processing units with different functions, the corresponding functions of the data processing threads are also different, specifically, the functions correspond to the functions of the data processing units one to one.

The scheme is applied to a data processing scheme taking the Flink as a real-time computing frame, the Flink can be expanded without damage and without data stream interruption under the condition of sudden increase of data volume, the data processing capacity is improved, data processing delay is avoided, and the user experience is improved. The functional block diagram of the data processing method may be as shown in fig. 5. Including data source 501, access stratum 502, message middleware a (Kafka-a) 503, flink flow 504, message middleware B (Kafka-B) 505, database (e.g., driud) 506, monitoring service 507, and new Flink flow 508. The access layer 502 is a receiving node, the Flink flow 504 is a computing node, the database 506 is a warehousing node, the access layer 502 is mainly responsible for receiving data from different sources included in the data source 501 for unification and standardizing the format of the data, the Flink flow 504 is responsible for acquiring massive data from the message middleware a (Kafka-a) 503, namely Kafka-a, for performing aggregation calculation, sorting and other processing, and writing the processing result into the message middleware B (Kafka-B) 505, and the database 506 is responsible for acquiring the data processing result from the Kafka-B and storing the data processing result in a unified manner. The monitoring service 507 real-timely monitors the growth of Kafka consumption lag (i.e. offset bias) in the message middleware a (Kafka-a) 503, expands Kafka parallelism, specifically, the Kafka-a parallelism, through the API interface when the consumption lag growth reaches a specified condition, and pulls up a new flexk stream 508 with the expanded parallelism to consume the Kafka data, i.e. consume the data in the consumption message middleware a (Kafka-a) 503 with the expanded parallelism, and the new flexk stream 508 sets the flexible maximum parallelism, and can release resources when the data amount becomes small. In addition, monitoring service 507 stops old Flink flow 504 from releasing resources, thereby achieving lossless capacity expansion. It should be noted that, after the old Flink stream and the newly pulled Flink stream consume the data, both the results are written into the message middleware B (Kafka-B) 505, and can be called by other applications, and are stored in the database 506, so that the data is backed up.

Referring to fig. 6, firstly, step 601 is executed to monitor the growth condition of Kafka consumption lag, specifically, monitor the Flink delay condition at a service timing of 1min, and determine whether to expand the parallelism degree according to the Flink delay condition, that is, after the lag growth condition is obtained, step 602 is executed to determine whether the lag of Kafka continuously increases by 1h and the data increases by 2w per minute, or the lag value is greater than 200w. If yes, sequentially executing the following capacity expansion steps: step 603, expanding the parallelism of the message middleware, wherein the parallelism required to be expanded by service calculation needs to be monitored first, the calculated 100w data volume corresponds to one parallelism at present, and the parallelism of expanding the Kafka by the Kafka interface is called. Step 604, the monitoring service pulls up a Flink stream with expanded parallelism (the parallelism of the receiving unit is equal to that of Kafka, and the parallelism of the computing unit and the warehousing unit is expanded in an equal proportion), and the Flink sets the maximum elastic parallelism, so that the flexible capacity reduction can be realized when the data volume is reduced. It may now occur that both streams consume Kafka data simultaneously. Step 605, the monitoring service stops the old Flink flow, at which time the new Flink flow consumes Kafka data. And finishing lossless expansion.

the flexibility maximum parallelism is set for the computing unit and the storage unit, the parallelism is flexibly adjusted when the data volume changes, the parallelism of different data processing units is dynamically managed, and the parallelism of the data processing thread included in the created stream data processing engine is configured according to the requirement under different service scenes, so that the data processing capability of the stream data processing engine is matched with the real-time requirement of a specific service scene, resources are reasonably scheduled, and the resource utilization rate and the data processing capability of the computing system are improved.

Referring to fig. 7, fig. 7 is a schematic flowchart of a data processing method according to an embodiment of the present disclosure. The data processing method may include at least the following steps S301 to S305:

s301, state information of the first stream data processing engine in processing the data stored in the message middleware is obtained.

And S302, if the state information meets the parallelism adjusting condition, determining the target parallelism according to the state information.

And S303, adjusting the parallelism of the message middleware according to the target parallelism, and determining a second stream data processing engine according to the target parallelism, wherein the second stream data processing engine is used for processing the data stored in the message middleware after the parallelism is adjusted.

The specific implementation manner of steps S301 to S303 can refer to S101 to S103 in the corresponding embodiment of fig. 2, which is not described herein again.

S304, acquiring the position information of the data currently processed by the first stream data processing engine.

In an embodiment, after the second stream data processing engine is determined according to the target parallelism, the first stream data processing engine and the second stream data processing engine simultaneously process the data stored in the message middleware, but since the second stream data processing engine is more matched with the data processing requirement than the first stream data processing engine, the monitoring server needs to stop the first stream data processing engine when appropriate, and the second stream data processing engine takes over the first stream data processing engine to continue processing the data stored in the message middleware. Based on this, first, the monitoring service in the monitoring server needs to acquire the location information of the data currently processed by the first streaming data processing engine and actively report the location information to the monitoring server, and the location information is used to determine whether the first streaming data processing engine can be stopped.

S305, if the location information indicates that the currently processed data is the last data in the data received by the first stream data processing engine, after the first stream data processing engine successfully processes the currently processed data, the first stream data processing engine is turned off, and the location information is sent to the second stream data processing engine.

In one embodiment, the location information is used by the second stream data processing engine to obtain data from the parallelism adjusted message middleware. The position information indicates the displacement of the latest consumption message, when the latest consumption data is the last data in the data received by the receiving unit of the first stream data processing engine, after the first stream data processing engine successfully processes the currently processed data, namely, the processing result is output, the first stream data processing engine can be closed, meanwhile, the monitoring service sends the position information to the second stream data processing engine, and the second stream data processing engine is informed of starting processing from the data at the next position of the currently processed data of the first stream data processing engine, so that seamless connection is realized, and the system does not need to be paused. The data processed by the second stream data processing engine is the data stored in the message middleware after the parallelism adjustment, and if the data is the message middleware with the expanded parallelism, the data volume processed by the second stream data processing engine in parallel is greatly improved, and the data processing capability is obviously improved.

It should be noted that, the closing of the first stream data processing engine may be directly destroying the first stream data processing engine, or cutting off the communication connection with the message middleware, but not destroying the first stream data processing engine, or backing up the first stream data processing engine before destroying the first stream data processing engine, so that after mass data is processed by the second stream data processing engine, the received data amount is reduced, and when a certain condition is reached, the second stream data processing engine may be closed, and the backed-up first stream data processing engine may be re-started. After the stream data engine processes the data, the processed data can be written into a new topic of Kafka for a user and an application program to call, and the data is backed up into a database.

the monitoring service monitors the position of the data newly processed by the first stream data processing engine and informs the position information of the obtained data to a new stream data processing engine (namely, a second stream data processing engine), so that the second stream data processing engine and the first stream data processing engine are seamlessly connected to process the data in the message middleware.

The method of the embodiments of the present application is set forth above in detail and the apparatus of the embodiments of the present application is provided below.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a data processing apparatus 80 according to an embodiment of the present disclosure. The data processing apparatus 80 may be a stand-alone device, or may be a device in a stand-alone device, such as a chip or an integrated circuit. The data processing device 80 includes functional modules for implementing the embodiments shown in fig. 2, 4 and 7.

In a possible implementation, the data processing apparatus 80 may include an obtaining module 801 and a determining module 802. Optionally, a stopping module 803 and a sending module 804 may also be included.

An obtaining module 801, configured to obtain status information of a first stream data processing engine when processing data stored in a message middleware;

a determining module 802, configured to determine a target parallelism according to the state information if the state information meets a parallelism adjustment condition;

the determining module 802 is further configured to adjust the parallelism of the message middleware according to the target parallelism, and determine a second stream data processing engine according to the target parallelism, where the second stream data processing engine is configured to process data stored in the message middleware after the parallelism is adjusted.

In an embodiment, the determining module 802 is specifically configured to: and if the duration of the continuous increase of the hysteresis quantity of the data consumption reaches the first duration and the increment in the unit time is greater than or equal to the first quantity threshold, or the hysteresis quantity of the data consumption is greater than or equal to the second quantity threshold, determining that the state information meets the parallelism adjusting condition.

In an embodiment, the status information includes a hysteresis amount of data consumption, and the determining module 802 is further configured to: and if the duration of the continuous reduction of the hysteresis quantity of the data consumption reaches the second duration and the reduction quantity in the unit time is greater than or equal to the third quantity threshold, or the hysteresis quantity of the data consumption is less than or equal to the fourth quantity threshold, determining that the state information meets the parallelism adjusting condition.

In an embodiment, the state information includes a hysteresis amount of data consumption, and the determining module 802 is further configured to: acquiring the reference data processing amount corresponding to a single parallelism and the current parallelism of the message middleware; determining a parallelism adjustment quantity according to the reference data processing quantity and the hysteresis quantity of data consumption; and determining the target parallelism according to the current parallelism and the parallelism adjustment quantity.

In an embodiment, the determining module 802 is specifically configured to: creating a second streaming data processing engine on the data processing server; one or more data processing threads matching the target parallelism are launched in the second streaming data processing engine.

In an embodiment, the determining module 802 is specifically configured to: the second stream data processing engine includes a receiving unit, a calculating unit, and a warehousing unit, and the determining module 802 is specifically configured to: starting a target parallelism data processing thread in a receiving unit; starting one or more data processing threads in the computing unit and the storage unit in an equal proportion according to the data quantity received by the receiving unit; and setting the elastic maximum parallelism of the computing unit and the warehousing unit, wherein the elastic maximum parallelism is used for adjusting the parallelism of the data processing threads running in the computing unit and the warehousing unit according to the data quantity and the computing complexity transmitted to the computing unit by the receiving unit in the data processing process of the second stream data processing engine.

In an embodiment, the obtaining module 801 is further configured to obtain location information of data currently processed by the first streaming data processing engine;

a stopping module 803, configured to, if the location information indicates that the currently processed data is the last data in the data received by the first streaming data processing engine, close the first streaming data processing engine after the first streaming data processing engine successfully processes the currently processed data;

a sending module 804, configured to send the location information to the second stream data processing engine; and the position information is used for the second stream data processing engine to acquire data from the message middleware after the parallelism is adjusted.

It should be understood that, in the embodiments of the apparatus of the present application, the division of a plurality of units or modules is only a logical division according to functions, and is not a limitation to the specific structure of the apparatus. In a specific implementation, some functional modules may be subdivided into more tiny functional modules, and some functional modules may also be combined into one functional module, but whether the functional modules are subdivided or combined, the general flow performed by the apparatus in the data processing process is the same. Generally, each unit (or module) corresponds to a respective program code (or program instruction), and when the respective program code of the units runs on a processor, the units are controlled by the processor to execute corresponding procedures so as to realize corresponding functions.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a computer device 90 according to an embodiment of the present disclosure. The computer device 90 may comprise a standalone device (e.g., one or more of a server, a node, a terminal, etc.) or may comprise a component (e.g., a chip, a software module, or a hardware module, etc.) within the standalone device. The computer device 90 may comprise at least one processor 901 and a communication interface 902, further optionally the computer device 90 may further comprise at least one memory 903 and a bus 904. The processor 901, the communication interface 902, and the memory 903 are connected by a bus 904.

The processor 901 is a module for performing arithmetic operation and/or logical operation, and may specifically be one or a combination of multiple processing modules such as a Central Processing Unit (CPU), a picture processing Unit (GPU), a Microprocessor (MPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Complex Programmable Logic Device (CPLD), a coprocessor (assisting the central processing Unit to complete corresponding processing and Application), and a Micro Control Unit (MCU).

The communication interface 902 may be used to provide information input or output to the at least one processor. And/or, the communication interface 902 may be used for receiving and/or transmitting data from/to the outside, and may be a wired link interface such as an ethernet cable, and may also be a wireless link (Wi-Fi, bluetooth, general wireless transmission, etc.) interface.

The memory 903 is used to provide a storage space in which data, such as an operating system and computer programs, may be stored. The memory 903 may be one or a combination of Random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or portable read-only memory (CD-ROM), among others.

At least one processor 901 in the computer device 90 is configured to call up a computer program stored in at least one memory 903, for executing the aforementioned data processing method, such as the data processing method described in the foregoing embodiments shown in fig. 2, fig. 4, and fig. 7.

In one possible embodiment, the processor 901 in the computer device 90 is configured to call up a computer program stored in the at least one memory 903, for performing the following operations: acquiring state information of a first stream data processing engine when processing data stored in message middleware through a communication interface 902; if the state information meets the parallelism adjusting condition, determining the target parallelism according to the state information; and adjusting the parallelism of the message middleware according to the target parallelism, and determining a second stream data processing engine according to the target parallelism, wherein the second stream data processing engine is used for processing the data stored in the message middleware after the parallelism is adjusted.

In one embodiment, the state information includes a lag amount of data consumption, and the processor 901 is further configured to: and if the duration of the continuous increase of the lag amount of the data consumption reaches the first duration and the increase amount in the unit time is greater than or equal to the first number threshold, or the lag amount of the data consumption is greater than or equal to the second number threshold, determining that the state information meets the parallelism adjusting condition.

In one embodiment, the state information includes a lag amount of data consumption, and the processor 901 is further configured to: and if the duration of the continuous reduction of the hysteresis quantity of the data consumption reaches the second duration and the reduction quantity in the unit time is greater than or equal to the third quantity threshold, or the hysteresis quantity of the data consumption is less than or equal to the fourth quantity threshold, determining that the state information meets the parallelism adjusting condition.

In one embodiment, the state information includes a hysteresis amount of data consumption, and the processor 901 is further configured to: acquiring the reference data processing amount corresponding to a single parallelism and the current parallelism of the message middleware through a communication interface 902; determining a parallelism adjustment amount according to the reference data processing amount and the lag amount of data consumption; and determining the target parallelism according to the current parallelism and the parallelism adjustment quantity.

In an embodiment, the processor 901 is specifically configured to: creating a second stream data processing engine on the data processing server; one or more data processing threads matching the target parallelism are launched in a second streaming data processing engine.

In an embodiment, the second stream data processing engine includes a receiving unit, a calculating unit, and a warehousing unit, and the processor 901 is specifically configured to: starting a target parallelism data processing thread in a receiving unit; starting one or more data processing threads in the computing unit and the storage unit in an equal proportion according to the data quantity received by the receiving unit; and setting the elastic maximum parallelism of the computing unit and the warehousing unit, wherein the elastic maximum parallelism is used for adjusting the parallelism of the data processing threads running in the computing unit and the warehousing unit according to the data quantity and the computing complexity transmitted to the computing unit by the receiving unit in the data processing process of the second stream data processing engine.

In an embodiment, the processor 901 is further configured to: acquiring the position information of the data currently processed by the first stream data processing engine through the communication interface 902; if the location information indicates that the currently processed data is the last data in the data received by the first stream data processing engine, the first stream data processing engine is closed after the first stream data processing engine successfully processes the currently processed data, and the location information is sent to the second stream data processing engine through the communication interface 902; and the position information is used for the second stream data processing engine to acquire data from the message middleware after the parallelism is adjusted.

It should be understood that the computer device 90 described in the embodiment of the present application may perform the description of the data processing method in the embodiment corresponding to fig. 2, or fig. 4, or fig. 7, and may also perform the description of the data processing apparatus 80 in the embodiment corresponding to fig. 8, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer-readable storage medium, where the computer program executed by the aforementioned computer device 90 is stored in the computer-readable storage medium, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the data processing method in the embodiments corresponding to fig. 2, fig. 4, and fig. 7 can be executed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application.

The computer readable storage medium may be the data processing apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash memory card (flash card), and the like provided on the computer device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

In one aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided by one aspect of the embodiments of the present application.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and should not be taken as limiting the scope of the present application, so that the present application will be covered by the appended claims.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein the state information includes a hysteresis amount of data consumption, and wherein before determining a target parallelism based on the state information if the state information satisfies a parallelism adjustment condition, the method further comprises:

and if the duration of the continuous increase of the data consumption lag reaches a first duration and the increase in unit time is greater than or equal to a first quantity threshold, or the data consumption lag is greater than or equal to a second quantity threshold, determining that the state information meets the parallelism adjusting condition.

3. The method of claim 1, wherein the state information includes an amount of lag in data consumption, and wherein the method further comprises, before determining a target parallelism based on the state information if the state information satisfies a parallelism adjustment condition:

and if the duration of the continuous reduction of the data consumption lag reaches a second duration and the reduction amount in unit time is greater than or equal to a third quantity threshold, or the data consumption lag is less than or equal to a fourth quantity threshold, determining that the state information meets the parallelism adjusting condition.

4. A method as recited in any of claims 1-3, wherein the state information includes a hysteresis amount of data consumption, and wherein determining a target parallelism from the state information comprises:

acquiring the reference data processing amount corresponding to a single parallelism and the current parallelism of the message middleware;

determining a parallelism adjustment amount according to the reference data processing amount and the hysteresis amount of the data consumption;

and determining the target parallelism according to the current parallelism and the parallelism adjustment quantity.

5. The method of any of claims 1-3, wherein determining a second stream data processing engine based on the target parallelism comprises:

creating a second streaming data processing engine on the data processing server;

starting one or more data processing threads in the second streaming data processing engine that match the target parallelism.

6. The method of claim 5, wherein the second streaming data processing engine comprises a receiving unit, a computing unit, and a binning unit, the launching one or more data processing threads in the second streaming data processing engine that match the target parallelism comprising:

starting the data processing threads with the target parallelism in the receiving unit;

starting one or more data processing threads in the computing unit and the storage unit according to the data quantity received by the receiving unit in an equal proportion;

and setting the elastic maximum parallelism of the computing unit and the warehousing unit, wherein the elastic maximum parallelism is used for adjusting the parallelism of data processing threads running in the computing unit and the warehousing unit according to the data volume and the computing complexity transmitted to the computing unit by the receiving unit in the data processing process of the second streaming data processing engine.

7. The method of claim 1, wherein after adjusting the parallelism of the message middleware based on the target parallelism and determining a second streaming data processing engine based on the target parallelism, the method further comprises:

acquiring the position information of the data currently processed by the first stream data processing engine;

if the location information indicates that the currently processed data is the last data in the data received by the first stream data processing engine, after the first stream data processing engine successfully processes the currently processed data, closing the first stream data processing engine, and sending the location information to the second stream data processing engine;

and the position information is used for the second stream data processing engine to acquire data from the message middleware after the parallelism adjustment.

8. A data processing apparatus, comprising:

9. A computer device, comprising: a processor, a memory, and a network interface;

the processor is connected with the memory and the network interface, wherein the network interface is used for providing a network communication function, the memory is used for storing program codes, and the processor is used for calling the program codes to execute the data processing method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, perform the data processing method of any of claims 1-7.