CN113778996A

CN113778996A - Large data stream data processing method and device, electronic equipment and storage medium

Info

Publication number: CN113778996A
Application number: CN202111110684.7A
Authority: CN
Inventors: 杨万强; 毕小根
Original assignee: Shanghai Fu Shen Lan Software Co ltd
Current assignee: Shanghai Fu Shen Lan Software Co ltd
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2021-12-10

Abstract

The application provides a method and a device for processing large data stream data, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring source data in a plurality of data sources; analyzing whether the source data contains change identification and/or attribute identification; if so, performing data organization on the source data and sending the source data to the message middleware, analyzing at least one piece of organized source data acquired from the message middleware, and matching the at least one piece of organized source data into a target data record base according to the primary key information; judging the processing priority of the data in the target data record base according to the attribute identification of the data in the target data record base, and forming a data group to be processed according to the processing priority of the data in the target data record base; and judging the operation standard of the data group to be processed according to the change identifier of the data group to be processed and forming a new target database. The invention can solve the data processing problem caused by the disorder of the data of the large data stream without changing the technical architecture of the original data.

Description

Large data stream data processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing large data stream data, an electronic device, and a storage medium.

Background

The rapid development of emerging information technologies and application modes such as cloud computing, internet of things, mobile interconnection, social media and the like promotes the rapid increase of global data volume and promotes the human society to enter a big data era. In the face of endless data flooding, a means is urgently needed to help us catch and think about the moments that a flash is lost. In such a context, streaming big data technology arose. The data value is mined more quickly and fully by acquiring the data more quickly and completely, and the method becomes a consensus of various industries in the big data era.

The streaming big data in the new period is characterized by instantaneity, volatility, burstiness, disorder, infinity and the like. On one hand, because each data source is independent, the space-time environment is different, so the relative sequence of each data element among the data streams can not be ensured; even if the same data stream is in a dynamic state, due to time and environment changes, consistency of data element sequence in a replay data stream and a previous data stream cannot be guaranteed.

1) The modification interval time of the source end data is short, the data modified firstly is processed later, and the information is disordered

2) The data deleted by the source end and the target end cannot be deleted to form garbage data

3) The source end system database is split, the target end data cannot be correctly processed, and the subsequent analysis data is distorted

4) The source end system database is filed, the target end data can not be processed correctly, and the subsequent analysis data is distorted

The problems can cause data confusion and distortion, the accuracy of the whole data is reduced, the problems are difficult to find when the data have problems, the value of the data is difficult to be truly played during data mining, and even the real judgment can be influenced.

Disclosure of Invention

The invention aims to provide a method and a device for processing large-data-stream data, electronic equipment and a storage medium, which solve the problem caused by disorder in the stream data processing process.

In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:

in a first aspect, an embodiment of the present application provides a method for processing large data stream data, including: acquiring source data in a plurality of data sources; analyzing whether the source data contains change identification and/or attribute identification; if so, performing data organization on the source data and sending the source data to a message middleware, wherein the organized source data after the data organization comprises a system, a table name, a primary key, the change identifier and/or the attribute identifier; analyzing at least one piece of organized source data acquired from the message middleware, and matching the at least one piece of organized source data into a target data record base according to the primary key information; judging the processing priority of the data in the target data record base according to the attribute identification of the data in the target data record base, and forming a data group to be processed according to the processing priority of the data in the target data record base; and judging the operation standard of the data group to be processed according to the change identifier of the data group to be processed and forming a new target database.

Wherein the determining the processing priority of the data in the target data record base according to the attribute identification of the data in the target data record base comprises: identifying database attributes of the acquired data in the target data record base, determining the processing priority of the acquired data in the target data record base, and taking the data of the target data record base with the optimal database attribute processing priority in the data of the target data record base with the same main key as a first data group to be processed; the database attributes include database migration, database splitting, and/or database archiving; identifying the time attribute of the first data group to be processed and determining the processing priority of the first data group to be processed, and using the first data group to be processed with the optimal time attribute processing priority in the first data group to be processed with the same primary key as a second data group to be processed; the time attribute comprises time information of source data change after organization; and identifying the system attribute of the second data group to be processed, determining the processing priority of the second data group to be processed, and taking the second data group to be processed with the optimal system attribute processing priority in the second data group to be processed with the same primary key as a third data group to be processed.

Further, the "determining the operation standard of the data group to be processed according to the change identifier of the data group to be processed and forming a new target database" includes: identifying a change identifier of each target data in the third data group to be processed; if the target data contains the modified or newly added change identification, updating the target data to form updated target data; if the target data contains the deleted change identification, performing logic deletion operation on the target data to form deleted target data; and the third data group to be processed does not have the data with the changed identification, the updated target data and the deleted target data to form a new target database.

Optionally, the analyzing whether the source data contains a change identifier and/or an attribute identifier includes:

if the source data in the data source contains the change identifier or the attribute identifier, the change identifier or the attribute identifier of the source data sent by the data source is recorded to the downstream.

and downstream analyzing the log information of the source data database in the data source so as to identify whether the source data in the data source contains a change identifier or an attribute identifier.

In a second aspect, an embodiment of the present application provides a large data stream data processing apparatus, including:

the acquisition module is used for acquiring source data in a plurality of data sources;

the analysis module is used for analyzing whether the source data contains a change identifier or an attribute identifier; if so, performing data organization on the source data and sending the source data to a message middleware, wherein the organized source data after the data organization comprises a system, a table name, a primary key, the change identifier and/or the attribute identifier;

the first judgment module is used for analyzing at least one piece of organized source data acquired from the message middleware and matching the at least one piece of organized source data into a target data record base according to the primary key information;

the second judging module is used for judging the processing priority of the data in the target data record base according to the attribute identification of the data in the target data record base and forming a data group to be processed according to the processing priority of the data in the target data record base;

and the third judging module is used for judging the operation standard of the data group to be processed according to the change identifier of the data group to be processed and forming a new target database.

Further, the second determining module includes:

a database data judging unit, configured to identify a database attribute of data in an acquired target data record base, determine a processing priority of the acquired target data record base, and use, as a first to-be-processed data group, data of the target data record base with an optimal database attribute processing priority among data of the target data record base with the same primary key; the database attributes include database migration, database splitting, and/or database archiving;

a time attribute judging unit, configured to identify a time attribute of the first to-be-processed data group, determine a processing priority of the first to-be-processed data group, and use, as a second to-be-processed data group, the first to-be-processed data group with an optimal time attribute processing priority among the first to-be-processed data groups with the same primary key; the time attribute comprises time information of source data change after organization;

and the system attribute judging unit is used for identifying the system attribute of the second data group to be processed, determining the processing priority of the second data group to be processed, and taking the second data group to be processed with the optimal system attribute processing priority in the second data group to be processed with the same main key as a third data group to be processed.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory and a bus, where the memory stores all machine-readable instructions executable by the processor, and when the electronic device runs, the processor and the memory communicate with each other through the bus, and the processor executes the machine-readable instructions to perform the steps of the large data stream data processing method according to any one of the above embodiments.

In a fourth aspect, the present application provides a readable storage medium, where a computer program is stored, and when the computer program is executed, the method for processing big data stream data according to any one of the above-mentioned steps is implemented.

The embodiment of the invention has the advantages that the problem of data processing caused by disorder can be solved by only adding the change identifier and the attribute identifier in the data acquisition and data processing processes without changing the original technical architecture, and the accuracy and consistency of data in large data stream processing are improved.

Drawings

FIG. 1 is a flow chart of a large data stream data processing method of the present invention;

FIG. 2 is a detailed flowchart of step S4 of a large data stream data processing method according to the present invention;

FIG. 3 is a schematic diagram of a large data stream data processing apparatus according to the present invention;

FIG. 4 is a schematic structural diagram of a second determining module according to the present invention;

FIG. 5 is a schematic mechanical diagram of an electronic device of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention are described below in detail and completely with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present invention, the terms "first", "second", "third", etc. are used only for distinguishing the description, and are not intended to indicate or imply relative importance.

Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

Referring to fig. 1, fig. 1 is a schematic flowchart of a large data stream data processing method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

s1: acquiring source data in a plurality of data sources;

s2: analyzing whether the source data contains change identification and/or attribute identification;

specifically, there are various methods for determining the change of source data in the process of stream data processing, and the method generally includes two modes, namely active mode and passive mode.

The active mode is: if the source data in the data source contains the change identifier or the attribute identifier, the change identifier or the attribute identifier of the source data sent by the data source is recorded to the downstream, and it should be noted that the downstream refers to message middleware here and below.

The passive mode is: and downstream analyzing the log information of the source data database in the data source, identifying the change timestamp information of the source system and the like, and further identifying whether the source data in the data source contains a change identifier or an attribute identifier.

In general, in order to reduce the system pressure and workload of source-end data, whether data changes is mostly sensed in a passive mode by analyzing a source-end database log through a tool.

The change identification refers to identification information when the source data information changes, so that identification can be conveniently carried out in the data processing process. The change identifier can comprise addition, modification and deletion, and can clearly identify the change of the data record, namely addition, modification or deletion relative to the history record.

Example (c):

serial number	Identification	Description of the invention
			1	I	Adding new
2	U	Modifying
			3	D	Deleting

The attribute identification refers to a system attribute, a database attribute and a time attribute corresponding to the source data information, the attribute identification is added to identify the attribute of the information, and data processing is carried out according to the system attribute, the database attribute and the time attribute in the data processing process.

S2-1, if the source data contains the change identification and/or the attribute identification, performing data organization on the source data and sending the source data to a message middleware, wherein the organized source data after the data organization contains a system, a table name, a primary key, the change identification and/or the attribute identification;

for example, the information organization will contain change identification and attribute identification information, and considering the extensibility of the data column, the information is defined by using JSON format, which is exemplified as follows:

the organization of information according to the above method can be accomplished by a program or a tool.

In addition, source data after data organization is sent to a message middleware, different processing can be performed according to characteristics of the middleware in the application, for example, kafka is taken as an example, corresponding Topic information needs to be created before information is sent, partiton quantity can be initialized according to throughput, the throughput is mainly the throughput of two ends of a Producer and a Consumer, and meanwhile, configuration information can be routed according to a hash value of a key, so that the consumption of the same key information is guaranteed to be order-preserving as much as possible, and the accuracy of the data is improved.

And if the source data does not contain the change identification and/or the attribute identification, directly sending the source data to the message middleware.

S3: analyzing at least one piece of organized source data acquired from the message middleware, and matching the at least one piece of organized source data into a target data record base according to the primary key information;

s4: judging the processing priority of the data in the target data record base according to the attribute identification of the data in the target data record base, and forming a data group to be processed according to the processing priority of the data in the target data record base;

specifically, as shown in fig. 2, the present step includes:

s401: identifying database attributes of the acquired data in the target data record base, determining the processing priority of the acquired data in the target data record base, and taking the data of the target data record base with the optimal database attribute processing priority in the data of the target data record base with the same main key as a first data group to be processed; the database attributes include database migration, database splitting, and/or database archiving;

identifying the database attribute information of the data in the acquired target data record base, and then configuring data processing according to the priority of the database, such as the following data:

main key	Database attributes	Data value
			3400001	A01	8
3400001	A02	9
			3400001	B01	10
3400001	B02	11
			3400001	C01	7

The ABC code value of the database attribute is set in the database priority configuration to determine the priority order, so that it can be seen that the record with the same primary key in the records is the record with the highest priority corresponding to the database attribute C01, and the record with the highest priority is taken as a valid record for processing whenever data is processed. If the incoming data has a lower priority than the target data, no processing is performed.

The database attribute represents the attribute of a source database, and for the condition that one system has a plurality of databases, such as a main database, a sub-database and an archive database, the attribute of the database is defined according to the actual condition, and the data are processed according to different database attributes during data processing.

The definition of the attribute can well solve the problem caused by inconsistent source end data in the process of processing streaming data by defining the attribute of the database under the conditions of database migration, database splitting and database archiving of the source end data. The main database, the sub-databases and the archive database can be represented by letters A, B, C, 01,02 and 03 represent migration conditions of the same attribute database, the attribute value of one database is A01 under the default condition, the attribute value of the database is A02 after the migration occurs, the attribute value of the corresponding database is B01 after the splitting occurs, and the attribute value of the database is C01 when the archive occurs.

Examples are as follows:

s402: identifying the time attribute of the first data group to be processed and determining the processing priority of the first data group to be processed, and using the first data group to be processed with the optimal time attribute processing priority in the first data group to be processed with the same primary key as a second data group to be processed; the time attribute comprises time information of source data change after organization;

specifically, the data processing is performed by configuring the time attribute information of the first data group to be processed, for example, the latest data or the oldest data can be set to be processed. And when the data with the largest time is set as effective data, carrying out the next data processing when the time is longer than that of the target library after the new data comes, or not carrying out the processing.

The time attribute is time information indicating a change in data. The definition of the attribute can solve the problem of data processing time disorder in the case of network, concurrency and the like. The value of the time attribute is recorded to the millisecond level, and the time sequence information of the data change is more accurately recorded. To facilitate subsequent data processing, the time information in millisecond format can also be converted into a time stamp format, such as 2021-08-0115:01:21.260 into 1627801281260.

S403: and identifying the system attribute of the second data group to be processed, determining the processing priority of the second data group to be processed, and taking the second data group to be processed with the optimal system attribute processing priority in the second data group to be processed with the same primary key as a third data group to be processed.

Specifically, the processing of the second to-be-processed data group is performed by configuring the system attribute information, and if the priority is higher as the priority number of the corresponding column of the system attribute is smaller, and the new data is larger than the target data, the next data processing is performed on the valid data, otherwise, the subsequent processing is not performed.

It should be explained that the system attributes represent the priority of the system and may be defined to be the column attributes of the data records. The definition of the attribute can solve the situation that multiple systems correspond to the same target, the judgment of the source priority can be represented by the numbers 1,2 and 3 …, and the number size represents the priority of the processing.

Examples are as follows:

the target end of the data can store the data in the column type storage in the data processing process, attribute information 'information number', 'source system', 'system priority' is marked in each column, and the data are used for logic judgment in the stream data processing process.

S5: and judging the operation standard of the data group to be processed according to the change identifier of the data group to be processed and forming a new target database.

Specifically, identifying a change identifier of each target data in the third to-be-processed data group; if the target data contains the modified or newly added change identification, updating the target data to form updated target data; if the target data contains the deleted change identification, performing logic deletion operation on the target data to form deleted target data; and the third data group to be processed does not have the data with the changed identification, the updated target data and the deleted target data to form a new target database.

The processed target database can be sent to downstream application.

According to the big data stream data processing method provided by the embodiment of the application, the problem of data processing caused by disorder can be solved only by adding the change identifier and the attribute identifier in the data acquisition and data processing processes without changing the original technical architecture, and the accuracy and consistency of data in big data stream type processing are improved.

Based on the same inventive concept, an embodiment of the present invention further provides a large data stream data processing apparatus, and fig. 3 is a schematic structural diagram of the large data stream data processing apparatus provided in the embodiment of the present invention, as shown in fig. 3, the apparatus includes:

an obtaining module 100, configured to obtain source data in a plurality of data sources;

an analysis module 200, configured to analyze whether the source data includes a change identifier or an attribute identifier; if so, performing data organization on the source data and sending the source data to a message middleware, wherein the organized source data after the data organization comprises a system, a table name, a primary key, the change identifier and/or the attribute identifier;

a first determining module 300, configured to analyze at least one piece of post-organization source data obtained from the message middleware, and match the at least one piece of post-organization source data to a target data record base according to the primary key information;

a second judging module 400, configured to judge, according to the attribute identifier of the data in the target data record base, a processing priority order of the data in the target data record base, and form a to-be-processed data group according to the processing priority order of the data in the target data record base;

a third determining module 500, configured to determine an operation standard of the to-be-processed data set according to the change identifier of the to-be-processed data set, and form a new target database.

Wherein the second determining module 400 comprises:

a database data determining unit 401, configured to identify a database attribute of data in an acquired target data record base, determine a processing priority of the acquired target data record base, and use, as a first to-be-processed data group, data of the target data record base with an optimal database attribute processing priority among data of the target data record base with the same primary key; the database attributes include database migration, database splitting, and/or database archiving;

a time attribute judging unit 402, configured to identify a time attribute of the first to-be-processed data group, determine a processing priority of the first to-be-processed data group, and use the first to-be-processed data group with an optimal time attribute processing priority in the first to-be-processed data group with the same primary key as a second to-be-processed data group; the time attribute comprises time information of source data change after organization;

a system attribute determining unit 403, configured to identify a system attribute of the second to-be-processed data group, determine a processing priority of the second to-be-processed data group, and use the second to-be-processed data group with the optimal system attribute processing priority in the second to-be-processed data group with the same primary key as a third to-be-processed data group.

The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present application, where the electronic device may be integrated in a terminal device or a chip of the terminal device, and the terminal may be a computing device with a data processing function.

As shown in fig. 5, the electronic apparatus includes: a processor 501, a memory 502 and a bus, wherein the memory 502 stores program instructions executable by the processor 501, when the electronic/device operates, the processor 501 communicates with the memory 502 through the bus, and the processor 501 executes the program instructions to execute the above-mentioned method embodiments. The specific implementation and technical effects are similar, and are not described herein again.

Optionally, the present invention further provides a storage medium, on which a computer program is stored, the computer program being configured to, when executed by a processor, perform the above-mentioned method embodiments.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims

1. A large data stream data processing method is characterized by comprising the following steps:

acquiring source data in a plurality of data sources;

analyzing whether the source data contains change identification and/or attribute identification;

if so, performing data organization on the source data and sending the source data to a message middleware, wherein the organized source data after the data organization comprises a system, a table name, a primary key, the change identifier and/or the attribute identifier;

analyzing at least one piece of organized source data acquired from the message middleware, and matching the at least one piece of organized source data into a target data record base according to the primary key information;

judging the processing priority of the data in the target data record base according to the attribute identification of the data in the target data record base, and forming a data group to be processed according to the processing priority of the data in the target data record base;

and judging the operation standard of the data group to be processed according to the change identifier of the data group to be processed and forming a new target database.

2. The big data stream data processing method as claimed in claim 1, wherein said determining the processing priority of the data in the target data record base according to the attribute identifier of the data in the target data record base comprises:

identifying database attributes of the acquired data in the target data record base, determining the processing priority of the acquired data in the target data record base, and taking the data of the target data record base with the optimal database attribute processing priority in the data of the target data record base with the same main key as a first data group to be processed; the database attributes include database migration, database splitting, and/or database archiving;

identifying the time attribute of the first data group to be processed and determining the processing priority of the first data group to be processed, and using the first data group to be processed with the optimal time attribute processing priority in the first data group to be processed with the same primary key as a second data group to be processed; the time attribute comprises time information of source data change after organization;

and identifying the system attribute of the second data group to be processed, determining the processing priority of the second data group to be processed, and taking the second data group to be processed with the optimal system attribute processing priority in the second data group to be processed with the same primary key as a third data group to be processed.

3. The method as claimed in claim 2, wherein said determining the operation standard of the data group to be processed according to the change identifier of the data group to be processed and forming a new target database comprises:

identifying a change identifier of each target data in the third data group to be processed;

if the target data contains the modified or newly added change identification, updating the target data to form updated target data;

if the target data contains the deleted change identification, performing logic deletion operation on the target data to form deleted target data;

and the third data group to be processed does not have the data with the changed identification, the updated target data and the deleted target data to form a new target database.

4. The method as claimed in claim 1, wherein said analyzing whether the source data contains a change identifier and/or an attribute identifier comprises:

5. The method as claimed in claim 1, wherein said analyzing whether the source data contains a change identifier and/or an attribute identifier comprises:

6. A large data stream data processing apparatus, comprising:

7. The big-data-stream data processing apparatus according to claim 6, wherein the second determining module comprises:

8. An electronic device, comprising a processor, a memory and a bus, wherein the memory stores all machine-readable instructions executable by the processor, when the electronic device is running, the processor and the memory communicate with each other via the bus, and the processor executes the machine-readable instructions to perform the steps of the big data stream data processing method according to any one of claims 1 to 5.

9. A readable storage medium, characterized in that the readable storage medium stores a computer program, which when executed implements the steps of the big data stream data processing method according to any of claims 1 to 5.