CN116755804B

CN116755804B - Assembled integrated big data processing method and system

Info

Publication number: CN116755804B
Application number: CN202310803713.0A
Authority: CN
Inventors: 唐虎强; 李宁
Original assignee: Hongyou Software Co ltd
Current assignee: Hongyou Software Co ltd
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2024-04-26
Anticipated expiration: 2043-07-03
Also published as: CN116755804A

Abstract

The invention provides an assembled integrated big data processing method and system, wherein the system comprises the following steps: a first defining module; a second definition module for predefining the data object; dividing the module; a data processing unit generating module; the data processing unit dependency relation table building module and the data processing unit executing module. The method and the system for processing the assembled integrated big data have the following advantages: the invention provides a big data processing system and method which are flexibly assembled and constructed through scenes, technologies and algorithms, realizes the integrated construction of fusion application and complex application scenes of different technologies in big data processing, and effectively improves the system development efficiency.

Description

Assembled integrated big data processing method and system

Technical Field

The invention belongs to the technical field of big data processing, and particularly relates to an assembled integrated big data processing method and system.

Background

In recent years, big data technology is rapidly developed, and informatization also enters a 'big integration, high sharing and deep application' stage. With the wide application of the internet of things and intelligent equipment, various structured, semi-structured and unstructured data also show explosive growth. The method has the advantages that the large data processing system with stronger processing capacity, easier expansion and higher performance is built, the requirements of high calculation, high storage and high load can be well met, mass data can be analyzed and mined, and the value of the data is improved to the maximum extent. However, big data processing techniques are various, and different techniques are suitable for different types of data and different application scenarios. In addition, the big data processing involves a plurality of links such as data acquisition, aggregation, cleaning, aggregation, analysis and the like, and the technology and business knowledge involved in each link are different, so that a set of big data processing system is designed and developed with great technical difficulty. At present, enterprises generally adopt methods for constructing multiple systems, including but not limited to: the data acquisition system, the data management system, the data analysis system, the data index system and the like are implemented and completed step by different teams, so that the system has longer construction period and higher construction cost, and the rapid change of the data processing business requirements and the iterative update of the technology are difficult to respond rapidly.

Therefore, in the field of big data processing, the following technical problems are urgently needed to be solved: aiming at the large data technology with various kinds and the large data processing requirement with various kinds, a large data processing system construction method which can be used for integrating various large data technologies, meeting various application scenes and rapidly responding to business and technology changes is provided, and an integrated data processing mechanism covering all links of data processing is realized.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an assembled integrated big data processing method and system, which can effectively solve the problems.

The technical scheme adopted by the invention is as follows:

the invention provides an assembled integrated big data processing method, which comprises the following steps:

Step 1, predefining a data processing scene, a data processing service, a data processing technical component and a data processing algorithm plug-in; the data processing service is provided with a service calling interface for calling the data processing service; the data processing technology component is provided with a component calling interface used for calling the data processing technology component; the data processing algorithm plug-in is provided with a plug-in calling interface and is used for calling the data processing algorithm plug-in;

step 2, predefining a data object; specifically, uniformly registering a data source object and a data storage target object as data objects, wherein each data object has a unique data object identifier;

Step 3, loading a data processing flow model which needs to be subjected to data processing; dividing the data processing flow model into ⁿ data processing scenes;

The data processing units corresponding to each data processing scene are generated, and the specific generation method is as follows:

1) Determining a data object serving as a data source of the data processing unit from the data objects in the step 2, wherein the data object is called a data source object, and the data object serving as a storage target of a result after the data processing unit executes is called a data storage target object;

2) Screening out required data processing services, data processing technical components and data processing algorithm plug-ins according to the data processing scene requirements, assembling and packaging according to the calling relation among the data processing services, the data processing technical components and the data processing algorithm plug-ins, and associating data source objects and data storage target objects to obtain the configured data processing unit;

Step 4, establishing a data processing unit dependency relationship table according to the association between the data source objects and the data storage target objects among the data processing units; the data processing unit dependency relationship table is used for storing the dependency degree of each data processing unit and the dependency relationship between the data processing unit and other data processing units, and marking the execution state of each data processing unit, wherein the execution state comprises the following steps: an unexecuted state and an execution completed state;

Specifically, for the ith data processing unit in the n data processing units, i=1, 2,3, …, n, the calculation method of the dependency and the dependency relationship is as follows:

For the jth data processing unit in other n-1 data processing units, j is not equal to i, j is not equal to 1,2,3, …, n, if a certain data object serving as a data storage target object exists in the jth data processing unit and is the same as a certain data object serving as a data source object of the ith data processing unit, the jth data processing unit is depended on, and the dependence of the ith data processing unit is increased by 1; sequentially comparing and identifying other n-1 data processing units, so as to obtain the dependence degree and the dependence relation of the ith data processing unit;

Step 5, reading the dependency relationship table of the data processing units, extracting each data processing unit marked as an unexecuted state and having a dependency degree of 0, and forming a data processing unit set;

Step 6, traversing the data processing unit set, asynchronously executing each data processing unit in the data processing unit set, after the execution of the data processing unit is completed, storing an execution result to a corresponding data storage target object by the data processing unit, and updating an execution state mark of a data processing unit dependency relationship table to mark the data processing unit as an execution completion state;

In this step, the execution method of each data processing unit is as follows:

1) Reading a data source according to the configured data source object;

2) Calling data processing service through a service gateway route; the data processing service loads and calls the corresponding data processing technology component through a reflection technology; the data processing technology component loads and calls the corresponding data processing algorithm plug-in through a reflection technology;

The called data processing algorithm plug-in performs data processing on the data source, and returns a data processing intermediate result to the called data processing technology component; the called data processing technology component further performs data processing on the data processing intermediate result to obtain a data processing result, and returns the data processing result to the called data processing service; therefore, the called data processing service obtains a data processing result, namely the execution result;

step 7, reading the current data processing unit dependency relationship table, judging whether a data processing unit marked as an unexecuted state exists or not, and ending the flow if the data processing unit marked as the unexecuted state does not exist; if so, executing step 8;

step 8, updating the dependency degree and the dependency relation of the data processing unit dependency relation table to obtain an updated data processing unit dependency relation table, wherein the updating method comprises the following steps:

for the ith data processing unit in the n data processing units, i=1, 2,3, …, n, searching for an unexecuted state from the data processing unit dependency table and depending on each data processing unit of the ith data processing unit after execution thereof is completed, and subtracting 1 from the searched dependency of the data processing unit;

and 9, returning to the step 5, and circularly executing.

Preferably, each class of data processing services configures multiple versions of the data processing services; each type of data processing technology component configures multiple versions of the data processing technology component; each class of data processing algorithm plugins configures multiple versions of the data processing algorithm plugins.

Preferably, the data processing scene comprises a data acquisition scene, a data synchronization scene, a data convergence scene, a data cleaning scene and a data analysis scene.

Preferably, the data acquisition scene is: realizing the data acquisition and storage of the data source to a designated data storage target object;

The data synchronization scene is: the data source is synchronized among different data storage target objects;

the data aggregation scene: constructing dimensions for the basic data, and constructing a business width table through dimension association;

The data cleaning scene is as follows: realizing the cleaning treatment and standardization treatment of dirty data;

The data analysis scene is: and realizing data mining analysis.

Preferably, the assembly and encapsulation method comprises the following steps:

constructing multiple types of data processing services under each data processing scene, and constructing multiple data processing service versions by each type of data processing service so as to adapt to the changes of different running environments and business scenes;

Constructing a plurality of types of data processing technology components under each type of data processing service, wherein each type of data processing technology components is loaded and called by the data processing service through a reflection technology according to a unified interface standard; each type of data processing technology component constructs a plurality of data processing technology component versions to adapt to the change of the data processing service;

Constructing a multi-class data processing algorithm plug-in under each class of data processing technology assembly, loading and calling each class of data processing algorithm plug-in by the data processing technology assembly according to a unified interface standard through a reflection technology, and according to different calling modes, the data processing technology assembly comprises two types: the first is internal loading calling, which requires the realization technology of the data processing algorithm plug-in to be consistent with the data processing technology component, and the data processing technology component calls the data processing algorithm plug-in through an internal loading calling method; the second is http call, which is used for the data processing technical component to call the data processing algorithm plug-in through the http interface; each class of data processing algorithm plugins builds multiple versions of the data processing algorithm plugins to satisfy the extended variations of the data processing logic.

The invention also provides a system of the assembled integrated big data processing method, which comprises the following steps:

The first defining module is used for predefining a data processing scene, a data processing service, a data processing technical component and a data processing algorithm plug-in; the data processing service is provided with a service calling interface for calling the data processing service; the data processing technology component is provided with a component calling interface used for calling the data processing technology component; the data processing algorithm plug-in is provided with a plug-in calling interface and is used for calling the data processing algorithm plug-in;

A second definition module for predefining the data object; specifically, uniformly registering a data source object and a data storage target object as data objects, wherein each data object has a unique data object identifier;

The dividing module is used for loading a data processing flow model which needs to be subjected to data processing; dividing the data processing flow model into n data processing scenes;

The data processing unit generating module is used for generating data processing units corresponding to each data processing scene, and the specific generating method comprises the following steps:

1) Determining a data object which is a data source of the data processing unit from defined data objects, namely a data source object, and a data object which is a storage target as a result of execution of the data processing unit, namely a data storage target object;

The data processing unit dependency relationship table establishing module is used for establishing a data processing unit dependency relationship table according to the association between the data source objects and the data storage target objects among the data processing units; the data processing unit dependency relationship table is used for storing the dependency degree of each data processing unit and the dependency relationship between the data processing unit and other data processing units, and marking the execution state of each data processing unit, wherein the execution state comprises the following steps: an unexecuted state and an execution completed state;

A data processing unit execution module comprising:

Step 1, reading a data processing unit dependency relationship table, extracting each data processing unit marked as an unexecuted state and having a dependency degree of 0, and forming a data processing unit set;

Step 2, traversing the data processing unit set, asynchronously executing each data processing unit in the data processing unit set, after the execution of the data processing unit is completed, storing an execution result to a corresponding data storage target object by the data processing unit, and updating an execution state mark of a data processing unit dependency relationship table to mark the data processing unit as an execution completion state;

In this step, the execution method of each data processing unit is as follows:

1) Reading a data source according to the configured data source object;

step 3, reading the current data processing unit dependency relationship table, judging whether a data processing unit marked as an unexecuted state exists or not, and ending the flow if the data processing unit marked as the unexecuted state does not exist; if so, executing the step 4;

And 4, updating the dependency degree and the dependency relation of the data processing unit dependency relation table to obtain an updated data processing unit dependency relation table, wherein the updating method comprises the following steps of:

And 5, returning to the step 1, and circularly executing.

The method and the system for processing the assembled integrated big data have the following advantages:

the invention provides a big data processing system and method which are flexibly assembled and constructed through scenes, technologies and algorithms, realizes the integrated construction of fusion application and complex application scenes of different technologies in big data processing, and effectively improves the system development efficiency.

Drawings

FIG. 1 is a schematic flow chart of an assembled integrated big data processing method provided by the invention;

FIG. 2 is a schematic diagram of the logical structure of a data processing unit according to the present invention.

Detailed Description

In order to make the technical problems, technical schemes and beneficial effects solved by the invention more clear, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Aiming at the problems that the big data processing system has various application scenes, complex business logic and difficult consistency processing, the big data technology system is huge, the iteration updating is fast and difficult to be compatible, and the like, the big data processing system and the method are flexibly assembled and constructed through scenes, technologies and algorithms, the integrated construction of the fusion application of different technologies and the complex application scenes in the big data processing is realized, and the system development efficiency is effectively improved.

The invention provides an assembled integrated big data processing method, referring to FIG. 1, comprising the following steps:

in a specific implementation, each type of data processing service configures multiple versions of data processing services; each type of data processing technology component configures multiple versions of the data processing technology component; each class of data processing algorithm plugins configures multiple versions of the data processing algorithm plugins.

in particular, the data processing scenarios include, but are not limited to, a data acquisition scenario, a data synchronization scenario, a data aggregation scenario, a data cleansing scenario, and a data analysis scenario.

The data acquisition scene is as follows: realizing the data acquisition and storage of the data source to a designated data storage target object;

The data analysis scene is: and realizing data mining analysis. Including but not limited to: data association analysis scenario: and realizing association analysis among a plurality of data tables. Index multidimensional aggregation analysis scenario: and aggregation analysis of the index items according to different dimensions is realized. Data mining analysis scenes; and the deep mining analysis of the data through a machine learning algorithm is realized.

the assembly and encapsulation method comprises the following steps:

In this step, the execution method of each data processing unit is as follows:

1) Reading a data source according to the configured data source object;

For the ⁱ th data processing unit, ⁱ＝^1,2,3 _, …, n in ⁿ data processing units, after execution is completed, searching for an unexecuted state from the data processing unit dependency table and depending on each data processing unit of the ⁱ th data processing unit, and subtracting 1 from the searched dependency of the data processing unit;

and 9, returning to the step 5, and circularly executing.

For ease of understanding, examples are as follows:

For example:

data objects D1, D2, D3, D4, D5, and data processing units P1, P2, P3 are provided.

The data source of P1 is D1, and the data storage targets are D2 and D3;

The data source of P2 is D2, and the data storage target is D4;

the data sources of P3 are D3 and D4, and the data storage targets are D5.

Then:

For the data processing unit P1, the data source is D1, and D1 is not a data storage target of P2 and P3, so that the dependency of the data processing unit P1 is 0, and P1 does not depend on P2 and P3;

for the data processing unit P2, the data source is D2, and D2 is simultaneously used as the data storage target of P1, so that the dependency of the data processing unit P2 is 1, and the P2 depends on P1;

For the data processing unit P3, the data sources are D3 and D4, while D3 is simultaneously used as the data storage target of P1, and D4 is simultaneously used as the data storage target of P2, so that the dependency of the data processing unit P3 is 2, and the P3 depends on P1 and P2;

Thus, the data processing unit dependency table is shown in table 1, and the second row of the data processing unit dependency table describes the dependency and dependency of P1; third line, describing the dependency and dependency relationship of P2; third line, describing the dependency and dependency relationship of P3;

TABLE 1

	P1	P2	P3	Degree of dependence	Execution state
						P1			0	Unexecuted state
P2	√			1	Unexecuted state
						P3	√	√	2	Unexecuted state

Therefore, initially, the dependency of P1 is 0, P1 is executed first, and after the execution of P1 is completed, the execution state of P1 is updated to generate table 2:

TABLE 2

	P1	P2	P3	Degree of dependence	Execution state
						P1			0	Execution completion status
P2	√			1	Unexecuted state
						P3	√	√	2	Unexecuted state

At this time, since there are P2 and P3 in the unexecuted state, it is necessary to update the dependency and the dependency relationship of table 2, and the update method is:

After P1 execution is completed, since P2 depends on P1, the dependency of P2 is reduced by 1; since P3 depends on P1, the dependency of P3 is reduced by 1, thereby generating table 3:

TABLE 3 Table 3

	P1	P2	P3	Degree of dependence	Execution state
						P1			0	Execution completion status
P2	√			0	Unexecuted state
						P3	√	√	1	Unexecuted state

Then, according to table 3, P2 is performed, and the next cycle is entered.

The key technology of the invention is as follows:

In the invention, a data processing technology assembly is constructed aiming at the technology implementation in a single data processing scene, different big data technology implementations are packaged in the data processing technology assembly according to the data processing requirement, and the association between the data processing service and the data processing technology assembly implementation is shielded. The data processing technical components are loaded and called by the data processing service through the reflection technology according to the unified interface standard, and the same class of data processing service needs to define the unified component interface standard, so that the multiplexing of the data processing technical components is convenient. Multiple data processing technology components can be built under each data processing service, and each data processing technology component can build multiple versions to accommodate changes in data processing services and upgrades to large data technologies.

In the invention, a data processing algorithm plug-in is constructed aiming at complex logic units in a data processing scene. The data processing algorithm plug-ins are loaded and called by the data processing technology assembly through a reflection technology according to a unified interface standard, and the data processing algorithm plug-ins are divided into two types according to different calling modes: the first is internal loading call, which requires the realization technology of the data processing algorithm plug-in to be consistent with the called data processing technology component; the second is http call, which is used for integrating data processing algorithm service and is realized through http interface call. Multiple data processing algorithm plug-ins may be built under each data processing technology component, and each data processing algorithm plug-in may build multiple versions to satisfy the extended variations of the data processing logic.

In the invention, a single data processing scene is constructed as an independent data processing unit, and a user configures a data processing unit description model according to actual needs, as shown in fig. 2, as a specific implementation manner, the data processing unit description model is composed of three parts of a data processing task, a data source object and a data storage target object:

Data processing tasks: data processing services, data processing technology components and data processing algorithm plug-ins that perform data processing task calls, control parameters, and the like are described.

Data source object: the data volume input by the data processing task can be multiple.

Data storage target object: the target for storing the execution results of the data processing task can be a plurality of targets.

It should be noted that, the system registers data entities with different sources and different structures as a unified data object.

The data processing units are connected and combined according to service requirements, and are arranged into a step-by-step data processing flow, so that an integrated complex data processing scene comprising data acquisition, cleaning, convergence, index aggregation, association analysis, intelligent analysis and the like is realized.

The data processing units are connected through input and output data objects to establish a data processing unit dependency relationship table. Each data processing unit is executed asynchronously through the data processing unit dependency table.

The invention also provides a system for realizing the assembled integrated big data processing method, which comprises the following steps:

The dividing module is used for loading a data processing flow model which needs to be subjected to data processing; dividing the data processing flow model into ⁿ data processing scenes;

A data processing unit execution module comprising:

In this step, the execution method of each data processing unit is as follows:

1) Reading a data source according to the configured data source object;

And 5, returning to the step 1, and circularly executing.

According to the invention, an execution example of the data processing unit is automatically constructed according to the description model information of the data processing unit, and the configured data processing service interface is called for execution. The data processing technology assembly is loaded and called in the data processing service through the reflection technology, and the data processing algorithm plug-in is loaded and called by the data processing technology assembly, so that the flexible plug-in of the data processing service, the data processing technology assembly and the data processing algorithm plug-in is realized, and the high scalability of the system is ensured.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which is also intended to be covered by the present invention.

Claims

1. An assembled integrated big data processing method is characterized by comprising the following steps:

Step 3, loading a data processing flow model which needs to be subjected to data processing; dividing the data processing flow model into n data processing scenes;

In this step, the execution method of each data processing unit is as follows:

1) Reading a data source according to the configured data source object;

step 8, updating the dependency of the data processing unit dependency table to obtain an updated data processing unit dependency table, wherein the updating method comprises the following steps:

and 9, returning to the step 5, and circularly executing.

2. An assemblable integrated big data processing method according to claim 1, wherein each class of data processing services configures a plurality of versions of data processing services; each type of data processing technology component configures multiple versions of the data processing technology component; each class of data processing algorithm plugins configures multiple versions of the data processing algorithm plugins.

3. The method for processing the assembled integrated big data according to claim 1, wherein the data processing scene comprises a data acquisition scene, a data synchronization scene, a data aggregation scene, a data cleaning scene and a data analysis scene.

4. A method of mountable integrated big data processing according to claim 3, wherein the data acquisition scenario is: realizing the data acquisition and storage of the data source to a designated data storage target object;

The data analysis scene is: and realizing data mining analysis.

5. The method for processing the assembled integrated big data according to claim 1, wherein the assembling and packaging method is as follows:

6. A system of an assemblable integrated big data processing method according to any one of claims 1 to 5, comprising:

A data processing unit execution module comprising:

In this step, the execution method of each data processing unit is as follows:

1) Reading a data source according to the configured data source object;

and 4, updating the dependency degree of the data processing unit dependency relation table to obtain an updated data processing unit dependency relation table, wherein the updating method comprises the following steps of:

And 5, returning to the step 1, and circularly executing.