CN116755804B - Assembled integrated big data processing method and system - Google Patents

Assembled integrated big data processing method and system Download PDF

Info

Publication number
CN116755804B
CN116755804B CN202310803713.0A CN202310803713A CN116755804B CN 116755804 B CN116755804 B CN 116755804B CN 202310803713 A CN202310803713 A CN 202310803713A CN 116755804 B CN116755804 B CN 116755804B
Authority
CN
China
Prior art keywords
data processing
data
processing unit
dependency
execution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310803713.0A
Other languages
Chinese (zh)
Other versions
CN116755804A (en
Inventor
唐虎强
李宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hongyou Software Co ltd
Original Assignee
Hongyou Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hongyou Software Co ltd filed Critical Hongyou Software Co ltd
Priority to CN202310803713.0A priority Critical patent/CN116755804B/en
Publication of CN116755804A publication Critical patent/CN116755804A/en
Application granted granted Critical
Publication of CN116755804B publication Critical patent/CN116755804B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4488Object-oriented
    • G06F9/449Object-oriented method invocation or resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • G06F9/44526Plug-ins; Add-ons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services
    • G06F9/548Object oriented; Remote method invocation [RMI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/549Remote execution
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides an assembled integrated big data processing method and system, wherein the system comprises the following steps: a first defining module; a second definition module for predefining the data object; dividing the module; a data processing unit generating module; the data processing unit dependency relation table building module and the data processing unit executing module. The method and the system for processing the assembled integrated big data have the following advantages: the invention provides a big data processing system and method which are flexibly assembled and constructed through scenes, technologies and algorithms, realizes the integrated construction of fusion application and complex application scenes of different technologies in big data processing, and effectively improves the system development efficiency.

Description

Assembled integrated big data processing method and system
Technical Field
The invention belongs to the technical field of big data processing, and particularly relates to an assembled integrated big data processing method and system.
Background
In recent years, big data technology is rapidly developed, and informatization also enters a 'big integration, high sharing and deep application' stage. With the wide application of the internet of things and intelligent equipment, various structured, semi-structured and unstructured data also show explosive growth. The method has the advantages that the large data processing system with stronger processing capacity, easier expansion and higher performance is built, the requirements of high calculation, high storage and high load can be well met, mass data can be analyzed and mined, and the value of the data is improved to the maximum extent. However, big data processing techniques are various, and different techniques are suitable for different types of data and different application scenarios. In addition, the big data processing involves a plurality of links such as data acquisition, aggregation, cleaning, aggregation, analysis and the like, and the technology and business knowledge involved in each link are different, so that a set of big data processing system is designed and developed with great technical difficulty. At present, enterprises generally adopt methods for constructing multiple systems, including but not limited to: the data acquisition system, the data management system, the data analysis system, the data index system and the like are implemented and completed step by different teams, so that the system has longer construction period and higher construction cost, and the rapid change of the data processing business requirements and the iterative update of the technology are difficult to respond rapidly.
Therefore, in the field of big data processing, the following technical problems are urgently needed to be solved: aiming at the large data technology with various kinds and the large data processing requirement with various kinds, a large data processing system construction method which can be used for integrating various large data technologies, meeting various application scenes and rapidly responding to business and technology changes is provided, and an integrated data processing mechanism covering all links of data processing is realized.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides an assembled integrated big data processing method and system, which can effectively solve the problems.
The technical scheme adopted by the invention is as follows:
the invention provides an assembled integrated big data processing method, which comprises the following steps:
Step 1, predefining a data processing scene, a data processing service, a data processing technical component and a data processing algorithm plug-in; the data processing service is provided with a service calling interface for calling the data processing service; the data processing technology component is provided with a component calling interface used for calling the data processing technology component; the data processing algorithm plug-in is provided with a plug-in calling interface and is used for calling the data processing algorithm plug-in;
step 2, predefining a data object; specifically, uniformly registering a data source object and a data storage target object as data objects, wherein each data object has a unique data object identifier;
Step 3, loading a data processing flow model which needs to be subjected to data processing; dividing the data processing flow model into n data processing scenes;
The data processing units corresponding to each data processing scene are generated, and the specific generation method is as follows:
1) Determining a data object serving as a data source of the data processing unit from the data objects in the step 2, wherein the data object is called a data source object, and the data object serving as a storage target of a result after the data processing unit executes is called a data storage target object;
2) Screening out required data processing services, data processing technical components and data processing algorithm plug-ins according to the data processing scene requirements, assembling and packaging according to the calling relation among the data processing services, the data processing technical components and the data processing algorithm plug-ins, and associating data source objects and data storage target objects to obtain the configured data processing unit;
Step 4, establishing a data processing unit dependency relationship table according to the association between the data source objects and the data storage target objects among the data processing units; the data processing unit dependency relationship table is used for storing the dependency degree of each data processing unit and the dependency relationship between the data processing unit and other data processing units, and marking the execution state of each data processing unit, wherein the execution state comprises the following steps: an unexecuted state and an execution completed state;
Specifically, for the ith data processing unit in the n data processing units, i=1, 2,3, …, n, the calculation method of the dependency and the dependency relationship is as follows:
For the jth data processing unit in other n-1 data processing units, j is not equal to i, j is not equal to 1,2,3, …, n, if a certain data object serving as a data storage target object exists in the jth data processing unit and is the same as a certain data object serving as a data source object of the ith data processing unit, the jth data processing unit is depended on, and the dependence of the ith data processing unit is increased by 1; sequentially comparing and identifying other n-1 data processing units, so as to obtain the dependence degree and the dependence relation of the ith data processing unit;
Step 5, reading the dependency relationship table of the data processing units, extracting each data processing unit marked as an unexecuted state and having a dependency degree of 0, and forming a data processing unit set;
Step 6, traversing the data processing unit set, asynchronously executing each data processing unit in the data processing unit set, after the execution of the data processing unit is completed, storing an execution result to a corresponding data storage target object by the data processing unit, and updating an execution state mark of a data processing unit dependency relationship table to mark the data processing unit as an execution completion state;
In this step, the execution method of each data processing unit is as follows:
1) Reading a data source according to the configured data source object;
2) Calling data processing service through a service gateway route; the data processing service loads and calls the corresponding data processing technology component through a reflection technology; the data processing technology component loads and calls the corresponding data processing algorithm plug-in through a reflection technology;
The called data processing algorithm plug-in performs data processing on the data source, and returns a data processing intermediate result to the called data processing technology component; the called data processing technology component further performs data processing on the data processing intermediate result to obtain a data processing result, and returns the data processing result to the called data processing service; therefore, the called data processing service obtains a data processing result, namely the execution result;
step 7, reading the current data processing unit dependency relationship table, judging whether a data processing unit marked as an unexecuted state exists or not, and ending the flow if the data processing unit marked as the unexecuted state does not exist; if so, executing step 8;
step 8, updating the dependency degree and the dependency relation of the data processing unit dependency relation table to obtain an updated data processing unit dependency relation table, wherein the updating method comprises the following steps:
for the ith data processing unit in the n data processing units, i=1, 2,3, …, n, searching for an unexecuted state from the data processing unit dependency table and depending on each data processing unit of the ith data processing unit after execution thereof is completed, and subtracting 1 from the searched dependency of the data processing unit;
and 9, returning to the step 5, and circularly executing.
Preferably, each class of data processing services configures multiple versions of the data processing services; each type of data processing technology component configures multiple versions of the data processing technology component; each class of data processing algorithm plugins configures multiple versions of the data processing algorithm plugins.
Preferably, the data processing scene comprises a data acquisition scene, a data synchronization scene, a data convergence scene, a data cleaning scene and a data analysis scene.
Preferably, the data acquisition scene is: realizing the data acquisition and storage of the data source to a designated data storage target object;
The data synchronization scene is: the data source is synchronized among different data storage target objects;
the data aggregation scene: constructing dimensions for the basic data, and constructing a business width table through dimension association;
The data cleaning scene is as follows: realizing the cleaning treatment and standardization treatment of dirty data;
The data analysis scene is: and realizing data mining analysis.
Preferably, the assembly and encapsulation method comprises the following steps:
constructing multiple types of data processing services under each data processing scene, and constructing multiple data processing service versions by each type of data processing service so as to adapt to the changes of different running environments and business scenes;
Constructing a plurality of types of data processing technology components under each type of data processing service, wherein each type of data processing technology components is loaded and called by the data processing service through a reflection technology according to a unified interface standard; each type of data processing technology component constructs a plurality of data processing technology component versions to adapt to the change of the data processing service;
Constructing a multi-class data processing algorithm plug-in under each class of data processing technology assembly, loading and calling each class of data processing algorithm plug-in by the data processing technology assembly according to a unified interface standard through a reflection technology, and according to different calling modes, the data processing technology assembly comprises two types: the first is internal loading calling, which requires the realization technology of the data processing algorithm plug-in to be consistent with the data processing technology component, and the data processing technology component calls the data processing algorithm plug-in through an internal loading calling method; the second is http call, which is used for the data processing technical component to call the data processing algorithm plug-in through the http interface; each class of data processing algorithm plugins builds multiple versions of the data processing algorithm plugins to satisfy the extended variations of the data processing logic.
The invention also provides a system of the assembled integrated big data processing method, which comprises the following steps:
The first defining module is used for predefining a data processing scene, a data processing service, a data processing technical component and a data processing algorithm plug-in; the data processing service is provided with a service calling interface for calling the data processing service; the data processing technology component is provided with a component calling interface used for calling the data processing technology component; the data processing algorithm plug-in is provided with a plug-in calling interface and is used for calling the data processing algorithm plug-in;
A second definition module for predefining the data object; specifically, uniformly registering a data source object and a data storage target object as data objects, wherein each data object has a unique data object identifier;
The dividing module is used for loading a data processing flow model which needs to be subjected to data processing; dividing the data processing flow model into n data processing scenes;
The data processing unit generating module is used for generating data processing units corresponding to each data processing scene, and the specific generating method comprises the following steps:
1) Determining a data object which is a data source of the data processing unit from defined data objects, namely a data source object, and a data object which is a storage target as a result of execution of the data processing unit, namely a data storage target object;
2) Screening out required data processing services, data processing technical components and data processing algorithm plug-ins according to the data processing scene requirements, assembling and packaging according to the calling relation among the data processing services, the data processing technical components and the data processing algorithm plug-ins, and associating data source objects and data storage target objects to obtain the configured data processing unit;
The data processing unit dependency relationship table establishing module is used for establishing a data processing unit dependency relationship table according to the association between the data source objects and the data storage target objects among the data processing units; the data processing unit dependency relationship table is used for storing the dependency degree of each data processing unit and the dependency relationship between the data processing unit and other data processing units, and marking the execution state of each data processing unit, wherein the execution state comprises the following steps: an unexecuted state and an execution completed state;
Specifically, for the ith data processing unit in the n data processing units, i=1, 2,3, …, n, the calculation method of the dependency and the dependency relationship is as follows:
For the jth data processing unit in other n-1 data processing units, j is not equal to i, j is not equal to 1,2,3, …, n, if a certain data object serving as a data storage target object exists in the jth data processing unit and is the same as a certain data object serving as a data source object of the ith data processing unit, the jth data processing unit is depended on, and the dependence of the ith data processing unit is increased by 1; sequentially comparing and identifying other n-1 data processing units, so as to obtain the dependence degree and the dependence relation of the ith data processing unit;
A data processing unit execution module comprising:
Step 1, reading a data processing unit dependency relationship table, extracting each data processing unit marked as an unexecuted state and having a dependency degree of 0, and forming a data processing unit set;
Step 2, traversing the data processing unit set, asynchronously executing each data processing unit in the data processing unit set, after the execution of the data processing unit is completed, storing an execution result to a corresponding data storage target object by the data processing unit, and updating an execution state mark of a data processing unit dependency relationship table to mark the data processing unit as an execution completion state;
In this step, the execution method of each data processing unit is as follows:
1) Reading a data source according to the configured data source object;
2) Calling data processing service through a service gateway route; the data processing service loads and calls the corresponding data processing technology component through a reflection technology; the data processing technology component loads and calls the corresponding data processing algorithm plug-in through a reflection technology;
The called data processing algorithm plug-in performs data processing on the data source, and returns a data processing intermediate result to the called data processing technology component; the called data processing technology component further performs data processing on the data processing intermediate result to obtain a data processing result, and returns the data processing result to the called data processing service; therefore, the called data processing service obtains a data processing result, namely the execution result;
step 3, reading the current data processing unit dependency relationship table, judging whether a data processing unit marked as an unexecuted state exists or not, and ending the flow if the data processing unit marked as the unexecuted state does not exist; if so, executing the step 4;
And 4, updating the dependency degree and the dependency relation of the data processing unit dependency relation table to obtain an updated data processing unit dependency relation table, wherein the updating method comprises the following steps of:
for the ith data processing unit in the n data processing units, i=1, 2,3, …, n, searching for an unexecuted state from the data processing unit dependency table and depending on each data processing unit of the ith data processing unit after execution thereof is completed, and subtracting 1 from the searched dependency of the data processing unit;
And 5, returning to the step 1, and circularly executing.
The method and the system for processing the assembled integrated big data have the following advantages:
the invention provides a big data processing system and method which are flexibly assembled and constructed through scenes, technologies and algorithms, realizes the integrated construction of fusion application and complex application scenes of different technologies in big data processing, and effectively improves the system development efficiency.
Drawings
FIG. 1 is a schematic flow chart of an assembled integrated big data processing method provided by the invention;
FIG. 2 is a schematic diagram of the logical structure of a data processing unit according to the present invention.
Detailed Description
In order to make the technical problems, technical schemes and beneficial effects solved by the invention more clear, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Aiming at the problems that the big data processing system has various application scenes, complex business logic and difficult consistency processing, the big data technology system is huge, the iteration updating is fast and difficult to be compatible, and the like, the big data processing system and the method are flexibly assembled and constructed through scenes, technologies and algorithms, the integrated construction of the fusion application of different technologies and the complex application scenes in the big data processing is realized, and the system development efficiency is effectively improved.
The invention provides an assembled integrated big data processing method, referring to FIG. 1, comprising the following steps:
Step 1, predefining a data processing scene, a data processing service, a data processing technical component and a data processing algorithm plug-in; the data processing service is provided with a service calling interface for calling the data processing service; the data processing technology component is provided with a component calling interface used for calling the data processing technology component; the data processing algorithm plug-in is provided with a plug-in calling interface and is used for calling the data processing algorithm plug-in;
in a specific implementation, each type of data processing service configures multiple versions of data processing services; each type of data processing technology component configures multiple versions of the data processing technology component; each class of data processing algorithm plugins configures multiple versions of the data processing algorithm plugins.
Step 2, predefining a data object; specifically, uniformly registering a data source object and a data storage target object as data objects, wherein each data object has a unique data object identifier;
Step 3, loading a data processing flow model which needs to be subjected to data processing; dividing the data processing flow model into n data processing scenes;
in particular, the data processing scenarios include, but are not limited to, a data acquisition scenario, a data synchronization scenario, a data aggregation scenario, a data cleansing scenario, and a data analysis scenario.
The data acquisition scene is as follows: realizing the data acquisition and storage of the data source to a designated data storage target object;
The data synchronization scene is: the data source is synchronized among different data storage target objects;
the data aggregation scene: constructing dimensions for the basic data, and constructing a business width table through dimension association;
The data cleaning scene is as follows: realizing the cleaning treatment and standardization treatment of dirty data;
The data analysis scene is: and realizing data mining analysis. Including but not limited to: data association analysis scenario: and realizing association analysis among a plurality of data tables. Index multidimensional aggregation analysis scenario: and aggregation analysis of the index items according to different dimensions is realized. Data mining analysis scenes; and the deep mining analysis of the data through a machine learning algorithm is realized.
The data processing units corresponding to each data processing scene are generated, and the specific generation method is as follows:
1) Determining a data object serving as a data source of the data processing unit from the data objects in the step 2, wherein the data object is called a data source object, and the data object serving as a storage target of a result after the data processing unit executes is called a data storage target object;
2) Screening out required data processing services, data processing technical components and data processing algorithm plug-ins according to the data processing scene requirements, assembling and packaging according to the calling relation among the data processing services, the data processing technical components and the data processing algorithm plug-ins, and associating data source objects and data storage target objects to obtain the configured data processing unit;
the assembly and encapsulation method comprises the following steps:
constructing multiple types of data processing services under each data processing scene, and constructing multiple data processing service versions by each type of data processing service so as to adapt to the changes of different running environments and business scenes;
Constructing a plurality of types of data processing technology components under each type of data processing service, wherein each type of data processing technology components is loaded and called by the data processing service through a reflection technology according to a unified interface standard; each type of data processing technology component constructs a plurality of data processing technology component versions to adapt to the change of the data processing service;
Constructing a multi-class data processing algorithm plug-in under each class of data processing technology assembly, loading and calling each class of data processing algorithm plug-in by the data processing technology assembly according to a unified interface standard through a reflection technology, and according to different calling modes, the data processing technology assembly comprises two types: the first is internal loading calling, which requires the realization technology of the data processing algorithm plug-in to be consistent with the data processing technology component, and the data processing technology component calls the data processing algorithm plug-in through an internal loading calling method; the second is http call, which is used for the data processing technical component to call the data processing algorithm plug-in through the http interface; each class of data processing algorithm plugins builds multiple versions of the data processing algorithm plugins to satisfy the extended variations of the data processing logic.
Step 4, establishing a data processing unit dependency relationship table according to the association between the data source objects and the data storage target objects among the data processing units; the data processing unit dependency relationship table is used for storing the dependency degree of each data processing unit and the dependency relationship between the data processing unit and other data processing units, and marking the execution state of each data processing unit, wherein the execution state comprises the following steps: an unexecuted state and an execution completed state;
Specifically, for the ith data processing unit in the n data processing units, i=1, 2,3, …, n, the calculation method of the dependency and the dependency relationship is as follows:
For the jth data processing unit in other n-1 data processing units, j is not equal to i, j is not equal to 1,2,3, …, n, if a certain data object serving as a data storage target object exists in the jth data processing unit and is the same as a certain data object serving as a data source object of the ith data processing unit, the jth data processing unit is depended on, and the dependence of the ith data processing unit is increased by 1; sequentially comparing and identifying other n-1 data processing units, so as to obtain the dependence degree and the dependence relation of the ith data processing unit;
Step 5, reading the dependency relationship table of the data processing units, extracting each data processing unit marked as an unexecuted state and having a dependency degree of 0, and forming a data processing unit set;
Step 6, traversing the data processing unit set, asynchronously executing each data processing unit in the data processing unit set, after the execution of the data processing unit is completed, storing an execution result to a corresponding data storage target object by the data processing unit, and updating an execution state mark of a data processing unit dependency relationship table to mark the data processing unit as an execution completion state;
In this step, the execution method of each data processing unit is as follows:
1) Reading a data source according to the configured data source object;
2) Calling data processing service through a service gateway route; the data processing service loads and calls the corresponding data processing technology component through a reflection technology; the data processing technology component loads and calls the corresponding data processing algorithm plug-in through a reflection technology;
The called data processing algorithm plug-in performs data processing on the data source, and returns a data processing intermediate result to the called data processing technology component; the called data processing technology component further performs data processing on the data processing intermediate result to obtain a data processing result, and returns the data processing result to the called data processing service; therefore, the called data processing service obtains a data processing result, namely the execution result;
step 7, reading the current data processing unit dependency relationship table, judging whether a data processing unit marked as an unexecuted state exists or not, and ending the flow if the data processing unit marked as the unexecuted state does not exist; if so, executing step 8;
step 8, updating the dependency degree and the dependency relation of the data processing unit dependency relation table to obtain an updated data processing unit dependency relation table, wherein the updating method comprises the following steps:
For the i th data processing unit, i1,2,3 , …, n in n data processing units, after execution is completed, searching for an unexecuted state from the data processing unit dependency table and depending on each data processing unit of the i th data processing unit, and subtracting 1 from the searched dependency of the data processing unit;
and 9, returning to the step 5, and circularly executing.
For ease of understanding, examples are as follows:
For example:
data objects D1, D2, D3, D4, D5, and data processing units P1, P2, P3 are provided.
The data source of P1 is D1, and the data storage targets are D2 and D3;
The data source of P2 is D2, and the data storage target is D4;
the data sources of P3 are D3 and D4, and the data storage targets are D5.
Then:
For the data processing unit P1, the data source is D1, and D1 is not a data storage target of P2 and P3, so that the dependency of the data processing unit P1 is 0, and P1 does not depend on P2 and P3;
for the data processing unit P2, the data source is D2, and D2 is simultaneously used as the data storage target of P1, so that the dependency of the data processing unit P2 is 1, and the P2 depends on P1;
For the data processing unit P3, the data sources are D3 and D4, while D3 is simultaneously used as the data storage target of P1, and D4 is simultaneously used as the data storage target of P2, so that the dependency of the data processing unit P3 is 2, and the P3 depends on P1 and P2;
Thus, the data processing unit dependency table is shown in table 1, and the second row of the data processing unit dependency table describes the dependency and dependency of P1; third line, describing the dependency and dependency relationship of P2; third line, describing the dependency and dependency relationship of P3;
TABLE 1
P1 P2 P3 Degree of dependence Execution state
P1 0 Unexecuted state
P2 1 Unexecuted state
P3 2 Unexecuted state
Therefore, initially, the dependency of P1 is 0, P1 is executed first, and after the execution of P1 is completed, the execution state of P1 is updated to generate table 2:
TABLE 2
P1 P2 P3 Degree of dependence Execution state
P1 0 Execution completion status
P2 1 Unexecuted state
P3 2 Unexecuted state
At this time, since there are P2 and P3 in the unexecuted state, it is necessary to update the dependency and the dependency relationship of table 2, and the update method is:
After P1 execution is completed, since P2 depends on P1, the dependency of P2 is reduced by 1; since P3 depends on P1, the dependency of P3 is reduced by 1, thereby generating table 3:
TABLE 3 Table 3
P1 P2 P3 Degree of dependence Execution state
P1 0 Execution completion status
P2 0 Unexecuted state
P3 1 Unexecuted state
Then, according to table 3, P2 is performed, and the next cycle is entered.
The key technology of the invention is as follows:
In the invention, a data processing technology assembly is constructed aiming at the technology implementation in a single data processing scene, different big data technology implementations are packaged in the data processing technology assembly according to the data processing requirement, and the association between the data processing service and the data processing technology assembly implementation is shielded. The data processing technical components are loaded and called by the data processing service through the reflection technology according to the unified interface standard, and the same class of data processing service needs to define the unified component interface standard, so that the multiplexing of the data processing technical components is convenient. Multiple data processing technology components can be built under each data processing service, and each data processing technology component can build multiple versions to accommodate changes in data processing services and upgrades to large data technologies.
In the invention, a data processing algorithm plug-in is constructed aiming at complex logic units in a data processing scene. The data processing algorithm plug-ins are loaded and called by the data processing technology assembly through a reflection technology according to a unified interface standard, and the data processing algorithm plug-ins are divided into two types according to different calling modes: the first is internal loading call, which requires the realization technology of the data processing algorithm plug-in to be consistent with the called data processing technology component; the second is http call, which is used for integrating data processing algorithm service and is realized through http interface call. Multiple data processing algorithm plug-ins may be built under each data processing technology component, and each data processing algorithm plug-in may build multiple versions to satisfy the extended variations of the data processing logic.
In the invention, a single data processing scene is constructed as an independent data processing unit, and a user configures a data processing unit description model according to actual needs, as shown in fig. 2, as a specific implementation manner, the data processing unit description model is composed of three parts of a data processing task, a data source object and a data storage target object:
Data processing tasks: data processing services, data processing technology components and data processing algorithm plug-ins that perform data processing task calls, control parameters, and the like are described.
Data source object: the data volume input by the data processing task can be multiple.
Data storage target object: the target for storing the execution results of the data processing task can be a plurality of targets.
It should be noted that, the system registers data entities with different sources and different structures as a unified data object.
The data processing units are connected and combined according to service requirements, and are arranged into a step-by-step data processing flow, so that an integrated complex data processing scene comprising data acquisition, cleaning, convergence, index aggregation, association analysis, intelligent analysis and the like is realized.
The data processing units are connected through input and output data objects to establish a data processing unit dependency relationship table. Each data processing unit is executed asynchronously through the data processing unit dependency table.
The invention also provides a system for realizing the assembled integrated big data processing method, which comprises the following steps:
The first defining module is used for predefining a data processing scene, a data processing service, a data processing technical component and a data processing algorithm plug-in; the data processing service is provided with a service calling interface for calling the data processing service; the data processing technology component is provided with a component calling interface used for calling the data processing technology component; the data processing algorithm plug-in is provided with a plug-in calling interface and is used for calling the data processing algorithm plug-in;
A second definition module for predefining the data object; specifically, uniformly registering a data source object and a data storage target object as data objects, wherein each data object has a unique data object identifier;
The dividing module is used for loading a data processing flow model which needs to be subjected to data processing; dividing the data processing flow model into n data processing scenes;
The data processing unit generating module is used for generating data processing units corresponding to each data processing scene, and the specific generating method comprises the following steps:
1) Determining a data object which is a data source of the data processing unit from defined data objects, namely a data source object, and a data object which is a storage target as a result of execution of the data processing unit, namely a data storage target object;
2) Screening out required data processing services, data processing technical components and data processing algorithm plug-ins according to the data processing scene requirements, assembling and packaging according to the calling relation among the data processing services, the data processing technical components and the data processing algorithm plug-ins, and associating data source objects and data storage target objects to obtain the configured data processing unit;
The data processing unit dependency relationship table establishing module is used for establishing a data processing unit dependency relationship table according to the association between the data source objects and the data storage target objects among the data processing units; the data processing unit dependency relationship table is used for storing the dependency degree of each data processing unit and the dependency relationship between the data processing unit and other data processing units, and marking the execution state of each data processing unit, wherein the execution state comprises the following steps: an unexecuted state and an execution completed state;
Specifically, for the ith data processing unit in the n data processing units, i=1, 2,3, …, n, the calculation method of the dependency and the dependency relationship is as follows:
For the jth data processing unit in other n-1 data processing units, j is not equal to i, j is not equal to 1,2,3, …, n, if a certain data object serving as a data storage target object exists in the jth data processing unit and is the same as a certain data object serving as a data source object of the ith data processing unit, the jth data processing unit is depended on, and the dependence of the ith data processing unit is increased by 1; sequentially comparing and identifying other n-1 data processing units, so as to obtain the dependence degree and the dependence relation of the ith data processing unit;
A data processing unit execution module comprising:
Step 1, reading a data processing unit dependency relationship table, extracting each data processing unit marked as an unexecuted state and having a dependency degree of 0, and forming a data processing unit set;
Step 2, traversing the data processing unit set, asynchronously executing each data processing unit in the data processing unit set, after the execution of the data processing unit is completed, storing an execution result to a corresponding data storage target object by the data processing unit, and updating an execution state mark of a data processing unit dependency relationship table to mark the data processing unit as an execution completion state;
In this step, the execution method of each data processing unit is as follows:
1) Reading a data source according to the configured data source object;
2) Calling data processing service through a service gateway route; the data processing service loads and calls the corresponding data processing technology component through a reflection technology; the data processing technology component loads and calls the corresponding data processing algorithm plug-in through a reflection technology;
The called data processing algorithm plug-in performs data processing on the data source, and returns a data processing intermediate result to the called data processing technology component; the called data processing technology component further performs data processing on the data processing intermediate result to obtain a data processing result, and returns the data processing result to the called data processing service; therefore, the called data processing service obtains a data processing result, namely the execution result;
step 3, reading the current data processing unit dependency relationship table, judging whether a data processing unit marked as an unexecuted state exists or not, and ending the flow if the data processing unit marked as the unexecuted state does not exist; if so, executing the step 4;
And 4, updating the dependency degree and the dependency relation of the data processing unit dependency relation table to obtain an updated data processing unit dependency relation table, wherein the updating method comprises the following steps of:
for the ith data processing unit in the n data processing units, i=1, 2,3, …, n, searching for an unexecuted state from the data processing unit dependency table and depending on each data processing unit of the ith data processing unit after execution thereof is completed, and subtracting 1 from the searched dependency of the data processing unit;
And 5, returning to the step 1, and circularly executing.
According to the invention, an execution example of the data processing unit is automatically constructed according to the description model information of the data processing unit, and the configured data processing service interface is called for execution. The data processing technology assembly is loaded and called in the data processing service through the reflection technology, and the data processing algorithm plug-in is loaded and called by the data processing technology assembly, so that the flexible plug-in of the data processing service, the data processing technology assembly and the data processing algorithm plug-in is realized, and the high scalability of the system is ensured.
The invention provides a big data processing system and method which are flexibly assembled and constructed through scenes, technologies and algorithms, realizes the integrated construction of fusion application and complex application scenes of different technologies in big data processing, and effectively improves the system development efficiency.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which is also intended to be covered by the present invention.

Claims (6)

1. An assembled integrated big data processing method is characterized by comprising the following steps:
Step 1, predefining a data processing scene, a data processing service, a data processing technical component and a data processing algorithm plug-in; the data processing service is provided with a service calling interface for calling the data processing service; the data processing technology component is provided with a component calling interface used for calling the data processing technology component; the data processing algorithm plug-in is provided with a plug-in calling interface and is used for calling the data processing algorithm plug-in;
step 2, predefining a data object; specifically, uniformly registering a data source object and a data storage target object as data objects, wherein each data object has a unique data object identifier;
Step 3, loading a data processing flow model which needs to be subjected to data processing; dividing the data processing flow model into n data processing scenes;
The data processing units corresponding to each data processing scene are generated, and the specific generation method is as follows:
1) Determining a data object serving as a data source of the data processing unit from the data objects in the step 2, wherein the data object is called a data source object, and the data object serving as a storage target of a result after the data processing unit executes is called a data storage target object;
2) Screening out required data processing services, data processing technical components and data processing algorithm plug-ins according to the data processing scene requirements, assembling and packaging according to the calling relation among the data processing services, the data processing technical components and the data processing algorithm plug-ins, and associating data source objects and data storage target objects to obtain the configured data processing unit;
Step 4, establishing a data processing unit dependency relationship table according to the association between the data source objects and the data storage target objects among the data processing units; the data processing unit dependency relationship table is used for storing the dependency degree of each data processing unit and the dependency relationship between the data processing unit and other data processing units, and marking the execution state of each data processing unit, wherein the execution state comprises the following steps: an unexecuted state and an execution completed state;
Specifically, for the ith data processing unit in the n data processing units, i=1, 2,3, …, n, the calculation method of the dependency and the dependency relationship is as follows:
For the jth data processing unit in other n-1 data processing units, j is not equal to i, j is not equal to 1,2,3, …, n, if a certain data object serving as a data storage target object exists in the jth data processing unit and is the same as a certain data object serving as a data source object of the ith data processing unit, the jth data processing unit is depended on, and the dependence of the ith data processing unit is increased by 1; sequentially comparing and identifying other n-1 data processing units, so as to obtain the dependence degree and the dependence relation of the ith data processing unit;
Step 5, reading the dependency relationship table of the data processing units, extracting each data processing unit marked as an unexecuted state and having a dependency degree of 0, and forming a data processing unit set;
Step 6, traversing the data processing unit set, asynchronously executing each data processing unit in the data processing unit set, after the execution of the data processing unit is completed, storing an execution result to a corresponding data storage target object by the data processing unit, and updating an execution state mark of a data processing unit dependency relationship table to mark the data processing unit as an execution completion state;
In this step, the execution method of each data processing unit is as follows:
1) Reading a data source according to the configured data source object;
2) Calling data processing service through a service gateway route; the data processing service loads and calls the corresponding data processing technology component through a reflection technology; the data processing technology component loads and calls the corresponding data processing algorithm plug-in through a reflection technology;
The called data processing algorithm plug-in performs data processing on the data source, and returns a data processing intermediate result to the called data processing technology component; the called data processing technology component further performs data processing on the data processing intermediate result to obtain a data processing result, and returns the data processing result to the called data processing service; therefore, the called data processing service obtains a data processing result, namely the execution result;
step 7, reading the current data processing unit dependency relationship table, judging whether a data processing unit marked as an unexecuted state exists or not, and ending the flow if the data processing unit marked as the unexecuted state does not exist; if so, executing step 8;
step 8, updating the dependency of the data processing unit dependency table to obtain an updated data processing unit dependency table, wherein the updating method comprises the following steps:
for the ith data processing unit in the n data processing units, i=1, 2,3, …, n, searching for an unexecuted state from the data processing unit dependency table and depending on each data processing unit of the ith data processing unit after execution thereof is completed, and subtracting 1 from the searched dependency of the data processing unit;
and 9, returning to the step 5, and circularly executing.
2. An assemblable integrated big data processing method according to claim 1, wherein each class of data processing services configures a plurality of versions of data processing services; each type of data processing technology component configures multiple versions of the data processing technology component; each class of data processing algorithm plugins configures multiple versions of the data processing algorithm plugins.
3. The method for processing the assembled integrated big data according to claim 1, wherein the data processing scene comprises a data acquisition scene, a data synchronization scene, a data aggregation scene, a data cleaning scene and a data analysis scene.
4. A method of mountable integrated big data processing according to claim 3, wherein the data acquisition scenario is: realizing the data acquisition and storage of the data source to a designated data storage target object;
The data synchronization scene is: the data source is synchronized among different data storage target objects;
the data aggregation scene: constructing dimensions for the basic data, and constructing a business width table through dimension association;
The data cleaning scene is as follows: realizing the cleaning treatment and standardization treatment of dirty data;
The data analysis scene is: and realizing data mining analysis.
5. The method for processing the assembled integrated big data according to claim 1, wherein the assembling and packaging method is as follows:
constructing multiple types of data processing services under each data processing scene, and constructing multiple data processing service versions by each type of data processing service so as to adapt to the changes of different running environments and business scenes;
Constructing a plurality of types of data processing technology components under each type of data processing service, wherein each type of data processing technology components is loaded and called by the data processing service through a reflection technology according to a unified interface standard; each type of data processing technology component constructs a plurality of data processing technology component versions to adapt to the change of the data processing service;
Constructing a multi-class data processing algorithm plug-in under each class of data processing technology assembly, loading and calling each class of data processing algorithm plug-in by the data processing technology assembly according to a unified interface standard through a reflection technology, and according to different calling modes, the data processing technology assembly comprises two types: the first is internal loading calling, which requires the realization technology of the data processing algorithm plug-in to be consistent with the data processing technology component, and the data processing technology component calls the data processing algorithm plug-in through an internal loading calling method; the second is http call, which is used for the data processing technical component to call the data processing algorithm plug-in through the http interface; each class of data processing algorithm plugins builds multiple versions of the data processing algorithm plugins to satisfy the extended variations of the data processing logic.
6. A system of an assemblable integrated big data processing method according to any one of claims 1 to 5, comprising:
The first defining module is used for predefining a data processing scene, a data processing service, a data processing technical component and a data processing algorithm plug-in; the data processing service is provided with a service calling interface for calling the data processing service; the data processing technology component is provided with a component calling interface used for calling the data processing technology component; the data processing algorithm plug-in is provided with a plug-in calling interface and is used for calling the data processing algorithm plug-in;
A second definition module for predefining the data object; specifically, uniformly registering a data source object and a data storage target object as data objects, wherein each data object has a unique data object identifier;
The dividing module is used for loading a data processing flow model which needs to be subjected to data processing; dividing the data processing flow model into n data processing scenes;
The data processing unit generating module is used for generating data processing units corresponding to each data processing scene, and the specific generating method comprises the following steps:
1) Determining a data object which is a data source of the data processing unit from defined data objects, namely a data source object, and a data object which is a storage target as a result of execution of the data processing unit, namely a data storage target object;
2) Screening out required data processing services, data processing technical components and data processing algorithm plug-ins according to the data processing scene requirements, assembling and packaging according to the calling relation among the data processing services, the data processing technical components and the data processing algorithm plug-ins, and associating data source objects and data storage target objects to obtain the configured data processing unit;
The data processing unit dependency relationship table establishing module is used for establishing a data processing unit dependency relationship table according to the association between the data source objects and the data storage target objects among the data processing units; the data processing unit dependency relationship table is used for storing the dependency degree of each data processing unit and the dependency relationship between the data processing unit and other data processing units, and marking the execution state of each data processing unit, wherein the execution state comprises the following steps: an unexecuted state and an execution completed state;
Specifically, for the ith data processing unit in the n data processing units, i=1, 2,3, …, n, the calculation method of the dependency and the dependency relationship is as follows:
For the jth data processing unit in other n-1 data processing units, j is not equal to i, j is not equal to 1,2,3, …, n, if a certain data object serving as a data storage target object exists in the jth data processing unit and is the same as a certain data object serving as a data source object of the ith data processing unit, the jth data processing unit is depended on, and the dependence of the ith data processing unit is increased by 1; sequentially comparing and identifying other n-1 data processing units, so as to obtain the dependence degree and the dependence relation of the ith data processing unit;
A data processing unit execution module comprising:
Step 1, reading a data processing unit dependency relationship table, extracting each data processing unit marked as an unexecuted state and having a dependency degree of 0, and forming a data processing unit set;
Step 2, traversing the data processing unit set, asynchronously executing each data processing unit in the data processing unit set, after the execution of the data processing unit is completed, storing an execution result to a corresponding data storage target object by the data processing unit, and updating an execution state mark of a data processing unit dependency relationship table to mark the data processing unit as an execution completion state;
In this step, the execution method of each data processing unit is as follows:
1) Reading a data source according to the configured data source object;
2) Calling data processing service through a service gateway route; the data processing service loads and calls the corresponding data processing technology component through a reflection technology; the data processing technology component loads and calls the corresponding data processing algorithm plug-in through a reflection technology;
The called data processing algorithm plug-in performs data processing on the data source, and returns a data processing intermediate result to the called data processing technology component; the called data processing technology component further performs data processing on the data processing intermediate result to obtain a data processing result, and returns the data processing result to the called data processing service; therefore, the called data processing service obtains a data processing result, namely the execution result;
step 3, reading the current data processing unit dependency relationship table, judging whether a data processing unit marked as an unexecuted state exists or not, and ending the flow if the data processing unit marked as the unexecuted state does not exist; if so, executing the step 4;
and 4, updating the dependency degree of the data processing unit dependency relation table to obtain an updated data processing unit dependency relation table, wherein the updating method comprises the following steps of:
for the ith data processing unit in the n data processing units, i=1, 2,3, …, n, searching for an unexecuted state from the data processing unit dependency table and depending on each data processing unit of the ith data processing unit after execution thereof is completed, and subtracting 1 from the searched dependency of the data processing unit;
And 5, returning to the step 1, and circularly executing.
CN202310803713.0A 2023-07-03 2023-07-03 Assembled integrated big data processing method and system Active CN116755804B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310803713.0A CN116755804B (en) 2023-07-03 2023-07-03 Assembled integrated big data processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310803713.0A CN116755804B (en) 2023-07-03 2023-07-03 Assembled integrated big data processing method and system

Publications (2)

Publication Number Publication Date
CN116755804A CN116755804A (en) 2023-09-15
CN116755804B true CN116755804B (en) 2024-04-26

Family

ID=87960811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310803713.0A Active CN116755804B (en) 2023-07-03 2023-07-03 Assembled integrated big data processing method and system

Country Status (1)

Country Link
CN (1) CN116755804B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007242051A (en) * 2007-05-21 2007-09-20 Nomura Research Institute Ltd Device for mounting/executing business logic program
WO2016036813A2 (en) * 2014-09-02 2016-03-10 Ab Initio Technology Llc Controlling data processing tasks
CN110532038A (en) * 2019-08-19 2019-12-03 杭州趣链科技有限公司 A kind of parallel execution method based on Java intelligence contract
CN110597572A (en) * 2018-06-13 2019-12-20 中移(苏州)软件技术有限公司 Service calling relation analysis method and computer system
CN111651451A (en) * 2020-04-25 2020-09-11 复旦大学 Scene-driven single system micro-service splitting method
CN111754073A (en) * 2020-05-19 2020-10-09 北京吉威空间信息股份有限公司 Centralized processing and distributed operation framework construction method for spatial data service
CN112379884A (en) * 2020-11-13 2021-02-19 李斌 Spark and parallel memory computing-based process engine implementation method and system
CN114675943A (en) * 2020-12-24 2022-06-28 珠海市魅族科技有限公司 Multi-program cooperation method, system, device and medium based on different scenes
CN115794262A (en) * 2022-12-07 2023-03-14 百度(中国)有限公司 Task processing method, device, equipment, storage medium and program product

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7774751B2 (en) * 2003-12-26 2010-08-10 Yefim Zhuk Knowledge-driven architecture
US20100050156A1 (en) * 2008-08-20 2010-02-25 International Business Machines Corporation Using build history information to optimize a software build process
US9773070B2 (en) * 2014-06-30 2017-09-26 Microsoft Technology Licensing, Llc Compound transformation chain application across multiple devices

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007242051A (en) * 2007-05-21 2007-09-20 Nomura Research Institute Ltd Device for mounting/executing business logic program
WO2016036813A2 (en) * 2014-09-02 2016-03-10 Ab Initio Technology Llc Controlling data processing tasks
CN110597572A (en) * 2018-06-13 2019-12-20 中移(苏州)软件技术有限公司 Service calling relation analysis method and computer system
CN110532038A (en) * 2019-08-19 2019-12-03 杭州趣链科技有限公司 A kind of parallel execution method based on Java intelligence contract
CN111651451A (en) * 2020-04-25 2020-09-11 复旦大学 Scene-driven single system micro-service splitting method
CN111754073A (en) * 2020-05-19 2020-10-09 北京吉威空间信息股份有限公司 Centralized processing and distributed operation framework construction method for spatial data service
CN112379884A (en) * 2020-11-13 2021-02-19 李斌 Spark and parallel memory computing-based process engine implementation method and system
CN114675943A (en) * 2020-12-24 2022-06-28 珠海市魅族科技有限公司 Multi-program cooperation method, system, device and medium based on different scenes
CN115794262A (en) * 2022-12-07 2023-03-14 百度(中国)有限公司 Task processing method, device, equipment, storage medium and program product

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于工作流引擎高校内控管理系统的设计与实现;于爽;江苏科技大学;20211231;全文 *

Also Published As

Publication number Publication date
CN116755804A (en) 2023-09-15

Similar Documents

Publication Publication Date Title
Wu et al. A multilevel index model to expedite web service discovery and composition in large-scale service repositories
Silva et al. Exploiting common subexpressions for cloud query processing
Fard et al. Towards efficient query processing on massive time-evolving graphs
US20160239544A1 (en) Collaborative planning for accelerating analytic queries
Wang et al. BENU: Distributed subgraph enumeration with backtracking-based framework
Reza et al. Prunejuice: pruning trillion-edge graphs to a precise pattern-matching solution
CN112379884A (en) Spark and parallel memory computing-based process engine implementation method and system
CN104615703A (en) RDF data distributed parallel inference method combined with Rete algorithm
Oliveira et al. Rigorous development of component-based systems using component metadata and patterns
Singh et al. A data structure perspective to the RDD-based Apriori algorithm on Spark
Sampath et al. An efficient weighted rule mining for web logs using systolic tree
Gombos et al. Spar (k) ql: SPARQL evaluation method on Spark GraphX
CN116775041B (en) Real-time decision engine implementation method based on stream calculation and RETE algorithm
CN116755804B (en) Assembled integrated big data processing method and system
CN105701605A (en) Waveform list management module applied to integrated communication navigation identification system
CN114138811A (en) Column calculation optimization method based on Spark SQL
Tehreem et al. Parallel architecture for implementation of frequent itemset mining using FP-growth
Abdolazimi et al. Connected components of big graphs in fixed mapreduce rounds
CN110851178B (en) Inter-process program static analysis method based on distributed graph reachable computation
CN111198766B (en) Database access operation deployment method, database access method and device
Fegaras Supporting bulk synchronous parallelism in map-reduce queries
Cabodi et al. A graph‐labeling approach for efficient cone‐of‐influence computation in model‐checking problems with multiple properties
CN109918410B (en) Spark platform based distributed big data function dependency discovery method
Lin et al. Double resource optimization for a robust computer network subject to a transmission budget
Elmaghraoui et al. Dynamic web service composition using AND/OR directed graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant