CN116795816A - Stream processing-based multi-bin construction method and system - Google Patents
Stream processing-based multi-bin construction method and system Download PDFInfo
- Publication number
- CN116795816A CN116795816A CN202310603864.1A CN202310603864A CN116795816A CN 116795816 A CN116795816 A CN 116795816A CN 202310603864 A CN202310603864 A CN 202310603864A CN 116795816 A CN116795816 A CN 116795816A
- Authority
- CN
- China
- Prior art keywords
- data
- layer
- service
- ods
- stream processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012545 processing Methods 0.000 title claims abstract description 55
- 238000010276 construction Methods 0.000 title claims abstract description 37
- 238000000034 method Methods 0.000 claims abstract description 41
- 238000004364 calculation method Methods 0.000 claims abstract description 25
- 230000008859 change Effects 0.000 claims abstract description 24
- 238000004458 analytical method Methods 0.000 claims abstract description 21
- 238000012544 monitoring process Methods 0.000 claims abstract description 10
- 230000001360 synchronised effect Effects 0.000 claims abstract description 4
- 238000003860 storage Methods 0.000 claims description 27
- 230000008569 process Effects 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 10
- 238000009826 distribution Methods 0.000 claims description 9
- 238000004140 cleaning Methods 0.000 claims description 5
- 238000007405 data analysis Methods 0.000 claims description 4
- 238000013480 data collection Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000007726 management method Methods 0.000 description 5
- 230000006399 behavior Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000012423 maintenance Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000007547 defect Effects 0.000 description 3
- 230000010354 integration Effects 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000005111 flow chemistry technique Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 241001178520 Stomatepia mongo Species 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013481 data capture Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000011068 loading method Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2365—Ensuring data consistency and integrity
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
- G06F21/6254—Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1095—Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/55—Push-based network services
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
- H04L67/566—Grouping or aggregating service requests, e.g. for unified processing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer Security & Cryptography (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Medical Informatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a stream processing-based method and a stream processing-based system for constructing a plurality of bins, which comprise the steps of analyzing and restoring service data of a structured or unstructured standard data packet, monitoring and capturing the change of a database, analyzing and processing the change, and pushing the data to a data convergence layer ODS; the data convergence layer ODS cleans, converts and desensitizes the data, and associates the data to form a data detail layer DWD; the data in the data detail layer DWD is distributed through the data to form a summarized data layer DWS, or the data is synchronized to a cloud component according to service requirements to form a standardized data query service; and the summary data layer DWS distributes the data detail layer DWD into a wide table or a thematic library through MYLink SQL data, and pi outputs the calculated data to the cloud component to provide service inquiry and offline calculation analysis. The stream processing-based multi-bin construction method and system have strong adaptability to scenes with high data real-time requirements, can be rapidly deployed and are easy to maintain, and the cost of enterprises is greatly reduced and the adaptability is improved.
Description
Technical Field
The application relates to the technical field of data warehouse construction, in particular to a method and a system for constructing a plurality of bins based on stream processing.
Background
In the computer field, data warehouse is a system for data analysis and reporting, an important component of business intelligence. The data warehouse is an integrated data central repository from one or more different sources, which may be heterogeneous repositories. Data warehouses typically store current up-to-date or historical data together for creation of commercial value. Data warehouse represents the way data is managed and used, and is a complete system of typical extraction, conversion, loading, modeling use hierarchy, data integration and access layers and the like. Data warehouses are topic-oriented, integrated, changing, but relatively stable data sets used to manage support for decision-making processes.
As internet users have increased dramatically in the internet era, especially from the PC era to the mobile internet era, massive user behavior data are generated, from PB level to EB level and even ZB level every day, enterprises have an urgent need to mine effective business information from the massive data, such as business data, user behavior data, abnormal data and the like, extract valuable information and make business decisions for users. With the continuous increase of the data scale, the business data volume is increased, the construction method and the architecture of the data warehouse are iterated continuously, the traditional data warehouse is developed to offline data warehouse, and the current offline and real-time Lambda is developed.
Structured, semi-structured and unstructured data in source data are regularly loaded into a plurality of bins through offline ETL, then the results are obtained through a calculation engine, and then a front end or service is provided for use. Offline bins + compute engines are commonly used in large OLTP databases (mainly conventional relational databases, such as Oracle, SQL Server). The traditional data warehouse can be very perfectly supported for transaction type business processing, but is not adaptable to massive data storage and calculation, and mainly has the following 3 points:
1. the traditional data warehouse belongs to a pre-construction model, and needs to be continuously reconstructed for massive data or continuous change of user needs, so that the efficiency is very low.
2. Because mass data of big data belongs to the change from the quantity to the quality, the traditional data warehouse cannot meet the requirement of rapid analysis response of the mass data.
3. The cluster expansion cost of the traditional data warehouse is very high, and the transverse expansion is difficult to be performed, so that the computing capacity is improved.
With the increasing of the data scale, the data volume of the business is increased rapidly, the traditional data warehouse is difficult to bear mass data, and the traditional database storage technology is also faced with storage tension, so that the cost is increased continuously. Meanwhile, with the popularization of big data technology, it is possible to construct an offline data warehouse through the big data technology, and the big data technology is adopted to bear and store and calculate offline data bins. The data warehouse construction in the big data is based on the traditional multi-bin construction architecture, the technical tool of the big data is used for replacing the traditional OLTP, the construction is changed into the offline big data multi-bin proposal architecture, and the offline data warehouse construction method well solves the defects of the traditional multi-bin. Along with the continuous change of the processing capacity and the demand of the service data, the practice finds that the mode capacity of the offline batch processing is greatly improved, but the service scene with very high requirements on the data processing and the service timeliness cannot be met anyway. The method for constructing the double-link number bin by offline and stream calculation is a compromise transitional scheme, but has a plurality of defects in practical production:
1. the use of computing resources increases. Since two lines of offline and streaming computing exist at the same time, the occupied time period of offline and streaming data computing resources may be inconsistent, more offline computing is performed from 12 am to 6 am, more streaming computing resources are performed from day time to 12 am, and thus the offline and streaming computing resources are not fully utilized, and the overall resource occupation is increased.
2. Two sets of code are maintained simultaneously. Two lines of offline and streaming computing, one requiring the implementation of code on the offline engine and one requiring the implementation of code on the streaming engine, and two sets of test procedures. The operation and maintenance cost of the log bin business is doubled.
3. The timeliness of offline calculation is poor. Due to the continuous change of services, more and more services are required to have higher and higher timeliness requirements for original offline tasks, and due to the fact that offline computing can only meet the computing requirements of T+n and can only change the time level of n to the minute level, the requirements on server resources are higher and higher, and timeliness inconsistency can be guaranteed.
4. Cluster storage requirements are high. Since both offline and streaming link processes require data to be stored in clusters, and a large amount of temporary data or logs are generated during intermediate computation, this can cause rapid expansion of the data, placing a significant strain on server storage.
Disclosure of Invention
In order to solve the technical problems in the prior art, the application provides a method and a system for constructing a plurality of bins based on stream processing, which are used for solving the technical problems.
According to a first aspect of the present application, a method for constructing a plurality of bins based on stream processing is provided, including:
s1: carrying out service data analysis and restoration on the structured or unstructured standard data packet, monitoring and capturing the change of the database, carrying out analysis processing, and pushing the data to a data convergence layer ODS;
s2: the data convergence layer ODS cleans, converts and desensitizes the data, and associates the data to form a data detail layer DWD;
s3: the data in the data detail layer DWD is distributed through the data to form a summarized data layer DWS, or the data is synchronized to a cloud component according to service requirements to form a standardized data query service;
s4: and the summary data layer DWS distributes the data detail layer DWD into a wide table or a thematic library through MYLink SQL data, and outputs the calculated data to the cloud component to provide service inquiry and offline calculation analysis.
In some specific embodiments, S1 further includes performing data collection on a source service library according to a collection rule of the data convergence layer ODS, where the source service library may correspond to one or more data convergence layers ODS.
In some specific embodiments, S1 specifically includes parsing the structured or unstructured standard data packets with the sSend tool, parsing and restoring the structured or unstructured data with the Datax, and monitoring changes in the capture database with the flankcdc for parsing.
In some specific embodiments, the data convergence layer ODS in S2 is configured to store the service library data, maintain the original appearance of the service data, and clean, convert, desensitize, and associate the data in the manner of sql+udf by using the mylnk engine to form a data detail layer DWD.
In some specific embodiments, S2 further comprises outputting the data directly into the cloud component by the MYLink engine to provide a trace query of the raw data.
In some specific embodiments, the cloud component comprises a cloud authentication component or a messenger cloud.
In some specific embodiments, both the sSend tools Datax, flinkCDC support a consumption queue as a data convergence layer ODS.
According to a second aspect of the present application, a computer-readable storage medium is presented, on which one or more computer programs are stored which, when executed by a computer processor, implement the above-described method.
According to a third aspect of the present application, there is provided a stream processing-based several-bin construction system, comprising:
the data processing unit is configured to analyze and restore service data of a structured or unstructured standard data packet, monitor and capture the change of a database, analyze and process the change, push the data to the data convergence layer ODS, and the data convergence layer ODS cleans, converts, desensitizes and correlates the data to form a data detail layer DWD;
the data distribution unit is configured to distribute the data in the data detail layer DWD through the data to form a summarized data layer DWS, or synchronize the data to the cloud component according to service requirements to form a standardized data query service;
and the query analysis unit is used for summarizing the data layer DWS, distributing the data detail layer DWD into a wide table or a thematic library through MYLink SQL data, outputting the calculated data to the cloud component, and providing service query and offline calculation analysis.
In some specific embodiments, the system further includes a data acquisition unit configured to perform data acquisition according to an acquisition rule of the data convergence layer ODS by using a source service library, where the source service library may correspond to one or more data convergence layers ODS.
In some specific embodiments, the sSend tool is utilized to analyze the service data of the structured or unstructured standard data packet, the data of the structured or unstructured standard data packet is analyzed and restored by using the Datax, and the change of the capture database is monitored by using the FlinkCDC to carry out analysis processing; the data convergence layer ODS is used for storing the business library data to keep the original appearance of the business data, and cleaning, converting, desensitizing and correlating the data in a SQL+UDF mode through the MYLink engine to form a data detail layer DWD; and directly outputting the data to a cloud component through a MYLink engine to provide tracking inquiry of the original data.
In some specific embodiments, both the sSend tools Datax, flinkCDC support a consumption queue as a data convergence layer ODS.
The application provides a method and a system for constructing a plurality of bins based on stream processing, which can be well applied to a business scene of coexistence of streams and batches, and can share the same resource for the same set of codes of the streams and the batches, so that the utilization rate of the resource is high and the cost of the resource is low. As long as the difficulty of developing, testing and releasing a set of codes on line is greatly reduced, the operation and maintenance cost in the later stage is also low. The method has strong adaptability to scenes with high data real-time requirements. Can be deployed quickly and maintained easily, greatly reduces the cost of enterprises and improves the adaptability.
Drawings
The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the application. Many of the intended advantages of other embodiments and embodiments will be readily appreciated as they become better understood by reference to the following detailed description. Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:
FIG. 1 is a flow chart of a stream processing based method of several bins construction according to one embodiment of the application;
FIG. 2 is an overall architecture diagram of a streaming digital bin in accordance with a specific embodiment of the present application;
FIG. 3 is a flow chart of a flow-based process for several bins construction in accordance with a specific embodiment of the present application;
FIG. 4 is a data warehouse architecture diagram of one particular embodiment of the present application;
FIG. 5 is a diagram of a stream processing based multi-bin building system architecture according to one embodiment of the present application;
FIG. 6 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.
Detailed Description
The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be noted that, for convenience of description, only the portions related to the present application are shown in the drawings.
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
Fig. 1 shows a flow chart of a method of constructing a plurality of bins based on stream processing according to an embodiment of the application. As shown in fig. 1, the method comprises the steps of:
s101: and analyzing and restoring service data of the structured or unstructured standard data packet, monitoring and capturing the change of the database, analyzing and processing the change, and pushing the data to the data convergence layer ODS.
In a specific embodiment, S1 further includes performing data collection on the source service library according to a collection rule of the data convergence layer ODS, where the source service library may correspond to one or more data convergence layers ODS. And analyzing the service data by utilizing the sSend tool to analyze the structured or unstructured standard data packet, analyzing and restoring the structured or unstructured data by utilizing the Datax, and analyzing and processing the change of the capture database by utilizing the FlinkCDC monitoring.
S102: the data convergence layer ODS cleans, converts, desensitizes and correlates the data to form a data detail layer DWD. The data convergence layer ODS is used for storing the business library data to keep the original appearance of the business data, and the MYLink engine is used for cleaning, converting, desensitizing and correlating the data in a SQL+UDF mode to form a data detail layer DWD. The data can also be directly output into the cloud component through the MYLink engine to provide tracking query of the original data. Wherein both the sSend tools Datax, flinkCDC support the consume queue as a data convergence layer ODS.
S103: the data in the data detail layer DWD forms a summary data layer DWS through data distribution, or synchronizes the data to the cloud component according to service requirements to form a standardized data query service.
S104: and the summary data layer DWS distributes the data detail layer DWD into a wide table or a thematic library through MYLink SQL data, and outputs the calculated data to the cloud component to provide service inquiry and offline calculation analysis.
Fig. 2 shows an overall architecture diagram of a streaming several bins according to a specific embodiment of the present application, where the streaming several bins are based on a new generation streaming computing engine mylnk as a base, forming a table with TJPlat, where the core idea is to solve the problem in the process of several bins by integrating and modifying the streaming computing system, so that both real-time computing and batch processing can be built according to the same set of modes. TJplat can also recalculate historical data when needed, and the flexibility of the stream type digital bin is reflected.
In a specific embodiment, the stream processing multi-bin construction of the application blends layering of multi-bin construction into a multi-bin construction method according to a four-level construction thought of all-resource, resource catalogue, catalogue global and global standardization, and all multi-bin layering resources are defined as all resources; all the resources are defined to a resource catalog for resource management, data item management, hierarchical classification management of the resources and the like, and downstream resources extracted by the resources or extraction rules, data cleaning rules (filtering, duplicate removal, lattice transformation, verification and the like), data association (Join operation of the resources) and the like are also defined or designed in the resource catalog; the directory globally defines components of the deployment store of the bins (may be ES, hbase, mongoDB, HDFS, clickHouse, etc.); global standardization is a global guideline provided by unified standards for subsequent resource service queries, data applications, and even BI analysis; "quadrupling" provides concrete business direction for the proposal of the stream processing number bin, and can be also said to be the process from abstraction to materialization of business.
In a specific embodiment, according to the ODS (aggregate bin) collection rules defined in "quadrupling", from source service bins to aggregate bins (ODS), the source bins to aggregate bins may be one source bin to n (one or more) aggregate bins, and data collection (the collected service rules may be defined in the collection process) may be performed by MDatax with an open source into an offline bin or a consumption queue (Kafka); the sSend can collect structured data, semi-structured data and even unstructured data of a specific service scene into an offline library, object storage or consumption queue, which is suitable for collecting the data of the specific scene of big data; the flankcdc is mainly used for acquiring data change of a main database (Change Data Capture) such as MySQL, oracle, postgres in real time, and the principle is to monitor and capture changes of the database (insertion, update and deletion of the database), and complete record the sequence of the changes and write the complete record into a consumption queue (Kafka) for subscription and consumption of mylnk. All data acquisition is controlled based on definable ideas. Datax, sSend, or FlinkCDC must support a consumption queue (Kafka) as an ODS, which is a necessary precondition for the construction of streaming bins.
In a specific embodiment, MYLINK is a new generation of streaming calculation engine developed based on the thought of the Flink batch and stream, and the streaming calculation engine is realized by completely using SQL-like semantics and combining the 'four-way' concept of the application, so that the streaming data warehouse construction threshold is reduced. The MYLINK retains the real-time computing features of the link including: simultaneously supporting unbounded and unbounded data sources; support Join and Union; low delay; the unified connector can be customized.
The MYLINK is simultaneously combined with the 'four' concept to innovate a flow type multi-bin construction mode, which is embodied in the following aspects:
(1) Through the relation between the ODS layer and the DWD layer of the resource pair, MYLINK performs stream calculation processing (such as ETL processes of data cleaning, data association, data extraction and distribution and the like) according to the data processing rule defined by the resource pair, and automatically and hierarchically outputs the data outputted in parallel according to the deployment definition;
(2) Calculating the sequence and the position of each operator link according to the dynamic free adjustment flow type required by the service so as to meet the service requirement;
(3) In the streaming computing section, SQL sentences can be written by oneself to process specific service requirements, and the UDF mode can be defined for special or complex services;
(4) MYLINK provides a dynamic window adjusting mode to allocate batch calculation windows of the business;
(5) The cloud system supports more storage components, can support cloud components of some manufacturers besides the open source components, such as a Hua Cheng cloud authentication component (ES, HIVE, HDFS, mongodb and the like) and a Tencent cloud (ES, HIVE, HDFS and the like).
The flow processing-based several-bin construction flow chart of a specific embodiment of the application shown in connection with fig. 3 specifically comprises the following steps:
the platform analyzes service data of a structured or unstructured standard data packet through an sSend tool, and the MDatax capability is used for analyzing and restoring the structured or unstructured data, the FlinkCDC of data acquisition is special for monitoring and capturing the change of a database to analyze and process, and the data is pushed to an ODS layer library.
The ODS layer (original library) is used for storing the business library data to keep the original appearance of the business data, and the data of the ODS layer may be cleaned, converted, desensitized, associated, etc. by the MYLink engine in the manner of sql+udf to form a DWD. The data can be directly output to the ES, the HDFS and other components through the MYLink for the data analyst to track and inquire the original data, and the data can be provided for data application or third party service through the cxLevelS or the Tianhe service platform.
The data of the DWD layer (data detail layer) is a treated data layer, and the data of the DWD layer can be used for data distribution to form a DWS (discrete wavelet transform) such as user behavior, commodity attribute or thematic library and the like; or synchronizing the data to the ES through MYLink according to the service requirement to provide data detail layer query service to form various standardized data query services.
The DWS layer is from the DWD layer, forms a wide table or a thematic library (index library) through data distribution of MYLink SQL, outputs calculated data to components such as Mongo and HDFS, and provides service query and offline calculation analysis respectively.
In a specific embodiment, the offline data warehouse construction method can well solve the defects of the traditional multi-warehouse: 1. the batch data calculation can solve the problem that the traditional number bin cannot calculate mass data to a great extent. And the calculated results can be queried in batches after batch calculation. 2. The capacity expansion method has the advantages that the capacity expansion is realized with low cost, the Hdoop ecology is adopted in the off-line number bin, the operation can be well carried out on a low-cost small-sized machine, and the low-cost capacity expansion scheme is provided for the off-line calculation lateral expansion. Unlike traditional ones, the machine can only purchase large-scale machine for longitudinal expansion. 3. The offline batch calculation can also be performed according to the actual requirements of the service and the data in place and then uniformly calculated, so that the consistency and the integrity of the data are obtained. The offline data warehouse uniformly collects different sources of source heterogeneous data sources into offline storage HDFS or HIVE through an open source Datax, flume or self-developed sSend tool, and also can collect important basic data into a relational database and the like. The collected data form a collection ledger log. The data collected from the off-line library can be stored in a partitioning, table separating or library separating mode according to the condition of the resource, and then the collected data is subjected to off-line batch calculation through HiveQL or sparkSQL of QBI (self-research off-line analysis platform), so that the off-line batch calculation can be performed at one time or at fixed time. After outputting the business data according to the business demands, the unified release platform CXLeves can provide data services for data applications or service applications.
The offline data warehouse can well perform layered construction and management according to the data warehouse construction in the construction process, so that the data can be orderly circulated in a design mode. Several bin layering provides an important theoretical basis for data layering suggestion management. The offline data (table or resource) in the suggestion process is dependent on a complex, level-confusing and even circularly dependent data system. In order for data organization to proceed in order, a modern data layering system is needed that is effective in organizing, managing and using data. The data layering is mainly used for controlling data more clearly in the data management process, and has the following advantages:
1. a clear data structure. Each data hierarchy has a corresponding scope, and definition and understanding are more convenient when the hierarchy is expressed.
2. The complexity problem is simplified. Each complex task is decomposed into multiple steps, each layer only handles specific problems, and is relatively simple and easy to understand. And the accuracy and consistency of the data are convenient to maintain.
3. And repeated work is reduced. The standard data layering can reduce the great repeated calculation work in developing some universal middle layer data tables.
4. Unified data caliber. Through data layering, a unified data outlet can be provided, the service data caliber of external output is unified, and the service is prevented from being singular.
5. The data sources are tracked. The service table is unique to the outside, but the sources of the service table can be one or more, and the data sources can be rapidly defined through tracking the blood edges of the data table.
FIG. 4 shows a data warehouse architecture diagram of a specific embodiment of the present application, as shown in FIG. 4, the data warehouse is generally divided into three layers, namely, a data convergence layer (ODS), a data warehouse layer (DW), and a data application layer (ADS), wherein the data warehouse layer (DW) can be further divided into: data detail layer (DWD), summary data layer (DWS), common dimension layer (DIM), TMP (temporary data layer).
The application provides a new data warehouse proposal method aiming at a plurality of problems in the construction and production of the traditional number warehouse and the off-line number warehouse (comprising off-line and stream calculation number warehouse). The method for constructing the multiple bins based on the stream processing can also be called a method and a system for constructing the multiple bins based on the stream batch integration. The flow batch integrated in the construction of the plurality of bins has the following advantages:
1. homology was calculated. The same set of codes and the same set of logic can be used for processing streaming tasks or batch tasks simultaneously, so that the learning and maintenance cost is reduced, and the resource utilization rate can be better.
2. The homology is stored. The streaming processing integration can simultaneously meet the storage of streaming data and batch data on a storage system, and can effectively cooperate with the updating of metadata.
3. The data latency is low. The streaming processing is created for real-time, and the data delay is in the order of seconds or even milliseconds, so that the streaming processing is very suitable for service scenes with strict real-time requirements.
FIG. 5 shows a architecture diagram of a stream processing-based multi-bin construction system according to an embodiment of the present application, as shown in FIG. 5, the system includes a data processing unit 501, a data distribution unit 502, and a query analysis unit 503, where the data processing unit 501 is configured to parse and restore service data of a structured or unstructured standard data packet, monitor and capture the change of a database, parse the change, and push the data to a data convergence layer ODS, where the data convergence layer ODS cleans, converts, desensitizes, and correlates the data to form a data detail layer DWD; the data distribution unit 502 is configured to distribute data in the data detail layer DWD through data to form a summary data layer DWS, or synchronize the data to the cloud component according to service requirements to form a standardized data query service; the summary data layer DWS in the query analysis unit 503 distributes the data detail layer DWD through the data of the MYLink SQL to form a broad table or a thematic library, and outputs the calculated data to the cloud component to provide service query and offline computing analysis.
In a specific embodiment, the data processing unit 501 further includes a data acquisition unit, which is configured to perform data acquisition according to an acquisition rule of the data convergence layer ODS by using a source service library, where the source service library may correspond to one or more data convergence layers ODS.
The method and the system for constructing the multiple bins based on the stream processing can be well applied to the business scene of coexistence of streams and batches, can share the same resource for the same set of codes of the streams and the batches, and have high resource utilization rate and low resource expense. As long as the difficulty of developing, testing and releasing a set of codes on line is greatly reduced, the operation and maintenance cost in the later stage is also low. The flow processing has great application scenes such as fraud prediction for the current business demands with strong timeliness, fraud belongs to the high-speed industry in the financial field, the occurrence process of the fraud is shorter, the influence is larger, and how to prevent the fraud is a problem which needs to be solved jointly by a plurality of financial companies or banks in recent years. Conventional anti-fraud approaches have not been adequate to address the difficulties faced. In the past, a few hours are needed to calculate transaction data and user behavior indexes, suspicious users are judged through corresponding rules, and case investigation and screening are combined, so that funds are transferred on the earth for a plurality of times by illegal persons. By using the streaming type multi-bin construction method, the corresponding index can be calculated in seconds or even milliseconds by streaming type calculation of the corresponding data, and then real-time early warning or interception is carried out on real-time running water, so that loss is avoided. The application has strong adaptability to scenes with high data real-time requirements for the construction of the stream data warehouse. Can be deployed quickly and maintained easily, greatly reduces the cost of enterprises and improves the adaptability. .
Referring now to FIG. 6, there is shown a schematic diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present application. The electronic device shown in fig. 6 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the application.
As shown in fig. 6, the computer system includes a Central Processing Unit (CPU) 601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Liquid Crystal Display (LCD) or the like, a speaker or the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 601. The computer readable storage medium of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present application may be implemented in software or in hardware.
As another aspect, the present application also provides a computer-readable storage medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable storage medium carries one or more programs that, when executed by the electronic device, cause the electronic device to: carrying out service data analysis and restoration on the structured or unstructured standard data packet, monitoring and capturing the change of the database, carrying out analysis processing, and pushing the data to a data convergence layer ODS; the data convergence layer ODS cleans, converts and desensitizes the data, and associates the data to form a data detail layer DWD; the data in the data detail layer DWD is distributed through the data to form a summarized data layer DWS, or the data is synchronized to a cloud component according to service requirements to form a standardized data query service; and the summary data layer DWS distributes the data detail layer DWD into a wide table or a thematic library through MYLink SQL data, and outputs the calculated data to the cloud component to provide service inquiry and offline calculation analysis.
The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the application referred to in the present application is not limited to the specific combinations of the technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the inventive concept described above. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.
Claims (12)
1. The method for constructing the multiple bins based on the stream processing is characterized by comprising the following steps of:
s1: carrying out service data analysis and restoration on the structured or unstructured standard data packet, monitoring and capturing the change of the database, carrying out analysis processing, and pushing the data to a data convergence layer ODS;
s2: the data convergence layer ODS cleans, converts, desensitizes and associates the data to form a data detail layer DWD;
s3: the data in the data detail layer DWD form a summary data layer DWS through data distribution, or the data is synchronized to a cloud component according to service requirements to form standardized data query service;
s4: and the summary data layer DWS distributes the data detail layer DWD through MYLink SQL data to form a wide table or a thematic library, and outputs the calculated data to a cloud component to provide service inquiry and offline calculation analysis.
2. The method for constructing a plurality of bins based on stream processing according to claim 1, wherein the step S1 is further comprised of collecting data from a source service library according to the collection rule of the data convergence layer ODS, wherein the source service library can correspond to one or more data convergence layers ODS.
3. The method for constructing a plurality of bins based on stream processing according to claim 1, wherein S1 specifically comprises analyzing service data of the structured or unstructured standard data packet by using a sSend tool, analyzing and restoring the structured or unstructured data by using Datax, and analyzing changes of a capture database by using a flankcdc monitoring.
4. The method for constructing a plurality of bins based on stream processing according to claim 1, wherein in the step S2, the data convergence layer ODS is configured to maintain the original appearance of the service data in the stored service library data, and clean, convert, desensitize and correlate the data in the manner of sql+udf by means of a mylnk engine to form the data detail layer DWD.
5. The stream processing-based multi-bin construction method of claim 4, wherein S2 further comprises providing a trace query of raw data by outputting data directly into a cloud component through the MYLink engine.
6. The stream processing-based multi-bin construction method according to claim 1, wherein the cloud component comprises a cloud authentication component or a messenger cloud.
7. A stream processing based multi-bin construction method according to claim 3, wherein the sSend tools, datax, flinkCDC each support a consumption queue as the data convergence layer ODS.
8. A computer readable storage medium having stored thereon one or more computer programs, which when executed by a computer processor implement the method of any of claims 1-7.
9. A stream processing-based multi-bin construction system, comprising:
the data processing unit is configured to analyze and restore service data of a structured or unstructured standard data packet, monitor and capture the change of a database, analyze and process the change, and push the data to the data convergence layer ODS, wherein the data convergence layer ODS cleans, converts, desensitizes and associates the data to form a data detail layer DWD;
the data distribution unit is configured to distribute the data in the data detail layer DWD through the data to form a summary data layer DWS, or synchronize the data to a cloud component according to service requirements to form a standardized data query service;
and the query analysis unit distributes the data detail layer DWD into a wide table or a thematic library through MYLink SQL data, outputs the calculated data to the cloud component and provides service query and offline calculation analysis.
10. The stream processing-based multi-bin construction system according to claim 9, further comprising a data collection unit configured to collect data from a source service library according to collection rules of the data convergence layer ODS, where the source service library may correspond to one or more of the data convergence layers ODS.
11. The stream processing-based multi-bin construction system according to claim 9, wherein the structured or unstructured standard data package is analyzed by using an sSend tool, the structured or unstructured data is analyzed and restored by using Datax, and the analysis is performed by monitoring the change of a capture database by using a flankcdc; the data convergence layer ODS is used for storing service library data to keep the original appearance of the service data, and cleaning, converting, desensitizing and correlating the data in a SQL+UDF mode through a MYLink engine to form the data detail layer DWD; and directly outputting the data to a cloud component through the MYLink engine to provide tracking inquiry of the original data.
12. The stream processing-based multi-silo construction system of claim 11, wherein the sSend tools, datax, flinkCDC each support a consume queue as the data convergence layer ODS.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310603864.1A CN116795816A (en) | 2023-05-26 | 2023-05-26 | Stream processing-based multi-bin construction method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310603864.1A CN116795816A (en) | 2023-05-26 | 2023-05-26 | Stream processing-based multi-bin construction method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116795816A true CN116795816A (en) | 2023-09-22 |
Family
ID=88040632
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310603864.1A Pending CN116795816A (en) | 2023-05-26 | 2023-05-26 | Stream processing-based multi-bin construction method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116795816A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117785983A (en) * | 2024-02-20 | 2024-03-29 | 四川大学华西医院 | Target object evaluation method, system, electronic device and storage medium |
-
2023
- 2023-05-26 CN CN202310603864.1A patent/CN116795816A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117785983A (en) * | 2024-02-20 | 2024-03-29 | 四川大学华西医院 | Target object evaluation method, system, electronic device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112685385B (en) | Big data platform for smart city construction | |
CN112181960B (en) | Intelligent operation and maintenance framework system based on AIOps | |
CN104021194A (en) | Mixed type processing system and method oriented to industry big data diversity application | |
CN113064866B (en) | Power business data integration system | |
CN103514223A (en) | Data synchronism method and system of database | |
CN111160867A (en) | Large-scale regional parking lot big data analysis system | |
CN106126601A (en) | A kind of social security distributed preprocess method of big data and system | |
CN109213752A (en) | A kind of data cleansing conversion method based on CIM | |
CN112241402A (en) | Empty pipe data supply chain system and data management method | |
CN112527886A (en) | Data warehouse system based on urban brain | |
CN112817958A (en) | Electric power planning data acquisition method and device and intelligent terminal | |
CN116842055A (en) | System and method for integrated processing of internet of things data batch flow | |
CN111538720B (en) | Method and system for cleaning basic data of power industry | |
CN112559634A (en) | Big data management system based on computer cloud computing | |
CN116795816A (en) | Stream processing-based multi-bin construction method and system | |
CN114218218A (en) | Data processing method, device and equipment based on data warehouse and storage medium | |
CN109308290A (en) | A kind of efficient data cleaning conversion method based on CIM | |
CN114691762A (en) | Intelligent construction method for enterprise data | |
CN110750582A (en) | Data processing method, device and system | |
CN112306992A (en) | Big data platform based on internet | |
CN116644136A (en) | Data acquisition method, device, equipment and medium for increment and full data | |
CN116523328A (en) | Intelligent decision-making method for cooperation of aviation equipment and manufacturing industry chain | |
CN109165203A (en) | Large public building energy consumption data based on Hadoop framework stores analysis method | |
CN114969183A (en) | Information management service platform applied to highway construction | |
Li | Construction of an interactive sharing platform for competitive intelligence data of marine resources under the background of intelligence construction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |