CN114490610A - Data processing method and device for data bin, storage medium and electronic device - Google Patents
Data processing method and device for data bin, storage medium and electronic device Download PDFInfo
- Publication number
- CN114490610A CN114490610A CN202210090577.0A CN202210090577A CN114490610A CN 114490610 A CN114490610 A CN 114490610A CN 202210090577 A CN202210090577 A CN 202210090577A CN 114490610 A CN114490610 A CN 114490610A
- Authority
- CN
- China
- Prior art keywords
- data
- bin
- preset
- target
- cleaning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/02—Banking, e.g. interest calculation or account maintenance
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Development Economics (AREA)
- General Business, Economics & Management (AREA)
- Quality & Reliability (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- Computing Systems (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a data processing method and device for a data bin, a storage medium and an electronic device. The method comprises the steps of acquiring data from different data sources based on a Flink frame, and storing the data according to data lake hudi; establishing a target data bin based on preset data extraction, conversion and loading processing; executing preset data processing operation based on the target data bin, wherein the preset data processing operation at least comprises one of the following operations: generating report operation, business data query operation and decision engine original data cleaning operation. The method and the device solve the technical problems of poor storage and calculation processing effects of the internet financial data. The method and the device for realizing the light-weight deployment and supporting the visual configuration under different service scenes are realized.
Description
Technical Field
The present application relates to the field of data warehouse technology, and in particular, to a data processing method and apparatus for a data warehouse, a storage medium, and an electronic apparatus.
Background
In internet finance, a banking staff can obtain a large amount of important information related to a client in real time by applying an advanced informatization technology to financial business, so that the operation is convenient and the efficiency is improved.
At present, the requirements for incremental data and data query real-time performance existing in a data bin still have defects.
Aiming at the problem of poor storage and calculation processing effects of internet financial data in the related technology, no effective solution is provided at present.
Disclosure of Invention
The present application mainly aims to provide a data processing method and apparatus for a data warehouse, a storage medium, and an electronic apparatus, so as to solve the problem of poor storage and calculation processing effects of internet financial data, and no effective problem has been proposed at present.
To achieve the above object, according to one aspect of the present application, there is provided a data processing method for a data bin.
A data processing method for a data bin according to the application comprises: acquiring data from different data sources based on a Flink frame, and storing the data according to the data lake hudi; establishing a target data bin based on preset data extraction, conversion and loading processing; executing preset data processing operation based on the target data bin, wherein the preset data processing operation at least comprises one of the following operations: generating report operation, business data query operation and decision engine original data cleaning operation.
Further, the establishing a target data bin based on the preset data extraction, conversion and loading processing further includes: and establishing a target data bin based on preset full-amount data and/or incremental data extraction, conversion and loading processes, wherein the full-amount/incremental data is extracted to the kafka queue by using a Canel component.
Further, the establishing a target data bin based on the preset data extraction, conversion and loading processes includes: the data is loaded to a target data bin after being read out from a data source and subjected to data type conversion and dirty data cleaning.
Further, the executing preset data processing operation based on the target data bin further includes: analyzing user behavior data, printing the embedded point log to a fixed file through a preset data model, and collecting a log file; extracting the log file to a kafka queue; and transmitting the data into the target data bin through the input source of the target data bin, cleaning the data by adopting Flink SQL, then sorting the data into required data, and storing the data into Hbase.
Further, the executing a preset data processing operation based on the target data bin, wherein the preset data processing operation at least includes one of the following operations: generating report operation, business data query operation and decision engine original data cleaning operation, comprising: the preset data processing operation comprises: generating report operation, and after the target data warehouse is accessed to an input source, cleaning data based on SQL to obtain data required by a target format; and accessing an output source through the target data bin, storing the data, and displaying the data through a BI tool.
Further, the executing a preset data processing operation based on the target data bin, wherein the preset data processing operation at least includes one of the following operations: generating report operation, business data query operation and decision engine original data cleaning operation, comprising: the preset data processing operation comprises the following steps: a decision engine original data cleaning operation, namely accessing a data source through the target data bin, and cleaning data by using preset Flink SQL, wherein the Flink SQL is configured according to service data, and the execution result of each Flink SQL is stored into the data lake hudi as intermediate data; cleaning data into a preset decision original field, and storing the data into the target data bin through an output source; and performing extraction, conversion and loading processing on incremental data by using the updated field, and establishing the target data bin.
Further, the executing a preset data processing operation based on the target data bin, wherein the preset data processing operation at least includes one of the following operations: generating report operation, business data query operation and decision engine original data cleaning operation, comprising: the preset data processing operation comprises: a business data query operation, which is to extract the full data to a kafka queue; cleaning the data bin through an input source; the cleaning result is used as an output source by configuring the ES.
To achieve the above object, according to another aspect of the present application, there is provided a data processing apparatus for use in a data processing system.
A data processing apparatus for binning according to the present application comprises: the establishing module is used for establishing a target data bin based on preset data extraction, conversion and loading processing; an execution processing module, configured to execute a preset data processing operation based on the target data bin, where the preset data processing operation at least includes one of: generating report operation, business data query operation and decision engine original data cleaning operation.
In order to achieve the above object, according to yet another aspect of the present application, there is provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method when executed.
In order to achieve the above object, according to yet another aspect of the present application, there is provided an electronic device comprising a memory and a processor, the memory having a computer program stored therein, the processor being configured to execute the computer program to perform the method.
In the data processing method and device, the storage medium and the electronic device for the data bin in the embodiment of the application, data are acquired from different data sources based on a Flink frame and are stored according to the data lake hudi; the method comprises the steps of establishing a target data bin mode based on preset data extraction, conversion and loading processing, and executing preset data processing operation based on the target data bin, wherein the preset data processing operation at least comprises one of the following operations: the method comprises the steps of generating a report operation, inquiring business data and cleaning the original data of a decision engine, and achieves the purposes of completing report processing, inquiring the business data in time and cleaning the original data of the decision engine, so that the technical effects of optimizing data storage and calculation processes are achieved, and the technical problems of poor storage and calculation processing effects of internet financial data are solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:
FIG. 1 is a hardware architecture diagram of a data processing method for a data bin according to an embodiment of the application;
FIG. 2 is a schematic flow chart diagram of a data processing method for a data bin according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a data processing apparatus for data bins according to an embodiment of the present application;
FIG. 4 is a flow diagram illustrating a data processing method for a data bin according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In this application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings. These terms are used primarily to better describe the present application and its embodiments, and are not used to limit the indicated devices, elements or components to a particular orientation or to be constructed and operated in a particular orientation.
Moreover, some of the above terms may be used to indicate other meanings besides the orientation or positional relationship, for example, the term "on" may also be used to indicate some kind of attachment or connection relationship in some cases. The specific meaning of these terms in this application will be understood by those of ordinary skill in the art as appropriate.
Furthermore, the terms "mounted," "disposed," "provided," "connected," and "sleeved" are to be construed broadly. For example, it may be a fixed connection, a removable connection, or a unitary construction; can be a mechanical connection, or an electrical connection; may be directly connected, or indirectly connected through intervening media, or may be in internal communication between two devices, elements or components. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 is a schematic hardware structure diagram of a data processing method for a data warehouse according to an embodiment of the present application, in which a data warehouse 100 is a theme-oriented, integrated, and relatively stable data storage set reflecting historical changes, and is used for supporting analysis reports and decisions of an enterprise. The data in the bins is from an integration of different data sources, such as data source 1, data source 2, and data source 3, and the data sources may be stored differently, such as mysql, oracle, hive, etc., so that Etl operations are required to integrate the different data sources. Wherein Etl the operations include, but are not limited to, data extraction, i.e., reading of data from a data source. Data conversion, data type conversion and dirty data cleaning. And loading data, wherein the processed data is loaded to a target, such as a data bin. Business functions such as data reporting, data mining, and data analysis may be provided based on the data warehouse 100.
As shown in fig. 2, the method includes steps S201 to S202 as follows:
step S201, acquiring data from different data sources based on a Flink frame, and storing the data according to the data lake hudi;
step S202, a target data bin is established based on preset data extraction, conversion and loading processing;
step S203, executing a preset data processing operation based on the target data bin, wherein the preset data processing operation at least includes one of the following operations: generating report operation, business data query operation and decision engine original data cleaning operation.
From the above description, it can be seen that the following technical effects are achieved by the present application:
the method comprises the steps of establishing a target data bin based on preset data extraction, conversion and loading processing, and executing preset data processing operation based on the target data bin, wherein the preset data processing operation at least comprises one of the following operations: the method comprises the steps of generating a report operation, inquiring business data and cleaning the original data of a decision engine, and achieves the purposes of completing report processing, inquiring the business data in time and cleaning the original data of the decision engine, so that the technical effects of optimizing data storage and calculation processes are achieved, and the technical problems of poor storage and calculation processing effects of internet financial data are solved.
After the input source is configured to the data bin (platform) in the above steps S201 and S202, the data bin (platform) extracts data according to the configured data source, and puts the extracted data into the database lake hudi for cleaning, and then outputs the cleaned data to the configured data source.
As a preferred option in this embodiment, the establishing a target data bin based on preset data extraction, conversion, and loading processing includes: performing data extraction on a data source according to a preset component type integration mode, converting the data source based on a data lake hudi storage mode and performing loading processing based on a Flink technology framework; the data is loaded to a target data bin after being read out from a data source and subjected to data type conversion and dirty data cleaning.
As a preferred embodiment, a Flink technology framework is used for carrying the whole data warehouse system, the data is stored by using data lake hudi, and the data sources are integrated in a component mode and support conventional data source operation, wherein the data sources include but are not limited to oracle, mysql, hbase, kafka, es, mongodb and redis.
As a preferred embodiment, report processing is completed based on the data warehouse, service data is queried in time, and the decision engine cleans the raw data.
As a preferred embodiment, the deployment is easily migrated based on the lightweight of the data bins.
As a preferred embodiment, it is simple to visually configure under different preset business scenarios based on the data bins. The preset service scenario includes, but is not limited to, a service scenario such as real-time data requirement and data tracking. I.e. supporting a visualization configuration.
In the above step S202, based on the target data bin, a preset data processing operation is executed, where the preset data processing operation at least includes one of the following operations: generating report operation, business data query operation and decision engine original data cleaning operation.
In a preferred embodiment, above the incremental data processing mechanism, the CDC mechanism of the database is adopted to extract the data, the kafka queue adopted by the partial database without the CDC mechanism synchronizes the data to the kafka queue, and then the kafka queue is configured as an input source in the data warehouse (platform) to perform the data cleaning operation.
In a preferred embodiment, for business data query operation, the embedded point log is printed to a fixed file in a business system, synchronized to kafka through a file extraction tool, and then the data is cleaned by configuring a kafka input source on a warehouse platform.
In a preferred embodiment, for decision engine raw data cleansing operations, whether database data or kafka data are used as data input sources, and then Etl cleansing is performed through Flink SQL, each service of the SQL can be configured, and the result of each SQL execution is stored as intermediate data in the data lake.
As a preferable example in this embodiment, the establishing a target data bin based on preset data extraction, conversion, and loading processing further includes: and establishing a target data bin based on preset full-amount data and/or incremental data extraction, conversion and loading processes, wherein the full-amount/incremental data is extracted to the kafka queue by using a Canel component.
In specific implementation, aiming at the situation that full data and incremental data exist under different scenes, two parts of data need to be considered to be synchronized when establishing a data bin. That is, the full/incremental data is drawn to the kafka queue using the Canel component. The full and incremental data inside this is extracted to kafka using a Canel component.
As a preference in this embodiment, the executing preset data processing operation based on the target data bin further includes: analyzing user behavior data, printing the embedded point log to a fixed file through a preset data model, and collecting a log file; extracting the log file to a kafka queue; and transmitting the data into the target data bin through the input source of the target data bin, cleaning the data by adopting the Flink SQL, then sorting the data into the required data and storing the data into the Hbase.
During specific implementation, firstly, business data is collected into a log file through a data model, then the log is extracted to kafka through a component, then the data is transmitted into a counting bin platform through an input source of a butt-joint counting bin platform and is cleaned through Flink SQL, and then the data is arranged into needed data to be put into Hbase for business use.
As a preference in this embodiment, the executing, based on the target data bin, a preset data processing operation, where the preset data processing operation includes at least one of: generating report operation, business data query operation and decision engine original data cleaning operation, comprising: the preset data processing operation comprises: generating report operation, and after the target data warehouse is accessed to an input source, cleaning data based on SQL to obtain data required by a target format; and accessing an output source through the target data bin, storing the data, and displaying the data through a BI tool.
During specific implementation, the input source is accessed through the warehouse counting platform, data cleaning is carried out through SQL, an intermediate table may be generated, various data required by a target format are finally formed through multiple cleaning, the output source is accessed through the warehouse counting platform, the data are stored, and the data are displayed through a BI tool.
As a preference in this embodiment, the executing, based on the target data bin, a preset data processing operation, where the preset data processing operation includes at least one of: generating report operation, business data query operation and decision engine original data cleaning operation, comprising: the preset data processing operation comprises: a decision engine original data cleaning operation, namely accessing a data source through the target data bin, and cleaning data by using preset Flink SQL, wherein the Flink SQL is configured according to service data, and the execution result of each Flink SQL is stored into the data lake hudi as intermediate data; cleaning data into a preset decision original field, and storing the data into the target data bin through an output source; and performing extraction, conversion and loading processing on incremental data by using the updated field, and establishing the target data bin.
When the method is implemented specifically, a data source is accessed through a warehouse counting platform, SQL is used for cleaning the data, the data is cleaned into decision original fields, and the decision original fields are stored in a target database through an output source. The update field is used for extraction to a data bin (platform) for incremental data.
As a preference in this embodiment, the executing, based on the target data bin, a preset data processing operation, where the preset data processing operation includes at least one of: generating report operation, business data query operation and decision engine original data cleaning operation, comprising: the preset data processing operation comprises: a business data query operation, which is to extract the full data to a kafka queue; cleaning the data bin through an input source; the cleaning result is used as an output source by configuring the ES.
In specific implementation, the real-time performance of the service data is relatively high, the full data needs to be extracted to the kafka queue, the data warehouse (platform) cleans the data through an input source, and the ES is configured as an output source after cleaning is completed, so that the service data is conveniently retrieved.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
There is also provided, in accordance with an embodiment of the present application, a data processing apparatus for counting bins, for implementing the above method, as shown in fig. 3, the apparatus including:
the data processing module 301 is configured to acquire data from different data sources based on a Flink framework, and store the data according to the data lake hudi;
an establishing module 302, configured to establish a target data bin based on preset data extraction, conversion, and loading processing;
an execution processing module 303, configured to execute a preset data processing operation based on the target data bin, where the preset data processing operation at least includes one of: generating report operation, business data query operation and decision engine original data cleaning operation.
After the input source is configured to the data bin (platform) in the data processing module 301 and the establishing module 302 of the present application, the data bin (platform) extracts data according to the configured data source, and puts the extracted data into the database lake hudi for cleaning, and then outputs the cleaned data to the configured data source.
As a preferred option in this embodiment, the establishing a target data bin based on preset data extraction, conversion, and loading processing includes: performing data extraction on a data source according to a preset component type integration mode, converting the data source based on a data lake hudi storage mode and performing loading processing based on a Flink technology framework; the data is loaded to a target data bin after being read out from a data source and subjected to data type conversion and dirty data cleaning.
As a preferred embodiment, a Flink technology framework is used for carrying the whole data warehouse system, the data is stored by using data lake hudi, and the data sources are integrated in a component mode and support conventional data source operation, wherein the data sources include but are not limited to oracle, mysql, hbase, kafka, es, mongodb and redis.
As a preferred embodiment, report processing is completed based on the data warehouse, service data is queried in time, and the decision engine cleans the raw data.
As a preferred embodiment, the deployment is easily migrated based on the lightweight of the data bins.
As a preferred embodiment, it is simple to visually configure under different preset business scenarios based on the data bins. The preset service scenario includes, but is not limited to, a service scenario such as real-time data requirement and data tracking.
Executing a predetermined data processing operation based on the target data bin in the execution processing module 302 of the present application, wherein the predetermined data processing operation at least includes one of: generating report operation, business data query operation and decision engine original data cleaning operation.
In a preferred embodiment, above the incremental data processing mechanism, the CDC mechanism of the database is adopted to extract the data, the kafka queue adopted by the partial database without the CDC mechanism synchronizes the data to the kafka queue, and then the kafka queue is configured as an input source in the data warehouse (platform) to perform the data cleaning operation.
In a preferred embodiment, for business data query operation, the embedded point log is printed to a fixed file in a business system, synchronized to kafka through a file extraction tool, and then the data is cleaned by configuring a kafka input source on a warehouse platform.
In a preferred embodiment, for decision engine raw data cleansing operations, whether database data or kafka data are used as data input sources, and then Etl cleansing is performed through Flink SQL, each service of the SQL can be configured, and the result of each SQL execution is stored as intermediate data in the data lake.
It will be apparent to those skilled in the art that the modules or steps of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present application is not limited to any specific combination of hardware and software.
In order to better understand the above-mentioned intelligent dialog interaction method flow, the following explains the above-mentioned technical solutions with reference to the preferred embodiments, but the technical solutions of the embodiments of the present invention are not limited thereto.
The data processing method for the data warehouse in the embodiment of the application completes report processing based on the data warehouse (platform), service data is inquired in time, and the decision engine original data is cleaned. The deployment is easy to migrate based on lightweight, and the visualization configuration is simple under different business scenes.
As shown in fig. 4, which is a schematic flow chart of a data processing method for a data bin in the embodiment of the present application, a specific process implemented includes the following steps:
step S401, a target data bin is established based on preset data extraction, conversion, and loading processing.
Step S402, whether incremental data exists or not.
And step S403, establishing a target data bin based on preset full amount data and/or incremental data extraction, conversion and loading processing, wherein the full amount/incremental data is extracted to a kafka queue by using a Canel component.
Above the incremental data processing mechanism, the CDC mechanism of the database is adopted to extract data, the kafka adopted by a partial database without the CDC mechanism synchronizes the data to the kafka, and then the kafka is configured as an input source on the warehouse platform to perform the cleaning operation of the data.
For the process tracking type service data, a buried point log is printed to a fixed file in a service system, the fixed file is synchronized to kafka through a file extraction tool, and then a kafka queue input source is configured on a warehouse platform to clean the data.
In the specific data cleaning process, no matter database data or kafka queue data is used as a data input source, then Etl cleaning is carried out through Flink SQL, each service can be configured based on SQL, the executed result of each SQL is stored into a data lake step S404 as intermediate data, and the data source is subjected to data extraction according to a preset component-type integration mode, conversion based on a data lake hudi storage mode and loading processing based on a Flink technical framework.
Step S405, the data is read from the data source, and loaded to the target data bin after data type conversion and dirty data cleaning.
Step S406, executing a preset data processing operation based on the target data bin, wherein the preset data processing operation at least includes one of the following operations: generating report operation, business data query operation and decision engine original data cleaning operation.
And generating report operation, accessing an input source through the warehouse counting platform, cleaning data through SQL (structured query language), possibly generating an intermediate table, cleaning for multiple times to finally form various data required by a target format, accessing an output source through the warehouse counting platform, storing the data, and displaying the data through a BI (business intelligence) tool.
And (4) performing decision engine original data cleaning operation, accessing a data source through the warehouse platform, cleaning the data by using SQL, cleaning the data into decision original fields, and storing the decision original fields into a target database through an output source. The update field is used for extraction to the bin platform for incremental data.
The method comprises the steps that business data are inquired, the real-time requirement of the business data on the data is high, the full data need to be extracted to kafka, a data warehouse (platform) cleans the data through an input source, and after cleaning is completed, ES is configured to serve as an output source, so that the business data can be retrieved conveniently. The full and incremental data inside this is extracted to the kafka queue using a Canel component.
In addition, for user behavior data analysis operation, firstly, business data is collected into a log file through a data model, then a component extracts logs into a kafka queue, then an input source of a docking data warehouse platform is used for transferring the data into a data warehouse (platform), the data is cleaned through Flink SQL, and then the data is arranged into needed data to be put into Hbase for business use.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (10)
1. A data processing method for a data bin, comprising:
acquiring data from different data sources based on a Flink frame, and storing the data according to the data lake hudi;
establishing a target data bin based on preset data extraction, conversion and loading processing;
executing preset data processing operation based on the target data bin, wherein the preset data processing operation at least comprises one of the following operations: generating report operation, business data query operation and decision engine original data cleaning operation.
2. The method of claim 1, wherein establishing a target data bin based on a pre-defined data extraction, conversion, and loading process further comprises:
and establishing a target data bin based on preset full-amount data and/or incremental data extraction, conversion and loading processes, wherein the full-amount/incremental data is extracted to the kafka queue by using a Canel component.
3. The method of claim 2, wherein establishing a target data bin based on a pre-defined data extraction, conversion and loading process comprises:
the data is loaded to a target data bin after being read out from a data source and subjected to data type conversion and dirty data cleaning.
4. The method of claim 1, wherein performing a pre-set data processing operation based on the target data bin further comprises: the user behavior data analysis operation is performed,
printing the service data to a fixed file by a preset data model, and collecting a log file;
extracting the log file to a kafka queue;
and transmitting the data into the target data bin through the input source of the target data bin, cleaning the data by adopting the Flink SQL, then sorting the data into the required data and storing the data into the Hbase.
5. The method of claim 1, wherein the performing of the pre-set data processing operation is based on the target data bin, wherein the pre-set data processing operation comprises at least one of: generating report operation, business data query operation and decision engine original data cleaning operation, comprising:
the preset data processing operation comprises: the operation of generating a report is carried out,
after an input source is accessed through the target data bin, cleaning data based on SQL to obtain data required by a target format;
and accessing an output source through the target data bin, storing the data, and displaying the data through a BI tool.
6. The method of claim 1, wherein the performing of the pre-set data processing operation is based on the target data bin, wherein the pre-set data processing operation comprises at least one of: generating report operation, business data query operation and decision engine original data cleaning operation, comprising:
the preset data processing operation comprises: the decision engine raw data is flushed of operations,
accessing a data source through the target data bin, and cleaning data by using preset Flink SQL, wherein the Flink SQL is configured according to service data, and the execution result of each Flink SQL is stored into the data lake hudi as intermediate data;
cleaning data into a preset decision original field, and storing the data into the target data bin through an output source;
and performing extraction, conversion and loading processing on incremental data by using the updated field, and establishing the target data bin.
7. The method of claim 1, wherein the performing of the pre-set data processing operation is based on the target data bin, wherein the pre-set data processing operation comprises at least one of: generating report operation, business data query operation and decision engine original data cleaning operation, comprising:
the preset data processing operation comprises: a business data query operation is performed on the business data,
by extracting the full amount of data to the kafka queue;
cleaning the data bin through an input source;
the cleaning result is used as an output source by configuring the ES.
8. A data processing apparatus for counting bins, comprising:
the data processing module is used for acquiring data from different data sources based on a Flink frame and storing the data according to the data lake hudi;
the establishing module is used for establishing a target data bin based on preset data extraction, conversion and loading processing;
an execution processing module, configured to execute a preset data processing operation based on the target data bin, where the preset data processing operation at least includes one of: generating report operation, business data query operation and decision engine original data cleaning operation.
9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 7 when executed.
10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210090577.0A CN114490610A (en) | 2022-01-25 | 2022-01-25 | Data processing method and device for data bin, storage medium and electronic device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210090577.0A CN114490610A (en) | 2022-01-25 | 2022-01-25 | Data processing method and device for data bin, storage medium and electronic device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114490610A true CN114490610A (en) | 2022-05-13 |
Family
ID=81473619
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210090577.0A Pending CN114490610A (en) | 2022-01-25 | 2022-01-25 | Data processing method and device for data bin, storage medium and electronic device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114490610A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117609315A (en) * | 2024-01-22 | 2024-02-27 | 中债金融估值中心有限公司 | Data processing method, device, equipment and readable storage medium |
-
2022
- 2022-01-25 CN CN202210090577.0A patent/CN114490610A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117609315A (en) * | 2024-01-22 | 2024-02-27 | 中债金融估值中心有限公司 | Data processing method, device, equipment and readable storage medium |
CN117609315B (en) * | 2024-01-22 | 2024-04-16 | 中债金融估值中心有限公司 | Data processing method, device, equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111339071B (en) | Method and device for processing multi-source heterogeneous data | |
CN111400326B (en) | Smart city data management system and method thereof | |
CN107958028A (en) | Method, apparatus, storage medium and the terminal of data acquisition | |
CN107193967A (en) | A kind of multi-source heterogeneous industry field big data handles full link solution | |
CN110502509B (en) | Traffic big data cleaning method based on Hadoop and Spark framework and related device | |
CN110347899B (en) | Distributed internet data acquisition system and method based on event-driven model | |
CN107103064B (en) | Data statistical method and device | |
CN104021460A (en) | Work flow management system and work flow handling method | |
CN112287015A (en) | Image generation system, image generation method, electronic device, and storage medium | |
CN108037919A (en) | A kind of visualization big data workflow configuration method and system based on WEB | |
CN108399186A (en) | A kind of collecting method and device | |
CN112115113B (en) | Data storage system, method, device, equipment and storage medium | |
CN109753502A (en) | A kind of collecting method based on NiFi | |
CN112286957B (en) | API application method and system of BI system based on structured query language | |
CN107145576B (en) | Big data ETL scheduling system supporting visualization and process | |
CN106780149A (en) | A kind of equipment real-time monitoring system based on timed task scheduling | |
CN110555076A (en) | Data marking method, processing method and device | |
CN108984583A (en) | A kind of searching method based on journal file | |
CN113468166A (en) | Metadata processing method and device, storage medium and server | |
CN114490610A (en) | Data processing method and device for data bin, storage medium and electronic device | |
CN105786941B (en) | Information mining method and device | |
CN108959356A (en) | A kind of intelligence adapted TV university Data application system Data Mart method for building up | |
CN114254033A (en) | Data processing method and system based on BS architecture | |
CN114567633A (en) | Cloud platform system supporting full life cycle of multi-stack database and management method | |
CN111949743A (en) | Method, device and equipment for acquiring network operation data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |