CN116383182A - Big data cleaning method - Google Patents
Big data cleaning method Download PDFInfo
- Publication number
- CN116383182A CN116383182A CN202310249934.8A CN202310249934A CN116383182A CN 116383182 A CN116383182 A CN 116383182A CN 202310249934 A CN202310249934 A CN 202310249934A CN 116383182 A CN116383182 A CN 116383182A
- Authority
- CN
- China
- Prior art keywords
- data
- component
- platform
- execution
- flow
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004140 cleaning Methods 0.000 title claims abstract description 29
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000006243 chemical reaction Methods 0.000 claims abstract description 16
- 238000012545 processing Methods 0.000 claims abstract description 15
- 230000008569 process Effects 0.000 claims description 13
- 230000001419 dependent effect Effects 0.000 claims description 9
- 230000002159 abnormal effect Effects 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 3
- 238000007619 statistical method Methods 0.000 claims description 3
- 238000000844 transformation Methods 0.000 claims description 3
- 238000011161 development Methods 0.000 claims description 2
- 230000010354 integration Effects 0.000 abstract description 8
- 230000005540 biological transmission Effects 0.000 abstract description 6
- 230000004927 fusion Effects 0.000 abstract description 3
- 230000002776 aggregation Effects 0.000 abstract description 2
- 238000004220 aggregation Methods 0.000 abstract description 2
- 238000010276 construction Methods 0.000 abstract description 2
- 238000012827 research and development Methods 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 3
- 238000013075 data extraction Methods 0.000 description 2
- 238000000586 desensitisation Methods 0.000 description 2
- 238000011068 loading method Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2219—Large Object storage; Management thereof
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a big data cleaning method, and relates to the technical field of data processing. The invention is a new generation data integration platform based on micro-service architecture research and development and innovation, and the full Web configuration is used after unpacking, so that distributed deployment is supported, and meanwhile, the scheduling and execution of tens of thousands of flows are supported; the data exchange is carried out among various heterogeneous data sources through the platform, the data fusion platform is built quickly, meanwhile, a lightweight data middle platform can be built quickly by superposing the API service platform, and the platform is specially developed for complex data integration scenes; supporting cross-database object control and ensuring high consistency of data transmission of multiple data sources; the batch flow integrated processing is supported, and the acquisition and synchronization time of data is greatly improved; the method comprises the steps of supporting operations such as merging, splitting, aggregation and the like of data flows among multiple data sources; the automatic conversion of data types among various different data sources is supported, and the construction speed of the integrated flow is greatly improved.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a big data cleaning method.
Background
Along with the wide application of the big data technology, the requirements on the big data cleaning technology are higher and higher, and the big data cleaning technology is greatly developed and advanced; in the big data cleaning technology, especially, there is a great development space in the aspect of carrying out cleaning flow and correcting or self-learning in time.
The existing big data cleaning technology, such as CN 115309735A-big data cleaning method, device, computer equipment and storage medium, is mainly focused on data processing analysis of multiple data sources and can monitor the relation among all processes finally; in addition, CN115062722 a-AI training method based on cloud service big data cleaning and artificial intelligent cloud system, the scheme has a clear industry direction, and is aimed at noise, AI comparison is carried out for a plurality of times aiming at some noise, and the like.
At present, aiming at data cleaning, the traditional mode uses some ETL tools, but when tasks are too many, the regular task scheduling of the system can only be used for management and log writing; unified management cannot be achieved; for each cleaning flow, the traditional mode is to represent the relation among all conversion components in a list form, so that the relation among all the components cannot be intuitively seen; the part of software is deployed by the client, so that the memory occupation is higher when the software runs, and the server resources cannot be utilized most efficiently; the tools of ETL are almost all capable of fulfilling most of the needs of industry software that is very large and small from a processing perspective, except for the possible differences in the industries aimed at. The present invention is advantageous in terms of specific components, arrangements or unified management but is not necessarily a pain point in other industries.
Disclosure of Invention
The invention provides a big data cleaning method, which is based on a new generation data integration platform which is completely independently developed and innovated by a micro-service architecture, provides a one-stop data processing platform integrating business system data and data transmission among heterogeneous data sources for enterprises, is used after full Web configuration is opened, and can quickly exchange data among various heterogeneous data sources through the platform, quickly construct a data fusion platform, and simultaneously quickly construct a lightweight data middle platform by superposing an API service platform; the data integration flow can be built and the functions of data extraction, conversion, cleaning, desensitization, loading and the like can be realized by only clicking a few times through visual dragging, pulling and dragging, and the system comprehensively surpasses the common open source data cleaning tool in the aspects of architecture, usability, transmission performance, visualization, functional richness and the like. The large-scale distributed deployment architecture is supported by the containerization technology, and the dynamic elastic expansion can be realized according to the resource utilization rate to realize the simultaneous scheduling and concurrent operation of tens of thousands of flows.
In order to solve the technical problems, the invention is realized by the following technical scheme:
the invention discloses a big data cleaning method, which is developed based on a micro-service architecture to support distributed deployment and simultaneously support scheduling and execution of tens of thousands of processes, and comprises the following steps:
s1, accessing a system interface, wherein a user needs to select 2-3 components for data cleaning to finish the process of cleaning, namely input, conversion and output;
s2, selecting an input component in a standard system interface, and selecting the input type, wherein the input type comprises a database, a file, an API interface and a kafka mode for adding;
s3, carrying out certain conversion on the input assembly according to service requirements, wherein the conversion comprises adding, modifying, deleting, column transferring and de-duplicating of data; performing a series of transformations on the input component according to the configuration of one or more transformation groups;
s4, according to the converted content, an output component is needed to receive the converted data, the output data is directly written in and output in a database, a file, an interface and a kafka mode;
s5, selecting whether a timing task exists in the current flow, and configuring timing task execution time;
s6, whether the configuration of the current flow is dependent or not, if so, other flows are dependent, namely, the current flow is executed after the dependent flow is selected to be executed;
s7, recording the result of each execution flow in a database, and recording the completion condition of each component;
s8, completing interface call of the generated flow by configuring an API (application program interface) to complete call of other systems to the current flow;
s9, carrying out statistical analysis on all the processes, enumerating out abnormal components, and carrying out optimization processing on the abnormal components according to different services.
Further, in the step S7, the completion of each component includes what operation is performed on the system by the name of the component, and the execution time of each component and the amount of processing data are counted.
Compared with the prior art, the invention has the following beneficial effects:
(1) The invention is a new generation data integration platform based on micro-service architecture complete independent research and development and innovation, provides a one-stop data processing platform integrating integration between service system data and data transmission between heterogeneous data sources for enterprises, is used after full Web configuration is opened, is a full Web front-rear end separation architecture, and can easily dock a third party service system by releasing all the capabilities as APIs; supporting distributed deployment, and simultaneously supporting scheduling and execution of tens of thousands of processes;
(2) The data exchange is carried out among various heterogeneous data sources through the platform, the data fusion platform is built quickly, meanwhile, a lightweight data middle platform can be built quickly by superposing the API service platform, and the platform is specially developed for complex data integration scenes; supporting cross-database object control and ensuring high consistency of data transmission of multiple data sources; the batch flow integrated processing is supported, and the acquisition and synchronization time of data is greatly improved;
(3) The data integration flow can be built by visual dragging, pulling and dragging only by clicking a few times, and the functions of data extraction, conversion, cleaning, desensitization, loading and the like are realized; the method comprises the steps of supporting operations such as merging, splitting, aggregation and the like of data flows among multiple data sources; the automatic conversion of data types among various different data sources is supported, and the construction speed of an integrated flow is greatly improved;
(4) The system comprehensively surpasses the common open source data cleaning tool in the aspects of architecture, usability, transmission performance, visualization, functional richness and the like;
(5) The large-scale distributed deployment architecture is supported by the containerization technology, and the dynamic elastic expansion can be realized according to the resource utilization rate to realize the simultaneous scheduling and concurrent operation of tens of thousands of flows.
Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart showing the steps of a big data cleaning method according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, the big data cleaning method of the present invention is developed based on a micro-service architecture to support distributed deployment and simultaneously support scheduling and execution of tens of thousands of processes, and the core is a process management as to the function, and mainly performs various configurations for the processes, including the following steps:
s1, accessing a system interface, wherein a user needs to select 2-3 components for data cleaning to finish the process of cleaning, namely input, conversion and output; the device is realized by an input assembly, a conversion assembly and an output assembly;
the input component has no too much obstacle between various types, and the concept of multiple data is to combine excel data with other (such as database) data for subsequent operation;
the conversion component comprises various types of content of the cleaning data, the implementation is different according to different implementations of the components, such as deduplication, and then the same data in a certain column is deduplicated, similar to the deduplication item in excel;
the output component outputs the processed data to a designated place, for example, an excel+database is subjected to duplicate removal processing, and the processed data is output to the database after completion;
s2, selecting an input component in a standard system interface, and selecting the input type, wherein the input type comprises a database, a file, an API interface and a kafka mode for adding; inputting different specific contents according to types, such as excel, uploading files, filling data links if a database is selected, and the like;
s3, carrying out certain conversion on the input assembly according to service requirements, wherein the conversion comprises adding, modifying, deleting, column transferring and de-duplicating of data; performing a series of transformations on the input component according to the configuration of one or more transformation groups;
s4, according to the converted content, an output component is needed to receive the converted data, the output data is directly written in and output in a database, a file, an interface and a kafka mode;
s5, selecting whether a timing task exists in the current flow, and configuring timing task execution time;
s6, whether the configuration of the current flow is dependent or not, if so, other flows are dependent, namely, the current flow is executed after the dependent flow is selected to be executed; the dependency refers to the upper level of the dependency, and it can be understood that 1 program can be associated with a plurality of programs, and the execution result is that after one program is executed, other programs are driven to be executed together; the immediate execution is only the execution mode, namely, the operation is performed after the manual click operation; configuration information includes, among other things, the manner in which execution is to be performed, such as whether timed execution is to be configured for monthly/daily/yearly execution;
s7, recording the result of each execution flow in a database, and recording the completion condition of each component;
s8, completing interface call of the generated flow by configuring an API (application program interface) to complete call of other systems to the current flow;
s9, carrying out statistical analysis on all the processes, enumerating out abnormal components, and carrying out optimization processing on the abnormal components according to different services.
In the step S7, the completion of each component includes what operation is performed on the system by the name of the component, and the execution time of each component and the amount of processing data are counted.
The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.
Claims (2)
1. A big data cleaning method supports distributed deployment and simultaneously supports scheduling and execution of tens of thousands of processes based on micro-service architecture development, and is characterized by comprising the following steps:
s1, accessing a system interface, wherein a user needs to select 2-3 components for data cleaning to finish the process of cleaning, namely input, conversion and output;
s2, selecting an input component in a standard system interface, and selecting the input type, wherein the input type comprises a database, a file, an API interface and a kafka mode for adding;
s3, carrying out certain conversion on the input assembly according to service requirements, wherein the conversion comprises adding, modifying, deleting, column transferring and de-duplicating of data; performing a series of transformations on the input component according to the configuration of one or more transformation groups;
s4, according to the converted content, an output component is needed to receive the converted data, the output data is directly written in and output in a database, a file, an interface and a kafka mode;
s5, selecting whether a timing task exists in the current flow, and configuring timing task execution time;
s6, whether the configuration of the current flow is dependent or not, if so, other flows are dependent, namely, the current flow is executed after the dependent flow is selected to be executed;
s7, recording the result of each execution flow in a database, and recording the completion condition of each component;
s8, completing interface call of the generated flow by configuring an API (application program interface) to complete call of other systems to the current flow;
s9, carrying out statistical analysis on all the processes, enumerating out abnormal components, and carrying out optimization processing on the abnormal components according to different services.
2. The method according to claim 1, wherein in the step S7, each component is completed in a state including what operation is performed on the system by the name of the component, and the time of execution of each component and the amount of processing data are counted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310249934.8A CN116383182A (en) | 2023-03-15 | 2023-03-15 | Big data cleaning method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310249934.8A CN116383182A (en) | 2023-03-15 | 2023-03-15 | Big data cleaning method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116383182A true CN116383182A (en) | 2023-07-04 |
Family
ID=86968551
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310249934.8A Pending CN116383182A (en) | 2023-03-15 | 2023-03-15 | Big data cleaning method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116383182A (en) |
-
2023
- 2023-03-15 CN CN202310249934.8A patent/CN116383182A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9436507B2 (en) | Composing and executing workflows made up of functional pluggable building blocks | |
CN107256206B (en) | Method and device for converting character stream format | |
CN106126601A (en) | A kind of social security distributed preprocess method of big data and system | |
US20020029208A1 (en) | Data gathering and distribution apparatus and methods | |
CN103761111A (en) | Method and system for constructing data-intensive workflow engine based on BPEL language | |
CN104899199A (en) | Data processing method and system for data warehouse | |
CN105205105A (en) | Data ETL (Extract Transform Load) system based on storm and treatment method based on storm | |
CN103309904A (en) | Method and device for generating data warehouse ETL (Extraction, Transformation and Loading) codes | |
CN107301214A (en) | Data migration method, device and terminal device in HIVE | |
CN112214453B (en) | Large-scale industrial data compression storage method, system and medium | |
JPWO2004025463A1 (en) | Requirement definition method, software development method, requirement word change method and new specification method | |
CN115169810A (en) | Artificial intelligence system construction method and device for power grid regulation | |
CN112379884A (en) | Spark and parallel memory computing-based process engine implementation method and system | |
CN113326026B (en) | Method and terminal for generating micro-service business process interface | |
CN114048188A (en) | Cross-database data migration system and method | |
CN101968747B (en) | Cluster application management system and application management method thereof | |
Masouleh et al. | Optimization of ETL process in data warehouse through a combination of parallelization and shared cache memory | |
CN116383182A (en) | Big data cleaning method | |
CN109344175A (en) | Relevant database data analysis capabilities extended method, system and electronic equipment | |
CN108132970A (en) | Big data distributed approach and system based on cloud computing | |
Betancourt et al. | OpenDIEL: a parallel workflow engine and data analytics framework | |
CN112860412A (en) | Service data processing method and device, electronic equipment and storage medium | |
US20230297346A1 (en) | Intelligent data processing system with metadata generation from iterative data analysis | |
CN109857380A (en) | A kind of workflow file compiling method and device | |
Kim et al. | Apache storm configuration platform for dynamic sampling and filtering of data streams |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |