CN113505119B - ETL method and device based on multiple data sources - Google Patents
ETL method and device based on multiple data sources Download PDFInfo
- Publication number
- CN113505119B CN113505119B CN202110862612.1A CN202110862612A CN113505119B CN 113505119 B CN113505119 B CN 113505119B CN 202110862612 A CN202110862612 A CN 202110862612A CN 113505119 B CN113505119 B CN 113505119B
- Authority
- CN
- China
- Prior art keywords
- data
- cleaning
- field
- target
- etl
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/214—Database migration support
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
- G06F16/24578—Query processing with adaptation to user needs using ranking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses an ETL method and device based on multiple data sources, wherein the method comprises the following steps: configuring custom data; selecting the output destination table and the cleaning rule of the field; generating a corresponding message format according to the configuration and selection operation, and writing the corresponding message format into a kafka message queue; processing the data in the kafka message queue by using a spark stream computing framework, and then warehousing the processed data; the beneficial effects are as follows: (1) The web page is used to simplify the operation of the operation and maintenance developer for receiving data, the command is not required to be executed, the configuration file is not required to be modified, the migration and the access of various complex data can be easily completed without executing a program, and the learning cost and the operation and maintenance difficulty are reduced; (2) By making a general rule of data access, the data of different tables in different libraries can be written into the format, and the set of flow codes can be used for completing the ETL operation of all data with different sources, so that the development quantity is greatly reduced.
Description
Technical Field
The application relates to the technical field of data processing, in particular to an ETL method and device based on multiple data sources.
Background
Currently, big data ETL (data extraction, conversion and loading) is increasingly widely applied, such as the e-commerce field, the financial field, the security field and the like. In the prior art, most of data access and storage custom develop an ETL program for a data source and a target data source, but as the data sources are more and more, the dimension of data analysis is more and more, and more programs need to be developed, so that the development cost and the operation and maintenance cost are increased linearly.
Disclosure of Invention
Aiming at the defects existing in the prior art, the embodiment of the application aims to provide an ETL method and device based on multiple data sources.
To achieve the above object, in a first aspect, an embodiment of the present application provides an ETL method based on multiple data sources, including:
configuring custom data;
selecting the output destination table and the cleaning rule of the field; wherein the configuration and selection are both obtained by operating in a front-end Web page;
generating a corresponding message format according to the configuration and selection operation, and writing the corresponding message format into a kafka message queue;
and processing the data in the kafka message queue by using a spark stream computing framework, and warehousing the processed data.
As a specific embodiment of the present application, the cleaning rule is a function of the completion of the encapsulation; the cleaning rule includes:
clearing field content and deleting space;
NULL value replacement, designated value replacement NULL.
As a specific embodiment of the present application, the processing of the spark streaming computing framework includes the following steps:
setting batch processing time according to the data volume and real-time requirements;
packaging the received kafka data into java according to a predefined data format;
and dividing the filter operator of the dstream using spark into different dstream streams according to the target database field in the data format, and performing warehousing operation on each stream traversal.
As a specific implementation mode of the application, data of the same batch flow are grouped according to the field of the target table, the groupby operator of RDD is used for calculating the number of the target tables, then the target tables are traversed, the filter operator of RDD is used for screening the data of the same table into a block, and each RDD is the target table of the same type;
when the data of the same type of target table are all together, the data are subjected to uniform cleaning operation, field values of the corresponding positions of the array are taken out, then the values of the corresponding cleaning fields are taken out, the cleaning rules are reflected by the method names of the cleaning rules, and the fields are cleaned.
Further, as a preferred embodiment of the present application, the database in sparks ql is used for warehousing.
In a second aspect, an embodiment of the present application further provides an ETL apparatus based on multiple data sources, including:
the configuration module is used for configuring the custom data;
the selection module is used for selecting the output destination table and the cleaning rule of the field; wherein the configuration and selection are both obtained by operating in a front-end Web page;
the packaging module is used for generating a corresponding message format according to the configuration and selection operation and writing the message format into the kafka message queue;
and the processing module is used for utilizing the spark stream computing framework to process the data in the kafka message queue and then warehousing the processed data.
As a specific embodiment of the present application, the cleaning rule is a function of the completion of the encapsulation; the cleaning rule includes:
clearing field content and deleting space;
NULL value replacement, designated value replacement NULL.
As a specific embodiment of the present application, the processing of the spark streaming computing framework includes the following steps:
setting batch processing time according to the data volume and real-time requirements;
packaging the received kafka data into java according to a predefined data format;
and dividing the filter operator of the dstream using spark into different dstream streams according to the target database field in the data format, and performing warehousing operation on each stream traversal.
As a specific implementation mode of the application, data of the same batch flow are grouped according to the field of the target table, the groupby operator of RDD is used for calculating the number of the target tables, then the target tables are traversed, the filter operator of RDD is used for screening the data of the same table into a block, and each RDD is the target table of the same type;
when the data of the same type of target table are all together, the data are subjected to uniform cleaning operation, field values of the corresponding positions of the array are taken out, then the values of the corresponding cleaning fields are taken out, the cleaning rules are reflected by the method names of the cleaning rules, and the fields are cleaned.
As a specific embodiment of the present application, the database in sparksql is used for warehousing.
The embodiment of the application has the main beneficial effects that:
(1) The web page is used to simplify the operation of the operation and maintenance developer for receiving data, the command is not required to be executed, the configuration file is not required to be modified, the migration and the access of various complex data can be easily completed without executing a program, and the learning cost and the operation and maintenance difficulty are reduced;
(2) By making a general rule of data access, the data of different tables in different libraries can be written into the format, so that the set of flow codes can be used for completing the ETL operation of all data with different sources, and the development quantity is greatly reduced.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
FIG. 1 is a flow chart of an ETL method based on multiple data sources according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an arrangement according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a destination table for selecting output according to an embodiment of the present application;
FIG. 4 is a flow chart of a data processing provided in an embodiment of the present application;
fig. 5 is a schematic structural diagram of an ETL device based on multiple data sources according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should be noted that the term of art is defined as commonly understood by those skilled in the computer arts.
As shown in fig. 1 to 4, an embodiment of the present application provides an ETL method based on multiple data sources, the method including:
s101, configuring custom data.
Specifically, the user can select the data source table, the selected field, the data type, the data length, the data precision and the like in the Web page in a user-defined manner.
S102, selecting the output destination table and the cleaning rule of the field; wherein the configuration and selection are both obtained by operating in a front-end Web page;
the destination table of the selection output indicates which table is to be written into finally.
Specifically, the user selects a data cleansing rule, a database of targets, a data table of targets, and a data type of targets on the Web page.
Wherein the cleaning rule is a function of encapsulation completion; the cleaning rule includes:
clearing field content and deleting space;
NULL value replacement, designated value replacement NULL.
When the method is applied, for the cleaning rule part, a big data developer needs to develop the cleaning rule which is possibly used at one time, package the cleaning rule into a function and provide the function for the front-end display for the user to select, so that the user can dynamically clean various fields, and the developer only needs to update codes when adding new cleaning rules.
S103, generating a corresponding message format according to the configuration and selection operation, and writing the message format into a kafka message queue.
Specifically, the user generates a corresponding message format through page operation in the previous step, and transmits the corresponding message format from the front end to the back end, the back end collects the data of the corresponding library according to the database address and the identity information given by the user, encapsulates the data into a fixed message format according to the field name and the type appointed by the user, and transmits the data from the database to the kafka message queue; a kafka message queue for waiting for a program of big data to fetch the data;
the design data format (i.e. message format) has the advantages that one kafka data can carry a plurality of source data, network IO is saved, the data format is completely separated from service, only the value in the spark data processing program is needed, the spark program does not pay attention to specific names and field types of fields, and the like, and only the data can be processed according to a designed mode.
S104, after the data in the kafka message queue is processed by using the spark stream computing framework, the processed data is put in storage.
Specifically, the processing of the spark stream computing framework includes the following steps:
setting batch processing time according to the data volume and real-time requirements;
packaging the received kafka data into java according to a predefined data format;
dividing a filter operator of dstream of spark into different dstream streams according to a target database field in the data format, and performing warehousing operation on each stream traversal;
that is, in the Spark data cleaning stage, batch processing time is set according to the data volume and the requirement of real-time, and a large number of small files are generated at too short intervals. After receiving the kafka data, the data is packaged into java according to a predefined data format. Since there are various data in kafka, in the first step, we need to separate different target databases, and the filter operator of dsstream using spark divides the target databases into different dsstream according to the target database fields in the data format, so that the target databases are not too many, and only the binning operation needs to be performed for each stream traversal.
Further, grouping the data of the same batch of streams according to the field of the target table, calculating the number of the target tables by using a groupby operator of RDD, traversing the target tables, and screening the data of the same table into a block by using a filter operator of RDD, wherein each RDD is the target table of the same type;
when the data of the same type of target table are all together, the data are subjected to uniform cleaning operation, field values of the corresponding positions of the array are taken out, then the values of the corresponding cleaning fields are taken out, the cleaning rules are reflected by the method names of the cleaning rules, and the fields are cleaned.
When the same data of the target library is in one stream, the data is in a different table. Since we are uncertain how many tables there are, the first operation is to group the data of this batch according to the target table field, and the groupby operator of RDD can be used to calculate how many target tables there are in total. The target tables are then traversed, again using the filter operator of RDD to filter the data of the same table into one block. Assuming 7 tables are grouped, 7 RDDs are screened out, using the 7 filter operator, for 7 table traversals, each RDD being a target table of the same class.
When the data of the same type of target table are all together, the data can be uniformly cleaned, the field value of the corresponding position of the array is taken out, the value of the corresponding cleaning field is taken out, the field is cleaned by using the method name reflection cleaning rule of the cleaning rule, the reflection is adopted, complicated codes are avoided, and the cleaning rule is extracted into different methods of a single type, so that the maintenance is more convenient.
Finally, after washing, the data can be binned, as opposed to binning using standard sql, which we use the dataframe in sparksql. Compared with jdbc mode, sparksql adopts a mode of directly writing files to HDFS, is faster than jdbc mode through TCP-based thraft mode, and reduces operations such as unpacking. Using the dataframe approach requires the construction of schemas and row. According to the message format transmitted by the user, the field names and the field types are sequentially extracted to construct a schema, and according to the field values transmitted by the user, a row is constructed. Then constructing into a dataframe according to the schema and the row, and calling the method of the dataframe to write the data into the target database.
The data can be written into the warehouse by using a dataframe mode in support of the jdbc connection mode, and the warehouse can only be put into the warehouse by using a general mode without support of the jdbc mode.
In addition, it should be noted that, a table needs to be built in hive (i.e. a table is built in a data warehouse, since the data warehouse has many kinds, and now we use a hive database and a thunder database, the method can also use other kinds of data warehouses), collect the local data and the error types of each error report, write the collected data and the error types into the error table of hive, and regularly count and evaluate the errors, so as to evaluate the error data regularly, find hidden bug in the code, modify or optimize the code, reduce the occurrence of similar errors, collect the error report data, prevent the critical data from losing due to the code bug, and thus ensure that the data cannot be lost.
According to the scheme, the operation of receiving data by operation and maintenance developers is simplified by using the web page, commands do not need to be executed, configuration files do not need to be modified, migration and access of various complex data can be easily completed without executing programs, and learning cost and operation and maintenance difficulty are reduced;
by making a general rule of data access, the data of different tables in different libraries can be written into the format, so that the set of flow codes can be used for completing the ETL operation of all data with different sources, and the development quantity is greatly reduced.
Based on the same inventive concept, referring to fig. 5, the embodiment of the present application further provides an ETL device based on multiple data sources, and since the principle of solving the problem of these devices is similar to that of an ETL method based on multiple data sources, the specific implementation of these devices can refer to the implementation steps of the method, and the repetition is omitted.
The device comprises:
the configuration module is used for configuring the custom data;
the selection module is used for selecting the output destination table and the cleaning rule of the field; wherein the configuration and selection are both obtained by operating in a front-end Web page; the cleaning rule is a function of packaging completion; the cleaning rule includes:
clearing field content and deleting space;
NULL value replacement, designated value replacement NULL.
The packaging module is used for generating a corresponding message format according to the configuration and selection operation and writing the message format into the kafka message queue;
the back end collects the data of the corresponding library according to the database address and identity information given by the user, encapsulates the data into a fixed message format according to the field name and the type specified by the user, and transmits the data from the data source library to the kafka message queue; kafka message queues to wait for large data programs to fetch data.
And the processing module is used for utilizing the spark stream computing framework to process the data in the kafka message queue and then warehousing the processed data.
Namely, setting the time of batch processing according to the data volume and the real-time requirement;
packaging the received kafka data into java according to a predefined data format;
dividing a filter operator of dstream of spark into different dstream streams according to a target database field in the data format, and performing warehousing operation on each stream traversal;
grouping the data of the same batch of streams according to the field of the target table, calculating the number of the target tables by using a groupby operator of RDD, traversing the target tables, and screening the data of the same table into a block by using a filter operator of RDD again, wherein each RDD is a target table of the same type;
when the data of the same type of target table are all together, uniformly cleaning, taking out the field value of the corresponding position of the array, taking out the value of the corresponding cleaning field, and using the method name of the cleaning rule to reflect the cleaning rule to clean the field; meanwhile, the dataframe in the sparksql is adopted for warehousing.
The scheme is applied to the situation that a plurality of data sources exist, the traditional method for developing a plurality of programs to clean and store corresponding data is replaced, the same effect as a plurality of programs can be achieved without developing and deploying the plurality of programs, and under the situation that certain data cleaning rules need to be changed, the method of the application can be used for replacing the traditional method for modifying codes, so that the development quantity is greatly reduced;
the web page is used to simplify the operation of the operation and maintenance developer for receiving data, and the migration and access of various complex data can be easily completed without executing programs, so that the learning cost and the operation and maintenance difficulty are reduced.
While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.
Claims (6)
1. An ETL method based on multiple data sources, comprising:
configuring custom data;
selecting the output destination table and the cleaning rule of the field; wherein the configuration and selection are both obtained by operating in a front-end Web page;
generating a corresponding message format according to the configuration and selection operation, and writing the corresponding message format into a kafka message queue;
processing the data in the kafka message queue by using a spark stream computing framework, and then warehousing the processed data;
the processing of the spark stream computing framework comprises the following steps:
setting batch processing time according to the data volume and real-time requirements;
packaging the received kafka data into java according to a predefined data format;
dividing a filter operator of dstream of spark into different dstream streams according to a target database field in the data format, and performing warehousing operation on each stream traversal;
grouping the data of the same batch of streams according to the field of the target table, calculating the number of the target tables by using a groupby operator of RDD, traversing the target tables, and screening the data of the same table into a block by using a filter operator of RDD again, wherein each RDD is a target table of the same type;
when the data of the same type of target table are all together, the data are subjected to uniform cleaning operation, field values of the corresponding positions of the array are taken out, then the values of the corresponding cleaning fields are taken out, the cleaning rules are reflected by the method names of the cleaning rules, and the fields are cleaned.
2. The multi-data-source based ETL method of claim 1, wherein the cleansing rules are a function of encapsulation completion; the cleaning rule includes:
clearing field content and deleting space;
NULL value replacement, designated value replacement NULL.
3. An ETL method based on multiple data sources as claimed in claim 2, wherein the binning is performed using dataframes in sparksql.
4. An ETL apparatus based on multiple data sources, comprising:
the configuration module is used for configuring the custom data;
the selection module is used for selecting the output destination table and the cleaning rule of the field; wherein the configuration and selection are both obtained by operating in a front-end Web page;
the packaging module is used for generating a corresponding message format according to the configuration and selection operation and writing the message format into the kafka message queue;
the processing module is used for utilizing the spark stream computing framework to process the data in the kafka message queue and then warehousing the processed data;
the processing of the spark stream computing framework comprises the following steps:
setting batch processing time according to the data volume and real-time requirements;
packaging the received kafka data into java according to a predefined data format;
dividing a filter operator of dstream of spark into different dstream streams according to a target database field in the data format, and performing warehousing operation on each stream traversal;
grouping the data of the same batch of streams according to the field of the target table, calculating the number of the target tables by using a groupby operator of RDD, traversing the target tables, and screening the data of the same table into a block by using a filter operator of RDD again, wherein each RDD is a target table of the same type;
when the data of the same type of target table are all together, the data are subjected to uniform cleaning operation, field values of the corresponding positions of the array are taken out, then the values of the corresponding cleaning fields are taken out, the cleaning rules are reflected by the method names of the cleaning rules, and the fields are cleaned.
5. The multi-data source based ETL device of claim 4, wherein the cleansing rules are a function of encapsulation completion; the cleaning rule includes:
clearing field content and deleting space;
NULL value replacement, designated value replacement NULL.
6. The multi-data source based ETL device according to claim 4, wherein the binning is performed using dataframes in sparksql.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110862612.1A CN113505119B (en) | 2021-07-29 | 2021-07-29 | ETL method and device based on multiple data sources |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110862612.1A CN113505119B (en) | 2021-07-29 | 2021-07-29 | ETL method and device based on multiple data sources |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113505119A CN113505119A (en) | 2021-10-15 |
CN113505119B true CN113505119B (en) | 2023-08-29 |
Family
ID=78015041
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110862612.1A Active CN113505119B (en) | 2021-07-29 | 2021-07-29 | ETL method and device based on multiple data sources |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113505119B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114691644A (en) * | 2022-03-15 | 2022-07-01 | 平安科技(深圳)有限公司 | Data migration method, device, equipment and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106372105A (en) * | 2016-08-19 | 2017-02-01 | 中国科学院信息工程研究所 | Spark platform-based microblog data preprocessing method |
CN106528865A (en) * | 2016-12-02 | 2017-03-22 | 航天科工智慧产业发展有限公司 | Quick and accurate cleaning method of traffic big data |
CN109062551A (en) * | 2018-08-08 | 2018-12-21 | 青岛大快搜索计算技术股份有限公司 | Development Framework based on big data exploitation command set |
CN110490229A (en) * | 2019-07-16 | 2019-11-22 | 昆明理工大学 | A kind of electric energy meter calibration error diagnostics method based on spark and clustering algorithm |
CN110601866A (en) * | 2018-06-13 | 2019-12-20 | 阿里巴巴集团控股有限公司 | Flow analysis system, data acquisition device, data processing device and method |
CN111090640A (en) * | 2019-11-13 | 2020-05-01 | 山东中磁视讯股份有限公司 | ETL data cleaning method and system |
CN111273607A (en) * | 2018-12-04 | 2020-06-12 | 沈阳高精数控智能技术股份有限公司 | Spark-based numerical control machine tool running state monitoring method |
CN111858569A (en) * | 2020-07-01 | 2020-10-30 | 长江岩土工程总公司(武汉) | Mass data cleaning method based on stream computing |
CN112488745A (en) * | 2020-10-27 | 2021-03-12 | 广东电力信息科技有限公司 | Intelligent charge control management method, device, equipment and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10445062B2 (en) * | 2016-09-15 | 2019-10-15 | Oracle International Corporation | Techniques for dataset similarity discovery |
-
2021
- 2021-07-29 CN CN202110862612.1A patent/CN113505119B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106372105A (en) * | 2016-08-19 | 2017-02-01 | 中国科学院信息工程研究所 | Spark platform-based microblog data preprocessing method |
CN106528865A (en) * | 2016-12-02 | 2017-03-22 | 航天科工智慧产业发展有限公司 | Quick and accurate cleaning method of traffic big data |
CN110601866A (en) * | 2018-06-13 | 2019-12-20 | 阿里巴巴集团控股有限公司 | Flow analysis system, data acquisition device, data processing device and method |
CN109062551A (en) * | 2018-08-08 | 2018-12-21 | 青岛大快搜索计算技术股份有限公司 | Development Framework based on big data exploitation command set |
CN111273607A (en) * | 2018-12-04 | 2020-06-12 | 沈阳高精数控智能技术股份有限公司 | Spark-based numerical control machine tool running state monitoring method |
CN110490229A (en) * | 2019-07-16 | 2019-11-22 | 昆明理工大学 | A kind of electric energy meter calibration error diagnostics method based on spark and clustering algorithm |
CN111090640A (en) * | 2019-11-13 | 2020-05-01 | 山东中磁视讯股份有限公司 | ETL data cleaning method and system |
CN111858569A (en) * | 2020-07-01 | 2020-10-30 | 长江岩土工程总公司(武汉) | Mass data cleaning method based on stream computing |
CN112488745A (en) * | 2020-10-27 | 2021-03-12 | 广东电力信息科技有限公司 | Intelligent charge control management method, device, equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
大数据时代亟需强化数据清洗环节的规范和标准;卿苏德 等;《世界电信》(第07期);55-60 * |
Also Published As
Publication number | Publication date |
---|---|
CN113505119A (en) | 2021-10-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11204928B2 (en) | Reducing flow delays in a data streaming application caused by lookup operations | |
CN107506451B (en) | Abnormal information monitoring method and device for data interaction | |
Bajaber et al. | Big data 2.0 processing systems: Taxonomy and open challenges | |
US10956422B2 (en) | Integrating event processing with map-reduce | |
Logothetis et al. | Stateful bulk processing for incremental analytics | |
US11914566B2 (en) | Indexing and relaying data to hot storage | |
US11693912B2 (en) | Adapting database queries for data virtualization over combined database stores | |
US20100313063A1 (en) | Mitigating reduction in availability level during maintenance of nodes in a cluster | |
WO2020238130A1 (en) | Big data log monitoring method and apparatus, storage medium, and computer device | |
KR101785959B1 (en) | Columnar storage representations of records | |
US8438144B2 (en) | Transactionally consistent database replay in an environment with connection pooling | |
CN111752959B (en) | Real-time database cross-database SQL interaction method and system | |
US20150293964A1 (en) | Applications of automated discovery of template patterns based on received requests | |
US20070250517A1 (en) | Method and Apparatus for Autonomically Maintaining Latent Auxiliary Database Structures for Use in Executing Database Queries | |
US10783142B2 (en) | Efficient data retrieval in staged use of in-memory cursor duration temporary tables | |
US20100293161A1 (en) | Automatically avoiding unconstrained cartesian product joins | |
EP3384385B1 (en) | Methods and systems for mapping object oriented/functional languages to database languages | |
US11803550B2 (en) | Workload-aware column imprints | |
CN113360581A (en) | Data processing method, device and storage medium | |
CN111078705A (en) | Spark platform based data index establishing method and data query method | |
CN113505119B (en) | ETL method and device based on multiple data sources | |
Xin et al. | Enhancing the interactivity of dataframe queries by leveraging think time | |
CN113806429A (en) | Canvas type log analysis method based on large data stream processing framework | |
CN116340363B (en) | Data storage and loading method based on relational database and related device | |
US10713150B1 (en) | Accurate test coverage of generated code |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 266000 Room 302, building 3, Office No. 77, Lingyan Road, Huangdao District, Qingdao, Shandong Province Applicant after: QINGDAO YISA DATA TECHNOLOGY Co.,Ltd. Address before: 266000 3rd floor, building 3, optical valley software park, 396 Emeishan Road, Huangdao District, Qingdao City, Shandong Province Applicant before: QINGDAO YISA DATA TECHNOLOGY Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |