CN114328759A - Data construction and management method and terminal of data warehouse - Google Patents

Data construction and management method and terminal of data warehouse Download PDF

Info

Publication number
CN114328759A
CN114328759A CN202111622479.9A CN202111622479A CN114328759A CN 114328759 A CN114328759 A CN 114328759A CN 202111622479 A CN202111622479 A CN 202111622479A CN 114328759 A CN114328759 A CN 114328759A
Authority
CN
China
Prior art keywords
data
standard
storage
original
original data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111622479.9A
Other languages
Chinese (zh)
Inventor
刘远祥
王仁斌
张峰
林镇荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Pico Information Co Ltd filed Critical Xiamen Meiya Pico Information Co Ltd
Priority to CN202111622479.9A priority Critical patent/CN114328759A/en
Publication of CN114328759A publication Critical patent/CN114328759A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data construction and management method and a terminal of a data warehouse, wherein after an original data table to be warehoused is obtained, the original data table to be warehoused generates corresponding standard data, and then the standard data and a standard field table are mapped to obtain the standard table, or the corresponding standard data is directly established on the basis of the original data table to be warehoused, so that the original data table of the database and the generated standard data are not required to be stored in the database, the data redundancy in the database is reduced, and the data are rapidly managed and warehoused; meanwhile, field data in the original data table are filtered, deduplicated, formatted and processed by UDF function, so that the redundancy of the data to be stored is reduced while the data to be stored is highly standardized, the data extraction efficiency is improved, and the data management cost is reduced.

Description

Data construction and management method and terminal of data warehouse
Technical Field
The invention relates to the field of data processing, in particular to a data construction and management method and a terminal for a data warehouse.
Background
With the rapid development and wide application of computer network and database technology, information management of all industries enters a new era. Under the current large background of the ever-increasing volume of data, the demands of governments and enterprises on data services and data governance are pressing. The method comprises the following steps of monitoring the property data, wherein various problems of property data confusion, data disorder, incomplete data and the like exist in the property data management process, so that the data management process is complex. Moreover, the data management is difficult to form standardized pipeline type processing, so that the management effect is low, and the quality of the asset data after the management is poor. Meanwhile, after the treatment process is programmed, the flow steps of treatment can be known only by checking the corresponding program. Thus, current asset data governance has not been able to meet the requirements of current data asset governance.
At the same time, early databases were primarily independent databases. The databases are applied to various aspects of the field of data processing, and have the advantages of multi-system scattered construction and no standard and uniform data standard and data model. Due to the lack of uniform data planning, trusted data sources, and data standards, the data cannot be normalized by standards, and thus there is a certain amount of redundancy. And the problem of human intervention exists in the data processing process, and the warehouse counting is established under the condition that a large amount of disordered data exists, so that the cost of the warehouse counting is higher than that of small and medium-sized enterprises, and an unpredictable risk is caused.
Therefore, the problems of a scattered business knowledge system, a complex data warehouse structure, a plurality of data extraction working procedures and the like exist in the current data management, so that the research and development cost of the data warehouse is increased, the data extraction efficiency is low, and a multidimensional storage data which can quickly and efficiently extract data, convert data, clean and encrypt data and can not be provided for small and medium-sized teams.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the data construction and management method and the terminal of the data warehouse are provided, so that the data management efficiency of the data warehouse is improved, and the cost is reduced.
In order to solve the technical problems, the invention adopts the technical scheme that:
a data construction and management method of a data warehouse comprises the following steps:
acquiring an original data table to be put in storage;
judging whether an original data table corresponding to the original data table to be put in storage exists in the data warehouse or not;
if yes, acquiring corresponding field data according to the access table;
sequentially filtering, removing duplication, format conversion and UDF function processing the field data to obtain standard data;
storing the standard data into a data warehouse;
and if the standard data does not exist, establishing the corresponding standard data based on the original data table to be put in storage.
In order to solve the technical problem, the invention adopts another technical scheme as follows:
a data construction and management terminal of a data warehouse comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize each step of the data construction and management method of the data warehouse.
The invention has the beneficial effects that: after the original data table to be put in storage is obtained, the original data table to be put in storage generates corresponding standard data, and then the standard data and the standard field table are mapped to obtain the standard table, or the corresponding standard data are directly established based on the original data table to be put in storage, so that the original data table and the generated standard data are not required to be stored in a database, the data redundancy in the database is reduced, and the data are quickly managed and put in storage; meanwhile, field data in the original data table are filtered, deduplicated, formatted and processed by UDF function, so that the redundancy of the data to be stored is reduced while the data to be stored is highly standardized, the data extraction efficiency is improved, and the data management cost is reduced.
Drawings
FIG. 1 is a flowchart illustrating steps of a method for building and managing data in a data warehouse according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating another step of a method for building and managing data in a data warehouse according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a data construction and management terminal of a data warehouse according to an embodiment of the present invention.
Description of the reference symbols
Detailed Description
In order to explain technical contents, achieved objects, and effects of the present invention in detail, the following description is made with reference to the accompanying drawings in combination with the embodiments.
Referring to fig. 1, a data construction and management method for a data warehouse includes:
acquiring an original data table to be put in storage;
judging whether an access table corresponding to the original data table to be put in storage exists in the data warehouse or not;
if yes, acquiring corresponding field data according to the access table;
sequentially filtering, removing duplication, format conversion and UDF function processing the field data to obtain standard data;
storing the standard data into a data warehouse;
and if the standard data does not exist, establishing the corresponding standard data based on the original data table to be put in storage.
As can be seen from the above description, the beneficial effects of the present invention are: after the original data table to be put in storage is obtained, the original data table to be put in storage generates corresponding standard data, and then the standard data and the standard field table are mapped to obtain the standard table, or the corresponding standard data are directly established based on the original data table to be put in storage, so that the original data table and the generated standard data are not required to be stored in a database, the data redundancy in the database is reduced, and the data are quickly managed and put in storage; meanwhile, field data in the original data table are filtered, deduplicated, formatted and processed by UDF function, so that the redundancy of the data to be stored is reduced while the data to be stored is highly standardized, the data extraction efficiency is improved, and the data management cost is reduced.
Further, the standard table comprises a view table and a physical table;
the judging whether the standard table corresponding to the original data table to be put in storage exists in the data warehouse comprises the following steps:
judging the type of a standard table corresponding to the original data table to be put in storage, if the type is a physical table, judging whether the original data table to be put in storage has an input standard, and if the type is a physical table, indicating that the standard table corresponding to the original data packet to be put in storage exists;
if the table is a view table, whether the original data table to be put in storage has corresponding access configuration is judged, and if the table is the view table, the standard table corresponding to the original data packet to be put in storage exists.
According to the description, different judgment standards are adopted according to different types of the original data tables to be warehoused, the physical tables and the visual charts can be distinguished and different types of judgment can be carried out, and therefore the distinguishing capability of the original data to be warehoused is improved.
Further, if the standard table does not exist, the creating of the standard table corresponding to the original data table to be put in storage includes:
and if the original data table to be put in storage is a physical table and an input standard does not exist, creating a corresponding standard table according to a preset standard field table.
According to the description, when the original data table to be put in storage does not have the corresponding input standard, the physical table is directly and automatically generated into the standard table according to the standard field table, so that the data can be put in storage quickly by using the standard model.
Further, if the standard table does not exist, the creating of the standard table corresponding to the original data table to be put in storage includes:
and if the original data table to be put in storage is a view table and corresponding access configuration does not exist, setting a corresponding data input layer and a corresponding configuration field according to the original data table to be put in storage and generating corresponding configuration.
It can be known from the above description that, when there is no corresponding access configuration for the original data table to be put in storage, the corresponding data input layer and the configuration field are set and the corresponding configuration is generated according to the original data table to be put in storage, that is, the corresponding step of obtaining the corresponding field data according to the access table is executed after the corresponding access configuration is generated, so that the access of the original data table to be put in storage is more convenient.
Further, before storing the standard data into a data warehouse, the method further includes:
and if the type of the standard table corresponding to the standard data is a physical table, mapping the standard data according to the standard table.
It can be known from the above description that, when the type of the standard table is a physical table, the standard data is mapped according to the standard field table corresponding to the standard table, so as to realize the standardization of the original data table to be put in storage.
Further, after the mapping the standard data, the method further includes:
judging whether a secondary treatment instruction is received or not, if not, transferring the standard data stream to a doris operator and storing the doris operator in a data warehouse;
if yes, inputting the standard data into a preset treatment end; and acquiring different treatment operators in sequence to treat the standard data to obtain treatment data.
According to the description, whether the standard data are processed or not is judged, and when the standard data are simple and small, the standard data are directly put into a warehouse; and when the standard data is data with high complexity and large quantity, the standard data is put in storage after being subjected to secondary processing completion and other treatments, so that the standard data can meet the requirements of different services, and the practicability of a data warehouse is improved.
Further, after obtaining the abatement data, the method further comprises:
acquiring a data grouping statistical operator;
and grouping the treatment data according to the data grouping statistical operator to form a service width table.
According to the description, the treatment data are uniformly classified and grouped by adopting the data grouping statistical operator, the data subjects are quickly cleaned by combining different operators, the original data table to be put in storage is cleaned and put in storage, the uniformity of the data is improved, and the data redundancy is reduced.
Further, the governing operator comprises a data filtering operator, a data deduplication operator, a data association operator, a data format conversion operator, a UDF function operator, a window operator, a business code operator and a sink operator.
According to the description, the plasticizer is cleaned in the whole process based on multiple flink operators and the combination of the operators, and data cleaning is achieved.
As can be seen from the above description, the acquiring of the original data table to be put into storage includes:
and judging the type of the data source of the original data table to be put in storage, and if the data source is a heterogeneous data source, synchronizing the heterogeneous data source through dataXWeb.
As can be seen from the above description, the dataXWeb is used to access other heterogeneous data sources, so that full-scale synchronization and incremental synchronization of data can be realized, and the data can be efficiently synchronized into a doris data warehouse; and a plurality of data sources can be synchronized simultaneously, so that the data access efficiency is further improved.
Referring to fig. 2, the present invention further provides a data construction and management terminal for a data warehouse, including a memory, a processor, and a computer program stored in the memory and operable on the processor, where the processor executes the computer program to implement the steps in the data construction and management method for a data warehouse.
The data construction and management method and the terminal of the data warehouse can be suitable for various types of data access, governance and query, such as protocol data based on JSM, kafka, MQTT and the like, file protocol data based on files CSV, EXCEL, JSON, XML and the like, protocol data based on HTTP, REST and the like, and the following description is given by a specific implementation mode:
example one
Referring to fig. 1, a data construction and management method for a data warehouse includes:
s1, acquiring an original data table to be put in storage;
s2, judging whether a standard table corresponding to the original data table to be put in storage exists in the data warehouse or not; specifically, the method comprises the following steps: the standard table comprises a view table and a physical table; judging the type of a standard table corresponding to the original data table to be put in storage, if the type is a physical table, judging whether the original data table to be put in storage has an input standard, and if the type is a physical table, indicating that the standard table corresponding to the original datagram to be put in storage exists; if the graph is a view table, judging whether the original data table to be put in storage has corresponding access configuration, if so, indicating that a standard table corresponding to the original datagram to be put in storage exists;
if not, executing S3, creating a standard table corresponding to the original data table to be put in storage, and then executing S4; specifically, the method comprises the following steps: if the original data table to be put in storage is a physical table and an input standard does not exist, executing S31 and creating a corresponding standard table according to a preset standard field table; if the original data table to be put in storage is a view table and corresponding access configuration does not exist, executing S32, setting a corresponding data input layer and configuration fields according to the original data table to be put in storage and generating corresponding configuration;
if yes, executing S4, and acquiring field data corresponding to the standard table in the original data table according to the standard table;
s5, sequentially filtering, removing duplicate, converting format and processing UDF function to the field data to obtain standard data;
s6, if the type of the standard table corresponding to the standard data is a physical table, mapping the standard data according to the standard table;
s7, storing the standard data into a data warehouse;
the embodiment provides a specific application scenario, where the Data Warehouse includes a Data Detail layer (DWD), a Data dimension topic layer (DWS), and a repository layer; the DWD layer is divided into a view layer and a physical layer; realizing automatic table building, extraction, conversion and loading through a DWD layer and based on a button; the method comprises the following specific steps:
when protocol data based on JSM, kafka, MQTT and the like are used as a data input layer;
s1, creating a conversion task for judging whether the original data table exists, and calling based on the operation, namely acquiring the original data table;
s2, determining whether the standard table corresponding to the original data table is a physical table or a view table based on the term standard table (table _ standard), and if the standard table is a physical table, determining whether the table name exists to determine whether the corresponding physical table exists as follows: whether in _ standard exists or not, if not, executing S31, and creating a corresponding field and a table name based on a standard field table (table _ standard _ field) to obtain a standard table; if the physical table exists, executing S4; if the view table is the view table, judging whether the configured original data table exists, if not, executing S32, then executing step S4, and if so, executing S4;
wherein S32 includes:
s321, taking data such as JSM, kafka and MQTT protocols as a data input layer and a configuration field in a button, setting read batches such as 1000, and setting a configuration non-submission offset configuration displacement mechanism, namely a field for matching with a table _ standard _ field; 1000 pieces of data which are shown as the batch are not submitted, consumption access control is not carried out, the fields and the corresponding field types are only obtained, and offset displacement is submitted under the condition of next consumption;
s322, operating a key character string, and operating a key to remove front and rear spaces and special characters;
s323, reading the 1000 lines of data through a preset java script, acquiring all fields and field types, acquiring the maximum value of the field by judging the value of the field, and generating a standard creation statement, a standard specified md5 field and a timestamp field according to the type judgment basis (if the value of the field is 1, the type is int type, and if the value is greater than int type, the type is considered Long type); creating a table partition strategy and configuring the number of front partitions for deletion based on the md5 field and the timestamp field customized by the standard; the md5 field is used for later-stage duplicate removal, duplicate removal operator filtering, full field selection and the like, and if the m is based on the specified field, the m is removed through the specified field; reading the designated topic data in kafka based on java, analyzing record.value () into json data after reading is finished, and then reading the key in json;
s324, creating a table script: calling the doris data source to execute the creation statement in the S323;
s4, configuring JSM, kafka and MQTT protocol connection information in a button according to a standard table, configuring field data and types, consuming from the position with the most open offset, and automatically submitting the position of the offset; judging whether the whole data packet is empty or not after completion, if so, not executing downwards, namely, if all data in the current data packet is empty, or the data packet is empty, terminating the flow;
s5, sequentially filtering, removing duplicate, converting format and processing UDF function to the field data to obtain standard data;
the method specifically comprises the following steps:
s51, data filtering: filtering the null value of the whole field, and removing the front space and the rear space;
s52, data deduplication: carrying out duplicate removal judgment on the full field of the batch of data and generating a unique md5 field;
s53: data lattice conversion: a time stamp field for performing data type conversion such as string-to-date, string conversion shaping, etc., converting the universal date format into YYYY-MM-DD, converting the universal time format into HH-MM-SS type, and inserting;
s54: UDF function processing: converting corresponding fields of the UDF function of the java online code, such as field completion, field encryption and decryption and the like;
s6, if the type of the standard table corresponding to the standard data is a physical table, mapping the standard data according to the standard table; specifically, if the standard table is a view table, mapping is not needed, and if the standard table is a physical table, the access domain name is directly mapped through a standard field table (table _ standard _ field) to be converted into a standard field and a corresponding standard field type;
s7, storing the standard data into a data warehouse;
in another alternative embodiment, if the document protocol data based on documents CSV, EXCEL, JSON, XML, etc. is used as the data input layer; step S321 is changed to S321 a: reading a file (namely an original data table) according to a standard table, configuring fields, compiling java scripts based on a key to carry out uniform transcoding to UTF-8 protocol, and assembling into each json form to carry out data issuing;
meanwhile, step S4 is changed to S4 a: and judging whether the whole data packet is empty or not, if so, not executing downwards, namely, if all data in the current data packet is empty or the data packet is empty, terminating the flow.
Example two
The present embodiment is different from the first embodiment in that the present embodiment is based on protocol data such as HTTP, REST, and the like as a data input layer;
change step S321 to S321 b:
s3211, acquiring a request header and parameters of the request header based on the key configuration;
s3212, configuring a splicing interface request body content;
s3213, configuring a splicing interface request body, and outputting all parameter contents of a request header to a log for later test and verification;
s3214, configuring an HTTP access component, configuring a url address, a request protocol POST \ GET protocol, and configuring an acquisition result to return a domain name;
s3215, writing a java script to convert the returned data format into a general json format, and specifying the converted format as follows:
{“page”:”1”,“pageSize”:“100”,“data”:[{“keyField”:””,…}]}
wherein "page": 1 "means: the current page number of the page starts from 1;
"pageSize": 100 "means: the number of paging is 100;
"data" [ { "keyField": "", … } ] represents: the returned data type, wherein keyField is the returned field;
subsequently, the returned fields are used as line data, and steps S322-S324 are executed;
after step S324 is completed, step S325 is executed: S3211-S3215;
step S326: and obtaining the current page by the parallel branching js script, transmitting the current page +1 to the steps, and performing paging execution S3211-S3215.
EXAMPLE III
The difference between the present embodiment and the first or second embodiment is that data based on file storage, column type storage, relational database, in-memory database, full-text search, document database, etc. is used as a data input layer;
changing step S3 to S3a includes the steps of:
s31: executing the sql script to display the corresponding field header and type;
s32: compiling a build table statement based on a java script, generating a standard build statement, and generating a standard specified md5 field and a timestamp field; performing a partition strategy based on the md5 field and the timestamp field customized by the standard;
s33: creating a table script: calling a doris data source to execute a creating statement;
changing step S4 to S4b includes the steps of:
s41: inquiring the total number of the data, wherein the total number is acquired and used for acquiring the data in the next step of paging;
s42: setting query parameters: setting a constant value for the total query number, setting the number of divided pages and the current page number, and assembling dynamic query statements;
s43: judging whether the whole data packet is empty or not, if so, not executing downwards, namely, if all data in the current data packet are empty or the data packet is empty, terminating the flow; and performing steps S5-S7;
step S8 is also included after step S7 is completed:
s8, acquiring the current page number, and adding judgment statements through js scripts: and judging whether the number of the current queries is larger than or equal to the total number, if so, not performing paging extraction, and if not, continuing to circularly execute S42-S8.
A fourth embodiment is different from the first, second or third embodiments in that the access data is a heterogeneous data source;
when data is accessed into a warehouse, the situation that other heterogeneous data sources are synchronized into the doris warehouse often exists; these heterogeneous data sources tend to be large in data size and consistent in fields and types; the full-scale synchronization and the incremental synchronization are met, and a data XWeb is adopted to perform data migration and synchronize a doris warehouse; if the access layer and the DWD layer only have a data migration synchronization function or the DWD layer standard view table is changed into a physical table, if the original data table (in _ table) is migrated into the standard physical table (in _ standard), a data synchronization task can be created by selecting a data source through a page by a DataX Web user; the data sources such as RDBMS, Hive, HBase, ClickHouse, MongoDB, oltp and the like are supported, the RDBMS data sources can create data synchronization tasks in batches, and the real-time viewing of data synchronization progress and logs is supported, and a synchronization termination function is provided;
referring to fig. 2, the data source type of the original data table to be put in storage is determined, and if the data source is a heterogeneous data source, the heterogeneous data source is synchronized through DataXWeb:
a1, creating a data source: creating a source database and a target database;
a2, creating a task template: constructing a task synchronization flow, selecting a target table, and automatically creating a table structure of the target table if the target table does not exist;
a3, constructing a JSON script: selecting field mapping, selecting a full or incremental synchronization field, adding an internal standard full md5 field and an insertion timestamp field to the field, generating a mapped json format synchronization content, synchronizing the mapped json format synchronization content to a DataX to generate a python file, and performing timing update or calling the DataX once for data synchronization through cluster distribution;
a4, judging whether data migration is completed, comparing the number in a migration library, if the data migration is consistent or the number in a new library is more than or equal to that in a history library table (an original data table or a service table), indicating that the migration is successful, judging whether the table is the original data table through a standard field table and the standard table of the original data table is a view table, and if so, directly deleting the data of the original data table and the table structure of the corresponding original data table; otherwise, no processing is carried out;
the data flow configuration is carried out based on the DataXWeb, and the following beneficial effects are achieved:
1. the monitoring from the source end data to the target data can be realized, and the number of successful data and failed data and log information can be checked; 2. the full-scale synchronization and incremental synchronization of data can be realized, and efficient data synchronization into doris is realized; 3. a plurality of heterogeneous data sources can be synchronized simultaneously; 4. the original data table associated with the DWD physical layer historical view table can be deleted, input redundancy is reduced, disk space is reduced, and data robustness is provided; 5. and automatically creating a target table structure according to the field type of the original data table under the condition that the target table does not exist.
EXAMPLE five
The difference between the embodiment and any one of the first to the fourth embodiments is that the data is subjected to secondary treatment before being put into a warehouse;
after the step S6 of mapping the standard data, the method further includes:
judging whether a secondary treatment instruction is received or not, if not, transferring the standard data stream to a doris operand and storing the doris operand in a data warehouse;
if yes, inputting the standard data into a default treatment end; different treatment operation numbers are sequentially obtained to treat the standard data to obtain treatment data; after the treatment data is obtained, the following steps are executed:
acquiring a data packet statistical operand;
grouping the treatment data according to the data grouping statistical operation number to form a service width table;
the governance operands comprise a data filtering operand, a data deduplication operand, a data association operand, a data format conversion operator, a UDF function operand, a window operand, a service code operand and a sink operand;
based on DWD layer data governance, select corresponding data layer to carry out secondary data governance, compensate defect on the button performance, or some customization demands need flink to compile, specific:
b1, dragging and selecting a kafka operator of the source through canvas, configuring data source information of kafka, inputting a consumption address and a topic name of the kafka based on a data source distributed to the kafka through a keyle access, and selecting displacement of the consumption kafka: consuming from a specified location, from a starting location, from a most recent location; at this time, the data of kafka can be sent by a button, and can also be sent by other terminals; this step is mutually exclusive from step B2;
b2, dragging and selecting a database operator of the source through canvas, configuring database connection information, database account numbers, passwords, database names and the like, and compiling the inquired sql statement; this step is mutually exclusive from step B3;
b3, selecting next standard table field mapping through canvas dragging, acquiring a table name through a table field in streaming pipeline data, replacing the table name with a standard table, such as in _ standard, through the standard table field mapping if the table name is an original data table name, such as in _ table, and replacing the correspondingly accessed field with the standard field, and continuing to reserve if the rest can not be matched;
b4, dragging AND selecting next-step data filtering operator nodes through a canvas, selecting satisfied conditions (larger than, equal to, not equal to, smaller than, larger than OR equal to, smaller than OR equal to, regular, empty, not empty, including, fuzzy matching, starting to include AND ending) to select AND support multiple groups of AND AND OR group condition support by inputting the names of the filtering field keys AND the values corresponding to the filtering field keys, AND filtering the values meeting the regular conditions AND not flowing to the next node; if the current key is name and the name is zhang san, the value beginning to contain the name begins to contain the name, and the value beginning to contain the name indicates that the name begins to contain the name; if the type of the current key is length, and length is 10, the full condition is set according to specific requirements, such as greater than 10, less than 10, and the like;
b5, dragging and selecting next step data to re-calculate child nodes through canvas, inputting the names of single or multiple filtering field key fields, and setting whether the key fields ignore case, case and case, filtering the values corresponding to the names of the same key fields, and not flowing to the next node;
b6, selecting the operator node of the next data association through canvas dragging, inputting a group of key or a plurality of groups of key field names, inputting equal conditions bound in the table names corresponding to the key field name combination group and the input resource library layer, backfilling and supplementing values corresponding to the key field names, and supplementing the data meeting the conditions based on the data values of the resource library layer, wherein the examples are as follows:
selecting a streaming field;
sexCode ═ 1, and sexValue is empty;
the table of the bound resource layer is t _ sex, t _ code is 1, and t _ sexValue is male;
directly complementing the sexValue as a male;
b7, dragging and selecting a next data format conversion operator node through a canvas, inputting a corresponding key field name through selection, correspondingly converting a data type, such as string to Date, and correspondingly converting a data format, such as YYYYY-MM-DD;
b8, dragging and selecting a next java online UDF function operator node through canvas, obtaining compiled byte codes through inputting java codes and compiling in runtime, and then defining the classes in runtime by using a custom class loader; realizing service conversion on the corresponding field of the data;
b9, dragging through canvas to select the operator nodes of the next window, grouping through selecting corresponding keys, selecting window types such as time windows and quantity windows, inputting corresponding time or quantity to form set data and sending the set data to downstream;
b10, dragging and selecting the next service code operator node through canvas, and performing a data cleaning process by configuring the specified java package name and the corresponding class name; for example: the service operator triggers and sends http to another service system;
b11, selecting to write into a sink operator node in the next step through canvas dragging, judging to directly convert the record into a field corresponding to an original data table and the original data table under the condition of a view table by configuring doris connection information and associating a standard table based on the table name of the field, judging whether the record exists based on database query without modification under the condition of a physical table, and performing batch combination insertion if the record does not exist, and performing updating operation if the record exists;
b12, generating an xml flow chart based on the operator, issuing the xml flow chart to a jar package program corresponding to the flink, executing the flow chart, and finally performing data governance processing;
based on DWS theme layer treatment, selecting a corresponding data layer to perform theme statistics of data to form a corresponding service width table;
specifically, the method comprises the following steps:
c1, selecting a treatment layer through canvas, selecting a data source and data connection information of a DWD layer or a DWS layer, filling in a standard sql statement of query, and converting the sql statement into a corresponding doris or mongo database through Apache Call;
c2, dragging the selective canvas and executing the steps B4-B12;
c3, selecting a data grouping statistical operator through canvas, carrying out key field grouping by specifying a key field, carrying out Reduce calculation on the field, and forming a field corresponding to the wide table;
c4, selecting a mongo data table in the sink of the target end through canvas, and configuring and storing the data into a DWS layer for updating or inserting the data into a database of the mongo;
c5, generating an xml flow chart based on the operator, issuing the xml flow chart to a jar package program corresponding to the flink, generating DWS subject width table data, generating the MapReduce normalization of the subject width table data based on the service data operator, and forming service subject data;
the following beneficial effects can be realized through the data management: 1. realizing the visualization of the flink flow configuration based on the canvas, and realizing the mastering and understanding of the data flow direction; 2. the data is cleaned in the whole flow based on the combination of multiple operators of the flink, so that the data cleaning process is realized; 3. the specified DWD layer realizes the table name and the table field of the bottom layer without relation based on the standard table, realizes the universality of the DWD layer, alleviates the problem that the field data and the data types pushed from different cities are inconsistent, unifies through the standard table, realizes that the business department only relates to the standard table, and has less data cognition degree.
EXAMPLE six
The difference between the present embodiment and any one of the first to fifth embodiments is that the warehouse data is queried;
DWD layer business data query rules are not limited to single-table and multi-table association by querying a standard table; analyzing one or more standard tables, judging whether the standard table is a view table or a physical table through a table _ standard table, if the standard table is the view table, carrying out sql analysis to replace and inquire the field and the table name of the table, and replacing the field and the field with the original data table name and the field; therefore, the method not only can realize no perception of upper-layer query, carry out service query through the standard table, but also realize query standardization (converting a corresponding physical table or a corresponding logic table through the standard table), and the service does not relate to an actual table name and a table field, thereby reducing the problem that the table name is not uniform due to the fact that the fields are not uniform in the data access process, and being compatible with the problem that the types of the data source fields in different cities are not uniform.
EXAMPLE seven
A data construction and management terminal of a data warehouse, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor executes the computer program to implement each step of a data construction and management method of a data warehouse according to any one of embodiments one to seven.
The terms:
in _ table raw data table
Figure BDA0003438581440000151
T1 DWD standard view chart
Table _ standard view layer table
Figure BDA0003438581440000152
Table _ standard _ field standard view field table
Figure BDA0003438581440000153
Figure BDA0003438581440000161
Mapping configuration of table _ standard _ relation original data table and standard view table
Figure BDA0003438581440000162
T2 DWD Standard physical Table:
when the in _ table, in _ table2 and in _ standard view table have n: in the case of 1, the number of the cells,
for example, the information of the information: 1, because the data structure of the loss of credit executed person in Sichuan province is inconsistent with the data field type and the data field of the loss of credit executed person in metropolis, n: aiming at a DWD layer physical table; the original data table n is formed to a 1DWD physical layer table.
At this time, the table in table _ standard is changed to the in _ standard type 2 physical table, and the in _ standard table and the corresponding field are directly created.
Figure BDA0003438581440000163
Figure BDA0003438581440000171
a: the newly accessed in _ table2 and the original data table in _ table2 in table _ standard _ relationship are field mapped and directly put into the standard physical table in _ standard.
And b, the old accessed in _ table history record is put into a standard physical table in _ standard in a full-scale manner through field mapping configuration of DataxWeb. And deletes the data and table structure of the in _ table.
c: and converting the old accessed in _ table real-time data into a physical table field and a table name through mapping of a custom operator standard table field of a button and converting into the physical table field and the table name through mapping of the custom operator standard table field of the flash for warehousing.
Mapping configuration of table _ standard _ relationship original data table and standard table:
Figure BDA0003438581440000172
in conclusion, the method and the device provided by the invention have the advantages that the flow is visualized based on the key designation, the data flow direction is clear at a glance, the data are rapidly managed and put in storage, the data flow direction configuration is dynamic, and meanwhile, the access to various data protocols is realized through the key without accessing the data flow through the hard coding in the form of codes; after the original data table to be put in storage is obtained, the original data table to be put in storage generates corresponding standard data, and then the standard data and the standard field table are mapped to obtain the standard table, or the corresponding standard data are directly established based on the original data table to be put in storage, so that the original data table and the generated standard data are not required to be stored in a database, the data redundancy in the database is reduced, and the data are quickly managed and put in storage; meanwhile, field data in the original data table are filtered, deduplicated, formatted and processed by UDF function, so that the redundancy of the data to be stored is reduced while the data to be stored is highly standardized, the data extraction efficiency is improved, and the data management cost is reduced.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields, are included in the scope of the present invention.

Claims (10)

1. A data construction and management method of a data warehouse is characterized by comprising the following steps:
acquiring an original data table to be put in storage;
judging whether a standard table corresponding to the original data table to be put in storage exists in the data warehouse or not;
if yes, acquiring field data corresponding to the standard table in the original data table according to the standard table;
sequentially filtering, removing duplication, format conversion and UDF function processing the field data to obtain standard data;
storing the standard data into a data warehouse;
and if not, creating a standard table corresponding to the original data table to be put in storage.
2. The data construction and management method of a data warehouse according to claim 1, wherein the standard tables include view tables and physical tables;
the judging whether the standard table corresponding to the original data table to be put in storage exists in the data warehouse comprises the following steps:
judging the type of a standard table corresponding to the original data table to be put in storage, if the type is a physical table, judging whether an input standard corresponding to the original data table to be put in storage exists, and if the type is a physical table, indicating that the standard table corresponding to the original data table to be put in storage exists;
if the table is a view table, whether the original data table to be put in storage has corresponding access configuration is judged, and if the table is the view table, the standard table corresponding to the original data packet to be put in storage exists.
3. The method according to claim 2, wherein if the standard table does not exist, the creating of the standard table corresponding to the original data table to be put in storage includes:
and if the original data table to be put in storage is a physical table and an input standard does not exist, creating a corresponding standard table according to a preset standard field table.
4. The method according to claim 2, wherein if the standard table does not exist, the creating of the standard table corresponding to the original data table to be put in storage includes:
and if the original data table to be put in storage is a view table and corresponding access configuration does not exist, setting a corresponding data input layer and a corresponding configuration field according to the original data table to be put in storage and generating corresponding configuration.
5. The method for building and managing data of a data warehouse of claim 2, wherein before storing the standard data in the data warehouse, the method further comprises:
and if the type of the standard table corresponding to the standard data is a physical table, mapping the standard data according to the standard table.
6. The method for building and managing data of a data warehouse according to claim 5, wherein the mapping the standard data further comprises:
judging whether a secondary treatment instruction is received or not, if not, transferring the standard data stream to a doris operator and storing the doris operator in a data warehouse;
if yes, inputting the standard data into a preset treatment end; and acquiring different treatment operators in sequence to treat the standard data to obtain treatment data.
7. The method for data construction and management of a data warehouse of claim 6, wherein after obtaining abatement data, the method further comprises:
acquiring a data grouping statistical operator;
and grouping the treatment data according to the data grouping statistical operator to form a service width table.
8. The method of claim 6, wherein the administration operators comprise data filtering operators, data deduplication operators, data association operators, data format conversion operators, UDF function operators, window operators, business code operators, and sink operators.
9. The method according to claim 1, wherein the obtaining of the original data table to be put into storage comprises:
and judging the type of the data source of the original data table to be put in storage, and if the data source is a heterogeneous data source, synchronizing the heterogeneous data source through dataXWeb.
10. A data construction and management terminal for a data warehouse, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor executes the computer program to implement the steps of a data construction and management method for a data warehouse according to any one of claims 1 to 9.
CN202111622479.9A 2021-12-28 2021-12-28 Data construction and management method and terminal of data warehouse Pending CN114328759A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111622479.9A CN114328759A (en) 2021-12-28 2021-12-28 Data construction and management method and terminal of data warehouse

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111622479.9A CN114328759A (en) 2021-12-28 2021-12-28 Data construction and management method and terminal of data warehouse

Publications (1)

Publication Number Publication Date
CN114328759A true CN114328759A (en) 2022-04-12

Family

ID=81014607

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111622479.9A Pending CN114328759A (en) 2021-12-28 2021-12-28 Data construction and management method and terminal of data warehouse

Country Status (1)

Country Link
CN (1) CN114328759A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117009998A (en) * 2023-08-29 2023-11-07 上海倍通医药科技咨询有限公司 Data inspection method and system
CN117331513A (en) * 2023-12-01 2024-01-02 蒲惠智造科技股份有限公司 Data reduction method and system based on Hadoop architecture
CN117520408A (en) * 2023-11-01 2024-02-06 广州市玄武无线科技股份有限公司 Data increment statistical method, device, equipment and storage medium for doris

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117009998A (en) * 2023-08-29 2023-11-07 上海倍通医药科技咨询有限公司 Data inspection method and system
CN117520408A (en) * 2023-11-01 2024-02-06 广州市玄武无线科技股份有限公司 Data increment statistical method, device, equipment and storage medium for doris
CN117331513A (en) * 2023-12-01 2024-01-02 蒲惠智造科技股份有限公司 Data reduction method and system based on Hadoop architecture
CN117331513B (en) * 2023-12-01 2024-03-19 蒲惠智造科技股份有限公司 Data reduction method and system based on Hadoop architecture

Similar Documents

Publication Publication Date Title
US20230144450A1 (en) Multi-partitioning data for combination operations
JP6416194B2 (en) Scalable analytic platform for semi-structured data
CN114328759A (en) Data construction and management method and terminal of data warehouse
US9116955B2 (en) Managing data queries
CN112905595A (en) Data query method and device and computer readable storage medium
CN110651265A (en) Data replication system
EP3513313A1 (en) System for importing data into a data repository
WO2020087082A1 (en) Trace and span sampling and analysis for instrumented software
US11514009B2 (en) Method and systems for mapping object oriented/functional languages to database languages
US11921720B1 (en) Systems and methods for decoupling search processing language and machine learning analytics from storage of accessed data
US11461333B2 (en) Vertical union of feature-based datasets
CN113407600B (en) Enhanced real-time calculation method for dynamically synchronizing multi-source large table data in real time
CN111680017A (en) Data synchronization method and device
US20230120592A1 (en) Query Generation and Processing System
CN114218218A (en) Data processing method, device and equipment based on data warehouse and storage medium
CN111966692A (en) Data processing method, medium, device and computing equipment for data warehouse
WO2024001493A1 (en) Visual data analysis method and device
CN113297057A (en) Memory analysis method, device and system
US20230289331A1 (en) Model generation service for data retrieval
US20150347506A1 (en) Methods and apparatus for specifying query execution plans in database management systems
KR20190005578A (en) Systemt and method of managing distributed database based on inmemory
US10558637B2 (en) Modularized data distribution plan generation
US20200311067A1 (en) Database partition pruning using dependency graph
Klein et al. Quality attribute-guided evaluation of NoSQL databases: an experience report
US11892988B1 (en) Content pack management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination