CN113297188B - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN113297188B
CN113297188B CN202110136105.XA CN202110136105A CN113297188B CN 113297188 B CN113297188 B CN 113297188B CN 202110136105 A CN202110136105 A CN 202110136105A CN 113297188 B CN113297188 B CN 113297188B
Authority
CN
China
Prior art keywords
data
project
partitions
fragments
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110136105.XA
Other languages
Chinese (zh)
Other versions
CN113297188A (en
Inventor
尤田
孟庆义
沈春辉
古青松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taobao China Software Co Ltd
Original Assignee
Taobao China Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taobao China Software Co Ltd filed Critical Taobao China Software Co Ltd
Priority to CN202110136105.XA priority Critical patent/CN113297188B/en
Publication of CN113297188A publication Critical patent/CN113297188A/en
Application granted granted Critical
Publication of CN113297188B publication Critical patent/CN113297188B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems

Abstract

An embodiment of the present specification provides a data processing method and an apparatus, wherein the data processing method includes: analyzing data distribution information of project data to be transmitted in a data source, calling a mapping protocol algorithm, segmenting the project data based on the data distribution information to generate a plurality of data segments, and calling a database import assembly to transmit the project data in the data segments to a preset number of data partitions under a target database table.

Description

Data processing method and device
Technical Field
The embodiment of the specification relates to the technical field of databases, in particular to a data processing method. One or more embodiments of the present specification also relate to a data processing apparatus, a computing device, and a computer-readable storage medium.
Background
With the rapid development of the internet, the variety and size of data in the internet are rapidly increasing. At present, because the traditional relational database has bottlenecks in performance and scale when meeting the requirements of storage, query, analysis and the like of big data, various non-relational databases are produced at the same time and become important processing means in the field of data storage and analysis. The data storage of the non-relational database does not need a fixed table structure, usually has no connection operation, and has strong expandability. The non-relational database is classified into a key-value pair database, a column family database, a document type database, a graph database and the like based on different storage structures.
In practical application, a non-relational database is often used as a project processing platform to provide services such as project processing and the like, in order to ensure the real-time performance of the project processing process and the project processing efficiency, massive transaction data cannot be stored in the non-relational database, if a data query service is provided externally through the non-relational database, a data warehouse is usually used, a common framework is to perform complex calculation on the data warehouse, and then regularly flow calculation results back to the non-relational database in batches to perform full data storage and real-time reading and writing.
Disclosure of Invention
In view of this, the embodiments of the present specification provide a data processing method. One or more embodiments of the present specification also relate to a data processing apparatus, a computing device, and a computer-readable storage medium to address technical deficiencies in the prior art.
According to a first aspect of embodiments herein, there is provided a data processing method including:
analyzing data distribution information of project data to be transmitted in a data source;
calling a mapping protocol algorithm, and segmenting the project data based on the data distribution information to generate a plurality of data segments;
and calling a database import component to transmit the project data in the data fragments to a preset number of data partitions under a target database table.
Optionally, after generating the plurality of data slices, the method further includes:
sorting the project data contained in the data fragments to generate a plurality of target data fragments containing sorting results;
correspondingly, the invoking the database import component to transmit the project data in the plurality of data fragments to a preset number of data partitions under a target database table includes:
and calling a database import component to transmit the project data in the target data fragments to a preset number of data partitions under a target database table.
Optionally, the invoking the database import component transmits the item data in the target data segments to a preset number of data partitions under a target database table, where the method includes:
integrating the project data contained in the target data fragments according to the sorting result to generate a data source table, and establishing a first mapping relation between the target data fragments and the data source table;
and reading project data contained in any target data fragment in the data source table according to the first mapping relation, and calling a database import component to transmit the project data contained in any target data fragment to at least one data partition under the target database table.
Optionally, the invoking the database import component transmits the item data in the target data segments to a preset number of data partitions under a target database table, where the method includes:
determining the number of data fragments;
and constructing a corresponding number of data writing modules in the database import assembly, and calling the corresponding number of data writing modules to respectively transmit the project data in the target data fragments to a preset number of data partitions under a target database table.
Optionally, the sorting the item data included in each of the plurality of data fragments to generate a plurality of target data fragments including a sorting result includes:
determining the total data amount of the project data to be transmitted according to the data distribution information;
determining the quantity of project data contained in each data fragment according to the total data quantity and the quantity of the data fragments;
and sequencing the project data contained in the plurality of data fragments according to the quantity of the project data contained in each data fragment, and generating a plurality of target data fragments containing sequencing results, wherein the sequencing results among the plurality of target data fragments are continuous.
Optionally, the analyzing data distribution information of the item data to be transmitted in the data source includes:
determining a quantile graph corresponding to project data to be transmitted in a data source, and determining data distribution information of the project data according to the quantile graph;
correspondingly, the invoking a mapping reduction algorithm, and segmenting the project data based on the data distribution information to generate a plurality of data segments includes:
and executing a mapping task, and mapping the project data to different data fragments based on the data distribution information.
Optionally, after generating the plurality of data slices, the method further includes:
executing a specification task, sequencing the project data respectively contained by the plurality of data fragments, and generating a plurality of target data fragments containing sequencing results;
and merging the project data contained in the target data fragments according to the sorting result based on a summary merging theory to generate a data source table.
Optionally, the data processing method further includes:
determining the number of data partitions having a mapping relation with each data fragment in the plurality of data fragments;
and optimizing the data partitions which have mapping relation with any data partition in the plurality of data partitions according to the number of the data partitions.
Optionally, the determining the number of data partitions having a mapping relationship with each data slice in the plurality of data slices includes:
establishing a second mapping relation between the plurality of data fragments and the data partitions according to the transmission result;
and determining the number of data partitions having a mapping relation with each data fragment in the plurality of data fragments according to the second mapping relation.
Optionally, the optimizing, according to the number of the data partitions, the data partition having a mapping relationship with any data partition in the multiple data partitions includes:
determining a first number of data partitions having a mapping relation with the ith data fragment and a second number of data partitions having a mapping relation with the (i + 1) th data fragment, wherein i belongs to [1, n-1], n is the total number of the data fragments, and n is a positive integer;
judging whether the first quantity is equal to the second quantity;
if not, merging or splitting the data partitions with the mapping relation with the (i + 1) th data fragment according to the first quantity; alternatively, the first and second electrodes may be,
and merging or splitting the data partitions which have the mapping relation with the ith data slice according to the second quantity.
Optionally, the sorting the item data included in each of the multiple data shards includes:
and determining a row primary key of a target database table, and sequencing the item data respectively contained in the plurality of data fragments according to the row primary key.
According to a second aspect of embodiments herein, there is provided a data processing apparatus comprising:
the analysis module is configured to analyze data distribution information of project data to be transmitted in the data source;
the data segmentation module is configured to call a mapping protocol algorithm, segment the project data based on the data distribution information and generate a plurality of data segments;
and the data transmission module is configured to call the database import component to transmit the project data in the plurality of data fragments to a preset number of data partitions under a target database table.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions to implement the steps of the data processing method.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the data processing method.
One embodiment of the present specification analyzes data distribution information of item data to be transmitted in a data source, invokes a mapping protocol algorithm, and segments the item data based on the data distribution information to generate a plurality of data segments, invokes a database import component to transmit the item data in the plurality of data segments to a preset number of data partitions under a target database table; the method comprises the steps of segmenting project data according to data distribution information of the project data, and transmitting the project data in any data segment in a segmentation result to at least one data segment of a target database table, so that the problem of data inclination in the data transmission process is avoided, and the stability and the availability of a data link between a data source and a database are guaranteed.
Drawings
FIG. 1 is a process flow diagram of a data processing method provided in one embodiment of the present specification;
FIG. 2a is a schematic diagram of a data processing process provided in one embodiment of the present description;
FIG. 2b is a schematic diagram of another data processing process provided in one embodiment of the present description;
FIG. 3 is a flowchart illustrating a data processing method according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a data processing apparatus provided in one embodiment of the present description;
fig. 5 is a block diagram of a computing device according to an embodiment of the present disclosure.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at" \8230; "or" when 8230; \8230; "or" in response to a determination ", depending on the context.
First, the noun terms referred to in one or more embodiments of the present specification are explained.
A data warehouse: a data warehouse solution that can provide big data computing services is generally used to efficiently analyze and process data.
Bulk: and (4) batch import, wherein in an LSM-tree system, a write link is generally skipped, a bottom layer file is directly generated, and bypass loading is carried out in the system.
Data skew: a large number of Key-values are allocated by partition into a small number or one partition.
MapReduce: a programming model for parallel operation of large scale data sets (greater than 1 TB).
In the present specification, a data processing method is provided, and the present specification relates to a data processing apparatus, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
Fig. 1 shows a process flow diagram of a data processing method provided according to an embodiment of the present specification, including steps 102 to 106.
And 102, analyzing data distribution information of project data to be transmitted in the data source.
Specifically, the items referred to in the embodiments of the present specification may be any items requiring statistics or processing on item information, such as transaction items, claim settlement items, or public service items; the user needs to make a decision based on the statistics of the project information or the processing result (the project data).
The data sources include, but are not limited to, data warehouses, data lakes, or distributed data storage systems.
Taking the data source as a data warehouse as an example, in practical application, since a database can provide project processing service for a project, taking the project as a transaction project as an example, the database is a project processing platform for the transaction project, and in order to ensure real-time performance and processing efficiency of a transaction project processing process, massive transaction data cannot be stored in the database, in the case that a user needs to count historical transaction data of the transaction project and adjust a transaction strategy according to a statistical result, the embodiment of the specification selects a process of realizing data statistics through the data warehouse, and therefore, after the data warehouse counts historical transaction information of the transaction project to generate project data, the project data needs to be transmitted to the database as data to be transmitted.
The item data may be transaction amount, transaction result, etc. of each commodity statistically generated by taking day, week, month, season, and year as statistical periods.
Before transmitting the project data to be transmitted to the database, the project data needs to be analyzed to determine the data distribution information of the project data, specifically, a fractional number graph corresponding to the project data to be transmitted in a data source is determined, and the data distribution information of the project data is determined according to the fractional number graph.
Specifically, the data distribution information includes, but is not limited to, distribution information of data attributes such as the total number of data, the size of each piece of data, and the like, and in practical applications, the item data may be processed by using a Quantile algorithm (Quantile) to obtain a corresponding Quantile graph, where the Quantile graph is used to reflect the distribution condition of single-attribute data of the item data, and may include, but is not limited to: percentile (Percentile) profile, deciles (Deciles) profile, quartiles (Quartiles) profile, or the like, without limitation.
In order to ensure that the sizes of the project data contained in each data fragment after being split are uniformly distributed, the embodiments of the present specification may generate the fractional number graph according to the data size and the total data number of each data, so as to visually obtain the size relationship between each item of data according to the fractional number graph, and perform data splitting according to the fractional number graph, which is beneficial to improving the accuracy of the data splitting result.
And 104, calling a mapping reduction algorithm, and segmenting the project data based on the data distribution information to generate a plurality of data fragments.
Specifically, after the data distribution information of the project data is obtained through analysis, the project data can be segmented through a mapping reduction algorithm.
In specific implementation, a mapping protocol algorithm is called, the project data is segmented based on the data distribution information, and a plurality of data fragments are generated, and the method can be specifically realized in the following mode:
executing a mapping task, and mapping the project data to different data fragments based on the data distribution information;
executing a specification task, sequencing the project data respectively contained by the plurality of data fragments, and generating a plurality of target data fragments containing sequencing results;
and merging the project data contained in the target data fragments according to the sorting result based on a summary merging theory to generate a data source table.
Specifically, the mapping and specification algorithm specifically relates to a Map (mapping) stage and a Reduce (specification) stage in the distributed job, the Map stage and the Reduce stage both execute a plurality of instances on each slave node device of the distributed system according to task requirements in charge of the Map stage and the Reduce stage, the number of the instances is from thousands to hundreds of thousands, and the instances execute the same computation logic on different data.
The Map is responsible for filtering and distributing data, and the instances of the Map read files in the distributed file system, and distribute the data to different Reduce instances for use after certain operation is carried out; the Reduce is responsible for calculating and merging data, and the Reduce instance receives output data of all Map instances and writes a distributed system file after operation.
For example, the data in the file a is chaotic, after passing through Map, the data of the file is divided into a plurality of parts, the data of the plurality of parts are distributed to different Reduce for use, the Reduce writes a file a after receiving the data and performing operation, and the data in the file a is well ordered according to a certain rule.
The protocol task is a task in charge of a Reduce stage in the distributed operation process, when the Map task is executed, an input file of the Map task is opened, a plurality of output files are created to write in intermediate data, a plurality of intermediate files are written out, all the intermediate files written out by the Map are a group of distributed files, and the task of opening the group of distributed files by the Reduce is the protocol task.
Therefore, in the embodiment of the present specification, the project data is segmented by executing the mapping task (Map), the segmentation result is sorted by executing the Reduce task (Reduce), and finally, the project data in the sorted target data fragments are merged by using the digest merging algorithm (Mergeable Summaries), so as to generate a complete data source table containing all the project data.
Still taking the item as a transaction item, taking the data source as a data warehouse, if the total number of item data is 4000 and the data size of each item data is equal as determined by analyzing the obtained data distribution information, dividing the 4000 items of data into 4 data fragments by executing a mapping task, wherein each data fragment comprises 1000 items of data; and respectively sequencing the project data in the 4 data fragments by executing a specification task.
Specifically, the project data in the first data fragment may be sorted according to 1000 sequence numbers 1-1000, the project data in the second data fragment may be sorted according to 1000 sequence numbers 1001-2000, the project data in the third data fragment may be sorted according to 1000 sequence numbers 2001-3000, the project data in the fourth data fragment may be sorted according to 1000 sequence numbers 3001-4000, and finally, the project data in the sorted target data fragments may be merged by using a digest merging algorithm (merge subminiaries), so as to generate a data library table including all the project data.
And 106, calling a database import component to transmit the project data in the data fragments to a preset number of data partitions under a target database table.
Specifically, after the item data to be transmitted in the data source is segmented, the item data contained in the multiple data segments generated by segmentation can be transmitted to the data partition under the target database table in parallel, and the method can be realized by calling the database import component.
The database according to the embodiment of the present specification may be a non-relational database (HBase).
In specific implementation, the implementation process of calling the database import component to transmit the project data in the plurality of data fragments to the preset number of data partitions under the target database table is as follows:
determining the number of data fragments;
and constructing a corresponding number of data writing modules in the database import assembly, and calling the corresponding number of data writing modules to respectively transmit the project data in the target data fragments to a preset number of data partitions under a target database table.
Specifically, after the project data to be transmitted in the data source is segmented, in order to ensure the data transmission efficiency of the project data, in the embodiment of the present specification, the number of data segments is determined, and a corresponding number of data writing modules are constructed in the database import component, so that the project data in each data segment is read in parallel by the constructed corresponding number of data writing modules, thereby implementing parallel transmission of the project data.
Along the use example, the total number of the project data is determined to be 4000 through data distribution information obtained through analysis, the 4000 pieces of data are divided into 4 data fragments which are respectively a data fragment 1, a data fragment 2, a data fragment 3 and a data fragment 4, 4 data writing modules are similarly constructed in a database import component and are respectively a data writing module 1, a data writing module 2, a data writing module 3 and a data writing module 4, project data in the 4 data fragments can be simultaneously and parallelly transmitted to a data partition of a database table through the 4 data writing modules, and one data writing module is responsible for reading and transmitting the project data in any one data fragment.
The number of the data writing modules in the import assembly is consistent with the number of the data fragments, so that parallel data writing can be realized, and the transmission efficiency of data is ensured.
In specific implementation, after a plurality of data fragments are generated, the item data included in each of the plurality of data fragments may be further sorted, and a plurality of target data fragments including a sorting result are generated, which may be specifically implemented in the following manner:
determining the total data amount of the project data to be transmitted according to the data distribution information;
determining the quantity of project data contained in each data fragment according to the total data quantity and the quantity of the data fragments;
and sequencing the project data contained in the plurality of data fragments according to the quantity of the project data contained in each data fragment, and generating a plurality of target data fragments containing sequencing results, wherein the sequencing results among the plurality of target data fragments are continuous.
Further, after a plurality of target data fragments containing the sorting result are generated, the database import component can be called to transmit the project data in the plurality of target data fragments to a preset number of data partitions under the target database table, which can be specifically realized by the following method:
integrating the project data contained in the target data fragments according to the sorting result to generate a data source table, and establishing a first mapping relation between the target data fragments and the data source table;
and reading project data contained in any target data fragment in the data source table according to the first mapping relation, and calling a database import component to transmit the project data contained in any target data fragment to at least one data partition under the target database table.
In this embodiment of the present specification, the row major key of the target database table may be determined, and the item data included in each of the plurality of data segments may be sorted according to the row major key.
Specifically, after the project data is segmented to generate a plurality of data segments, the data in the data segments can be sorted, and the project data in each sorted data segment is integrated to generate the data source table.
In practical applications, the database referred to in the embodiments of the present disclosure may be a non-relational database (HBase), and the data partition (region) is a basic unit of load balancing in the HBase. Each region has a start line (start Key) and an end line (end Key) by which a region-defined interval can be determined, on which region a piece of data is defined by the start and end lines. For example, the start Key of a region: 100, end Key:200, then row key =108 data falls on the region, row key =1081 data also falls on the region, and row key =108a also falls on the region.
Since the start row and the end row of each data partition in the target database table in the HBase are set in advance, in this embodiment of the present description, in order to avoid data skew, that is, uneven data distribution, occurring in the process of transmitting the item data from the data source (data warehouse) to the database, in this embodiment of the present description, after the item data is segmented, the item data contained in each data partition in the target database table in the HBase may be sorted according to the row key (row key) of each data partition.
For example, if the row primary key of each data partition in the target database table in the HBase is 1 to 200, the total data size of the item data may be determined, and the data size included in each data fragment is determined, so that the item data included in each data fragment is sorted according to the number form corresponding to the row primary key, as described above, the item data in the first data fragment may be sorted according to the 1000 serial numbers of 1 to 1000, the item data in the second data fragment may be sorted according to the 1000 serial numbers of 1001 to 2000, the item data in the third data fragment may be sorted according to the 1000 serial numbers of 2001 to 3000, and the item data in the fourth data fragment may be sorted according to the 1000 serial numbers of 3001 to 4000.
After the item data contained in each data fragment is sequenced, the sequencing results can be integrated to integrate the item data in the multiple data fragments according to the sequencing results to generate a data warehouse table, and because the item data needs to be read from the data warehouse table and a database import component needs to be called for data transmission in the data transmission process, the embodiment of the specification needs to establish a mapping relation between the data fragments and the data warehouse table, so that the data writing module in the data import component reads the item data in parallel, and further the parallel transmission of the item data is realized, and the data transmission efficiency is ensured.
In addition, since the setting of start Key and end Key in a region may not correspond to the amount of data contained in 1 data slice, for example, start Key =1 and end Key =100 in region 1; and the sorting result of the project data contained in the data fragment 1 is 1 to 150, so when the database import component is called to transmit the project data contained in any one data fragment to the target database table, the project data in the data fragment may need to be transmitted to one, two or more data partitions under the target database table.
In addition, after the project data in each data fragment is transmitted to the data partition of the database table, the number of the data partitions having a mapping relation with each data fragment in the plurality of data fragments can be determined, and the data partitions having a mapping relation with any data fragment in the plurality of data fragments can be optimized according to the number of the data partitions.
Further, determining the number of data partitions having a mapping relationship with each data slice in the plurality of data slices may specifically be implemented in the following manner:
establishing a second mapping relation between the plurality of data fragments and the data partitions according to the transmission result;
and determining the number of data partitions having a mapping relation with each data fragment in the plurality of data fragments according to the second mapping relation.
In addition, optimizing the data partitions having a mapping relationship with any data partition of the plurality of data partitions according to the number of the data partitions includes:
determining a first number of data partitions having a mapping relation with the ith data fragment and a second number of data partitions having a mapping relation with the (i + 1) th data fragment, wherein i belongs to [1, n-1], n is the total number of the data fragments, and n is a positive integer;
judging whether the first quantity is equal to the second quantity;
if not, merging or splitting the data partitions which have the mapping relation with the (i + 1) th data fragment according to the first quantity; alternatively, the first and second electrodes may be,
and merging or splitting the data partitions which have the mapping relation with the ith data slice according to the second quantity.
Specifically, after the project data is transmitted to the data partitions under the target database, the data partitions having a mapping relationship with any one, two or more data fragments of the plurality of data fragments may be optimized according to the number of the data partitions mapped by different data fragments, and may be specifically subjected to merging or splitting processing.
In practical application, the number of the data partitions having a mapping relation with each data fragment can be determined by establishing a second mapping relation between the data fragments and the data partitions.
For example, after it is determined that the project data in the data fragment 1 is transmitted to the database according to the transmission result, and the project data is stored in the data partition 1 and the data partition 2 of the database table, it is determined that the data fragment 1 has a mapping relationship with the data partition 1 and the data partition 2, that is, the number of data partitions having a mapping relationship with the data fragment 1 is 2.
After the number of the data partitions having a mapping relationship with each data fragment is determined in the above manner, the data partitions having a mapping relationship with any data fragment of the plurality of data fragments can be optimized according to the determined result, that is, the data partitions are merged or split.
For example, if it is determined that the number of data partitions mapped to data slice 1 is 2 and the number of data partitions mapped to data slice 2 is 4, the number of data partitions mapped to data slice 2 may be combined from 4 to 2, or the number of data partitions mapped to data slice 1 may be split from 2 to 4.
Alternatively, if it is determined that the number of data partitions mapped with data slice 1, data slice 2, and data slice 4 is 2, but the number of data partitions mapped with data slice 3 is 4, the number of data partitions mapped with data slice 4 may be merged from 4 to 2.
After the data writing module writes any one data partition to be full, the File corresponding to the data partition can be moved to a target HDFS (Hadoop Distributed File System) directory; after all the item data in the data source table are written into the data partition of the target database table, the HBase bulk load (batch loading) of the database can be triggered to load the files in the target HDFS directory.
The embodiment of the specification segments the project data according to the data distribution information of the project data, transmits the project data in any data segment in the segmentation result to at least one data partition of the target database table, and can adjust and optimize the data partition under the target database table, thereby being beneficial to avoiding the problem of data tilt in the data transmission process and ensuring the stability and availability of a data link between a data source and the database.
A schematic diagram of a data processing process provided in an embodiment of the present specification is shown in fig. 2a, and a quantile graph corresponding to item data to be transmitted in a data warehouse is determined, and data distribution information of the item data is determined according to the quantile graph; then executing a mapping task, and mapping the project data to different data fragments based on the data distribution information; then executing a specification task, sequencing the project data respectively contained by the plurality of data fragments, and generating a plurality of target data fragments containing sequencing results, so that the whole data set of the project data is globally ordered;
and finally, importing the service cross-partition intelligent writing module, reading corresponding data in the data set according to the footmark offset of the uniform fragments, and writing the target table, wherein each writing module corresponds to one fragment of the data source. However, one writing module may correspond to a plurality of partitions of the target table, so that the target table partitions can be optimized to avoid data skew. The write file operation across partitions may then be performed concurrently. In addition, after each data writing module finishes one partition, the data writing module is responsible for moving the corresponding file to the target HDFS directory, and the files of all the data partitions in the target directory are loaded together after the whole table writing is finished.
Another schematic diagram of a data processing process provided in this specification is shown in fig. 2b, where after data distribution information of source data in a data warehouse is obtained through parsing, project data may be divided into 4 data fragments by executing a mapping specification task, the project data in the 4 data fragments are sorted, and then the project data in the sorted data fragments are merged to obtain an ordered intermediate table formed by the project data; and calling a database import component, respectively reading the project data of 4 data fragments in the intermediate table in parallel by using 4 data writing modules in the component, and writing the read project data into at least one data partition in a target database table to realize data writing across partitions. After the writing is finished, the number of the data partitions having a mapping relation with each data fragment in the plurality of data fragments can be determined, and the data partitions having a mapping relation with any data fragment in the plurality of data fragments are optimized according to the number of the data partitions, that is, the data partitions having a mapping relation with each data fragment are merged or split.
The method includes the steps of analyzing data distribution information of project data to be transmitted in a data source, calling a mapping protocol algorithm, segmenting the project data based on the data distribution information to generate a plurality of data segments, and calling a database importing component to transmit the project data in the data segments to a preset number of data partitions under a target database table.
According to the method, the project data are segmented according to the data distribution information of the project data, and then the project data in any data segment in the segmentation result are transmitted to at least one data segment of the target database table, so that the problem of data inclination in the data transmission process is avoided, the time for data transmission is controllable, and the stability and the availability of a data link between a data source and a database are guaranteed.
The following describes the data processing method further by taking an application of the data processing method provided in this specification in an actual scene as an example, with reference to fig. 3. Fig. 3 shows a flowchart of a processing procedure of a data processing method according to an embodiment of the present specification, and specific steps include step 302 to step 320.
Step 302, determining a quantile graph corresponding to project data to be transmitted in a data warehouse, and determining data distribution information of the project data according to the quantile graph.
Step 304, executing a mapping task, and mapping the project data to different data fragments based on the data distribution information.
And step 306, executing a specification task, sequencing the item data respectively contained in the data fragments according to the row major key of the target database table, and generating a plurality of target data fragments containing the sequencing result.
And 308, merging the project data contained in the target data fragments according to the sorting result based on the abstract merging theory to generate a database table.
Step 310, a first mapping relationship between the target data fragments and the database table is established.
Step 312, determining the number of the target data fragments, and constructing a corresponding number of data writing modules in the database import component.
Step 314, invoking a data writing module, and reading the item data contained in any target data fragment in the database table according to the first mapping relationship.
Step 316, transmitting the project data contained in any target data fragment to at least one data partition in the target database table.
Step 318, determining the number of data partitions having a mapping relation with each target data slice in the plurality of target data slices.
And step 320, optimizing the data partitions which have the mapping relation with any target data partitions in the plurality of target data partitions according to the number of the data partitions.
The embodiment of the specification firstly segments the project data according to the data distribution information of the project data, and then transmits the project data in any data segment in the segmentation result to at least one data segment of the target database table, so that the problem of data inclination in the data transmission process is avoided, the time for data transmission is controllable, and the stability and the availability of a data link between a data warehouse and a database are ensured.
Corresponding to the above method embodiment, the present specification further provides a data processing apparatus embodiment, and fig. 4 shows a schematic diagram of a data processing apparatus provided in an embodiment of the present specification. As shown in fig. 4, the apparatus includes:
the analysis module 402 is configured to analyze data distribution information of project data to be transmitted in the data source;
a data segmentation module 404 configured to invoke a mapping convention algorithm, segment the project data based on the data distribution information, and generate a plurality of data segments;
a data transmission module 406 configured to invoke the database import component to transmit the item data in the plurality of data segments to a preset number of data partitions under the target database table.
Optionally, the data processing apparatus further includes:
the sorting module is configured to sort the item data respectively contained in the plurality of data fragments and generate a plurality of target data fragments containing sorting results;
correspondingly, the data transmission module comprises:
and the transmission sub-module is configured to call the database import component to transmit the item data in the target data fragments to a preset number of data partitions under the target database table.
Optionally, the transmission sub-module includes:
the integration unit is configured to integrate the project data contained in the target data fragments according to the sorting result, generate a data source table, and establish a first mapping relation between the target data fragments and the data source table;
and the first transmission unit is configured to read the item data contained in any target data fragment in the data source table according to the first mapping relation, and call a database import component to transmit the item data contained in any target data fragment to at least one data partition under the target database table.
Optionally, the transmission sub-module includes:
a determining unit configured to determine the number of data slices;
and the second transmission unit is configured to construct a corresponding number of data writing modules in the database import component, and call the corresponding number of data writing modules to respectively transmit the item data in the target data fragments to a preset number of data partitions under a target database table.
Optionally, the sorting module includes:
the first determining submodule is configured to determine the total data amount of the project data to be transmitted according to the data distribution information;
the second determining submodule is configured to determine the number of project data contained in each data fragment according to the total data amount and the number of the data fragments;
the first sequencing submodule is configured to sequence the item data included in each of the plurality of data fragments according to the quantity of the item data included in each of the plurality of data fragments, and generate a plurality of target data fragments including a sequencing result, wherein the sequencing results among the plurality of target data fragments are continuous.
Optionally, the parsing module 402 includes:
the distribution information determining submodule is configured to determine a fractional bit map corresponding to project data to be transmitted in a data source, and determine data distribution information of the project data according to the fractional bit map;
accordingly, the data slicing module 404 includes:
a mapping sub-module configured to perform a mapping task, map the project data to different data shards based on the data distribution information.
Optionally, the data slicing module 404 further includes:
the generation submodule is configured to execute a specification task, sequence the project data respectively contained in the plurality of data fragments, and generate a plurality of target data fragments containing a sequencing result;
and the merging submodule is configured to merge the project data contained in the target data fragments according to the sorting result based on a summary merging theory, so as to generate a data source table.
Optionally, the data processing apparatus further includes:
a quantity determination module configured to determine a quantity of data partitions having a mapping relationship with each of the plurality of data shards;
and the optimization module is configured to optimize the data partitions which have mapping relation with any data partitions in the plurality of data partitions according to the number of the data partitions.
Optionally, the number determination module is configured to:
the establishing submodule is configured to establish a second mapping relation between the plurality of data fragments and the data partition according to the transmission result;
and the quantity determining submodule is configured to determine the quantity of the data partitions having the mapping relation with each data fragment in the plurality of data fragments according to the second mapping relation.
Optionally, the optimization module includes:
a first processing submodule configured to determine a first number of data partitions having a mapping relationship with an ith data slice and a second number of data partitions having a mapping relationship with an (i + 1) th data slice, where i ∈ [1, n-1], n is a total number of data slices, and n is a positive integer;
a determination submodule configured to determine whether the first number and the second number are equal;
if the operation result of the judgment sub-module is not, operating a second processing sub-module or a third processing sub-module;
the second processing submodule is configured to merge or split the data partitions having the mapping relation with the (i + 1) th data fragment according to the first quantity;
and the third processing submodule is configured to perform merging or splitting processing on the data partitions having the mapping relation with the ith data slice according to the second number.
Optionally, the sorting module includes:
and the second sorting submodule is configured to determine a row primary key of the target database table, and sort the item data respectively contained in the plurality of data fragments according to the row primary key.
The above is a schematic configuration of a data processing apparatus of the present embodiment. It should be noted that the technical solution of the data processing apparatus and the technical solution of the data processing method belong to the same concept, and details that are not described in detail in the technical solution of the data processing apparatus can be referred to the description of the technical solution of the data processing method.
FIG. 5 illustrates a block diagram of a computing device 500, provided in accordance with one embodiment of the present specification. The components of the computing device 500 include, but are not limited to, a memory 510 and a processor 520. Processor 520 is coupled to memory 510 via bus 530, and database 550 is used to store data.
Computing device 500 also includes access device 540, access device 540 enabling computing device 500 to communicate via one or more networks 560. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 540 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 500, as well as other components not shown in FIG. 5, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 5 is for purposes of example only and is not limiting as to the scope of the present description. Other components may be added or replaced as desired by those skilled in the art.
Computing device 500 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet computer, personal digital assistant, laptop computer, notebook computer, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 500 may also be a mobile or stationary server.
Wherein the memory 510 is configured to store computer-executable instructions for execution by the processor 520 for implementing the steps of the data processing method.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the data processing method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the data processing method.
An embodiment of the present specification also provides a computer readable storage medium storing computer instructions which, when executed by a processor, are used for implementing the steps of the data processing method.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the data processing method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the data processing method.
The foregoing description of specific embodiments has been presented for purposes of illustration and description. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts, but those skilled in the art should understand that the present embodiment is not limited by the described acts, because some steps may be performed in other sequences or simultaneously according to the present embodiment. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for an embodiment of the specification.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the embodiments. The specification is limited only by the claims and their full scope and equivalents.

Claims (12)

1. A method of data processing, comprising:
analyzing data distribution information of project data to be transmitted in a data source;
calling a mapping protocol algorithm, and segmenting the project data based on the data distribution information to generate a plurality of data segments;
calling a database import component to transmit the project data in the plurality of data fragments to a preset number of data partitions under a target database table;
determining the number of data partitions having a mapping relation with each data fragment in the plurality of data fragments;
optimizing the data partitions which have a mapping relation with any data partition in the plurality of data partitions according to the number of the data partitions, so that the number of the data partitions which have the mapping relation with each data partition is the same;
wherein the determining the number of data partitions having a mapping relationship with each data slice of the plurality of data slices comprises:
establishing a second mapping relation between the plurality of data fragments and the data partitions according to the transmission result;
and determining the number of data partitions having a mapping relation with each data fragment in the plurality of data fragments according to the second mapping relation.
2. The data processing method of claim 1, further comprising, after generating the plurality of data slices:
sorting the project data respectively contained by the data fragments to generate a plurality of target data fragments containing sorting results;
correspondingly, the invoking the database import component to transmit the project data in the plurality of data fragments to a preset number of data partitions under a target database table includes:
and calling a database import component to transmit the project data in the target data fragments to a preset number of data partitions under a target database table.
3. The data processing method of claim 2, wherein the invoking the database import component to transfer the item data in the plurality of target data segments to a preset number of data partitions under a target database table comprises:
integrating the project data contained in the target data fragments according to the sorting result to generate a data source table, and establishing a first mapping relation between the target data fragments and the data source table;
and reading project data contained in any target data fragment in the data source table according to the first mapping relation, and calling a database import component to transmit the project data contained in any target data fragment to at least one data partition in the target database table.
4. The data processing method of claim 2, wherein the invoking the database import component to transfer the item data in the plurality of target data segments to a preset number of data partitions under a target database table comprises:
determining the number of data fragments;
and constructing a corresponding number of data writing modules in the database import assembly, and calling the corresponding number of data writing modules to respectively transmit the project data in the target data fragments to a preset number of data partitions under a target database table.
5. The data processing method according to any one of claims 2 to 4, wherein the sorting the item data respectively included in the plurality of data shards to generate a plurality of target data shards including a sorting result includes:
determining the total data amount of the project data to be transmitted according to the data distribution information;
determining the quantity of project data contained in each data fragment according to the total data quantity and the quantity of the data fragments;
and sequencing the project data contained in the plurality of data fragments according to the quantity of the project data contained in each data fragment, and generating a plurality of target data fragments containing sequencing results, wherein the sequencing results among the plurality of target data fragments are continuous.
6. The data processing method according to claim 1, wherein the analyzing data distribution information of project data to be transmitted in the data source comprises:
determining a quantile graph corresponding to project data to be transmitted in a data source, and determining data distribution information of the project data according to the quantile graph;
correspondingly, the invoking a mapping reduction algorithm, and segmenting the project data based on the data distribution information to generate a plurality of data segments includes:
and executing a mapping task, and mapping the project data to different data fragments based on the data distribution information.
7. The data processing method of claim 6, after generating the plurality of data slices, further comprising:
executing a specification task, sequencing the project data respectively contained by the plurality of data fragments, and generating a plurality of target data fragments containing sequencing results;
and merging the project data contained in the target data fragments according to the sorting result based on a summary merging theory to generate a data source table.
8. The data processing method according to claim 1, wherein optimizing the data partition having a mapping relationship with any data slice in the plurality of data slices according to the number of the data partitions includes:
determining a first number of data partitions having a mapping relation with the ith data fragment and a second number of data partitions having a mapping relation with the (i + 1) th data fragment, wherein i belongs to [1, n-1], n is the total number of the data fragments, and n is a positive integer;
judging whether the first quantity is equal to the second quantity;
if not, merging or splitting the data partitions with the mapping relation with the (i + 1) th data fragment according to the first quantity; alternatively, the first and second electrodes may be,
and merging or splitting the data partitions with the mapping relation with the ith data fragment according to the second number.
9. The data processing method according to claim 2, wherein the sorting the item data respectively contained in the plurality of data slices comprises:
and determining a row primary key of a target database table, and sequencing the item data respectively contained in the plurality of data fragments according to the row primary key.
10. A data processing apparatus comprising:
the analysis module is configured to analyze data distribution information of project data to be transmitted in the data source;
the data segmentation module is configured to call a mapping protocol algorithm, segment the project data based on the data distribution information and generate a plurality of data segments;
the data transmission module is configured to call a database import component to transmit the project data in the plurality of data fragments to a preset number of data partitions under a target database table;
a quantity determination module configured to determine a quantity of data partitions having a mapping relationship with each of the plurality of data shards;
the optimization module is configured to optimize the data partitions which have a mapping relation with any data partition in the plurality of data partitions according to the number of the data partitions, so that the number of the data partitions which have the mapping relation with each data partition is the same;
wherein the number determination module comprises:
the establishing submodule is configured to establish a second mapping relation between the plurality of data fragments and the data partition according to the transmission result;
and the quantity determining submodule is configured to determine the quantity of the data partitions having the mapping relation with each data fragment in the plurality of data fragments according to the second mapping relation.
11. A computing device, comprising:
a memory and a processor;
the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions to implement the steps of the data processing method of any one of claims 1 to 9.
12. A computer readable storage medium storing computer instructions which, when executed by a processor, carry out the steps of the data processing method of any one of claims 1 to 9.
CN202110136105.XA 2021-02-01 2021-02-01 Data processing method and device Active CN113297188B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110136105.XA CN113297188B (en) 2021-02-01 2021-02-01 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110136105.XA CN113297188B (en) 2021-02-01 2021-02-01 Data processing method and device

Publications (2)

Publication Number Publication Date
CN113297188A CN113297188A (en) 2021-08-24
CN113297188B true CN113297188B (en) 2022-11-15

Family

ID=77318873

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110136105.XA Active CN113297188B (en) 2021-02-01 2021-02-01 Data processing method and device

Country Status (1)

Country Link
CN (1) CN113297188B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114490673B (en) * 2022-04-08 2022-07-12 腾讯科技(深圳)有限公司 Data information processing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102467570A (en) * 2010-11-17 2012-05-23 日电(中国)有限公司 Connection query system and method for distributed data warehouse
CN106970929A (en) * 2016-09-08 2017-07-21 阿里巴巴集团控股有限公司 Data lead-in method and device
US10657154B1 (en) * 2017-08-01 2020-05-19 Amazon Technologies, Inc. Providing access to data within a migrating data partition
CN111241185A (en) * 2020-04-26 2020-06-05 浙江网商银行股份有限公司 Data processing method and device
CN112037874A (en) * 2020-09-03 2020-12-04 合肥工业大学 Distributed data processing method based on mapping reduction

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541858B (en) * 2010-12-07 2016-06-15 腾讯科技(深圳)有限公司 Based on mapping and the data balancing processing method of stipulations, Apparatus and system
US9355146B2 (en) * 2012-06-29 2016-05-31 International Business Machines Corporation Efficient partitioned joins in a database with column-major layout

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102467570A (en) * 2010-11-17 2012-05-23 日电(中国)有限公司 Connection query system and method for distributed data warehouse
CN106970929A (en) * 2016-09-08 2017-07-21 阿里巴巴集团控股有限公司 Data lead-in method and device
US10657154B1 (en) * 2017-08-01 2020-05-19 Amazon Technologies, Inc. Providing access to data within a migrating data partition
CN111241185A (en) * 2020-04-26 2020-06-05 浙江网商银行股份有限公司 Data processing method and device
CN112037874A (en) * 2020-09-03 2020-12-04 合肥工业大学 Distributed data processing method based on mapping reduction

Also Published As

Publication number Publication date
CN113297188A (en) 2021-08-24

Similar Documents

Publication Publication Date Title
CN111241185B (en) Data processing method and device
US9460188B2 (en) Data warehouse compatibility
US8364751B2 (en) Automated client/server operation partitioning
CN107515878B (en) Data index management method and device
CN111046237B (en) User behavior data processing method and device, electronic equipment and readable medium
JP6928677B2 (en) Data processing methods and equipment for performing online analysis processing
CN111538605B (en) Distributed data access layer middleware and command execution method and device
CN111475584B (en) Data processing method, system and device
US9424260B2 (en) Techniques for data assignment from an external distributed file system to a database management system
CN111949832A (en) Method and device for analyzing dependency relationship of batch operation
CN113297188B (en) Data processing method and device
US20150120697A1 (en) System and method for analysis of a database proxy
US8655920B2 (en) Report updating based on a restructured report slice
CN109218385A (en) The method and apparatus for handling data
US11544260B2 (en) Transaction processing method and system, and server
CN116089367A (en) Dynamic barrel dividing method, device, electronic equipment and medium
Alam et al. Generating massive scale-free networks: Novel parallel algorithms using the preferential attachment model
CN111723089A (en) Method and device for processing data based on columnar storage format
CN109165257A (en) Data query method and related system, equipment and storage medium
CN110728118B (en) Cross-data-platform data processing method, device, equipment and storage medium
US10049159B2 (en) Techniques for data retrieval in a distributed computing environment
CN112711588A (en) Multi-table connection method and device
CN117194080B (en) Message processing method and device
CN117635334A (en) Factor calculation method and device and calculation equipment
Li et al. A resilient index graph for querying large biological scientific data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40057454

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant