CN113778657B

CN113778657B - Data processing method and device

Info

Publication number: CN113778657B
Application number: CN202011017711.1A
Authority: CN
Inventors: 郑思城; 吴怡燃
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2024-04-16
Anticipated expiration: 2040-09-24
Also published as: CN113778657A

Abstract

The invention discloses a data processing method and device, and relates to the technical field of computers. One embodiment of the method comprises the following steps: when the triggering condition meeting the merging task is monitored, acquiring a configuration file; determining a target data table in the cluster according to the configuration file, and selecting a data partition to be processed from the target data table according to a threshold condition in the configuration file; calling a merging interface to obtain a merging method, merging the to-be-processed data partitions based on a partition target capacity threshold in the configuration file to generate merged data partitions; and calling a load balancing interface, and obtaining an allocation strategy according to the merged data partition so as to allocate the merged data partition to each device. The embodiment provides a set of general method for screening out the data partitions needing to be combined, and calling the interfaces to finish the combination and the partition distribution after the combination, thereby improving the combination efficiency of the data partitions and further avoiding the low system operation efficiency caused by the overlarge system gauge pressure.

Description

Data processing method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus.

Background

With the continuous expansion of cluster scale and user number, service usage modes are more and more complex, and the application of the HBase framework is more and more common; the number of data partitions of the data table is also increasing; because users use improper or normal services, a large number of empty partitions or partitions with too small capacity are often generated, so that the system gauge pressure is too high, and the system operation efficiency is low. Often in the prior art, developers actively find these empty partitions or partitions with too small a capacity, and issue instructions for performing merging to merge them.

In the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art:

for larger clusters, when there are a large number of empty partitions or too small a capacity of partitions, the method of proactively searching for reconsolidations is too inefficient, easy to miss, and lacks versatility.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a data processing method and apparatus, which can provide a set of general method flow to determine and screen out the data partitions that need to be merged, and call the corresponding interfaces to complete merging and allocation of the merged data partitions, so as to improve merging efficiency of the data partitions, and further avoid low system operation efficiency caused by excessive system gauge pressure.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a data processing method including:

when the triggering condition meeting the merging task is monitored, acquiring a configuration file;

determining a target data table in a cluster according to the configuration file, and selecting a data partition to be processed from the target data table according to a threshold condition in the configuration file;

calling a preset merging interface to obtain a merging method, and merging the data partitions to be processed by using the merging method based on a partition target capacity threshold in the configuration file to generate merged data partitions;

and calling a load balancing interface, obtaining an allocation strategy according to the merged data partition, and further allocating the merged data partition to each device in the cluster.

Optionally, the monitoring to meet the triggering condition of the merging task, obtaining the configuration file includes:

and acquiring the configuration file when the current moment reaches the execution moment of the merging task based on the preset frequency or when the running index of the thread task is monitored to meet the running abnormality threshold or when the execution instruction of the merging task is received.

Optionally, the determining, according to the configuration file, a target data table in the cluster includes:

invoking a data table constraint condition from the configuration file;

when the constraint condition of the data table is null or the constraint condition of the data table is a full data table, taking all online data tables in a database as the target data table;

and when the data table constraint condition is data table identification range information, taking an online data table indicated by the data table identification range as the target data table.

Optionally, the selecting the data partition to be processed from the target data table according to the threshold condition in the configuration file includes:

a threshold condition is called from the configuration file, and a partition capacity threshold in the screening condition is determined;

selecting a data partition with partition capacity smaller than the partition capacity threshold from the target data table as a partition to be selected;

judging whether the number of the empty partitions in the partition to be selected exceeds one, if so, taking all the empty data partitions in the partition to be selected as the data partitions to be processed; and taking at least two non-empty data partitions to be selected, which are connected with partition identifiers in the same data table, as the data partitions to be processed.

Optionally, the threshold condition further includes: a partition creation time threshold; the selecting the data partition with partition capacity smaller than the partition capacity threshold from the target data table as the candidate partition comprises:

selecting a partition with the creation time earlier than the partition creation time threshold from the target data table as a first partition to be selected; and selecting a data partition with partition capacity smaller than the partition capacity threshold from the first candidate partition as a candidate partition.

Optionally, the threshold condition further includes: partition identification range start value and end value; the selecting the data partition with partition capacity smaller than the partition capacity threshold from the target data table as the candidate partition comprises:

selecting a partition mark from the target data table, wherein the partition mark is between a start value and an end value of a partition mark range and is used as a second partition to be selected; and selecting a data partition with partition capacity smaller than the partition capacity threshold from the second candidate partition as a candidate partition.

Optionally, the merging the to-be-processed data partition by using the merging method based on the partition target capacity threshold in the configuration file to generate a merged data partition, including:

generating a merging strategy for the data partition to be processed based on the partition target capacity threshold; the merging strategy indicates partition identification of at least two data partitions to be processed which participate in the same merging;

performing offline processing on the data partition to be processed;

merging the bottom data of the data partition to be processed according to the merging strategy to generate the bottom data of the merged data partition and the partition identification;

adding the partition identification of the combined data partition into a system table;

and carrying out online processing on the combined data partitions, and updating a system log.

According to still another aspect of an embodiment of the present invention, there is provided a data processing apparatus including:

the monitoring module is used for acquiring a configuration file when the triggering condition meeting the merging task is monitored;

the screening module is used for determining a target data table in the cluster according to the configuration file, and further selecting a data partition to be processed from the target data table according to a threshold condition in the configuration file;

the merging module is used for calling a preset merging interface to obtain a merging method, and merging the data partitions to be processed by using the merging method based on the partition target capacity threshold in the configuration file so as to generate merged data partitions;

and the distribution module is used for calling a load balancing interface, obtaining a distribution strategy according to the combined data partition, and further distributing the combined data partition to each device in the cluster.

Optionally, the monitoring module monitors a triggering condition that meets a merging task, and obtains a configuration file, including:

Optionally, the screening module determines, according to the configuration file, a target data table in the cluster, including:

invoking a data table constraint condition from the configuration file;

Optionally, the screening module selects the data partition to be processed from the target data table according to a threshold condition in the configuration file, including:

Optionally, the threshold condition further includes: a partition creation time threshold; the screening module selects a data partition with partition capacity smaller than the partition capacity threshold from the target data table as a partition to be selected, and the method comprises the following steps:

Optionally, the threshold condition further includes: partition identification range start value and end value; the screening module selects a data partition with partition capacity smaller than the partition capacity threshold from the target data table as a partition to be selected, and the method comprises the following steps:

Optionally, the merging module merges the to-be-processed data partitions by using the merging method based on the partition target capacity threshold in the configuration file, so as to generate merged data partitions, including:

generating a merging strategy for the data partition to be processed based on the partition target capacity threshold; the merging strategy indicates partition identification of at least two data partitions to be processed;

performing offline processing on the data partition to be processed;

According to another aspect of an embodiment of the present invention, there is provided a data processing electronic apparatus including:

one or more processors;

storage means for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the data processing method provided by the present invention.

According to still another aspect of the embodiments of the present invention, there is provided a computer readable medium having stored thereon a computer program which when executed by a processor implements the data processing method provided by the present invention.

One embodiment of the above invention has the following advantages or benefits: because the technical means of screening out the data partitions needing to be combined and calling the corresponding interfaces to finish the combination and the distribution of the combined data partitions are determined by adopting a general method, the technical problems that in the prior art, developers are required to actively search the data partitions needing to be combined and actively issue a combination instruction to be combined, the efficiency is too low and omission is easy to occur are solved, the combination efficiency of the data partitions is improved, and the system operation efficiency is low due to the fact that the system gauge pressure is too high are overcome.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main flow of a data processing method according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a main flow of merging partitions of data to be processed in a data processing method according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of the main modules of a data processing apparatus according to an embodiment of the present invention;

FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;

fig. 5 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of the main flow of a data processing method according to a first embodiment of the present invention, as shown in fig. 1, including:

step S101, acquiring a configuration file when the trigger condition meeting the merging task is monitored;

step S102, determining a target data table in a cluster according to the configuration file, and selecting a data partition to be processed from the target data table according to a threshold condition in the configuration file;

step S103, calling a preset merging interface to obtain a merging method, and merging the data partitions to be processed by using the merging method based on a partition target capacity threshold in the configuration file to generate merged data partitions;

step S104, calling a load balancing interface, obtaining an allocation strategy according to the combined data partition, and further allocating the combined data partition to each device in the cluster.

The data processing method provided by the invention can provide a set of general method flow, can rapidly screen out the data partitions which meet the conditions and need to be combined according to the configuration file, call the corresponding combining interface to acquire the combining method to complete the combination, and call the load balancing interface to complete the distribution of the combined data partitions, thereby realizing the complete data partition combining processing flow, improving the combining efficiency of the data partitions and further avoiding the low system operation efficiency caused by overlarge system gauge pressure.

In some embodiments, the monitoring to meet the triggering condition of the merging task, obtaining the configuration file includes:

The execution time may be preset, for example: setting to execute the merging task once at 0 pm every day or execute the merging task once according to a preset frequency (such as every 6 hours/time, every 12 hours/time, every week/time, etc.);

the exception threshold may be a defined threshold for task run time, running invoked resource data, such as: the running time exceeds a time threshold, and the number of partitions in a data table of the calling resource exceeds a number threshold;

the method can also be that a worker actively issues an execution instruction of a merging task once; but different from the prior art, the method does not need to actively search the data partitions needing to be merged by the staff, does not need to actively call the merging method by the staff and does not need to determine the allocation of the merged partitions by the staff; by using the method provided by the invention, a worker can complete a complete merging process flow according to the configuration file by only sending an execution instruction once.

In some embodiments, the determining a target data table in the cluster according to the configuration file includes:

invoking a data table constraint condition from the configuration file;

The range of the data table in which the data partition to be sought is located may be specified in the configuration file: the method can be a full data table, and can also be formulated by a data table identifier; in some practical applications, the range of the data table may not be specified, and the full data table may be set as the target data table by default.

In some embodiments, the selecting the data partition to be processed from the target data table according to the threshold condition in the configuration file includes:

The partition capacity threshold may be used to screen out data partitions below the capacity value; for example, in some practical applications, most normal partition capacity may be 2G, and if there are more data partitions with 20M and 10M partition capacities, the partition capacity threshold may be set to be 50M and 40M, which are equivalent, to be used for screening out the data partitions far below the normal partition capacity;

it should be noted that, because the underlying data of the data partitions is in a dictionary sequence, in order to ensure the continuity after merging and easy searching, only at least two data partitions which belong to the same data table and are connected by partition identifiers can be merged in the subsequent step when the data partitions are not empty.

In some embodiments, the threshold condition further comprises: a partition creation time threshold; the selecting the data partition with partition capacity smaller than the partition capacity threshold from the target data table as the candidate partition comprises:

In some embodiments, the threshold condition further comprises: partition identification range start value and end value; the selecting the data partition with partition capacity smaller than the partition capacity threshold from the target data table as the candidate partition comprises:

One or two of the two threshold conditions can be combined with the partition capacity threshold according to the service requirement in practical application to serve as more accurate screening conditions; the setting of the partition creation time threshold can process the expired data partition, and the start value and the end value of the partition identification range can be screened from the data partition in the appointed range, so that the constraint condition of the data table is more accurate.

In some embodiments, the merging the to-be-processed data partitions by using the merging method based on the partition target capacity threshold in the configuration file to generate a merged data partition includes:

performing offline processing on the data partition to be processed;

The partition target capacity threshold may be a range; such as: in the application scenario, if the capacity of most normal data partitions is about 20G, the target capacity threshold of the partition may be set in the range of 18G to 22G, so that the capacity of the new data partition after merging is less different from the capacity of the normal data partition.

The merge policy may indicate which two or more data partitions may be merged into a new data partition, and the new data partition may meet the partition target capacity threshold;

in some practical applications, after generating the merge policy, it is also necessary to check whether this merge is possible, such as: if the data partition of the operation has a reference of a parent partition, or if the partition does not exist and the partition has an overlay, the operation update log state can be canceled, and the merging operation can be canceled.

Fig. 2 is a schematic diagram of a main flow of merging data partitions to be processed in a data processing method according to a second embodiment of the present invention, as shown in fig. 2, including executing merging region a (data partition a) and region B (data partition B) and the flow:

merging the bottom data of the region A and the region B, and storing the merged bottom data on the HDFS;

merging metadata information of the region A and the region B to generate metadata of a new region C; the metadata information of the region C newly generated after merging is stored in a meta table (system table);

checking whether region C is online; if not, then distributing on-line, and then marking the completion update log state on ZK (a program development framework);

if online, the completion update log status can be marked directly on ZK (a program development framework).

Fig. 3 is a schematic diagram of main modules of a data processing apparatus according to an embodiment of the present invention, and as shown in fig. 3, a data processing apparatus 300 includes:

the monitoring module 301 is configured to obtain a configuration file when it is monitored that a trigger condition of a merging task is met;

the screening module 302 is configured to determine a target data table in a cluster according to the configuration file, and further select a data partition to be processed from the target data table according to a threshold condition in the configuration file;

the merging module 303 is configured to invoke a preset merging interface to obtain a merging method, and merge the to-be-processed data partitions by using the merging method based on the partition target capacity threshold in the configuration file to generate merged data partitions;

and the allocation module 304 is configured to invoke a load balancing interface, obtain an allocation policy according to the merged data partition, and further allocate the merged data partition to each device in the cluster.

In some embodiments, the monitoring module monitors a triggering condition that meets a merging task, and obtains a configuration file, including:

In some embodiments, the filtering module determines a target data table in the cluster according to the configuration file, including:

invoking a data table constraint condition from the configuration file;

In some embodiments, the filtering module selects a data partition to be processed from the target data table according to a threshold condition in the configuration file, including:

In some embodiments, the threshold condition further comprises: a partition creation time threshold; the screening module selects a data partition with partition capacity smaller than the partition capacity threshold from the target data table as a partition to be selected, and the method comprises the following steps:

In some embodiments, the threshold condition further comprises: partition identification range start value and end value; the screening module selects a data partition with partition capacity smaller than the partition capacity threshold from the target data table as a partition to be selected, and the method comprises the following steps:

In some embodiments, the merging module merges the data partitions to be processed using the merging method based on a partition target capacity threshold in the configuration file to generate merged data partitions, including:

performing offline processing on the data partition to be processed;

The merge policy may indicate which two or more data partitions may be merged into a new data partition, and the new data partition may meet the partition target capacity threshold.

FIG. 4 illustrates an exemplary system architecture 400 in which a data processing method or data processing apparatus of an embodiment of the present invention may be applied.

As shown in fig. 4, the system architecture 400 may include terminal devices 401, 402, 403, a network 404, and a server 405. The network 404 is used as a medium to provide communication links between the terminal devices 401, 402, 403 and the server 405. The network 404 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 405 via the network 404 using the terminal devices 401, 402, 403 to receive or send messages or the like. The terminal devices 401, 402, 403 may have various communication client applications installed thereon, or input interface applications for setting configuration files.

The terminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 405 may be a server providing various services, such as a background management server providing support for business processes performed by the user using the terminal devices 401, 402, 403. The background management server can analyze and other data of the received product information inquiry request and feed back the processing result to the terminal equipment.

It should be noted that, the data processing method provided in the embodiment of the present invention is generally executed by the server 405, and accordingly, the data processing apparatus is generally disposed in the server 405.

It should be understood that the number of terminal devices, networks and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 5, there is illustrated a schematic diagram of a computer system 500 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 5 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU) 501, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input section 506 including a keyboard, a mouse, and the like; an output portion 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as needed so that a computer program read therefrom is mounted into the storage section 508 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 509, and/or installed from the removable media 511. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 501.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: a processor includes a monitoring module, a screening module, a combining module, and an allocation module. The names of these modules do not constitute a limitation on the module itself in some cases.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: step S101, acquiring a configuration file when the trigger condition meeting the merging task is monitored; step S102, determining a target data table in a cluster according to the configuration file, and selecting a data partition to be processed from the target data table according to a threshold condition in the configuration file; step S103, calling a preset merging interface to obtain a merging method, and merging the data partitions to be processed by using the merging method based on a partition target capacity threshold in the configuration file to generate merged data partitions; step S104, calling a load balancing interface, obtaining an allocation strategy according to the combined data partition, and further allocating the combined data partition to each device in the cluster.

According to the technical scheme of the embodiment of the invention, the technical means of screening out the data partitions needing to be combined and calling the corresponding interfaces to finish combining and distributing the combined data partitions are determined by adopting a general method, so that the technical problems that in the prior art, a developer is required to actively search the data partitions needing to be combined and actively issue a combining instruction to combine, the efficiency is low and omission is easy to occur are solved, the combining efficiency of the data partitions is improved, and the system operation efficiency is low due to overlarge system gauge pressure.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of data processing, comprising:

2. The method of claim 1, wherein the monitoring to meet the trigger condition of the merge task, obtaining the configuration file, comprises:

3. The method of claim 1, wherein determining the target data table in the cluster from the configuration file comprises:

invoking a data table constraint condition from the configuration file;

4. The method of claim 1, wherein selecting a data partition to be processed from the target data table according to a threshold condition in the configuration file comprises:

5. The method of claim 4, wherein the threshold condition further comprises: a partition creation time threshold; the selecting the data partition with partition capacity smaller than the partition capacity threshold from the target data table as the candidate partition comprises:

6. The method of claim 4, wherein the threshold condition further comprises: partition identification range start value and end value; the selecting the data partition with partition capacity smaller than the partition capacity threshold from the target data table as the candidate partition comprises:

7. The method according to any one of claims 4-6, wherein the merging the data partitions to be processed using the merging method based on the partition target capacity threshold in the configuration file to generate a merged data partition comprises:

performing offline processing on the data partition to be processed;

8. A data processing apparatus, comprising:

9. A data processing electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.

10. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.