CN107643906A

CN107643906A - Data processing method and device

Info

Publication number: CN107643906A
Application number: CN201610590825.2A
Authority: CN
Inventors: 李雪斌
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2016-07-22
Filing date: 2016-07-22
Publication date: 2018-01-30
Anticipated expiration: 2036-07-22
Also published as: CN107643906B; WO2018014761A1

Abstract

The invention discloses a kind of data processing method and device, belong to field of computer technology.Methods described includes：Obtain the compression dictionary of at least two target text data blocks, based on the compression dictionary, each target text data block at least two target texts data block is compressed respectively, obtain at least two compression data blocks, when receiving the process instruction that same processing operation is carried out at least two target texts data block, compressed data at least two compression data block is handled, to realize the processing of at least two target texts data block.The present invention is by handling the compressed data at least two compression data block, to realize the processing at least two target texts data block, without being decompressed at least two compression data block, reduce data processing amount, and then the processing time of data is shortened, and save process resource.

Description

Data processing method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus.

Background

With the development of computer technology, a large amount of text data, which means data composed of printable characters including 33 to 127-bit characters in American Standard Code for Information Interchange (ASCII), characters in UNICODE (UNICODE), characters in universal code (UTF-8), and the like, needs to be stored and analyzed. When storing the text data, in order to save the time and space occupied by data storage and transmission, the text data needs to be compressed first, then the compressed text data is stored, then, when analyzing the text data, the compressed text data can be decompressed first to obtain the text data, then the text data is subjected to processing such as comparison, sorting, searching, hash operation, connection operation, and the like, and the text data is analyzed based on the processing result of the text data.

At present, a data processing method is provided, which specifically comprises: for each stored text data block in a plurality of text data blocks, generating a compression dictionary of the text data block, wherein the text data block comprises a plurality of text data; compressing the text data block based on the compression dictionary of the text data block to obtain a compressed data block corresponding to the text data block; and storing the compressed data block corresponding to the text data block. When a processing instruction for carrying out the same processing operation on a first text data block and a second text data block is received, acquiring a compressed data block corresponding to the first text data block and acquiring a compressed data block corresponding to the second text data block, wherein the first text data block and the second text data block are any two text data blocks in the plurality of text data blocks; decompressing the compressed data block corresponding to the first text data block to obtain a first text data block, and decompressing the compressed data block corresponding to the second text data block to obtain a second text data block; and processing the text data in the first text data block and the second text data block to obtain a processing result.

When a processing instruction for performing the same processing operation on the first text data block and the second text data block is received, the first text data block and the second text data block can be processed only after the compressed data block corresponding to the first text data block and the compressed data block corresponding to the second text data block are respectively decompressed, so that the data processing time is long, and the consumed processing resources are more.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide a data processing method and apparatus. The technical scheme is as follows:

in a first aspect, a data processing method is provided, the method including:

the method comprises the steps of obtaining a compression dictionary of at least two target text data blocks, wherein the at least two target text data blocks are data blocks which are stored and are processed through the same processing operation in the following process, each target text data block comprises a plurality of text data, each text data comprises a plurality of characters, and the compression dictionary comprises compression codes of each text data in each target text data block or comprises compression codes of each character in each target text data block;

compressing each target text data block in the at least two target text data blocks respectively based on the compression dictionary to obtain at least two compressed data blocks, wherein the at least two target text data blocks correspond to the at least two compressed data blocks one to one, each compressed data block comprises a plurality of compressed data, and the plurality of compressed data correspond to the plurality of text data one to one;

and when a processing instruction for carrying out the same processing operation on the at least two target text data blocks is received, processing the compressed data in the at least two compressed data blocks so as to realize the processing of the at least two target text data blocks.

It should be noted that the compression dictionaries of the at least two target text data blocks may include compression codes of each text data in each target text data block, or may include compression codes of each character in each target text data block, that is, the compression codes in the compression dictionaries may correspond to the text data one by one, or may correspond to the characters one by one, which is not specifically limited in this embodiment of the present invention.

In addition, the at least two target text data blocks are data blocks which are subsequently processed through the same processing operation, and the at least two target text data blocks correspond to the same compression dictionary, that is, the at least two target text data blocks share the same compression dictionary, so that a compression dictionary does not need to be generated for each text data block. When a processing instruction for performing the same processing operation on at least two compressed data blocks is received, because the at least two compressed data blocks are obtained by compressing the at least two target text data blocks through the compression dictionary and the at least two target text data blocks share the same compression dictionary, compressed data in the at least two compressed data blocks can be directly processed, thereby realizing the processing of the at least two target text data blocks.

With reference to the first aspect, in a first possible implementation manner of the first aspect, before the obtaining the compression dictionaries of the at least two target text data blocks, the method further includes:

determining the at least two target text data blocks from the stored plurality of text data blocks;

generating a compression dictionary of the at least two target text data blocks.

In order to facilitate subsequent compression of each of the at least two target text data blocks based on the same compression standard, after determining the at least two target text data blocks, compression dictionaries for the at least two target text data blocks may be generated. When generating the compression dictionaries for the at least two target text data blocks, the compression dictionaries for the at least two target text data blocks may be generated based on a specified compression algorithm and the at least two target text data blocks.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the determining the at least two target text data blocks from the stored multiple text data blocks includes:

selecting at least two text data blocks belonging to a target category from the plurality of text data blocks, and determining the selected text data blocks as target text data blocks; or,

when a selection instruction for at least two text data blocks in the plurality of text data blocks is detected, determining the text data block selected by the selection instruction as a target text data block.

Because the text data blocks in the target category can be processed through the same processing operation under the common condition, at least two text data blocks belonging to the target category can be selected and the selected text data block is determined as the target text data block in the embodiment of the invention, the determining operation is simple and convenient, and the user does not need to participate, so that the determining efficiency of the target text data block can be improved.

In addition, because the selection instruction is triggered by the user, when the text data block selected by the selection instruction is determined as the target text data block, the target text data block is actually determined according to the user operation, so that the determined target text data block can be ensured to meet the user requirement.

With reference to the first aspect, in a third possible implementation manner of the first aspect, the processing the compressed data in the at least two compressed data blocks to implement the processing of the at least two target text data blocks includes:

determining a plurality of compressed data included in each of the at least two compressed data blocks;

processing the at least two compressed data blocks based on a plurality of compressed data included in each of the at least two compressed data blocks to obtain processing results of the at least two compressed data blocks;

and decompressing the processing results of the at least two compressed data blocks based on the compressed dictionary to obtain the processing results of the at least two target text data blocks.

Since the compression dictionary stores the compression code of each text data in the respective target text data blocks, or the compression code of each character in the respective target text data blocks, therefore, compressing the at least two target text data blocks, that is, converting the text data in the at least two target text data blocks into compressed data, since the conversion rule is certain, therefore, after the compressed data is calculated, the processing result of the at least two compressed data blocks is also composed of compression codes, and therefore, based on the compression dictionary, after the processing results of the at least two compressed data blocks are decompressed, the compression results formed by the compression codes are converted into a text data form, and because the conversion rule is certain, therefore, the converted processing result is the processing result of the at least two target text data blocks.

It should be noted that, in general, the text data blocks are stored in a distributed manner, that is, when a processing instruction for performing the same processing operation on two text data blocks in the text data blocks is received, because compression dictionaries of the two text data blocks are different, two compressed data blocks corresponding to the two stored text data blocks need to be decompressed respectively to obtain the two text data blocks, and then the two text data blocks are transmitted to a device for data processing, so that the device processes the two text data blocks.

In the embodiment of the present invention, since the compressed data in the at least two compressed data blocks is directly processed to implement the processing of the at least two target text data blocks, when the at least two target text data blocks are processed, only the at least two compressed data blocks corresponding to the at least two target text data blocks may be transmitted to the device for data processing, and the device processes the compressed data in the at least two compressed data blocks to implement the processing of the at least two target text data blocks. Compared with the mode that the compressed data block needs to be decompressed before the text data block is processed in the related art, the method and the device for processing the target text data block can process the target text data block directly on the basis of the compressed data block, so that the data processing time can be shortened, and the processing resource can be saved. In addition, compared with a mode of transmitting text data blocks among a plurality of devices included in the distributed system in the related art, only compressed data blocks need to be transmitted in the embodiment of the invention, so that the data transmission amount can be reduced, the network bandwidth utilization rate can be improved, and the data transmission time can be saved.

With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the determining a plurality of compressed data included in each of the at least two compressed data blocks includes:

judging whether the code lengths of a plurality of compression codes included in the compression dictionary are equal or not;

when the code lengths of a plurality of compression codes included in the compression dictionary are equal and the compression dictionary includes the compression code of each text data in each target text data block, determining the code length of each compression code as the length of each compression data in the compression data block for each compression data block in the at least two compression data blocks;

and sequentially determining the plurality of compressed data from the compressed data block according to the length of each compressed data in the compressed data block.

As can be seen from the above description, the compression dictionary is generated by specifying a compression algorithm, and for different compression algorithms, the code lengths of the plurality of compression codes included in the generated compression dictionary may not be equal, and the code lengths between the respective compression codes included in the compression dictionary may not be equal, and the compressed data blocks are compressed based on the compression dictionary, so that, when determining the plurality of compressed data blocks included in each of the at least two compressed data blocks, it may be determined whether the code lengths of the plurality of compression codes included in the compression dictionary are equal or not.

In addition, when the compression dictionary includes the compression code of each text data in the respective target text data blocks, it is determined that each text data is compressed based on the compression code of each text data in the respective target text data blocks when the compression of the target text data blocks is performed, and therefore, the code length of the respective compression codes in the compression dictionary may be determined as the length of the respective compressed data blocks in the compressed data blocks.

With reference to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, after the determining whether code lengths of a plurality of compression codes included in the compression dictionary are equal, the method further includes:

when the code lengths of a plurality of compression codes included in the compression dictionary are equal and when the compression dictionary includes the compression code of each character in each target text data block, determining a target text data block corresponding to the compression data block for each of the at least two compression data blocks;

multiplying the code length of each compressed code by the number of characters of each text data in the target text data block to obtain the length of each compressed data in the compressed data block;

When the compression dictionary comprises the compression codes of each character in each target text data block, it is determined that each character is compressed based on the compression codes of each character in each target text data block when the target text data block is compressed, and one text data can comprise a plurality of characters, so that the code length of each compression code in the compression dictionary can be multiplied by the number of characters of each text data in the target text data block respectively, and the length of each compression data block in the compression data block can be obtained.

With reference to the fourth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, after the determining whether code lengths of a plurality of compression codes included in the compression dictionary are equal, the method further includes:

when the code lengths of a plurality of compressed codes included in the compression dictionary are not equal, for each compressed data block of the at least two compressed data blocks, determining the plurality of compressed data from the compressed data block according to the data index of the compressed data block, where the data index of the compressed data block is used for indicating the position of each compressed data in the plurality of compressed data in the compressed data block.

With reference to the sixth possible implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, before determining the plurality of compressed data from the compressed data block according to the data index of the compressed data block, the method further includes:

for a target text data block corresponding to the compressed data block, determining the position of compressed data corresponding to each text data in the target text data block in the compressed data block in the process of compressing the target text data block;

and generating a data index of the compressed data block based on the position of the compressed data corresponding to each text data in the compressed data block.

It should be noted that, in the process of compressing the target text data block, when the position of the compressed data corresponding to each text data in the target text data block in the compressed data block is determined, the start position and the end position of the compressed data corresponding to each text data in the compressed data block may be determined, and then the position of the compressed data corresponding to each text data in the compressed data block is uniquely determined according to the start position and the end position. For convenience, the start position of each compressed data in the compressed data block may be determined as the data index of the compressed data block.

In a second aspect, there is provided a data processing apparatus, the apparatus comprising:

the device comprises an acquisition module, a compression module and a processing module, wherein the acquisition module is used for acquiring a compression dictionary of at least two target text data blocks, the at least two target text data blocks are stored data blocks which are processed through the same processing operation in the following process, each target text data block comprises a plurality of text data, each text data comprises a plurality of characters, and the compression dictionary comprises a compression code of each text data in each target text data block or comprises a compression code of each character in each target text data block;

the compression module is used for respectively compressing each target text data block in the at least two target text data blocks based on the compression dictionary to obtain at least two compressed data blocks, wherein the at least two target text data blocks correspond to the at least two compressed data blocks one to one, each compressed data block comprises a plurality of compressed data, and the plurality of compressed data correspond to the plurality of text data one to one;

and the processing module is used for processing the compressed data in the at least two compressed data blocks when receiving a processing instruction for carrying out the same processing operation on the at least two target text data blocks so as to realize the processing of the at least two target text data blocks.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the apparatus further includes:

a determining module, configured to determine the at least two target text data blocks from the stored plurality of text data blocks;

and the generating module is used for generating the compression dictionaries of the at least two target text data blocks.

With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the determining module includes:

a selecting unit, configured to select at least two text data blocks belonging to a target category from the plurality of text data blocks, and determine the selected text data block as a target text data block; or,

a first determining unit, configured to, when a selection instruction for at least two text data blocks of the plurality of text data blocks is detected, determine a text data block selected by the selection instruction as a target text data block.

With reference to the second aspect, in a third possible implementation manner of the second aspect, the processing module includes:

a second determining unit configured to determine a plurality of compressed data included in each of the at least two compressed data blocks;

the processing unit is used for processing the at least two compressed data blocks based on a plurality of compressed data included in each of the at least two compressed data blocks to obtain processing results of the at least two compressed data blocks;

and the decompression unit is used for decompressing the processing results of the at least two compressed data blocks based on the compression dictionary to obtain the processing results of the at least two target text data blocks.

With reference to the third possible implementation manner of the second aspect, in a fourth possible implementation manner of the second aspect, the second determining unit includes:

a judging subunit, configured to judge whether code lengths of a plurality of compression codes included in the compression dictionary are equal;

a first determining subunit, configured to determine, for each of the at least two compressed data blocks, a code length of each compressed code as a length of each of the compressed data in the compressed data block, when code lengths of a plurality of compressed codes included in the compression dictionary are equal and the compression dictionary includes a compressed code of each of the text data in the respective target text data blocks;

a second determining subunit, configured to sequentially determine the plurality of compressed data blocks from the compressed data block according to the length of each compressed data in the compressed data block

With reference to the fourth possible implementation manner of the second aspect, in a fifth possible implementation manner of the second aspect, the second determining unit further includes:

a third determining subunit, configured to determine, for each of the at least two compressed data blocks, a target text data block to which the compressed data block corresponds when the code lengths of the plurality of compressed codes included in the compression dictionary are equal and when the compression dictionary includes a compressed code of each character in the respective target text data blocks;

the operation subunit is used for multiplying the code length of each compressed code by the number of characters of each text data in the target text data block to obtain the length of each compressed data in the compressed data block;

a fourth determining subunit, configured to sequentially determine the multiple pieces of compressed data from the compressed data block according to the length of each piece of compressed data in the compressed data block.

With reference to the fourth possible implementation manner of the second aspect, in a sixth possible implementation manner of the second aspect, the second determining unit further includes:

a fifth determining subunit, configured to determine, for each of the at least two compressed data blocks, the plurality of compressed data according to a data index of the compressed data block when code lengths of a plurality of compressed codes included in the compression dictionary are not equal, where the data index of the compressed data block is used to indicate a position of each of the plurality of compressed data in the compressed data block.

With reference to the sixth possible implementation manner of the second aspect, in a seventh possible implementation manner of the second aspect, the second determining unit further includes:

a sixth determining subunit, configured to determine, for a target text data block corresponding to the compressed data block, a position of compressed data corresponding to each text data in the target text data block in the compressed data block in a process of compressing the target text data block;

and the generating subunit is used for generating a data index of the compressed data block based on the position of the compressed data corresponding to each text data in the compressed data block.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

in the embodiment of the present invention, for at least two target text data blocks that are subsequently processed through the same processing operation, compression dictionaries of the at least two target text data blocks are obtained, and each of the at least two target text data blocks is compressed based on the compression dictionary to obtain at least two compressed data blocks, that is, the at least two target text data blocks may share the same compression dictionary without generating a compression dictionary for each text data block. In addition, since the compression dictionaries of the at least two target text data blocks are the same, that is, the at least two target text data blocks are compressed by the same compression standard, the compressed data in the at least two compressed data blocks are processed, and the result of decompressing the processing result by the compression dictionary is the same as the processing result obtained by processing the text data in the at least two target text data blocks, so that the embodiment of the present invention implements the processing of the at least two target text data blocks by processing the compressed data in the at least two compressed data blocks, without decompressing the at least two compressed data blocks, thereby reducing the data processing amount, further shortening the data processing time, and saving the processing resources.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a block diagram of a data processing system according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a computer device according to an embodiment of the present invention;

FIG. 3 is a flow chart of a data processing method according to an embodiment of the present invention;

FIG. 4A is a flow chart of another data processing method provided by an embodiment of the invention;

FIG. 4B is a diagram of a target block of text data according to the embodiment of FIG. 4A;

FIG. 4C is a diagram illustrating a correspondence between a target block of text data and a compression dictionary in accordance with the embodiment of FIG. 4A;

FIG. 4D is a diagram of another target block of text data and compression dictionary correspondence according to the embodiment of FIG. 4A;

FIG. 4E is a diagram of a compressed field index according to the embodiment of FIG. 4A;

fig. 5A is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 5B is a schematic structural diagram of a processing module according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Fig. 1 is a schematic structural diagram of a data processing system according to an embodiment of the present invention. Referring to fig. 1, the system may be a distributed system, and certainly may not be a distributed system, and in the embodiment of the present invention, the distributed system is taken as an example for description. The distributed system includes a plurality of devices, which are device 01, device 02, and device 03.

In addition, each of the plurality of devices may include a data importing module and a data processing module, a specific device of the plurality of devices may include not only the data importing module and the data processing module, but also a compression dictionary configuration module and a compression dictionary shared storage module, and the specific device may be any one of the plurality of devices. The distributed system may store a plurality of text data blocks, the compression dictionary configuration module is configured to configure a compression dictionary for the plurality of text data blocks stored in the distributed system, and the compression dictionary is configured to compress the text data blocks to obtain compressed data blocks.

The compression dictionary sharing storage module is used for correspondingly storing the identifications of the text data blocks and the corresponding compression dictionaries; the data import module is used for acquiring the compression dictionaries of the at least two target text data blocks from the compression dictionary shared storage module when the data processing is carried out on the at least two target text data blocks; the data processing module is used for processing the compressed data in the compressed data blocks corresponding to the at least two target text data blocks and decompressing the processing result based on the compressed dictionary acquired by the data importing module, so that the at least two target text data blocks are processed.

Fig. 2 is a schematic structural diagram of a computer device according to an embodiment of the present invention, and the device in the distributed system in fig. 1 may be implemented by the computer device shown in fig. 2. Referring to fig. 2, the computer device comprises at least one processor 201, a communication bus 202, a memory 203 and at least one communication interface 204.

The processor 201 may be a general purpose Central Processing Unit (CPU), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with the inventive arrangements.

The communication bus 202 may include a path that conveys information between the aforementioned components.

The Memory 203 may be a Read-Only Memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an Electrically Erasable Programmable Read-Only Memory (EEPROM), a compact disc Read-Only Memory (CD-ROM) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these. The memory 203 may be self-contained and coupled to the processor 201 via the communication bus 202. The memory 203 may also be integrated with the processor 201.

Communication interface 204, using any transceiver or the like, is used for communicating with other devices or communication Networks, such as ethernet, Radio Access Network (RAN), Wireless Local Area Network (WLAN), etc.

In particular implementations, processor 201 may include one or more CPUs, such as CPU0 and CPU1 shown in fig. 2, as one embodiment.

In particular implementations, a computer device may include multiple processors, such as processor 201 and processor 208 shown in fig. 2, as one embodiment. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In particular implementations, a computer device may also include an output device 205 and an input device 206, as one embodiment. The output device 205 is in communication with the processor 201 and may display information in a variety of ways. For example, the output device 205 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like. The input device 206 is in communication with the processor 201 and may receive user input in a variety of ways. For example, the input device 206 may be a mouse, a keyboard, a touch screen device, or a sensing device, among others.

The computer device may be a general purpose computer device or a special purpose computer device. In a specific implementation, the computer device may be a desktop computer, a laptop computer, a network server, a Personal Digital Assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, or an embedded device. The embodiment of the invention does not limit the type of the computer equipment.

The memory 203 is used for storing program codes for executing the scheme of the invention, and the processor 201 controls the execution. The processor 201 is operable to execute program code 210 stored in the memory 203. One or more software modules (e.g., a data import module, a data processing module, a compression dictionary configuration module, a compression dictionary shared storage module, etc.) may be included in program code 210. The devices in the distributed system shown in fig. 1 may process data through the processor 201 and one or more software modules in the program code 210 in memory 203.

Fig. 3 is a flowchart of a data processing method according to an embodiment of the present invention. Referring to fig. 3, the method includes:

step 301: the method comprises the steps of obtaining a compression dictionary of at least two target text data blocks, wherein the at least two target text data blocks are data blocks which are stored and are processed through the same processing operation in the following process, each target text data block comprises a plurality of text data, each text data comprises a plurality of characters, and the compression dictionary comprises compression codes of each text data in each target text data block or comprises compression codes of each character in each target text data block.

Step 302: and compressing each target text data block in the at least two target text data blocks respectively based on the compression dictionary to obtain at least two compressed data blocks, wherein the at least two target text data blocks are in one-to-one correspondence with the at least two compressed data blocks, each compressed data block comprises a plurality of compressed data, and the plurality of compressed data are in one-to-one correspondence with the plurality of text data.

Step 303: and when a processing instruction for carrying out the same processing operation on the at least two target text data blocks is received, processing the compressed data in the at least two compressed data blocks so as to realize the processing of the at least two target text data blocks.

In the embodiment of the present invention, for at least two target text data blocks that are subsequently processed through the same processing operation, compression dictionaries of the at least two target text data blocks are obtained, and each of the at least two target text data blocks is compressed based on the compression dictionary to obtain at least two compressed data blocks, that is, the at least two target text data blocks may share the same compression dictionary without generating a compression dictionary for each text data block. In addition, since the compression dictionaries of the at least two target text data blocks are the same, that is, the at least two target text data blocks are compressed by the same compression standard, the compressed data in the at least two compressed data blocks are processed, and the result of decompressing the processing result by the compression dictionary is the same as the processing result obtained by processing the text data in the at least two target text data blocks, so that the embodiment of the present invention implements the processing of the at least two target text data blocks by processing the at least two compressed data blocks without decompressing the at least two compressed data blocks, thereby reducing the data processing amount, further shortening the data processing time, and saving the processing resources.

Optionally, before obtaining the compression dictionaries of the at least two target text data blocks, the method further includes:

determining at least two target text data blocks from the stored plurality of text data blocks;

a compression dictionary of the at least two target text data blocks is generated.

Optionally, determining at least two target text data blocks from the stored plurality of text data blocks comprises:

selecting at least two text data blocks belonging to a target category from the stored plurality of text data blocks, and determining the selected text data blocks as target text data blocks; or,

when a selection instruction for at least two text data blocks of the plurality of text data blocks is detected, the text data block selected by the selection instruction is determined as a target text data block.

Optionally, processing the compressed data in the at least two compressed data blocks to implement processing of the at least two target text data blocks includes:

and decompressing the processing results of the at least two compressed data blocks based on the compression dictionary to obtain the processing results of the at least two target text data blocks.

Optionally, determining a plurality of compressed data included in each of the at least two compressed data blocks comprises:

when the code lengths of a plurality of compressed codes included in the compressed dictionary are equal and the compressed dictionary includes the compressed code of each text data in each target text data block, determining the code length of each compressed code as the length of each compressed data in each compressed data block for each compressed data block in the at least two compressed data blocks;

and sequentially determining a plurality of compressed data from the compressed data block according to the length of each compressed data in the compressed data block.

Optionally, after determining whether the code lengths of the plurality of compressed codes included in the compression dictionary are equal, the method further includes:

when the code lengths of a plurality of compressed codes included in the compressed dictionary are equal and when the compressed dictionary includes the compressed code of each character in each target text data block, determining a target text data block corresponding to each compressed data block in the at least two compressed data blocks;

when the code lengths of a plurality of compressed codes included in the compression dictionary are not equal, for each compressed data block of the at least two compressed data blocks, determining a plurality of compressed data from the compressed data block according to the data index of the compressed data block, wherein the data index of the compressed data block is used for indicating the position of each compressed data in the plurality of compressed data in the compressed data block.

Optionally, before determining a plurality of compressed data from the compressed data block according to the data index of the compressed data block, the method further includes:

for the target text data block corresponding to the compressed data block, determining the position of compressed data corresponding to each text data in the target text data block in the compressed data block in the process of compressing the target text data block;

All the above optional technical solutions can be combined arbitrarily to form an optional embodiment of the present invention, which is not described in detail herein.

Fig. 4A is a flowchart of a data processing method according to an embodiment of the present invention. Referring to fig. 4A, the method includes:

step 401: at least two target text data blocks are determined from the stored plurality of text data blocks.

It should be noted that the plurality of text data blocks may be stored in a specified format, and the specified format may be preset, for example, the specified format may be a text file (TextFile) format, a partial format, a sequenceile format, an RCFile format, an Avro format, and the like, which is not specifically limited in this embodiment of the present invention. In order to improve the access efficiency of the text data blocks, in practical applications, the text data blocks may be Distributed and stored, where the Distributed storage refers to that the text data blocks are dispersedly stored on multiple independent devices, for example, the text data blocks may be stored based on a Distributed File System (HDFS), and the like, and this is not specifically limited in the embodiment of the present invention.

In addition, the at least two target text data blocks are data blocks which are subsequently processed through the same processing operation, each target text data block comprises a plurality of text data, and each text data block comprises a plurality of characters. For example, as shown in fig. 4B, a certain target text data block of the at least two target text data blocks includes a plurality of text data, which are 101, 102, 103, and 104. The processing operation may include comparing, sorting, searching, hash operation, join operation, and the like, which is not specifically limited in this embodiment of the present invention.

The text data is data composed of printable characters, and the printable characters include 33-127 bits of characters in ASCII, characters in UNICODE, characters in UTF-8, and the like, which is not particularly limited in the embodiments of the present invention.

Specifically, when the distributed system determines the at least two target text data blocks from the plurality of text data blocks, the distributed system may select at least two text data blocks belonging to a target category from the plurality of text data blocks, and determine the selected text data block as the target text data block; or, when a selection instruction for at least two text data blocks of the plurality of text data blocks is detected, determining the text data block selected by the selection instruction as a target text data block.

It should be noted that all the text data blocks in the target category may be processed through the same processing operation, for example, the target category may be a number, time, a user name, and the like, which is not specifically limited in this embodiment of the present invention.

In addition, the selection instruction is used to select a target text data block from the plurality of text data blocks, and the selection instruction may be triggered by a user, where the user may trigger through a specified operation, and the specified operation may be a single-click operation, a double-click operation, a voice operation, and the like, which is not specifically limited in this embodiment of the present invention.

For example, of the plurality of text data blocks, the text data blocks belonging to the target category are the text data block 1, the text data block 2, the text data block 3, and the text data block 4, the text data block 1, the text data block 2, the text data block 3, and the text data block 4 may be selected from the plurality of text data blocks, and thereafter, the selected text data block 1, the text data block 2, the text data block 3, and the text data block 4 may be determined as the target text data block.

For another example, if a selection instruction for the text data block 1, the text data block 2, the text data block 3, and the text data block 4 of the plurality of text data blocks is detected, the text data block 1, the text data block 2, the text data block 3, and the text data block 4 selected by the selection instruction may be determined as a target text data block.

It should be noted that, in practical applications, after determining the at least two target text data blocks, the compression dictionary configuration module may further send an identifier of each of the at least two target text data blocks to the compression dictionary shared storage module, so that the compression dictionary shared storage module may determine an association relationship between the at least two target text data blocks, thereby facilitating a subsequent compression dictionary shared storage module to generate the same compression dictionary for the at least two target text data blocks.

The identifier of the target text data block is used to uniquely identify the target text data block, and the identifier of the target text data block may be a name of the target text data block, which is not specifically limited in the embodiment of the present invention.

Step 402: a compression dictionary of the at least two target text data blocks is generated.

It should be noted that the compression dictionaries of the at least two target text data blocks may include a compression code of each text data in each target text data block, or may include a compression code of each character in each target text data block, which is not specifically limited in this embodiment of the present invention. For example, the text data included in the at least two target text data blocks are 101, 102, and 103, when the compression dictionary of the at least two target text data blocks includes the compression code of each text data in the respective target text data blocks, the compression dictionary may correspondingly include compression code 1, compression code 2, and compression code 3, as shown in fig. 4C; when the compression dictionaries of the at least two target text data blocks include the compression codes of each character in the respective target text data blocks, if the characters included in the text data 101 are 1, 2 and 3, the corresponding compression codes are compression code 1, compression code 2 and compression code 3, the characters included in the text data 102 are 4, 5 and 6, the corresponding compression codes are compression code 4, compression code 5 and compression code 6, the characters included in the text data 103 are 7, 8 and 9, and the corresponding compression codes are compression code 7, compression code 8 and compression code 9, the compression dictionary may be as shown in fig. 4D.

In addition, the specific compression algorithm may be preset, and the specific compression algorithm may be a compression ordering preserving algorithm, that is, the encoded lexicographic order may be kept the same as the lexicographic order of the original character string, for example, the specific compression algorithm may be a Huffman Coding (Huffman Coding) algorithm, a Hu-Tucker Coding algorithm, or the like, which is not specifically limited in the embodiment of the present invention.

The operation of generating the compression dictionary of the at least two target text data blocks based on the specified compression algorithm and the at least two target text data blocks may refer to related technologies, which are not described in detail in the embodiments of the present invention.

It should be noted that, when the at least two target text data blocks are stored in a distributed manner, the device for storing the at least two target text data blocks may send the at least two target text data blocks to the compression dictionary shared storage module when receiving a compression instruction for the at least two target text data blocks, so that the compression dictionary shared storage module generates a compression dictionary for the at least two text data blocks, and returns the compression dictionary to the device, so that the device may compress the at least two text data blocks, of course, the compression dictionary shared storage module may also actively acquire the at least two target text data blocks, so as to generate a compression dictionary for the at least two target text data blocks, which is not specifically limited in this embodiment of the present invention. The compression instruction is used to instruct to compress the at least two target text data blocks, and the compression instruction may be triggered by a specified operation, which is not specifically limited in the embodiment of the present invention.

In addition, after the compression dictionary shared storage module generates the compression dictionary, the at least two target text data blocks can be deleted, so that the storage resources of the compression dictionary shared storage module are saved.

Furthermore, after the compression dictionary shared storage module generates the compression dictionaries of the at least two target text data blocks, the identifier of each target text data block in the at least two target text data blocks and the compression dictionaries of the at least two target text data blocks may be stored in the corresponding relationship between the identifier of the text data block and the compression dictionary, so that when the device acquires the compression dictionary of the target text data block in the following process, the compression dictionary of the target text data block may be simply and quickly acquired from the corresponding relationship between the identifier of the text data block and the compression dictionary based on the identifier of the target text data block.

It should be noted that, in the related art, when compressing a certain text data block, a compression dictionary corresponding to the text data block is often generated first, then the text data block is compressed based on the compression dictionary of the text data block to obtain a compressed data block corresponding to the text data block, and then the compression dictionary of the text data block and the compressed data block corresponding to the text data block are stored together, so that the compressed data block corresponding to the text data block can be decompressed subsequently based on the compression dictionary of the text data block. That is to say, if a plurality of compressed data blocks are to be stored in the related art, the compression dictionary corresponding to each of the plurality of compressed data blocks needs to be stored, so that a large amount of storage resources are consumed. In the embodiment of the present invention, since the at least two target text data blocks all use the same compression dictionary, the compression dictionaries of the at least two target text data blocks only need to be stored in the compression dictionary shared storage module once, thereby saving storage resources.

In addition, in the related art, when generating the compressed dictionaries of the plurality of text data blocks, the generation operation needs to be performed once for each text data block in the plurality of text data blocks, that is, the compressed dictionaries of all the text data blocks in the plurality of text data blocks can be obtained only by performing the generation operation multiple times, so that more processing resources are consumed. In the embodiment of the present invention, the at least two target text data blocks may be determined in advance, and the compressed dictionaries of each of the at least two target text data blocks are the same, so that the compressed dictionaries of the at least two target text data blocks can be obtained only by performing a generation operation on the at least two target text data blocks once, thereby saving processing resources.

It should be noted that, in the embodiment of the present invention, the compression dictionaries of the at least two target text data blocks may be determined through the above-mentioned step 401 and step 402, and the operation of compressing and processing the at least two target text data blocks based on the compression dictionaries of the at least two target text data blocks may be implemented through the following step 403 and step 405.

As can be seen from the above description, the method provided in the embodiment of the present invention is applied to a distributed system, and a compression dictionary sharing storage module is disposed in a specific device of a plurality of devices included in the distributed system, but the plurality of text data blocks are stored in the plurality of devices in a distributed manner, and the at least two target text data blocks are included in the plurality of text data blocks, that is, the at least two target text data blocks are also stored in the plurality of devices in a distributed manner, so that after the compression dictionary sharing storage module generates a compression dictionary for the at least two target text data blocks, the compression dictionary may be stored, and when the distributed system needs to perform data processing on the at least two target text data blocks, the compression dictionary for the at least two target text data blocks is obtained as follows, so as to perform subsequent processing steps, as shown in steps 403-405.

Step 403: compression dictionaries for the at least two target text data blocks are obtained.

It should be noted that, the operation of the distributed system for acquiring the compression dictionaries of the at least two target text data blocks may be performed by a data import module included in the device for storing the at least two target text data blocks, specifically, the data import module included in the device for storing the at least two target text data blocks may send a compression dictionary acquisition request to the compression dictionary shared storage module, where the compression dictionary acquisition request carries an identifier of the target text data block; when the compression dictionary shared storage module receives the compression dictionary obtaining request, the compression dictionary of the target text data block may be obtained from the correspondence between the stored text data block identifier and the compression dictionary based on the identifier of the target text data block, and the compression dictionary of the target text data block is sent to the data import module.

In addition, in practical application, when a compression instruction for the at least two target text data blocks is received, the compression dictionaries of the at least two target text data blocks may be obtained, and of course, the compression dictionaries of the at least two target text data blocks may also be obtained in other cases, as long as it is ensured that the compression dictionaries of the at least two target text data blocks are obtained before the at least two target text data blocks are compressed, which is not specifically limited in the embodiment of the present invention.

It should be noted that the compressing instruction is used to instruct to compress the at least two target text data blocks, and the compressing instruction may be triggered by a user, and of course, the compressing instruction may also be triggered when a certain trigger event is detected by the distributed system, which is not specifically limited in this embodiment of the present invention.

Further, when the compression dictionary of the target text data block is not stored in the compression dictionary shared storage module, the compression dictionary shared storage module needs to generate the compression dictionary of the target text data block based on the target text data block and a specified compression algorithm.

Step 404: and compressing each target text data block in the at least two target text data blocks respectively based on the compression dictionaries of the at least two target text data blocks to obtain at least two compressed data blocks.

It should be noted that the at least two target text data blocks correspond to the at least two compressed data blocks one to one, each compressed data block includes a plurality of compressed data, and for a certain compressed data block, a plurality of compressed data included in the compressed data block corresponds to a plurality of text data included in the target text data block corresponding to the compressed data block one to one, that is, each text data in the at least two target text data blocks has a unique corresponding compressed data in the at least two compressed data blocks.

In addition, based on the compression dictionaries of the at least two target text data blocks, operations of compressing each target text data block of the at least two target text data blocks to obtain at least two compressed data blocks may refer to related technologies, which are not described in detail in the embodiments of the present invention.

Step 405: and when a processing instruction for carrying out the same processing operation on the at least two target text data blocks is received, processing the compressed data in the at least two compressed data blocks so as to realize the processing of the at least two target text data blocks.

Specifically, the operation of processing the compressed data in the at least two compressed data blocks to realize the processing of the at least two target text data blocks may be: determining a plurality of compressed data included in each of the at least two compressed data blocks; processing the at least two compressed data blocks based on a plurality of compressed data included in each of the at least two compressed data blocks to obtain processing results of the at least two compressed data blocks; and decompressing the processing results of the at least two compressed data blocks based on the compression dictionary to obtain the processing results of the at least two target text data blocks.

In the embodiment of the present invention, when processing the compressed data in the at least two compressed data blocks to implement processing of the at least two target text data blocks, the processing may be performed by a data processing module in the device, that is, the data processing module may process the compressed data in the at least two compressed data blocks, and then decompress the processing result based on the compression dictionaries of the at least two target text data blocks, thereby implementing processing of the at least two target text data blocks.

The operation of processing the at least two compressed data blocks based on the plurality of compressed data included in each of the at least two compressed data blocks is similar to the operation of processing two text data blocks based on the plurality of text data included in each of the two text data blocks in the related art, which is not described in detail in the embodiments of the present invention.

The operation of decompressing the processing results of the at least two compressed data blocks based on the compression dictionary to obtain the processing results of the at least two target text data blocks is similar to the operation of decompressing a certain compressed data block based on a compression dictionary corresponding to the certain compressed data block in the related art, which is not described in detail in the embodiment of the present invention.

When determining the plurality of compressed data included in each of the at least two compressed data blocks, it may be determined whether code lengths of a plurality of compressed codes included in the compression dictionary are equal, and then the plurality of compressed data included in each of the at least two compressed data blocks is determined by combining the determination result through the following three conditions:

in a first case, when the code lengths of a plurality of compressed codes included in the compression dictionary are equal and the compression dictionary includes a compressed code of each text data in each target text data block, for each of the at least two compressed data blocks, the code length of each compressed code is determined as the length of each compressed data in the compressed data block; and sequentially determining a plurality of compressed data from the compressed data block according to the length of each compressed data in the compressed data block.

When a plurality of compressed data are sequentially determined from the compressed data block according to the length of each compressed data in the compressed data block, the compressed data block may be sequentially divided according to the length of each compressed data in the compressed data block to obtain a plurality of compressed data.

In a second case, when the code lengths of a plurality of compressed codes included in the compressed dictionary are equal and when the compressed dictionary includes a compressed code of each character in each target text data block, for each compressed data block of the at least two compressed data blocks, determining a target text data block corresponding to the compressed data block; multiplying the code length of each compressed code by the number of characters of each text data in the target text data block to obtain the length of each compressed data in the compressed data block; and sequentially determining a plurality of compressed data from the compressed data block according to the length of each compressed data in the compressed data block.

In a third case, when the code lengths of the plurality of compressed codes included in the compression dictionary are not equal, for each of the at least two compressed data blocks, a plurality of compressed data is determined from the compressed data block according to the data index of the compressed data block, where the data index of the compressed data block is used to indicate the position of each of the plurality of compressed data in the compressed data block.

Further, before determining a plurality of compressed data from the compressed data block according to the data index of the compressed data block, the method further includes: for the target text data block corresponding to the compressed data block, determining the position of compressed data corresponding to each text data in the target text data block in the compressed data block in the process of compressing the target text data block; and generating a data index of the compressed data block based on the position of the compressed data corresponding to each text data in the compressed data block.

For example, after a certain text data aaabbbcc of the target text data block is compressed, the obtained compressed data is 0001010, the codes aaa correspond to 00, bbb corresponds to 010, and ccc corresponds to 10, in order to obtain the compressed data corresponding to each field without decompressing the data, during the process of compressing the text data, the initial position of aaa after compression is determined to be 0, the end position is determined to be 2, the length is 2-0 ═ 2, the initial position of bbb after compression is 2, the end position is determined to be 5, the length is 5-2 ═ 3, the initial position of ccc after compression is 5, the end position is 7, and the length is 7-5 ═ 2, referring to fig. 4E, the initial position 0 of compressed data 00, the initial position 2 of compressed data 010, and the initial position 5 of compressed data 10 can be determined as the data index of the compressed data block.

Fig. 5A is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention. Referring to fig. 5A, the data processing apparatus includes:

an obtaining module 501, configured to obtain a compression dictionary of at least two target text data blocks, where the at least two target text data blocks are data blocks that are stored in a plurality of text data blocks and are subsequently processed through the same processing operation, each target text data block includes a plurality of text data, each text data includes a plurality of characters, and the compression dictionary includes a compression code of each text data in each target text data block or includes a compression code of each character in each target text data block;

a compressing module 502, configured to compress each target text data block of the at least two target text data blocks respectively based on the compression dictionary to obtain at least two compressed data blocks, where the at least two target text data blocks correspond to the at least two compressed data blocks one to one, each compressed data block includes multiple compressed data, and the multiple compressed data correspond to the multiple text data one to one;

the processing module 503 is configured to, when receiving a processing instruction for performing the same processing operation on the at least two target text data blocks, process compressed data in the at least two compressed data blocks to implement processing of the at least two target text data blocks.

Further, the apparatus further comprises:

Wherein the determining module comprises:

a selecting unit configured to select at least two text data blocks belonging to a target category from the plurality of text data blocks, and determine the selected text data block as a target text data block; or,

a first determining unit, configured to, when a selection instruction for at least two of the plurality of text data blocks is detected, determine a text data block selected by the selection instruction as a target text data block.

Referring to fig. 5B, the processing module 503 includes:

a second determining unit 5031, configured to determine a plurality of compressed data included in each of the at least two compressed data blocks;

a processing unit 5032, configured to process, based on a plurality of compressed data included in each of the at least two compressed data blocks, the at least two compressed data blocks to obtain a processing result of the at least two compressed data blocks;

a decompressing unit 5033, configured to decompress the processing result of the at least two compressed data blocks based on the compression dictionary to obtain the processing result of the at least two target text data blocks.

Among them, the second determining unit 5031 includes:

a first determining subunit, configured to determine, for each of the at least two compressed data blocks, a code length of each compressed code as a length of each compressed data in the compressed data block, when code lengths of a plurality of compressed codes included in the compression dictionary are equal and the compression dictionary includes a compressed code of each text data in the respective target text data blocks;

and the second determining subunit is used for sequentially determining the plurality of compressed data from the compressed data block according to the length of each compressed data in the compressed data block.

Further, the second determining unit 5031 further comprises:

a third determining subunit, configured to determine, for each of the at least two compressed data blocks, a target text data block corresponding to the compressed data block when the compression dictionary includes a plurality of compressed codes having equal code lengths and when the compression dictionary includes a compressed code for each character in the respective target text data blocks;

and a fourth determining subunit, configured to sequentially determine the plurality of compressed data from the compressed data block according to the length of each compressed data in the compressed data block.

Further, the second determining unit 5031 further comprises:

a fifth determining subunit, configured to determine, for each of the at least two compressed data blocks, the plurality of compressed data according to a data index of the compressed data block from the compressed data block when code lengths of a plurality of compressed codes included in the compression dictionary are not equal, where the data index of the compressed data block is used to indicate a position of each of the plurality of compressed data in the compressed data block.

Further, the second determining unit 5031 further comprises:

and the generating subunit is used for generating the data index of the compressed data block based on the position of the compressed data corresponding to each text data in the compressed data block.

It should be noted that: in the data processing apparatus provided in the above embodiment, only the division of the above functional modules is used for illustration when data processing is performed, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to complete all or part of the above described functions. In addition, the data processing apparatus and the data processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein obtaining the compressed dictionary of the at least two target blocks of text data is preceded by:

3. The method of claim 2, wherein said determining the at least two target text data blocks from the stored plurality of text data blocks comprises:

4. The method of claim 1, wherein the processing compressed data in the at least two compressed data blocks to achieve processing of the at least two target text data blocks comprises:

5. The method of claim 4, wherein the determining the plurality of compressed data that each of the at least two compressed data blocks includes comprises:

6. The method of claim 5, wherein after determining whether the code lengths of the plurality of compressed codes included in the compression dictionary are equal, further comprising:

7. The method of claim 5, wherein after determining whether the code lengths of the plurality of compressed codes included in the compression dictionary are equal, further comprising:

8. The method of claim 7, wherein determining the plurality of compressed data from the compressed data block before based on the data index of the compressed data block, further comprises:

9. A data processing apparatus, characterized in that the apparatus comprises:

10. The apparatus of claim 9, wherein the apparatus further comprises:

11. The apparatus of claim 10, wherein the determining module comprises:

12. The apparatus of claim 9, wherein the processing module comprises:

13. The apparatus of claim 12, wherein the second determining unit comprises:

and a second determining subunit, configured to sequentially determine the plurality of compressed data from the compressed data block according to the length of each compressed data in the compressed data block.

14. The apparatus of claim 13, wherein the second determining unit further comprises:

15. The apparatus of claim 13, wherein the second determining unit further comprises:

16. The apparatus of claim 15, wherein the second determining unit further comprises: