WO2018014761A1 - 数据处理方法及装置 - Google Patents

数据处理方法及装置 Download PDF

Info

Publication number
WO2018014761A1
WO2018014761A1 PCT/CN2017/092527 CN2017092527W WO2018014761A1 WO 2018014761 A1 WO2018014761 A1 WO 2018014761A1 CN 2017092527 W CN2017092527 W CN 2017092527W WO 2018014761 A1 WO2018014761 A1 WO 2018014761A1
Authority
WO
WIPO (PCT)
Prior art keywords
text data
compressed
compressed data
data blocks
target text
Prior art date
Application number
PCT/CN2017/092527
Other languages
English (en)
French (fr)
Inventor
李雪斌
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2018014761A1 publication Critical patent/WO2018014761A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a data processing method and apparatus.
  • the text data refers to data composed of printable characters including American Standard Code for Information Interchange (ASCII). Among the 33 to 127 characters, the characters in the UNICODE, and the characters in the UTF-8.
  • ASCII American Standard Code for Information Interchange
  • the text data needs to be compressed first, and then the compressed text data is stored, and then the text is
  • the compressed text data may be decompressed to obtain the text data, and then the text data is subjected to processing such as comparison, sorting, searching, hashing, and joining operations, based on the text data. The result is processed to analyze the text data.
  • a data processing method specifically: generating, for each text data block in a plurality of stored text data blocks, a compression dictionary of the text data block, the text data block including a plurality of text data;
  • the compression dictionary of the text data block compresses the text data block to obtain a compressed data block corresponding to the text data block; and stores the compressed data block corresponding to the text data block.
  • a text data block and a second text data block are any two text data blocks of the plurality of text data blocks; decompressing the compressed data blocks corresponding to the first text data block to obtain a first text data block, and The compressed data block corresponding to the second text data block is decompressed to obtain a second text data block; the text data in the first text data block and the second text data block are processed to obtain a processing result.
  • the compressed data block corresponding to the first text data block and the compressed data block corresponding to the second text data block need to be separately performed. After the decompression, the first text data block and the second text data block can be processed. Therefore, the data processing time is longer and the processing resources consumed are more.
  • an embodiment of the present invention provides a data processing method and apparatus.
  • the technical solution is as follows:
  • a data processing method comprising:
  • Obtaining a compression dictionary of at least two target text data blocks wherein the at least two target text data blocks are data blocks that are subsequently processed by the same processing operation, and each target text data block includes multiple Text data, each text data includes a plurality of characters, and the compression dictionary includes the respective items a compressed code of each text data in the text data block, or a compressed code including each character in each of the target text data blocks;
  • each target text data block in the at least two target text data blocks respectively according to the compression dictionary to obtain at least two compressed data blocks, the at least two target text data blocks and the at least two
  • Each of the compressed data blocks includes a plurality of compressed data, and the plurality of compressed data are in one-to-one correspondence with the plurality of text data;
  • the compression dictionary of the at least two target text data blocks may include a compression code of each text data in each target text data block, or may include a compression code of each character in each target text data block, that is,
  • the compression code in the compression dictionary may be in one-to-one correspondence with the text data, or may be in one-to-one correspondence with the characters, which is not specifically limited in the embodiment of the present invention.
  • the at least two target text data blocks are data blocks that are subsequently processed by the same processing operation, and the at least two target text data blocks correspond to the same compression dictionary, that is, the at least two target text data blocks share the same Compress the dictionary so that there is no need to generate a separate compression dictionary for each text block.
  • the at least two compressed data blocks are obtained by compressing the at least two target text data blocks by the compression dictionary, and the at least two The target text data blocks share the same compression dictionary, and therefore, the compressed data in the at least two compressed data blocks can be directly processed, thereby implementing processing of the at least two target text data blocks.
  • the method before the acquiring the compression dictionary of the at least two target text data blocks, the method further includes:
  • the at least two target text data blocks may be generated.
  • Compression dictionary when generating the compression dictionary of the at least two target text data blocks, generating a compression dictionary of the at least two target text data blocks based on the specified compression algorithm and the at least two target text data blocks, of course, in practical applications,
  • the compression dictionary of the at least two target text data blocks may be generated in other manners, which is not specifically limited in the embodiment of the present invention.
  • the determining, by the stored plurality of text data blocks, the at least two target text data blocks include:
  • the text data block selected by the selection instruction is determined as a target text data block.
  • the text data blocks under the target category can be processed by the same processing operation, in the embodiment of the present invention, at least two text data blocks belonging to the target category can be selected, and the selected text data block is determined as the target.
  • the determination operation is simple and convenient, and no user participation is required, so that The efficiency of determining high target text data blocks.
  • the target text data block is actually determined according to the user operation, thereby ensuring the determination.
  • the target text data block meets the user's needs.
  • the compressing data in the at least two compressed data blocks is processed to implement the at least two target text data blocks. Processing, including:
  • the compressed dictionary stores the compressed code of each text data in each target text data block or the compressed code of each character in each target text data block
  • the at least two target text data blocks are compressed, That is, the text data in the at least two target text data blocks is converted into compressed data.
  • the conversion rule is constant, after the calculation of the compressed data, the obtained processing of the at least two compressed data blocks is performed.
  • the result is also composed of a compressed code. Therefore, after the processing result of the at least two compressed data blocks is decompressed based on the compression dictionary, the compression result composed of the compressed code is converted into a text data form, and The conversion rule is certain. Therefore, the processed result of the conversion is the processing result of the at least two target text data blocks.
  • the plurality of text data blocks are distributedly stored, that is, when the same processing operation is performed on two text data blocks in the plurality of text data blocks.
  • the compression dictionary of the two text data blocks since the compression dictionary of the two text data blocks is different, the two compressed data blocks corresponding to the two stored text data blocks need to be decompressed separately to obtain the two text data blocks, and then The two text data blocks are transferred to a device for data processing, and the two text data blocks are processed by the device.
  • the processing of the at least two target text data blocks is implemented, and therefore, the at least two target text data blocks are processed.
  • at least two compressed data blocks corresponding to the at least two target text data blocks may be transmitted to the device performing data processing, and the compressed data in the at least two compressed data blocks is processed by the device. Processing of the at least two target text data blocks is implemented. Therefore, in the embodiment of the present invention, the processing of the target text data block can be directly implemented based on the compressed data block, as compared with the method for decompressing the compressed data block before processing the text data block in the related art. Reduce data processing time and save processing resources.
  • the determining, by the multiple compressions included in each of the at least two compressed data blocks data pack include:
  • the compression dictionary includes a compression code of each text data in each of the target text data blocks, for each of the at least two compressed data blocks Compressing data blocks, determining a code length of each compressed code as a length of each compressed data in the compressed data block;
  • the compression dictionary is generated by specifying a compression algorithm, and for different compression algorithms, the code lengths of the plurality of compression codes included in the generated compression dictionary may not be equal, and each compression code included in the compression dictionary The code lengths between the two may be different, and the compressed data block is compressed based on the compression dictionary. Therefore, in determining the plurality of compressed data blocks included in each of the at least two compressed data blocks, the compressed data block may be first It is judged whether the code lengths of the plurality of compressed codes included in the compression dictionary are equal.
  • the compression dictionary includes a compression code of each text data in each target text data block, determining that when performing compression of the target text data block, is based on a compression code of each text data in each target text data block, Each text data is compressed, and therefore, the code length of the respective compressed code in the compression dictionary can be determined as the length of each compressed data block in the compressed data block.
  • the compression dictionary when the compression dictionary includes a compression code of each character in each target text data block, determining that when performing compression of the target text data block, is based on a compression code of each character in each target text data block, for each The character is compressed, and since one text data can include a plurality of characters, the code length of each compressed code in the compression dictionary can be respectively multiplied by the number of characters of each text data in the target text data block, thereby obtaining The length of each compressed data block in the compressed data block.
  • the determining, according to the data index of the compressed data block, from the compressed data block Before describing multiple compressed data it also includes:
  • the compression corresponding to each text data may be determined.
  • the start position and the end position of the data in the compressed data block, and then the position of the compressed data corresponding to each text data in the compressed data block is uniquely determined by the start position and the end position.
  • the starting position of each compressed data in the compressed data block may be determined as the data index of the compressed data block.
  • a data processing apparatus comprising:
  • An obtaining module configured to acquire a compression dictionary of at least two target text data blocks, where the at least two target text data blocks are data blocks that are subsequently processed by the same processing operation, and each target text data is stored in the plurality of text data blocks.
  • Each of the blocks includes a plurality of text data, each of the text data including a plurality of characters, the compression dictionary including a compression code of each of the text data blocks of the respective target text data blocks, or each of the respective target text data blocks Compressed code of characters;
  • each compressed data block includes a plurality of compressed data, and the plurality of compressed data are in one-to-one correspondence with the plurality of text data;
  • a processing module configured to process compressed data in the at least two compressed data blocks to receive the at least two when receiving processing instructions for performing the same processing operation on the at least two target text data blocks Processing of target text data blocks.
  • the device further includes:
  • a determining module configured to determine the at least two target text data blocks from the stored plurality of text data blocks
  • a generating module configured to generate a compression dictionary of the at least two target text data blocks.
  • the determining module includes:
  • a selecting unit configured to select at least two pieces of text data belonging to the target category from the plurality of text data blocks, and determine the selected text data block as the target text data block;
  • a first determining unit configured to determine, when the selection instruction for the at least two text data blocks in the plurality of text data blocks, a text data block selected by the selection instruction as a target text data block.
  • the processing module includes:
  • a second determining unit configured to determine a plurality of compressed data included in each of the at least two compressed data blocks
  • a processing unit configured to process the at least two compressed data blocks based on the plurality of compressed data included in each of the at least two compressed data blocks to obtain processing of the at least two compressed data blocks Knot fruit;
  • a decompression unit configured to decompress the processing result of the at least two compressed data blocks based on the compression dictionary to obtain a processing result of the at least two target text data blocks.
  • the second determining unit includes:
  • a determining subunit configured to determine whether code lengths of the plurality of compressed codes included in the compression dictionary are equal
  • a first determining subunit configured to: when the code lengths of the plurality of compressed codes included in the compression dictionary are equal and the compression dictionary includes a compression code of each text data in each of the target text data blocks, Each of the two compressed data blocks, the code length of each compressed code is determined as the length of each compressed data in the compressed data block;
  • a second determining subunit configured to sequentially determine the plurality of compressed data from the compressed data block according to a length of each compressed data in the compressed data block
  • the second determining unit further includes:
  • a third determining subunit configured to: when the code lengths of the plurality of compressed codes included in the compression dictionary are equal and when the compression dictionary includes a compression code of each character in each of the target text data blocks, for the at least Determining, by each of the two compressed data blocks, a target text data block corresponding to the compressed data block;
  • An operation subunit configured to multiply a code length of each compressed code by a number of characters of each text data in the target text data block to obtain a length of each compressed data in the compressed data block;
  • a fourth determining subunit configured to sequentially determine the plurality of compressed data from the compressed data block according to a length of each compressed data in the compressed data block.
  • the second determining unit further includes:
  • a fifth determining subunit configured to, when the code lengths of the plurality of compressed codes included in the compression dictionary are not equal, for each of the at least two compressed data blocks, according to the compressed data block Data index, from the compressed data block, determining the plurality of compressed data, the data index of the compressed data block is used to indicate that each of the plurality of compressed data is in the compressed data block s position.
  • the second determining unit further includes:
  • a sixth determining subunit configured to determine compressed data corresponding to each text data in the target text data block in the process of compressing the target text data block for the target text data block corresponding to the compressed data block The location in the compressed data block;
  • Generating a subunit configured to generate a data index of the compressed data block based on a location of the compressed data corresponding to the respective text data in the compressed data block.
  • a compression dictionary of the at least two target text data blocks is acquired for at least two target text data blocks that are subsequently processed by the same processing operation, and the at least two target text data blocks are obtained based on the compression dictionary.
  • Each target text data block is compressed to obtain at least two compressed data blocks, that is, the at least two target text data blocks can share the same compression dictionary without generating each text data block.
  • the compression dictionary of the at least two target text data blocks is the same, that is, the at least two target text data blocks are compressed by the same compression standard, and therefore, compression in the at least two compressed data blocks
  • the data is processed, and the result of decompressing the processing result by the compression dictionary is the same as the processing result of processing the text data in the at least two target text data blocks, so the embodiment of the present invention passes the at least two The compressed data in the compressed data block is processed to implement processing of the at least two target text data blocks without decompressing the at least two compressed data blocks, thereby reducing data processing amount and thereby shortening data processing. Time, as well as saving processing resources.
  • FIG. 1 is a schematic structural diagram of a data processing system according to an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
  • FIG. 3 is a flowchart of a data processing method according to an embodiment of the present invention.
  • FIG. 4A is a flowchart of another data processing method according to an embodiment of the present invention.
  • FIG. 4B is a schematic diagram of a target text data block involved in the embodiment of FIG. 4A;
  • 4C is a schematic diagram of a correspondence between a target text data block and a compression dictionary according to the embodiment of FIG. 4A;
  • 4D is a schematic diagram of a correspondence between another target text data block and a compression dictionary according to the embodiment of FIG. 4A;
  • 4E is a schematic diagram of a compressed field index involved in the embodiment of FIG. 4A;
  • FIG. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.
  • FIG. 5B is a schematic structural diagram of a processing module according to an embodiment of the present invention.
  • FIG. 1 is a schematic structural diagram of a data processing system according to an embodiment of the present invention.
  • the system may be a distributed system, and may or may not be a distributed system.
  • a distributed system is taken as an example for description.
  • the device includes a plurality of devices, which are device 01, device 02, device 03, and device 0n, and the plurality of devices are connected to each other, and the plurality of devices may be terminals or servers. This embodiment of the present invention does not specifically limit this.
  • each of the plurality of devices may include a data importing module and a data processing module
  • the designated one of the plurality of devices may include not only a data importing module and a data processing module, but also a compression dictionary configuration module and
  • the compression dictionary shares a storage module, and the designated device may be any one of the plurality of devices.
  • the distributed system can store a plurality of text data blocks
  • the compression dictionary configuration module is configured to configure a compression dictionary for the plurality of text data blocks stored in the distributed system, where the compression dictionary is used to compress the text data block.
  • the distributed system may further store a plurality of compressed data blocks, wherein the plurality of compressed data blocks are in one-to-one correspondence with the plurality of text data blocks, and it is noted that the plurality of text data At least two text data blocks in the block that need to perform the same processing operation may share a compression dictionary, that is, the at least two text data may correspond to the same compression dictionary.
  • the compression dictionary shared storage module is configured to correspondingly store the identifiers of the plurality of text data blocks and the corresponding compression dictionary;
  • the data importing module is configured to obtain the data from the compression dictionary shared storage module when performing data processing on the at least two target text data blocks a compression dictionary of the at least two target text data blocks;
  • the data processing module is configured to process the compressed data in the compressed data block corresponding to the at least two target text data blocks, and based on the compression dictionary pair processing result obtained by the data import module Decompression is performed to implement processing of the at least two target text data blocks.
  • the computer device includes at least one processor 201, a communication bus 202, a memory 203, and at least one communication interface 204.
  • the processor 201 can be a general purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program of the present invention.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • Communication bus 202 can include a path for communicating information between the components described above.
  • the memory 203 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (RAM), or other information that can store information and instructions.
  • ROM read-only memory
  • RAM random access memory
  • Type of dynamic storage device or Electro Scientific Erasable Programmable Read-Only Memory (EEPROM), Compact Disc Read-Only Memory (CD-ROM) or other optical disc storage, optical disc Storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or capable of carrying or storing desired program code in the form of instructions or data structures and capable of being Any other medium accessed by the computer, but is not limited thereto.
  • Memory 203 may be present independently and coupled to processor 201 via communication bus 202.
  • the memory 203 can also be integrated with the processor 201.
  • the communication interface 204 uses devices such as any transceiver for communicating with other devices or communication networks, such as Ethernet, Radio Access Network (RAN), Wireless Local Area Networks (WLAN), and the like.
  • devices such as any transceiver for communicating with other devices or communication networks, such as Ethernet, Radio Access Network (RAN), Wireless Local Area Networks (WLAN), and the like.
  • RAN Radio Access Network
  • WLAN Wireless Local Area Networks
  • processor 201 may include one or more CPUs, such as CPU0 and CPU1 shown in FIG.
  • a computer device can include multiple processors, such as processor 201 and processor 208 shown in FIG. Each of these processors can be a single-CPU processor or a multi-core processor.
  • a processor herein may refer to one or more devices, circuits, and/or processing cores for processing data, such as computer program instructions.
  • the computer device can also include an output device 205 and an input device 206.
  • Output device 205 is in communication with processor 201 and can display information in a variety of ways.
  • the output device 205 can be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector. Wait.
  • the input device 206 is in communication with the processor 201 and can be received in a variety of ways.
  • User input can be a mouse, keyboard, touch screen device or sensing device, and the like.
  • the computer device described above may be a general purpose computer device or a special purpose computer device.
  • the computer device may be a desktop computer, a portable computer, a network server, a personal digital assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, or an embedded device.
  • PDA personal digital assistant
  • Embodiments of the invention do not limit the type of computer device.
  • the memory 203 is used to store program code for executing the solution of the present invention, and is controlled by the processor 201 for execution.
  • the processor 201 is configured to execute the program code 210 stored in the memory 203.
  • One or more software modules eg, a data import module, a data processing module, a compression dictionary configuration module, and a compression dictionary shared storage module, etc.
  • the devices in the distributed system shown in FIG. 1 may process the data through one or more of the program code 210 in the processor 201 and the memory 203.
  • FIG. 3 is a flowchart of a data processing method according to an embodiment of the present invention. Referring to Figure 3, the method includes:
  • Step 301 Acquire a compression dictionary of at least two target text data blocks, where the at least two target text data blocks are data blocks that are subsequently processed by the same processing operation, and each target text data block includes a plurality of text data, each of the text data including a plurality of characters, the compression dictionary including a compression code of each text data in the respective target text data blocks, or a compression code including each character in the respective target text data blocks.
  • Step 302 Compress each target text data block in the at least two target text data blocks respectively according to the compression dictionary, to obtain at least two compressed data blocks, and the at least two target text data blocks and the at least two The compressed data blocks are in one-to-one correspondence, and each of the compressed data blocks includes a plurality of compressed data, and the plurality of compressed data are in one-to-one correspondence with the plurality of text data.
  • Step 303 When receiving a processing instruction for performing the same processing operation on the at least two target text data blocks, processing the compressed data in the at least two compressed data blocks to implement the at least two target text data blocks. deal with.
  • a compression dictionary of the at least two target text data blocks is acquired for at least two target text data blocks that are subsequently processed by the same processing operation, and the at least two target text data blocks are obtained based on the compression dictionary.
  • Each target text data block is compressed to obtain at least two compressed data blocks, that is, the at least two target text data blocks can share the same compression dictionary without generating a compression dictionary for each text data block.
  • the compression dictionary of the at least two target text data blocks is the same, that is, the at least two target text data blocks are compressed by the same compression standard, and therefore, compression in the at least two compressed data blocks
  • the data is processed, and the result of decompressing the processing result by the compression dictionary is the same as the processing result of processing the text data in the at least two target text data blocks, so the embodiment of the present invention passes the at least two The compressed data block is processed to implement processing of the at least two target text data blocks without decompressing the at least two compressed data blocks, thereby reducing data processing amount, thereby shortening data processing time and saving Processing resources.
  • the method before acquiring the compression dictionary of the at least two target text data blocks, the method further includes:
  • a compression dictionary that generates the at least two target text data blocks.
  • determining at least two target text data blocks from the stored plurality of text data blocks including:
  • the text data block selected by the selection instruction is determined as the target text data block.
  • processing the compressed data in the at least two compressed data blocks to implement processing of the at least two target text data blocks including:
  • determining a plurality of compressed data included in each of the at least two compressed data blocks including:
  • the compression dictionary includes a compression code of each text data in each target text data block, for each of the at least two compressed data blocks, The code length of each compressed code is determined as the length of each compressed data in the compressed data block;
  • a plurality of compressed data are sequentially determined from the compressed data block in accordance with the length of each compressed data in the compressed data block.
  • the method further includes:
  • a plurality of compressed data are sequentially determined from the compressed data block in accordance with the length of each compressed data in the compressed data block.
  • the method further includes:
  • the data index of the compressed data block is used to indicate a location of each of the plurality of compressed data in the compressed data block.
  • the method before determining the plurality of compressed data from the compressed data block, according to the data index of the compressed data block, the method further includes:
  • the target text data block corresponding to the compressed data block in the process of compressing the target text data block, determining a location of the compressed data corresponding to each text data in the target text data block in the compressed data block;
  • a data index of the compressed data block is generated based on a location of the compressed data corresponding to each text data in the compressed data block.
  • FIG. 4A is a flowchart of a data processing method according to an embodiment of the present invention. Referring to Figure 4A, the method includes:
  • Step 401 Determine at least two target text data blocks from the stored plurality of text data blocks.
  • the plurality of text data blocks may be stored in a specified format, and the specified format may be preset.
  • the specified format may be a text file format, a Parquet format, a SequenceFile format, an RCFile format, an Avro format, or the like.
  • the embodiment of the present invention does not specifically limit this.
  • the plurality of text data blocks can be distributedly stored, and the distributed storage means that the plurality of text data blocks are distributed and stored in multiple independent manners.
  • the embodiment of the present invention does not specifically limit the storage of the plurality of text data based on a distributed file system (HDFS).
  • HDFS distributed file system
  • the at least two target text data blocks are data blocks that are subsequently processed by the same processing operation, and each target text data block includes a plurality of text data, and each of the text data includes a plurality of characters.
  • a certain target text data block in the at least two target text data blocks includes a plurality of text data, and the plurality of text data are 101, 102, 103, and 104.
  • the processing operation may include a comparison, a sort, a search, a hash operation, a connection operation, and the like, which are not specifically limited in this embodiment of the present invention.
  • the text data refers to data composed of printable characters, including 33-127 characters in ASCII, characters in UNICODE, characters in UTF-8, etc., which is not in this embodiment of the present invention. Make specific limits.
  • At least two text data blocks belonging to the target category may be selected from the plurality of text data blocks, and The selected text data block is determined as the target text data block; or, when a selection instruction for at least two of the plurality of text data blocks is detected, the text data block selected by the selection instruction is determined as the target text data Piece.
  • the text data blocks in the target category can be processed by the same processing operation.
  • the target category can be a number, a time, a user name, and the like, which is not specifically limited in this embodiment of the present invention.
  • the selection instruction is used to select a target text data block from the plurality of text data blocks, and the selection instruction may be triggered by a user, and the user may be triggered by a specified operation, which may be a click operation, a double-click operation, or a voice.
  • a specified operation which may be a click operation, a double-click operation, or a voice.
  • the operation and the like are not specifically limited in the embodiment of the present invention.
  • the text data blocks under the target category can be processed by the same processing operation, in the embodiment of the present invention, at least two text data blocks belonging to the target category can be selected, and the selected text data block is determined as the target.
  • the text data block is simple and convenient to perform, and does not require user participation, so that the determination efficiency of the target text data block can be improved.
  • the target text data block is actually determined according to the user operation, thereby ensuring the determination.
  • the target text data block meets the user's needs.
  • the text data blocks belonging to the target category are text data block 1, text data block 2, text data block 3, and text data block 4, and then the plurality of text data blocks may be selected.
  • the text data block 1, the text data block 2, the text data block 3, and the text data block 4, after which the selected text can be
  • the data block 1, the text data block 2, the text data block 3, and the text data block 4 are determined as target text data blocks.
  • the text data block 1 selected by the selection instruction may be The text data block 2, the text data block 3, and the text data block 4 are determined as target text data blocks.
  • determining the operation of the at least two target text data blocks may be performed by a compression dictionary configuration module in the distributed system, and in actual application, determining the at least two After the target text data block, the compression dictionary configuration module may further send the identifier of each target text data block in the at least two target text data blocks to the compression dictionary shared storage module, so that the compression dictionary shared storage module may determine the at least The association relationship between the two target text data blocks, thereby facilitating subsequent generation of the same compression dictionary by the compression dictionary shared storage module for the at least two target text data blocks.
  • the identifier of the target text data block is used to uniquely identify the target text data block, and the identifier of the target text data block may be the name of the target text data block, etc., which is not specifically limited in this embodiment of the present invention.
  • Step 402 Generate a compression dictionary of the at least two target text data blocks.
  • the at least two target text data blocks may be generated.
  • Compression dictionary when generating the compression dictionary of the at least two target text data blocks, generating a compression dictionary of the at least two target text data blocks based on the specified compression algorithm and the at least two target text data blocks, of course, in practical applications,
  • the compression dictionary of the at least two target text data blocks may be generated in other manners, which is not specifically limited in the embodiment of the present invention.
  • the compression dictionary of the at least two target text data blocks may include a compression code of each text data in each target text data block, or may include a compression code of each character in each target text data block, and the present invention The embodiment does not specifically limit this.
  • the text data included in the at least two target text data blocks is 101, 102, 103, and when the compression dictionary of the at least two target text data blocks includes the compression code of each text data in each target text data block.
  • the compression dictionary may correspond to a compression code 1, a compression code 2, and a compression code 3, as shown in FIG. 4C; when the compression dictionary of the at least two target text data blocks includes a compression code of each character in each target text data block.
  • the corresponding compressed code is the compressed code 1, the compressed code 2, and the compressed code 3.
  • the characters included in the text data 102 are 4, 5, and 6, corresponding to The compressed code is a compressed code 4, a compressed code 5, and a compressed code 6.
  • the characters included in the text data 103 are 7, 8, and 9.
  • the corresponding compressed code is a compressed code 7, a compressed code 8, and a compressed code 9.
  • the compressed dictionary can be As shown in Figure 4D.
  • the specified compression algorithm may be preset, and the specified compression algorithm may be a compressed sort and hold algorithm, that is, the encoded lexicographic order may be kept the same as the lexicographic order of the original string, for example, the specified compression algorithm may be The Huffman Coding algorithm, the Hu-Tucker coding algorithm, and the like are not specifically limited in the embodiment of the present invention.
  • the device storing the at least two target text data blocks may receive a compression instruction for the at least two target text data blocks.
  • the compression dictionary shared storage module may also actively acquire the at least two target text data blocks to generate a compression dictionary of the at least two target text data blocks.
  • the embodiment does not specifically limit this.
  • the compression instruction is used to indicate that the at least two target text data blocks are compressed, and the compression instruction may be triggered by a specified operation, which is not specifically limited in this embodiment of the present invention.
  • the at least two target text data blocks may be deleted, so as to save storage resources of the compression dictionary shared storage module.
  • the identifier of each target text data block in the at least two target text data blocks and the at least two target text data may be further
  • the compression dictionary of the block is stored in the correspondence between the text data block identifier and the compression dictionary, so that when the device searches for the compression dictionary of the target text data block, the identifier of the target text data block can be directly used from the In the correspondence between the text data block identifier and the compression dictionary, the compression dictionary of the target text data block is obtained simply and quickly.
  • the at least two target text data blocks may be determined in advance, and the compression dictionary of each target text data block in the at least two target text data blocks is the same, therefore, only the at least two The target text data block performs a generating operation to obtain a compression dictionary of the at least two target text data blocks, thereby saving the rational resources.
  • the compression dictionary of the at least two target text data blocks may be determined by using the foregoing steps 401-402, and the at least two are based on the compression dictionary of the at least two target text data blocks.
  • the operation of compressing and processing the target text data block can be implemented by the following steps 403-405.
  • the method provided by the embodiment of the present invention is used in a distributed system, and the compression dictionary shared storage module is disposed in a designated device among the plurality of devices included in the distributed system, but the plurality of text data blocks are And stored in the plurality of devices in a distributed manner, the at least two target text data blocks are included in the plurality of text data blocks, that is, the at least two target text data blocks are also distributedly stored in the In a plurality of devices, after the compression dictionary shared storage module generates the compression dictionary of the at least two target text data blocks, the compression dictionary may be stored, and when the distributed system needs to perform data on the at least two target text data blocks When processing, the compression dictionary of the at least two target text data blocks is obtained as follows, thereby performing subsequent The processing steps are as follows in steps 403-405.
  • Step 403 Acquire a compression dictionary of the at least two target text data blocks.
  • the operation of the distributed system to acquire the compression dictionary of the at least two target text data blocks may be performed by a data import module included in the device storing the at least two target text data blocks, specifically, storing the at least
  • the data importing module included in the device of the two target text data blocks may send a compression dictionary acquisition request to the compression dictionary shared storage module, where the compression dictionary acquires the identifier of the target text data block in the request; when the compression dictionary shared storage module receives
  • the compression dictionary of the target text data block may be obtained from the correspondence between the stored text data block identifier and the compression dictionary based on the identifier of the target text data block, and the target text data block is compressed.
  • the dictionary is sent to the data importing module.
  • the compression dictionary of the at least two target text data blocks may be obtained in other manners, which is not specifically limited in the embodiment of the present invention.
  • the compression dictionary of the at least two target text data blocks may be acquired when receiving the compression instruction for the at least two target text data blocks, and of course, the at least two may be acquired in other cases.
  • the compression dictionary of the target text data block the compression dictionary of the at least two target text data blocks may be obtained before the compression of the at least two target text data blocks, and the embodiment of the present invention does not specifically limited.
  • the compression instruction is used to indicate compression of the at least two target text data blocks, and the compression instruction may be triggered by a user.
  • the compression instruction may also be when the distributed system detects a trigger event. The triggering is not specifically limited in this embodiment of the present invention.
  • the compression dictionary shared storage module needs to generate the target text data block based on the target text data block and the specified compression algorithm. Compression dictionary.
  • Step 404 Compress each target text data block in the at least two target text data blocks based on the compression dictionary of the at least two target text data blocks to obtain at least two compressed data blocks.
  • the at least two target text data blocks are in one-to-one correspondence with the at least two compressed data blocks, each compressed data block includes a plurality of compressed data, and for a compressed data block, the compressed data block includes The plurality of compressed data are in one-to-one correspondence with the plurality of text data included in the target text data block corresponding to the compressed data block, that is, each of the at least two target text data blocks is at least two There is a unique corresponding compressed data in the compressed data blocks.
  • Step 405 When receiving a processing instruction for performing the same processing operation on the at least two target text data blocks, processing the compressed data in the at least two compressed data blocks to implement processing of at least two target text data blocks. .
  • the plurality of text data blocks are distributedly stored, that is, when the same processing operation is performed on two text data blocks in the plurality of text data blocks.
  • the compression dictionary of the two text data blocks since the compression dictionary of the two text data blocks is different, the two compressed data blocks corresponding to the two stored text data blocks need to be decompressed separately to obtain the two text data blocks, and then Will both The text data block is transmitted to the device for data processing, and the two text data blocks are processed by the device.
  • the processing of the at least two target text data blocks is implemented, and therefore, the at least two target text data blocks are processed.
  • at least two compressed data blocks corresponding to the at least two target text data blocks may be transmitted to the device performing data processing, and the compressed data in the at least two compressed data blocks is processed by the device. Processing of the at least two target text data blocks is implemented. Therefore, in the embodiment of the present invention, the processing of the target text data block can be directly implemented based on the compressed data block, as compared with the method for decompressing the compressed data block before processing the text data block in the related art. Reduce data processing time and save processing resources.
  • the processing of compressing the compressed data in the at least two compressed data blocks to implement processing of the at least two target text data blocks may be: determining that each of the at least two compressed data blocks includes: And compressing the at least two compressed data blocks to obtain a processing result of the at least two compressed data blocks, based on the plurality of compressed data included in each of the at least two compressed data blocks; And extracting, according to the compression dictionary, a processing result of the at least two compressed data blocks, to obtain a processing result of the at least two target text data blocks.
  • the compressed dictionary stores the compressed code of each text data in each target text data block or the compressed code of each character in each target text data block
  • the at least two target text data blocks are compressed, That is, the text data in the at least two target text data blocks is converted into compressed data.
  • the conversion rule is constant, after the calculation of the compressed data, the obtained processing of the at least two compressed data blocks is performed.
  • the result is also composed of a compressed code. Therefore, after the processing result of the at least two compressed data blocks is decompressed based on the compression dictionary, the compression result composed of the compressed code is converted into a text data form, and The conversion rule is certain. Therefore, the processed result of the conversion is the processing result of the at least two target text data blocks.
  • the data processing module in the device may be executed, that is, The data processing module may process the compressed data in the at least two compressed data blocks, and then decompress the processing result based on the compression dictionary of the at least two target text data blocks, thereby implementing the at least two target text data blocks. Processing.
  • each compressed data block of the at least two compressed data blocks it may first determine whether the code lengths of the plurality of compressed codes included in the compression dictionary are equal, and then adopt the following three cases in combination with the determination result; Determining a plurality of compressed data included in each of the at least two compressed data blocks:
  • the compression dictionary includes a compression code of each text data in each target text data block
  • the compressed data block is determined by determining the code length of each compressed code as the length of each compressed data in the compressed data block; and according to the length of each compressed data in the compressed data block, a plurality of compressed data are sequentially determined from the compressed data block.
  • the compression dictionary is generated by specifying a compression algorithm, and for different compression algorithms, the code lengths of the plurality of compression codes included in the generated compression dictionary may not be equal, and each compression code included in the compression dictionary The code lengths between the two may be different, and the compressed data block is compressed based on the compression dictionary. Therefore, in determining the plurality of compressed data blocks included in each of the at least two compressed data blocks, the compressed data block may be first It is judged whether the code lengths of the plurality of compressed codes included in the compression dictionary are equal.
  • the compression dictionary includes a compression code of each text data in each target text data block, determining that when performing compression of the target text data block, is based on a compression code of each text data in each target text data block, Each text data is compressed, and therefore, the code length of the respective compressed code in the compression dictionary can be determined as the length of each compressed data block in the compressed data block.
  • the compressed data block when a plurality of compressed data are sequentially determined from the compressed data block, the compressed data block may be sequentially divided according to the length of each compressed data in the compressed data block. , get multiple compressed data.
  • the compression dictionary when the code lengths of the plurality of compressed codes included in the compression dictionary are equal and when the compression dictionary includes a compression code of each character in each target text data block, for each of the at least two compressed data blocks Compressing the data block, determining the target text data block corresponding to the compressed data block; multiplying the code length of each compressed code by the number of characters of each text data in the target text data block to obtain each compressed data in the compressed data block The length is determined according to the length of each compressed data in the compressed data block, and a plurality of compressed data are sequentially determined from the compressed data block.
  • the compression dictionary when the compression dictionary includes a compression code of each character in each target text data block, determining that when performing compression of the target text data block, is based on a compression code of each character in each target text data block, for each The character is compressed, and since one text data can include a plurality of characters, the code length of each compressed code in the compression dictionary can be respectively multiplied by the number of characters of each text data in the target text data block, thereby obtaining The length of each compressed data block in the compressed data block.
  • a third case when the code lengths of the plurality of compressed codes included in the compression dictionary are not equal, for each compressed data block of the at least two compressed data blocks, according to the data index of the compressed data block, from the compression In the data block, a plurality of compressed data is determined, and a data index of the compressed data block is used to indicate a location of each of the plurality of compressed data in the compressed data block.
  • the method further includes: compressing the target text data block for the target text data block corresponding to the compressed data block, according to the data index of the compressed data block. In the process, determining the location of the compressed data corresponding to each text data in the target text data block in the compressed data block; based on the location of the compressed data corresponding to each text data in the compressed data block, Generate a data index of the compressed data block.
  • the compression corresponding to each text data may be determined.
  • the start position and the end position of the data in the compressed data block, and then the position of the compressed data corresponding to each text data in the compressed data block is uniquely determined by the start position and the end position.
  • the starting position of each compressed data in the compressed data block may be determined as the data index of the compressed data block.
  • the obtained compressed data is 0001010.
  • the recording code aaa corresponds to 00
  • bbb corresponds to 010
  • ccc corresponds to 10
  • each data can be obtained without decompressing the data.
  • Compressed data corresponding to the fields.
  • the compressed data 00 can be The start position 0, the start position 2 of the compressed data 010, and the start position 5 of the compressed data 10 are determined as the data index of the compressed data block.
  • a compression dictionary of the at least two target text data blocks is acquired for at least two target text data blocks that are subsequently processed by the same processing operation, and the at least two target text data blocks are obtained based on the compression dictionary.
  • Each target text data block is compressed to obtain at least two compressed data blocks, that is, the at least two target text data blocks can share the same compression dictionary without generating a compression dictionary for each text data block.
  • the compression dictionary of the at least two target text data blocks is the same, that is, the at least two target text data blocks are compressed by the same compression standard, and therefore, compression in the at least two compressed data blocks
  • the data is processed, and the result of decompressing the processing result by the compression dictionary is the same as the processing result of processing the text data in the at least two target text data blocks, so the embodiment of the present invention passes the at least two The compressed data block is processed to implement processing of the at least two target text data blocks without decompressing the at least two compressed data blocks, thereby reducing data processing amount, thereby shortening data processing time and saving Processing resources.
  • FIG. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.
  • the data processing apparatus includes:
  • the obtaining module 501 is configured to acquire a compression dictionary of at least two target text data blocks, where the at least two target text data blocks are data blocks that are subsequently processed by the same processing operation, and each target text data is stored in the plurality of text data blocks.
  • Each of the blocks includes a plurality of text data, each of the text data including a plurality of characters, the compression dictionary including a compression code of each text data in the respective target text data blocks, or a compression of each character in the respective target text data blocks code;
  • the compression module 502 is configured to compress each target text data block in the at least two target text data blocks respectively according to the compression dictionary, to obtain at least two compressed data blocks, and the at least two target text data blocks and the The at least two compressed data blocks are in one-to-one correspondence, and each compressed data block includes a plurality of compressed data, and the plurality of compressed data are in one-to-one correspondence with the plurality of text data;
  • the processing module 503 is configured to process the compressed data in the at least two compressed data blocks to receive the at least two target texts when receiving the processing instruction for performing the same processing operation on the at least two target text data blocks Processing of data blocks.
  • the device further includes:
  • a determining module configured to determine the at least two target text data blocks from the stored plurality of text data blocks
  • a generating module configured to generate a compression dictionary of the at least two target text data blocks.
  • the determining module comprises:
  • a selecting unit configured to select at least two pieces of text data belonging to the target category from the plurality of text data blocks, and determine the selected text data block as the target text data block;
  • a first determining unit configured to determine, when the selection instruction for the at least two text data blocks in the plurality of text data blocks, the text data block selected by the selection instruction as the target text data block.
  • the processing module 503 includes:
  • a second determining unit 5031 configured to determine a plurality of compressed data included in each compressed data block of the at least two compressed data blocks
  • the processing unit 5032 is configured to process the at least two compressed data blocks based on the plurality of compressed data included in each of the at least two compressed data blocks to obtain a processing result of the at least two compressed data blocks.
  • the decompressing unit 5033 is configured to decompress the processing results of the at least two compressed data blocks based on the compression dictionary to obtain processing results of the at least two target text data blocks.
  • the second determining unit 5031 includes:
  • a determining subunit configured to determine whether the code lengths of the plurality of compressed codes included in the compression dictionary are equal
  • a first determining subunit configured to: when the code lengths of the plurality of compressed codes included in the compression dictionary are equal and the compression dictionary includes a compression code of each text data in the respective target text data blocks, for the at least two compressed data Each compressed data block in the block determines a code length of each compressed code as a length of each compressed data in the compressed data block;
  • a second determining subunit configured to sequentially determine the plurality of compressed data from the compressed data block according to the length of each compressed data in the compressed data block.
  • the second determining unit 5031 further includes:
  • a third determining subunit configured to: when the code lengths of the plurality of compressed codes included in the compression dictionary are equal and when the compression dictionary includes a compression code of each character in the respective target text data blocks, for the at least two compressed data Determining, by each compressed data block in the block, a target text data block corresponding to the compressed data block;
  • An operation subunit configured to multiply a code length of each compressed code by a number of characters of each text data in the target text data block to obtain a length of each compressed data in the compressed data block;
  • a fourth determining subunit configured to sequentially determine the plurality of compressed data from the compressed data block according to the length of each compressed data in the compressed data block.
  • the second determining unit 5031 further includes:
  • a fifth determining subunit configured to: when the code lengths of the plurality of compressed codes included in the compression dictionary are not equal, for each compressed data block of the at least two compressed data blocks, according to the data index of the compressed data block, From the compressed data block, the plurality of compressed data is determined, and a data index of the compressed data block is used to indicate a location of each of the plurality of compressed data in the compressed data block.
  • the second determining unit 5031 further includes:
  • a sixth determining subunit configured to determine compressed data corresponding to each text data in the target text data block in the process of compressing the target text data block for the target text data block corresponding to the compressed data block The location in the compressed data block;
  • And generating a subunit configured to generate a data index of the compressed data block based on a location of the compressed data corresponding to the respective text data in the compressed data block.
  • a compression dictionary of the at least two target text data blocks is acquired for at least two target text data blocks that are subsequently processed by the same processing operation, and the at least two target text data blocks are obtained based on the compression dictionary.
  • Each target text data block is compressed to obtain at least two compressed data blocks, that is, the at least two target text data blocks can share the same compression dictionary without generating a compression dictionary for each text data block.
  • the compression dictionary of the at least two target text data blocks is the same, that is, the at least two target text data blocks are compressed by the same compression standard, and therefore, compression in the at least two compressed data blocks
  • the data is processed, and the result of decompressing the processing result by the compression dictionary is the same as the processing result of processing the text data in the at least two target text data blocks, so the embodiment of the present invention passes the at least two The compressed data block is processed to implement processing of the at least two target text data blocks without decompressing the at least two compressed data blocks, thereby reducing data processing amount, thereby shortening data processing time and saving Processing resources.
  • the data processing apparatus provided by the foregoing embodiment is only illustrated by the division of each functional module. In actual applications, the function allocation may be completed by different functional modules as needed. The internal structure of the device is divided into different functional modules to perform all or part of the functions described above.
  • the data processing apparatus and the data processing method embodiment provided by the foregoing embodiments are in the same concept, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种数据处理方法及装置,属于计算机技术领域。所述方法包括:获取至少两个目标文本数据块的压缩字典(301),基于所述压缩字典,分别对所述至少两个目标文本数据块中的每个目标文本数据块进行压缩,得到至少两个压缩数据块(302),当接收到对所述至少两个目标文本数据块进行同一处理操作的处理指令时,对所述至少两个压缩数据块中的压缩数据进行处理,以实现所述至少两个目标文本数据块的处理(303)。通过对该至少两个压缩数据块中的压缩数据进行处理,来实现对该至少两个目标文本数据块的处理,而无需对该至少两个压缩数据块进行解压,减小了数据处理量,进而缩短了数据的处理时间,以及节省了处理资源。

Description

数据处理方法及装置 技术领域
本发明涉及计算机技术领域,特别涉及一种数据处理方法及装置。
背景技术
随着计算机技术的发展,大量的文本数据需要进行存储和分析,该文本数据是指由可打印字符组成的数据,该可打印字符包括美国标准信息交换代码(American Standard Code for Information Interchange,ASCII)中的33~127位的字符、统一码(UNICODE)中的字符、万国码(UTF-8)中的字符等。在对该文本数据进行存储时,为了能够节省数据存储与传输时所占用的时间及空间,需要先对该文本数据进行压缩,再对压缩后的该文本数据进行存储,之后,在对该文本数据进行分析时,可以先将压缩后的该文本数据进行解压,以得到该文本数据,再对该文本数据进行诸如比较、排序、查找、哈希运算、连接运算等处理,基于该文本数据的处理结果来对该文本数据进行分析。
目前,提供了一种数据处理方法,具体为:对于存储的多个文本数据块中的每个文本数据块,生成该文本数据块的压缩字典,该文本数据块中包括多个文本数据;基于该文本数据块的压缩字典,对该文本数据块进行压缩,得到该文本数据块对应的压缩数据块;对该文本数据块对应的压缩数据块进行存储。当接收到对第一文本数据块和第二文本数据块进行同一处理操作的处理指令时,获取第一文本数据块对应的压缩数据块,以及获取第二文本数据块对应的压缩数据块,第一文本数据块和第二文本数据块为该多个文本数据块中的任意两个文本数据块;对第一文本数据块对应的压缩数据块进行解压,以得到第一文本数据块,并对第二文本数据块对应的压缩数据块进行解压,以得到第二文本数据块;对第一文本数据块和第二文本数据块中的文本数据进行处理,得到处理结果。
由于在接收到对第一文本数据块和第二文本数据块进行同一处理操作的处理指令时,需要分别对第一文本数据块对应的压缩数据块和第二文本数据块对应的压缩数据块进行解压后,才能对第一文本数据块和第二文本数据块进行处理,因此,数据处理的时间较长,消耗的处理资源较多。
发明内容
为了解决现有技术的问题,本发明实施例提供了一种数据处理方法及装置。所述技术方案如下:
第一方面,提供了一种数据处理方法,所述方法包括:
获取至少两个目标文本数据块的压缩字典,所述至少两个目标文本数据块为存储的多个文本数据块中后续通过同一处理操作进行处理的数据块,各个目标文本数据块均包括多个文本数据,各个文本数据均包括多个字符,所述压缩字典包括所述各个目 标文本数据块中每个文本数据的压缩码,或者包括所述各个目标文本数据块中每个字符的压缩码;
基于所述压缩字典,分别对所述至少两个目标文本数据块中的每个目标文本数据块进行压缩,得到至少两个压缩数据块,所述至少两个目标文本数据块与所述至少两个压缩数据块一一对应,各个压缩数据块均包括多个压缩数据,所述多个压缩数据与所述多个文本数据一一对应;
当接收到对所述至少两个目标文本数据块进行同一处理操作的处理指令时,对所述至少两个压缩数据块中的压缩数据进行处理,以实现所述至少两个目标文本数据块的处理。
需要说明的是,该至少两个目标文本数据块的压缩字典可以包括各个目标文本数据块中每个文本数据的压缩码,或者可以包括各个目标文本数据块中每个字符的压缩码,也即是,该压缩字典中的压缩码可以与文本数据一一对应,也可以与字符一一对应,本发明实施例对此不做具体限定。
另外,该至少两个目标文本数据块为后续通过同一处理操作进行处理的数据块,且该至少两个目标文本数据块对应同一压缩字典,也即是,该至少两个目标文本数据块共享同一压缩字典,如此,无需对每个文本数据块分别生成一个压缩字典。进而在接收到对至少两个压缩数据块进行同一处理操作的处理指令时,由于该至少两个压缩数据块是通过该压缩字典对该至少两个目标文本数据块进行压缩得到,且该至少两个目标文本数据块共享同一压缩字典,因此,可以直接对该至少两个压缩数据块中的压缩数据进行处理,从而实现对该至少两个目标文本数据块的处理。
结合第一方面,在上述第一方面的第一种可能的实现方式中,所述获取至少两个目标文本数据块的压缩字典之前,还包括:
从存储的多个文本数据块中,确定所述至少两个目标文本数据块;
生成所述至少两个目标文本数据块的压缩字典。
为了便于后续可以基于同一压缩标准对该至少两个目标文本数据块中的每个目标文本数据块进行压缩,在确定该至少两个目标文本数据块后,可以生成该至少两个目标文本数据块的压缩字典。且在生成该至少两个目标文本数据块的压缩字典时,可以基于指定压缩算法和该至少两个目标文本数据块,生成该至少两个目标文本数据块的压缩字典,当然,实际应用中,也可以以其它方式,生成该至少两个目标文本数据块的压缩字典,本发明实施例对此不做具体限定。
结合第一方面的第一种可能的实现方式,在上述第一方面的第二种可能的实现方式中,所述从存储的多个文本数据块中,确定所述至少两个目标文本数据块,包括:
从所述多个文本数据块中,选择属于目标类别的至少两个文本数据块,将选择的文本数据块确定为目标文本数据块;或者,
当检测到针对所述多个文本数据块中至少两个文本数据块的选择指令时,将所述选择指令所选择的文本数据块确定为目标文本数据块。
由于通常情况下目标类别下的文本数据块均可以通过同一处理操作进行处理,因此,本发明实施例中可以选择属于目标类别的至少两个文本数据块,并将选择的文本数据块确定为目标文本数据块,该确定操作简单方便,且无需用户参与,从而可以提 高目标文本数据块的确定效率。
另外,由于选择指令由用户触发,因此,本发明实施例中将选择指令所选择的文本数据块确定为目标文本数据块时,实际上时根据用户操作来确定目标文本数据块,从而可以保证确定的目标文本数据块符合用户需求。
结合第一方面,在上述第一方面的第三种可能的实现方式中,所述对所述至少两个压缩数据块中的压缩数据进行处理,以实现所述至少两个目标文本数据块的处理,包括:
确定所述至少两个压缩数据块中每个压缩数据块包括的多个压缩数据;
基于所述至少两个压缩数据块中每个压缩数据块包括的多个压缩数据,对所述至少两个压缩数据块进行处理,得到所述至少两个压缩数据块的处理结果;
基于所述压缩字典,对所述至少两个压缩数据块的处理结果进行解压,得到所述至少两个目标文本数据块的处理结果。
由于压缩字典中存储的是各个目标文本数据块中每个文本数据的压缩码,或者各个目标文本数据块中每个字符的压缩码,因此,对该至少两个目标文本数据块进行压缩,也即是将该至少两个目标文本数据块中的文本数据转换为了压缩数据,由于该转换规则是一定的,因此,在对该压缩数据进行计算后,得到的该至少两个压缩数据块的处理结果也是由压缩码所组成的,因此,基于该压缩字典,对该至少两个压缩数据块的处理结果进行解压后,则是将该压缩码所组成的压缩结果转换为了文本数据形式,且由于转化规则是一定的,因此,转换后的处理结果即为该至少两个目标文本数据块的处理结果。
需要说明的是,由于一般情况下,该多个文本数据块是进行分布式存储的,也即是,在接收到对该多个文本数据块中的某两个文本数据块进行同一处理操作的处理指令时,由于该两个文本数据块的压缩字典不同,因此,需要先对存储的这两个文本数据块对应的两个压缩数据块分别进行解压,以得到这两个文本数据块,进而将这两个文本数据块传输给进行数据处理的设备,由该设备对这两个文本数据块进行处理。
而本发明实施例中,由于是直接对该至少两个压缩数据块中的压缩数据进行处理,来实现该至少两个目标文本数据块的处理,因此,在对该至少两个目标文本数据块进行处理时,可以只将该至少两个目标文本数据块对应的至少两个压缩数据块传输给进行数据处理的设备,由该设备对该至少两个压缩数据块中的压缩数据进行处理,来实现对该至少两个目标文本数据块的处理。从而相比于相关技术中在对文本数据块进行处理之前需要先对压缩数据块进行解压的方式,本发明实施例中直接基于该压缩数据块即可实现对目标文本数据块的处理,从而可以减少数据处理时间,节省处理资源。另外,相比于相关技术中在该分布式系统包括的多个设备之间传输文本数据块的方式,本发明实施例中只需要传输压缩数据块,从而可以降低数据传输量,提升网络带宽利用率,且节省数据传输时间,再者,相比于相关技术中直接对文本数据进行处理的方式,本发明实施例中只需要对压缩数据进行处理,从而可以减小数据处理量,节省数据处理时间,节省处理资源。
结合第一方面的第三种可能的实现方式,在上述第一方面的第四种可能的实现方式中,所述确定所述至少两个压缩数据块中每个压缩数据块包括的多个压缩数据,包 括:
判断所述压缩字典包括的多个压缩码的码长是否相等;
当所述压缩字典包括的多个压缩码的码长相等且所述压缩字典包括所述各个目标文本数据块中每个文本数据的压缩码时,对于所述至少两个压缩数据块中的每个压缩数据块,将各个压缩码的码长确定为所述压缩数据块中各个压缩数据的长度;
按照所述压缩数据块中各个压缩数据的长度,从所述压缩数据块中,依次确定所述多个压缩数据。
由上述描述可知,压缩字典是通过指定压缩算法来生成,而对于不同的压缩算法,生成的压缩字典中包括的多个压缩码的码长可能不相等,且该压缩字典中包括的各个压缩码之间的码长可能也不相等,并且压缩数据块是基于该压缩字典进行压缩的,因此,在确定该至少两个压缩数据块中每个压缩数据块包括的多个压缩数据块,可以先判断该压缩字典包括的多个压缩码的码长是否相等。
另外,当该压缩字典包括各个目标文本数据块中每个文本数据的压缩码时,确定在进行目标文本数据块的压缩时,是基于各个目标文本数据块中每个文本数据的压缩码,对每个文本数据进行压缩的,因此,可以将该压缩字典中该各个压缩码的码长确定为该压缩数据块中各个压缩数据块的长度。
结合第一方面的第四种可能的实现方式,在上述第一方面的第五种可能的实现方式中,所述判断所述压缩字典包括的多个压缩码的码长是否相等之后,还包括:
当所述压缩字典包括的多个压缩码的码长相等且当所述压缩字典包括所述各个目标文本数据块中每个字符的压缩码时,对于所述至少两个压缩数据块中的每个压缩数据块,确定所述压缩数据块对应的目标文本数据块;
将各个压缩码的码长分别与所述目标文本数据块中各个文本数据的字符个数相乘,得到所述压缩数据块中各个压缩数据的长度;
按照所述压缩数据块中各个压缩数据的长度,从所述压缩数据块中,依次确定所述多个压缩数据。
其中,当该压缩字典包括各个目标文本数据块中每个字符的压缩码时,确定在进行目标文本数据块的压缩时,是基于各个目标文本数据块中每个字符的压缩码,对每个字符进行压缩的,且由于一个文本数据可以包括多个字符,因此,可以将该压缩字典中该各个压缩码的码长分别与目标文本数据块中各个文本数据的字符个数相乘,进而得到该压缩数据块中各个压缩数据块的长度。
结合第一方面的第四种可能的实现方式,在上述第一方面的第六种可能的实现方式中,所述判断所述压缩字典包括的多个压缩码的码长是否相等之后,还包括:
当所述压缩字典包括的多个压缩码的码长不相等时,对于所述至少两个压缩数据块中的每个压缩数据块,根据所述压缩数据块的数据索引,从所述压缩数据块中,确定所述多个压缩数据,所述压缩数据块的数据索引用于指示所述多个压缩数据中每个压缩数据在所述压缩数据块中所处的位置。
结合第一方面的第六种可能的实现方式,在上述第一方面的第七种可能的实现方式中,所述根据所述压缩数据块的数据索引,从所述压缩数据块中,确定所述多个压缩数据之前,还包括:
对于所述压缩数据块对应的目标文本数据块,在对所述目标文本数据块进行压缩的过程中,确定所述目标文本数据块中各个文本数据对应的压缩数据在所述压缩数据块中所处的位置;
基于所述各个文本数据对应的压缩数据在所述压缩数据块中所处的位置,生成所述压缩数据块的数据索引。
需要说明的是,在对目标文本数据块进行压缩的过程中,确定目标文本数据块中各个文本数据对应的压缩数据在该压缩数据块中所处的位置时,可以确定各个文本数据对应的压缩数据在该压缩数据块中的起始位置和结束位置,进而通过起始位置和结束位置唯一地确定各个文本数据对应的压缩数据在该压缩数据块中所处的位置。其中,为了方便起见,可以将各个压缩数据在该压缩数据块中的起始位置确定为该压缩数据块的数据索引。
第二方面,提供了一种数据处理装置,所述装置包括:
获取模块,用于获取至少两个目标文本数据块的压缩字典,所述至少两个目标文本数据块为存储的多个文本数据块中后续通过同一处理操作进行处理的数据块,各个目标文本数据块均包括多个文本数据,各个文本数据均包括多个字符,所述压缩字典包括所述各个目标文本数据块中每个文本数据的压缩码,或者包括所述各个目标文本数据块中每个字符的压缩码;
压缩模块,用于基于所述压缩字典,分别对所述至少两个目标文本数据块中的每个目标文本数据块进行压缩,得到至少两个压缩数据块,所述至少两个目标文本数据块与所述至少两个压缩数据块一一对应,各个压缩数据块均包括多个压缩数据,所述多个压缩数据与所述多个文本数据一一对应;
处理模块,用于当接收到对所述至少两个目标文本数据块进行同一处理操作的处理指令时,对所述至少两个压缩数据块中的压缩数据进行处理,以实现所述至少两个目标文本数据块的处理。
结合第二方面,在上述第二方面的第一种可能的实现方式中,所述装置还包括:
确定模块,用于从存储的多个文本数据块中,确定所述至少两个目标文本数据块;
生成模块,用于生成所述至少两个目标文本数据块的压缩字典。
结合第二方面的第一种可能的实现方式,在上述第二方面的第二种可能的实现方式中,所述确定模块包括:
选择单元,用于从所述多个文本数据块中,选择属于目标类别的至少两个文本数据块,将选择的文本数据块确定为目标文本数据块;或者,
第一确定单元,用于当检测到针对所述多个文本数据块中至少两个文本数据块的选择指令时,将所述选择指令所选择的文本数据块确定为目标文本数据块。
结合第二方面,在上述第二方面的第三种可能的实现方式中,所述处理模块包括:
第二确定单元,用于确定所述至少两个压缩数据块中每个压缩数据块包括的多个压缩数据;
处理单元,用于基于所述至少两个压缩数据块中每个压缩数据块包括的多个压缩数据,对所述至少两个压缩数据块进行处理,得到所述至少两个压缩数据块的处理结 果;
解压单元,用于基于所述压缩字典,对所述至少两个压缩数据块的处理结果进行解压,得到所述至少两个目标文本数据块的处理结果。
结合第二方面的第三种可能的实现方式,在上述第二方面的第四种可能的实现方式中,所述第二确定单元包括:
判断子单元,用于判断所述压缩字典包括的多个压缩码的码长是否相等;
第一确定子单元,用于当所述压缩字典包括的多个压缩码的码长相等且所述压缩字典包括所述各个目标文本数据块中每个文本数据的压缩码时,对于所述至少两个压缩数据块中的每个压缩数据块,将各个压缩码的码长确定为所述压缩数据块中各个压缩数据的长度;
第二确定子单元,用于按照所述压缩数据块中各个压缩数据的长度,从所述压缩数据块中,依次确定所述多个压缩数据
结合第二方面的第四种可能的实现方式,在上述第二方面的第五种可能的实现方式中,所述第二确定单元还包括:
第三确定子单元,用于当所述压缩字典包括的多个压缩码的码长相等且当所述压缩字典包括所述各个目标文本数据块中每个字符的压缩码时,对于所述至少两个压缩数据块中的每个压缩数据块,确定所述压缩数据块对应的目标文本数据块;
运算子单元,用于将各个压缩码的码长分别与所述目标文本数据块中各个文本数据的字符个数相乘,得到所述压缩数据块中各个压缩数据的长度;
第四确定子单元,用于按照所述压缩数据块中各个压缩数据的长度,从所述压缩数据块中,依次确定所述多个压缩数据。
结合第二方面的第四种可能的实现方式,在上述第二方面的第六种可能的实现方式中,所述第二确定单元还包括:
第五确定子单元,用于当所述压缩字典包括的多个压缩码的码长不相等时,对于所述至少两个压缩数据块中的每个压缩数据块,根据所述压缩数据块的数据索引,从所述压缩数据块中,确定所述多个压缩数据,所述压缩数据块的数据索引用于指示所述多个压缩数据中每个压缩数据在所述压缩数据块中所处的位置。
结合第二方面的第六种可能的实现方式,在上述第二方面的第七种可能的实现方式中,所述第二确定单元还包括:
第六确定子单元,用于对于所述压缩数据块对应的目标文本数据块,在对所述目标文本数据块进行压缩的过程中,确定所述目标文本数据块中各个文本数据对应的压缩数据在所述压缩数据块中所处的位置;
生成子单元,用于基于所述各个文本数据对应的压缩数据在所述压缩数据块中所处的位置,生成所述压缩数据块的数据索引。
本发明实施例提供的技术方案带来的有益效果是:
在本发明实施例中,对于后续通过同一处理操作进行处理的至少两个目标文本数据块,获取该至少两个目标文本数据块的压缩字典,基于该压缩字典对该至少两个目标文本数据块中的每个目标文本数据块进行压缩,得到至少两个压缩数据块,也即是,该至少两个目标文本数据块可以共享同一压缩字典,而无需对每个文本数据块都生成 一个压缩字典。另外,由于该至少两个目标文本数据块的压缩字典相同,也即是,该至少两个目标文本数据块是通过同一压缩标准进行压缩的,因此,对该至少两个压缩数据块中的压缩数据进行处理,并对处理结果通过该压缩字典进行解压后的结果与对该至少两个目标文本数据块中的文本数据进行处理得到的处理结果相同,所以本发明实施例通过对该至少两个压缩数据块中的压缩数据进行处理,来实现对该至少两个目标文本数据块的处理,而无需对该至少两个压缩数据块进行解压,减小了数据处理量,进而缩短了数据的处理时间,以及节省了处理资源。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例提供的一种数据处理系统的结构示意图;
图2是本发明实施例提供的一种计算机设备的结构示意图;
图3是本发明实施例提供的一种数据处理方法的流程图;
图4A是本发明实施例提供的另一种数据处理方法的流程图;
图4B是图4A实施例所涉及的一种目标文本数据块的示意图;
图4C是图4A实施例所涉及的一种目标文本数据块与压缩字典的对应关系的示意图;
图4D是图4A实施例所涉及的另一种目标文本数据块与压缩字典的对应关系的示意图;
图4E是图4A实施例所涉及的压缩字段索引的示意图;
图5A是本发明实施例提供的一种数据处理装置的结构示意图;
图5B是本发明实施例提供的一种处理模块的结构示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
图1是本发明实施例提供的一种数据处理系统的结构示意图。参见图1,该系统可以为分布式系统,当然也可以是不是分布式系统,在本发明实施例中以分布式系统为例进行说明。其中,在该分布式系统中包括多个设备,分别为设备01、设备02、设备03……设备0n,该多个设备之间相互连接,且该多个设备可以终端,也可以为服务器,本发明实施例对此不做具体限定。
另外,该多个设备中的每个设备均可以包括数据导入模块和数据处理模块,该多个设备中的指定设备中不仅可以包括数据导入模块和数据处理模块,还可以包括压缩字典配置模块和压缩字典共享存储模块,该指定设备可以为该多个设备中的任一设备。其中,该分布式系统中可以存储多个文本数据块,压缩字典配置模块用于对该分布式系统中存储的多个文本数据块配置压缩字典,该压缩字典用于对文本数据块进行压缩, 从而得到压缩数据块,相应地,该分布式系统中还可以存储多个压缩数据块,该多个压缩数据块与该多个文本数据块一一对应,值得注意的是,该多个文本数据块中后续需要进行同一处理操作的至少两个文本数据块可以共享一个压缩字典,也即是,该至少两个文本数据可以对应相同的压缩字典。
压缩字典共享存储模块用于对应存储该多个文本数据块的标识与对应的压缩字典;数据导入模块用于在对至少两个目标文本数据块进行数据处理时,从压缩字典共享存储模块中获取该至少两个目标文本数据块的压缩字典;数据处理模块用于对该至少两个目标文本数据块对应的压缩数据块中的压缩数据进行处理,并基于数据导入模块获取的压缩字典对处理结果进行解压,从而实现对该至少两个目标文本数据块的处理。
图2为本发明实施例提供的一种计算机设备的结构示意图,图1中的分布式系统中的设备可以以图2中所示的计算机设备来实现。参见图2,该计算机设备包括至少一个处理器201,通信总线202,存储器203以及至少一个通信接口204。
处理器201可以是一个通用中央处理器(CPU),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制本发明方案程序执行的集成电路。
通信总线202可包括一通路,在上述组件之间传送信息。
存储器203可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其它类型的静态存储设备,随机存取存储器(random access memory,RAM))或者可存储信息和指令的其它类型的动态存储设备,也可以是电可擦可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。存储器203可以是独立存在,通过通信总线202与处理器201相连接。存储器203也可以和处理器201集成在一起。
通信接口204,使用任何收发器一类的装置,用于与其它设备或通信网络通信,如以太网,无线接入网(RAN),无线局域网(Wireless Local Area Networks,WLAN)等。
在具体实现中,作为一种实施例,处理器201可以包括一个或多个CPU,例如图2中所示的CPU0和CPU1。
在具体实现中,作为一种实施例,计算机设备可以包括多个处理器,例如图2中所示的处理器201和处理器208。这些处理器中的每一个可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
在具体实现中,作为一种实施例,计算机设备还可以包括输出设备205和输入设备206。输出设备205和处理器201通信,可以以多种方式来显示信息。例如,输出设备205可以是液晶显示器(liquid crystal display,LCD),发光二级管(light emitting diode,LED)显示设备,阴极射线管(cathode ray tube,CRT)显示设备,或投影仪(projector)等。输入设备206和处理器201通信,可以以多种方式接收用 户的输入。例如,输入设备206可以是鼠标、键盘、触摸屏设备或传感设备等。
上述的计算机设备可以是一个通用计算机设备或者是一个专用计算机设备。在具体实现中,计算机设备可以是台式机、便携式电脑、网络服务器、掌上电脑(Personal Digital Assistant,PDA)、移动手机、平板电脑、无线终端设备、通信设备或者嵌入式设备。本发明实施例不限定计算机设备的类型。
其中,存储器203用于存储执行本发明方案的程序代码,并由处理器201来控制执行。处理器201用于执行存储器203中存储的程序代码210。程序代码210中可以包括一个或多个软件模块(例如,数据导入模块、数据处理模块、压缩字典配置模块和压缩字典共享存储模块等等)。图1中所示的分布式系统中的设备可以通过处理器201以及存储器203中的程序代码210中的一个或多个软件模块,对数据进行处理。
图3是本发明实施例提供的一种数据处理方法的流程图。参见图3,该方法包括:
步骤301:获取至少两个目标文本数据块的压缩字典,该至少两个目标文本数据块为存储的多个文本数据块中后续通过同一处理操作进行处理的数据块,各个目标文本数据块均包括多个文本数据,各个文本数据均包括多个字符,该压缩字典包括该各个目标文本数据块中每个文本数据的压缩码,或者包括该各个目标文本数据块中每个字符的压缩码。
步骤302:基于该压缩字典,分别对该至少两个目标文本数据块中的每个目标文本数据块进行压缩,得到至少两个压缩数据块,该至少两个目标文本数据块与该至少两个压缩数据块一一对应,各个压缩数据块均包括多个压缩数据,该多个压缩数据与该多个文本数据一一对应。
步骤303:当接收到对该至少两个目标文本数据块进行同一处理操作的处理指令时,对该至少两个压缩数据块中的压缩数据进行处理,以实现该至少两个目标文本数据块的处理。
在本发明实施例中,对于后续通过同一处理操作进行处理的至少两个目标文本数据块,获取该至少两个目标文本数据块的压缩字典,基于该压缩字典对该至少两个目标文本数据块中的每个目标文本数据块进行压缩,得到至少两个压缩数据块,也即是,该至少两个目标文本数据块可以共享同一压缩字典,而无需对每个文本数据块都生成一个压缩字典。另外,由于该至少两个目标文本数据块的压缩字典相同,也即是,该至少两个目标文本数据块是通过同一压缩标准进行压缩的,因此,对该至少两个压缩数据块中的压缩数据进行处理,并对处理结果通过该压缩字典进行解压后的结果与对该至少两个目标文本数据块中的文本数据进行处理得到的处理结果相同,所以本发明实施例通过对该至少两个压缩数据块进行处理,来实现对该至少两个目标文本数据块的处理,而无需对该至少两个压缩数据块进行解压,减小了数据处理量,进而缩短了数据的处理时间,以及节省了处理资源。
可选地,获取至少两个目标文本数据块的压缩字典之前,还包括:
从存储的多个文本数据块中,确定至少两个目标文本数据块;
生成该至少两个目标文本数据块的压缩字典。
可选地,从存储的多个文本数据块中,确定至少两个目标文本数据块,包括:
从存储的多个文本数据块中,选择属于目标类别的至少两个文本数据块,将选择 的文本数据块确定为目标文本数据块;或者,
当检测到针对该多个文本数据块中至少两个文本数据块的选择指令时,将选择指令所选择的文本数据块确定为目标文本数据块。
可选地,对该至少两个压缩数据块中的压缩数据进行处理,以实现该至少两个目标文本数据块的处理,包括:
确定该至少两个压缩数据块中每个压缩数据块包括的多个压缩数据;
基于该至少两个压缩数据块中每个压缩数据块包括的多个压缩数据,对该至少两个压缩数据块进行处理,得到该至少两个压缩数据块的处理结果;
基于该压缩字典,对该至少两个压缩数据块的处理结果进行解压,得到至少两个目标文本数据块的处理结果。
可选地,确定该至少两个压缩数据块中每个压缩数据块包括的多个压缩数据,包括:
判断该压缩字典包括的多个压缩码的码长是否相等;
当该压缩字典包括的多个压缩码的码长相等且压缩字典包括各个目标文本数据块中每个文本数据的压缩码时,对于该至少两个压缩数据块中的每个压缩数据块,将各个压缩码的码长确定为该压缩数据块中各个压缩数据的长度;
按照该压缩数据块中各个压缩数据的长度,从该压缩数据块中,依次确定多个压缩数据。
可选地,判断该压缩字典包括的多个压缩码的码长是否相等之后,还包括:
当该压缩字典包括的多个压缩码的码长相等且当压缩字典包括各个目标文本数据块中每个字符的压缩码时,对于该至少两个压缩数据块中的每个压缩数据块,确定该压缩数据块对应的目标文本数据块;
将各个压缩码的码长分别与目标文本数据块中各个文本数据的字符个数相乘,得到该压缩数据块中各个压缩数据的长度;
按照该压缩数据块中各个压缩数据的长度,从该压缩数据块中,依次确定多个压缩数据。
可选地,判断该压缩字典包括的多个压缩码的码长是否相等之后,还包括:
当该压缩字典包括的多个压缩码的码长不相等时,对于该至少两个压缩数据块中的每个压缩数据块,根据该压缩数据块的数据索引,从该压缩数据块中,确定多个压缩数据,该压缩数据块的数据索引用于指示该多个压缩数据中每个压缩数据在该压缩数据块中所处的位置。
可选地,根据该压缩数据块的数据索引,从该压缩数据块中,确定多个压缩数据之前,还包括:
对于该压缩数据块对应的目标文本数据块,在对目标文本数据块进行压缩的过程中,确定目标文本数据块中各个文本数据对应的压缩数据在该压缩数据块中所处的位置;
基于各个文本数据对应的压缩数据在该压缩数据块中所处的位置,生成该压缩数据块的数据索引。
上述所有可选技术方案,均可按照任意结合形成本发明的可选实施例,本发明实 施例对此不再一一赘述。
图4A是本发明实施例提供的一种数据处理方法的流程图。参见图4A,该方法包括:
步骤401:从存储的多个文本数据块中,确定至少两个目标文本数据块。
需要说明的是,该多个文本数据块可以以指定格式进行存储,该指定格式可以预先设置,如该指定格式可以为文本文件(TextFile)格式、Parquet格式、SequenceFile格式、RCFile格式、Avro格式等,本发明实施例对此不做具体限定。且为了提高该多个文本数据块的存取效率,实际应用中通常可以将该多个文本数据块进行分布式存储,该分布式存储是指将该多个文本数据块分散存储在多台独立的设备上,如可以将该多个文本数据基于分布式文件系统(Hadoop Distributed File System,HDFS)进行存储等,本发明实施例对此不做具体限定。
另外,该至少两个目标文本数据块为后续通过同一处理操作进行处理的数据块,各个目标文本数据块均包括多个文本数据,各个文本数据均包括多个字符。例如,如图4B所示,该至少两个目标文本数据块中的某一目标文本数据块中包括多个文本数据,该多个文本数据为101、102、103和104。其中,该处理操作可以包括比较、排序、查找、哈希运算、连接运算等,本发明实施例对此不做具体限定。
再者,文本数据是指由可打印字符组成的数据,该可打印字符包括ASCII中的33~127位的字符、UNICODE中的字符、UTF-8中的字符等,本发明实施例对此不做具体限定。
具体地,该分布式系统从该多个文本数据块中,确定该至少两个目标文本数据块时,可以从该多个文本数据块中,选择属于目标类别的至少两个文本数据块,将选择的文本数据块确定为目标文本数据块;或者,当检测到针对该多个文本数据块中至少两个文本数据块的选择指令时,将选择指令所选择的文本数据块确定为目标文本数据块。
需要说明的是,目标类别下的文本数据块均可以通过同一处理操作进行处理,如目标类别可以为编号、时间、用户名称等,本发明实施例对此不做具体限定。
另外,选择指令用于从该多个文本数据块中选择目标文本数据块,且该选择指令可以由用户触发,该用户可以通过指定操作触发,该指定操作可以为单击操作、双击操作、语音操作等,本发明实施例对此不做具体限定。
由于通常情况下目标类别下的文本数据块均可以通过同一处理操作进行处理,因此,本发明实施例中可以选择属于目标类别的至少两个文本数据块,并将选择的文本数据块确定为目标文本数据块,该确定操作简单方便,且无需用户参与,从而可以提高目标文本数据块的确定效率。
另外,由于选择指令由用户触发,因此,本发明实施例中将选择指令所选择的文本数据块确定为目标文本数据块时,实际上时根据用户操作来确定目标文本数据块,从而可以保证确定的目标文本数据块符合用户需求。
例如,该多个文本数据块中,属于目标类别的文本数据块为文本数据块1、文本数据块2、文本数据块3和文本数据块4,则可以从该多个文本数据块中,选择该文本数据块1、文本数据块2、文本数据块3和文本数据块4,之后,可以将该选择的文本 数据块1、文本数据块2、文本数据块3和文本数据块4确定为目标文本数据块。
再例如,检测到针对该多个文本数据块中的文本数据块1、文本数据块2、文本数据块3和文本数据块4的选择指令,则可以将该选择指令所选择的文本数据块1、文本数据块2、文本数据块3和文本数据块4确定为目标文本数据块。
需要说明的是,从该多个文本数据块中,确定该至少两个目标文本数据块的操作可以由该分布式系统中的压缩字典配置模块执行,且实际应用中,在确定该至少两个目标文本数据块之后,该压缩字典配置模块还可以将该至少两个目标文本数据块中每个目标文本数据块的标识发送给压缩字典共享存储模块,以便该压缩字典共享存储模块可以确定该至少两个目标文本数据块之间的关联关系,进而便于后续该压缩字典共享存储模块为该至少两个目标文本数据块生成同一压缩字典。
其中,目标文本数据块的标识用于唯一标识该目标文本数据块,且目标文本数据块的标识可以为该目标文本数据块的名称等,本发明实施例对此不做具体限定。
步骤402:生成该至少两个目标文本数据块的压缩字典。
为了便于后续可以基于同一压缩标准对该至少两个目标文本数据块中的每个目标文本数据块进行压缩,在确定该至少两个目标文本数据块后,可以生成该至少两个目标文本数据块的压缩字典。且在生成该至少两个目标文本数据块的压缩字典时,可以基于指定压缩算法和该至少两个目标文本数据块,生成该至少两个目标文本数据块的压缩字典,当然,实际应用中,也可以以其它方式,生成该至少两个目标文本数据块的压缩字典,本发明实施例对此不做具体限定。
需要说明的是,该至少两个目标文本数据块的压缩字典可以包括各个目标文本数据块中每个文本数据的压缩码,或者可以包括各个目标文本数据块中每个字符的压缩码,本发明实施例对此不做具体限定。例如,该至少两个目标文本数据块中包括的文本数据为101、102、103,则当该至少两个目标文本数据块的压缩字典包括各个目标文本数据块中每个文本数据的压缩码时,该压缩字典可以对应包括压缩码1、压缩码2和压缩码3,如图4C所示;当该至少两个目标文本数据块的压缩字典包括各个目标文本数据块中每个字符的压缩码时,假如文本数据101中包括的字符为1、2、3,对应的压缩码为压缩码1、压缩码2、压缩码3,文本数据102中包括的字符为4、5、6,对应的压缩码为压缩码4、压缩码5、压缩码6,文本数据103中包括的字符为7、8、9,对应的压缩码为压缩码7、压缩码8、压缩码9,该压缩字典可以如图4D所示。
另外,指定压缩算法可以预先设置,且该指定压缩算法可以是一种压缩排序保持算法,也即是,可以保持编码后的字典序与原字符串的字典序相同,如该指定压缩算法可以为哈夫曼编码(Huffman Coding)算法、Hu-Tucker编码算法等,本发明实施例对此不做具体限定。
其中,基于指定压缩算法和该至少两个目标文本数据块,生成该至少两个目标文本数据块的压缩字典的操作可以参考相关技术,本发明实施例对此不进行详细阐述。
需要说明的是,当该至少两个目标文本数据块是分布式存储时,存储该至少两个目标文本数据块的设备可以在接收到针对该至少两个目标文本数据块的压缩指令时,将该至少两个目标文本数据块发送到压缩字典共享存储模块,以便该压缩字典共享存储模块生成该至少两个文本数据块的压缩字典,并将该压缩字典返回该设备,以便该 设备可以对该至少两个文本数据块进行压缩,当然,该压缩字典共享存储模块也可以主动获取该至少两个目标文本数据块,以便生成该至少两个目标文本数据块的压缩字典,本发明实施例对此不做具体限定。其中,压缩指令用于指示对该至少两个目标文本数据块进行压缩,且该压缩指令可以通过指定操作触发,本发明实施例对此不做具体限定。
另外,该压缩字典共享存储模块生成该压缩字典后,可以将该至少两个目标文本数据块进行删除,以便节省该压缩字典共享存储模块的存储资源。
再者,压缩字典共享存储模块生成该至少两个目标文本数据块的压缩字典后,还可以将该至少两个目标文本数据块中每个目标文本数据块的标识和该至少两个目标文本数据块的压缩字典存储到文本数据块标识与压缩字典之间的对应关系中,以便于后续在上述所述设备获取目标文本数据块的压缩字典时,可以直接基于目标文本数据块的标识,从该文本数据块标识与压缩字典之间的对应关系中,简单快速地获取该目标文本数据块的压缩字典。
需要说明的是,相关技术中在压缩某一文本数据块时,往往是先生成该文本数据块对应的压缩字典,再基于该文本数据块的压缩字典,对该文本数据块进行压缩,得到该文本数据块对应的压缩数据块,之后,将该文本数据块的压缩字典和该文本数据块对应的压缩数据块共同存储,以便后续可以基于该文本数据块的压缩字典,对该文本数据块对应的压缩数据块进行解压。也就是说,相关技术中如果要存储多个压缩数据块,则需要对该多个压缩数据块中每个压缩数据块对应的压缩字典也进行存储,从而消耗了较多的存储资源。而本发明实施例中,由于该至少两个目标文本数据块均使用同一压缩字典,因此,只需要对该至少两个目标文本数据块的压缩字典在该压缩字典共享存储模块中进行一次存储,从而节省了存储资源。
另外,相关技术中,在生成多个文本数据块的压缩字典时,需要对于该多个文本数据块中的每个文本数据块均执行一次生成操作,也即是,需要执行多次生成操作,才能得到该多个文本数据块中所有文本数据块的压缩字典,从而消耗了较多的处理资源。而由于本发明实施例中可以事先确定该至少两个目标文本数据块,且该至少两个目标文本数据块中每个目标文本数据块的压缩字典均相同,因此,只需要对该至少两个目标文本数据块执行一次生成操作,即可得到该至少两个目标文本数据块的压缩字典,从而节省了处了理资源。
需要说明的是,本发明实施例中,可以通过上述步骤401-402确定该至少两个目标文本数据块的压缩字典,而基于该至少两个目标文本数据块的压缩字典,对该至少两个目标文本数据块进行压缩和处理的操作可以通过如下步骤403-405来实现。
由上述描述可知,本发明实施例提供的方法用于分布式系统中,且压缩字典共享存储模块设置在该分布式系统包括的多个设备中的指定设备中,但是,该多个文本数据块又是分布式地存储在该多个设备中,该至少两个目标文本数据块包含于该多个文本数据块中,也即是,该至少两个目标文本数据块也分布式地存储在该多个设备中,因此,在压缩字典共享存储模块生成该至少两个目标文本数据块的压缩字典之后,可以存储该压缩字典,当该分布式系统需要对该至少两个目标文本数据块进行数据处理时,再按照如下方式来获取该至少两个目标文本数据块的压缩字典,从而进行后续的 处理步骤,具体如下步骤403-405所示。
步骤403:获取该至少两个目标文本数据块的压缩字典。
需要说明的是,分布式系统获取该至少两个目标文本数据块的压缩字典的操作可以通过存储该至少两个目标文本数据块的设备所包括的数据导入模块来执行,具体地,存储该至少两个目标文本数据块的设备包括的数据导入模块可以向该压缩字典共享存储模块发送压缩字典获取请求,该压缩字典获取请求中携带目标文本数据块的标识;当该压缩字典共享存储模块接收到该压缩字典获取请求时,可以基于目标文本数据块的标识,从存储的文本数据块标识与压缩字典之间的对应关系中,获取目标文本数据块的压缩字典,并将目标文本数据块的压缩字典发送给该数据导入模块,当然,实际应用中,也可以以其它方式获取该至少两个目标文本数据块的压缩字典,本发明实施例对此不做具体限定。
另外,实际应用中,可以在接收到针对该至少两个目标文本数据块的压缩指令时,获取该至少两个目标文本数据块的压缩字典,当然,也可以在其它情况下获取该至少两个目标文本数据块的压缩字典,只要保证在对该至少两个目标文本数据块进行压缩之前,已经获取到该至少两个目标文本数据块的压缩字典即可,本发明实施例对此不做具体限定。
需要说明的是,该压缩指令用于指示对该至少两个目标文本数据块进行压缩,且该压缩指令可以由用户触发,当然,该压缩指令也可以在分布式系统检测到某一触发事件时触发,本发明实施例对此不做具体限定。
进一步地,当压缩字典共享存储模块中未存储有该目标文本数据块的压缩字典时,此时,该压缩字典共享存储模块需要基于该目标文本数据块和指定压缩算法,生成该目标文本数据块的压缩字典。
步骤404:基于该至少两个目标文本数据块的压缩字典,分别对该至少两个目标文本数据块中的每个目标文本数据块进行压缩,得到至少两个压缩数据块。
需要说明的是,该至少两个目标文本数据块与该至少两个压缩数据块一一对应,各个压缩数据块均包括多个压缩数据,且对于某个压缩数据块,该压缩数据块中包括的多个压缩数据与该压缩数据块对应的目标文本数据块中包括的多个文本数据一一对应,也即是,该至少两个目标文本数据块中的每个文本数据均在该至少两个压缩数据块中具有唯一对应的压缩数据。
另外,基于该至少两个目标文本数据块的压缩字典,分别对该至少两个目标文本数据块中的每个目标文本数据块进行压缩,得到至少两个压缩数据块的操作可以参考相关技术,本发明实施例对此不进行详细阐述。
步骤405:当接收到对该至少两个目标文本数据块进行同一处理操作的处理指令时,对该至少两个压缩数据块中的压缩数据进行处理,以实现至少两个目标文本数据块的处理。
需要说明的是,由于一般情况下,该多个文本数据块是进行分布式存储的,也即是,在接收到对该多个文本数据块中的某两个文本数据块进行同一处理操作的处理指令时,由于该两个文本数据块的压缩字典不同,因此,需要先对存储的这两个文本数据块对应的两个压缩数据块分别进行解压,以得到这两个文本数据块,进而将这两个 文本数据块传输给进行数据处理的设备,由该设备对这两个文本数据块进行处理。
而本发明实施例中,由于是直接对该至少两个压缩数据块中的压缩数据进行处理,来实现该至少两个目标文本数据块的处理,因此,在对该至少两个目标文本数据块进行处理时,可以只将该至少两个目标文本数据块对应的至少两个压缩数据块传输给进行数据处理的设备,由该设备对该至少两个压缩数据块中的压缩数据进行处理,来实现对该至少两个目标文本数据块的处理。从而相比于相关技术中在对文本数据块进行处理之前需要先对压缩数据块进行解压的方式,本发明实施例中直接基于该压缩数据块即可实现对目标文本数据块的处理,从而可以减少数据处理时间,节省处理资源。另外,相比于相关技术中在该分布式系统包括的多个设备之间传输文本数据块的方式,本发明实施例中只需要传输压缩数据块,从而可以降低数据传输量,提升网络带宽利用率,且节省数据传输时间,再者,相比于相关技术中直接对文本数据进行处理的方式,本发明实施例中只需要对压缩数据进行处理,从而可以减小数据处理量,节省数据处理时间,节省处理资源。
具体地,对该至少两个压缩数据块中的压缩数据进行处理,以实现该至少两个目标文本数据块的处理的操作可以为:确定该至少两个压缩数据块中每个压缩数据块包括的多个压缩数据;基于该至少两个压缩数据块中每个压缩数据块包括的多个压缩数据,对该至少两个压缩数据块进行处理,得到该至少两个压缩数据块的处理结果;基于该压缩字典,对该至少两个压缩数据块的处理结果进行解压,得到该至少两个目标文本数据块的处理结果。
由于压缩字典中存储的是各个目标文本数据块中每个文本数据的压缩码,或者各个目标文本数据块中每个字符的压缩码,因此,对该至少两个目标文本数据块进行压缩,也即是将该至少两个目标文本数据块中的文本数据转换为了压缩数据,由于该转换规则是一定的,因此,在对该压缩数据进行计算后,得到的该至少两个压缩数据块的处理结果也是由压缩码所组成的,因此,基于该压缩字典,对该至少两个压缩数据块的处理结果进行解压后,则是将该压缩码所组成的压缩结果转换为了文本数据形式,且由于转化规则是一定的,因此,转换后的处理结果即为该至少两个目标文本数据块的处理结果。
在本发明实施例中,对该至少两个压缩数据块中的压缩数据进行处理,以实现至少两个目标文本数据块的处理时,可以通过设备中的数据处理模块来执行,也即是,该数据处理模块可以对该至少两个压缩数据块中的压缩数据进行处理,进而基于该至少两个目标文本数据块的压缩字典对处理结果进行解压,从而实现对该至少两个目标文本数据块的处理。
其中,基于该至少两个压缩数据块中每个压缩数据块包括的多个压缩数据,对该至少两个压缩数据块进行处理的操作与相关技术中基于某两个文本数据块中每个文本数据块包括的多个文本数据,对这两个文本数据块进行处理的操作类似,本发明实施例对此不进行详细阐述。
其中,基于该压缩字典,对该至少两个压缩数据块的处理结果进行解压,得到该至少两个目标文本数据块的处理结果的操作与相关技术中基于某个压缩数据块对应的压缩字典,对某个压缩数据块进行解压的操作类似,本发明实施例对此不进行详细阐 述。
其中,在确定至少两个压缩数据块中每个压缩数据块包括的多个压缩数据时,可以先判断压缩字典包括的多个压缩码的码长是否相等,进而结合判断结果通过如下三种情况来确定至少两个压缩数据块中每个压缩数据块包括的多个压缩数据:
第一种情况,当压缩字典包括的多个压缩码的码长相等且该压缩字典包括各个目标文本数据块中每个文本数据的压缩码时,对于该至少两个压缩数据块中的每个压缩数据块,将各个压缩码的码长确定为该压缩数据块中各个压缩数据的长度;按照该压缩数据块中各个压缩数据的长度,从该压缩数据块中,依次确定多个压缩数据。
由上述描述可知,压缩字典是通过指定压缩算法来生成,而对于不同的压缩算法,生成的压缩字典中包括的多个压缩码的码长可能不相等,且该压缩字典中包括的各个压缩码之间的码长可能也不相等,并且压缩数据块是基于该压缩字典进行压缩的,因此,在确定该至少两个压缩数据块中每个压缩数据块包括的多个压缩数据块,可以先判断该压缩字典包括的多个压缩码的码长是否相等。
另外,当该压缩字典包括各个目标文本数据块中每个文本数据的压缩码时,确定在进行目标文本数据块的压缩时,是基于各个目标文本数据块中每个文本数据的压缩码,对每个文本数据进行压缩的,因此,可以将该压缩字典中该各个压缩码的码长确定为该压缩数据块中各个压缩数据块的长度。
其中,按照该压缩数据块中各个压缩数据的长度,从该压缩数据块中,依次确定多个压缩数据时,可以按照该压缩数据块中各个压缩数据的长度,依次将该压缩数据块进行划分,得到多个压缩数据。
第二种情况,当该压缩字典包括的多个压缩码的码长相等且当该压缩字典包括各个目标文本数据块中每个字符的压缩码时,对于该至少两个压缩数据块中的每个压缩数据块,确定该压缩数据块对应的目标文本数据块;将各个压缩码的码长分别与目标文本数据块中各个文本数据的字符个数相乘,得到该压缩数据块中各个压缩数据的长度;按照该压缩数据块中各个压缩数据的长度,从该压缩数据块中,依次确定多个压缩数据。
其中,当该压缩字典包括各个目标文本数据块中每个字符的压缩码时,确定在进行目标文本数据块的压缩时,是基于各个目标文本数据块中每个字符的压缩码,对每个字符进行压缩的,且由于一个文本数据可以包括多个字符,因此,可以将该压缩字典中该各个压缩码的码长分别与目标文本数据块中各个文本数据的字符个数相乘,进而得到该压缩数据块中各个压缩数据块的长度。
第三种情况,当该压缩字典包括的多个压缩码的码长不相等时,对于该至少两个压缩数据块中的每个压缩数据块,根据该压缩数据块的数据索引,从该压缩数据块中,确定多个压缩数据,该压缩数据块的数据索引用于指示该多个压缩数据中每个压缩数据在该压缩数据块中所处的位置。
进一步地,根据该压缩数据块的数据索引,从该压缩数据块中,确定多个压缩数据之前,还包括:对于该压缩数据块对应的目标文本数据块,在对目标文本数据块进行压缩的过程中,确定目标文本数据块中各个文本数据对应的压缩数据在该压缩数据块中所处的位置;基于各个文本数据对应的压缩数据在该压缩数据块中所处的位置, 生成该压缩数据块的数据索引。
需要说明的是,在对目标文本数据块进行压缩的过程中,确定目标文本数据块中各个文本数据对应的压缩数据在该压缩数据块中所处的位置时,可以确定各个文本数据对应的压缩数据在该压缩数据块中的起始位置和结束位置,进而通过起始位置和结束位置唯一地确定各个文本数据对应的压缩数据在该压缩数据块中所处的位置。其中,为了方便起见,可以将各个压缩数据在该压缩数据块中的起始位置确定为该压缩数据块的数据索引。
比如,目标文本数据块的某个文本数据aaabbbccc在压缩后,得到的压缩数据为0001010,在压缩字典中记录编码aaa对应00,bbb对应010,ccc对应10,为了在不解压数据就能够获取每个字段对应的压缩数据,在对文本数据进行压缩的过程中,确定aaa在压缩后的起始位置为0,结束位置为2,长度为2-0=2,bbb在压缩后的起始位置为2,结束位置为5,长度为5-2=3,ccc在压缩后的起始位置为5,结束位置为7,长度为7-5=2,参见图4E,可以将压缩数据00的起始位置0、压缩数据010的起始位置2以及压缩数据10的起始位置5确定为该压缩数据块的数据索引。
在本发明实施例中,对于后续通过同一处理操作进行处理的至少两个目标文本数据块,获取该至少两个目标文本数据块的压缩字典,基于该压缩字典对该至少两个目标文本数据块中的每个目标文本数据块进行压缩,得到至少两个压缩数据块,也即是,该至少两个目标文本数据块可以共享同一压缩字典,而无需对每个文本数据块都生成一个压缩字典。另外,由于该至少两个目标文本数据块的压缩字典相同,也即是,该至少两个目标文本数据块是通过同一压缩标准进行压缩的,因此,对该至少两个压缩数据块中的压缩数据进行处理,并对处理结果通过该压缩字典进行解压后的结果与对该至少两个目标文本数据块中的文本数据进行处理得到的处理结果相同,所以本发明实施例通过对该至少两个压缩数据块进行处理,来实现对该至少两个目标文本数据块的处理,而无需对该至少两个压缩数据块进行解压,减小了数据处理量,进而缩短了数据的处理时间,以及节省了处理资源。
图5A是本发明实施例提供的一种数据处理装置的结构示意图。参照图5A,该数据处理装置包括:
获取模块501,用于获取至少两个目标文本数据块的压缩字典,该至少两个目标文本数据块为存储的多个文本数据块中后续通过同一处理操作进行处理的数据块,各个目标文本数据块均包括多个文本数据,各个文本数据均包括多个字符,该压缩字典包括该各个目标文本数据块中每个文本数据的压缩码,或者包括该各个目标文本数据块中每个字符的压缩码;
压缩模块502,用于基于该压缩字典,分别对该至少两个目标文本数据块中的每个目标文本数据块进行压缩,得到至少两个压缩数据块,该至少两个目标文本数据块与该至少两个压缩数据块一一对应,各个压缩数据块均包括多个压缩数据,该多个压缩数据与该多个文本数据一一对应;
处理模块503,用于当接收到对该至少两个目标文本数据块进行同一处理操作的处理指令时,对该至少两个压缩数据块中的压缩数据进行处理,以实现该至少两个目标文本数据块的处理。
进一步地,该装置还包括:
确定模块,用于从存储的多个文本数据块中,确定该至少两个目标文本数据块;
生成模块,用于生成该至少两个目标文本数据块的压缩字典。
其中,确定模块包括:
选择单元,用于从该多个文本数据块中,选择属于目标类别的至少两个文本数据块,将选择的文本数据块确定为目标文本数据块;或者,
第一确定单元,用于当检测到针对该多个文本数据块中至少两个文本数据块的选择指令时,将该选择指令所选择的文本数据块确定为目标文本数据块。
参照图5B,处理模块503包括:
第二确定单元5031,用于确定该至少两个压缩数据块中每个压缩数据块包括的多个压缩数据;
处理单元5032,用于基于该至少两个压缩数据块中每个压缩数据块包括的多个压缩数据,对该至少两个压缩数据块进行处理,得到该至少两个压缩数据块的处理结果;
解压单元5033,用于基于该压缩字典,对该至少两个压缩数据块的处理结果进行解压,得到该至少两个目标文本数据块的处理结果。
其中,第二确定单元5031包括:
判断子单元,用于判断该压缩字典包括的多个压缩码的码长是否相等;
第一确定子单元,用于当该压缩字典包括的多个压缩码的码长相等且该压缩字典包括该各个目标文本数据块中每个文本数据的压缩码时,对于该至少两个压缩数据块中的每个压缩数据块,将各个压缩码的码长确定为该压缩数据块中各个压缩数据的长度;
第二确定子单元,用于按照该压缩数据块中各个压缩数据的长度,从该压缩数据块中,依次确定该多个压缩数据。
进一步地,第二确定单元5031还包括:
第三确定子单元,用于当该压缩字典包括的多个压缩码的码长相等且当该压缩字典包括该各个目标文本数据块中每个字符的压缩码时,对于该至少两个压缩数据块中的每个压缩数据块,确定该压缩数据块对应的目标文本数据块;
运算子单元,用于将各个压缩码的码长分别与该目标文本数据块中各个文本数据的字符个数相乘,得到该压缩数据块中各个压缩数据的长度;
第四确定子单元,用于按照该压缩数据块中各个压缩数据的长度,从该压缩数据块中,依次确定该多个压缩数据。
进一步地,第二确定单元5031还包括:
第五确定子单元,用于当该压缩字典包括的多个压缩码的码长不相等时,对于该至少两个压缩数据块中的每个压缩数据块,根据该压缩数据块的数据索引,从该压缩数据块中,确定该多个压缩数据,该压缩数据块的数据索引用于指示该多个压缩数据中每个压缩数据在该压缩数据块中所处的位置。
进一步地,第二确定单元5031还包括:
第六确定子单元,用于对于该压缩数据块对应的目标文本数据块,在对该目标文本数据块进行压缩的过程中,确定该目标文本数据块中各个文本数据对应的压缩数据 在该压缩数据块中所处的位置;
生成子单元,用于基于该各个文本数据对应的压缩数据在该压缩数据块中所处的位置,生成该压缩数据块的数据索引。
在本发明实施例中,对于后续通过同一处理操作进行处理的至少两个目标文本数据块,获取该至少两个目标文本数据块的压缩字典,基于该压缩字典对该至少两个目标文本数据块中的每个目标文本数据块进行压缩,得到至少两个压缩数据块,也即是,该至少两个目标文本数据块可以共享同一压缩字典,而无需对每个文本数据块都生成一个压缩字典。另外,由于该至少两个目标文本数据块的压缩字典相同,也即是,该至少两个目标文本数据块是通过同一压缩标准进行压缩的,因此,对该至少两个压缩数据块中的压缩数据进行处理,并对处理结果通过该压缩字典进行解压后的结果与对该至少两个目标文本数据块中的文本数据进行处理得到的处理结果相同,所以本发明实施例通过对该至少两个压缩数据块进行处理,来实现对该至少两个目标文本数据块的处理,而无需对该至少两个压缩数据块进行解压,减小了数据处理量,进而缩短了数据的处理时间,以及节省了处理资源。
需要说明的是:上述实施例提供的数据处理装置在数据处理时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的数据处理装置与数据处理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (16)

  1. 一种数据处理方法,其特征在于,所述方法包括:
    获取至少两个目标文本数据块的压缩字典,所述至少两个目标文本数据块为存储的多个文本数据块中后续通过同一处理操作进行处理的数据块,各个目标文本数据块均包括多个文本数据,各个文本数据均包括多个字符,所述压缩字典包括所述各个目标文本数据块中每个文本数据的压缩码,或者包括所述各个目标文本数据块中每个字符的压缩码;
    基于所述压缩字典,分别对所述至少两个目标文本数据块中的每个目标文本数据块进行压缩,得到至少两个压缩数据块,所述至少两个目标文本数据块与所述至少两个压缩数据块一一对应,各个压缩数据块均包括多个压缩数据,所述多个压缩数据与所述多个文本数据一一对应;
    当接收到对所述至少两个目标文本数据块进行同一处理操作的处理指令时,对所述至少两个压缩数据块中的压缩数据进行处理,以实现所述至少两个目标文本数据块的处理。
  2. 如权利要求1所述的方法,其特征在于,所述获取至少两个目标文本数据块的压缩字典之前,还包括:
    从存储的多个文本数据块中,确定所述至少两个目标文本数据块;
    生成所述至少两个目标文本数据块的压缩字典。
  3. 如权利要求2所述的方法,其特征在于,所述从存储的多个文本数据块中,确定所述至少两个目标文本数据块,包括:
    从所述多个文本数据块中,选择属于目标类别的至少两个文本数据块,将选择的文本数据块确定为目标文本数据块;或者,
    当检测到针对所述多个文本数据块中至少两个文本数据块的选择指令时,将所述选择指令所选择的文本数据块确定为目标文本数据块。
  4. 如权利要求1所述的方法,其特征在于,所述对所述至少两个压缩数据块中的压缩数据进行处理,以实现所述至少两个目标文本数据块的处理,包括:
    确定所述至少两个压缩数据块中每个压缩数据块包括的多个压缩数据;
    基于所述至少两个压缩数据块中每个压缩数据块包括的多个压缩数据,对所述至少两个压缩数据块进行处理,得到所述至少两个压缩数据块的处理结果;
    基于所述压缩字典,对所述至少两个压缩数据块的处理结果进行解压,得到所述至少两个目标文本数据块的处理结果。
  5. 如权利要求4所述的方法,其特征在于,所述确定所述至少两个压缩数据块中每个压缩数据块包括的多个压缩数据,包括:
    判断所述压缩字典包括的多个压缩码的码长是否相等;
    当所述压缩字典包括的多个压缩码的码长相等且所述压缩字典包括所述各个目标文本数据块中每个文本数据的压缩码时,对于所述至少两个压缩数据块中的每个压缩数据块,将各个压缩码的码长确定为所述压缩数据块中各个压缩数据的长度;
    按照所述压缩数据块中各个压缩数据的长度,从所述压缩数据块中,依次确定所述多个压缩数据。
  6. 如权利要求5所述的方法,其特征在于,所述判断所述压缩字典包括的多个压缩码的码长是否相等之后,还包括:
    当所述压缩字典包括的多个压缩码的码长相等且当所述压缩字典包括所述各个目标文本数据块中每个字符的压缩码时,对于所述至少两个压缩数据块中的每个压缩数据块,确定所述压缩数据块对应的目标文本数据块;
    将各个压缩码的码长分别与所述目标文本数据块中各个文本数据的字符个数相乘,得到所述压缩数据块中各个压缩数据的长度;
    按照所述压缩数据块中各个压缩数据的长度,从所述压缩数据块中,依次确定所述多个压缩数据。
  7. 如权利要求5所述的方法,其特征在于,所述判断所述压缩字典包括的多个压缩码的码长是否相等之后,还包括:
    当所述压缩字典包括的多个压缩码的码长不相等时,对于所述至少两个压缩数据块中的每个压缩数据块,根据所述压缩数据块的数据索引,从所述压缩数据块中,确定所述多个压缩数据,所述压缩数据块的数据索引用于指示所述多个压缩数据中每个压缩数据在所述压缩数据块中所处的位置。
  8. 如权利要求7所述的方法,其特征在于,所述根据所述压缩数据块的数据索引,从所述压缩数据块中,确定所述多个压缩数据之前,还包括:
    对于所述压缩数据块对应的目标文本数据块,在对所述目标文本数据块进行压缩的过程中,确定所述目标文本数据块中各个文本数据对应的压缩数据在所述压缩数据块中所处的位置;
    基于所述各个文本数据对应的压缩数据在所述压缩数据块中所处的位置,生成所述压缩数据块的数据索引。
  9. 一种数据处理装置,其特征在于,所述装置包括:
    获取模块,用于获取至少两个目标文本数据块的压缩字典,所述至少两个目标文本数据块为存储的多个文本数据块中后续通过同一处理操作进行处理的数据块,各个目标文本数据块均包括多个文本数据,各个文本数据均包括多个字符,所述压缩字典包括所述各个目标文本数据块中每个文本数据的压缩码,或者包括所述各个目标文本数据块中每个字符的压缩码;
    压缩模块,用于基于所述压缩字典,分别对所述至少两个目标文本数据块中的每个目标文本数据块进行压缩,得到至少两个压缩数据块,所述至少两个目标文本数据块与所述至少两个压缩数据块一一对应,各个压缩数据块均包括多个压缩数据,所述多个压缩数据与所述多个文本数据一一对应;
    处理模块,用于当接收到对所述至少两个目标文本数据块进行同一处理操作的处理指令时,对所述至少两个压缩数据块中的压缩数据进行处理,以实现所述至少两个目标文本数据块的处理。
  10. 如权利要求9所述的装置,其特征在于,所述装置还包括:
    确定模块,用于从存储的多个文本数据块中,确定所述至少两个目标文本数据块;
    生成模块,用于生成所述至少两个目标文本数据块的压缩字典。
  11. 如权利要求10所述的装置,其特征在于,所述确定模块包括:
    选择单元,用于从所述多个文本数据块中,选择属于目标类别的至少两个文本数据块,将选择的文本数据块确定为目标文本数据块;或者,
    第一确定单元,用于当检测到针对所述多个文本数据块中至少两个文本数据块的选择指令时,将所述选择指令所选择的文本数据块确定为目标文本数据块。
  12. 如权利要求9所述的装置,其特征在于,所述处理模块包括:
    第二确定单元,用于确定所述至少两个压缩数据块中每个压缩数据块包括的多个压缩数据;
    处理单元,用于基于所述至少两个压缩数据块中每个压缩数据块包括的多个压缩数据,对所述至少两个压缩数据块进行处理,得到所述至少两个压缩数据块的处理结果;
    解压单元,用于基于所述压缩字典,对所述至少两个压缩数据块的处理结果进行解压,得到所述至少两个目标文本数据块的处理结果。
  13. 如权利要求12所述的装置,其特征在于,所述第二确定单元包括:
    判断子单元,用于判断所述压缩字典包括的多个压缩码的码长是否相等;
    第一确定子单元,用于当所述压缩字典包括的多个压缩码的码长相等且所述压缩字典包括所述各个目标文本数据块中每个文本数据的压缩码时,对于所述至少两个压缩数据块中的每个压缩数据块,将各个压缩码的码长确定为所述压缩数据块中各个压缩数据的长度;
    第二确定子单元,用于按照所述压缩数据块中各个压缩数据的长度,从所述压缩数据块中,依次确定所述多个压缩数据。
  14. 如权利要求13所述的装置,其特征在于,所述第二确定单元还包括:
    第三确定子单元,用于当所述压缩字典包括的多个压缩码的码长相等且当所述压缩字典包括所述各个目标文本数据块中每个字符的压缩码时,对于所述至少两个压缩数据块中的每个压缩数据块,确定所述压缩数据块对应的目标文本数据块;
    运算子单元,用于将各个压缩码的码长分别与所述目标文本数据块中各个文本数据的字符个数相乘,得到所述压缩数据块中各个压缩数据的长度;
    第四确定子单元,用于按照所述压缩数据块中各个压缩数据的长度,从所述压缩数据块中,依次确定所述多个压缩数据。
  15. 如权利要求13所述的装置,其特征在于,所述第二确定单元还包括:
    第五确定子单元,用于当所述压缩字典包括的多个压缩码的码长不相等时,对于所述至少两个压缩数据块中的每个压缩数据块,根据所述压缩数据块的数据索引,从所述压缩数据块中,确定所述多个压缩数据,所述压缩数据块的数据索引用于指示所述多个压缩数据中每个压缩数据在所述压缩数据块中所处的位置。
  16. 如权利要求15所述的装置,其特征在于,所述第二确定单元还包括:
    第六确定子单元,用于对于所述压缩数据块对应的目标文本数据块,在对所述目标文本数据块进行压缩的过程中,确定所述目标文本数据块中各个文本数据对应的压缩数据在所述压缩数据块中所处的位置;
    生成子单元,用于基于所述各个文本数据对应的压缩数据在所述压缩数据块中所处的位置,生成所述压缩数据块的数据索引。
PCT/CN2017/092527 2016-07-22 2017-07-11 数据处理方法及装置 WO2018014761A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610590825.2 2016-07-22
CN201610590825.2A CN107643906B (zh) 2016-07-22 2016-07-22 数据处理方法及装置

Publications (1)

Publication Number Publication Date
WO2018014761A1 true WO2018014761A1 (zh) 2018-01-25

Family

ID=60992963

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/092527 WO2018014761A1 (zh) 2016-07-22 2017-07-11 数据处理方法及装置

Country Status (2)

Country Link
CN (1) CN107643906B (zh)
WO (1) WO2018014761A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765111A (zh) * 2019-10-21 2021-05-07 伊姆西Ip控股有限责任公司 用于处理数据的方法、设备和计算机程序产品
US11061571B1 (en) * 2020-03-19 2021-07-13 Nvidia Corporation Techniques for efficiently organizing and accessing compressible data
CN114979794B (zh) * 2022-05-13 2023-11-14 深圳智慧林网络科技有限公司 一种数据发送方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080294863A1 (en) * 2007-05-21 2008-11-27 Sap Ag Block compression of tables with repeated values
CN101320372A (zh) * 2008-05-22 2008-12-10 上海爱数软件有限公司 一种重复数据的压缩方法
CN102473175A (zh) * 2009-07-31 2012-05-23 惠普开发有限公司 Xml数据的压缩
CN103326732A (zh) * 2013-05-10 2013-09-25 华为技术有限公司 压缩数据的方法、解压数据的方法、编码器和解码器
CN104023070A (zh) * 2014-06-16 2014-09-03 杜海洋 基于云存储的文件压缩方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104283777B (zh) * 2013-07-03 2018-08-21 华为技术有限公司 报文压缩的方法和装置
CN105893337B (zh) * 2015-01-04 2020-07-10 伊姆西Ip控股有限责任公司 用于文本压缩和解压缩的方法和设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080294863A1 (en) * 2007-05-21 2008-11-27 Sap Ag Block compression of tables with repeated values
CN101320372A (zh) * 2008-05-22 2008-12-10 上海爱数软件有限公司 一种重复数据的压缩方法
CN102473175A (zh) * 2009-07-31 2012-05-23 惠普开发有限公司 Xml数据的压缩
CN103326732A (zh) * 2013-05-10 2013-09-25 华为技术有限公司 压缩数据的方法、解压数据的方法、编码器和解码器
CN104023070A (zh) * 2014-06-16 2014-09-03 杜海洋 基于云存储的文件压缩方法

Also Published As

Publication number Publication date
CN107643906B (zh) 2021-01-05
CN107643906A (zh) 2018-01-30

Similar Documents

Publication Publication Date Title
JP6626211B2 (ja) ショートリンクを処理する方法及び装置並びにショートリンクサーバ
US10073899B2 (en) Efficient storage using automatic data translation
US11811949B2 (en) File validation using a blockchain
CN106407201B (zh) 一种数据处理方法、装置及计算机可读存储介质
JP6734946B2 (ja) 情報を生成するための方法及び装置
WO2015081808A1 (en) Method and apparatus for data transmission
CN111523001B (zh) 用于存储数据的方法、装置、设备以及存储介质
US20170109371A1 (en) Method and Apparatus for Processing File in a Distributed System
WO2018014761A1 (zh) 数据处理方法及装置
US20220019562A1 (en) Data compression based on key-value store
WO2016058488A1 (zh) 一种用于提供sdk文件的方法与设备
US10747763B2 (en) Efficient multiple aggregation distinct processing
WO2023103390A1 (zh) 任务处理方法、任务处理装置、电子设备以及存储介质
CN109462650A (zh) 数据文件下载方法、装置、计算机设备及存储介质
US11307984B2 (en) Optimized sorting of variable-length records
CN112346871A (zh) 一种请求处理方法及微服务系统
CN110727417A (zh) 一种数据处理方法和装置
CN113190517B (zh) 数据集成方法、装置、电子设备和计算机可读介质
CN109962972B (zh) 一种离线包重组方法及系统
US20240160636A1 (en) Fetching query result data using result batches
US10866960B2 (en) Dynamic execution of ETL jobs without metadata repository
CN113641706B (zh) 数据查询的方法和装置
JP2005228183A (ja) プログラム実行方法、および、プログラム実行のための計算機システム
CN114238264A (zh) 数据处理方法、装置、计算机设备和存储介质
CN111459981B (zh) 查询任务的处理方法、装置、服务器及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17830395

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17830395

Country of ref document: EP

Kind code of ref document: A1