CN107643906A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN107643906A
CN107643906A CN201610590825.2A CN201610590825A CN107643906A CN 107643906 A CN107643906 A CN 107643906A CN 201610590825 A CN201610590825 A CN 201610590825A CN 107643906 A CN107643906 A CN 107643906A
Authority
CN
China
Prior art keywords
data block
compression
compressed
data
text data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610590825.2A
Other languages
Chinese (zh)
Other versions
CN107643906B (en
Inventor
李雪斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201610590825.2A priority Critical patent/CN107643906B/en
Priority to PCT/CN2017/092527 priority patent/WO2018014761A1/en
Publication of CN107643906A publication Critical patent/CN107643906A/en
Application granted granted Critical
Publication of CN107643906B publication Critical patent/CN107643906B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a kind of data processing method and device, belong to field of computer technology.Methods described includes:Obtain the compression dictionary of at least two target text data blocks, based on the compression dictionary, each target text data block at least two target texts data block is compressed respectively, obtain at least two compression data blocks, when receiving the process instruction that same processing operation is carried out at least two target texts data block, compressed data at least two compression data block is handled, to realize the processing of at least two target texts data block.The present invention is by handling the compressed data at least two compression data block, to realize the processing at least two target texts data block, without being decompressed at least two compression data block, reduce data processing amount, and then the processing time of data is shortened, and save process resource.

Description

Data processing method and device
Technical field
The present invention relates to field of computer technology, more particularly to a kind of data processing method and device.
Background technology
With the development of computer technology, substantial amounts of text data needs to be stored and analyzed, and this article notebook data refers to The data being made up of printable character, the printable character include USA standard code for information interchange (American Standard Code for Information Interchange, ASCII) in the character of 33~127, in Unicode (UNICODE) Character, character in Unicode (UTF-8) etc..When being stored to this article notebook data, in order to save data storage With time shared during transmission and space, it is necessary to be first compressed to this article notebook data, then to this article notebook data after compression Stored, afterwards, when analyzing this article notebook data, first can be decompressed this article notebook data after compression, with Obtain this article notebook data, then this article notebook data is compared, sorts, search, Hash operation, the processing such as concatenation operation, This article notebook data is analyzed based on the result of this article notebook data.
At present, there is provided a kind of data processing method, be specially:For each text in multiple text data blocks of storage Notebook data block, generates the compression dictionary of text data block, and text data block includes multiple text datas;Based on the text The compression dictionary of data block, is compressed to text data block, obtains compression data block corresponding to text data block;To this Compression data block corresponding to text data block is stored.The first text data block and the second text data block are entered when receiving During the process instruction of the same processing operation of row, compression data block corresponding to the first text data block is obtained, and obtain the second text Compression data block corresponding to notebook data block, the first text data block and the second text data block are in the plurality of text data block Any two text data block;Compression data block corresponding to first text data block is decompressed, to obtain the first textual data Decompressed according to block, and to compression data block corresponding to the second text data block, to obtain the second text data block;To the first text Text data in notebook data block and the second text data block is handled, and obtains result.
Due to referring to receiving the processing that same processing operation is carried out to the first text data block and the second text data block When making, it is necessary to respectively to compression data block corresponding to compression data block corresponding to the first text data block and the second text data block After being decompressed, the first text data block and the second text data block could be handled, therefore, time of data processing compared with Long, the process resource of consumption is more.
The content of the invention
In order to solve problem of the prior art, the embodiments of the invention provide a kind of data processing method and device.It is described Technical scheme is as follows:
First aspect, there is provided a kind of data processing method, methods described include:
The compression dictionary of at least two target text data blocks is obtained, at least two target texts data block is storage Multiple text data blocks in wrapped subsequently through the same data block that is handled of processing operation, each target text data block Multiple text datas are included, each text data includes multiple characters, and the compression dictionary includes each target text number According to the compressed code of each text data in block, or include the compressed code of each character in each target text data block;
Based on the compression dictionary, respectively to each target text data at least two target texts data block Block is compressed, and obtains at least two compression data blocks, at least two target texts data block and described at least two pressures Contracting data block corresponds, and each compression data block includes multiple compressed datas, the multiple compressed data with it is the multiple Text data corresponds;
When receiving the process instruction that same processing operation is carried out at least two target texts data block, to institute The compressed data stated at least two compression data blocks is handled, to realize the place of at least two target texts data block Reason.
It should be noted that the compression dictionary of at least two target texts data block can include each target text number According to the compressed code of each text data in block, or the compressed code of each character in each target text data block can be included, It that is to say, the compressed code in the compression dictionary can correspond with text data, can also be corresponded with character, the present invention Embodiment is not specifically limited to this.
In addition, at least two target texts data block is the data block handled subsequently through same processing operation, And at least two target texts data block corresponds to same compression dictionary, that is to say, at least two target texts data block is total to Same compression dictionary is enjoyed, in this way, without generating a compression dictionary respectively to each text data block.And then receiving to extremely When few two compression data blocks carry out the process instruction of same processing operation, because at least two compression data block is by this Compression dictionary is compressed to obtain at least two target texts data block, and at least two target texts data block is shared Same compression dictionary, therefore, directly the compressed data at least two compression data block can be handled, so as to realize Processing at least two target texts data block.
It is described to obtain at least two with reference in a first aspect, in the first possible implementation of above-mentioned first aspect Before the compression dictionary of target text data block, in addition to:
From multiple text data blocks of storage, at least two target texts data block is determined;
Generate the compression dictionary of at least two target texts data block.
For the ease of subsequently can be based on same compression standard to each mesh at least two target texts data block Mark text data block is compressed, and it is determined that after at least two target texts data block, can generate at least two target The compression dictionary of text data block.And when generating the compression dictionary of at least two target texts data block, it can be based on referring to Level pressure compression algorithm and at least two target texts data block, the compression dictionary of at least two target texts data block is generated, Certainly, in practical application, the compression dictionary of at least two target texts data block, this hair can also otherwise, be generated Bright embodiment is not specifically limited to this.
With reference to the first possible implementation of first aspect, in second of possible realization side of above-mentioned first aspect In formula, in multiple text data blocks from storage, at least two target texts data block is determined, including:
From the multiple text data block, selection belongs to the other at least two text datas block of target class, by selection Text data block is defined as target text data block;Or
When detecting for the selection instruction of at least two text data blocks in the multiple text data block, by described in Text data block selected by selection instruction is defined as target text data block.
Because the text data block under target classification under normal circumstances can be handled by same processing operation, because This, can select to belong to the other at least two text datas block of target class in the embodiment of the present invention, and by the text data of selection Block is defined as target text data block, and the determination is simple to operate, and is participated in without user, so as to improve target text The determination efficiency of data block.
Further, since selection instruction is triggered by user, therefore, by the text selected by selection instruction in the embodiment of the present invention When data block is defined as target text data block, actually when target text data block determined according to user's operation, so as to To ensure that the target text data block determined meets user's request.
It is described to described at least two with reference in a first aspect, in the third possible implementation of above-mentioned first aspect Compressed data in individual compression data block is handled, to realize the processing of at least two target texts data block, including:
Determine multiple compressed datas that each compression data block includes at least two compression data block;
Based on multiple compressed datas that each compression data block includes at least two compression data block, to it is described extremely Few two compression data blocks are handled, and obtain the result of at least two compression data block;
Based on the compression dictionary, the result of at least two compression data block is decompressed, obtained described The result of at least two target text data blocks.
Due to stored in compression dictionary be each text data in each target text data block compressed code, or respectively The compressed code of each character in individual target text data block, therefore, is compressed at least two target texts data block, It is to change the text data at least two target texts data block for compressed data, because the transformation rule is one It is fixed, therefore, after calculating the compressed data, the result of obtained at least two compression data block be also by What compressed code was formed, therefore, based on the compression dictionary, the result of at least two compression data block is decompressed Afterwards, then be that the compression result for being formed the compressed code is changed for form of textual data, and due to transformation rule be it is certain, Therefore, the result after conversion is the result of at least two target texts data block.
It should be noted that because generally, the plurality of text data block is to carry out distributed storage, namely It is that the same process instruction for handling operation is carried out to certain two text data block in the plurality of text data block receiving When, because the compression dictionary of two text data blocks is different, thus it is, it is necessary to first corresponding to the two text data blocks of storage Two compression data blocks decompressed respectively, to obtain the two text data blocks, and then the two text data blocks are passed The equipment for carrying out data processing is defeated by, the two text data blocks are handled by the equipment.
And in the embodiment of the present invention, due to be directly to the compressed data at least two compression data block at Reason, to realize the processing of at least two target texts data block, therefore, carried out at least two target texts data block During processing, at least two compression data blocks corresponding at least two target texts data block only can be transferred to carry out data The equipment of processing, the compressed data at least two compression data block is handled by the equipment, to realize to this at least The processing of two target text data blocks.So as to compared to needing elder generation before handling text data block in correlation technique The mode decompressed to compression data block, the compression data block is directly based upon in the embodiment of the present invention and can be achieved to target text The processing of notebook data block, so as to reduce data processing time, save process resource.In addition, compared in correlation technique The mode of text data block is transmitted between the multiple equipment that the distributed system includes, only needs to transmit pressure in the embodiment of the present invention Contracting data block, so as to reduce volume of transmitted data, network bandwidth utilization factor is lifted, and save data transmission period, furthermore, phase Only need to carry out compressed data than the mode for directly handling text data in correlation technique, in the embodiment of the present invention Processing, so as to reduce data processing amount, data processing time is saved, saves process resource.
With reference to the third possible implementation of first aspect, in the 4th kind of possible realization side of above-mentioned first aspect In formula, the multiple compressed datas for determining that each compression data block includes at least two compression data block, including:
Judge whether the code length for multiple compressed codes that the compression dictionary includes is equal;
When the code length of multiple compressed codes that the compression dictionary includes is equal and the compression dictionary includes each mesh When marking the compressed code of each text data in text data block, for each compression number at least two compression data block According to block, the code length of each compressed code is defined as to the length of each compressed data in the compression data block;
According to the length of each compressed data in the compression data block, from the compression data block, institute is determined successively State multiple compressed datas.
Seen from the above description, compressing dictionary is generated by specified compression algorithm, and for different compression algorithms, The code length for multiple compressed codes that the compression dictionary of generation includes may be unequal, and each compression that the compression dictionary includes Code length between code may be also unequal, and compression data block is compressed based on the compression dictionary, therefore, it is determined that Multiple compression data blocks that each compression data block includes at least two compression data block, can first judge the compression dictionary Including multiple compressed codes code length it is whether equal.
In addition, when the compression dictionary includes the compressed code of each text data in each target text data block, it is determined that It is the compressed code based on each text data in each target text data block when carrying out the compression of target text data block, Each text data is compressed, therefore, the code length of each compressed code in the compression dictionary can be defined as the pressure The length of each compression data block in contracting data block.
With reference to the 4th kind of possible implementation of first aspect, in the 5th kind of possible realization side of above-mentioned first aspect In formula, it is described judge that the compression dictionary includes the code lengths of multiple compressed codes it is whether equal after, in addition to:
When the code length for multiple compressed codes that the compression dictionary includes is equal and works as the compression dictionary including described each In target text data block during the compressed code of each character, for each compressed data at least two compression data block Block, determine target text data block corresponding to the compression data block;
By the code length of each compressed code character number phase with each text data in the target text data block respectively Multiply, obtain the length of each compressed data in the compression data block;
According to the length of each compressed data in the compression data block, from the compression data block, institute is determined successively State multiple compressed datas.
Wherein, when the compression dictionary includes the compressed code of each character in each target text data block, it is determined that entering It is the compressed code based on each character in each target text data block, to each word during the compression of row target text data block What symbol was compressed, and because a text data can include multiple characters, therefore, can be each by this in the compression dictionary Character number of the code length of compressed code respectively with each text data in target text data block is multiplied, and then obtains the compression number According to the length of each compression data block in block.
With reference to the 4th kind of possible implementation of first aspect, in the 6th kind of possible realization side of above-mentioned first aspect In formula, it is described judge that the compression dictionary includes the code lengths of multiple compressed codes it is whether equal after, in addition to:
When the code length for multiple compressed codes that the compression dictionary includes is unequal, at least two compressed data Each compression data block in block, according to the data directory of the compression data block, from the compression data block, it is determined that described Multiple compressed datas, the data directory of the compression data block are used to indicate that each compressed data exists in the multiple compressed data The location of in the compression data block.
With reference to the 6th kind of possible implementation of first aspect, in the 7th kind of possible realization side of above-mentioned first aspect In formula, the data directory according to the compression data block, from the compression data block, the multiple compressed data is determined Before, in addition to:
For target text data block corresponding to the compression data block, it is compressed to the target text data block During, determine that compressed data corresponding to each text data is in the compression data block in the target text data block Location;
The location of based on compressed data corresponding to each text data in the compression data block, generate institute State the data directory of compression data block.
It should be noted that during being compressed to target text data block, determine in target text data block Compressed data corresponding to each text data is in the compression data block during location, it may be determined that each text data pair Original position and end position of the compressed data answered in the compression data block, and then by original position and end position only One ground determines the location in the compression data block of compressed data corresponding to each text data.Wherein, rise for convenience See, original position of each compressed data in the compression data block can be defined as to the data directory of the compression data block.
Second aspect, there is provided a kind of data processing equipment, described device include:
Acquisition module, for obtaining the compression dictionary of at least two target text data blocks, at least two targets text Notebook data block is that the data block handled, each target are operated subsequently through same processing in multiple text data blocks of storage Text data block includes multiple text datas, and each text data includes multiple characters, and the compression dictionary includes described The compressed code of each text data in each target text data block, or including each in each target text data block The compressed code of character;
Compression module, for based on the compression dictionary, respectively to every at least two target texts data block Individual target text data block is compressed, and obtains at least two compression data blocks, at least two target texts data block with At least two compression data block corresponds, and each compression data block includes multiple compressed datas, the multiple compression Data correspond with the multiple text data;
Processing module, the place that same processing operation is carried out at least two target texts data block is received for working as During reason instruction, the compressed data at least two compression data block is handled, to realize at least two target The processing of text data block.
With reference to second aspect, in the first possible implementation of above-mentioned second aspect, described device also includes:
Determining module, for from multiple text data blocks of storage, determining at least two target texts data block;
Generation module, for generating the compression dictionary of at least two target texts data block.
With reference to the first possible implementation of second aspect, in second of possible realization side of above-mentioned second aspect In formula, the determining module includes:
Selecting unit, for from the multiple text data block, selection to belong to other at least two textual data of target class According to block, the text data block of selection is defined as target text data block;Or
First determining unit, detected for working as at least two text data blocks in the multiple text data block During selection instruction, the text data block selected by the selection instruction is defined as target text data block.
With reference to second aspect, in the third possible implementation of above-mentioned second aspect, the processing module includes:
Second determining unit, for determining that each compression data block includes multiple at least two compression data block Compressed data;
Processing unit, for based on multiple compressions that each compression data block includes at least two compression data block Data, at least two compression data block is handled, obtain the result of at least two compression data block;
Decompression units, for based on the compression dictionary, being carried out to the result of at least two compression data block Decompression, obtains the result of at least two target texts data block.
With reference to the third possible implementation of second aspect, in the 4th kind of possible realization side of above-mentioned second aspect In formula, second determining unit includes:
Whether judgment sub-unit, the code length of the multiple compressed codes included for judging the compression dictionary are equal;
First determination subelement, the code length of multiple compressed codes for including when the compression dictionary is equal and the compression When dictionary includes the compressed code of each text data in each target text data block, for described at least two compression numbers According to each compression data block in block, the code length of each compressed code is defined as each compressed data in the compression data block Length;
Second determination subelement, for the length according to each compressed data in the compression data block, from the compression In data block, the multiple compressed data is determined successively
With reference to the 4th kind of possible implementation of second aspect, in the 5th kind of possible realization side of above-mentioned second aspect In formula, second determining unit also includes:
3rd determination subelement, the code length of multiple compressed codes for including when the compression dictionary is equal and works as the pressure When contracting dictionary includes the compressed code of each character in each target text data block, at least two compressed data Each compression data block in block, determine target text data block corresponding to the compression data block;
Computing subelement, for by the code length of each compressed code respectively with each textual data in the target text data block According to character number be multiplied, obtain the length of each compressed data in the compression data block;
4th determination subelement, for the length according to each compressed data in the compression data block, from the compression In data block, the multiple compressed data is determined successively.
With reference to the 4th kind of possible implementation of second aspect, in the 6th kind of possible realization side of above-mentioned second aspect In formula, second determining unit also includes:
5th determination subelement, when the code length of multiple compressed codes for including when the compression dictionary is unequal, for Each compression data block at least two compression data block, according to the data directory of the compression data block, from described In compression data block, the multiple compressed data is determined, the data directory of the compression data block is used to indicate the multiple pressure Each compressed data location in the compression data block in contracting data.
With reference to the 6th kind of possible implementation of second aspect, in the 7th kind of possible realization side of above-mentioned second aspect In formula, second determining unit also includes:
6th determination subelement, for for target text data block corresponding to the compression data block, to the mesh During mark text data block is compressed, compression number corresponding to each text data in the target text data block is determined According to the location in the compression data block;
Subelement is generated, for based on compressed data institute in the compression data block corresponding to each text data The position at place, generate the data directory of the compression data block.
The beneficial effect that technical scheme provided in an embodiment of the present invention is brought is:
In embodiments of the present invention, at least two target text numbers for being handled subsequently through same processing operation According to block, the compression dictionary of at least two target texts data block is obtained, it is literary at least two target based on the compression dictionary Each target text data block in notebook data block is compressed, and is obtained at least two compression data blocks, be that is to say, this at least two Individual target text data block can share same compression dictionary, without generating a compression word to each text data block Allusion quotation.Further, since the compression dictionary of at least two target texts data block is identical, it that is to say, at least two target text Data block is compressed by same compression standard, and therefore, the compressed data at least two compression data block is entered Row processing, and the result after being decompressed to result by the compression dictionary with least two target texts data block In the text data result that is handled to obtain it is identical, so the embodiment of the present invention passes through at least two compressions number Handled according to the compressed data in block, to realize the processing at least two target texts data block, without to this extremely Few two compression data blocks are decompressed, and reduce data processing amount, and then shorten the processing time of data, and are saved Process resource.
Brief description of the drawings
Technical scheme in order to illustrate the embodiments of the present invention more clearly, make required in being described below to embodiment Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for For those of ordinary skill in the art, on the premise of not paying creative work, other can also be obtained according to these accompanying drawings Accompanying drawing.
Fig. 1 is a kind of structural representation of data handling system provided in an embodiment of the present invention;
Fig. 2 is a kind of structural representation of computer equipment provided in an embodiment of the present invention;
Fig. 3 is a kind of flow chart of data processing method provided in an embodiment of the present invention;
Fig. 4 A are the flow charts of another data processing method provided in an embodiment of the present invention;
Fig. 4 B are a kind of schematic diagrames of target text data block involved by Fig. 4 A embodiments;
Fig. 4 C are a kind of target text data block and the signal of the corresponding relation of compression dictionary involved by Fig. 4 A embodiments Figure;
Fig. 4 D are showing for corresponding relation of another target text data block with compressing dictionary involved by Fig. 4 A embodiments It is intended to;
Fig. 4 E are the schematic diagrames of the packed field index involved by Fig. 4 A embodiments;
Fig. 5 A are a kind of structural representations of data processing equipment provided in an embodiment of the present invention;
Fig. 5 B are a kind of structural representations of processing module provided in an embodiment of the present invention.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention Formula is described in further detail.
Fig. 1 is a kind of structural representation of data handling system provided in an embodiment of the present invention.Referring to Fig. 1, the system can Think distributed system, naturally it is also possible to be distributed system, enter in a distributed manner exemplified by system in embodiments of the present invention Row explanation.Wherein, multiple equipment, respectively equipment 01, equipment 02, equipment 03...... equipment are included in the distributed system 0n, it is connected with each other between the plurality of equipment, and the plurality of equipment can be with terminal, or server, the embodiment of the present invention pair This is not specifically limited.
In addition, each equipment in the plurality of equipment may each comprise data import modul and data processing module, this is more Data import modul and data processing module can not only be included in designated equipment in individual equipment, compression dictionary can also be included Configuration module and compression dictionary sharing storage module, the designated equipment can be any appliance in the plurality of equipment.Wherein, should Multiple text data blocks can be stored in distributed system, compression dictionary configuration module is used for being stored in the distributed system Multiple text data block configuration compression dictionaries, the compression dictionary are used to be compressed text data block, so as to obtain compressing number According to block, correspondingly, multiple compression data blocks, the plurality of compression data block and the plurality of text can also be stored in the distributed system Notebook data block corresponds, it is notable that subsequently needs to carry out same processing operation extremely in the plurality of text data block Few two text data blocks can share a compression dictionary, that is to say, at least two text data can correspond to identical Compress dictionary.
Compress dictionary sharing storage module and be used for the corresponding mark for storing the plurality of text data block and corresponding compression word Allusion quotation;Data import modul is used for when carrying out data processing at least two target text data blocks, is deposited from compression dictionary is shared The compression dictionary of at least two target texts data block is obtained in storage module;Data processing module is used at least two mesh Compressed data in compression data block corresponding to mark text data block is handled, and the compression obtained based on data import modul Dictionary decompresses to result, so as to realize the processing at least two target texts data block.
Fig. 2 is a kind of structural representation of computer equipment provided in an embodiment of the present invention, in the distributed system in Fig. 1 Equipment can be realized with computer equipment shown in Fig. 2.Referring to Fig. 2, the computer equipment includes at least one processing Device 201, communication bus 202, memory 203 and at least one communication interface 204.
Processor 201 can be a general central processor (CPU), microprocessor, ASIC (application-specific integrated circuit, ASIC), or it is one or more for controlling the present invention program The integrated circuit that program performs.
Communication bus 202 may include a path, and information is transmitted between said modules.
Memory 203 can be read-only storage (read-only memory, ROM) or can store static information and instruction Other types of static storage device, random access memory (random access memory, RAM)) or can store Information and the other types of dynamic memory or EEPROM of instruction (Electrically Erasable Programmable Read-Only Memory, EEPROM), read-only optical disc (Compact Disc Read-Only Memory, CD-ROM) or other optical disc storages, laser disc storage (including compression laser disc, laser disc, light Dish, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium or other magnetic storage apparatus or can be used in carrying or Storage with instruction or data structure form desired program code and can by any other medium of computer access, but Not limited to this.Memory 203 can be individually present, and be connected by communication bus 202 with processor 201.Memory 203 It can be integrated with processor 201.
Communication interface 204, using a kind of device of any transceiver, for miscellaneous equipment or communication, such as Ethernet, wireless access network (RAN), WLAN (Wireless Local Area Networks, WLAN) etc..
In the specific implementation, as a kind of embodiment, processor 201 can include one or more CPU, such as in Fig. 2 Shown CPU0 and CPU1.
In the specific implementation, as a kind of embodiment, computer equipment can include multiple processors, such as institute in Fig. 2 The processor 201 and processor 208 shown.Each in these processors can be monokaryon (single-CPU) processing Device or multinuclear (multi-CPU) processor.Here processor can refer to one or more equipment, circuit, And/or the process cores for processing data (such as computer program instructions).
In the specific implementation, as a kind of embodiment, computer equipment can also include output equipment 205 and input equipment 206.Output equipment 205 and processor 201 communicate, and can carry out display information in many ways.For example, output equipment 205 can be with It is liquid crystal display (liquid crystal display, LCD), Light-Emitting Diode (light emitting diode, LED) Display device, cathode-ray tube (cathode ray tube, CRT) display device, or projecting apparatus (projector) etc..Input Equipment 206 and processor 201 communicate, and can receive the input of user in many ways.For example, input equipment 206 can be mouse Mark, keyboard, touch panel device or sensing equipment etc..
Above-mentioned computer equipment can be an a general purpose computing device either dedicated computing machine equipment.Having During body is realized, computer equipment can be desktop computer, portable computer, the webserver, palm PC (Personal Digital Assistant, PDA), cell phone, tablet personal computer, wireless terminal device, communication equipment or embedded device. The unlimited type for determining computer equipment of the embodiment of the present invention.
Wherein, memory 203 is used to store the program code for performing the present invention program, and is held by processor 201 to control OK.Processor 201 is used to perform the program code 210 stored in memory 203.In program code 210 can include one or Multiple software modules are (for example, data import modul, data processing module, compression dictionary configuration module and compression dictionary are shared and deposited Store up module etc.).The equipment in distributed system shown in Fig. 1 can pass through the journey in processor 201 and memory 203 Data are handled by one or more of sequence code 210 software module.
Fig. 3 is a kind of flow chart of data processing method provided in an embodiment of the present invention.Referring to Fig. 3, this method includes:
Step 301:The compression dictionary of at least two target text data blocks is obtained, at least two target texts data block For the data block handled in multiple text data blocks of storage subsequently through same processing operation, each target text data Block includes multiple text datas, and each text data includes multiple characters, and the compression dictionary includes each target text The compressed code of each text data in data block, or include the compressed code of each character in each target text data block.
Step 302:Based on the compression dictionary, respectively to each target text at least two target texts data block Data block is compressed, and obtains at least two compression data blocks, at least two target texts data block and at least two pressure Contracting data block corresponds, and each compression data block includes multiple compressed datas, the plurality of compressed data and the plurality of text Data correspond.
Step 303:The same process instruction for handling operation is carried out at least two target texts data block when receiving When, the compressed data at least two compression data block is handled, to realize at least two target texts data block Processing.
In embodiments of the present invention, at least two target text numbers for being handled subsequently through same processing operation According to block, the compression dictionary of at least two target texts data block is obtained, it is literary at least two target based on the compression dictionary Each target text data block in notebook data block is compressed, and is obtained at least two compression data blocks, be that is to say, this at least two Individual target text data block can share same compression dictionary, without generating a compression word to each text data block Allusion quotation.Further, since the compression dictionary of at least two target texts data block is identical, it that is to say, at least two target text Data block is compressed by same compression standard, and therefore, the compressed data at least two compression data block is entered Row processing, and the result after being decompressed to result by the compression dictionary with least two target texts data block In the text data result that is handled to obtain it is identical, so the embodiment of the present invention passes through at least two compressions number Handled according to block, to realize the processing at least two target texts data block, without at least two compressions number Decompressed according to block, reduce data processing amount, and then shorten the processing time of data, and save process resource.
Alternatively, before the compression dictionary for obtaining at least two target text data blocks, in addition to:
From multiple text data blocks of storage, at least two target text data blocks are determined;
Generate the compression dictionary of at least two target texts data block.
Alternatively, from multiple text data blocks of storage, at least two target text data blocks are determined, including:
From multiple text data blocks of storage, selection belongs to the other at least two text datas block of target class, will select Text data block be defined as target text data block;Or
When detecting for the selection instruction of at least two text data blocks in the plurality of text data block, selection is referred to The selected text data block of order is defined as target text data block.
Alternatively, the compressed data at least two compression data block is handled, to realize at least two mesh The processing of text data block is marked, including:
Determine multiple compressed datas that each compression data block includes at least two compression data block;
Based on multiple compressed datas that each compression data block includes at least two compression data block, to this at least two Individual compression data block is handled, and obtains the result of at least two compression data block;
Based on the compression dictionary, the result of at least two compression data block is decompressed, obtains at least two The result of target text data block.
Optionally it is determined that multiple compressed datas that each compression data block includes at least two compression data block, bag Include:
Judge whether the code length for multiple compressed codes that the compression dictionary includes is equal;
When the code length for multiple compressed codes that the compression dictionary includes is equal and compression dictionary includes each target text data , will be each for each compression data block at least two compression data block in block during the compressed code of each text data The code length of compressed code is defined as the length of each compressed data in the compression data block;
According to the length of each compressed data in the compression data block, from the compression data block, multiple pressures are determined successively Contracting data.
Alternatively, after judging whether the code length for multiple compressed codes that the compression dictionary includes is equal, in addition to:
When the code length for multiple compressed codes that the compression dictionary includes is equal and includes each target text number when compressing dictionary According to during the compressed code of each character, for each compression data block at least two compression data block, determining the pressure in block Target text data block corresponding to contracting data block;
Character number of the code length of each compressed code respectively with each text data in target text data block is multiplied, obtained The length of each compressed data into the compression data block;
According to the length of each compressed data in the compression data block, from the compression data block, multiple pressures are determined successively Contracting data.
Alternatively, after judging whether the code length for multiple compressed codes that the compression dictionary includes is equal, in addition to:
When the code length for multiple compressed codes that the compression dictionary includes is unequal, at least two compression data block Each compression data block, according to the data directory of the compression data block, from the compression data block, determine multiple compression numbers According to the data directory of the compression data block is used to indicate that each compressed data is in the compression data block in the plurality of compressed data Location.
Alternatively, according to the data directory of the compression data block, from the compression data block, determine multiple compressed datas it Before, in addition to:
For target text data block corresponding to the compression data block, in the process being compressed to target text data block In, determine the location in the compression data block of compressed data corresponding to each text data in target text data block;
The location of based on compressed data corresponding to each text data in the compression data block, generate the compression number According to the data directory of block.
Above-mentioned all optional technical schemes, can form the alternative embodiment of the present invention according to any combination, and the present invention is real Example is applied no longer to repeat this one by one.
Fig. 4 A are a kind of flow charts of data processing method provided in an embodiment of the present invention.Referring to Fig. 4 A, this method includes:
Step 401:From multiple text data blocks of storage, at least two target text data blocks are determined.
It should be noted that the plurality of text data block can be stored with specified format, the specified format can be pre- First set, as the specified format can be text (TextFile) form, Parquet forms, SequenceFile forms, RCFile forms, Avro forms etc., the embodiment of the present invention is not specifically limited to this.And in order to improve the plurality of text data block Access efficiency, generally the plurality of text data block can be subjected to distributed storage in practical application, the distributed storage be Refer to and disperse to be stored in more independent equipment by the plurality of text data block, such as the plurality of text data can be based on distribution Formula file system (Hadoop Distributed File System, HDFS) stores etc., the embodiment of the present invention to this not It is specifically limited.
In addition, at least two target texts data block is the data block handled subsequently through same processing operation, Each target text data block includes multiple text datas, and each text data includes multiple characters.For example, such as Fig. 4 B institutes Show, a certain target text data block at least two target texts data block includes multiple text datas, the plurality of text Notebook data is 101,102,103 and 104.Wherein, the processing operation can include compare, sort, searching, Hash operation, connection Computing etc., the embodiment of the present invention is not specifically limited to this.
Furthermore text data refers to the data being made up of printable character, the printable character include in ASCII 33~ The character of 127, the character in UNICODE, character in UTF-8 etc., the embodiment of the present invention is not specifically limited to this.
Specifically, the distributed system determines at least two target texts data block from the plurality of text data block When, it can select to belong to the other at least two text datas block of target class, by the text of selection from the plurality of text data block Data block is defined as target text data block;Or it is directed at least two textual datas in the plurality of text data block when detecting According to block selection instruction when, the text data block selected by selection instruction is defined as target text data block.
It should be noted that the text data block under target classification can be handled by same processing operation, such as Target classification can be numbering, time, user's name etc., and the embodiment of the present invention is not specifically limited to this.
In addition, selection instruction is used for the selection target text data block from the plurality of text data block, and the selection instruction Can be triggered by user, the user can be triggered by assigned operation, the assigned operation can be single-click operation, double click operation, Voice operating etc., the embodiment of the present invention is not specifically limited to this.
Because the text data block under target classification under normal circumstances can be handled by same processing operation, because This, can select to belong to the other at least two text datas block of target class in the embodiment of the present invention, and by the text data of selection Block is defined as target text data block, and the determination is simple to operate, and is participated in without user, so as to improve target text The determination efficiency of data block.
Further, since selection instruction is triggered by user, therefore, by the text selected by selection instruction in the embodiment of the present invention When data block is defined as target text data block, actually when target text data block determined according to user's operation, so as to To ensure that the target text data block determined meets user's request.
For example, in the plurality of text data block, belong to the other text data block of target class for text data block 1, textual data According to block 2, text data block 3 and text data block 4, then can from the plurality of text data block, select text data block 1, Text data block 2, text data block 3 and text data block 4, afterwards, can be by text data block 1, the text data of the selection Block 2, text data block 3 and text data block 4 are defined as target text data block.
For another example detect for the text data block 1 in the plurality of text data block, text data block 2, text data The selection instruction of block 3 and text data block 4, then can by the text data block 1 selected by the selection instruction, text data block 2, Text data block 3 and text data block 4 are defined as target text data block.
It should be noted that from the plurality of text data block, the operation of at least two target texts data block is determined It can be performed by the compression dictionary configuration module in the distributed system, and in practical application, it is determined that at least two target After text data block, the compression dictionary configuration module can also be by each target text at least two target texts data block The mark of notebook data block is sent to compression dictionary sharing storage module, so that the compression dictionary sharing storage module can determine this Incidence relation between at least two target text data blocks, so be easy to the follow-up compression dictionary sharing storage module for this extremely Few two target text data blocks generate same compression dictionary.
Wherein, the mark of target text data block is used for the unique mark target text data block, and target text data The mark of block can be title of the target text data block etc., and the embodiment of the present invention is not specifically limited to this.
Step 402:Generate the compression dictionary of at least two target texts data block.
For the ease of subsequently can be based on same compression standard to each mesh at least two target texts data block Mark text data block is compressed, and it is determined that after at least two target texts data block, can generate at least two target The compression dictionary of text data block.And when generating the compression dictionary of at least two target texts data block, it can be based on referring to Level pressure compression algorithm and at least two target texts data block, the compression dictionary of at least two target texts data block is generated, Certainly, in practical application, the compression dictionary of at least two target texts data block, this hair can also otherwise, be generated Bright embodiment is not specifically limited to this.
It should be noted that the compression dictionary of at least two target texts data block can include each target text number According to the compressed code of each text data in block, or the compressed code of each character in each target text data block can be included, The embodiment of the present invention is not specifically limited to this.For example, the text data that at least two target texts data block includes is 101st, 102,103, then when the compression dictionary of at least two target texts data block is included in each target text data block often During the compressed code of individual text data, the compression dictionary, which can correspond to, includes compressed code 1, compressed code 2 and compressed code 3, such as Fig. 4 C institutes Show;When the compression dictionary of at least two target texts data block includes the compression of each character in each target text data block During code, if the character that text data 101 includes is 1,2,3, corresponding compressed code is compressed code 1, compressed code 2, compressed code 3, the character that text data 102 includes is 4,5,6, and corresponding compressed code is compressed code 4, compressed code 5, compressed code 6, textual data It is 7,8,9 according to 103 characters included, corresponding compressed code is compressed code 7, compressed code 8, compressed code 9, and the compression dictionary can be with As shown in Figure 4 D.
In addition, specified compression algorithm can be pre-set, and the specified compression algorithm can be that a kind of compression sequence is kept It algorithm, that is to say, the lexcographical order after coding can be kept identical with the lexcographical order of former character string, as the specified compression algorithm can be with For Huffman encoding (Huffman Coding) algorithm, Hu-Tucker encryption algorithms etc., the embodiment of the present invention is not done specifically to this Limit.
Wherein, based on specified compression algorithm and at least two target texts data block, at least two target text is generated The operation of the compression dictionary of notebook data block may be referred to correlation technique, and the embodiment of the present invention is to this without elaborating.
It should be noted that when at least two target texts data block is distributed storage, store this at least two The equipment of target text data block can be when receiving the compression instruction at least two target texts data block, by this At least two target text data blocks are sent to compression dictionary sharing storage module, so that the compression dictionary sharing storage module is given birth to The equipment is returned into the compression dictionary of at least two text datas block, and by the compression dictionary, so that the equipment can be to this At least two text data blocks are compressed, certainly, the compression dictionary sharing storage module can also active obtaining this at least two Individual target text data block, to generate the compression dictionary of at least two target texts data block, the embodiment of the present invention is to this It is not specifically limited.Wherein, compression instruction is used to indicate to be compressed at least two target texts data block, and the compression Instruction can be triggered by assigned operation, and the embodiment of the present invention is not specifically limited to this.
In addition, after the compression dictionary sharing storage module generates the compression dictionary, can be by least two target text Data block is deleted, to save the storage resource of the compression dictionary sharing storage module.
Furthermore after compression dictionary sharing storage module generates the compression dictionary of at least two target texts data block, also Can be by the mark and at least two target text of each target text data block at least two target texts data block The compression dictionary storage of data block is into the corresponding relation between text data block identification and compression dictionary, in order to subsequently upper When stating the compression dictionary of the equipment acquisition target text data block, the mark of target text data block can be directly based upon, from In corresponding relation between text data block identifier and compression dictionary, the pressure of the target text data block is quickly and easily obtained Contracting dictionary.
It should be noted that in correlation technique when compressing a certain text data block, text data are often first generated Compression dictionary corresponding to block, then the compression dictionary based on text data block, are compressed to text data block, obtain this article Compression data block corresponding to notebook data block, afterwards, by pressure corresponding to the compression dictionary of text data block and text data block Contracting data block stores jointly, so as to subsequently can the compression dictionary based on text data block, to corresponding to text data block Compression data block is decompressed.That is, then needed to the plurality of in correlation technique if to store multiple compression data blocks Compression dictionary is also stored corresponding to each compression data block in compression data block, so as to consume more storage resource. And in the embodiment of the present invention, because at least two target texts data block uses same compression dictionary, therefore it may only be necessary to right The compression dictionary of at least two target texts data block is once stored in the compression dictionary sharing storage module, so as to Save storage resource.
In addition, in correlation technique, when generating the compression dictionary of multiple text data blocks, it is necessary to for the plurality of textual data It is performed both by once generating operation according to each text data block in block, that is to say and operated, it is necessary to perform repeatedly generation, can just be obtained The compression dictionary of all text data blocks in the plurality of text data block, so as to consume more process resource.And due to this At least two target texts data block can be determined in advance in inventive embodiments, and at least two target texts data block The compression dictionary all same of each target text data block, therefore it may only be necessary to be performed at least two target texts data block Once generation operation, you can obtain the compression dictionary of at least two target texts data block, located reason resource so as to save.
It should be noted that in the embodiment of the present invention, at least two target can be determined by above-mentioned steps 401-402 The compression dictionary of text data block, and based on the compression dictionary of at least two target texts data block, at least two mesh The mark text data block operation that is compressed and handles can be realized with 403-405 as follows.
Seen from the above description, method provided in an embodiment of the present invention is used in distributed system, and compresses dictionary and share Memory module is arranged in the designated equipment in the multiple equipment that the distributed system includes, still, the plurality of text data block It is to be stored in a distributed manner in the plurality of equipment again, at least two target texts data block is contained in the plurality of text data block In, it that is to say, at least two target texts data block is also stored in the plurality of equipment in a distributed manner, therefore, in compression word After allusion quotation sharing storage module generates the compression dictionary of at least two target texts data block, the compression dictionary can be stored, When the distributed system needs to carry out data processing at least two target texts data block, obtained according still further to following manner The compression dictionary of at least two target texts data block is taken, so as to carry out follow-up processing step, step 403- specific as follows Shown in 405.
Step 403:Obtain the compression dictionary of at least two target texts data block.
It should be noted that the operation that distributed system obtains the compression dictionary of at least two target texts data block can To be performed by storing the data import modul included by the equipment of at least two target texts data block, specifically, deposit The data import modul that storing up the equipment of at least two target texts data block includes can store mould to the compression dictionary is shared Block sends compression dictionary and obtains request, and the compression dictionary obtains the mark that target text data block is carried in request;When the compression When dictionary sharing storage module receives the compression dictionary and obtains request, can the mark based on target text data block, from depositing In corresponding relation between the text data block identification and compression dictionary of storage, the compression dictionary of target text data block is obtained, and The compression dictionary of target text data block is sent to the data import modul, certainly, can also be with other sides in practical application Formula obtains the compression dictionary of at least two target texts data block, and the embodiment of the present invention is not specifically limited to this.
In addition, in practical application, can when receiving the compression instruction at least two target texts data block, Obtain the compression dictionary of at least two target texts data block, it is of course also possible to obtain in other cases this at least two The compression dictionary of target text data block, as long as ensureing before being compressed at least two target texts data block, Through getting the compression dictionary of at least two target texts data block, the embodiment of the present invention is not specifically limited to this.
It should be noted that the compression is instructed for indicating to be compressed at least two target texts data block, and Compression instruction can be triggered by user, and certainly, compression instruction can also detect a certain trigger event in distributed system When trigger, the embodiment of the present invention is not specifically limited to this.
Further, when compressing the compression dictionary that is not stored with the target text data block in dictionary sharing storage module When, now, the compression dictionary sharing storage module needs to be based on the target text data block and specified compression algorithm, generates the mesh Mark the compression dictionary of text data block.
Step 404:Based on the compression dictionary of at least two target texts data block, respectively at least two target text Each target text data block in notebook data block is compressed, and obtains at least two compression data blocks.
It should be noted that at least two target texts data block corresponds with least two compression data block, Each compression data block includes multiple compressed datas, and includes more for some compression data block, the compression data block Multiple text datas that individual compressed data target text data block corresponding with the compression data block includes correspond, namely It is that each text data at least two target texts data block has unique at least two compression data block Corresponding compressed data.
In addition, the compression dictionary based at least two target texts data block, respectively at least two target text Each target text data block in data block is compressed, and the operation for obtaining at least two compression data blocks may be referred to correlation Technology, the embodiment of the present invention is to this without elaborating.
Step 405:The same process instruction for handling operation is carried out at least two target texts data block when receiving When, the compressed data at least two compression data block is handled, to realize at least two target text data blocks Processing.
It should be noted that because generally, the plurality of text data block is to carry out distributed storage, namely It is that the same process instruction for handling operation is carried out to certain two text data block in the plurality of text data block receiving When, because the compression dictionary of two text data blocks is different, thus it is, it is necessary to first corresponding to the two text data blocks of storage Two compression data blocks decompressed respectively, to obtain the two text data blocks, and then the two text data blocks are passed The equipment for carrying out data processing is defeated by, the two text data blocks are handled by the equipment.
And in the embodiment of the present invention, due to be directly to the compressed data at least two compression data block at Reason, to realize the processing of at least two target texts data block, therefore, carried out at least two target texts data block During processing, at least two compression data blocks corresponding at least two target texts data block only can be transferred to carry out data The equipment of processing, the compressed data at least two compression data block is handled by the equipment, to realize to this at least The processing of two target text data blocks.So as to compared to needing elder generation before handling text data block in correlation technique The mode decompressed to compression data block, the compression data block is directly based upon in the embodiment of the present invention and can be achieved to target text The processing of notebook data block, so as to reduce data processing time, save process resource.In addition, compared in correlation technique The mode of text data block is transmitted between the multiple equipment that the distributed system includes, only needs to transmit pressure in the embodiment of the present invention Contracting data block, so as to reduce volume of transmitted data, network bandwidth utilization factor is lifted, and save data transmission period, furthermore, phase Only need to carry out compressed data than the mode for directly handling text data in correlation technique, in the embodiment of the present invention Processing, so as to reduce data processing amount, data processing time is saved, saves process resource.
Specifically, the compressed data at least two compression data block is handled, to realize at least two mesh Marking the operation of the processing of text data block can be:Determine that each compression data block at least two compression data block includes Multiple compressed datas;Based on multiple compressed datas that each compression data block includes at least two compression data block, to this At least two compression data blocks are handled, and obtain the result of at least two compression data block;Based on the compression dictionary, The result of at least two compression data block is decompressed, obtains the processing knot of at least two target texts data block Fruit.
Due to stored in compression dictionary be each text data in each target text data block compressed code, or respectively The compressed code of each character in individual target text data block, therefore, is compressed at least two target texts data block, It is to change the text data at least two target texts data block for compressed data, because the transformation rule is one It is fixed, therefore, after calculating the compressed data, the result of obtained at least two compression data block be also by What compressed code was formed, therefore, based on the compression dictionary, the result of at least two compression data block is decompressed Afterwards, then be that the compression result for being formed the compressed code is changed for form of textual data, and due to transformation rule be it is certain, Therefore, the result after conversion is the result of at least two target texts data block.
In embodiments of the present invention, the compressed data at least two compression data block is handled, to realize extremely During the processing of few two target text data blocks, it can be performed, that is to say by the data processing module in equipment, the data Processing module can be handled the compressed data at least two compression data block, and then is based at least two target The compression dictionary of text data block decompresses to result, so as to realize the place at least two target texts data block Reason.
Wherein, based on multiple compressed datas that each compression data block includes at least two compression data block, to this Each textual data in certain two text data block is based in operation that at least two compression data blocks are handled and correlation technique The multiple text datas included according to block, the operation handled the two text data blocks is similar, and the embodiment of the present invention is to this Without elaborating.
Wherein, based on the compression dictionary, the result of at least two compression data block is decompressed, obtains this extremely The operation of the result of few two target text data blocks compression corresponding with being based on some compression data block in correlation technique Dictionary, the operation decompressed to some compression data block is similar, and the embodiment of the present invention is to this without elaborating.
Wherein, can it is determined that during each compression data block includes at least two compression data blocks multiple compressed datas First to judge whether the code length for compressing multiple compressed codes that dictionary includes is equal, and then pass through following three kinds of feelings with reference to judged result Condition determines each compression data block includes at least two compression data blocks multiple compressed datas:
The first situation, when the code length for multiple compressed codes that compression dictionary includes is equal and the compression dictionary includes each mesh When marking the compressed code of each text data in text data block, for each compressed data at least two compression data block Block, the code length of each compressed code is defined as to the length of each compressed data in the compression data block;According to the compression data block In each compressed data length, from the compression data block, determine multiple compressed datas successively.
Seen from the above description, compressing dictionary is generated by specified compression algorithm, and for different compression algorithms, The code length for multiple compressed codes that the compression dictionary of generation includes may be unequal, and each compression that the compression dictionary includes Code length between code may be also unequal, and compression data block is compressed based on the compression dictionary, therefore, it is determined that Multiple compression data blocks that each compression data block includes at least two compression data block, can first judge the compression dictionary Including multiple compressed codes code length it is whether equal.
In addition, when the compression dictionary includes the compressed code of each text data in each target text data block, it is determined that It is the compressed code based on each text data in each target text data block when carrying out the compression of target text data block, Each text data is compressed, therefore, the code length of each compressed code in the compression dictionary can be defined as the pressure The length of each compression data block in contracting data block.
Wherein, according to the length of each compressed data in the compression data block, from the compression data block, determine successively more During individual compressed data, the compression data block can be carried out successively according to the length of each compressed data in the compression data block Division, obtains multiple compressed datas.
Second of situation, include respectively when the code length for multiple compressed codes that the compression dictionary includes is equal and works as the compression dictionary In individual target text data block during the compressed code of each character, for each compressed data at least two compression data block Block, determine target text data block corresponding to the compression data block;By the code length of each compressed code respectively with target text data The character number of each text data is multiplied in block, obtains the length of each compressed data in the compression data block;According to the pressure The length of each compressed data in contracting data block, from the compression data block, multiple compressed datas are determined successively.
Wherein, when the compression dictionary includes the compressed code of each character in each target text data block, it is determined that entering It is the compressed code based on each character in each target text data block, to each word during the compression of row target text data block What symbol was compressed, and because a text data can include multiple characters, therefore, can be each by this in the compression dictionary Character number of the code length of compressed code respectively with each text data in target text data block is multiplied, and then obtains the compression number According to the length of each compression data block in block.
The third situation, when the code length for multiple compressed codes that the compression dictionary includes is unequal, for this at least two Each compression data block in compression data block, according to the data directory of the compression data block, from the compression data block, it is determined that Multiple compressed datas, the data directory of the compression data block are used to indicate that each compressed data is in the pressure in the plurality of compressed data The location of in contracting data block.
Further, according to the data directory of the compression data block, from the compression data block, multiple compressed datas are determined Before, in addition to:For target text data block corresponding to the compression data block, what is be compressed to target text data block During, determine the position that compressed data corresponding to each text data is residing in the compression data block in target text data block Put;The location of based on compressed data corresponding to each text data in the compression data block, generate the compression data block Data directory.
It should be noted that during being compressed to target text data block, determine in target text data block Compressed data corresponding to each text data is in the compression data block during location, it may be determined that each text data pair Original position and end position of the compressed data answered in the compression data block, and then by original position and end position only One ground determines the location in the compression data block of compressed data corresponding to each text data.Wherein, rise for convenience See, original position of each compressed data in the compression data block can be defined as to the data directory of the compression data block.
For example some text data aaabbbccc of target text data block is upon compression, obtained compressed data is 0001010, corresponding 010, the ccc corresponding 10 of corresponding 00, the bbb of record coding aaa in dictionary is compressed, in order to without decompressing data It can obtain compressed data corresponding to each field, during being compressed to text data, determine aaa upon compression Original position is 0, end position 2, and the original position of length 2-0=2, bbb upon compression is 2, and end position 5 is long It is 5 to spend for 5-2=3, the original positions of ccc upon compression, end position 7, length 7-5=2, referring to Fig. 4 E, can be incited somebody to action The original position 5 of the original position 0 of compressed data 00, the original position 2 of compressed data 010 and compressed data 10 is defined as this The data directory of compression data block.
In embodiments of the present invention, at least two target text numbers for being handled subsequently through same processing operation According to block, the compression dictionary of at least two target texts data block is obtained, it is literary at least two target based on the compression dictionary Each target text data block in notebook data block is compressed, and is obtained at least two compression data blocks, be that is to say, this at least two Individual target text data block can share same compression dictionary, without generating a compression word to each text data block Allusion quotation.Further, since the compression dictionary of at least two target texts data block is identical, it that is to say, at least two target text Data block is compressed by same compression standard, and therefore, the compressed data at least two compression data block is entered Row processing, and the result after being decompressed to result by the compression dictionary with least two target texts data block In the text data result that is handled to obtain it is identical, so the embodiment of the present invention passes through at least two compressions number Handled according to block, to realize the processing at least two target texts data block, without at least two compressions number Decompressed according to block, reduce data processing amount, and then shorten the processing time of data, and save process resource.
Fig. 5 A are a kind of structural representations of data processing equipment provided in an embodiment of the present invention.Reference picture 5A, the data Processing unit includes:
Acquisition module 501, for obtaining the compression dictionary of at least two target text data blocks, at least two target text Notebook data block is that the data block handled, each target are operated subsequently through same processing in multiple text data blocks of storage Text data block includes multiple text datas, and each text data includes multiple characters, and it is each that the compression dictionary includes this The compressed code of each text data in target text data block, or including each character in each target text data block Compressed code;
Compression module 502, for based on the compression dictionary, respectively to each at least two target texts data block Target text data block is compressed, and obtains at least two compression data blocks, at least two target texts data block with this extremely Few two compression data blocks correspond, and each compression data block includes multiple compressed datas, and the plurality of compressed data is with being somebody's turn to do Multiple text datas correspond;
Processing module 503, same processing operation is carried out at least two target texts data block for working as to receive During process instruction, the compressed data at least two compression data block is handled, to realize at least two target text The processing of notebook data block.
Further, the device also includes:
Determining module, for from multiple text data blocks of storage, determining at least two target texts data block;
Generation module, for generating the compression dictionary of at least two target texts data block.
Wherein it is determined that module includes:
Selecting unit, for from the plurality of text data block, selection to belong to other at least two text data of target class Block, the text data block of selection is defined as target text data block;Or
First determining unit, for working as the choosing detected at least two text data blocks in the plurality of text data block When selecting instruction, the text data block selected by the selection instruction is defined as target text data block.
Reference picture 5B, processing module 503 include:
Second determining unit 5031, for determining that each compression data block includes more at least two compression data block Individual compressed data;
Processing unit 5032, for based on multiple pressures that each compression data block includes at least two compression data block Contracting data, at least two compression data block is handled, obtains the result of at least two compression data block;
Decompression units 5033, for based on the compression dictionary, being carried out to the result of at least two compression data block Decompression, obtains the result of at least two target texts data block.
Wherein, the second determining unit 5031 includes:
Whether judgment sub-unit, the code length of the multiple compressed codes included for judging the compression dictionary are equal;
First determination subelement, the code length of multiple compressed codes for including when the compression dictionary is equal and the compression dictionary Including in each target text data block during the compressed code of each text data, at least two compression data block Each compression data block, the code length of each compressed code is defined as to the length of each compressed data in the compression data block;
Second determination subelement, for the length according to each compressed data in the compression data block, from the compressed data In block, the plurality of compressed data is determined successively.
Further, the second determining unit 5031 also includes:
3rd determination subelement, the code length of multiple compressed codes for including when the compression dictionary is equal and works as the compression word When allusion quotation includes the compressed code of each character in each target text data block, for every at least two compression data block Individual compression data block, determine target text data block corresponding to the compression data block;
Computing subelement, for by the code length of each compressed code respectively with each text data in the target text data block Character number be multiplied, obtain the length of each compressed data in the compression data block;
4th determination subelement, for the length according to each compressed data in the compression data block, from the compressed data In block, the plurality of compressed data is determined successively.
Further, the second determining unit 5031 also includes:
5th determination subelement, when the code length of multiple compressed codes for including when the compression dictionary is unequal, for this Each compression data block at least two compression data blocks, according to the data directory of the compression data block, from the compressed data In block, the plurality of compressed data is determined, the data directory of the compression data block is used to indicate each to press in the plurality of compressed data Contracting data location in the compression data block.
Further, the second determining unit 5031 also includes:
6th determination subelement, for for target text data block corresponding to the compression data block, to target text During notebook data block is compressed, determine that compressed data corresponding to each text data is at this in the target text data block The location of in compression data block;
Generate subelement, for based on compressed data corresponding to each text data in the compression data block it is residing Position, generate the data directory of the compression data block.
In embodiments of the present invention, at least two target text numbers for being handled subsequently through same processing operation According to block, the compression dictionary of at least two target texts data block is obtained, it is literary at least two target based on the compression dictionary Each target text data block in notebook data block is compressed, and is obtained at least two compression data blocks, be that is to say, this at least two Individual target text data block can share same compression dictionary, without generating a compression word to each text data block Allusion quotation.Further, since the compression dictionary of at least two target texts data block is identical, it that is to say, at least two target text Data block is compressed by same compression standard, and therefore, the compressed data at least two compression data block is entered Row processing, and the result after being decompressed to result by the compression dictionary with least two target texts data block In the text data result that is handled to obtain it is identical, so the embodiment of the present invention passes through at least two compressions number Handled according to block, to realize the processing at least two target texts data block, without at least two compressions number Decompressed according to block, reduce data processing amount, and then shorten the processing time of data, and save process resource.
It should be noted that:The data processing equipment that above-described embodiment provides is in data processing, only with above-mentioned each function The division progress of module, can be as needed and by above-mentioned function distribution by different function moulds for example, in practical application Block is completed, i.e., the internal structure of device is divided into different functional modules, to complete all or part of work(described above Energy.In addition, the data processing equipment that above-described embodiment provides belongs to same design with data processing method embodiment, it is specific real Existing process refers to embodiment of the method, repeats no more here.
One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment To complete, by program the hardware of correlation can also be instructed to complete, described program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only storage, disk or CD etc..
The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.

Claims (16)

1. a kind of data processing method, it is characterised in that methods described includes:
The compression dictionary of at least two target text data blocks is obtained, at least two target texts data block is more for storage The data block handled in individual text data block subsequently through same processing operation, each target text data block include more Individual text data, each text data include multiple characters, and the compression dictionary includes each target text data block In each text data compressed code, or include the compressed code of each character in each target text data block;
Based on the compression dictionary, each target text data block at least two target texts data block is entered respectively Row compression, obtain at least two compression data blocks, at least two target texts data block and described at least two compression numbers Corresponded according to block, each compression data block includes multiple compressed datas, the multiple compressed data and the multiple text Data correspond;
When receive at least two target texts data block carry out it is same processing operation process instruction when, to it is described extremely Compressed data in few two compression data blocks is handled, to realize the processing of at least two target texts data block.
2. the method as described in claim 1, it is characterised in that the compression word for obtaining at least two target text data blocks Before allusion quotation, in addition to:
From multiple text data blocks of storage, at least two target texts data block is determined;
Generate the compression dictionary of at least two target texts data block.
3. method as claimed in claim 2, it is characterised in that in multiple text data blocks from storage, it is determined that described At least two target text data blocks, including:
From the multiple text data block, selection belongs to the other at least two text datas block of target class, by the text of selection Data block is defined as target text data block;Or
When detecting for the selection instruction of at least two text data blocks in the multiple text data block, by the selection The selected text data block of instruction is defined as target text data block.
4. the method as described in claim 1, it is characterised in that the compression number at least two compression data block According to being handled, to realize the processing of at least two target texts data block, including:
Determine multiple compressed datas that each compression data block includes at least two compression data block;
Based on multiple compressed datas that each compression data block includes at least two compression data block, to described at least two Individual compression data block is handled, and obtains the result of at least two compression data block;
Based on the compression dictionary, the result of at least two compression data block is decompressed, obtain it is described at least The result of two target text data blocks.
5. method as claimed in claim 4, it is characterised in that described to determine each to press at least two compression data block Multiple compressed datas that contracting data block includes, including:
Judge whether the code length for multiple compressed codes that the compression dictionary includes is equal;
When the code length of multiple compressed codes that the compression dictionary includes is equal and the compression dictionary includes each target text In notebook data block during the compressed code of each text data, for each compressed data at least two compression data block Block, the code length of each compressed code is defined as to the length of each compressed data in the compression data block;
According to the length of each compressed data in the compression data block, from the compression data block, determine successively described more Individual compressed data.
6. method as claimed in claim 5, it is characterised in that the multiple compressed codes for judging the compression dictionary and including After whether code length is equal, in addition to:
Include each target when the code length for multiple compressed codes that the compression dictionary includes is equal and works as the compression dictionary In text data block during the compressed code of each character, for each compression data block at least two compression data block, Determine target text data block corresponding to the compression data block;
Character number of the code length of each compressed code respectively with each text data in the target text data block is multiplied, obtained The length of each compressed data into the compression data block;
According to the length of each compressed data in the compression data block, from the compression data block, determine successively described more Individual compressed data.
7. method as claimed in claim 5, it is characterised in that the multiple compressed codes for judging the compression dictionary and including After whether code length is equal, in addition to:
When the code length for multiple compressed codes that the compression dictionary includes is unequal, at least two compression data block Each compression data block, according to the data directory of the compression data block, from the compression data block, determine the multiple Compressed data, the data directory of the compression data block are used to indicate that each compressed data is described in the multiple compressed data The location of in compression data block.
8. method as claimed in claim 7, it is characterised in that the data directory according to the compression data block, from institute State in compression data block, before determining the multiple compressed data, in addition to:
For target text data block corresponding to the compression data block, in the mistake being compressed to the target text data block Cheng Zhong, determine that compressed data corresponding to each text data is residing in the compression data block in the target text data block Position;
The location of based on compressed data corresponding to each text data in the compression data block, generate the pressure The data directory of contracting data block.
9. a kind of data processing equipment, it is characterised in that described device includes:
Acquisition module, for obtaining the compression dictionary of at least two target text data blocks, at least two target texts number According to the data block handled in multiple text data blocks that block is storage subsequently through same processing operation, each target text Data block includes multiple text datas, and each text data includes multiple characters, and the compression dictionary includes described each The compressed code of each text data in target text data block, or including each character in each target text data block Compressed code;
Compression module, for based on the compression dictionary, respectively to each mesh at least two target texts data block Mark text data block is compressed, and obtains at least two compression data blocks, at least two target texts data block with it is described At least two compression data blocks correspond, and each compression data block includes multiple compressed datas, the multiple compressed data Corresponded with the multiple text data;
Processing module, the same processing for handling operation of at least two target texts data block progress is referred to for working as to receive When making, the compressed data at least two compression data block is handled, to realize at least two target text The processing of data block.
10. device as claimed in claim 9, it is characterised in that described device also includes:
Determining module, for from multiple text data blocks of storage, determining at least two target texts data block;
Generation module, for generating the compression dictionary of at least two target texts data block.
11. device as claimed in claim 10, it is characterised in that the determining module includes:
Selecting unit, for from the multiple text data block, selecting to belong to the other at least two text datas block of target class, The text data block of selection is defined as target text data block;Or
First determining unit, for working as the selection detected at least two text data blocks in the multiple text data block During instruction, the text data block selected by the selection instruction is defined as target text data block.
12. device as claimed in claim 9, it is characterised in that the processing module includes:
Second determining unit, for determining each compression data block includes at least two compression data block multiple compressions Data;
Processing unit, for based on multiple compression numbers that each compression data block includes at least two compression data block According to handling at least two compression data block, obtain the result of at least two compression data block;
Decompression units, for based on the compression dictionary, being decompressed to the result of at least two compression data block, Obtain the result of at least two target texts data block.
13. device as claimed in claim 12, it is characterised in that second determining unit includes:
Whether judgment sub-unit, the code length of the multiple compressed codes included for judging the compression dictionary are equal;
First determination subelement, the code length of multiple compressed codes for including when the compression dictionary is equal and the compression dictionary Including in each target text data block during the compressed code of each text data, at least two compression data block In each compression data block, the code length of each compressed code is defined as to the length of each compressed data in the compression data block Degree;
Second determination subelement, for the length according to each compressed data in the compression data block, from the compressed data In block, the multiple compressed data is determined successively.
14. device as claimed in claim 13, it is characterised in that second determining unit also includes:
3rd determination subelement, the code length of multiple compressed codes for including when the compression dictionary is equal and works as the compression word When allusion quotation includes the compressed code of each character in each target text data block, at least two compression data block Each compression data block, determine target text data block corresponding to the compression data block;
Computing subelement, for by the code length of each compressed code respectively with each text data in the target text data block Character number is multiplied, and obtains the length of each compressed data in the compression data block;
4th determination subelement, for the length according to each compressed data in the compression data block, from the compressed data In block, the multiple compressed data is determined successively.
15. device as claimed in claim 13, it is characterised in that second determining unit also includes:
5th determination subelement, when the code length of multiple compressed codes for including when the compression dictionary is unequal, for described Each compression data block at least two compression data blocks, according to the data directory of the compression data block, from the compression In data block, the multiple compressed data is determined, the data directory of the compression data block is used to indicate the multiple compression number Each compressed data location in the compression data block in.
16. device as claimed in claim 15, it is characterised in that second determining unit also includes:
6th determination subelement, for for target text data block corresponding to the compression data block, to target text During notebook data block is compressed, determine that compressed data corresponding to each text data exists in the target text data block The location of in the compression data block;
Generate subelement, for based on compressed data corresponding to each text data in the compression data block it is residing Position, generate the data directory of the compression data block.
CN201610590825.2A 2016-07-22 2016-07-22 Data processing method and device Active CN107643906B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610590825.2A CN107643906B (en) 2016-07-22 2016-07-22 Data processing method and device
PCT/CN2017/092527 WO2018014761A1 (en) 2016-07-22 2017-07-11 Data processing method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610590825.2A CN107643906B (en) 2016-07-22 2016-07-22 Data processing method and device

Publications (2)

Publication Number Publication Date
CN107643906A true CN107643906A (en) 2018-01-30
CN107643906B CN107643906B (en) 2021-01-05

Family

ID=60992963

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610590825.2A Active CN107643906B (en) 2016-07-22 2016-07-22 Data processing method and device

Country Status (2)

Country Link
CN (1) CN107643906B (en)
WO (1) WO2018014761A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765111A (en) * 2019-10-21 2021-05-07 伊姆西Ip控股有限责任公司 Method, apparatus and computer program product for processing data
CN113495687A (en) * 2020-03-19 2021-10-12 辉达公司 Techniques for efficiently organizing and accessing compressible data
CN114979794A (en) * 2022-05-13 2022-08-30 深圳智慧林网络科技有限公司 Data sending method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103326732A (en) * 2013-05-10 2013-09-25 华为技术有限公司 Method for packing data, method for unpacking data, coder and decoder
CN104283777A (en) * 2013-07-03 2015-01-14 华为技术有限公司 Message compression method and device
US20160197621A1 (en) * 2015-01-04 2016-07-07 Emc Corporation Text compression and decompression

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7769729B2 (en) * 2007-05-21 2010-08-03 Sap Ag Block compression of tables with repeated values
CN101320372B (en) * 2008-05-22 2012-07-04 上海爱数软件有限公司 Compression method for repeated data
EP2460091A4 (en) * 2009-07-31 2013-07-03 Hewlett Packard Development Co Compression of xml data
CN104023070B (en) * 2014-06-16 2017-02-15 杜海洋 file compression method based on cloud storage

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103326732A (en) * 2013-05-10 2013-09-25 华为技术有限公司 Method for packing data, method for unpacking data, coder and decoder
CN104283777A (en) * 2013-07-03 2015-01-14 华为技术有限公司 Message compression method and device
US20160197621A1 (en) * 2015-01-04 2016-07-07 Emc Corporation Text compression and decompression

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡荣 等: "基于OpenMP的文件压缩与解压的并行设计模型", 《中南大学学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765111A (en) * 2019-10-21 2021-05-07 伊姆西Ip控股有限责任公司 Method, apparatus and computer program product for processing data
CN113495687A (en) * 2020-03-19 2021-10-12 辉达公司 Techniques for efficiently organizing and accessing compressible data
CN114979794A (en) * 2022-05-13 2022-08-30 深圳智慧林网络科技有限公司 Data sending method and device
CN114979794B (en) * 2022-05-13 2023-11-14 深圳智慧林网络科技有限公司 Data transmission method and device

Also Published As

Publication number Publication date
WO2018014761A1 (en) 2018-01-25
CN107643906B (en) 2021-01-05

Similar Documents

Publication Publication Date Title
KR102535450B1 (en) Data storage method and apparatus, and computer device and storage medium thereof
CN105988996B (en) Index file generation method and device
CN111291103A (en) Interface data analysis method and device, electronic equipment and storage medium
CN109359237A (en) It is a kind of for search for boarding program method and apparatus
CN107643906A (en) Data processing method and device
CA2936485C (en) Optimized data condenser and method
US20090028266A1 (en) Compact encoding of arbitrary length binary objects
CN107729523A (en) Data service method, electronic installation and storage medium
CN107622040A (en) A kind of control method and system of laser carving data
CN116842012A (en) Method, device, equipment and storage medium for storing Redis cluster in fragments
CN107580015A (en) Data processing method and device, server
CN116738954A (en) Report export method, report template configuration device and computer equipment
CN104077282B (en) The method and apparatus of processing data
CN105930104A (en) Data storing method and device
CN110221778A (en) Processing method, system, storage medium and the electronic equipment of hotel's data
CN113393288B (en) Order processing information generation method, device, equipment and computer readable medium
CN111639260B (en) Content recommendation method, content recommendation device and storage medium
TWI719537B (en) Text comparison method, system and computer program product
CN105653534B (en) Data processing method and device
CN112328960B (en) Optimization method and device for data operation, electronic equipment and storage medium
CN109918374A (en) The method and terminal device of mass data storage
CN113343639B (en) Product identification code diagram generation and information query method based on product identification code diagram
CN117235236B (en) Dialogue method, dialogue device, computer equipment and storage medium
CN113032003B (en) Development file export method, development file export device, electronic equipment and computer storage medium
CN115801228B (en) Interactive information encryption method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220211

Address after: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province

Patentee after: Huawei Cloud Computing Technology Co.,Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd.