CN107643906A - Data processing method and device - Google Patents
Data processing method and device Download PDFInfo
- Publication number
- CN107643906A CN107643906A CN201610590825.2A CN201610590825A CN107643906A CN 107643906 A CN107643906 A CN 107643906A CN 201610590825 A CN201610590825 A CN 201610590825A CN 107643906 A CN107643906 A CN 107643906A
- Authority
- CN
- China
- Prior art keywords
- data block
- compression
- compressed
- data
- text data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention discloses a kind of data processing method and device, belong to field of computer technology.Methods described includes:Obtain the compression dictionary of at least two target text data blocks, based on the compression dictionary, each target text data block at least two target texts data block is compressed respectively, obtain at least two compression data blocks, when receiving the process instruction that same processing operation is carried out at least two target texts data block, compressed data at least two compression data block is handled, to realize the processing of at least two target texts data block.The present invention is by handling the compressed data at least two compression data block, to realize the processing at least two target texts data block, without being decompressed at least two compression data block, reduce data processing amount, and then the processing time of data is shortened, and save process resource.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of data processing method and device.
Background technology
With the development of computer technology, substantial amounts of text data needs to be stored and analyzed, and this article notebook data refers to
The data being made up of printable character, the printable character include USA standard code for information interchange (American Standard
Code for Information Interchange, ASCII) in the character of 33~127, in Unicode (UNICODE)
Character, character in Unicode (UTF-8) etc..When being stored to this article notebook data, in order to save data storage
With time shared during transmission and space, it is necessary to be first compressed to this article notebook data, then to this article notebook data after compression
Stored, afterwards, when analyzing this article notebook data, first can be decompressed this article notebook data after compression, with
Obtain this article notebook data, then this article notebook data is compared, sorts, search, Hash operation, the processing such as concatenation operation,
This article notebook data is analyzed based on the result of this article notebook data.
At present, there is provided a kind of data processing method, be specially:For each text in multiple text data blocks of storage
Notebook data block, generates the compression dictionary of text data block, and text data block includes multiple text datas;Based on the text
The compression dictionary of data block, is compressed to text data block, obtains compression data block corresponding to text data block;To this
Compression data block corresponding to text data block is stored.The first text data block and the second text data block are entered when receiving
During the process instruction of the same processing operation of row, compression data block corresponding to the first text data block is obtained, and obtain the second text
Compression data block corresponding to notebook data block, the first text data block and the second text data block are in the plurality of text data block
Any two text data block;Compression data block corresponding to first text data block is decompressed, to obtain the first textual data
Decompressed according to block, and to compression data block corresponding to the second text data block, to obtain the second text data block;To the first text
Text data in notebook data block and the second text data block is handled, and obtains result.
Due to referring to receiving the processing that same processing operation is carried out to the first text data block and the second text data block
When making, it is necessary to respectively to compression data block corresponding to compression data block corresponding to the first text data block and the second text data block
After being decompressed, the first text data block and the second text data block could be handled, therefore, time of data processing compared with
Long, the process resource of consumption is more.
The content of the invention
In order to solve problem of the prior art, the embodiments of the invention provide a kind of data processing method and device.It is described
Technical scheme is as follows:
First aspect, there is provided a kind of data processing method, methods described include:
The compression dictionary of at least two target text data blocks is obtained, at least two target texts data block is storage
Multiple text data blocks in wrapped subsequently through the same data block that is handled of processing operation, each target text data block
Multiple text datas are included, each text data includes multiple characters, and the compression dictionary includes each target text number
According to the compressed code of each text data in block, or include the compressed code of each character in each target text data block;
Based on the compression dictionary, respectively to each target text data at least two target texts data block
Block is compressed, and obtains at least two compression data blocks, at least two target texts data block and described at least two pressures
Contracting data block corresponds, and each compression data block includes multiple compressed datas, the multiple compressed data with it is the multiple
Text data corresponds;
When receiving the process instruction that same processing operation is carried out at least two target texts data block, to institute
The compressed data stated at least two compression data blocks is handled, to realize the place of at least two target texts data block
Reason.
It should be noted that the compression dictionary of at least two target texts data block can include each target text number
According to the compressed code of each text data in block, or the compressed code of each character in each target text data block can be included,
It that is to say, the compressed code in the compression dictionary can correspond with text data, can also be corresponded with character, the present invention
Embodiment is not specifically limited to this.
In addition, at least two target texts data block is the data block handled subsequently through same processing operation,
And at least two target texts data block corresponds to same compression dictionary, that is to say, at least two target texts data block is total to
Same compression dictionary is enjoyed, in this way, without generating a compression dictionary respectively to each text data block.And then receiving to extremely
When few two compression data blocks carry out the process instruction of same processing operation, because at least two compression data block is by this
Compression dictionary is compressed to obtain at least two target texts data block, and at least two target texts data block is shared
Same compression dictionary, therefore, directly the compressed data at least two compression data block can be handled, so as to realize
Processing at least two target texts data block.
It is described to obtain at least two with reference in a first aspect, in the first possible implementation of above-mentioned first aspect
Before the compression dictionary of target text data block, in addition to:
From multiple text data blocks of storage, at least two target texts data block is determined;
Generate the compression dictionary of at least two target texts data block.
For the ease of subsequently can be based on same compression standard to each mesh at least two target texts data block
Mark text data block is compressed, and it is determined that after at least two target texts data block, can generate at least two target
The compression dictionary of text data block.And when generating the compression dictionary of at least two target texts data block, it can be based on referring to
Level pressure compression algorithm and at least two target texts data block, the compression dictionary of at least two target texts data block is generated,
Certainly, in practical application, the compression dictionary of at least two target texts data block, this hair can also otherwise, be generated
Bright embodiment is not specifically limited to this.
With reference to the first possible implementation of first aspect, in second of possible realization side of above-mentioned first aspect
In formula, in multiple text data blocks from storage, at least two target texts data block is determined, including:
From the multiple text data block, selection belongs to the other at least two text datas block of target class, by selection
Text data block is defined as target text data block;Or
When detecting for the selection instruction of at least two text data blocks in the multiple text data block, by described in
Text data block selected by selection instruction is defined as target text data block.
Because the text data block under target classification under normal circumstances can be handled by same processing operation, because
This, can select to belong to the other at least two text datas block of target class in the embodiment of the present invention, and by the text data of selection
Block is defined as target text data block, and the determination is simple to operate, and is participated in without user, so as to improve target text
The determination efficiency of data block.
Further, since selection instruction is triggered by user, therefore, by the text selected by selection instruction in the embodiment of the present invention
When data block is defined as target text data block, actually when target text data block determined according to user's operation, so as to
To ensure that the target text data block determined meets user's request.
It is described to described at least two with reference in a first aspect, in the third possible implementation of above-mentioned first aspect
Compressed data in individual compression data block is handled, to realize the processing of at least two target texts data block, including:
Determine multiple compressed datas that each compression data block includes at least two compression data block;
Based on multiple compressed datas that each compression data block includes at least two compression data block, to it is described extremely
Few two compression data blocks are handled, and obtain the result of at least two compression data block;
Based on the compression dictionary, the result of at least two compression data block is decompressed, obtained described
The result of at least two target text data blocks.
Due to stored in compression dictionary be each text data in each target text data block compressed code, or respectively
The compressed code of each character in individual target text data block, therefore, is compressed at least two target texts data block,
It is to change the text data at least two target texts data block for compressed data, because the transformation rule is one
It is fixed, therefore, after calculating the compressed data, the result of obtained at least two compression data block be also by
What compressed code was formed, therefore, based on the compression dictionary, the result of at least two compression data block is decompressed
Afterwards, then be that the compression result for being formed the compressed code is changed for form of textual data, and due to transformation rule be it is certain,
Therefore, the result after conversion is the result of at least two target texts data block.
It should be noted that because generally, the plurality of text data block is to carry out distributed storage, namely
It is that the same process instruction for handling operation is carried out to certain two text data block in the plurality of text data block receiving
When, because the compression dictionary of two text data blocks is different, thus it is, it is necessary to first corresponding to the two text data blocks of storage
Two compression data blocks decompressed respectively, to obtain the two text data blocks, and then the two text data blocks are passed
The equipment for carrying out data processing is defeated by, the two text data blocks are handled by the equipment.
And in the embodiment of the present invention, due to be directly to the compressed data at least two compression data block at
Reason, to realize the processing of at least two target texts data block, therefore, carried out at least two target texts data block
During processing, at least two compression data blocks corresponding at least two target texts data block only can be transferred to carry out data
The equipment of processing, the compressed data at least two compression data block is handled by the equipment, to realize to this at least
The processing of two target text data blocks.So as to compared to needing elder generation before handling text data block in correlation technique
The mode decompressed to compression data block, the compression data block is directly based upon in the embodiment of the present invention and can be achieved to target text
The processing of notebook data block, so as to reduce data processing time, save process resource.In addition, compared in correlation technique
The mode of text data block is transmitted between the multiple equipment that the distributed system includes, only needs to transmit pressure in the embodiment of the present invention
Contracting data block, so as to reduce volume of transmitted data, network bandwidth utilization factor is lifted, and save data transmission period, furthermore, phase
Only need to carry out compressed data than the mode for directly handling text data in correlation technique, in the embodiment of the present invention
Processing, so as to reduce data processing amount, data processing time is saved, saves process resource.
With reference to the third possible implementation of first aspect, in the 4th kind of possible realization side of above-mentioned first aspect
In formula, the multiple compressed datas for determining that each compression data block includes at least two compression data block, including:
Judge whether the code length for multiple compressed codes that the compression dictionary includes is equal;
When the code length of multiple compressed codes that the compression dictionary includes is equal and the compression dictionary includes each mesh
When marking the compressed code of each text data in text data block, for each compression number at least two compression data block
According to block, the code length of each compressed code is defined as to the length of each compressed data in the compression data block;
According to the length of each compressed data in the compression data block, from the compression data block, institute is determined successively
State multiple compressed datas.
Seen from the above description, compressing dictionary is generated by specified compression algorithm, and for different compression algorithms,
The code length for multiple compressed codes that the compression dictionary of generation includes may be unequal, and each compression that the compression dictionary includes
Code length between code may be also unequal, and compression data block is compressed based on the compression dictionary, therefore, it is determined that
Multiple compression data blocks that each compression data block includes at least two compression data block, can first judge the compression dictionary
Including multiple compressed codes code length it is whether equal.
In addition, when the compression dictionary includes the compressed code of each text data in each target text data block, it is determined that
It is the compressed code based on each text data in each target text data block when carrying out the compression of target text data block,
Each text data is compressed, therefore, the code length of each compressed code in the compression dictionary can be defined as the pressure
The length of each compression data block in contracting data block.
With reference to the 4th kind of possible implementation of first aspect, in the 5th kind of possible realization side of above-mentioned first aspect
In formula, it is described judge that the compression dictionary includes the code lengths of multiple compressed codes it is whether equal after, in addition to:
When the code length for multiple compressed codes that the compression dictionary includes is equal and works as the compression dictionary including described each
In target text data block during the compressed code of each character, for each compressed data at least two compression data block
Block, determine target text data block corresponding to the compression data block;
By the code length of each compressed code character number phase with each text data in the target text data block respectively
Multiply, obtain the length of each compressed data in the compression data block;
According to the length of each compressed data in the compression data block, from the compression data block, institute is determined successively
State multiple compressed datas.
Wherein, when the compression dictionary includes the compressed code of each character in each target text data block, it is determined that entering
It is the compressed code based on each character in each target text data block, to each word during the compression of row target text data block
What symbol was compressed, and because a text data can include multiple characters, therefore, can be each by this in the compression dictionary
Character number of the code length of compressed code respectively with each text data in target text data block is multiplied, and then obtains the compression number
According to the length of each compression data block in block.
With reference to the 4th kind of possible implementation of first aspect, in the 6th kind of possible realization side of above-mentioned first aspect
In formula, it is described judge that the compression dictionary includes the code lengths of multiple compressed codes it is whether equal after, in addition to:
When the code length for multiple compressed codes that the compression dictionary includes is unequal, at least two compressed data
Each compression data block in block, according to the data directory of the compression data block, from the compression data block, it is determined that described
Multiple compressed datas, the data directory of the compression data block are used to indicate that each compressed data exists in the multiple compressed data
The location of in the compression data block.
With reference to the 6th kind of possible implementation of first aspect, in the 7th kind of possible realization side of above-mentioned first aspect
In formula, the data directory according to the compression data block, from the compression data block, the multiple compressed data is determined
Before, in addition to:
For target text data block corresponding to the compression data block, it is compressed to the target text data block
During, determine that compressed data corresponding to each text data is in the compression data block in the target text data block
Location;
The location of based on compressed data corresponding to each text data in the compression data block, generate institute
State the data directory of compression data block.
It should be noted that during being compressed to target text data block, determine in target text data block
Compressed data corresponding to each text data is in the compression data block during location, it may be determined that each text data pair
Original position and end position of the compressed data answered in the compression data block, and then by original position and end position only
One ground determines the location in the compression data block of compressed data corresponding to each text data.Wherein, rise for convenience
See, original position of each compressed data in the compression data block can be defined as to the data directory of the compression data block.
Second aspect, there is provided a kind of data processing equipment, described device include:
Acquisition module, for obtaining the compression dictionary of at least two target text data blocks, at least two targets text
Notebook data block is that the data block handled, each target are operated subsequently through same processing in multiple text data blocks of storage
Text data block includes multiple text datas, and each text data includes multiple characters, and the compression dictionary includes described
The compressed code of each text data in each target text data block, or including each in each target text data block
The compressed code of character;
Compression module, for based on the compression dictionary, respectively to every at least two target texts data block
Individual target text data block is compressed, and obtains at least two compression data blocks, at least two target texts data block with
At least two compression data block corresponds, and each compression data block includes multiple compressed datas, the multiple compression
Data correspond with the multiple text data;
Processing module, the place that same processing operation is carried out at least two target texts data block is received for working as
During reason instruction, the compressed data at least two compression data block is handled, to realize at least two target
The processing of text data block.
With reference to second aspect, in the first possible implementation of above-mentioned second aspect, described device also includes:
Determining module, for from multiple text data blocks of storage, determining at least two target texts data block;
Generation module, for generating the compression dictionary of at least two target texts data block.
With reference to the first possible implementation of second aspect, in second of possible realization side of above-mentioned second aspect
In formula, the determining module includes:
Selecting unit, for from the multiple text data block, selection to belong to other at least two textual data of target class
According to block, the text data block of selection is defined as target text data block;Or
First determining unit, detected for working as at least two text data blocks in the multiple text data block
During selection instruction, the text data block selected by the selection instruction is defined as target text data block.
With reference to second aspect, in the third possible implementation of above-mentioned second aspect, the processing module includes:
Second determining unit, for determining that each compression data block includes multiple at least two compression data block
Compressed data;
Processing unit, for based on multiple compressions that each compression data block includes at least two compression data block
Data, at least two compression data block is handled, obtain the result of at least two compression data block;
Decompression units, for based on the compression dictionary, being carried out to the result of at least two compression data block
Decompression, obtains the result of at least two target texts data block.
With reference to the third possible implementation of second aspect, in the 4th kind of possible realization side of above-mentioned second aspect
In formula, second determining unit includes:
Whether judgment sub-unit, the code length of the multiple compressed codes included for judging the compression dictionary are equal;
First determination subelement, the code length of multiple compressed codes for including when the compression dictionary is equal and the compression
When dictionary includes the compressed code of each text data in each target text data block, for described at least two compression numbers
According to each compression data block in block, the code length of each compressed code is defined as each compressed data in the compression data block
Length;
Second determination subelement, for the length according to each compressed data in the compression data block, from the compression
In data block, the multiple compressed data is determined successively
With reference to the 4th kind of possible implementation of second aspect, in the 5th kind of possible realization side of above-mentioned second aspect
In formula, second determining unit also includes:
3rd determination subelement, the code length of multiple compressed codes for including when the compression dictionary is equal and works as the pressure
When contracting dictionary includes the compressed code of each character in each target text data block, at least two compressed data
Each compression data block in block, determine target text data block corresponding to the compression data block;
Computing subelement, for by the code length of each compressed code respectively with each textual data in the target text data block
According to character number be multiplied, obtain the length of each compressed data in the compression data block;
4th determination subelement, for the length according to each compressed data in the compression data block, from the compression
In data block, the multiple compressed data is determined successively.
With reference to the 4th kind of possible implementation of second aspect, in the 6th kind of possible realization side of above-mentioned second aspect
In formula, second determining unit also includes:
5th determination subelement, when the code length of multiple compressed codes for including when the compression dictionary is unequal, for
Each compression data block at least two compression data block, according to the data directory of the compression data block, from described
In compression data block, the multiple compressed data is determined, the data directory of the compression data block is used to indicate the multiple pressure
Each compressed data location in the compression data block in contracting data.
With reference to the 6th kind of possible implementation of second aspect, in the 7th kind of possible realization side of above-mentioned second aspect
In formula, second determining unit also includes:
6th determination subelement, for for target text data block corresponding to the compression data block, to the mesh
During mark text data block is compressed, compression number corresponding to each text data in the target text data block is determined
According to the location in the compression data block;
Subelement is generated, for based on compressed data institute in the compression data block corresponding to each text data
The position at place, generate the data directory of the compression data block.
The beneficial effect that technical scheme provided in an embodiment of the present invention is brought is:
In embodiments of the present invention, at least two target text numbers for being handled subsequently through same processing operation
According to block, the compression dictionary of at least two target texts data block is obtained, it is literary at least two target based on the compression dictionary
Each target text data block in notebook data block is compressed, and is obtained at least two compression data blocks, be that is to say, this at least two
Individual target text data block can share same compression dictionary, without generating a compression word to each text data block
Allusion quotation.Further, since the compression dictionary of at least two target texts data block is identical, it that is to say, at least two target text
Data block is compressed by same compression standard, and therefore, the compressed data at least two compression data block is entered
Row processing, and the result after being decompressed to result by the compression dictionary with least two target texts data block
In the text data result that is handled to obtain it is identical, so the embodiment of the present invention passes through at least two compressions number
Handled according to the compressed data in block, to realize the processing at least two target texts data block, without to this extremely
Few two compression data blocks are decompressed, and reduce data processing amount, and then shorten the processing time of data, and are saved
Process resource.
Brief description of the drawings
Technical scheme in order to illustrate the embodiments of the present invention more clearly, make required in being described below to embodiment
Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for
For those of ordinary skill in the art, on the premise of not paying creative work, other can also be obtained according to these accompanying drawings
Accompanying drawing.
Fig. 1 is a kind of structural representation of data handling system provided in an embodiment of the present invention;
Fig. 2 is a kind of structural representation of computer equipment provided in an embodiment of the present invention;
Fig. 3 is a kind of flow chart of data processing method provided in an embodiment of the present invention;
Fig. 4 A are the flow charts of another data processing method provided in an embodiment of the present invention;
Fig. 4 B are a kind of schematic diagrames of target text data block involved by Fig. 4 A embodiments;
Fig. 4 C are a kind of target text data block and the signal of the corresponding relation of compression dictionary involved by Fig. 4 A embodiments
Figure;
Fig. 4 D are showing for corresponding relation of another target text data block with compressing dictionary involved by Fig. 4 A embodiments
It is intended to;
Fig. 4 E are the schematic diagrames of the packed field index involved by Fig. 4 A embodiments;
Fig. 5 A are a kind of structural representations of data processing equipment provided in an embodiment of the present invention;
Fig. 5 B are a kind of structural representations of processing module provided in an embodiment of the present invention.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention
Formula is described in further detail.
Fig. 1 is a kind of structural representation of data handling system provided in an embodiment of the present invention.Referring to Fig. 1, the system can
Think distributed system, naturally it is also possible to be distributed system, enter in a distributed manner exemplified by system in embodiments of the present invention
Row explanation.Wherein, multiple equipment, respectively equipment 01, equipment 02, equipment 03...... equipment are included in the distributed system
0n, it is connected with each other between the plurality of equipment, and the plurality of equipment can be with terminal, or server, the embodiment of the present invention pair
This is not specifically limited.
In addition, each equipment in the plurality of equipment may each comprise data import modul and data processing module, this is more
Data import modul and data processing module can not only be included in designated equipment in individual equipment, compression dictionary can also be included
Configuration module and compression dictionary sharing storage module, the designated equipment can be any appliance in the plurality of equipment.Wherein, should
Multiple text data blocks can be stored in distributed system, compression dictionary configuration module is used for being stored in the distributed system
Multiple text data block configuration compression dictionaries, the compression dictionary are used to be compressed text data block, so as to obtain compressing number
According to block, correspondingly, multiple compression data blocks, the plurality of compression data block and the plurality of text can also be stored in the distributed system
Notebook data block corresponds, it is notable that subsequently needs to carry out same processing operation extremely in the plurality of text data block
Few two text data blocks can share a compression dictionary, that is to say, at least two text data can correspond to identical
Compress dictionary.
Compress dictionary sharing storage module and be used for the corresponding mark for storing the plurality of text data block and corresponding compression word
Allusion quotation;Data import modul is used for when carrying out data processing at least two target text data blocks, is deposited from compression dictionary is shared
The compression dictionary of at least two target texts data block is obtained in storage module;Data processing module is used at least two mesh
Compressed data in compression data block corresponding to mark text data block is handled, and the compression obtained based on data import modul
Dictionary decompresses to result, so as to realize the processing at least two target texts data block.
Fig. 2 is a kind of structural representation of computer equipment provided in an embodiment of the present invention, in the distributed system in Fig. 1
Equipment can be realized with computer equipment shown in Fig. 2.Referring to Fig. 2, the computer equipment includes at least one processing
Device 201, communication bus 202, memory 203 and at least one communication interface 204.
Processor 201 can be a general central processor (CPU), microprocessor, ASIC
(application-specific integrated circuit, ASIC), or it is one or more for controlling the present invention program
The integrated circuit that program performs.
Communication bus 202 may include a path, and information is transmitted between said modules.
Memory 203 can be read-only storage (read-only memory, ROM) or can store static information and instruction
Other types of static storage device, random access memory (random access memory, RAM)) or can store
Information and the other types of dynamic memory or EEPROM of instruction
(Electrically Erasable Programmable Read-Only Memory, EEPROM), read-only optical disc (Compact
Disc Read-Only Memory, CD-ROM) or other optical disc storages, laser disc storage (including compression laser disc, laser disc, light
Dish, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium or other magnetic storage apparatus or can be used in carrying or
Storage with instruction or data structure form desired program code and can by any other medium of computer access, but
Not limited to this.Memory 203 can be individually present, and be connected by communication bus 202 with processor 201.Memory 203
It can be integrated with processor 201.
Communication interface 204, using a kind of device of any transceiver, for miscellaneous equipment or communication, such as
Ethernet, wireless access network (RAN), WLAN (Wireless Local Area Networks, WLAN) etc..
In the specific implementation, as a kind of embodiment, processor 201 can include one or more CPU, such as in Fig. 2
Shown CPU0 and CPU1.
In the specific implementation, as a kind of embodiment, computer equipment can include multiple processors, such as institute in Fig. 2
The processor 201 and processor 208 shown.Each in these processors can be monokaryon (single-CPU) processing
Device or multinuclear (multi-CPU) processor.Here processor can refer to one or more equipment, circuit,
And/or the process cores for processing data (such as computer program instructions).
In the specific implementation, as a kind of embodiment, computer equipment can also include output equipment 205 and input equipment
206.Output equipment 205 and processor 201 communicate, and can carry out display information in many ways.For example, output equipment 205 can be with
It is liquid crystal display (liquid crystal display, LCD), Light-Emitting Diode (light emitting diode, LED)
Display device, cathode-ray tube (cathode ray tube, CRT) display device, or projecting apparatus (projector) etc..Input
Equipment 206 and processor 201 communicate, and can receive the input of user in many ways.For example, input equipment 206 can be mouse
Mark, keyboard, touch panel device or sensing equipment etc..
Above-mentioned computer equipment can be an a general purpose computing device either dedicated computing machine equipment.Having
During body is realized, computer equipment can be desktop computer, portable computer, the webserver, palm PC (Personal
Digital Assistant, PDA), cell phone, tablet personal computer, wireless terminal device, communication equipment or embedded device.
The unlimited type for determining computer equipment of the embodiment of the present invention.
Wherein, memory 203 is used to store the program code for performing the present invention program, and is held by processor 201 to control
OK.Processor 201 is used to perform the program code 210 stored in memory 203.In program code 210 can include one or
Multiple software modules are (for example, data import modul, data processing module, compression dictionary configuration module and compression dictionary are shared and deposited
Store up module etc.).The equipment in distributed system shown in Fig. 1 can pass through the journey in processor 201 and memory 203
Data are handled by one or more of sequence code 210 software module.
Fig. 3 is a kind of flow chart of data processing method provided in an embodiment of the present invention.Referring to Fig. 3, this method includes:
Step 301:The compression dictionary of at least two target text data blocks is obtained, at least two target texts data block
For the data block handled in multiple text data blocks of storage subsequently through same processing operation, each target text data
Block includes multiple text datas, and each text data includes multiple characters, and the compression dictionary includes each target text
The compressed code of each text data in data block, or include the compressed code of each character in each target text data block.
Step 302:Based on the compression dictionary, respectively to each target text at least two target texts data block
Data block is compressed, and obtains at least two compression data blocks, at least two target texts data block and at least two pressure
Contracting data block corresponds, and each compression data block includes multiple compressed datas, the plurality of compressed data and the plurality of text
Data correspond.
Step 303:The same process instruction for handling operation is carried out at least two target texts data block when receiving
When, the compressed data at least two compression data block is handled, to realize at least two target texts data block
Processing.
In embodiments of the present invention, at least two target text numbers for being handled subsequently through same processing operation
According to block, the compression dictionary of at least two target texts data block is obtained, it is literary at least two target based on the compression dictionary
Each target text data block in notebook data block is compressed, and is obtained at least two compression data blocks, be that is to say, this at least two
Individual target text data block can share same compression dictionary, without generating a compression word to each text data block
Allusion quotation.Further, since the compression dictionary of at least two target texts data block is identical, it that is to say, at least two target text
Data block is compressed by same compression standard, and therefore, the compressed data at least two compression data block is entered
Row processing, and the result after being decompressed to result by the compression dictionary with least two target texts data block
In the text data result that is handled to obtain it is identical, so the embodiment of the present invention passes through at least two compressions number
Handled according to block, to realize the processing at least two target texts data block, without at least two compressions number
Decompressed according to block, reduce data processing amount, and then shorten the processing time of data, and save process resource.
Alternatively, before the compression dictionary for obtaining at least two target text data blocks, in addition to:
From multiple text data blocks of storage, at least two target text data blocks are determined;
Generate the compression dictionary of at least two target texts data block.
Alternatively, from multiple text data blocks of storage, at least two target text data blocks are determined, including:
From multiple text data blocks of storage, selection belongs to the other at least two text datas block of target class, will select
Text data block be defined as target text data block;Or
When detecting for the selection instruction of at least two text data blocks in the plurality of text data block, selection is referred to
The selected text data block of order is defined as target text data block.
Alternatively, the compressed data at least two compression data block is handled, to realize at least two mesh
The processing of text data block is marked, including:
Determine multiple compressed datas that each compression data block includes at least two compression data block;
Based on multiple compressed datas that each compression data block includes at least two compression data block, to this at least two
Individual compression data block is handled, and obtains the result of at least two compression data block;
Based on the compression dictionary, the result of at least two compression data block is decompressed, obtains at least two
The result of target text data block.
Optionally it is determined that multiple compressed datas that each compression data block includes at least two compression data block, bag
Include:
Judge whether the code length for multiple compressed codes that the compression dictionary includes is equal;
When the code length for multiple compressed codes that the compression dictionary includes is equal and compression dictionary includes each target text data
, will be each for each compression data block at least two compression data block in block during the compressed code of each text data
The code length of compressed code is defined as the length of each compressed data in the compression data block;
According to the length of each compressed data in the compression data block, from the compression data block, multiple pressures are determined successively
Contracting data.
Alternatively, after judging whether the code length for multiple compressed codes that the compression dictionary includes is equal, in addition to:
When the code length for multiple compressed codes that the compression dictionary includes is equal and includes each target text number when compressing dictionary
According to during the compressed code of each character, for each compression data block at least two compression data block, determining the pressure in block
Target text data block corresponding to contracting data block;
Character number of the code length of each compressed code respectively with each text data in target text data block is multiplied, obtained
The length of each compressed data into the compression data block;
According to the length of each compressed data in the compression data block, from the compression data block, multiple pressures are determined successively
Contracting data.
Alternatively, after judging whether the code length for multiple compressed codes that the compression dictionary includes is equal, in addition to:
When the code length for multiple compressed codes that the compression dictionary includes is unequal, at least two compression data block
Each compression data block, according to the data directory of the compression data block, from the compression data block, determine multiple compression numbers
According to the data directory of the compression data block is used to indicate that each compressed data is in the compression data block in the plurality of compressed data
Location.
Alternatively, according to the data directory of the compression data block, from the compression data block, determine multiple compressed datas it
Before, in addition to:
For target text data block corresponding to the compression data block, in the process being compressed to target text data block
In, determine the location in the compression data block of compressed data corresponding to each text data in target text data block;
The location of based on compressed data corresponding to each text data in the compression data block, generate the compression number
According to the data directory of block.
Above-mentioned all optional technical schemes, can form the alternative embodiment of the present invention according to any combination, and the present invention is real
Example is applied no longer to repeat this one by one.
Fig. 4 A are a kind of flow charts of data processing method provided in an embodiment of the present invention.Referring to Fig. 4 A, this method includes:
Step 401:From multiple text data blocks of storage, at least two target text data blocks are determined.
It should be noted that the plurality of text data block can be stored with specified format, the specified format can be pre-
First set, as the specified format can be text (TextFile) form, Parquet forms, SequenceFile forms,
RCFile forms, Avro forms etc., the embodiment of the present invention is not specifically limited to this.And in order to improve the plurality of text data block
Access efficiency, generally the plurality of text data block can be subjected to distributed storage in practical application, the distributed storage be
Refer to and disperse to be stored in more independent equipment by the plurality of text data block, such as the plurality of text data can be based on distribution
Formula file system (Hadoop Distributed File System, HDFS) stores etc., the embodiment of the present invention to this not
It is specifically limited.
In addition, at least two target texts data block is the data block handled subsequently through same processing operation,
Each target text data block includes multiple text datas, and each text data includes multiple characters.For example, such as Fig. 4 B institutes
Show, a certain target text data block at least two target texts data block includes multiple text datas, the plurality of text
Notebook data is 101,102,103 and 104.Wherein, the processing operation can include compare, sort, searching, Hash operation, connection
Computing etc., the embodiment of the present invention is not specifically limited to this.
Furthermore text data refers to the data being made up of printable character, the printable character include in ASCII 33~
The character of 127, the character in UNICODE, character in UTF-8 etc., the embodiment of the present invention is not specifically limited to this.
Specifically, the distributed system determines at least two target texts data block from the plurality of text data block
When, it can select to belong to the other at least two text datas block of target class, by the text of selection from the plurality of text data block
Data block is defined as target text data block;Or it is directed at least two textual datas in the plurality of text data block when detecting
According to block selection instruction when, the text data block selected by selection instruction is defined as target text data block.
It should be noted that the text data block under target classification can be handled by same processing operation, such as
Target classification can be numbering, time, user's name etc., and the embodiment of the present invention is not specifically limited to this.
In addition, selection instruction is used for the selection target text data block from the plurality of text data block, and the selection instruction
Can be triggered by user, the user can be triggered by assigned operation, the assigned operation can be single-click operation, double click operation,
Voice operating etc., the embodiment of the present invention is not specifically limited to this.
Because the text data block under target classification under normal circumstances can be handled by same processing operation, because
This, can select to belong to the other at least two text datas block of target class in the embodiment of the present invention, and by the text data of selection
Block is defined as target text data block, and the determination is simple to operate, and is participated in without user, so as to improve target text
The determination efficiency of data block.
Further, since selection instruction is triggered by user, therefore, by the text selected by selection instruction in the embodiment of the present invention
When data block is defined as target text data block, actually when target text data block determined according to user's operation, so as to
To ensure that the target text data block determined meets user's request.
For example, in the plurality of text data block, belong to the other text data block of target class for text data block 1, textual data
According to block 2, text data block 3 and text data block 4, then can from the plurality of text data block, select text data block 1,
Text data block 2, text data block 3 and text data block 4, afterwards, can be by text data block 1, the text data of the selection
Block 2, text data block 3 and text data block 4 are defined as target text data block.
For another example detect for the text data block 1 in the plurality of text data block, text data block 2, text data
The selection instruction of block 3 and text data block 4, then can by the text data block 1 selected by the selection instruction, text data block 2,
Text data block 3 and text data block 4 are defined as target text data block.
It should be noted that from the plurality of text data block, the operation of at least two target texts data block is determined
It can be performed by the compression dictionary configuration module in the distributed system, and in practical application, it is determined that at least two target
After text data block, the compression dictionary configuration module can also be by each target text at least two target texts data block
The mark of notebook data block is sent to compression dictionary sharing storage module, so that the compression dictionary sharing storage module can determine this
Incidence relation between at least two target text data blocks, so be easy to the follow-up compression dictionary sharing storage module for this extremely
Few two target text data blocks generate same compression dictionary.
Wherein, the mark of target text data block is used for the unique mark target text data block, and target text data
The mark of block can be title of the target text data block etc., and the embodiment of the present invention is not specifically limited to this.
Step 402:Generate the compression dictionary of at least two target texts data block.
For the ease of subsequently can be based on same compression standard to each mesh at least two target texts data block
Mark text data block is compressed, and it is determined that after at least two target texts data block, can generate at least two target
The compression dictionary of text data block.And when generating the compression dictionary of at least two target texts data block, it can be based on referring to
Level pressure compression algorithm and at least two target texts data block, the compression dictionary of at least two target texts data block is generated,
Certainly, in practical application, the compression dictionary of at least two target texts data block, this hair can also otherwise, be generated
Bright embodiment is not specifically limited to this.
It should be noted that the compression dictionary of at least two target texts data block can include each target text number
According to the compressed code of each text data in block, or the compressed code of each character in each target text data block can be included,
The embodiment of the present invention is not specifically limited to this.For example, the text data that at least two target texts data block includes is
101st, 102,103, then when the compression dictionary of at least two target texts data block is included in each target text data block often
During the compressed code of individual text data, the compression dictionary, which can correspond to, includes compressed code 1, compressed code 2 and compressed code 3, such as Fig. 4 C institutes
Show;When the compression dictionary of at least two target texts data block includes the compression of each character in each target text data block
During code, if the character that text data 101 includes is 1,2,3, corresponding compressed code is compressed code 1, compressed code 2, compressed code
3, the character that text data 102 includes is 4,5,6, and corresponding compressed code is compressed code 4, compressed code 5, compressed code 6, textual data
It is 7,8,9 according to 103 characters included, corresponding compressed code is compressed code 7, compressed code 8, compressed code 9, and the compression dictionary can be with
As shown in Figure 4 D.
In addition, specified compression algorithm can be pre-set, and the specified compression algorithm can be that a kind of compression sequence is kept
It algorithm, that is to say, the lexcographical order after coding can be kept identical with the lexcographical order of former character string, as the specified compression algorithm can be with
For Huffman encoding (Huffman Coding) algorithm, Hu-Tucker encryption algorithms etc., the embodiment of the present invention is not done specifically to this
Limit.
Wherein, based on specified compression algorithm and at least two target texts data block, at least two target text is generated
The operation of the compression dictionary of notebook data block may be referred to correlation technique, and the embodiment of the present invention is to this without elaborating.
It should be noted that when at least two target texts data block is distributed storage, store this at least two
The equipment of target text data block can be when receiving the compression instruction at least two target texts data block, by this
At least two target text data blocks are sent to compression dictionary sharing storage module, so that the compression dictionary sharing storage module is given birth to
The equipment is returned into the compression dictionary of at least two text datas block, and by the compression dictionary, so that the equipment can be to this
At least two text data blocks are compressed, certainly, the compression dictionary sharing storage module can also active obtaining this at least two
Individual target text data block, to generate the compression dictionary of at least two target texts data block, the embodiment of the present invention is to this
It is not specifically limited.Wherein, compression instruction is used to indicate to be compressed at least two target texts data block, and the compression
Instruction can be triggered by assigned operation, and the embodiment of the present invention is not specifically limited to this.
In addition, after the compression dictionary sharing storage module generates the compression dictionary, can be by least two target text
Data block is deleted, to save the storage resource of the compression dictionary sharing storage module.
Furthermore after compression dictionary sharing storage module generates the compression dictionary of at least two target texts data block, also
Can be by the mark and at least two target text of each target text data block at least two target texts data block
The compression dictionary storage of data block is into the corresponding relation between text data block identification and compression dictionary, in order to subsequently upper
When stating the compression dictionary of the equipment acquisition target text data block, the mark of target text data block can be directly based upon, from
In corresponding relation between text data block identifier and compression dictionary, the pressure of the target text data block is quickly and easily obtained
Contracting dictionary.
It should be noted that in correlation technique when compressing a certain text data block, text data are often first generated
Compression dictionary corresponding to block, then the compression dictionary based on text data block, are compressed to text data block, obtain this article
Compression data block corresponding to notebook data block, afterwards, by pressure corresponding to the compression dictionary of text data block and text data block
Contracting data block stores jointly, so as to subsequently can the compression dictionary based on text data block, to corresponding to text data block
Compression data block is decompressed.That is, then needed to the plurality of in correlation technique if to store multiple compression data blocks
Compression dictionary is also stored corresponding to each compression data block in compression data block, so as to consume more storage resource.
And in the embodiment of the present invention, because at least two target texts data block uses same compression dictionary, therefore it may only be necessary to right
The compression dictionary of at least two target texts data block is once stored in the compression dictionary sharing storage module, so as to
Save storage resource.
In addition, in correlation technique, when generating the compression dictionary of multiple text data blocks, it is necessary to for the plurality of textual data
It is performed both by once generating operation according to each text data block in block, that is to say and operated, it is necessary to perform repeatedly generation, can just be obtained
The compression dictionary of all text data blocks in the plurality of text data block, so as to consume more process resource.And due to this
At least two target texts data block can be determined in advance in inventive embodiments, and at least two target texts data block
The compression dictionary all same of each target text data block, therefore it may only be necessary to be performed at least two target texts data block
Once generation operation, you can obtain the compression dictionary of at least two target texts data block, located reason resource so as to save.
It should be noted that in the embodiment of the present invention, at least two target can be determined by above-mentioned steps 401-402
The compression dictionary of text data block, and based on the compression dictionary of at least two target texts data block, at least two mesh
The mark text data block operation that is compressed and handles can be realized with 403-405 as follows.
Seen from the above description, method provided in an embodiment of the present invention is used in distributed system, and compresses dictionary and share
Memory module is arranged in the designated equipment in the multiple equipment that the distributed system includes, still, the plurality of text data block
It is to be stored in a distributed manner in the plurality of equipment again, at least two target texts data block is contained in the plurality of text data block
In, it that is to say, at least two target texts data block is also stored in the plurality of equipment in a distributed manner, therefore, in compression word
After allusion quotation sharing storage module generates the compression dictionary of at least two target texts data block, the compression dictionary can be stored,
When the distributed system needs to carry out data processing at least two target texts data block, obtained according still further to following manner
The compression dictionary of at least two target texts data block is taken, so as to carry out follow-up processing step, step 403- specific as follows
Shown in 405.
Step 403:Obtain the compression dictionary of at least two target texts data block.
It should be noted that the operation that distributed system obtains the compression dictionary of at least two target texts data block can
To be performed by storing the data import modul included by the equipment of at least two target texts data block, specifically, deposit
The data import modul that storing up the equipment of at least two target texts data block includes can store mould to the compression dictionary is shared
Block sends compression dictionary and obtains request, and the compression dictionary obtains the mark that target text data block is carried in request;When the compression
When dictionary sharing storage module receives the compression dictionary and obtains request, can the mark based on target text data block, from depositing
In corresponding relation between the text data block identification and compression dictionary of storage, the compression dictionary of target text data block is obtained, and
The compression dictionary of target text data block is sent to the data import modul, certainly, can also be with other sides in practical application
Formula obtains the compression dictionary of at least two target texts data block, and the embodiment of the present invention is not specifically limited to this.
In addition, in practical application, can when receiving the compression instruction at least two target texts data block,
Obtain the compression dictionary of at least two target texts data block, it is of course also possible to obtain in other cases this at least two
The compression dictionary of target text data block, as long as ensureing before being compressed at least two target texts data block,
Through getting the compression dictionary of at least two target texts data block, the embodiment of the present invention is not specifically limited to this.
It should be noted that the compression is instructed for indicating to be compressed at least two target texts data block, and
Compression instruction can be triggered by user, and certainly, compression instruction can also detect a certain trigger event in distributed system
When trigger, the embodiment of the present invention is not specifically limited to this.
Further, when compressing the compression dictionary that is not stored with the target text data block in dictionary sharing storage module
When, now, the compression dictionary sharing storage module needs to be based on the target text data block and specified compression algorithm, generates the mesh
Mark the compression dictionary of text data block.
Step 404:Based on the compression dictionary of at least two target texts data block, respectively at least two target text
Each target text data block in notebook data block is compressed, and obtains at least two compression data blocks.
It should be noted that at least two target texts data block corresponds with least two compression data block,
Each compression data block includes multiple compressed datas, and includes more for some compression data block, the compression data block
Multiple text datas that individual compressed data target text data block corresponding with the compression data block includes correspond, namely
It is that each text data at least two target texts data block has unique at least two compression data block
Corresponding compressed data.
In addition, the compression dictionary based at least two target texts data block, respectively at least two target text
Each target text data block in data block is compressed, and the operation for obtaining at least two compression data blocks may be referred to correlation
Technology, the embodiment of the present invention is to this without elaborating.
Step 405:The same process instruction for handling operation is carried out at least two target texts data block when receiving
When, the compressed data at least two compression data block is handled, to realize at least two target text data blocks
Processing.
It should be noted that because generally, the plurality of text data block is to carry out distributed storage, namely
It is that the same process instruction for handling operation is carried out to certain two text data block in the plurality of text data block receiving
When, because the compression dictionary of two text data blocks is different, thus it is, it is necessary to first corresponding to the two text data blocks of storage
Two compression data blocks decompressed respectively, to obtain the two text data blocks, and then the two text data blocks are passed
The equipment for carrying out data processing is defeated by, the two text data blocks are handled by the equipment.
And in the embodiment of the present invention, due to be directly to the compressed data at least two compression data block at
Reason, to realize the processing of at least two target texts data block, therefore, carried out at least two target texts data block
During processing, at least two compression data blocks corresponding at least two target texts data block only can be transferred to carry out data
The equipment of processing, the compressed data at least two compression data block is handled by the equipment, to realize to this at least
The processing of two target text data blocks.So as to compared to needing elder generation before handling text data block in correlation technique
The mode decompressed to compression data block, the compression data block is directly based upon in the embodiment of the present invention and can be achieved to target text
The processing of notebook data block, so as to reduce data processing time, save process resource.In addition, compared in correlation technique
The mode of text data block is transmitted between the multiple equipment that the distributed system includes, only needs to transmit pressure in the embodiment of the present invention
Contracting data block, so as to reduce volume of transmitted data, network bandwidth utilization factor is lifted, and save data transmission period, furthermore, phase
Only need to carry out compressed data than the mode for directly handling text data in correlation technique, in the embodiment of the present invention
Processing, so as to reduce data processing amount, data processing time is saved, saves process resource.
Specifically, the compressed data at least two compression data block is handled, to realize at least two mesh
Marking the operation of the processing of text data block can be:Determine that each compression data block at least two compression data block includes
Multiple compressed datas;Based on multiple compressed datas that each compression data block includes at least two compression data block, to this
At least two compression data blocks are handled, and obtain the result of at least two compression data block;Based on the compression dictionary,
The result of at least two compression data block is decompressed, obtains the processing knot of at least two target texts data block
Fruit.
Due to stored in compression dictionary be each text data in each target text data block compressed code, or respectively
The compressed code of each character in individual target text data block, therefore, is compressed at least two target texts data block,
It is to change the text data at least two target texts data block for compressed data, because the transformation rule is one
It is fixed, therefore, after calculating the compressed data, the result of obtained at least two compression data block be also by
What compressed code was formed, therefore, based on the compression dictionary, the result of at least two compression data block is decompressed
Afterwards, then be that the compression result for being formed the compressed code is changed for form of textual data, and due to transformation rule be it is certain,
Therefore, the result after conversion is the result of at least two target texts data block.
In embodiments of the present invention, the compressed data at least two compression data block is handled, to realize extremely
During the processing of few two target text data blocks, it can be performed, that is to say by the data processing module in equipment, the data
Processing module can be handled the compressed data at least two compression data block, and then is based at least two target
The compression dictionary of text data block decompresses to result, so as to realize the place at least two target texts data block
Reason.
Wherein, based on multiple compressed datas that each compression data block includes at least two compression data block, to this
Each textual data in certain two text data block is based in operation that at least two compression data blocks are handled and correlation technique
The multiple text datas included according to block, the operation handled the two text data blocks is similar, and the embodiment of the present invention is to this
Without elaborating.
Wherein, based on the compression dictionary, the result of at least two compression data block is decompressed, obtains this extremely
The operation of the result of few two target text data blocks compression corresponding with being based on some compression data block in correlation technique
Dictionary, the operation decompressed to some compression data block is similar, and the embodiment of the present invention is to this without elaborating.
Wherein, can it is determined that during each compression data block includes at least two compression data blocks multiple compressed datas
First to judge whether the code length for compressing multiple compressed codes that dictionary includes is equal, and then pass through following three kinds of feelings with reference to judged result
Condition determines each compression data block includes at least two compression data blocks multiple compressed datas:
The first situation, when the code length for multiple compressed codes that compression dictionary includes is equal and the compression dictionary includes each mesh
When marking the compressed code of each text data in text data block, for each compressed data at least two compression data block
Block, the code length of each compressed code is defined as to the length of each compressed data in the compression data block;According to the compression data block
In each compressed data length, from the compression data block, determine multiple compressed datas successively.
Seen from the above description, compressing dictionary is generated by specified compression algorithm, and for different compression algorithms,
The code length for multiple compressed codes that the compression dictionary of generation includes may be unequal, and each compression that the compression dictionary includes
Code length between code may be also unequal, and compression data block is compressed based on the compression dictionary, therefore, it is determined that
Multiple compression data blocks that each compression data block includes at least two compression data block, can first judge the compression dictionary
Including multiple compressed codes code length it is whether equal.
In addition, when the compression dictionary includes the compressed code of each text data in each target text data block, it is determined that
It is the compressed code based on each text data in each target text data block when carrying out the compression of target text data block,
Each text data is compressed, therefore, the code length of each compressed code in the compression dictionary can be defined as the pressure
The length of each compression data block in contracting data block.
Wherein, according to the length of each compressed data in the compression data block, from the compression data block, determine successively more
During individual compressed data, the compression data block can be carried out successively according to the length of each compressed data in the compression data block
Division, obtains multiple compressed datas.
Second of situation, include respectively when the code length for multiple compressed codes that the compression dictionary includes is equal and works as the compression dictionary
In individual target text data block during the compressed code of each character, for each compressed data at least two compression data block
Block, determine target text data block corresponding to the compression data block;By the code length of each compressed code respectively with target text data
The character number of each text data is multiplied in block, obtains the length of each compressed data in the compression data block;According to the pressure
The length of each compressed data in contracting data block, from the compression data block, multiple compressed datas are determined successively.
Wherein, when the compression dictionary includes the compressed code of each character in each target text data block, it is determined that entering
It is the compressed code based on each character in each target text data block, to each word during the compression of row target text data block
What symbol was compressed, and because a text data can include multiple characters, therefore, can be each by this in the compression dictionary
Character number of the code length of compressed code respectively with each text data in target text data block is multiplied, and then obtains the compression number
According to the length of each compression data block in block.
The third situation, when the code length for multiple compressed codes that the compression dictionary includes is unequal, for this at least two
Each compression data block in compression data block, according to the data directory of the compression data block, from the compression data block, it is determined that
Multiple compressed datas, the data directory of the compression data block are used to indicate that each compressed data is in the pressure in the plurality of compressed data
The location of in contracting data block.
Further, according to the data directory of the compression data block, from the compression data block, multiple compressed datas are determined
Before, in addition to:For target text data block corresponding to the compression data block, what is be compressed to target text data block
During, determine the position that compressed data corresponding to each text data is residing in the compression data block in target text data block
Put;The location of based on compressed data corresponding to each text data in the compression data block, generate the compression data block
Data directory.
It should be noted that during being compressed to target text data block, determine in target text data block
Compressed data corresponding to each text data is in the compression data block during location, it may be determined that each text data pair
Original position and end position of the compressed data answered in the compression data block, and then by original position and end position only
One ground determines the location in the compression data block of compressed data corresponding to each text data.Wherein, rise for convenience
See, original position of each compressed data in the compression data block can be defined as to the data directory of the compression data block.
For example some text data aaabbbccc of target text data block is upon compression, obtained compressed data is
0001010, corresponding 010, the ccc corresponding 10 of corresponding 00, the bbb of record coding aaa in dictionary is compressed, in order to without decompressing data
It can obtain compressed data corresponding to each field, during being compressed to text data, determine aaa upon compression
Original position is 0, end position 2, and the original position of length 2-0=2, bbb upon compression is 2, and end position 5 is long
It is 5 to spend for 5-2=3, the original positions of ccc upon compression, end position 7, length 7-5=2, referring to Fig. 4 E, can be incited somebody to action
The original position 5 of the original position 0 of compressed data 00, the original position 2 of compressed data 010 and compressed data 10 is defined as this
The data directory of compression data block.
In embodiments of the present invention, at least two target text numbers for being handled subsequently through same processing operation
According to block, the compression dictionary of at least two target texts data block is obtained, it is literary at least two target based on the compression dictionary
Each target text data block in notebook data block is compressed, and is obtained at least two compression data blocks, be that is to say, this at least two
Individual target text data block can share same compression dictionary, without generating a compression word to each text data block
Allusion quotation.Further, since the compression dictionary of at least two target texts data block is identical, it that is to say, at least two target text
Data block is compressed by same compression standard, and therefore, the compressed data at least two compression data block is entered
Row processing, and the result after being decompressed to result by the compression dictionary with least two target texts data block
In the text data result that is handled to obtain it is identical, so the embodiment of the present invention passes through at least two compressions number
Handled according to block, to realize the processing at least two target texts data block, without at least two compressions number
Decompressed according to block, reduce data processing amount, and then shorten the processing time of data, and save process resource.
Fig. 5 A are a kind of structural representations of data processing equipment provided in an embodiment of the present invention.Reference picture 5A, the data
Processing unit includes:
Acquisition module 501, for obtaining the compression dictionary of at least two target text data blocks, at least two target text
Notebook data block is that the data block handled, each target are operated subsequently through same processing in multiple text data blocks of storage
Text data block includes multiple text datas, and each text data includes multiple characters, and it is each that the compression dictionary includes this
The compressed code of each text data in target text data block, or including each character in each target text data block
Compressed code;
Compression module 502, for based on the compression dictionary, respectively to each at least two target texts data block
Target text data block is compressed, and obtains at least two compression data blocks, at least two target texts data block with this extremely
Few two compression data blocks correspond, and each compression data block includes multiple compressed datas, and the plurality of compressed data is with being somebody's turn to do
Multiple text datas correspond;
Processing module 503, same processing operation is carried out at least two target texts data block for working as to receive
During process instruction, the compressed data at least two compression data block is handled, to realize at least two target text
The processing of notebook data block.
Further, the device also includes:
Determining module, for from multiple text data blocks of storage, determining at least two target texts data block;
Generation module, for generating the compression dictionary of at least two target texts data block.
Wherein it is determined that module includes:
Selecting unit, for from the plurality of text data block, selection to belong to other at least two text data of target class
Block, the text data block of selection is defined as target text data block;Or
First determining unit, for working as the choosing detected at least two text data blocks in the plurality of text data block
When selecting instruction, the text data block selected by the selection instruction is defined as target text data block.
Reference picture 5B, processing module 503 include:
Second determining unit 5031, for determining that each compression data block includes more at least two compression data block
Individual compressed data;
Processing unit 5032, for based on multiple pressures that each compression data block includes at least two compression data block
Contracting data, at least two compression data block is handled, obtains the result of at least two compression data block;
Decompression units 5033, for based on the compression dictionary, being carried out to the result of at least two compression data block
Decompression, obtains the result of at least two target texts data block.
Wherein, the second determining unit 5031 includes:
Whether judgment sub-unit, the code length of the multiple compressed codes included for judging the compression dictionary are equal;
First determination subelement, the code length of multiple compressed codes for including when the compression dictionary is equal and the compression dictionary
Including in each target text data block during the compressed code of each text data, at least two compression data block
Each compression data block, the code length of each compressed code is defined as to the length of each compressed data in the compression data block;
Second determination subelement, for the length according to each compressed data in the compression data block, from the compressed data
In block, the plurality of compressed data is determined successively.
Further, the second determining unit 5031 also includes:
3rd determination subelement, the code length of multiple compressed codes for including when the compression dictionary is equal and works as the compression word
When allusion quotation includes the compressed code of each character in each target text data block, for every at least two compression data block
Individual compression data block, determine target text data block corresponding to the compression data block;
Computing subelement, for by the code length of each compressed code respectively with each text data in the target text data block
Character number be multiplied, obtain the length of each compressed data in the compression data block;
4th determination subelement, for the length according to each compressed data in the compression data block, from the compressed data
In block, the plurality of compressed data is determined successively.
Further, the second determining unit 5031 also includes:
5th determination subelement, when the code length of multiple compressed codes for including when the compression dictionary is unequal, for this
Each compression data block at least two compression data blocks, according to the data directory of the compression data block, from the compressed data
In block, the plurality of compressed data is determined, the data directory of the compression data block is used to indicate each to press in the plurality of compressed data
Contracting data location in the compression data block.
Further, the second determining unit 5031 also includes:
6th determination subelement, for for target text data block corresponding to the compression data block, to target text
During notebook data block is compressed, determine that compressed data corresponding to each text data is at this in the target text data block
The location of in compression data block;
Generate subelement, for based on compressed data corresponding to each text data in the compression data block it is residing
Position, generate the data directory of the compression data block.
In embodiments of the present invention, at least two target text numbers for being handled subsequently through same processing operation
According to block, the compression dictionary of at least two target texts data block is obtained, it is literary at least two target based on the compression dictionary
Each target text data block in notebook data block is compressed, and is obtained at least two compression data blocks, be that is to say, this at least two
Individual target text data block can share same compression dictionary, without generating a compression word to each text data block
Allusion quotation.Further, since the compression dictionary of at least two target texts data block is identical, it that is to say, at least two target text
Data block is compressed by same compression standard, and therefore, the compressed data at least two compression data block is entered
Row processing, and the result after being decompressed to result by the compression dictionary with least two target texts data block
In the text data result that is handled to obtain it is identical, so the embodiment of the present invention passes through at least two compressions number
Handled according to block, to realize the processing at least two target texts data block, without at least two compressions number
Decompressed according to block, reduce data processing amount, and then shorten the processing time of data, and save process resource.
It should be noted that:The data processing equipment that above-described embodiment provides is in data processing, only with above-mentioned each function
The division progress of module, can be as needed and by above-mentioned function distribution by different function moulds for example, in practical application
Block is completed, i.e., the internal structure of device is divided into different functional modules, to complete all or part of work(described above
Energy.In addition, the data processing equipment that above-described embodiment provides belongs to same design with data processing method embodiment, it is specific real
Existing process refers to embodiment of the method, repeats no more here.
One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment
To complete, by program the hardware of correlation can also be instructed to complete, described program can be stored in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only storage, disk or CD etc..
The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and
Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.
Claims (16)
1. a kind of data processing method, it is characterised in that methods described includes:
The compression dictionary of at least two target text data blocks is obtained, at least two target texts data block is more for storage
The data block handled in individual text data block subsequently through same processing operation, each target text data block include more
Individual text data, each text data include multiple characters, and the compression dictionary includes each target text data block
In each text data compressed code, or include the compressed code of each character in each target text data block;
Based on the compression dictionary, each target text data block at least two target texts data block is entered respectively
Row compression, obtain at least two compression data blocks, at least two target texts data block and described at least two compression numbers
Corresponded according to block, each compression data block includes multiple compressed datas, the multiple compressed data and the multiple text
Data correspond;
When receive at least two target texts data block carry out it is same processing operation process instruction when, to it is described extremely
Compressed data in few two compression data blocks is handled, to realize the processing of at least two target texts data block.
2. the method as described in claim 1, it is characterised in that the compression word for obtaining at least two target text data blocks
Before allusion quotation, in addition to:
From multiple text data blocks of storage, at least two target texts data block is determined;
Generate the compression dictionary of at least two target texts data block.
3. method as claimed in claim 2, it is characterised in that in multiple text data blocks from storage, it is determined that described
At least two target text data blocks, including:
From the multiple text data block, selection belongs to the other at least two text datas block of target class, by the text of selection
Data block is defined as target text data block;Or
When detecting for the selection instruction of at least two text data blocks in the multiple text data block, by the selection
The selected text data block of instruction is defined as target text data block.
4. the method as described in claim 1, it is characterised in that the compression number at least two compression data block
According to being handled, to realize the processing of at least two target texts data block, including:
Determine multiple compressed datas that each compression data block includes at least two compression data block;
Based on multiple compressed datas that each compression data block includes at least two compression data block, to described at least two
Individual compression data block is handled, and obtains the result of at least two compression data block;
Based on the compression dictionary, the result of at least two compression data block is decompressed, obtain it is described at least
The result of two target text data blocks.
5. method as claimed in claim 4, it is characterised in that described to determine each to press at least two compression data block
Multiple compressed datas that contracting data block includes, including:
Judge whether the code length for multiple compressed codes that the compression dictionary includes is equal;
When the code length of multiple compressed codes that the compression dictionary includes is equal and the compression dictionary includes each target text
In notebook data block during the compressed code of each text data, for each compressed data at least two compression data block
Block, the code length of each compressed code is defined as to the length of each compressed data in the compression data block;
According to the length of each compressed data in the compression data block, from the compression data block, determine successively described more
Individual compressed data.
6. method as claimed in claim 5, it is characterised in that the multiple compressed codes for judging the compression dictionary and including
After whether code length is equal, in addition to:
Include each target when the code length for multiple compressed codes that the compression dictionary includes is equal and works as the compression dictionary
In text data block during the compressed code of each character, for each compression data block at least two compression data block,
Determine target text data block corresponding to the compression data block;
Character number of the code length of each compressed code respectively with each text data in the target text data block is multiplied, obtained
The length of each compressed data into the compression data block;
According to the length of each compressed data in the compression data block, from the compression data block, determine successively described more
Individual compressed data.
7. method as claimed in claim 5, it is characterised in that the multiple compressed codes for judging the compression dictionary and including
After whether code length is equal, in addition to:
When the code length for multiple compressed codes that the compression dictionary includes is unequal, at least two compression data block
Each compression data block, according to the data directory of the compression data block, from the compression data block, determine the multiple
Compressed data, the data directory of the compression data block are used to indicate that each compressed data is described in the multiple compressed data
The location of in compression data block.
8. method as claimed in claim 7, it is characterised in that the data directory according to the compression data block, from institute
State in compression data block, before determining the multiple compressed data, in addition to:
For target text data block corresponding to the compression data block, in the mistake being compressed to the target text data block
Cheng Zhong, determine that compressed data corresponding to each text data is residing in the compression data block in the target text data block
Position;
The location of based on compressed data corresponding to each text data in the compression data block, generate the pressure
The data directory of contracting data block.
9. a kind of data processing equipment, it is characterised in that described device includes:
Acquisition module, for obtaining the compression dictionary of at least two target text data blocks, at least two target texts number
According to the data block handled in multiple text data blocks that block is storage subsequently through same processing operation, each target text
Data block includes multiple text datas, and each text data includes multiple characters, and the compression dictionary includes described each
The compressed code of each text data in target text data block, or including each character in each target text data block
Compressed code;
Compression module, for based on the compression dictionary, respectively to each mesh at least two target texts data block
Mark text data block is compressed, and obtains at least two compression data blocks, at least two target texts data block with it is described
At least two compression data blocks correspond, and each compression data block includes multiple compressed datas, the multiple compressed data
Corresponded with the multiple text data;
Processing module, the same processing for handling operation of at least two target texts data block progress is referred to for working as to receive
When making, the compressed data at least two compression data block is handled, to realize at least two target text
The processing of data block.
10. device as claimed in claim 9, it is characterised in that described device also includes:
Determining module, for from multiple text data blocks of storage, determining at least two target texts data block;
Generation module, for generating the compression dictionary of at least two target texts data block.
11. device as claimed in claim 10, it is characterised in that the determining module includes:
Selecting unit, for from the multiple text data block, selecting to belong to the other at least two text datas block of target class,
The text data block of selection is defined as target text data block;Or
First determining unit, for working as the selection detected at least two text data blocks in the multiple text data block
During instruction, the text data block selected by the selection instruction is defined as target text data block.
12. device as claimed in claim 9, it is characterised in that the processing module includes:
Second determining unit, for determining each compression data block includes at least two compression data block multiple compressions
Data;
Processing unit, for based on multiple compression numbers that each compression data block includes at least two compression data block
According to handling at least two compression data block, obtain the result of at least two compression data block;
Decompression units, for based on the compression dictionary, being decompressed to the result of at least two compression data block,
Obtain the result of at least two target texts data block.
13. device as claimed in claim 12, it is characterised in that second determining unit includes:
Whether judgment sub-unit, the code length of the multiple compressed codes included for judging the compression dictionary are equal;
First determination subelement, the code length of multiple compressed codes for including when the compression dictionary is equal and the compression dictionary
Including in each target text data block during the compressed code of each text data, at least two compression data block
In each compression data block, the code length of each compressed code is defined as to the length of each compressed data in the compression data block
Degree;
Second determination subelement, for the length according to each compressed data in the compression data block, from the compressed data
In block, the multiple compressed data is determined successively.
14. device as claimed in claim 13, it is characterised in that second determining unit also includes:
3rd determination subelement, the code length of multiple compressed codes for including when the compression dictionary is equal and works as the compression word
When allusion quotation includes the compressed code of each character in each target text data block, at least two compression data block
Each compression data block, determine target text data block corresponding to the compression data block;
Computing subelement, for by the code length of each compressed code respectively with each text data in the target text data block
Character number is multiplied, and obtains the length of each compressed data in the compression data block;
4th determination subelement, for the length according to each compressed data in the compression data block, from the compressed data
In block, the multiple compressed data is determined successively.
15. device as claimed in claim 13, it is characterised in that second determining unit also includes:
5th determination subelement, when the code length of multiple compressed codes for including when the compression dictionary is unequal, for described
Each compression data block at least two compression data blocks, according to the data directory of the compression data block, from the compression
In data block, the multiple compressed data is determined, the data directory of the compression data block is used to indicate the multiple compression number
Each compressed data location in the compression data block in.
16. device as claimed in claim 15, it is characterised in that second determining unit also includes:
6th determination subelement, for for target text data block corresponding to the compression data block, to target text
During notebook data block is compressed, determine that compressed data corresponding to each text data exists in the target text data block
The location of in the compression data block;
Generate subelement, for based on compressed data corresponding to each text data in the compression data block it is residing
Position, generate the data directory of the compression data block.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610590825.2A CN107643906B (en) | 2016-07-22 | 2016-07-22 | Data processing method and device |
PCT/CN2017/092527 WO2018014761A1 (en) | 2016-07-22 | 2017-07-11 | Data processing method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610590825.2A CN107643906B (en) | 2016-07-22 | 2016-07-22 | Data processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107643906A true CN107643906A (en) | 2018-01-30 |
CN107643906B CN107643906B (en) | 2021-01-05 |
Family
ID=60992963
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610590825.2A Active CN107643906B (en) | 2016-07-22 | 2016-07-22 | Data processing method and device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN107643906B (en) |
WO (1) | WO2018014761A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112765111A (en) * | 2019-10-21 | 2021-05-07 | 伊姆西Ip控股有限责任公司 | Method, apparatus and computer program product for processing data |
CN113495687A (en) * | 2020-03-19 | 2021-10-12 | 辉达公司 | Techniques for efficiently organizing and accessing compressible data |
CN114979794A (en) * | 2022-05-13 | 2022-08-30 | 深圳智慧林网络科技有限公司 | Data sending method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103326732A (en) * | 2013-05-10 | 2013-09-25 | 华为技术有限公司 | Method for packing data, method for unpacking data, coder and decoder |
CN104283777A (en) * | 2013-07-03 | 2015-01-14 | 华为技术有限公司 | Message compression method and device |
US20160197621A1 (en) * | 2015-01-04 | 2016-07-07 | Emc Corporation | Text compression and decompression |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7769729B2 (en) * | 2007-05-21 | 2010-08-03 | Sap Ag | Block compression of tables with repeated values |
CN101320372B (en) * | 2008-05-22 | 2012-07-04 | 上海爱数软件有限公司 | Compression method for repeated data |
EP2460091A4 (en) * | 2009-07-31 | 2013-07-03 | Hewlett Packard Development Co | Compression of xml data |
CN104023070B (en) * | 2014-06-16 | 2017-02-15 | 杜海洋 | file compression method based on cloud storage |
-
2016
- 2016-07-22 CN CN201610590825.2A patent/CN107643906B/en active Active
-
2017
- 2017-07-11 WO PCT/CN2017/092527 patent/WO2018014761A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103326732A (en) * | 2013-05-10 | 2013-09-25 | 华为技术有限公司 | Method for packing data, method for unpacking data, coder and decoder |
CN104283777A (en) * | 2013-07-03 | 2015-01-14 | 华为技术有限公司 | Message compression method and device |
US20160197621A1 (en) * | 2015-01-04 | 2016-07-07 | Emc Corporation | Text compression and decompression |
Non-Patent Citations (1)
Title |
---|
胡荣 等: "基于OpenMP的文件压缩与解压的并行设计模型", 《中南大学学报》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112765111A (en) * | 2019-10-21 | 2021-05-07 | 伊姆西Ip控股有限责任公司 | Method, apparatus and computer program product for processing data |
CN113495687A (en) * | 2020-03-19 | 2021-10-12 | 辉达公司 | Techniques for efficiently organizing and accessing compressible data |
CN114979794A (en) * | 2022-05-13 | 2022-08-30 | 深圳智慧林网络科技有限公司 | Data sending method and device |
CN114979794B (en) * | 2022-05-13 | 2023-11-14 | 深圳智慧林网络科技有限公司 | Data transmission method and device |
Also Published As
Publication number | Publication date |
---|---|
WO2018014761A1 (en) | 2018-01-25 |
CN107643906B (en) | 2021-01-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102535450B1 (en) | Data storage method and apparatus, and computer device and storage medium thereof | |
CN105988996B (en) | Index file generation method and device | |
CN111291103A (en) | Interface data analysis method and device, electronic equipment and storage medium | |
CN109359237A (en) | It is a kind of for search for boarding program method and apparatus | |
CN107643906A (en) | Data processing method and device | |
CA2936485C (en) | Optimized data condenser and method | |
US20090028266A1 (en) | Compact encoding of arbitrary length binary objects | |
CN107729523A (en) | Data service method, electronic installation and storage medium | |
CN107622040A (en) | A kind of control method and system of laser carving data | |
CN116842012A (en) | Method, device, equipment and storage medium for storing Redis cluster in fragments | |
CN107580015A (en) | Data processing method and device, server | |
CN116738954A (en) | Report export method, report template configuration device and computer equipment | |
CN104077282B (en) | The method and apparatus of processing data | |
CN105930104A (en) | Data storing method and device | |
CN110221778A (en) | Processing method, system, storage medium and the electronic equipment of hotel's data | |
CN113393288B (en) | Order processing information generation method, device, equipment and computer readable medium | |
CN111639260B (en) | Content recommendation method, content recommendation device and storage medium | |
TWI719537B (en) | Text comparison method, system and computer program product | |
CN105653534B (en) | Data processing method and device | |
CN112328960B (en) | Optimization method and device for data operation, electronic equipment and storage medium | |
CN109918374A (en) | The method and terminal device of mass data storage | |
CN113343639B (en) | Product identification code diagram generation and information query method based on product identification code diagram | |
CN117235236B (en) | Dialogue method, dialogue device, computer equipment and storage medium | |
CN113032003B (en) | Development file export method, development file export device, electronic equipment and computer storage medium | |
CN115801228B (en) | Interactive information encryption method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220211 Address after: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province Patentee after: Huawei Cloud Computing Technology Co.,Ltd. Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd. |