CN109062987A

CN109062987A - A kind of document handling method and device

Info

Publication number: CN109062987A
Application number: CN201810714009.7A
Authority: CN
Inventors: 冉世友; 陈正; 殷舒; 刘胜
Original assignee: Union Mobile Pay Co Ltd
Current assignee: Union Mobile Pay Co Ltd
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2018-12-21

Abstract

The present embodiments relate to technical field of data processing more particularly to a kind of document handling methods and device to save resource to reduce the occupancy to space.The embodiment of the present invention includes: for the first node data in file destination, if it is determined that the content of the node data stored in the content and database of the first node data is all different, then in the database by first node data storage, and first position of the first node data in the file destination is determined；The first node data are any node data in the file destination；Form the mapping relations of the content of the first position and the first node data；The mapping relations are added to the index file of the database.

Description

A kind of document handling method and device

Technical field

The present invention relates to technical field of data processing more particularly to a kind of document handling methods and device.

Background technique

With the continuous development of information technology, transmitting-receiving, the storage of file have become the important link in information processing.One As, during file is stored and is compressed, it may appear that the case where by multiple files together storage or transmission.Send file it Before, original document can be compressed, obtain the compressed package smaller than original document, compressed package is transmitted.Receiving pressure After contracting packet, original document is obtained by being decompressed to compressed package, resource damage can be reduced in document transmission process in this way Consumption.

Encounter need to handle a large amount of similar documents when, such as electronic contract, usually by a series of files directly into Row storage or compression, can occupy a large amount of space in this way, cause the waste of resource.

Summary of the invention

The application provides a kind of document handling method and device, to reduce the occupancy to space, saves resource.

A kind of document handling method provided in an embodiment of the present invention, comprising:

For the first node data in file destination, however, it is determined that deposited in the content and database of the first node data The content of the node data of storage is all different, then in the database by first node data storage, and described in determination First position of the first node data in the file destination；The first node data are any in the file destination Node data；

Form the mapping relations of the content of the first position and the first node data；

The mapping relations are added to the index file of the database.

Optionally, further includes:

If it is determined that the content phase of the content of the second node data stored in the database and the first node data Together, then the mapping relations of the content of the first position and the second node data are formed；

The mapping relations are added to the index file of the database.

Optionally, the mapping relations of the index file further include the corresponding cryptographic Hash of content of node data；

It is described if it is determined that the node data stored in the content and database of the first node data content not phase Together, then the first node data are stored in the database, comprising:

The cryptographic Hash of the first node data is determined according to the content of the first node data；

It determines in the database with the presence or absence of cryptographic Hash identical with the cryptographic Hash of the first node data；

If it does not exist, then by first node data storage in the database, and by the first node data Cryptographic Hash be added in the index file；

The mapping relations for forming the first position and the first node data content, comprising:

Form the mapping relations between the first position and the cryptographic Hash of the first node data.

Optionally, the file destination is any file in multiple files to be processed, the multiple file to be processed File type is identical；

The node data stored in the database is the node data of any file in the multiple file to be processed.

It is optionally, described to store the first node data in the database, comprising:

It is stored in the database after the content of the first node data is compressed；

After the index file that the mapping relations are added to the database, further includes:

The index file is compressed and is stored in the database.

The embodiment of the present invention also provides a kind of document handling apparatus, comprising:

Storage unit, for for the first node data in file destination, however, it is determined that the first node data it is interior Hold and be all different with the content of the node data stored in database, then the first node data is stored in the database In, and determine first position of the first node data in the file destination；The first node data are the mesh Mark any node data in file；

Map unit is used to form the mapping relations of the content of the first position and the first node data；

Indexing units, for the mapping relations to be added to the index file of the database.

Optionally, the map unit, is also used to:

If it is determined that the content phase of the content of the second node data stored in the database and the first node data Together, then the mapping relations of the content of the first position and the second node data are formed.

The storage unit, is also used to:

The map unit, is also used to:

Optionally, further include compression unit, be used for:

The content of the first node data is compressed；

The index file is compressed.

The embodiment of the present invention also provides a kind of electronic equipment, comprising:

At least one processor；And

The memory being connect at least one described processor communication；Wherein,

The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one A processor executes, so that at least one described processor is able to carry out above-mentioned method.

The embodiment of the present invention also provides a kind of non-transient computer readable storage medium, and the non-transient computer is readable to deposit Storage media stores computer instruction, and the computer instruction is for making the computer execute the above method.

In the embodiment of the present invention, using any node data in file destination as first node data, for first segment Point data compares all node datas stored in first node data and database, however, it is determined that first node data Content and the content of node data of storage be all different, then in the database by the storage of first node data, and determine the First position of one node data in file destination forms the mapping relations of the content of first position and first node data, And the mapping relations are added in the index file of database.In this way, only store the node data not having in database, It avoids storing duplicate file content, the memory space and transfer resource of data can be saved.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly introduced, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without any creative labor, it can also be obtained according to these attached drawings His attached drawing.

Fig. 1 is a kind of flow diagram of document handling method provided in an embodiment of the present invention；

Fig. 2 is the flow diagram for the specific file process classification method that the embodiment of the present invention one provides；

Fig. 3 is the structural schematic diagram of pdf document provided by Embodiment 2 of the present invention；

Fig. 4 to Fig. 8 is respectively the tree figure of node data of the file 1 provided by Embodiment 2 of the present invention to file 5；

Fig. 9 is a kind of structural schematic diagram of the document sorting apparatus of file process provided in an embodiment of the present invention；

Figure 10 is the structural schematic diagram of electronic equipment provided in an embodiment of the present invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention make into It is described in detail to one step, it is clear that the described embodiments are only some of the embodiments of the present invention, rather than whole implementation Example.Based on the embodiments of the present invention, obtained by those of ordinary skill in the art without making creative efforts All other embodiment, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a kind of document handling methods.As shown in Figure 1, at file provided in an embodiment of the present invention Reason method, comprising the following steps:

Step 101, for the first node data in file destination, however, it is determined that the content and number of the first node data It is all different according to the content of the node data stored in library, then in the database by first node data storage, and Determine first position of the first node data in the file destination；The first node data are the file destination In any node data.

Step 102, the mapping relations for forming the first position with the content of the first node data.

Step 103, the index file that the mapping relations are added to the database.

In above-mentioned steps, the content of the node data stored in the content and database of first node data is all different. In addition, there is also the identical situations of the content of a certain node data stored in the content of first node data and database.This Inventive embodiments further include:

The mapping relations are added to the index file of the database.

In the embodiment of the present invention, however, it is determined that the second node data stored in the content and database of first node data Content is identical, then without storing first node data again, it is only necessary to form the content of first position and second node data Mapping relations, and the mapping relations are added in the index file of database.Needing to obtain first in file destination in this way When node data, second can be found in the database according to the mapping relations of first position and the content of second node data Node data has found first node data since the content of first node data is identical as the content of second node data.

It can be seen that file destination is divided into multiple node datas in the embodiment of the present invention, by each node data with The node data stored in database compares.If the content of the not stored node data in database, by node data Storage is in the database；If the content of the existing node data in database, without repeating to store node data, only need The position of the content of node data and node data in file destination is formed into mapping relations, index text is added in mapping relations Part.The mapping relations of all node datas of a file destination form the index file of the file destination.In this way, when needing to obtain When taking the file destination, only all nodes of the file destination need to be found out from database according to the mapping relations in index file The content of data can be combined into file destination.

For the ease of storing and searching pairing, the embodiment of the present invention calculates cryptographic Hash according to the content of node data, then rope The mapping relations of quotation part further include the corresponding cryptographic Hash of content of node data.

Above-mentioned steps 101, if it is determined that the node data stored in the content and database of the first node data is interior Appearance is all different, then in the database by first node data storage, comprising:

Step 102 forms the first position and the first node data content, comprising:

Hash (hash) is also hash, exactly the input of random length (be called and be preliminary mapping pre-image) by dissipating At the output of regular length, which is exactly hashed value for column algorithmic transformation.Hash function is just like next fundamental characteristics: if two A cryptographic Hash is different (according to Same Function), then being originally inputted for the two cryptographic Hash is also different.Also It is to say, if the content of two node datas is different, the cryptographic Hash of the two node datas is also different.It therefore, can be with Cryptographic Hash by comparing two node datas determines whether the content of two node datas is identical.In the embodiment of the present invention quite In using the cryptographic Hash of node data as the mark of the content of node data, if its content of the node data of different names is identical, It is still correspond to identical cryptographic Hash.The Kazakhstan of first node data is determined in the embodiment of the present invention according to the content of first node data Uncommon value.Then determining whether there is cryptographic Hash identical with the cryptographic Hash of first node data in database, and if it exists, then show There is node data identical with the content of first node data in database, then without storing first node Data duplication, Only the cryptographic Hash of first position of the first node data in file destination and first node data need to be formed mapping relations, deposited In indexed file.If it does not exist, show that the content of the content and first node data of database interior joint data is all different, Then in the database by the storage of first node data, and the cryptographic Hash of first position and first node data mapping is formed to close System, there are in index file.

Preferably, the embodiment of the present invention is suitable for handling multiple files, the file destination is in multiple files to be processed Any file, the file type of the multiple file to be processed is identical；The node data stored in the database is described The node data of any file in multiple files to be processed.Node data is carried out between a series of identical files of file type Comparison, more identical node datas can be contrasted, reduce the node data stored in database, avoid database The data of middle storing excess lead to more workload.

In order to further save database space, data compression can be carried out before node data stores or transmits. It is described to store the first node data in the database, comprising:

The index file is compressed and is stored in the database.

For a clearer understanding of the present invention, above-mentioned process is described in detail with specific embodiment below.Implement The specific steps of example one are as shown in Figure 2, comprising:

Step 201, from multiple files optional one be used as file destination, determine multiple number of nodes in file destination According to.

Step 202, for any node data in file destination, calculate the cryptographic Hash of the node data.

Whether in the database step 203 judges the cryptographic Hash of the node data, if so, thening follow the steps 205, otherwise Execute step 204.

Step 204 will be stored in database after node data compression, and the key in database is the Hash of node data Value, value are compressed node data.

Position of the node data in file destination and cryptographic Hash are established and generate mapping relations by step 205, will map Relationship is added in index file.

Step 206 judges whether all node datas in the file destination calculated cryptographic Hash, if so, executing Step 207, no to then follow the steps 202.

Step 207 judges whether each file is used as file destination in multiple files, if so, step 208 is executed, it is no Then follow the steps 201.

Step 208 compresses index file.

In embodiment two, is illustrated based on network loan electronic contract, need to compress 5 PDF (Portable Document Format, portable document format) file.5 pdf documents are respectively designated as file 1, file 2, file 3, file 4 and text 5,5 files of part are as shown in figure 3, it is understood that name is only not represent sequencing convenient for citing.

Firstly, the head and the tail of each electronic contract will appear same subject of right's information, therefore there is head and the tail in 5 files Identical data, such as 1 head and the tail of file have identical data 1, and 2 head and the tail of file have identical data 3, and alternative document is similar.

Secondly, file 1, file 2, file 3 and file 4 are generated based on the same contract template.Wherein different numbers Node data content it is different, it is identical to number identical node data content.

Finally, file 5 is after contract template change, newly-generated contract, the content of contract template change is as new number According to before being inserted in data 4, as new data 8.

The corresponding node data organization chart of above-mentioned 5 pdf documents is as shown in Fig. 4 to Fig. 8.It should be noted that Fig. 4 extremely schemes The position of discrepant node data and the node data in organization chart is illustrated only in 8, wherein data 1 to data 8 use c1 It is indicated to c8, other not shown node datas do not have any difference in each file.In the embodiment of the present invention two, processing Process is as follows:

For ease of description, it is started to process from file 1, it can also be since other files.

The cryptographic Hash of Pages Root first in calculation document 1, judge its whether in the index file one of database, by In not existing, will be put into database after the data compression of Pages Root, wherein key value is the cryptographic Hash of Pages Root, Value value is compressed Pages Root data content.

The cryptographic Hash of position of the Pages Root in file 1 and Pages Root are generated into mapping relations, index is added In file.It, can since the cross reference table of PDF safeguards money position (offset address) of each node data in entire file With the position by the offset address of Pages Root directly used as Pages Root in file 1.

For the child node under child node and each Page under Pages Root in file 1, repeats and calculate each section The cryptographic Hash of point data, compression store and establish the operation of mapping relations.It should be pointed out that in the Page3 of file 1, Due to it includes data c1 and Page1 in include data c1 content it is identical, the cryptographic Hash of data c1 has been in Page3 Through being present in database, therefore, the cryptographic Hash of position of the data c1 in file 1 in Page3 and data c1 need to only be generated Mapping relations are added in index file.

File 1 is traversed to file 5, aforesaid operations is performed both by, generates index file, as shown in table 1.

Table 1

In addition, the cryptographic Hash of data and the corresponding relationship of compressed data are as shown in table 2 in database.

Table 2

Key	value
		Pages Root hash	Compressed file 1Pages Root node data
Page hash	Compressed file 1Page node data
		c1hash	Compressed file 1c1 node data
c2hash	Compressed file 1c2 node data
		c3hash	Compressed file 2c3 node data
c4hash	Compressed file 2c4 node data
		c5hash	Compressed file 3c5 node data
c6hash	Compressed file 5c6 node data
		c7hash	Compressed file 4c7 node data
c8hash	Compressed file 5c8 node data

For file 1 to file 5, it is assumed that (i.e. the size of data of data c1 to c8) is for each difference content node data 10k, the size of file 1 to file 4 are 100k, and the size of file 5 is 110k, then complete duplicate node data in 5 files Size 70k, total variances content node size 160k, duplicate difference content node size 80k.Assuming that the compressed software pair used Text compression can reach 30% compression ratio.Since index file size accounting is very small, ignore index file size, only The compression of initial data in file is considered, then the compression ratio of this programme is (70+160-80) * 30%/510=8.8%.

The embodiment of the invention also provides a kind of document handling apparatus, as shown in Figure 9, comprising:

Storage unit 901, for for the first node data in file destination, however, it is determined that the first node data The content of the node data stored in content and database is all different, then the first node data is stored in the data In library, and determine first position of the first node data in the file destination；The first node data are described Any node data in file destination；

Map unit 902 is used to form the mapping relations of the content of the first position and the first node data；

Indexing units 903, for the mapping relations to be added to the index file of the database.

Preferably, the map unit 902, is also used to:

Preferably, the mapping relations of the index file further include the corresponding cryptographic Hash of content of node data；

The storage unit 901, is also used to:

The map unit 902, is also used to:

Preferably, the file destination is any file in multiple files to be processed, the multiple file to be processed File type is identical；

Preferably, further including compression unit 904, it is used for:

The content of the first node data is compressed；

The index file is compressed.

Based on identical principle, the present invention also provides a kind of electronic equipment, as shown in Figure 10, comprising:

Including processor 1001, memory 1002, transceiver 1003, bus interface 1004, wherein processor 1001, storage It is connected between device 1002 and transceiver 1003 by bus interface 1004；

The processor 1001 executes following method for reading the program in the memory 1002:

The mapping relations are added to the index file of the database.

Further, the processor 401 is specifically used for:

The mapping relations are added to the index file of the database.

Further, the processor 401 is specifically used for:

The index file is compressed and is stored in the database.

The embodiment of the present application provides a kind of computer program product, and the computer program product is non-temporary including being stored in Calculation procedure on state computer readable storage medium, the computer program include program instruction, when described program instructs quilt When computer executes, the method that makes the computer execute an any of the above-described text mark.

The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Although preferred embodiments of the present invention have been described, it is created once a person skilled in the art knows basic Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as It selects embodiment and falls into all change and modification of the scope of the invention.

Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the invention is also intended to include including these modification and variations.

Claims

1. a kind of document handling method characterized by comprising

For the first node data in file destination, however, it is determined that stored in the content and database of the first node data The content of node data is all different, then in the database by first node data storage, and determines described first First position of the node data in the file destination；The first node data are any node in the file destination Data；

The mapping relations are added to the index file of the database.

2. the method as described in claim 1, which is characterized in that further include:

If it is determined that the content of the second node data stored in the database is identical as the content of the first node data, then Form the mapping relations of the content of the first position and the second node data；

The mapping relations are added to the index file of the database.

3. the method as described in claim 1, which is characterized in that the mapping relations of the index file further include node data The corresponding cryptographic Hash of content；

It is described if it is determined that the content of the node data stored in the content and database of the first node data is all different, then In the database by first node data storage, comprising:

If it does not exist, then by first node data storage in the database, and by the Kazakhstan of the first node data Uncommon value is added in the index file；

4. method as described in any one of claims 1 to 3, which is characterized in that the file destination is multiple files to be processed In any file, the file type of the multiple file to be processed is identical；

5. method as claimed in claim 4, which is characterized in that described that the first node data are stored in the database In, comprising:

The index file is compressed and is stored in the database.

6. a kind of document handling apparatus characterized by comprising

Storage unit, for for the first node data in file destination, however, it is determined that the content of the first node data with The content of the node data stored in database is all different, then in the database by first node data storage, And determine first position of the first node data in the file destination；The first node data are the target text Any node data in part；

7. device as claimed in claim 6, which is characterized in that the map unit is also used to:

If it is determined that the content of the second node data stored in the database is identical as the content of the first node data, then Form the mapping relations of the content of the first position and the second node data.

8. device as claimed in claim 6, which is characterized in that the mapping relations of the index file further include node data The corresponding cryptographic Hash of content；

The storage unit, is also used to:

The map unit, is also used to:

9. such as the described in any item devices of claim 6 to 8, which is characterized in that the file destination is multiple files to be processed In any file, the file type of the multiple file to be processed is identical；

10. device as claimed in claim 9, which is characterized in that further include compression unit, be used for:

The content of the first node data is compressed；

The index file is compressed.