CN102646179A

CN102646179A - PDF (Portable Document Format) document information embedding and extraction method based on PDF documents

Info

Publication number: CN102646179A
Application number: CN2012100457361A
Authority: CN
Inventors: 刘红梅; 李雷
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2012-02-27
Filing date: 2012-02-27
Publication date: 2012-08-22

Abstract

The invention belongs to the field of multimedia signal processing, and particularly relates to a PDF (Portable Document Format) document information embedding and extraction method based on PDF documents. The method is characterized in that a new document body added in additive updating of the PDF document is utilized as a carrier for hidden information; and the hidden information can be invisibly written-in at the beginning of establishment of the document, has no influence on the document display layer, can be transmitted on the internet along with the transmission of the document content, has large embeddable capacity and can not be damaged due to transmission or the common document editing behavior, and has concealment without easy finding and damage for attackers. As a method for authenticating the PDF documents, the PDF document information embedding and extraction method has the advantages that relevant authentication information such as the author, the provenance and the copyright of the document can be invisibly embedded into the PDF documents, so that the PDF document information embedding and extraction method is practical for copyright authentication, truth-falsehood distinguishing and the like of the PDF documents.

Description

A kind of pdf document information based on the pdf document body embeds and method for distilling

Technical field

The invention belongs to field of multimedia signal processing, be specifically related to a kind of pdf document information and embed and method for distilling based on the pdf document body.

Background technology

In recent years, along with the fast development of network technology, people begin more and more through internet transmission with obtain information.Meanwhile, novel office mode such as ecommerce, E-Government just are being widely used, and beginnings such as increasing administration, business documentation such as power of attorney, registration list, contract, invoice are circulated with the form of electronic document and transmitted.But in this open environment of internet, the copyright ownership problem that malicious act is threatening electronic document constantly such as copy, distort, problems such as a large amount of copyrights are usurped, illegal transmissions, information forgery emerge in an endless stream.In view of the situation, the data hiding technique of the electronic document main means that become copyright authentication, authenticity day by day, resolve a dispute.

PDF (Portable Document Format) file layout is the electronic document format of Adobe company exploitation.This file layout all is general in operating systems such as Windows, Unix, Mac, is independent of operating system platform.The pdf document form can and be independent of equipment and the graph image of resolution etc. is encapsulated in the file with literal, font, form, color.This formatted file can also comprise electronic information such as hypertext link, sound and dynamic image, supports the speciality file, and integrated level and safe reliability are all higher.Moreover pdf document has been used the compression algorithm of industrial standard, is easy to transmission and stores.The desirable document format that above-mentioned characteristic makes PDF become on Internet, to carry out electronic document distribution and digital information propagation.Therefore, the research based on the Information Hiding Techniques of PDF document to current applied environment, has crucial practical significance.Carry out brief analysis in the face of the structure of pdf document in the prior art down, so that the present invention is understood.

As shown in Figure 1 is the file structure figure of original PDF, comprises four parts: file header (Header), file body (Body), cross reference table (Cross-reference table) and end-of-file (Trailer).File header sign pdf document version information; File body is made up of a series of indirect object, has comprised the content of pdf document basically; Cross reference table comprises the address information of indirect object, and original state has only a unit; The root object of end-of-file record pdf document and the information such as start address of cross reference table.

As shown in Figure 2, be to upgrade the pdf document structure of operation through appending formula.In once appending the renewal operation, the object that any new object perhaps is modified all can be added to the back of original pdf document tail, forms new file body, and new intersection precedents that new file body is corresponding and new end-of-file also can be along with being inserted into the end.

As shown in Figure 3, be pdf document cross reference table exemplary plot.Each cross reference table comprises the object entry of adjacent object in the certain limit number.Each cross reference table is that delegation begins with key word xref, and the delegation of beginning comprises two numerals of being separated by the space, and first digit is represented the object number of first object in this document body, and second digit is represented the quantity of all objects in this document body.Ensuing is the entry of one of every row of corresponding each object of pdf document, and the entry structure is:

nnnnnnnnnn ggggg x y

Wherein nnnnnnnnnn is the side-play amount of 10 bytes, and expression starts to the byte number of this object beginning, the digital zero filling of the then side-play amount front of not enough 10 bytes of byte number from pdf document; Ggggg is the rank of 5 bytes, removes outside No. 0 object, and the initial grade in the cross reference table of other object number is 0, and each entry is reused, and all can be endowed a new rank, is 65535 to the maximum.X is the Obj State key word, and n, f, three status keywords of eol are arranged, and n representes that the entry that using, f represent the entry that has gone out of use.Eol is an end mark.Indicated 0 to 5 relevant information of six objects altogether in the example among Fig. 3.

Summary of the invention

The technical matters that the present invention solves is the deficiency that overcomes prior art, provides a kind of embedding information is embedded into also can from pdf document, extract pdf document information embedding and the method for distilling based on pdf document body of embedding information so that pdf document is identified in the newly-built file body of pdf document.Can effectively solve the problem that PDF copyright authentication, the true and false are distinguished after utilizing the present invention to PDF embedding information, and the present invention has good robustness to the edit action of PDF document.

For solving the problems of the technologies described above, technical scheme of the present invention is following:

A kind of pdf document information based on the pdf document body embeds and method for distilling, comprises the steps:

(1) embedding that hides Info, its specifically:

Read in original pdf document stream;

Reading in hides Info carries out segmentation, and each section of hiding Info is carried out scramble, record scramble parameter;

Search and confirm the largest object number in the original pdf document stream;

Largest object number is added 1 first new object number of inserting as new file body, each section of hiding Info back of encoding is write in the original pdf document as the new new object of file body successively, and generate new object's position sign;

After the embedding that hides Info finishes, write corresponding new cross reference table of new file body and new end-of-file, accomplish and once append renewal;

Pdf document that the output band hides Info and output scramble parameter and new object's position sign are as key;

(2) extract and to hide Info, its specifically:

Read pdf document stream and key that band hides Info;

According to new object's position sign in the key, in the data stream of pdf document, search and confirm to append the new object that update mode writes;

Extract the data stream in the determined new object and it is decoded;

According to the scramble parameter in the key, decoded new object data is flowed the unrest that is inverted;

Data stream der group after will being inverted disorderly merges output, is hidden Info.

In the such scheme, said each section of hiding Info is carried out scramble, the concrete steps of record scramble parameter are to utilize chaotic maps that each section of hiding Info is carried out scramble, write down mapping parameters as the scramble parameter.

In the such scheme, it is characterized in that said new object's position is masked as the pairing new object number of all sections of hiding Info.

In the such scheme, it is characterized in that said reading in hides Info when carrying out the segmentation acquisition section of hiding Info, also the quantitative value of the record section of hiding Info;

Said object's position is masked as first new object number and the quantitative value of the section of hiding Info in the new file body of insertion.

Compared with prior art, the beneficial effect of technical scheme of the present invention is:

The present invention utilizes pdf document to append the carrier that the new file body conduct of interpolation hides Info in the formula renewal; Hide Info and at the beginning of file is set up, just invisibly be written into; File demonstration aspect had no influence; Can transmit on the internet along with the transmission of document content, the capacity that can embed is enough big, can not be destroyed because of the documents editing behavior of transmitting or using always.For the assailant, have disguise, easy-to-search does not destroy.The present invention is as a kind of method of PDF document authentication, can invisibly in pdf document, embed the relevant authentication informations such as author, source, copyright of file, and copyright authentication, the true and false of pdf document are distinguished etc. to have practicality.

Description of drawings

Fig. 1 is the structural representation of original pdf document;

Fig. 2 is through appending the pdf document structural drawing after formula is upgraded operation;

Fig. 3 is the cross reference table instantiation design sketch of pdf document;

The process flow diagram of Fig. 4 for hiding Info among the present invention and embedding;

Fig. 5 is for extracting the process flow diagram that hides Info among the present invention;

Fig. 6 is original pdf document display effect figure in the specific embodiment of the invention;

Fig. 7 has embedded the pdf document display effect figure that hides Info in the specific embodiment of the invention;

Fig. 8 has embedded the design sketch that the pdf document that hides Info is carried out various notes, marking operation in the specific embodiment of the invention;

Fig. 9 is for embedded the display effect figure of the list class pdf document that hides Info in the specific embodiment of the invention;

Figure 10 is to embedding the display effect figure after the list class pdf document that hides Info is edited in the specific embodiment of the invention.

Embodiment

Below in conjunction with accompanying drawing and embodiment technical scheme of the present invention is done further explanation.

Like Fig. 4 and shown in Figure 5, for of the present invention a kind of based on the pdf document information embedding of pdf document body and the process flow diagram of method for distilling, the concrete steps of said method are following:

(S1) as shown in Figure 4, the embedding that in original pdf document, hides Info, its concrete steps are:

(S11) read in original pdf document stream;

(S12) carry out segmentation according to regular length to reading in to hide Info, then each section of hiding Info is carried out scramble, said scramble utilizes chaotic maps to carry out, and the record mapping parameters is as the scramble parameter, and the hop count of the record section of hiding Info;

(S13) in original pdf document stream, search and confirm the largest object number in the original pdf document stream, confirm to append the object number that newly adds object in the renewal with this;

(S14) largest object number is added 1 first new object number of inserting as new file body, each section of hiding Info back of encoding is write in the original pdf document as the new new object of file body successively, and generate new object's position sign; New object's position is masked as the pairing new object number of all sections of hiding Info or is the hop count value of first the new object number in the new file body of inserting and the section of hiding Info;

(S15) hide Info embed finish after, write corresponding new cross reference table of new file body and new end-of-file, accomplish and once append renewal; So far, have the pdf document foundation that hides Info and finish, hiding Info is embedded in the new file body;

(S16) pdf document that hides Info of output band and output scramble parameter and new object's position sign are as key.

(S2) as shown in Figure 5, in having the pdf document that hides Info, extract and hide Info, its specifically:

(S21) read pdf document stream and the key that band hides Info;

(S22), in the data stream of pdf document, search and confirm to append the new object that update mode writes according to new object's position sign in the key;

(S23) extract the data stream in the determined new object and it is decoded; In the prior art; The compression if the content of each PDF object is encoded; Then have the wave filter that corresponding identification explanation coding uses in the PDF object; Most PDF object adopts the Flatedecode compression algorithm, also gives tacit consent among the present invention and uses algorithm of the prior art to encode, and extracts the Shi Zeke that hides Info and according to the decoding algorithm of correspondence of the prior art pdf document is decoded.And hiding Info of encrypting when extracting and recovering, has corresponding key-scramble parameter, and this key can be hidden Info to be used for extracting by record when embedding hides Info.

(S24), decoded new object data is flowed the unrest that is inverted according to the scramble parameter in the key;

(S25) the data stream der group after will being inverted disorderly merges output, is hidden Info.

As shown in Figure 6 is the display effect figure of original pdf document; Fig. 7 is the display effect figure that embeds the pdf document that hides Info that utilizes the present invention to obtain; Demonstration to pdf document after as can be seen from the figure embedding hides Info does not bring any influence, and the present invention has good vision to hiding Info disguised.

Fig. 8 is to embedding the design sketch after the pdf document that hides Info is carried out various notes, marking operation.This figure is to use Adobe Acrobat 9 Professional softwares to carry out the result of note, mark to embedding the pdf document that hides Info.Utilize the present invention to hide Info to being extracted by the pdf document behind the editor through experiment, extracting the testing result accuracy is 100%, and the present invention is a robust to general edit action.

Fig. 9 and Figure 10 are to embedding diagram before and after the edit action that the list class pdf document that hides Info carries out; Wherein Fig. 9 embeds the list class pdf document diagram hide Info but can carry out any editing operation, and Figure 10 has carried out the document graphical representation that obtains after Edition Contains is preserved to Fig. 9.Utilize the extraction of the present invention to being hidden Info by the list class file behind the editor through experiment, extracting the testing result accuracy is 100%.Therefore, the present invention is a robust to said edit action equally.Therefore, the present invention distinguishes etc. to have good practicability to copyright authentication, the true and false of pdf document.

Claims

1. the pdf document information based on the pdf document body embeds and method for distilling, it is characterized in that, comprises the steps:

The embedding that hides Info, its specifically:

Read in original pdf document stream;

Extraction hides Info, its specifically:

Read pdf document stream and key that band hides Info;

Extract the data stream in the determined new object and it is decoded;

2. the pdf document information based on the pdf document body according to claim 1 embeds and method for distilling; It is characterized in that; Said each section of hiding Info is carried out scramble; The concrete steps of record scramble parameter are to utilize chaotic maps that each section of hiding Info is carried out scramble, and the record mapping parameters is as the scramble parameter.

3. the pdf document information based on the pdf document body according to claim 1 and 2 embeds and method for distilling, it is characterized in that said new object's position is masked as the pairing new object number of all sections of hiding Info.

4. the pdf document information based on the pdf document body according to claim 1 and 2 embeds and method for distilling, it is characterized in that, said reading in hides Info when carrying out the segmentation acquisition section of hiding Info, also the quantitative value of the record section of hiding Info;