CN102646179A - PDF (Portable Document Format) document information embedding and extraction method based on PDF documents - Google Patents

PDF (Portable Document Format) document information embedding and extraction method based on PDF documents Download PDF

Info

Publication number
CN102646179A
CN102646179A CN2012100457361A CN201210045736A CN102646179A CN 102646179 A CN102646179 A CN 102646179A CN 2012100457361 A CN2012100457361 A CN 2012100457361A CN 201210045736 A CN201210045736 A CN 201210045736A CN 102646179 A CN102646179 A CN 102646179A
Authority
CN
China
Prior art keywords
pdf document
info
pdf
document
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012100457361A
Other languages
Chinese (zh)
Inventor
刘红梅
李雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN2012100457361A priority Critical patent/CN102646179A/en
Publication of CN102646179A publication Critical patent/CN102646179A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Editing Of Facsimile Originals (AREA)

Abstract

The invention belongs to the field of multimedia signal processing, and particularly relates to a PDF (Portable Document Format) document information embedding and extraction method based on PDF documents. The method is characterized in that a new document body added in additive updating of the PDF document is utilized as a carrier for hidden information; and the hidden information can be invisibly written-in at the beginning of establishment of the document, has no influence on the document display layer, can be transmitted on the internet along with the transmission of the document content, has large embeddable capacity and can not be damaged due to transmission or the common document editing behavior, and has concealment without easy finding and damage for attackers. As a method for authenticating the PDF documents, the PDF document information embedding and extraction method has the advantages that relevant authentication information such as the author, the provenance and the copyright of the document can be invisibly embedded into the PDF documents, so that the PDF document information embedding and extraction method is practical for copyright authentication, truth-falsehood distinguishing and the like of the PDF documents.

Description

A kind of pdf document information based on the pdf document body embeds and method for distilling
Technical field
The invention belongs to field of multimedia signal processing, be specifically related to a kind of pdf document information and embed and method for distilling based on the pdf document body.
Background technology
In recent years, along with the fast development of network technology, people begin more and more through internet transmission with obtain information.Meanwhile, novel office mode such as ecommerce, E-Government just are being widely used, and beginnings such as increasing administration, business documentation such as power of attorney, registration list, contract, invoice are circulated with the form of electronic document and transmitted.But in this open environment of internet, the copyright ownership problem that malicious act is threatening electronic document constantly such as copy, distort, problems such as a large amount of copyrights are usurped, illegal transmissions, information forgery emerge in an endless stream.In view of the situation, the data hiding technique of the electronic document main means that become copyright authentication, authenticity day by day, resolve a dispute.
PDF (Portable Document Format) file layout is the electronic document format of Adobe company exploitation.This file layout all is general in operating systems such as Windows, Unix, Mac, is independent of operating system platform.The pdf document form can and be independent of equipment and the graph image of resolution etc. is encapsulated in the file with literal, font, form, color.This formatted file can also comprise electronic information such as hypertext link, sound and dynamic image, supports the speciality file, and integrated level and safe reliability are all higher.Moreover pdf document has been used the compression algorithm of industrial standard, is easy to transmission and stores.The desirable document format that above-mentioned characteristic makes PDF become on Internet, to carry out electronic document distribution and digital information propagation.Therefore, the research based on the Information Hiding Techniques of PDF document to current applied environment, has crucial practical significance.Carry out brief analysis in the face of the structure of pdf document in the prior art down, so that the present invention is understood.
As shown in Figure 1 is the file structure figure of original PDF, comprises four parts: file header (Header), file body (Body), cross reference table (Cross-reference table) and end-of-file (Trailer).File header sign pdf document version information; File body is made up of a series of indirect object, has comprised the content of pdf document basically; Cross reference table comprises the address information of indirect object, and original state has only a unit; The root object of end-of-file record pdf document and the information such as start address of cross reference table.
As shown in Figure 2, be to upgrade the pdf document structure of operation through appending formula.In once appending the renewal operation, the object that any new object perhaps is modified all can be added to the back of original pdf document tail, forms new file body, and new intersection precedents that new file body is corresponding and new end-of-file also can be along with being inserted into the end.
As shown in Figure 3, be pdf document cross reference table exemplary plot.Each cross reference table comprises the object entry of adjacent object in the certain limit number.Each cross reference table is that delegation begins with key word xref, and the delegation of beginning comprises two numerals of being separated by the space, and first digit is represented the object number of first object in this document body, and second digit is represented the quantity of all objects in this document body.Ensuing is the entry of one of every row of corresponding each object of pdf document, and the entry structure is:
nnnnnnnnnn ggggg x y
Wherein nnnnnnnnnn is the side-play amount of 10 bytes, and expression starts to the byte number of this object beginning, the digital zero filling of the then side-play amount front of not enough 10 bytes of byte number from pdf document; Ggggg is the rank of 5 bytes, removes outside No. 0 object, and the initial grade in the cross reference table of other object number is 0, and each entry is reused, and all can be endowed a new rank, is 65535 to the maximum.X is the Obj State key word, and n, f, three status keywords of eol are arranged, and n representes that the entry that using, f represent the entry that has gone out of use.Eol is an end mark.Indicated 0 to 5 relevant information of six objects altogether in the example among Fig. 3.
Summary of the invention
The technical matters that the present invention solves is the deficiency that overcomes prior art, provides a kind of embedding information is embedded into also can from pdf document, extract pdf document information embedding and the method for distilling based on pdf document body of embedding information so that pdf document is identified in the newly-built file body of pdf document.Can effectively solve the problem that PDF copyright authentication, the true and false are distinguished after utilizing the present invention to PDF embedding information, and the present invention has good robustness to the edit action of PDF document.
For solving the problems of the technologies described above, technical scheme of the present invention is following:
A kind of pdf document information based on the pdf document body embeds and method for distilling, comprises the steps:
(1) embedding that hides Info, its specifically:
Read in original pdf document stream;
Reading in hides Info carries out segmentation, and each section of hiding Info is carried out scramble, record scramble parameter;
Search and confirm the largest object number in the original pdf document stream;
Largest object number is added 1 first new object number of inserting as new file body, each section of hiding Info back of encoding is write in the original pdf document as the new new object of file body successively, and generate new object's position sign;
After the embedding that hides Info finishes, write corresponding new cross reference table of new file body and new end-of-file, accomplish and once append renewal;
Pdf document that the output band hides Info and output scramble parameter and new object's position sign are as key;
(2) extract and to hide Info, its specifically:
Read pdf document stream and key that band hides Info;
According to new object's position sign in the key, in the data stream of pdf document, search and confirm to append the new object that update mode writes;
Extract the data stream in the determined new object and it is decoded;
According to the scramble parameter in the key, decoded new object data is flowed the unrest that is inverted;
Data stream der group after will being inverted disorderly merges output, is hidden Info.
In the such scheme, said each section of hiding Info is carried out scramble, the concrete steps of record scramble parameter are to utilize chaotic maps that each section of hiding Info is carried out scramble, write down mapping parameters as the scramble parameter.
In the such scheme, it is characterized in that said new object's position is masked as the pairing new object number of all sections of hiding Info.
In the such scheme, it is characterized in that said reading in hides Info when carrying out the segmentation acquisition section of hiding Info, also the quantitative value of the record section of hiding Info;
Said object's position is masked as first new object number and the quantitative value of the section of hiding Info in the new file body of insertion.
Compared with prior art, the beneficial effect of technical scheme of the present invention is:
The present invention utilizes pdf document to append the carrier that the new file body conduct of interpolation hides Info in the formula renewal; Hide Info and at the beginning of file is set up, just invisibly be written into; File demonstration aspect had no influence; Can transmit on the internet along with the transmission of document content, the capacity that can embed is enough big, can not be destroyed because of the documents editing behavior of transmitting or using always.For the assailant, have disguise, easy-to-search does not destroy.The present invention is as a kind of method of PDF document authentication, can invisibly in pdf document, embed the relevant authentication informations such as author, source, copyright of file, and copyright authentication, the true and false of pdf document are distinguished etc. to have practicality.
Description of drawings
Fig. 1 is the structural representation of original pdf document;
Fig. 2 is through appending the pdf document structural drawing after formula is upgraded operation;
Fig. 3 is the cross reference table instantiation design sketch of pdf document;
The process flow diagram of Fig. 4 for hiding Info among the present invention and embedding;
Fig. 5 is for extracting the process flow diagram that hides Info among the present invention;
Fig. 6 is original pdf document display effect figure in the specific embodiment of the invention;
Fig. 7 has embedded the pdf document display effect figure that hides Info in the specific embodiment of the invention;
Fig. 8 has embedded the design sketch that the pdf document that hides Info is carried out various notes, marking operation in the specific embodiment of the invention;
Fig. 9 is for embedded the display effect figure of the list class pdf document that hides Info in the specific embodiment of the invention;
Figure 10 is to embedding the display effect figure after the list class pdf document that hides Info is edited in the specific embodiment of the invention.
Embodiment
Below in conjunction with accompanying drawing and embodiment technical scheme of the present invention is done further explanation.
Like Fig. 4 and shown in Figure 5, for of the present invention a kind of based on the pdf document information embedding of pdf document body and the process flow diagram of method for distilling, the concrete steps of said method are following:
(S1) as shown in Figure 4, the embedding that in original pdf document, hides Info, its concrete steps are:
(S11) read in original pdf document stream;
(S12) carry out segmentation according to regular length to reading in to hide Info, then each section of hiding Info is carried out scramble, said scramble utilizes chaotic maps to carry out, and the record mapping parameters is as the scramble parameter, and the hop count of the record section of hiding Info;
(S13) in original pdf document stream, search and confirm the largest object number in the original pdf document stream, confirm to append the object number that newly adds object in the renewal with this;
(S14) largest object number is added 1 first new object number of inserting as new file body, each section of hiding Info back of encoding is write in the original pdf document as the new new object of file body successively, and generate new object's position sign; New object's position is masked as the pairing new object number of all sections of hiding Info or is the hop count value of first the new object number in the new file body of inserting and the section of hiding Info;
(S15) hide Info embed finish after, write corresponding new cross reference table of new file body and new end-of-file, accomplish and once append renewal; So far, have the pdf document foundation that hides Info and finish, hiding Info is embedded in the new file body;
(S16) pdf document that hides Info of output band and output scramble parameter and new object's position sign are as key.
(S2) as shown in Figure 5, in having the pdf document that hides Info, extract and hide Info, its specifically:
(S21) read pdf document stream and the key that band hides Info;
(S22), in the data stream of pdf document, search and confirm to append the new object that update mode writes according to new object's position sign in the key;
(S23) extract the data stream in the determined new object and it is decoded; In the prior art; The compression if the content of each PDF object is encoded; Then have the wave filter that corresponding identification explanation coding uses in the PDF object; Most PDF object adopts the Flatedecode compression algorithm, also gives tacit consent among the present invention and uses algorithm of the prior art to encode, and extracts the Shi Zeke that hides Info and according to the decoding algorithm of correspondence of the prior art pdf document is decoded.And hiding Info of encrypting when extracting and recovering, has corresponding key-scramble parameter, and this key can be hidden Info to be used for extracting by record when embedding hides Info.
(S24), decoded new object data is flowed the unrest that is inverted according to the scramble parameter in the key;
(S25) the data stream der group after will being inverted disorderly merges output, is hidden Info.
As shown in Figure 6 is the display effect figure of original pdf document; Fig. 7 is the display effect figure that embeds the pdf document that hides Info that utilizes the present invention to obtain; Demonstration to pdf document after as can be seen from the figure embedding hides Info does not bring any influence, and the present invention has good vision to hiding Info disguised.
Fig. 8 is to embedding the design sketch after the pdf document that hides Info is carried out various notes, marking operation.This figure is to use Adobe Acrobat 9 Professional softwares to carry out the result of note, mark to embedding the pdf document that hides Info.Utilize the present invention to hide Info to being extracted by the pdf document behind the editor through experiment, extracting the testing result accuracy is 100%, and the present invention is a robust to general edit action.
Fig. 9 and Figure 10 are to embedding diagram before and after the edit action that the list class pdf document that hides Info carries out; Wherein Fig. 9 embeds the list class pdf document diagram hide Info but can carry out any editing operation, and Figure 10 has carried out the document graphical representation that obtains after Edition Contains is preserved to Fig. 9.Utilize the extraction of the present invention to being hidden Info by the list class file behind the editor through experiment, extracting the testing result accuracy is 100%.Therefore, the present invention is a robust to said edit action equally.Therefore, the present invention distinguishes etc. to have good practicability to copyright authentication, the true and false of pdf document.

Claims (4)

1. the pdf document information based on the pdf document body embeds and method for distilling, it is characterized in that, comprises the steps:
The embedding that hides Info, its specifically:
Read in original pdf document stream;
Reading in hides Info carries out segmentation, and each section of hiding Info is carried out scramble, record scramble parameter;
Search and confirm the largest object number in the original pdf document stream;
Largest object number is added 1 first new object number of inserting as new file body, each section of hiding Info back of encoding is write in the original pdf document as the new new object of file body successively, and generate new object's position sign;
After the embedding that hides Info finishes, write corresponding new cross reference table of new file body and new end-of-file, accomplish and once append renewal;
Pdf document that the output band hides Info and output scramble parameter and new object's position sign are as key;
Extraction hides Info, its specifically:
Read pdf document stream and key that band hides Info;
According to new object's position sign in the key, in the data stream of pdf document, search and confirm to append the new object that update mode writes;
Extract the data stream in the determined new object and it is decoded;
According to the scramble parameter in the key, decoded new object data is flowed the unrest that is inverted;
Data stream der group after will being inverted disorderly merges output, is hidden Info.
2. the pdf document information based on the pdf document body according to claim 1 embeds and method for distilling; It is characterized in that; Said each section of hiding Info is carried out scramble; The concrete steps of record scramble parameter are to utilize chaotic maps that each section of hiding Info is carried out scramble, and the record mapping parameters is as the scramble parameter.
3. the pdf document information based on the pdf document body according to claim 1 and 2 embeds and method for distilling, it is characterized in that said new object's position is masked as the pairing new object number of all sections of hiding Info.
4. the pdf document information based on the pdf document body according to claim 1 and 2 embeds and method for distilling, it is characterized in that, said reading in hides Info when carrying out the segmentation acquisition section of hiding Info, also the quantitative value of the record section of hiding Info;
Said object's position is masked as first new object number and the quantitative value of the section of hiding Info in the new file body of insertion.
CN2012100457361A 2012-02-27 2012-02-27 PDF (Portable Document Format) document information embedding and extraction method based on PDF documents Pending CN102646179A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100457361A CN102646179A (en) 2012-02-27 2012-02-27 PDF (Portable Document Format) document information embedding and extraction method based on PDF documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100457361A CN102646179A (en) 2012-02-27 2012-02-27 PDF (Portable Document Format) document information embedding and extraction method based on PDF documents

Publications (1)

Publication Number Publication Date
CN102646179A true CN102646179A (en) 2012-08-22

Family

ID=46658996

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100457361A Pending CN102646179A (en) 2012-02-27 2012-02-27 PDF (Portable Document Format) document information embedding and extraction method based on PDF documents

Country Status (1)

Country Link
CN (1) CN102646179A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103118139A (en) * 2013-03-05 2013-05-22 中国科学技术大学苏州研究院 Distributed information hiding transmission system and transmission method thereof
CN103488739A (en) * 2013-09-18 2014-01-01 天脉聚源(北京)传媒科技有限公司 Method and device for processing PDF (portable document format) files, and mobile communication equipment
CN103544408A (en) * 2013-09-23 2014-01-29 中山大学 Method for embedment and extraction of PDF document hidden information according to composite font
CN103761488A (en) * 2014-02-24 2014-04-30 赛特斯信息科技股份有限公司 Information hiding achievement method based on file header control file contents
CN104134023A (en) * 2014-08-15 2014-11-05 北京邮电大学 Watermark processing method and system
CN105843783A (en) * 2016-03-21 2016-08-10 哈尔滨工程大学 Chinese PDF file text content extraction method oriented to network flow transmission
CN105956477A (en) * 2016-04-20 2016-09-21 广州慧睿思通信息科技有限公司 PDF document recovery device and method
CN104134023B (en) * 2014-08-15 2017-01-04 北京邮电大学 A kind of watermark handling method and system for processing watermark
CN107992761A (en) * 2016-10-27 2018-05-04 北京京东尚科信息技术有限公司 Strengthen the method and system of PDF document content security
CN109948123A (en) * 2018-11-27 2019-06-28 阿里巴巴集团控股有限公司 A kind of image combining method and device
CN110008663A (en) * 2018-12-27 2019-07-12 杭州基尔区块链科技有限公司 A method of the information protected for PDF document and distribute tracking is quickly embedded in and extracts
CN110287157A (en) * 2019-06-28 2019-09-27 北京金山安全软件有限公司 File processing method, file reading method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101082981A (en) * 2007-05-22 2007-12-05 中山大学 Watermark embeding and extracting method of binary image
US7783972B2 (en) * 2001-01-08 2010-08-24 Enfocus NV Ensured workflow system and method for editing a consolidated file

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7783972B2 (en) * 2001-01-08 2010-08-24 Enfocus NV Ensured workflow system and method for editing a consolidated file
CN101082981A (en) * 2007-05-22 2007-12-05 中山大学 Watermark embeding and extracting method of binary image

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘友继: "基于PDF文档的信息隐藏与检测", 《中国优秀硕士学位论文全文库信息科技辑》 *
刘友继等: "一种新的基于PDF文档结构的信息隐藏算法", 《计算机工程》 *
顾艳春等: "一种基于PDF文档和置乱技术的文本数字水印算法", 《佛山科学技术学院学报(自然科学版)》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103118139B (en) * 2013-03-05 2016-03-30 中国科学技术大学苏州研究院 Distributed information hides transmission system and transmission method thereof
CN103118139A (en) * 2013-03-05 2013-05-22 中国科学技术大学苏州研究院 Distributed information hiding transmission system and transmission method thereof
CN103488739A (en) * 2013-09-18 2014-01-01 天脉聚源(北京)传媒科技有限公司 Method and device for processing PDF (portable document format) files, and mobile communication equipment
CN103544408A (en) * 2013-09-23 2014-01-29 中山大学 Method for embedment and extraction of PDF document hidden information according to composite font
CN103761488A (en) * 2014-02-24 2014-04-30 赛特斯信息科技股份有限公司 Information hiding achievement method based on file header control file contents
CN104134023B (en) * 2014-08-15 2017-01-04 北京邮电大学 A kind of watermark handling method and system for processing watermark
CN104134023A (en) * 2014-08-15 2014-11-05 北京邮电大学 Watermark processing method and system
CN105843783A (en) * 2016-03-21 2016-08-10 哈尔滨工程大学 Chinese PDF file text content extraction method oriented to network flow transmission
CN105956477A (en) * 2016-04-20 2016-09-21 广州慧睿思通信息科技有限公司 PDF document recovery device and method
CN105956477B (en) * 2016-04-20 2018-12-21 广州慧睿思通信息科技有限公司 A kind of PDF document recovery device and method
CN107992761A (en) * 2016-10-27 2018-05-04 北京京东尚科信息技术有限公司 Strengthen the method and system of PDF document content security
CN109948123A (en) * 2018-11-27 2019-06-28 阿里巴巴集团控股有限公司 A kind of image combining method and device
CN109948123B (en) * 2018-11-27 2023-06-02 创新先进技术有限公司 Image merging method and device
CN110008663A (en) * 2018-12-27 2019-07-12 杭州基尔区块链科技有限公司 A method of the information protected for PDF document and distribute tracking is quickly embedded in and extracts
CN110008663B (en) * 2018-12-27 2020-12-08 杭州基尔区块链科技有限公司 Method for quickly embedding and extracting information for PDF document protection and distribution tracking
CN110287157A (en) * 2019-06-28 2019-09-27 北京金山安全软件有限公司 File processing method, file reading method and device

Similar Documents

Publication Publication Date Title
CN102646179A (en) PDF (Portable Document Format) document information embedding and extraction method based on PDF documents
CN102622562B (en) PDF (Portable Document Format) file information embedding and extracting method based on PDF cross reference table
CN103544408A (en) Method for embedment and extraction of PDF document hidden information according to composite font
CN100550653C (en) A kind of Code And Decode method of variable length structural information
CN102096787B (en) Method and device for hiding information based on word2007 text segmentation
CN104850765A (en) Watermark processing method, device and system
CN110414194B (en) Text watermark embedding and extracting method
KR20120070664A (en) System for tracking illegal distributeur and preventing distribution of illegal content and method thereof
CN109785222B (en) Method for quickly embedding and extracting information of webpage
CN102360413A (en) Steganographic method with misguiding function of controllable secret key sequence
CN113536247B (en) Hidden data watermarking method for mobile phone number with MD5 characteristic of traceable information
CN103530574B (en) A kind of hide Info embedding and extracting method based on English PDF document
Mane et al. Data hiding technique: Audio steganographyusing lsb technique
CN104517045A (en) Method for creating protected digital file
Gong et al. Detecting fingerprints of audio steganography software
CN114386103A (en) Secret information hiding method, secret information extracting method and transmission system
Kuribayashi et al. Data hiding for text document in PDF file
CN113112392B (en) Watermark embedding and extracting method, watermark embedding and extracting device and processing system
Naji et al. New Approach of Hidden Data in the portable Executable File without Change the Size of Carrier File Using Statistical
CN105989569A (en) Digital watermark embedding method and apparatus for EPUB document as well as extraction method and apparatus
Cao et al. Approaches to obtaining fingerprints of steganography tools which embed message in fixed positions
Jaiswal et al. Implementation of a new technique for web document protection using unicode
Sabir et al. A non-algorithmic forensic approach for hiding data in image files
CN111382398A (en) Method, device and equipment for information processing, hidden information analysis and embedding
Bhattacharyya et al. A method of data hiding in audio signal

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120822