CN112053275A

CN112053275A - Printing and scanning attack resistant PDF document watermarking method and device

Info

Publication number: CN112053275A
Application number: CN202010677029.9A
Authority: CN
Inventors: 王道顺; 刘景财; 刘乃熙; 贾星星; 巩林明
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-07-14
Filing date: 2020-07-14
Publication date: 2020-12-08
Anticipated expiration: 2040-07-14
Also published as: CN112053275B

Abstract

The application discloses a printing and scanning attack resistant PDF document watermarking method and a device, wherein the method comprises the following steps: analyzing the initial PDF document to obtain a text line spacing signal of the initial PDF document; embedding preset watermark information into a text line spacing signal to generate a line spacing signal containing a watermark; generating a final PDF document from the initial PDF document based on the line space signal containing the watermark; and generating a document image according to the final PDF document, detecting a boundary box of a text line from the document image, and calculating a line spacing signal of the final PDF image to extract watermark information. By introducing the text detection technology, the embodiment of the application not only can accurately calculate the line spacing signal, but also can compensate most attacks introduced by the printing and scanning process.

Description

Printing and scanning attack resistant PDF document watermarking method and device

Technical Field

The present invention relates to the field of text watermarking technologies, and in particular, to a Portable Document Format (PDF) Document watermarking method and apparatus for resisting print-scan attacks.

Background

The printing and scanning attack refers to a process that a text document containing a watermark outputs a copy through a printer and then a scanner generates a text document image. In fact, print-and-scan attacks are a common compound attack type in text watermarking, that is, they are formed by combining various types of attacks, mainly including noise attacks, non-linear attacks (such as gamma correction and digital halftoning), geometric attacks (such as rotation, scaling, clipping), and the like.

Therefore, it is required that the digital watermarking algorithm resistant to the print scan attack must have high robustness against the above attacks.

Content of application

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, a first objective of the present application is to provide a printing and scanning attack resistant PDF document watermarking method, which not only can accurately calculate a line spacing signal by introducing a text detection technology, but also can compensate most of the attacks introduced by the printing and scanning process.

The second purpose of the present application is to provide a PDF document watermarking device that is resistant to print-scan attacks.

In order to achieve the above object, an embodiment of a first aspect of the present application provides a PDF document watermarking method for resisting print scan attack, including the following steps: analyzing an initial PDF document to obtain a text line spacing signal of the initial PDF document; embedding preset watermark information into a text line spacing signal to generate a line spacing signal containing a watermark; generating a final PDF document from the initial PDF document based on the line space signal containing the watermark; and generating a document image according to the final PDF document, detecting a boundary box of a text line from the document image, and calculating a line spacing signal of the final PDF image to extract watermark information.

According to the printing and scanning attack resistant PDF document watermarking method, by analyzing various attacks introduced in the printing and scanning process, various attacks generated in the printing and scanning process are compensated by using a text detection technology, a line spacing signal is calculated from a document image generated after printing and scanning as accurately as possible, and the robustness of a PDF document watermarking algorithm is improved.

In addition, the printing and scanning attack resistant PDF document watermarking method according to the above embodiment of the present application may further have the following additional technical features:

in one embodiment of the present application, further comprising: and quantizing the line spacing signals based on a preset strategy of optimizing quantization step length so as to resist fixed gain attack.

In an embodiment of the present application, before detecting a bounding box of a text line from the document image, the method further includes: detecting the requirement of text detection; and judging whether a text detection mode based on sliding window classification or a text detection mode based on connection component analysis is met according to the requirement.

In an embodiment of the present application, when the sliding window based classified text detection mode is met, the detecting a bounding box of a text line from the document image and calculating a line spacing signal of the final PDF image includes: utilizing windows with various scales to slide on the document image, and dividing each window mark into a text segment window or a non-text segment window through a classifier; and performing morphological operation on the text segment window to determine a text area.

In an embodiment of the present application, when the text detection manner based on connected component analysis is satisfied, the detecting a bounding box of a text line from the document image and calculating a line spacing signal of the final PDF image includes: extracting text characters of the document image; text segments are synthesized based on the text characters, and non-text segments of the text segments are filtered to synthesize the filtered text segments into text lines.

In order to achieve the above object, an embodiment of a second aspect of the present application provides a PDF document watermarking apparatus that is resistant to print scan attack, including: the analysis module is used for analyzing and analyzing the initial PDF document to obtain a text line spacing signal of the initial PDF document; the embedding module is used for embedding preset watermark information into the text line spacing signal to generate a line spacing signal containing the watermark; a generating module, configured to generate a final PDF document from the initial PDF document based on the line spacing signal containing the watermark; and the extraction module is used for generating a document image according to the final PDF document, detecting a boundary box of a text line from the document image, and calculating a line spacing signal of the final PDF image so as to extract and obtain watermark information.

According to the printing and scanning attack resistant PDF document watermarking device, by analyzing various attacks introduced in the printing and scanning process, various attacks generated in the printing and scanning process are compensated by using a text detection technology, a line spacing signal is calculated from a document image generated after printing and scanning as accurately as possible, and the robustness of a PDF document watermarking algorithm is improved.

in one embodiment of the present application, further comprising: and the processing module is used for carrying out quantization processing on the line spacing signal based on a preset strategy for optimizing quantization step length so as to resist fixed gain attack.

In one embodiment of the present application, further comprising: the detection module is used for detecting the requirement of text detection before detecting the bounding box of the text line in the document image; and the judging module is used for judging whether a text detection mode based on sliding window classification or a text detection mode based on connection component analysis is met according to the requirement.

In an embodiment of the application, when the sliding window based classification text detection mode is met, the extraction module is further configured to slide on the document image by using windows of multiple scales, divide each window mark into a text segment window or a non-text segment window through a classifier, and perform a morphological operation on the text segment window to determine a text region.

In an embodiment of the application, when the text detection manner based on connected component analysis is satisfied, the extraction module is further configured to extract text characters of the document image, synthesize a text segment based on the text characters, and filter non-text segments in the text segment to synthesize the filtered text segment into a text line.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of a PDF document watermarking method resisting print scanning attack according to an embodiment of the present application;

fig. 2 is a flowchart of a PDF document watermarking method resisting print scan attack according to an embodiment of the present application;

FIG. 3 is a diagram illustrating robustness against fixed gain attacks and Gaussian noise attacks provided in accordance with an embodiment of the present application;

FIG. 4 is a diagram illustrating a robustness analysis for Gaussian noise according to an embodiment of the present application;

fig. 5 is a schematic diagram illustrating a text detection process and effect according to an embodiment of the present application;

fig. 6 is a diagram of an example of a PDF document watermarking apparatus according to an embodiment of the present application, which is resistant to print scan attacks.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The method and the device hope to optimize two PDF document watermarking algorithms based on quantization index modulation to enable the PDF document watermarking algorithms to have better robustness to printing scanning attacks. Two PDF document watermarking algorithms based on quantization index modulation encode watermarking information by using a line shift technology. Therefore, the watermark extraction algorithm must be able to accurately calculate the text line spacing information from the document image generated after the print scan. In order to solve the problem, the embodiment of the application provides a printing and scanning attack resistant PDF document watermarking algorithm and device based on a text detection technology.

The method and the device for watermarking the PDF document against the printing and scanning attacks provided by the embodiment of the application are described below with reference to the attached drawings, and firstly, the method for watermarking the PDF document against the printing and scanning attacks provided by the embodiment of the application is described with reference to the attached drawings.

Specifically, fig. 1 is a schematic flowchart of a PDF document watermarking method for resisting print scan attack according to an embodiment of the present application.

As shown in fig. 1, the printing and scanning attack resistant PDF document watermarking method includes the following steps:

in step S101, the initial PDF document is analyzed to obtain a text line spacing signal of the initial PDF document.

It will be appreciated that the PDF document is analyzed by the parser to derive a text line spacing signal x, as shown in fig. 2.

In step S102, preset watermark information is embedded into the text line space signal, and a line space signal containing a watermark is generated.

It will be appreciated that the watermark information m is embedded in x by an embedder, as shown in figure 2, resulting in a line space signal containing the watermark.

In step S103, a final PDF document is generated from the initial PDF document based on the line space signal containing the watermark.

It will be appreciated that the PDF document containing the watermark is output via the generator as shown in fig. 2.

In step S104, a document image is generated from the final PDF document, a bounding box of a text line is detected from the document image, and a line space signal of the final PDF image is calculated to extract watermark information.

It can be understood that, as shown in fig. 2, the PDF document containing the watermark is processed by a printer and a scanner to generate a document image I, a text detector detects a bounding box of a text line from the document image I, and then a line spacing signal is calculated, and watermark information m is obtained by an extractor.

The embodiment of the application introduces the text detection technology, which not only can accurately calculate the line spacing signal, but also can compensate most attacks introduced by the printing and scanning process, except for the zooming attack. In fact, the scaling Attack is converted into a Fixed Gain Attack (Fixed Gain Attack) after being processed by the text detector, that is, the size of the document image obtained after the print scanning process is scaled, so that the line space calculated by the text detector may be enlarged or reduced by a Fixed factor relative to the line space of the original document. At the same time, the text detector may also introduce noise attacks due to errors.

Therefore, the embodiment of the application converts the print scan attack into two independent problems: firstly, improving a PDF document watermarking algorithm to enable the PDF document watermarking algorithm to have enough robustness to fixed gain attack; and secondly, developing or realizing a high-precision text detection algorithm with better robustness to noise attack, nonlinear attack and geometric attack. The following will be described in detail respectively, specifically as follows:

in one embodiment of the present application, further comprising: and carrying out quantization processing on the line spacing signals based on a strategy of presetting an optimized quantization step length so as to resist fixed gain attack.

Specifically, the fixed gain attack refers to a carrier signal (e.g. a line spacing signal) received by a receiving end being multiplied by a scaling factor ρ>0, namely:

watermarking algorithms based on quantization index modulation and its improved techniques are less robust to fixed gain attacks because the scaled signal easily deviates from the original quantization unit. In fact, at present, a lot of researchers have optimized watermarking algorithms based on quantization index modulation and improved techniques thereof so as to be resistant to fixed gain attack, but the algorithms almost aim at image watermarking algorithms and optimize quantization step size or embedder and extractor expression forms by using a Watson perception model. The algorithm of the embodiment of the present application selects a mode of optimizing a quantization step Δ to achieve resistance to a fixed gain attack, and meanwhile, maintains the adaptivity of the algorithm, taking a PDF document watermarking algorithm based on STDM (Statistical Time Division Multiplexing ) as an example, an expression of an adaptive function of the algorithm is:

wherein p ═ x^Tv is the projection of the ith sub-signal x in the carrier signal along the vector v, y^i-1Is the output signal of the i-1 st sub-signal in the carrier signal, | | y^i-1||Is aNorm (although not strictly called norm), i.e.:

where L is the length of the non-overlapping sub-signals into which the carrier signal is divided, H is the length of the watermark bits, yⁱIs the output signal of the ith sub-signal in the carrier signal, and the superscript is omitted in the following formula derivation for the sake of simplifying the written form of the formula.

In the case of fixed gain attack only, it can be seen from equation 1:

then:

as can be seen from formula 4, the watermark bits extracted from the output signal y that is not attacked and the distorted signal that is attacked by the fixed gain are the same, which ensures the accuracy and effectiveness of the algorithm in the embodiment of the present application under the attack of the fixed gain, and the following problem is to solve the value of β. At this time, the process of the present invention,

therefore, it is not only easy to use

Order to

Equation 7 can be simplified as:

further, similar to the STDM-based PDF document watermarking algorithm, the expression of the adaptive function of the DM-based PDF document watermarking algorithm is:

wherein x is_kIs the kth sample point of the ith subsignal x in the carrier signal. It can also be shown that the above adaptive function is effective for fixed gain attacks.

The quantization-based watermarking algorithm RDM (proportional digital Modulation) has strong robustness to the fixed gain attack, but poor robustness to the gaussian noise, as shown in fig. 3, the two improved algorithms STDM-SF and DM-SF proposed by the embodiment of the present application have strong robustness to the fixed gain attack and the gaussian noise attack, but the two improved algorithms proposed by the embodiment of the present application have reduced robustness to the gaussian noise attack compared with the conventional STDM algorithm and the algorithms before improvement, STDM-SA and DM-SA.

Fig. 4 shows the trend of the robustness of the above algorithm to gaussian noise varying with maximum distortion, and it can be seen from fig. 4 that, under the condition of the same maximum distortion, the PDF document watermarking algorithm based on DM is larger than the average distortion based on STDM algorithm, and the two improved algorithms proposed by the embodiment of the present application are smaller than the average distortion of the algorithms before the improvement, which is also the reason that the robustness of the improved algorithm to gaussian noise of the embodiment of the present application is reduced. In addition, the robustness of the algorithm STDM-SF to Gaussian noise is better than that of the algorithm DM-SF when MXAE is less than 0.6, and the robustness of the algorithm DM-SF is better when MXAE is greater than 0.6.

Further, in an embodiment of the present application, before detecting a bounding box of a text line from a document image, the method further includes: detecting the requirement of text detection; and judging whether the text detection mode based on sliding window classification or the text detection mode based on connection component analysis is met according to the requirement.

It should be noted that in recent years, scene text detection technology has been extensively studied, and particularly with the development of deep learning technology, many researchers use the existing mature depth models or depth models developed for specific text detection tasks to locate the position of text regions in complex scene images. The greatest advantage of deep learning-based text detection methods is that automatic feature learning can free researchers from designing and testing a large number of potential manual features. However, in consideration of the high complexity and low computational efficiency of the deep learning method, and the single background color and regular text shape of the scanned PDF text document image compared with the complex scene image, the conventional method is used herein to implement the text detection task.

The conventional scene text detection technology can be roughly divided into two categories, namely, text detection based on Sliding Window Classification (Sliding Window Classification) and text detection based on Connected Component Analysis (Connected Component Analysis).

In an embodiment of the present application, when a text detection method based on sliding window classification is satisfied, detecting a bounding box of a text line from a document image, and calculating a line spacing signal of a final PDF image includes: sliding windows with various scales on a document image, and dividing each window mark into a text segment window or a non-text segment window through a classifier; and performing morphological operation on the text segment window to determine a text area.

Specifically, text detection based on sliding window classification first slides on an input image using windows of various scales, and each window is labeled as a text segment or a non-text segment by a classifier, and then the windows labeled as text segments are grouped into text regions by morphological operations. The advantage of this approach is a simple and adaptive training detection architecture. However, when a complex classification method is used and a large number of windows need to be classified, the computational cost may be very high. Text detection based on connected component analysis employs a bottom-up approach to obtain all text regions by continuously connecting smaller text components into larger text components. The method uses manually designed rules or automatically trained classifiers to filter out non-text components extracted from the scene images. In general, text detection methods based on connected component analysis are more computationally efficient because the number of text components that need to be processed is relatively small and rotation, scaling and font changes of the text have little impact on it.

In one embodiment of the present application, when a text detection method based on connected component analysis is satisfied, detecting a bounding box of a text line from a document image, and calculating a line spacing signal of a final PDF image includes: extracting text characters of the document image; text segments are synthesized based on the text characters, and non-text segments in the text segments are filtered to synthesize the filtered text segments into text lines.

Specifically, the text detection algorithm used in the embodiment of the present application is implemented based on a connected component analysis method, and includes the following steps:

(1) extracting text characters: MSER (maximum Stable extreme Regions) is an efficient method for extracting candidate text connection components (characters or some part of characters) in video and natural scene images. Assuming that pixels in the grayscale image I below a certain threshold are marked as "black", and pixels above or equal to the threshold are marked as "white", a process of gradually changing the image I from "all white" to "all black" will be seen as the threshold is changed from small to large. Within a certain range of variation of the threshold, there are certain regions in the image I that remain unchanged, which are extremal regions because the intensity of the pixels within these regions is higher (or lower) than the pixels outside the regions. The set of maximum extremum regions in image I that remain constant as the threshold varies is defined as MSER. Since the contrast between text and background in the PDF document image is relatively large, MSER is well suited to extract candidate text characters in the PDF document, as shown in (b) of fig. 5.

(2) Constructing a text segment: and synthesizing the extracted candidate text characters into text segments by using a clustering method, and then filtering out non-text segments by using a classifier. The text detection algorithm used in the embodiment of the present application constructs the text segment by using a clustering method, but is implemented by spatially clustering center coordinates of the extracted candidate text characters, as shown in (c) and (d) in fig. 5.

(3) Filtering the non-text segment: the constructed text segment may contain some non-text segments (such as background) similar to the text features, and the non-text segments may be filtered out by a classifier or some rules. Rules are designed to filter out text segments of special shapes, such as large or small aspect ratios, large or small area ratios, based on the geometric features of the text in the PDF document, so filtering out non-text segments using rules is also referred to as geometric filtering [50, 54 ].

(4) And (3) constructing a text line: the document image obtained after printing and scanning may generate a certain rotation angle, so that the text segments are combined into text lines by a projection clustering method. Firstly, analyzing the text segment with large width-height ratio obtained after filtering, and calculating the rotation angle of the document; then projecting the coordinates of the central points of all the text segments along the direction of the rotation angle; finally, the obtained projection data are clustered, thereby constructing a text line, as shown in (e) and (f) of fig. 5.

The following is to analyze the experimental results and performance of the specific application of the PDF document watermarking method for resisting the printing and scanning attacks, and specifically as follows:

the embodiment of the application provides that the basic quantizer and the dither signal used by the improved PDF document watermarking algorithm are as follows:

both the watermark signal and the projection vector are randomly generated. According to the embodiment of the application, a WPS 2019 is used for editing a text document, and the text document is output in a PDF format, wherein the text font is a Song style five-number character, the line spacing is 22 pounds, the length of the PDF document is limited, the bit length of a watermark is H-11, and the length of a sub-signal is L-8. The printer and the scanner used in the experiment are AURORA AD220MC black-and-white laser multifunctional all-in-one machines, the printer keeps default settings, the output format of the scanner is JPG image, the color is gray, and the DPI is 300. The experimental results are shown in table 1, where table 1 is a table of the experimental results of the print scan attack.

TABLE 1

The robustness of the STDM-SF, DM-SF and RDM algorithms to the printing scanning attack under various distortion conditions is tested respectively. As can be seen from table 1, the three algorithms are robust against print scan attacks when the maximum (or average) distortion is large. However, as the maximum (or average) distortion is gradually reduced, that is, the imperceptibility of the watermark is gradually increased, the robustness of the RDM algorithm is reduced most seriously, and in the worst case, the error rate exceeds 50%, because the RDM algorithm has poor robustness to gaussian noise, when the distortion of the text document is small, the RDM algorithm is easily interfered by the detection error of the text detector, and the RDM algorithm has the advantage that the watermark embedding capacity is high; secondly, the DM-SF algorithm is used, and when the maximum distortion MXAE is 0.1, the error rate is about 40%, because the DM-SF algorithm is also poor in robustness to gaussian noise when the maximum distortion is small, as shown in (b) of fig. 5; the STDM-SF algorithm performs best among the three algorithms, with the number of watermark Error Bits (EB) always being 0 at various distortions. In the above three algorithms, under the condition that the maximum distortion is 1, the imperceptibility of the watermark is also good enough, and the difference between the PDF document containing the watermark and the original document is difficult to be perceived by human eyes.

In summary, the printing and scanning attack resistant PDF document watermarking method according to the embodiment of the present application compensates for multiple attacks generated in the printing and scanning process by analyzing multiple attacks introduced in the printing and scanning process and using the text detection technology, and calculates the line spacing signal from the document image generated after printing and scanning as accurately as possible, thereby improving the robustness of the PDF document watermarking algorithm; in addition, in order to solve the problem that the calculated line spacing signal is amplified or reduced by a certain fixed multiple relative to the original text document line spacing signal, two PDF document watermark improvement algorithms with enough robustness to fixed gain attack are provided, and according to experimental results, the provided PDF document watermark scheme has strong robustness to printing and scanning attack.

The PDF document watermarking device against print scan attack proposed according to the embodiment of the present application is described next with reference to the drawings.

Fig. 6 is a schematic block diagram of a PDF document watermarking apparatus resisting print scan attack according to an embodiment of the present application.

As shown in fig. 6, the PDF document watermarking device 10 resistant to print scan attack includes: parsing module 100, embedding module 200, generating module 300, and extracting module 400.

The analysis module 100 is configured to analyze an initial PDF document to obtain a text line spacing signal of the initial PDF document; the embedding module 200 is configured to embed preset watermark information into the text line spacing signal, and generate a line spacing signal containing a watermark; the generating module 300 is configured to generate a final PDF document from the initial PDF document based on the line spacing signal containing the watermark; the extracting module 400 is configured to generate a document image according to the final PDF document, detect a bounding box of a text line from the document image, and calculate a line spacing signal of the final PDF image to extract watermark information. The device 10 of the embodiment of the application not only can accurately calculate the line spacing signal, but also can compensate most attacks introduced by the printing and scanning process by introducing the text detection technology.

In one embodiment of the present application, the apparatus 10 of the present application embodiment further comprises: and a processing module. The processing module is used for carrying out quantization processing on the line spacing signals based on a preset strategy for optimizing quantization step length so as to resist fixed gain attack.

In one embodiment of the present application, the apparatus 10 of the present application embodiment further comprises: the device comprises a detection module and a judgment module. The detection module is used for detecting the requirement of text detection before detecting the bounding box of the text line in the document image; the judging module is used for judging whether a text detection mode based on sliding window classification or a text detection mode based on connection component analysis is met according to requirements.

In an embodiment of the present application, when the text detection manner based on sliding window classification is satisfied, the extraction module 400 is further configured to slide on the document image by using windows with multiple scales, and divide each window mark into a text segment window or a non-text segment window through the classifier, and perform a morphological operation on the text segment window to determine the text region.

In an embodiment of the present application, when the text detection manner based on connected component analysis is satisfied, the extraction module 400 is further configured to extract text characters of the document image, synthesize text segments based on the text characters, and filter non-text segments in the text segments to synthesize the filtered text segments into text lines.

It should be noted that the explanation of the foregoing embodiment of the PDF document watermarking method for resisting print scan attack is also applicable to the PDF document watermarking device for resisting print scan attack in this embodiment, and is not repeated herein.

According to the printing and scanning attack resistant PDF document watermarking device provided by the embodiment of the application, by analyzing various attacks introduced in the printing and scanning process, various attacks generated in the printing and scanning process are compensated by using a text detection technology, a line spacing signal is calculated from a document image generated after printing and scanning as accurately as possible, and the robustness of a PDF document watermarking algorithm is improved; in addition, in order to solve the problem that the calculated line spacing signal is amplified or reduced by a certain fixed multiple relative to the original text document line spacing signal, two PDF document watermark improvement algorithms with enough robustness to fixed gain attack are provided, and according to experimental results, the provided PDF document watermark scheme has strong robustness to printing and scanning attack.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "N" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Claims

1. A printing and scanning attack resistant PDF document watermarking method is characterized by comprising the following steps:

analyzing an initial PDF document to obtain a text line spacing signal of the initial PDF document;

embedding preset watermark information into a text line spacing signal to generate a line spacing signal containing a watermark;

generating a final PDF document from the initial PDF document based on the line space signal containing the watermark;

and generating a document image according to the final PDF document, detecting a boundary box of a text line from the document image, and calculating a line spacing signal of the final PDF image to extract watermark information.

2. The method of claim 1, further comprising:

and quantizing the line spacing signals based on a preset strategy of optimizing quantization step length so as to resist fixed gain attack.

3. The method of claim 1, prior to detecting a bounding box of a line of text from the document image, further comprising:

detecting the requirement of text detection;

and judging whether a text detection mode based on sliding window classification or a text detection mode based on connection component analysis is met according to the requirement.

4. The method according to claim 3, wherein the detecting a bounding box of a text line from the document image and calculating a line space signal of the final PDF image when the sliding window based classified text detection mode is satisfied comprises:

utilizing windows with various scales to slide on the document image, and dividing each window mark into a text segment window or a non-text segment window through a classifier;

and performing morphological operation on the text segment window to determine a text area.

5. The method according to claim 3, wherein the detecting a bounding box of a text line from the document image and calculating a line space signal of the final PDF image when the connected component analysis-based text detection mode is satisfied comprises:

extracting text characters of the document image;

text segments are synthesized based on the text characters, and non-text segments of the text segments are filtered to synthesize the filtered text segments into text lines.

6. A PDF document watermarking device for resisting print scanning attack is characterized by comprising:

the analysis module is used for analyzing and analyzing the initial PDF document to obtain a text line spacing signal of the initial PDF document;

the embedding module is used for embedding preset watermark information into the text line spacing signal to generate a line spacing signal containing the watermark;

a generating module, configured to generate a final PDF document from the initial PDF document based on the line spacing signal containing the watermark;

and the extraction module is used for generating a document image according to the final PDF document, detecting a boundary box of a text line from the document image, and calculating a line spacing signal of the final PDF image so as to extract and obtain watermark information.

7. The apparatus of claim 6, further comprising:

and the processing module is used for carrying out quantization processing on the line spacing signal based on a preset strategy for optimizing quantization step length so as to resist fixed gain attack.

8. The apparatus of claim 6, further comprising:

the detection module is used for detecting the requirement of text detection before detecting the bounding box of the text line in the document image;

and the judging module is used for judging whether a text detection mode based on sliding window classification or a text detection mode based on connection component analysis is met according to the requirement.

9. The apparatus of claim 8, wherein when the sliding window based classification text detection mode is satisfied, the extraction module is further configured to slide windows with multiple scales on the document image, and divide each window into a text segment window or a non-text segment window by a classifier, and perform a morphological operation on the text segment window to determine a text region.

10. The apparatus of claim 8, wherein the extraction module is further configured to extract text characters of the document image and synthesize text segments based on the text characters when the connection component analysis-based text detection mode is satisfied, and filter non-text segments in the text segments to synthesize the filtered text segments into text lines.