WO2023155302A1 - Pdf版面分割方法和装置、电子设备、存储介质 - Google Patents

Pdf版面分割方法和装置、电子设备、存储介质 Download PDF

Info

Publication number
WO2023155302A1
WO2023155302A1 PCT/CN2022/090674 CN2022090674W WO2023155302A1 WO 2023155302 A1 WO2023155302 A1 WO 2023155302A1 CN 2022090674 W CN2022090674 W CN 2022090674W WO 2023155302 A1 WO2023155302 A1 WO 2023155302A1
Authority
WO
WIPO (PCT)
Prior art keywords
pdf
layout
segmentation
loss function
layout segmentation
Prior art date
Application number
PCT/CN2022/090674
Other languages
English (en)
French (fr)
Inventor
唐小初
舒畅
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023155302A1 publication Critical patent/WO2023155302A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the present application relates to the technical field of artificial intelligence, in particular to a PDF layout segmentation method and device, electronic equipment and storage media.
  • PDF As a file format that supports multimedia across platforms, PDF is widely used in the Internet age, especially in the financial field. Although PDF transmission and reading is very convenient, it is very inconvenient to obtain the content in it. Therefore, for PDFs that need to be edited and copied, they need to be converted into Html, Word and other formats first. For PDF conversion, it is first necessary to obtain the reading order of the paragraphs. Since PDF typesetting often involves operations such as column division and table insertion, it is difficult to obtain the reading order directly.
  • the embodiment of the present application proposes a PDF layout segmentation method, including:
  • the PDF image is input to a pre-trained PDF layout segmentation model for layout segmentation, to obtain a set of boundary points corresponding to the PDF image, and generate a set of horizontal lines and a set of vertical lines according to the set of boundary points;
  • the embodiment of the present application proposes a PDF layout segmentation device, including:
  • An acquisition module configured to acquire a PDF document, and convert the PDF document into a PDF image
  • the horizontal and vertical segmentation module is used to input the PDF image into the pre-trained PDF layout segmentation model to perform layout segmentation, and obtain the horizontal line set and the vertical line set corresponding to the PDF image;
  • a Unicom module configured to obtain the reading block of the PDF image according to the set of horizontal lines and the set of vertical lines according to preset Unicom rules
  • the obtaining segmentation result module is used to obtain the PDF layout segmentation result according to the reading order of the reading blocks.
  • the embodiment of the present application provides an electronic device, including:
  • the program is stored in a memory, and the processor executes the at least one program to implement a PDF layout segmentation method, wherein the PDF layout segmentation method includes:
  • the PDF image is input to a pre-trained PDF layout segmentation model for layout segmentation, to obtain a set of boundary points corresponding to the PDF image, and generate a set of horizontal lines and a set of vertical lines according to the set of boundary points;
  • the embodiment of the present application provides a storage medium, the storage medium is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to make a computer execute A PDF layout segmentation method, wherein, the PDF layout segmentation method comprises:
  • the PDF image is input to a pre-trained PDF layout segmentation model for layout segmentation, to obtain a set of boundary points corresponding to the PDF image, and generate a set of horizontal lines and a set of vertical lines according to the set of boundary points;
  • the PDF layout segmentation method and device, electronic equipment, and storage medium proposed in the embodiments of the present application are not only applicable to general PDF documents, but also applicable to PDF documents that include operations such as column division or inserting charts, and obtaining their reading order is convenient for subsequent PDF content operations. , improve the efficiency and accuracy of PDF layout segmentation.
  • Fig. 1 is a flow chart of a PDF layout segmentation method provided by an embodiment of the present application.
  • Fig. 2 is another flow chart of the PDF layout segmentation method provided by the embodiment of the present application.
  • FIG. 3 is a schematic diagram of PDF layout training samples of the PDF layout segmentation method provided by the embodiment of the present application.
  • Fig. 4 is a schematic diagram of the layout segmentation label of the PDF layout segmentation method provided by the embodiment of the present application.
  • Fig. 5 is a schematic diagram of layout segmentation of the PDF layout segmentation method provided by the embodiment of the present application.
  • FIG. 6 is another schematic diagram of PDF layout training samples of the PDF layout segmentation method provided by the embodiment of the present application.
  • FIG. 7 is yet another flow chart of the PDF layout segmentation method provided by the embodiment of the present application.
  • Fig. 8 is a structural block diagram of the PDF layout segmentation device provided by the embodiment of the present application.
  • FIG. 9 is a schematic diagram of a hardware structure of an electronic device provided by an embodiment of the present application.
  • PDF Portable Document Format: Indicates the portable document format, which is an electronic file format developed by Adobe Systems for file exchange in a manner independent of applications, operating systems, and hardware.
  • the PDF file is based on the PostScript language image model. This file format has nothing to do with the operating system platform. It is common in Windows, Unix or Mac OS operating systems. No matter what kind of printer it is, it can ensure accurate color and accuracy. Excellent printing effect, that is, PDF will faithfully reproduce every character, color and image of the original manuscript. It is an ideal document format for electronic document distribution and digital information dissemination.
  • Unet network model It is an image segmentation network.
  • the specific segmentation process is: input an image, encode or downsample, then decode or upsample, output the target segmentation result of the image, and then segment the result according to the target
  • the difference between the real segmentation results and the image segmentation network is trained by backpropagation.
  • Its network structure is mainly divided into three parts: downsampling, upsampling and skip connection.
  • the network is divided into left and right parts for analysis.
  • the left side is the compression process, that is, the Encoder (encoder), which mainly reduces the image size through convolution and downsampling, and extracts some shallow features.
  • the right part is the decoding process, that is, the Decoder (decoder), which mainly obtains some deep features through convolution and upsampling.
  • the convolution uses valid filling to ensure that the results are obtained based on no missing context features. Therefore, every After convolution, the size of the image will be reduced.
  • the shallow features obtained in the encoding stage are combined with the deep features obtained in the decoding stage to refine the image, and the prediction segmentation is performed according to the obtained results, and then the 1x1 convolution is used for classification to obtain the target segmentation. result.
  • ResNest model It is an improved version of the ResNet model, which is used for target detection or image segmentation, etc.
  • the Split-Attention attention module is introduced on the basis of the ResNet model.
  • the essence of Split-Attention can be understood as an attention supervision mechanism for slices.
  • the ResNest model has better image classification performance on the ImageNet dataset, especially the ResNeSt-50 model.
  • the model using ResNeSt-50 as the basic skeleton such as the Faster-RCNN model
  • the model using ResNeSt-50 as the basic skeleton such as the DeeplabV3 model
  • ResNet-50 models e.g. mIOU
  • PDF As a file format that supports multimedia across platforms, PDF is widely used in the Internet age, especially in the financial field. Although PDF transmission and reading is very convenient, it is very inconvenient to obtain the content in it. Therefore, for PDFs that need to be edited and copied, they need to be converted into Html, Word and other formats first. For PDF conversion, it is first necessary to obtain the reading order of the paragraphs. Since PDF typesetting often involves operations such as column division and table insertion, it is difficult to obtain the reading order directly. In related technologies, some PDF layout analysis algorithms directly use the writing order of PDF to divide the layout, but the PDF editing process often edits the back and then edits the front, which is not exactly the same as the writing order, especially when it includes column division or inserting charts, etc. The reading order of the manipulated PDF document obtained by this algorithm is not very accurate. The other part is processed by OCR recognition technology combined with manual operation, and the processing efficiency is low.
  • the embodiment of the present application provides a PDF layout segmentation method and device, electronic equipment, storage medium, and a PDF layout segmentation method, by obtaining a PDF document, converting the PDF document into a PDF image, and then inputting the PDF image into the pre-training PDF layout segmentation model based on the PDF layout segmentation model to segment the layout to obtain the set of boundary points corresponding to the PDF image, and generate a set of horizontal lines and a set of vertical lines according to the set of boundary points, and then obtain a PDF image according to the set of horizontal lines and vertical lines according to the preset Unicom rules
  • the reading block, and finally according to the reading order of the reading block the PDF layout segmentation result is obtained.
  • This embodiment is not only applicable to general PDF documents, but also applicable to PDF documents containing operations such as dividing columns or inserting charts. Obtaining the reading order facilitates subsequent PDF content operations, and improves the efficiency and accuracy of PDF layout segmentation.
  • Embodiments of the present application provide a PDF layout segmentation method and device, an electronic device, and a storage medium, which are specifically described through the following embodiments. Firstly, the PDF layout segmentation method in the embodiment of the present application is described.
  • AI artificial intelligence
  • the embodiments of the present application may acquire and process relevant data based on artificial intelligence technology.
  • artificial intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the PDF page segmentation method provided in the embodiment of the present application relates to the technical field of artificial intelligence, especially to the technical field of data mining.
  • the PDF page segmentation method provided by the embodiment of the present application can be applied to a terminal or a server, and can also be software running on the terminal or the server.
  • the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer or a smart watch, etc.
  • the server can be an independent server, or can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage , network services, cloud communications, middleware services, domain name services, security services, content distribution network (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms
  • the software can be a PDF layout The application of the segmentation method, etc., but not limited to the above forms.
  • the application can be used in numerous general purpose or special purpose computer system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, etc.
  • This application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote computer storage media including storage devices.
  • Fig. 1 is an optional flow chart of the PDF layout segmentation method provided by the embodiment of the present application.
  • the method in Fig. 1 may include but not limited to steps S110 to S140.
  • Step S110 acquiring a PDF document and converting the PDF document into a PDF image.
  • the first method get each page of the PDF document, convert each page into a corresponding image, that is, convert a PDF document into multiple image files with the same number of pages, for example, a PDF document file contains 100 pages , it will be converted into 100 image files.
  • the second method according to the height and width of each page in the PDF document, add the height of each page, determine the result as the target height, compare the width of each page, and determine the largest width as the target width, After determining the target height and target width, store the content of each page in the PDF document, and convert the data in the memory into a long image by splicing.
  • this embodiment can also use the open source PDFBox tool to convert PDF documents into PDF images.
  • the above conversion methods are only examples, and it does not mean that this embodiment can only convert PDF documents into PDF through the above two methods Image, you can choose different conversion methods according to your needs.
  • Step S120 inputting the PDF image into the pre-trained PDF layout segmentation model for layout segmentation, obtaining a set of boundary points corresponding to the PDF image, and generating a set of horizontal lines and a set of vertical lines according to the set of boundary points.
  • the PDF layout segmentation model is an improved Unet network model
  • the improvement point is: the down-sampling module of the encoder in the Unet network model is replaced with a ResNest50 feature extraction module.
  • the Unet network model in the related art trains samples from scratch, that is, each sample is brand new data for the Unet network model, and the model weight information in the previous training process is not utilized during training, so this embodiment Use the ResNest50 feature extraction module to replace the downsampling module of the encoder in the Unet network model, perform downsampling to reduce the image size, and extract shallow features for subsequent image segmentation operations.
  • the ResNest model introduces the Split-Attention attention module on the basis of the ResNet model for attention supervision
  • the ResNeSt-50 model as a common structure of the ResNest model, has better image classification performance on the ImageNet dataset, so in this embodiment, use
  • the improved Unet network model to build a PDF layout segmentation model can improve the accuracy of image segmentation.
  • step S120 specifically includes: inputting the PDF image into a pre-trained PDF layout segmentation model to perform layout segmentation to obtain the boundary points of each content block in the PDF image, forming a boundary point set according to the boundary points, and forming a boundary point set according to the boundary point set
  • the coordinates of the boundary points in the center are used to obtain the boundary line of each content block, and a set of horizontal lines and a set of vertical lines are generated according to the boundary lines.
  • the process of training the PDF layout segmentation model includes but is not limited to steps S210 to S240:
  • Step S210 constructing an initial PDF layout segmentation model and layout training data set.
  • the initial PDF layout segmentation model is an improved Unet network model
  • the layout training data set includes: PDF layout training samples and corresponding layout segmentation labels, wherein the PDF layout training samples are PDF images converted from PDF documents , and the corresponding layout division label is a horizontal line with a preset width or a vertical line with a preset width.
  • the preset width of the layout segmentation label is 40 pixels, that is, in this embodiment, the PDF layout training samples are marked with thicker horizontal lines or vertical lines, which is different from the thinner lines in the related art. Lines, thick lines are more conducive to segment recognition. It is understandable that the preset width is only for illustration and does not mean to limit it.
  • FIG. 3 it is a schematic diagram of a PDF layout training sample in an embodiment of the present application.
  • the PDF layout training sample is a converted PDF image.
  • the figure shows 7 different content blocks from block 1 to block 7.
  • the content can be text, tables or pictures, and will not be described in detail here. limited.
  • FIG. 4 it is a schematic diagram of a layout division label in an embodiment of the present application.
  • different horizontal lines and vertical lines are marked as layout segmentation labels according to the distribution of content blocks in Figure 4, and subsequent layout segmentation can be performed according to the relationship between horizontal lines and vertical lines.
  • it also includes performing sample expansion processing on the PDF layout training sample in the layout training data set, specifically: 1) selecting a scaling value within the range from the upper limit of the scaling threshold to the lower limit of the scaling threshold; 2) training the PDF layout according to the scaling value The sample is scaled, and the sample expansion is performed on the PDF layout training sample to obtain the zoomed sample.
  • the upper limit of the zoom threshold can be 0.9 times, and the lower limit of the zoom threshold can be 1.1 times, that is, a zoom value is selected between 0.9 times and 1.1 times, and the PDF layout training sample (ie, the converted PDF picture) to obtain zoomed samples, and the zoomed samples and corresponding layout segmentation labels are included in the layout training data set, so as to implement sample expansion to the layout training data set.
  • the PDF layout training sample ie, the converted PDF picture
  • Step S220 input the PDF layout training sample into the initial PDF layout segmentation model to perform layout segmentation, and obtain a layout segmentation prediction value.
  • the initial PDF layout segmentation model performs layout segmentation on the input PDF layout training samples, obtains the boundary points of each content block according to the principle of image recognition, forms a boundary point set, and then performs a boundary point set in the boundary point set
  • the coordinates get the boundary line of each content block. For example, if the content block is divided into rectangles, the rectangular boundary line of each content block is obtained, which is composed of two horizontal lines and two vertical lines. By analogy, each content block is obtained
  • the corresponding boundary lines constitute the first set of horizontal lines and the first set of vertical lines.
  • the above-mentioned embodiment starts from the first horizontal line according to the order from top to bottom, and selects the horizontal line downwards successively to carry out the horizontal line division. If the horizontal line passes through a certain content block, then delete the horizontal line, and continue to Select downwards until a horizontal line that does not pass through any content block is selected. The interval between these two horizontal lines is called a segment, and the next segment is selected in turn until all horizontal lines in the first horizontal line set are selected.
  • the two horizontal lines are merged into a horizontal line with a preset width, and the position of the horizontal line only needs to be between the two content blocks. Make specific restrictions to get a set of horizontal lines.
  • the above-mentioned embodiment carries out vertical line segmentation according to the order from left to right, in each segment (that is, considering the boundary line of the content block in this segment), select from the first vertical line set according to the selection method of the horizontal line set
  • Vertical lines constitute a set of vertical lines.
  • the set of horizontal lines and the set of vertical lines are used as page segmentation prediction values.
  • FIG. 5 it is a schematic diagram of layout division in an embodiment of the present application.
  • the PDF layout training sample in Figure 3 is divided into layouts, and the boundary points of each content block are obtained according to the principle of image recognition (the boundary points of each content block are shown in the figure), and then each content block is obtained according to the coordinates of the boundary points For example, if the content block is divided into rectangles, the rectangular boundary line of each content block is obtained, which consists of two horizontal lines and two vertical lines.
  • the horizontal boundary lines obtained are 1-1 and 1-4
  • the vertical boundary lines are 2-1 and 2-5, and so on, to obtain the corresponding boundary lines of each content block, respectively :
  • the horizontal boundary lines of block 2 are 1-2 and 1-3
  • the vertical boundary lines are 2-7 and 2-10
  • the horizontal boundary lines of block 3 are 1-2 and 1-3
  • the vertical boundary lines 2-1 and 2-2 the horizontal boundary of block 4 is 1-5 and 1-8
  • the vertical boundary is 2-3 and 2-4
  • the horizontal boundary of block 5 is 1-6 and 1-10
  • the vertical boundary line is 2-76 and 2-11
  • the horizontal boundary line of block 6 is 1-11 and 1-13
  • the vertical line boundary line is 2-1 and 2-8 and the horizontal line of block 7
  • Line boundary lines are 1-12 and 1-13
  • vertical line boundary lines are 2-9 and 2-11, forming the first set of horizontal lines ⁇ 1-1, 1-2, 1-3, 1-4, 1- 5, 1-6, 1-7, 1-8, 1-9, 1-10, 1-11, 1-12, 1-13 ⁇ and the first
  • filter segments according to the first set of horizontal lines and the first set of vertical lines to obtain the set of horizontal lines and the set of vertical lines. Specifically: first, start from the first horizontal line in order from top to bottom, and select the horizontal lines downward one by one. If the horizontal line passes through a certain content block, delete the horizontal line and continue to select until A horizontal line that does not pass through any content block is obtained by selecting, and the space between these two horizontal lines is called a segment, and the next segment is selected in turn until all horizontal lines in the first horizontal line set are selected, and a horizontal line set is obtained.
  • the first horizontal line is 1-1, select horizontal line 1-2 downward, and this horizontal line 1-2 passes through block 1, so delete horizontal line 1-2, and continue to select downward horizontal line 1-3,
  • the horizontal line 1-3 passes through block 1, so delete horizontal line 1-3, continue to select horizontal line 1-4, this horizontal line does not pass through any content block, so horizontal line 1-1 to horizontal line 1 -4 as a segment, called segment 1.
  • the two horizontal lines are merged into a horizontal line with a preset width.
  • the lines between two segments can be merged, that is Lanes 1-4 and lanes 1-5 can merge, and lanes 1-10 and lanes 1-11 can merge.
  • the horizontal line set ⁇ 1, 2, 3, 4 ⁇ of the PDF layout training sample in Figure 5 is obtained, that is, the horizontal line 1-1 in Figure 5 corresponds to the horizontal line 1 of the preset width in Figure 4, and in Figure 5 Horizontal line 1-4 and horizontal line 1-5 are merged into horizontal line 2 of preset width in Figure 4, and horizontal line 1-10 and horizontal line 1-11 in Figure 5 are merged into horizontal line 3 of preset width in Figure 4 , the horizontal line 1-13 in FIG. 5 is the horizontal line 4 with a preset width in FIG. 4 .
  • the longitudinal line 2 of the preset width in Fig. 4 is the preset width in Fig. 4 Set the vertical line 7 of width.
  • the vertical line set of section 2 is ⁇ 8, 9, 10, 11 ⁇
  • the vertical line set of section 3 is ⁇ 12, 13, 14 ⁇ .
  • the set of horizontal lines is ⁇ 1, 2, 3, 4 ⁇
  • the set of vertical lines is ⁇ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ⁇
  • the set of horizontal lines and the set of vertical lines are used as the layout segmentation prediction value.
  • FIG. 6 it is a schematic diagram of a PDF layout training sample in an embodiment of the present application.
  • block 6, block 7, and block 8 all belong to segment 3 according to the above layout division method, so when segment 3 is segmented vertically, block 7 and block 8 can be regarded as a complete content block to complete the vertical segment After that, it is divided into horizontal lines to obtain the corresponding set of horizontal lines.
  • step S230 the loss value is calculated according to the predicted value of layout segmentation and the corresponding layout segmentation label.
  • the above method obtains the predicted value of layout segmentation, that is, the set of horizontal lines and the set of vertical lines, and compares it with the corresponding layout segmentation label to calculate the loss value.
  • the set of horizontal lines and the set of vertical lines can be Calculate the loss value for the position difference between the horizontal line and vertical line in the line set and the horizontal line or vertical line in the layout segmentation label, that is, the loss value is the position error value of the corresponding horizontal line or vertical line.
  • Step S240 using the loss function to adjust the model weights of the initial PDF layout segmentation model according to the loss value until the loss function satisfies the convergence condition, and training the PDF layout segmentation model.
  • PDF layout segmentation can be equivalent to the boundary detection problem, and the boundary detection problem can be regarded as a semantic segmentation problem, that is, the boundary is simply marked as 1 and other areas are marked as 0, representing it as a binary classification semantic segmentation problem. If a single loss function is used to adjust the effect, the effect is poor. Therefore, the loss function in this embodiment is a composite loss function. According to the first loss function and the second Two loss functions get the loss function, expressed as:
  • ⁇ + ⁇ 1
  • represents the first weight
  • represents the second weight
  • L represents the composite loss function
  • L 1 represents the first loss function
  • L 2 represents the second loss function
  • the first loss function is a dice loss loss function
  • the second loss function is a weighted cross-entropy loss function
  • the dice loss function is a measurement function used to evaluate the similarity of two samples. Its value ranges from 0 to 1. The larger the value, the closer the two samples are, expressed as:
  • X and Y are two samples respectively
  • means to take the intersection
  • represent the number of elements in sample X and sample Y respectively.
  • the above first loss function can also be expressed as:
  • N represents the number of pixels
  • g i represents the sample label
  • p i represents the predicted value.
  • the values of p i and g i are 0 or 1, indicating whether the pixel is a boundary, and if so, the value is 1 , otherwise the value is 0.
  • the second loss function is a weighted cross-entropy loss function, where the cross-entropy loss function is expressed as:
  • L 2 ′ represents the cross-entropy loss function
  • y ture represents the sample label
  • y pred represents the predicted value.
  • the predicted value when the sample label is positive and the predicted value is negative, the predicted value is false negative; when the sample label is negative and the predicted value is positive, the predicted value is false positive.
  • the penalty for false positives is greater, and different weights are added to the two parts of the cross-entropy loss function to obtain the weighted cross-entropy loss function to improve the adjustment ability of the loss function.
  • the weighted cross-entropy loss function is expressed as:
  • ⁇ 0 and ⁇ 1 represent the weight of false negative and the weight of false positive respectively.
  • the parameters in the initial PDF layout segmentation model are adjusted according to the loss value until the loss function meets the convergence condition, and the PDF layout segmentation model is obtained, that is, the parameters of the initial PDF layout segmentation model are updated according to the loss function until the PDF layout is obtained Split the model.
  • the convergence condition may be: minimizing loss functions, that is, optimizing the parameters of the initial PDF layout segmentation model by minimizing each loss function.
  • Step S130 according to the preset Unicom rules, according to the set of horizontal lines and the set of vertical lines, the reading block of the PDF image is obtained.
  • step S130 includes but is not limited to steps S131 to S133:
  • Step S131 read two horizontal lines in the horizontal line set in order from top to bottom.
  • Step S132 reading all the vertical lines in the two horizontal lines in the vertical line set in order from left to right.
  • step S133 a reading block is generated between every two adjacent vertical lines until all the horizontal lines in the horizontal line set are read.
  • the set of horizontal lines is ⁇ 1, 2, 3, 4 ⁇
  • the set of vertical lines is ⁇ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ⁇ .
  • All vertical lines in line 1 and horizontal line 2 are vertical line 5, vertical line 6 and vertical line 7, and a reading block is generated between every two adjacent vertical lines to obtain reading block 1 (corresponding to block 1 in Figure 4 ) and read block 2 (corresponding to block 2 in Figure 4).
  • Step S140 according to the reading order of the reading blocks, obtain the result of PDF layout segmentation.
  • two horizontal lines in the horizontal line set are firstly selected from top to bottom, and then the reading order of the reading block between the two horizontal lines is read from left to right until the reading sequence in the horizontal line set All the horizontal lines are read, and then the PDF layout segmentation results are obtained according to the reading order of the reading blocks.
  • the reading order of the reading blocks finally generated by the above embodiment is: reading block 1->reading block 2->reading block 3->reading block 4->reading block 5->reading block 6->reading block 7, and the PDF layout is obtained Split results.
  • the PDF layout segmentation method obtaineds a PDF document, converts the PDF document into a PDF image, and then inputs the PDF image into a pre-trained PDF layout segmentation model to perform layout segmentation to obtain a set of boundary points corresponding to the PDF image , and generate a set of horizontal lines and a set of vertical lines according to the set of boundary points, and then obtain the reading block of the PDF image according to the set of horizontal lines and the set of vertical lines according to the preset Unicom rules, and finally obtain the PDF layout segmentation result according to the reading order of the reading blocks .
  • This embodiment is not only applicable to general PDF documents, but also applicable to PDF documents containing operations such as dividing columns or inserting charts. Obtaining the reading order facilitates subsequent PDF content operations, and improves the efficiency and accuracy of PDF layout segmentation.
  • the embodiment of the present application chooses to be based on image segmentation, which can use the image to obtain more comprehensive layout information, and mark the PDF image by drawing lines, so that the reading sequence can be solved more conveniently. It avoids the non-learnable problem judged by the pure rule method in related technologies, and also avoids the problem of robust crossover in the NLP algorithm based on the fine-grained segmentation of paragraphs or sentences.
  • the reading order of the reading blocks in the PDF document is obtained, the coherent content of the document is obtained, and subsequent analysis can be conveniently performed on this basis.
  • the embodiment of the present application also provides a PDF layout splitting device, which can implement the above PDF layout splitting method.
  • the device includes:
  • An acquisition module 810 configured to acquire a PDF document and convert the PDF document into a PDF image
  • the horizontal and vertical segmentation module 820 is used to input the PDF image into the pre-trained PDF layout segmentation model to perform layout segmentation, and obtain the horizontal line set and the vertical line set corresponding to the PDF image;
  • Unicom module 830 configured to obtain the reading block of the PDF image according to the set of horizontal lines and the set of vertical lines according to the preset Unicom rule;
  • Obtaining a segmentation result module 840 configured to obtain a PDF layout segmentation result according to the reading order of the reading blocks.
  • the acquiring module 810 to convert the acquired PDF document into a PDF image.
  • the first method get each page of the PDF document, convert each page into a corresponding image, that is, convert a PDF document into multiple image files with the same number of pages, for example, a PDF document file contains 100 pages , it will be converted into 100 image files.
  • the second method according to the height and width of each page in the PDF document, add the height of each page, determine the result as the target height, compare the width of each page, and determine the largest width as the target width, After determining the target height and target width, store the content of each page in the PDF document, and convert the data in the memory into a long image by splicing.
  • this embodiment can also use the open source PDFBox tool to convert PDF documents into PDF images.
  • the above conversion methods are only examples, and it does not mean that this embodiment can only convert PDF documents into PDF through the above two methods Image, you can choose different conversion methods according to your needs.
  • the PDF layout segmentation model is an improved Unet network model
  • the improvement point is: the downsampling module of the encoder in the Unet network model is replaced with the ResNest50 feature extraction module. Since the Unet network model in the related art trains samples from scratch, that is, each sample is brand new data for the Unet network model, and the model weight information in the previous training process is not utilized during training, so this embodiment Use the ResNest50 feature extraction module to replace the downsampling module of the encoder in the Unet network model, perform downsampling to reduce the image size, and extract shallow features for subsequent image segmentation operations.
  • the ResNest model introduces the Split-Attention attention module on the basis of the ResNet model for attention supervision
  • the ResNeSt-50 model as a common structure of the ResNest model, has better image classification performance on the ImageNet dataset, so in this embodiment, use
  • the improved Unet network model to build a PDF layout segmentation model can improve the accuracy of image segmentation.
  • the loss function of the PDF layout segmentation model is obtained according to the first loss function and the second loss function, wherein the first loss function can be a dice loss loss function, and the second loss function can be a weighted cross-entropy loss function .
  • the model training module 850 is used to train the PDF layout segmentation model.
  • the training process includes: 1) building an initial PDF layout segmentation model and layout Training data set, layout training data set includes: PDF layout training sample and corresponding layout segmentation label, layout segmentation label is the horizontal line of preset width or the vertical line of preset width; 2) input PDF layout training sample into initial PDF Layout segmentation is performed in the layout segmentation model to obtain the predicted value of layout segmentation; 3) Calculate the loss value according to the predicted value of layout segmentation and the corresponding layout segmentation label; 4) Use the loss function to adjust the model weight of the initial PDF layout segmentation model according to the loss value , until the PDF layout segmentation model is obtained through training.
  • the PDF layout segmentation device of this embodiment is not only suitable for layout segmentation of general PDF documents, but also suitable for PDF documents that include operations such as column separation or inserting charts. Obtaining its reading order is convenient for subsequent PDF content operations, and can improve the efficiency of PDF layout segmentation. efficiency and accuracy.
  • the specific implementation manner of the PDF layout segmentation device of this embodiment is basically the same as the specific implementation manner of the above-mentioned PDF layout segmentation method, and will not be repeated here.
  • the embodiment of the present application also provides an electronic device, including:
  • the program is stored in a memory, and the processor executes the at least one program to implement a PDF layout segmentation method, wherein the PDF layout segmentation method includes:
  • the reading block of the PDF image is obtained according to the set of horizontal lines and the set of vertical lines;
  • the electronic device may be any intelligent terminal including a mobile phone, a tablet computer, a personal digital assistant (PDA for short), a vehicle-mounted computer, and the like.
  • a mobile phone a tablet computer
  • PDA personal digital assistant
  • FIG. 9 illustrates a hardware structure of an electronic device in another embodiment.
  • the electronic device includes:
  • the processor 901 may be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related programs, so as to realize The technical solutions provided by the embodiments of the present application;
  • a general-purpose CPU Central Processing Unit, central processing unit
  • a microprocessor an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related programs, so as to realize The technical solutions provided by the embodiments of the present application;
  • ASIC Application Specific Integrated Circuit
  • the memory 902 may be implemented in the form of a ROM (ReadOnly Memory, read-only memory), a static storage device, a dynamic storage device, or a RAM (Random Access Memory, random access memory).
  • the memory 902 can store operating systems and other application programs.
  • the relevant program codes are stored in the memory 902 and called by the processor 901 to execute the implementation of this application.
  • Example PDF layout segmentation method
  • the input/output interface 903 is used to realize information input and output
  • the communication interface 904 is used to realize the communication interaction between the device and other devices, and the communication can be realized through a wired method (such as USB, network cable, etc.), or can be realized through a wireless method (such as a mobile network, WIFI, Bluetooth, etc.); and
  • bus 905 for transferring information between various components of the device (such as processor 901, memory 902, input/output interface 903 and communication interface 904);
  • the processor 901 , the memory 902 , the input/output interface 903 and the communication interface 904 are connected to each other within the device through the bus 905 .
  • the embodiment of the present application also provides a storage medium, the storage medium is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to make a computer perform a PDF layout segmentation method, wherein, the PDF layout segmentation method includes:
  • the reading block of the PDF image is obtained according to the set of horizontal lines and the set of vertical lines;
  • This embodiment is not only applicable to general PDF documents, but also applicable to PDF documents containing operations such as dividing columns or inserting charts. Obtaining the reading order facilitates subsequent PDF content operations, and improves the efficiency and accuracy of PDF layout segmentation.
  • the computer-readable storage medium may be non-volatile or volatile.
  • memory can be used to store non-transitory software programs and non-transitory computer-executable programs.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices.
  • the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • At least one (item) means one or more, and “multiple” means two or more.
  • “And/or” is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, “A and/or B” can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural.
  • the character “/” generally indicates that the contextual objects are an “or” relationship.
  • At least one of the following” or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • At least one item (piece) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c ", where a, b, c can be single or multiple.
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including multiple instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), magnetic disk or optical disc, etc., which can store programs. medium.
  • ROM read-only memory
  • RAM random access memory
  • magnetic disk or optical disc etc., which can store programs. medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Character Input (AREA)

Abstract

本申请实施例提供PDF版面分割方法和装置、电子设备、存储介质,涉及人工智能技术领域。该PDF版面分割方法,包括:获取PDF文档,并将PDF文档转化为PDF图像,然后将PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到PDF图像对应的边界点集合,并根据边界点集合生成横线集合和纵线集合,再按照预设联通规则,根据横线集合和纵线集合得到PDF图像的阅读块,最后根据阅读块的阅读顺序,得到PDF版面分割结果。本实施例不仅适用于一般PDF文档,更适用于包含分栏或插入图表等操作的PDF文档,获取其阅读顺序便于后续PDF内容操作,提高PDF版面分割的效率和准确率。

Description

PDF版面分割方法和装置、电子设备、存储介质
本申请要求于2022年02月16日提交中国专利局、申请号为202210143646.X,发明名称为“PDF版面分割方法和装置、电子设备、存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及PDF版面分割方法和装置、电子设备和存储介质。
背景技术
PDF作为跨平台支持多媒体的一种文件格式,在互联网时代使用的非常广泛,尤其在金融领域。虽然PDF传输阅读很方便,但是对于获取其中的内容非常的不便利。因此对于需要编辑、复制内容的PDF,需要首先转换成Html、Word等格式。对于PDF转换,首先需要获取段落的阅读顺序,由于PDF排版经常涉及分栏、插入表格等操作,所以直接获取阅读顺序较困难。
技术问题
以下是发明人意识到的现有技术的技术问题:一部分PDF版面分析算法直接采用PDF的写入顺序进行版面分割,但PDF编辑过程经常先编辑后面再编辑前面,并不完全和写入顺序相同,尤其是包含分栏或插入图表等操作的PDF文档通过这种算法获取的阅读顺序准确度不高。另一部分通过OCR识别技术结合人工操作的方式进行处理,处理效率低。
技术解决方案
第一方面,本申请实施例提出了一种PDF版面分割方法,包括:
获取PDF文档,并将所述PDF文档转化为PDF图像;
将所述PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到所述PDF图像对应的边界点集合,并根据所述边界点集合生成横线集合和纵线集合;
按照预设联通规则,根据所述横线集合和所述纵线集合得到所述PDF图像的阅读块;
根据所述阅读块的阅读顺序,得到PDF版面分割结果。
第二方面,本申请实施例提出了一种PDF版面分割装置,包括:
获取模块,用于获取PDF文档,并将所述PDF文档转化为PDF图像;
横纵分割模块,用于将所述PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到所述PDF图像对应的横线集合和纵线集合;
联通模块,用于按照预设联通规则,根据所述横线集合和所述纵线集合得到所述PDF图像的阅读块;
获取分割结果模块,用于根据所述阅读块的阅读顺序,得到PDF版面分割结果。
第三方面,本申请实施例提出了一种电子设备,包括:
至少一个存储器;
至少一个处理器;
至少一个程序;
所述程序被存储在存储器中,处理器执行所述至少一个程序以实现一种PDF版面分割方法,其中,所述PDF版面分割方法包括:
获取PDF文档,并将所述PDF文档转化为PDF图像;
将所述PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到所述PDF图像对应的边界点集合,并根据所述边界点集合生成横线集合和纵线集合;
按照预设联通规则,根据所述横线集合和所述纵线集合得到所述PDF图像的阅读块;
根据所述阅读块的阅读顺序,得到PDF版面分割结果。
第四方面,本申请实施例提出了一种存储介质,该存储介质是计算机可读存储介质,所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行一种PDF版面分割方法,其中,所述PDF版面分割方法包括:
获取PDF文档,并将所述PDF文档转化为PDF图像;
将所述PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到所述PDF图像对应的边界点集合,并根据所述边界点集合生成横线集合和纵线集合;
按照预设联通规则,根据所述横线集合和所述纵线集合得到所述PDF图像的阅读块;
根据所述阅读块的阅读顺序,得到PDF版面分割结果。
有益效果
本申请实施例提出的PDF版面分割方法和装置、电子设备、存储介质,不仅适用于一般PDF文档,更适用于包含分栏或插入图表等操作的PDF文档,获取其阅读顺序便于后续PDF内容操作,提高PDF版面分割的效率和准确率。
附图说明
图1是本申请实施例提供的PDF版面分割方法的流程图。
图2是本申请实施例提供的PDF版面分割方法的又一流程图。
图3是本申请实施例提供的PDF版面分割方法的PDF版面训练样本示意图。
图4是本申请实施例提供的PDF版面分割方法的版面分割标签示意图。
图5是本申请实施例提供的PDF版面分割方法的版面分割示意图。
图6是本申请实施例提供的PDF版面分割方法的PDF版面训练样本又一示意图。
图7本申请实施例提供的PDF版面分割方法的又一流程图。
图8本申请实施例提供的PDF版面分割装置的结构框图。
图9是本申请实施例提供的电子设备的硬件结构示意图。
本发明的实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
首先,对本申请中涉及的若干名词进行解析:
PDF(Portable Document Format):表示可携带文档格式,是由Adobe Systems用于与应用程序、操作系统、硬件无关的方式进行文件交换所发展出的电子文件格式。PDF文件以PostScript语言图象模型为基础,这种文件格式与操作系统平台无关,在Windows,Unix或者Mac OS操作系统中都是通用的,无论在哪种打印机上都可保证精确的颜色和准确的打印效果,即PDF会忠实地再现原稿的每一个字符、颜色以及图象,是一种电子文档发行和数字化信息传播的理想文档格式。
Unet网络模型:是一种图像分割网络,具体的分割过程是:输入一幅图像,进行编码或者说降采样,然后进行解码或者说升采样,输出该图像的目标分割结果,然后根据目标分割结果和真实分割结果之间的差异,通过反向传播的方式来训练这个图像分割网络。其网络结构主要分为三部分:降采样,上采样以及跳跃连接。首先将该网络分为左右部分来分析,左边是压缩的过程,即Encoder(编码器),主要是通过卷积和降采样来降低图像尺寸,提取一些浅层特征。右边部分是解码的过程,即Decoder(解码器),主要是通过卷积和上采样来获取一些深层特征,其中卷积采用valid填充方式保证结果都是基于没有缺失的上下文特征得到的,因此每次经过卷积后,图像的大小会减小。中间通过concat(连接)的方式,将编码阶段获得的浅层特征同解码阶段获得的深层特征结合起来细化图像,根据得到的结果进行预测分割,再通过1x1的卷积做分类,得到目标分割结果。
ResNest模型:是ResNet模型的改进版,用于进行目标检测或图像分割等,在ResNet模型的基础上引入Split-Attention注意力模块,Split-Attention其本质可理解为切片的注意力监督机制。ResNest模型在ImageNet数据集上图像分类性能较佳,尤其是其中的ResNeSt-50模型。例如使用ResNeSt-50为基本骨架的模型(例如Faster-RCNN模型)比使用ResNet-50的模型(例如mAP)要高出3.08%;使用ResNeSt-50为基本骨架的模型(例如DeeplabV3模型)比使用ResNet-50的模型(例如mIOU)要高出3.02%。
PDF作为跨平台支持多媒体的一种文件格式,在互联网时代使用的非常广泛,尤其在金融领域。虽然PDF传输阅读很方便,但是对于获取其中的内容非常的不便利。因此对于需要编辑、复制内容的PDF,需要首先转换成Html、Word等格式。对于PDF转换,首先需要获取段落的阅读顺序,由于PDF排版经常涉及分栏、插入表格等操作,所以直接获取阅读顺序较困难。相关技术中,一部分PDF版面分析算法直接采用PDF的写入顺序进行版面分割,但PDF编辑过程经常先编辑后面再编辑前面,并不完全和写入顺序相同,尤其是包含分栏或插入图表等操作的PDF文档通过这种算法获取的阅读顺序准确度不高。另一部分通过OCR识别技术结合人工操作的方式进行处理,处理效率低。
基于此,本申请实施例提供一种PDF版面分割方法和装置、电子设备、存储介质,PDF版面分割方法,通过获取PDF文档,并将PDF文档转化为PDF图像,然后将PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到PDF图像对应的边界点集合,并根据边界点集合生成横线集合和纵线集合,再按照预设联通规则,根据横线集合和纵线集合得到PDF图像的阅读块,最后根据阅读块的阅读顺序,得到PDF版面分割结果。本实施例不仅适用于一般PDF文档,更适用于包含分栏或插入图表等操作的PDF文档,获取其阅读顺序便于后续PDF内容操作,提高PDF版面分割的效率和准确率。
本申请实施例提供PDF版面分割方法和装置、电子设备、存储介质,具体通过如下实施例进行说明,首先描述本申请实施例中的PDF版面分割方法。
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
本申请实施例提供的PDF版面分割方法,涉及人工智能技术领域,尤其涉及数据挖掘技术领域。本申请实施例提供的PDF版面分割方法可应用于终端中,也可应用于服务器端中,还可以是运行于终端或服务器端中的软件。在一些实施例中,终端可以是智能手机、平板电脑、笔记本电脑、台式计算机或者智能手表等;服务器可以是独立的服务器,也可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基 础云计算服务的云服务器;软件可以是实现PDF版面分割方法的应用等,但并不局限于以上形式。
本申请可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
图1是本申请实施例提供的PDF版面分割方法的一个可选的流程图,图1中的方法可以包括但不限于包括步骤S110至步骤S140。
步骤S110,获取PDF文档,并将PDF文档转化为PDF图像。
在一实施例中,将获取到的PDF文档转化成PDF图像有两种方式。
第一种方式:获取该PDF文档的每一个页面,将各个页面转换为对应的一张图像,即将一个PDF文档转换为与页面数量相同数量的多个图像文件,例如一个PDF文档文件包含100页,就会被转换为100个图像文件。
第二种方式:根据PDF文档中每一个页面的高度和宽度,将每个页面的高度相加,将结果确定为目标高度,比较每个页面的宽度,将最大的一个宽度确定为目标宽度,确定目标高度和目标宽度后,存储PDF文档中每一页内容,通过拼接的方式将内存中的数据转为一张长图像。
可以理解的是,本实施例也可以利用开源的PDFBox工具将PDF文档转化为PDF图像,上述转化方式仅作示例,并不代表本实施例仅能通过以上两种方式实现将PDF文档转化为PDF图像,可以根据需要选取不同的转化方式。
步骤S120,将PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到PDF图像对应的边界点集合,并根据边界点集合生成横线集合和纵线集合。
在一实施例中,PDF版面分割模型为改进的Unet网络模型,改进点在于:将Unet网络模型中编码器的降采样模块替换为ResNest50特征提取模块。由于相关技术中Unet网络模型对样本都是从头开始训练,即每个样本对于Unet网络模型来说都是全新的数据,训练时没有利用到在先训练过程中的模型权重信息,因此本实施例利用ResNest50特征提取模块替换Unet网络模型中编码器的降采样模块,进行降采样来降低图像尺寸,提取浅层特征用于后续图像分割操作。由于ResNest模型是在ResNet模型的基础上引入Split-Attention注意力模块进行注意力监督,ResNeSt-50模型作为ResNest模型的常见结构,在ImageNet数据集上图像分类性能较佳,因此本实施例中利用改进的Unet网络模型构建PDF版面分割模型能够提升图像分割准确率。
在一实施例中,步骤S120具体包括:将PDF图像输入到预先训练的PDF版面分割模型进行版面分割得到PDF图像中每个内容块的边界点,根据边界点构成边界点集合,根据边界点集合中边界点的坐标得到每个内容块的边界线,根据边界线生成横线集合和纵线集合。
在一实施例中,参照图2,训练得到PDF版面分割模型的过程包括但不限于步骤S210至步骤S240:
步骤S210,构建初始PDF版面分割模型和版面训练数据集。
在一实施例中,初始PDF版面分割模型即改进的Unet网络模型,版面训练数据集包括:PDF版面训练样本和对应的版面分割标签,其中,PDF版面训练样本为PDF文档在转化得到的PDF图像,对应的版面分割标签为预设宽度的横线或者预设宽度的纵线。
在一实施例中,版面分割标签的预设宽度为40像素,即该实施例中利用较粗的横线或者纵线对PDF版面训练样本进行标注,这种标注方式不同于相关技术中的细线,粗线更有利于进行分段识别,可以理解的是,预设宽度仅作示意,不代表对其进行限制。
参照图3,为本申请一实施例中PDF版面训练样本示意图。该实施例中,PDF版面训练样本是一种转化后的PDF图片,图中示出了块1至块7共7个不同的内容块,内容可以是文字、表格或者图片,在此不做具体限定。
参照图4,为本申请一实施例中版面分割标签示意图。结合图3中的PDF版面训练样本,图4中根据内容块的分布标注不同的横线和纵线作为版面分割标签,可按照横线和纵线之间的关系进行后续的版面分割。
在一实施例中,还包括对版面训练数据集中PDF版面训练样本进行样本扩充处理,具体是:1)在缩放阈值上限至缩放阈值下限范围内选取缩放值;2)根据缩放值对PDF版面训练样本进行缩放,对PDF版面训练样本进行样本扩充得到缩放样本。
上述实施例中,缩放阈值上限可以是0.9倍,缩放阈值下限可以是1.1倍,即在0.9倍至1.1倍之间选取一个缩放值,按照该缩放值对PDF版面训练样本(即转化后的PDF图片)进行缩放,得到缩放样本,将缩放样本和对应的版面分割标签归入版面训练数据集中,实现对版面训练数据集进行样本扩充。
步骤S220,将PDF版面训练样本输入到初始PDF版面分割模型中进行版面分割,得到版面分割预测值。
在一实施例中,初始PDF版面分割模型对输入的PDF版面训练样本进行版面分割,根据图像识别的原理得到每个内容块的边界点,构成边界点集合,然后对边界点集合中边界点的坐标得到每个内容块的边界线,例如将内容块分割成矩形,则得到每个内容块的矩形边界线,由两条横线和两条纵线组成,以此类推,得到每个内容块对应的边界线,构成第一横线集合和第一纵线集合。
然后上述实施例按照从上到下的顺序从第一根横线开始,依次向下选取横线进行横线分割,如果该横线穿过某一个内容块,则删掉该横线,继续往下选取,直至选取得到一个未穿过任何内容块的横线,这两条横线之间称为一个段,依次选取下一个段,直至第一横线集合中所有的横线选取完毕。
在一实施例中,如果两条横线之间没有任何内容块,则合并该两条横线为一条预设宽度的横线,该横线的位置位于两个内容块之间即可,不做具体限定,得到横线集合。
然后上述实施例按照从左到右的顺序进行纵线分割,在每一个段里面(即考虑该段中的内容块的边界线),按照横线集合选取方式,从第一纵线集合中选取纵线构成纵线集合。将横线集合以及纵线集合作为版面分割预测值。
参照图5,为本申请一实施例中版面分割示意图。对图3中PDF版面训练样本进行版面分割,根据图像识别的原理得到每个内容块的边界点(图中示出了每个内容块的边界点),然后根据边界点坐标得到每个内容块的边界线,例如将内容块分割成矩形,则得到每个内容块的矩形边界线,由两条横线和两条纵线组成。例如对于块1来说得到的横线边界线为1-1和1-4,纵线边界线为2-1和2-5,以此类推,得到每个内容块对应的边界线,分别是:块2的横线边界线为1-2和1-3,纵线边界线为2-7和2-10、块3的横线边界线为1-2和1-3,纵线边界线为2-1和2-2、块4的横线边界线为1-5和1-8,纵线边界线为2-3和2-4、块5的横线边界线为1-6和1-10,纵线边界线为2-76和2-11、块6的横线边界线为1-11和1-13,纵线边界线为2-1和2-8和块7的横线边界线为1-12和1-13,纵线边界线为2-9和2-11,构成第一横线集合{1-1,1-2,1-3,1-4,1-5,1-6,1-7,1-8,1-9,1-10,1-11,1-12,1-13}和第一纵线集合{2-1,2-2,2-3,2-4,2-5,2-6,2-7,2-8,2-9,2-10,2-11}。根据图5可见,由于不同内容块的布局位置,可能导致边界线出现重合。
然后根据第一横线集合和第一纵线集合筛选段,得到横线集合和纵线集合。具体是:首先按照从上到下的顺序从第一根横线开始,依次向下选取横线,如果该横线穿过某一个内容 块,则删掉该横线,继续往下选取,直至选取得到一个未穿过任何内容块的横线,这两条横线之间称为一个段,依次选取下一个段,直至第一横线集合中所有的横线选取完毕,得到横线集合。例如第一个横线为1-1,向下选取横线1-2,该横线1-2穿过块1,因此删掉横线1-2,继续向下选取横线1-3,该横线1-3穿过块1,因此删掉横线1-3,继续向下选取横线1-4,该横线未穿过任何内容块,因此横线1-1至横线1-4之间作为一个段,称为段1。
依次选取下一个段,从横线1-5开始,向下选取横线1-6,该横线1-6穿过块4,因此删掉横线1-6,继续向下选取横线1-7,该横线1-7穿过块5,因此删掉横线1-7,继续向下选取横线1-8,该横线1-8穿过块3,因此删掉横线1-8,继续向下选取横线1-9,该横线1-9穿过块5,因此删掉横线1-9,继续向下选取横线1-10,该横线未穿过任何内容块,因此横线1-5至横线1-10之间作为一个段,称为段2。
以此类推,依次选取下一个段,直至第一横线集合中所有的横线选取完毕,得到横线集合,即图5中,得到横线1-11至横线1-13之间作为一个段,称为段3,直至第一横线集合中所有的横线选取完毕,得到横线集合。
在一实施例中,如果两条横线之间没有任何内容块,则合并该两条横线为一条预设宽度的横线,图5中,两个段之间的线可以进行合并,即横线1-4和横线1-5可以合并,横线1-10和横线1-11可以合并。结合图4,得到图5中PDF版面训练样本的横线集合{1,2,3,4},即图5中横线1-1对应图4中预设宽度的横线1,图5中横线1-4和横线1-5合并为图4中预设宽度的横线2,图5中横线1-10和横线1-11合并为图4中预设宽度的横线3,图5中横线1-13为图4中预设宽度的横线4。
得到横线集合以及对应段之后,依次选取段,按照从左到右的顺序,在每一个段里面,从第一纵线集合中选取纵线构成纵线集合。结合图5,对于段1来说,该段中包含块1和块2,只考虑该段中的边界线,即在{2-1,2-4,2-7,2-10}中选取纵线,与横线集合选取方式相同,结合图4,段1中,得到图5中PDF版面训练样本的纵线集合{5,6,7},即图5中纵线2-1对应图4中预设宽度的纵线5,图5中纵线2-4和纵线2-7合并为图4中预设宽度的纵线2,图5中纵线2-10为图4中预设宽度的纵线7。以此类推,得到段2的纵线集合为{8,9,10,11},段3的纵线集合为{12,13,14}。
最终根据图5的第一横线集合以及第一纵线集合得到横线集合为{1,2,3,4},纵线集合为{5,6,7,8,9,10,11,12,13,14},将横线集合以及纵线集合作为版面分割预测值。
可以理解的是,如果经过横线集合分段之后,同一段中,存在部分上下分布的内容块,则将其作为一个完整的内容块,按照上述方式线进行纵线分割,然后对该内容块进行横线分割或纵线分割。参照图6,为本申请一实施例中PDF版面训练样本示意图。图6中按照上述版面分割方式,块6、块7和块8均属于段3,则对段3进行纵线分割时,可以将块7和块8作为一个完整的内容块,完成纵线分割后,在将其进行横线分割,得到对应的横线集合。
步骤S230,根据版面分割预测值和对应的版面分割标签计算得到损失值。
在一实施例中,上述方式得到版面分割预测值,即横线集合以及纵线集合,将其与对应的版面分割标签进行比较计算得到损失值,该实施例中,可以根据横线集合以及纵线集合中横线以及纵线与版面分割标签中横线或纵线的位置差异计算损失值,即损失值是对应横线或者纵线的位置误差值。
步骤S240,利用损失函数根据损失值调整初始PDF版面分割模型的模型权值,直至损失函数满足收敛条件,训练得到PDF版面分割模型。
在一实施例中,由于PDF版面分割的目的是得到边界点集合,因此PDF版面分割可等同于边界检测问题,可以将边界检测问题视为语义分割问题,即在标注中简单地将边界标记为1和其他区域标记为0,将其表示为一个二分类语义分割问题,如果使用单一的损失函数调整效果较差,因此,本实施例中损失函数为复合损失函数,根据第一损失函数和第二损失函数得到损失函数,表示为:
L=α*L 1+β*L 2
其中,α+β=1,α表示第一权重,β表示第二权重,L表示复合损失函数,L 1表示第一损失函数,L 2表示第二损失函数。
在一实施例中,第一损失函数为dice loss损失函数,第二损失函数为加权交叉熵损失函数。
具体的,dice loss损失函数是一种用于评估两个样本相似性的度量函数,其取值范围为0指1之间,取值越大表示两个样本越接近,表示为:
Figure PCTCN2022090674-appb-000001
其中,X和Y分别两个样本,∩表示取交集,|X|、|Y|分别表示样本X和样本Y中元素的个数。
上述第一损失函数还可以表示为:
Figure PCTCN2022090674-appb-000002
其中,N表示像素点个数,g i表示样本标签,p i表示预测值,在边界检测中,p i和g i的值为0或者1,表示像素是否为边界,如果是则值为1,否则值为0。
在一实施例中,第二损失函数为加权交叉熵损失函数,其中交叉熵损失函数表示为:
L 2′=y ture*log(y pred)+(1-y ture)*log(1-y pred)
其中,L 2′表示交叉熵损失函数,y ture表示样本标签,y pred表示预测值。本实施例中,当样本标签为阳性,预测值为阴性时,预测值为假阴性;当样本标签为阴性,预测值为阳性时,预测值为假阳性,为了让模型认为假阴性的惩罚比假阳性的惩罚更大,则在交叉熵损失函数的两部分前分别增加不同的权重得到加权交叉熵损失函数,以提高损失函数的调节能力,加权交叉熵损失函数表示为:
L 2=ω 1*y ture*log(y pred)+ω 0*(1-y ture)*log(1-y pred)
其中,ω 0和ω 1分别表示假阴权重和假阳权重,在一实施例中,可以设置假阴权重:假阳权重=1:25,在此仅作示意,并不代表对其进行限制。
在一实施例中,分别设置第一损失函数的第一权重和第二损失函数的第二权重,得到损失函数,例如第一权重:第二权重=0.3:0.7,即损失函数表示为:
L=0.3*L 1+0.7*L 2
本实施例中,根据损失值对初始PDF版面分割模型中的参数进行调整,直至损失函数满足收敛条件,得到PDF版面分割模型,即根据损失函数更新初始PDF版面分割模型的参数,直至得到PDF版面分割模型。本实施例中,收敛条件可以是:最小化损失函数,即通过最小化各损失函数的方式来针对初始PDF版面分割模型的参数进行优化。
步骤S130,按照预设联通规则,根据横线集合和纵线集合得到PDF图像的阅读块。
在一实施例中,上述得到横线集合以及纵线集合后,需要确定阅读块,参照图7,步骤S130包括但不限于步骤S131至步骤S133:
步骤S131,按照从上到下的顺序读取横线集合中两条横线。
步骤S132,按照从左到右的顺序读取纵线集合中,位于两条横线中的所有纵线。
步骤S133,每两条相邻纵线之间生成一个阅读块,直至横线集合中所有横线均读取完成。
在一实施例中,参照图4,横线集合为{1,2,3,4},纵线集合为{5,6,7,8,9,10,11,12,13,14}。
该实施例中,首先按照从上到下的顺序读取横线集合中两条横线为:横线1和横线2,然后按照从左到右的顺序读取纵线集合中,位于横线1和横线2中的所有纵线,为纵线5、 纵线6和纵线7,每两条相邻纵线之间生成一个阅读块,得到阅读块1(对应图4中块1)和阅读块2(对应图4中块2)。
然后继续按照从上到下的顺序读取横线集合中两条横线为:横线2和横线3,按照从左到右的顺序读取纵线集合中,位于横线2和横线3中的所有纵线,为纵线8、纵线9、纵线10和纵线11,每两条相邻纵线之间生成一个阅读块,得到阅读块3(对应图4中块3)、阅读块4(对应图4中块4)和阅读块5(对应图4中块5)。
然后继续按照从上到下的顺序读取横线集合中两条横线为:横线3和横线4,按照从左到右的顺序读取纵线集合中,位于横线3和横线4中的所有纵线,为纵线12、纵线13、和纵线14,每两条相邻纵线之间生成一个阅读块,得到阅读块6(对应图4中块6)和阅读块7(对应图4中块7)。
步骤S140,根据阅读块的阅读顺序,得到PDF版面分割结果。
在一实施例中,按照先从上到下选取横线集合中两条横线,再从左到右的阅读顺序读取两条横线之间的阅读块的阅读顺序,直至横线集合中所有横线均读取完成,然后根据阅读块的阅读顺序得到PDF版面分割结果。
参照图4,按照先从上到下选取横线集合中横线1和横线2,再从左到右的阅读顺序读取横线1和横线2之间的阅读块的阅读顺序:阅读块1->阅读块2。
然后按照先从上到下选取横线集合中横线2和横线3,再从左到右的阅读顺序读取横线2和横线3之间的阅读块的阅读顺序:阅读块3->阅读块4->阅读块5。
最后按照先从上到下选取横线集合中横线3和横线4,再从左到右的阅读顺序读取横线3和横线4之间的阅读块的阅读顺序:阅读块6->阅读块7。
上述实施例最终生成的阅读块阅读顺序为:阅读块1->阅读块2->阅读块3->阅读块4->阅读块5->阅读块6->阅读块7,即得到PDF版面分割结果。
本申请实施例提供的PDF版面分割方法,通过获取PDF文档,并将PDF文档转化为PDF图像,然后将PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到PDF图像对应的边界点集合,并根据边界点集合生成横线集合和纵线集合,再按照预设联通规则,根据横线集合和纵线集合得到PDF图像的阅读块,最后根据阅读块的阅读顺序,得到PDF版面分割结果。本实施例不仅适用于一般PDF文档,更适用于包含分栏或插入图表等操作的PDF文档,获取其阅读顺序便于后续PDF内容操作,提高PDF版面分割的效率和准确率。
相对于相关技术中通过NSP或者Pointer Network类似的NLP思路进行版面分割,这种方式一方面数据标注困难,另外一方面算法的鲁棒性不够好。本申请实施例选择基于图像分割,能够利用图像获取更全面的布局信息,通过给PDF图像画线的方式进行标注,可以较为方便的求解阅读顺序。避免了相关技术中利用纯规则方式判断的不可学习的问题,也避免了NLP算法基于段落或者句子的细粒度进行分割的方式鲁棒性交叉的问题。本实施例得到PDF文档中阅读块的阅读顺序,获取文档连贯的内容,并能够方便地在此基础上进行后续分析。
本申请实施例还提供一种PDF版面分割装置,可以实现上述PDF版面分割方法,参照图8,该装置包括:
获取模块810,用于获取PDF文档,并将PDF文档转化为PDF图像;
横纵分割模块820,用于将PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到PDF图像对应的横线集合和纵线集合;
联通模块830,用于按照预设联通规则,根据横线集合和纵线集合得到PDF图像的阅读块;
获取分割结果模块840,用于根据阅读块的阅读顺序,得到PDF版面分割结果。
在一实施例中,获取模块810将获取到的PDF文档转化成PDF图像有两种方式。第一种方式:获取该PDF文档的每一个页面,将各个页面转换为对应的一张图像,即将一个PDF文档转换为与页面数量相同数量的多个图像文件,例如一个PDF文档文件包含100页,就会被转换为100个图像文件。第二种方式:根据PDF文档中每一个页面的高度和宽度,将每个 页面的高度相加,将结果确定为目标高度,比较每个页面的宽度,将最大的一个宽度确定为目标宽度,确定目标高度和目标宽度后,存储PDF文档中每一页内容,通过拼接的方式将内存中的数据转为一张长图像。可以理解的是,本实施例也可以利用开源的PDFBox工具将PDF文档转化为PDF图像,上述转化方式仅作示例,并不代表本实施例仅能通过以上两种方式实现将PDF文档转化为PDF图像,可以根据需要选取不同的转化方式。
在一实施例中,横纵分割模块820中,PDF版面分割模型为改进的Unet网络模型,改进点在于:将Unet网络模型中编码器的降采样模块替换为ResNest50特征提取模块。由于相关技术中Unet网络模型对样本都是从头开始训练,即每个样本对于Unet网络模型来说都是全新的数据,训练时没有利用到在先训练过程中的模型权重信息,因此本实施例利用ResNest50特征提取模块替换Unet网络模型中编码器的降采样模块,进行降采样来降低图像尺寸,提取浅层特征用于后续图像分割操作。由于ResNest模型是在ResNet模型的基础上引入Split-Attention注意力模块进行注意力监督,ResNeSt-50模型作为ResNest模型的常见结构,在ImageNet数据集上图像分类性能较佳,因此本实施例中利用改进的Unet网络模型构建PDF版面分割模型能够提升图像分割准确率。
在一实施例中,PDF版面分割模型的损失函数为根据第一损失函数和第二损失函数得到的,其中第一损失函数可以是dice loss损失函数,第二损失函数可以是加权交叉熵损失函数。
在一实施例中,还包括模型训练模块850,在横纵分割模块820进行版面分割之前,模型训练模块850用于训练PDF版面分割模型,训练过程包括:1)构建初始PDF版面分割模型和版面训练数据集,版面训练数据集包括:PDF版面训练样本和对应的版面分割标签,版面分割标签为预设宽度的横线或者预设宽度的纵线;2)将PDF版面训练样本输入到初始PDF版面分割模型中进行版面分割,得到版面分割预测值;3)根据版面分割预测值和对应的版面分割标签计算得到损失值;4)利用损失函数根据损失值调整初始PDF版面分割模型的模型权值,直至训练得到PDF版面分割模型。
本实施例的PDF版面分割装置不仅适用于对一般PDF文档进行版面分割,更适用于包含分栏或插入图表等操作的PDF文档,获取其阅读顺序便于后续PDF内容操作,能够提高PDF版面分割的效率和准确率。
本实施例的PDF版面分割装置的具体实施方式与上述PDF版面分割方法的具体实施方式基本一致,在此不再赘述。
本申请实施例还提供了一种电子设备,包括:
至少一个存储器;
至少一个处理器;
至少一个程序;
所述程序被存储在存储器中,处理器执行所述至少一个程序以实现一种PDF版面分割方法,其中,所述PDF版面分割方法包括:
获取PDF文档,并将PDF文档转化为PDF图像;
将PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到PDF图像对应的边界点集合,并根据边界点集合生成横线集合和纵线集合;
按照预设联通规则,根据横线集合和纵线集合得到PDF图像的阅读块;
根据阅读块的阅读顺序,得到PDF版面分割结果。
该电子设备可以为包括手机、平板电脑、个人数字助理(Personal Digital Assistant,简称PDA)、车载电脑等任意智能终端。
请参阅图9,图9示意了另一实施例的电子设备的硬件结构,电子设备包括:
处理器901,可以采用通用的CPU(CentralProcessingUnit,中央处理器)、微处理器、应用专用集成电路(ApplicationSpecificIntegratedCircuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本申请实施例所提供的技术方案;
存储器902,可以采用ROM(ReadOnlyMemory,只读存储器)、静态存储设备、动态 存储设备或者RAM(RandomAccessMemory,随机存取存储器)等形式实现。存储器902可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施例所提供的技术方案时,相关的程序代码保存在存储器902中,并由处理器901来调用执行本申请实施例的PDF版面分割方法;
输入/输出接口903,用于实现信息输入及输出;
通信接口904,用于实现本设备与其他设备的通信交互,可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信;和
总线905,在设备的各个组件(例如处理器901、存储器902、输入/输出接口903和通信接口904)之间传输信息;
其中处理器901、存储器902、输入/输出接口903和通信接口904通过总线905实现彼此之间在设备内部的通信连接。
本申请实施例还提供了一种存储介质,该存储介质是计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令用于使计算机执行一种PDF版面分割方法,其中,所述PDF版面分割方法包括:
获取PDF文档,并将PDF文档转化为PDF图像;
将PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到PDF图像对应的边界点集合,并根据边界点集合生成横线集合和纵线集合;
按照预设联通规则,根据横线集合和纵线集合得到PDF图像的阅读块;
根据阅读块的阅读顺序,得到PDF版面分割结果。
本实施例不仅适用于一般PDF文档,更适用于包含分栏或插入图表等操作的PDF文档,获取其阅读顺序便于后续PDF内容操作,提高PDF版面分割的效率和准确率。
所述计算机可读存储介质可以是非易失性,也可以是易失性。存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
本申请实施例描述的实施例是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域技术人员可知,随着技术的演变和新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本领域技术人员可以理解的是,图1-5中示出的技术方案并不构成对本申请实施例的限定,可以包括比图示更多或更少的步骤,或者组合某些步骤,或者不同的步骤。
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。
本申请的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和 /或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括多指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序的介质。
以上参照附图说明了本申请实施例的优选实施例,并非因此局限本申请实施例的权利范围。本领域技术人员不脱离本申请实施例的范围和实质内所作的任何修改、等同替换和改进,均应在本申请实施例的权利范围之内。

Claims (20)

  1. 一种PDF版面分割方法,其中,包括:
    获取PDF文档,并将所述PDF文档转化为PDF图像;
    将所述PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到所述PDF图像对应的边界点集合,并根据所述边界点集合生成横线集合和纵线集合;
    按照预设联通规则,根据所述横线集合和所述纵线集合得到所述PDF图像的阅读块;
    根据所述阅读块的阅读顺序,得到PDF版面分割结果。
  2. 根据权利要求1所述的PDF版面分割方法,其中,所述将所述PDF图像输入到预先训练的PDF版面分割模型进行版面分割之前,还包括:
    构建初始PDF版面分割模型和版面训练数据集,所述版面训练数据集包括:PDF版面训练样本和对应的版面分割标签,所述版面分割标签为预设宽度的横线或者预设宽度的纵线;
    将所述PDF版面训练样本输入到所述初始PDF版面分割模型中进行版面分割,得到版面分割预测值;
    根据所述版面分割预测值和对应的所述版面分割标签计算得到损失值;
    利用损失函数根据所述损失值调整所述初始PDF版面分割模型的模型权值,直至损失函数满足收敛条件,训练得到所述PDF版面分割模型。
  3. 根据权利要求2所述的PDF版面分割方法,其中,所述构建初始PDF版面分割模型和版面训练数据集,还包括:
    在缩放阈值上限至缩放阈值下限范围内选取缩放值;
    根据所述缩放值对所述PDF版面训练样本进行缩放,对所述版面训练数据集进行样本扩充得到缩放样本;
    利用所述缩放样本和所述PDF版面训练样本共同构建所述版面训练数据集。
  4. 根据权利要求2所述的PDF版面分割方法,其中,所述损失函数为复合损失函数;
    对应的,所述利用损失函数根据所述损失值调整所述初始PDF版面分割模型的模型权值,直至训练得到所述PDF版面分割模型之前,还包括:根据第一损失函数和与所述第一损失函数不同的第二损失函数,计算得到所述复合损失函数,表示为:
    L=α*L 1+β*L 2
    其中,α+β=1,α表示第一权重,β表示第二权重,L表示所述复合损失函数,L 1表示所述第一损失函数,L 2表示所述第二损失函数。
  5. 根据权利要求1至4任一项所述的PDF版面分割方法,其中,所述按照预设联通规则,根据所述横线集合和所述纵线集合得到所述PDF图像的阅读块,包括:
    按照从上到下的顺序读取所述横线集合中两条横线;
    按照从左到右的顺序读取所述纵线集合中,位于所述两条横线中的所有纵线;
    每两条相邻纵线之间生成一个阅读块,直至所述横线集合中所有横线均读取完成。
  6. 根据权利要求5所述的PDF版面分割方法,其中,所述根据所述阅读块的阅读顺序,得到PDF版面分割结果,包括:
    按照先从上到下选取所述横线集合中两条横线,再从左到右的阅读顺序读取所述两条横线之间的所述阅读块的阅读顺序,直至所述横线集合中所有横线均读取完成;
    根据所述阅读块的阅读顺序得到所述PDF版面分割结果。
  7. 根据权利要求5所述的PDF版面分割方法,其中,所述PDF版面分割模型为改进的Unet网络模型,所述Unet网络模型中编码器的降采样模块为ResNest50特征提取模块;
    所述将所述PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到所述PDF图像对应的边界点集合,并根据所述边界点集合生成横线集合和纵线集合,包括:
    将所述PDF图像输入到预先训练的PDF版面分割模型进行版面分割得到所述PDF图像中每个内容块的边界点;
    根据所述边界点构成所述边界点集合;
    根据所述边界点集合中边界点的坐标得到每个内容块的边界线;
    根据所述边界线生成横线集合和纵线集合。
  8. 一种PDF版面分割装置,其中,包括:
    获取模块,用于获取PDF文档,并将所述PDF文档转化为PDF图像;
    横纵分割模块,用于将所述PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到所述PDF图像对应的横线集合和纵线集合;
    联通模块,用于按照预设联通规则,根据所述横线集合和所述纵线集合得到所述PDF图像的阅读块;
    获取分割结果模块,用于根据所述阅读块的阅读顺序,得到PDF版面分割结果。
  9. 一种电子设备,其中,包括:
    至少一个存储器;
    至少一个处理器;
    至少一个程序;
    所述程序被存储在存储器中,处理器执行所述至少一个程序以实现一种PDF版面分割方法,其中,所述PDF版面分割方法包括:
    获取PDF文档,并将所述PDF文档转化为PDF图像;
    将所述PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到所述PDF图像对应的边界点集合,并根据所述边界点集合生成横线集合和纵线集合;
    按照预设联通规则,根据所述横线集合和所述纵线集合得到所述PDF图像的阅读块;
    根据所述阅读块的阅读顺序,得到PDF版面分割结果。
  10. 根据权利要求9所述的一种电子设备,其中,所述将所述PDF图像输入到预先训练的PDF版面分割模型进行版面分割之前,还包括:
    构建初始PDF版面分割模型和版面训练数据集,所述版面训练数据集包括:PDF版面训练样本和对应的版面分割标签,所述版面分割标签为预设宽度的横线或者预设宽度的纵线;
    将所述PDF版面训练样本输入到所述初始PDF版面分割模型中进行版面分割,得到版面分割预测值;
    根据所述版面分割预测值和对应的所述版面分割标签计算得到损失值;
    利用损失函数根据所述损失值调整所述初始PDF版面分割模型的模型权值,直至损失函数满足收敛条件,训练得到所述PDF版面分割模型。
  11. 根据权利要求10所述的一种电子设备,其中,所述构建初始PDF版面分割模型和版面训练数据集,还包括:
    在缩放阈值上限至缩放阈值下限范围内选取缩放值;
    根据所述缩放值对所述PDF版面训练样本进行缩放,对所述版面训练数据集进行样本扩充得到缩放样本;
    利用所述缩放样本和所述PDF版面训练样本共同构建所述版面训练数据集。
  12. 根据权利要求10所述的一种电子设备,其中,所述损失函数为复合损失函数;
    对应的,所述利用损失函数根据所述损失值调整所述初始PDF版面分割模型的模型权值,直至训练得到所述PDF版面分割模型之前,还包括:根据第一损失函数和与所述第一损失函数不同的第二损失函数,计算得到所述复合损失函数,表示为:
    L=α*L 1+β*L 2
    其中,α+β=1,α表示第一权重,β表示第二权重,L表示所述复合损失函数,L 1表示所述第一损失函数,L 2表示所述第二损失函数。
  13. 根据权利要求9至12任一项所述的一种电子设备,其中,所述按照预设联通规则,根据所述横线集合和所述纵线集合得到所述PDF图像的阅读块,包括:
    按照从上到下的顺序读取所述横线集合中两条横线;
    按照从左到右的顺序读取所述纵线集合中,位于所述两条横线中的所有纵线;
    每两条相邻纵线之间生成一个阅读块,直至所述横线集合中所有横线均读取完成。
  14. 根据权利要求13所述的一种电子设备,其中,所述根据所述阅读块的阅读顺序,得到PDF版面分割结果,包括:
    按照先从上到下选取所述横线集合中两条横线,再从左到右的阅读顺序读取所述两条横线之间的所述阅读块的阅读顺序,直至所述横线集合中所有横线均读取完成;
    根据所述阅读块的阅读顺序得到所述PDF版面分割结果。
  15. 一种存储介质,所述存储介质为计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行一种PDF版面分割方法,其中,所述PDF版面分割方法包括:
    获取PDF文档,并将所述PDF文档转化为PDF图像;
    将所述PDF图像输入到预先训练的PDF版面分割模型进行版面分割,得到所述PDF图像对应的边界点集合,并根据所述边界点集合生成横线集合和纵线集合;
    按照预设联通规则,根据所述横线集合和所述纵线集合得到所述PDF图像的阅读块;
    根据所述阅读块的阅读顺序,得到PDF版面分割结果。
  16. 根据权利要求15所述的一种存储介质,其中,所述将所述PDF图像输入到预先训练的PDF版面分割模型进行版面分割之前,还包括:
    构建初始PDF版面分割模型和版面训练数据集,所述版面训练数据集包括:PDF版面训练样本和对应的版面分割标签,所述版面分割标签为预设宽度的横线或者预设宽度的纵线;
    将所述PDF版面训练样本输入到所述初始PDF版面分割模型中进行版面分割,得到版面分割预测值;
    根据所述版面分割预测值和对应的所述版面分割标签计算得到损失值;
    利用损失函数根据所述损失值调整所述初始PDF版面分割模型的模型权值,直至损失函数满足收敛条件,训练得到所述PDF版面分割模型。
  17. 根据权利要求16所述的一种存储介质,其中,所述构建初始PDF版面分割模型和版面训练数据集,还包括:
    在缩放阈值上限至缩放阈值下限范围内选取缩放值;
    根据所述缩放值对所述PDF版面训练样本进行缩放,对所述版面训练数据集进行样本扩充得到缩放样本;
    利用所述缩放样本和所述PDF版面训练样本共同构建所述版面训练数据集。
  18. 根据权利要求16所述的一种存储介质,其中,所述损失函数为复合损失函数;
    对应的,所述利用损失函数根据所述损失值调整所述初始PDF版面分割模型的模型权值,直至训练得到所述PDF版面分割模型之前,还包括:根据第一损失函数和与所述第一损失函数不同的第二损失函数,计算得到所述复合损失函数,表示为:
    L=α*L 1+β*L 2
    其中,α+β=1,α表示第一权重,β表示第二权重,L表示所述复合损失函数,L 1表示所述第一损失函数,L 2表示所述第二损失函数。
  19. 根据权利要求15至18任一项所述的一种存储介质,其中,所述按照预设联通规则,根据所述横线集合和所述纵线集合得到所述PDF图像的阅读块,包括:
    按照从上到下的顺序读取所述横线集合中两条横线;
    按照从左到右的顺序读取所述纵线集合中,位于所述两条横线中的所有纵线;
    每两条相邻纵线之间生成一个阅读块,直至所述横线集合中所有横线均读取完成。
  20. 根据权利要求19所述的一种存储介质,其中,所述根据所述阅读块的阅读顺序,得到PDF版面分割结果,包括:
    按照先从上到下选取所述横线集合中两条横线,再从左到右的阅读顺序读取所述两条横线之间的所述阅读块的阅读顺序,直至所述横线集合中所有横线均读取完成;
    根据所述阅读块的阅读顺序得到所述PDF版面分割结果。
PCT/CN2022/090674 2022-02-16 2022-04-29 Pdf版面分割方法和装置、电子设备、存储介质 WO2023155302A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210143646.X 2022-02-16
CN202210143646.XA CN114494303A (zh) 2022-02-16 2022-02-16 Pdf版面分割方法和装置、电子设备、存储介质

Publications (1)

Publication Number Publication Date
WO2023155302A1 true WO2023155302A1 (zh) 2023-08-24

Family

ID=81482496

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090674 WO2023155302A1 (zh) 2022-02-16 2022-04-29 Pdf版面分割方法和装置、电子设备、存储介质

Country Status (2)

Country Link
CN (1) CN114494303A (zh)
WO (1) WO2023155302A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999758A (zh) * 2012-11-14 2013-03-27 北京大学 一种基于多边形检测的漫画图像版面理解系统和方法
CN105528614A (zh) * 2015-12-02 2016-04-27 北京大学 一种漫画图像版面的识别方法和自动识别系统
US20180322339A1 (en) * 2017-05-08 2018-11-08 Adobe Systems Incorporated Page segmentation of vector graphics documents
US20200074637A1 (en) * 2018-08-28 2020-03-05 International Business Machines Corporation 3d segmentation with exponential logarithmic loss for highly unbalanced object sizes

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999758A (zh) * 2012-11-14 2013-03-27 北京大学 一种基于多边形检测的漫画图像版面理解系统和方法
CN105528614A (zh) * 2015-12-02 2016-04-27 北京大学 一种漫画图像版面的识别方法和自动识别系统
US20180322339A1 (en) * 2017-05-08 2018-11-08 Adobe Systems Incorporated Page segmentation of vector graphics documents
US20200074637A1 (en) * 2018-08-28 2020-03-05 International Business Machines Corporation 3d segmentation with exponential logarithmic loss for highly unbalanced object sizes

Also Published As

Publication number Publication date
CN114494303A (zh) 2022-05-13

Similar Documents

Publication Publication Date Title
CN109933756B (zh) 基于ocr的图像转档方法、装置、设备及可读存储介质
CN111476067B (zh) 图像的文字识别方法、装置、电子设备及可读存储介质
CN114821622B (zh) 文本抽取方法、文本抽取模型训练方法、装置及设备
CN101719142B (zh) 基于分类字典的稀疏表示图片文字检测方法
WO2023134088A1 (zh) 视频摘要生成方法、装置、电子设备及存储介质
CN114596566B (zh) 文本识别方法及相关装置
CN113780229A (zh) 文本识别方法及装置
WO2024027349A1 (zh) 一种印刷体数学公式识别方法、装置及存储介质
CN113177435A (zh) 试卷分析方法、装置、存储介质及电子设备
CN114359943A (zh) Ofd版式文档段落识别方法及装置
KR20220034076A (ko) 문자부호 생성 모델의 훈련 방법, 문자부호 생성 방법, 장치 및 설비
CN114218889A (zh) 文档处理及文档模型的训练方法、装置、设备和存储介质
CN111680491A (zh) 文档信息的抽取方法、装置和电子设备
CN114495147A (zh) 识别方法、装置、设备以及存储介质
WO2023155302A1 (zh) Pdf版面分割方法和装置、电子设备、存储介质
CN113486171B (zh) 一种图像处理方法及装置、电子设备
JP2023010805A (ja) ドキュメント情報抽出モデルのトレーニングおよびドキュメント情報の抽出のための方法、装置、電子機器、記憶媒体並びにコンピュータプログラム
CN115687625A (zh) 文本分类方法、装置、设备及介质
CN115203415A (zh) 一种简历文档信息提取方法及相关装置
CN115294594A (zh) 文档分析方法、装置、设备及存储介质
CN115223182A (zh) 一种文档版面识别方法及相关装置
CN111368553B (zh) 智能词云图数据处理方法、装置、设备及存储介质
CN113221718A (zh) 公式识别方法、装置、存储介质和电子设备
CN114399782B (zh) 文本图像处理方法、装置、设备、存储介质及程序产品
CN111753836A (zh) 文字识别方法、装置、计算机可读介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22926615

Country of ref document: EP

Kind code of ref document: A1