WO2020238054A1 - Pdf文档中图表的定位方法、装置及计算机设备 - Google Patents

Pdf文档中图表的定位方法、装置及计算机设备 Download PDF

Info

Publication number
WO2020238054A1
WO2020238054A1 PCT/CN2019/117747 CN2019117747W WO2020238054A1 WO 2020238054 A1 WO2020238054 A1 WO 2020238054A1 CN 2019117747 W CN2019117747 W CN 2019117747W WO 2020238054 A1 WO2020238054 A1 WO 2020238054A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
chart
pdf document
picture
detection model
Prior art date
Application number
PCT/CN2019/117747
Other languages
English (en)
French (fr)
Inventor
刘克亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020238054A1 publication Critical patent/WO2020238054A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • This application relates to the field of data processing technology, and in particular to a method, device, computer equipment, and computer-readable storage medium for locating charts in PDF documents.
  • the existing analysis methods for PDF documents can only extract the pictures or content in the PDF document separately, and it is impossible to know exactly which position in the PDF document is a table and which position is a graphic, because it is impossible to accurately determine the PDF document
  • the position of the chart reduces the efficiency of the use of PDF documents.
  • the embodiments of the application provide a method, device, computer equipment, and computer-readable storage medium for locating charts in PDF documents, which can solve the problem of inefficient use of PDF documents due to the inability to accurately locate the positions of charts in PDF documents in traditional technologies. problem.
  • an embodiment of the present application provides a method for locating charts in a PDF document.
  • the method includes: obtaining a PDF document, and placing each page of the PDF document in a preset manner according to the document of each page.
  • the position in the PDF document is converted into each picture carrying a preset position identifier; all pictures containing charts in the pictures are identified as target pictures through a preset target detection model, and the charts include graphs and tables; Extract the chart in each target picture through the target detection model to identify the position of the chart in each target picture; use the position of each target picture in the PDF document And the position of the chart corresponding to each of the target pictures are combined in a preset order to generate the position of the chart in the PDF document.
  • an embodiment of the present application also provides a positioning device for charts in a PDF document, including: a conversion unit, configured to obtain a PDF document, and preset each page of the PDF document according to the The position of the page document in the PDF document is converted into each picture carrying a preset position identifier; the recognition unit is used for identifying all pictures containing charts as target pictures through a preset target detection model, The chart includes a graph and a table; an extraction unit for extracting the chart in each target picture through the target detection model to identify the position of the chart in each target picture; a positioning unit , Used to combine the position of each target picture in the PDF document and the position of the chart in the corresponding target picture in a preset order to generate the chart in the PDF document position.
  • an embodiment of the present application also provides a computer device, which includes a memory and a processor, the memory is stored with a computer program, and when the processor executes the computer program, the graphics in the PDF document Positioning method.
  • an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the PDF document The positioning method of the middle chart.
  • FIG. 1 is a schematic flowchart of a method for positioning a chart in a PDF document provided by an embodiment of the application;
  • FIG. 2 is a schematic diagram of the division of a chart location area in a method for positioning a chart in a PDF document provided by an embodiment of the application;
  • FIG. 3 is a schematic block diagram of a device for locating charts in a PDF document provided by an embodiment of the application.
  • Fig. 4 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • the method for locating charts in a PDF document provided by the embodiments of the present application can be applied to computer equipment such as terminals or servers, and the steps of the method for locating charts in the PDF document are implemented by software installed on the terminal or server.
  • the terminal may be an electronic device such as a mobile phone, a notebook computer, a tablet computer, or a desktop computer, and the server may be a cloud server or a server cluster.
  • the specific implementation process of the positioning method of the chart in the PDF document provided by the embodiment of the application is as follows: the terminal obtains the PDF document, and stores each page of the PDF document in a preset manner according to the document of each page.
  • the position in the PDF document is converted into each picture carrying a preset position identifier; all pictures containing charts in the pictures are identified as target pictures through a preset target detection model, and the charts include graphs and tables; Extract the chart in each target picture through the target detection model to identify the position of the chart in each target picture; use the position of each target picture in the PDF document And the position of the chart corresponding to each of the target pictures are combined in a preset order to generate the position of the chart in the PDF document.
  • FIG. 1 is a schematic flowchart of a method for locating charts in a PDF document provided by an embodiment of the application.
  • the positioning method of the chart in the PDF document is applied to a terminal or a server to complete all or part of the function of the positioning method of the chart in the PDF document.
  • the method includes the following steps S101-S104:
  • the preset position identifier refers to the description of the position of each page of the PDF document in the entire PDF document, and can be the page number code for each page of the PDF document in the PDF document, for example, the document page number is described by the numbers "1,2,3"
  • the preset position mark can be the first page, the second page, the third page... of the PDF.
  • the preset location identifier can also be added with the document name or document number of the PDF document, for example, the document name is A document, and the third page of A document can be described as A3, through the combination of document name and document page number , Can improve the efficiency of the recognition of PDF files.
  • the preset methods include the corresponding methods of converting PDF documents into pictures in different programming languages.
  • the conversion of PD F documents into pictures in JAVA can be provided by a third-party frame package, such as downloading the frame package of Icepdf, or the frame package of Jpedal Package etc.
  • a PDF document is obtained, and each page of the PDF document is converted into each picture carrying a preset position identifier according to the position of each page in the PDF document in a preset manner.
  • each page of the PDF document can be converted into a picture by a preset method. If the PDF document contains multiple pages, it can be converted into multiple pictures, which can be converted into JPG format or JPEG format. Converting PDF documents to pictures can be provided by a third-party shelf package, such as downloading the Icepdf shelf package, and importing it into the project, and converting the PDF document into several pictures through the Icepdf control. Or download the shelf package of PDFbox and import the project. You can also download the shelf package of Jpedal and import it into the project.
  • the PDF document can be converted into a picture format. For example, every piece of the PDF document can be converted into a picture format through the Icepdf control.
  • the page document is converted into each picture in JPG format or JPEG format carrying a preset position identifier according to the position of each page of the document in the PDF document.
  • S102 Recognizing all pictures including charts in the pictures as target pictures through a preset target detection model, and the charts include graphs and tables.
  • Target detection also called target extraction
  • target extraction is a kind of image segmentation based on the geometric and statistical characteristics of the target. It combines the segmentation and recognition of the target into one.
  • Target detection is not difficult for humans. Through the perception of different color modules in the picture, it is easy to locate and classify the target object, but for the computer, it is facing the RGB pixel matrix, which is difficult to obtain directly from the image
  • the target corresponding to the abstract concept and its location are located, and sometimes multiple objects and cluttered backgrounds are mixed together, which makes target detection more difficult.
  • “Target detection” mainly solves two problems: where are multiple targets on the image, that is, the target location, and what the target is, that is, the target category.
  • a pre-trained target detection model is used to identify each of the pictures to determine whether each picture contains a graph, the graph includes a graph and a table, if the picture contains a graph and/or a table , Taking all the pictures containing graphs and/or tables as target pictures, and further extracting the graphs and/or tables in each target picture through the target detection model, if the pictures do not contain graphs, Do not process the picture and discard the picture, which can also be referred to as filtering out the picture, that is, the picture is not processed.
  • the target detection model is based on the target detection algorithm for target detection.
  • the target detection algorithm is mainly based on the deep learning model.
  • the embodiment of this application realizes the positioning of the chart in the PDF document based on deep learning.
  • the deep learning model can be divided into two categories. : (1) Two-stage detection algorithm, which divides the detection problem into two stages. First, candidate regions are generated. The English is Region proposals, and then the candidate regions are classified. Generally, the position needs to be refined.
  • This type of algorithm is typical Representative is based on the R-CNN algorithm of Region proposal, such as R-CNN, Fast R-CNN, Faster R-CNN, etc.; (2) One-stage detection algorithm, which does not require the Region proposal stage, and directly generates the category probability of the object And the position coordinate value, more typical algorithms such as YOLO and SSD.
  • the target detection model Through the target detection model, multiple objects in a target picture can be identified, and different objects can be located, mainly to give the bounding box of the object. Before using the target detection model to identify whether the picture contains a chart, the target detection model is trained first.
  • the method before the step of recognizing all pictures containing charts in the pictures as target pictures by using a preset target detection model, the method further includes:
  • the step of training the target detection model includes: inputting a graph and a table into the target detection model so that the target detection model recognizes the graph and the table; inputting the picture carrying the graph and/or the table into the target detection model The target detection model so that the target detection model recognizes the graph and/or the table, and correspondingly extracts the position of the graph and/or the table; trains the target detection model until the target The recognition accuracy of the detection model on the graph and/or the table satisfies a preset condition.
  • the training process of the target detection model is as follows:
  • Target detection English called Object Detection, refers to finding the purpose or target in the image. Targets can also be called objects. Determining their position and size is one of the central issues in the machine vision category. There are four categories of tasks in computer vision regarding image recognition:
  • Target classification English is Classification. Deal with the question of "what?", that is, given a picture or a piece of video to determine what kind of purpose it contains.
  • Target detection English is Detection. Deal with the question of "what? Where?", that is, locate the location of the purpose and know what the target is.
  • Target segmentation-Segmentation It is divided into instance segmentation (instance-level in English) and scene segmentation (Scene-level in English). Deal with the question of "which object or scene each pixel belongs to”.
  • target detectors based on candidate regions include those based on candidate regions, such as R-CNN, SPP-net, Fast R-CNN, Faster R-CNN and R-FCN models, and are based on end-to-end (End-to-end) models. End) target detection methods, these methods do not require area nomination, including YOLO and SSD. Since the existing model is used for training in the embodiment of this application, in the embodiment of this application, the target detection model based on Faster R-CNN is adopted as Take an example to illustrate the technical solution of this application.
  • the graph and the table are respectively input to the target detection model, so that the target detection model recognizes what the graph is and what is the table according to the input graph and table, so that the target detection model can recognize the graph and the table.
  • the target detection model recognizes what the graph is and what is the table according to the input graph and table, so that the target detection model can recognize the graph and the table.
  • the target detection model itself can perform target positioning
  • the target detection model can recognize graphics and tables of the input pictures and perform corresponding positioning on the recognized graphics and tables.
  • the target detection model is able to recognize and locate the graphics and tables of the input picture, and train the target detection model through the input of a large number of samples to improve the accuracy of the target detection model's recognition of graphics and tables, and train the target detection model Until the target detection model's recognition accuracy rate of the graph and/or the table meets the preset condition, the preset condition refers to the target detection model's recognition accuracy rate of the graph and the target detection model's recognition accuracy of the table
  • the target detection model has an accuracy rate of over 90% for graphics recognition, and the target detection model has an accuracy rate of over 95% for table recognition.
  • the trained target detection model can be used to identify whether the picture converted from PDF contains graphics and/or tables. Specifically, first convert each page of the PDF into a picture, and then use the trained target detection model to detect the converted pictures, such as the trained FASTER-RCNN target detection model to detect the pictures, if The target detection model detects that the picture contains graphics and/or tables. If the picture contains multiple graphics and/or multiple tables, classify the detected graphics and/or tables, and locate them one by one to determine which one in the picture The position is a graph, and which position is a table, so that all the graphs in the picture are sequentially identified, avoiding omission of the graphs in the picture, and improving the positioning efficiency of the graphs in the document.
  • the picture contains graphics and/or tables
  • use the picture as a target picture classify the graphics and/or tables contained in the target picture through the target detection model, and locate which position in the target picture is the graphic , Which position is the table, and the position of the figure and/or table in the target picture can be extracted.
  • the position of the figure or table in the target picture can be determined by the four vertices of the figure or table in the target picture Coordinates. If the picture does not contain a picture or a table, then the picture is discarded.
  • the first step of target detection is to do region nomination (Region Proposal in English), that is, to find possible regions of interest (English is Region Of Interest, ROI).
  • Region nomination methods include the following:
  • the sliding window is essentially an exhaustive method, using different scales and aspect ratios to enumerate all possible large and small blocks, and then send them for identification, and the ones with a high probability of identification are left.
  • such a method is too complex and generates a lot of redundant candidate regions, which is not feasible in reality.
  • rule block Some pruning was carried out on the basis of the exhaustive method, and only fixed size and aspect ratio were used. This is very effective in some specific application scenarios, such as the Chinese character detection in the photo search app, because the Chinese characters are square and square, and the aspect ratios are mostly consistent. Therefore, using regular blocks for regional nomination is a more appropriate choice. But for ordinary target detection, the rule block still needs to visit a lot of locations, and the complexity is high.
  • R-CNN is the abbreviation of Region-based Convolutional Neural Networks.
  • the Chinese translation is a region-based convolutional neural network. It is a combination of region nomination (Region Proposal in English) and convolutional neural network (English) It is the target detection method of Convolutional Neural Networks, abbreviated as CNN).
  • the main steps of R-CNN include: (1) Region nomination, extracting about 2000 region candidate frames from the original image through Selective Search; (2) Normalizing the region size In this way, all candidate boxes are scaled to a fixed size, for example, 227 ⁇ 227); (3) Feature extraction, through the CNN network, to extract features; (4) Classification and regression, adding two full-size boxes on the basis of the feature layer Connect the layers, and then use SVM classification for recognition, and use linear regression to fine-tune the position and size of the border. Each category trains a separate border regressor.
  • the main steps of Fast R-CNN are as follows: (1) Feature extraction, using the entire picture as input to use CNN to obtain the feature layer of the picture; (2) Region nomination, using Selective Search and other methods to extract region candidate frames from the original picture , And project these candidate frames to the final feature layer one by one; (3) Area normalization, RoI Pooling is performed for each area candidate frame on the feature layer to obtain a fixed size feature representation; (4) Classification and Regression, and then through two fully connected layers, respectively use softmax multi-classification for target recognition, and use the regression model to fine-tune the position and size of the border.
  • Faster R-CNN takes the entire picture as input, and uses CNN to obtain the feature layer of the picture;
  • Regional nomination in the final volume On the product feature layer, use k different Anchor Boxes for nomination, and k generally takes 9;
  • Classification and regression classify the area corresponding to each Anchor Box by object/non-object, and use k
  • the regression model (each corresponding to a different Anchor Box) fine-tunes the position and size of the candidate box, and finally classifies the target.
  • Faster R-CNN abandoned Selective Search and introduced the RPN network, so that regional nomination, classification, and regression share convolutional features, thereby further accelerating.
  • Faster R-CNN needs to determine whether it is a target (target determination) for 20,000 Anchor Boxes, and then perform target recognition, which is divided into two steps.
  • the preset sequence includes the sequence in which the position of each target picture in the PDF document is first, the position of the chart in the corresponding target picture is last, or the sequence of each target picture in the PDF document The sequence of the position in the PDF document at the back and the position of the chart in the corresponding target picture at the front.
  • the position of the chart in the PDF document is located, that is, the position of the chart is determined
  • the position of the chart in the PDF document is finally located according to the position of each target picture in the PDF document.
  • the coordinates of a chart L on the third page of PDF document A are (x1, y1)
  • the position of chart L in the PDF document can be described as A3 (x1, y1)
  • the position of chart L in the PDF document can be described Is (x1, y1)A3.
  • the embodiment of the application realizes the positioning of the chart in the PDF document
  • the PDF file is converted into independent pictures one by one in a preset manner, and all the pictures are identified through the preset target detection model
  • the picture containing the chart is used as the target picture, and the position of the chart in each target picture is extracted by the target detection model.
  • the location of the location chart in the PDF document can automatically identify which area in the PDF document is a graph or table.
  • the position of each target picture in the PDF document and the position of the chart corresponding to each target picture are combined in a preset order to generate the chart in the PDF document.
  • the method further includes: displaying the information of all the target pictures in a preset number sequence in a list form according to the order of each target picture in the PDF document, the information including: The type of the chart, the position of the chart in each target picture, the position of each target picture in the PDF document, and the position of the chart in the PDF document.
  • the information of all the target pictures is displayed in the form of a list in a preset number sequence, and the information includes: the type of the chart, the chart in each of the The position of the target picture, the position of each target picture in the PDF document, and the position of the chart in the PDF document.
  • Table 1 is an example of the information of each target picture in a PDF document that contains a chart. As shown in Table 1, the figures and tables are described by uniform numbers 1, 2, and 3.
  • the PDF document The graphs contained in A include Table 1, Graph 2 and Table 3.
  • the coordinates of a vertex are used to illustrate the position of a vertex of the graph in each of the target pictures, on page 3 of PDF document A
  • the position of the coordinates (x1, y1) has a vertex of table 1
  • the position of coordinates (x2, y2) of page 7 in PDF document A has a vertex of figure 2
  • the coordinates of page 9 in PDF document A The position (x3, y3) has a vertex of Table 3.
  • the table generally uses the coordinates of the four vertices of the table to determine the position of the table in each target picture.
  • the graphics can be determined by the coordinates of the n vertices of the graphics.
  • n In the position of each target picture, n ⁇ 3, n is an integer, for example, a triangle shape can use the coordinates of three vertices of the triangle to describe the position of the triangle in each target picture, and a quadrilateral shape can be a quadrilateral
  • the coordinates of the four vertices of the table describe the position of the table in each target picture
  • the pentagonal graphics use the coordinates of the five vertices of the pentagon to describe the position of the graphics in each target picture.
  • the graphics and the table can also be described in the order of their respective preset numbers 1, 2, and 3, that is, the table is described in the order of the preset numbers 1, 2, and 3 of the table, and the graphics are described in the order of the preset numbers 1, 2, and 3. 2, 3 order description, the table can be described as table 1, table 2, and table 3, etc., and the graphic description is diagram 1, graphic 2, and graphic 3, etc.
  • JS stands for JavaScript.
  • JavaScript is the programming language of the Web. It uses HTML combined with CSS structural style codes.
  • the Table style in CSS is used to display the information of each target picture containing the chart in the form of a table.
  • CSS English is Cascading Style Sheets refers to cascading style sheets.
  • the step of extracting the chart in each target picture through the target detection model to identify the position of the chart in each of the target pictures includes: passing the target The detection model extracts the chart in each target picture to identify the position of the chart in a preset area corresponding to each target picture, the preset area includes m areas, m ⁇ 2, m is Integer.
  • the target positioning is not only to identify what object is, that is, to classify, but also to predict the position of the object.
  • the position is generally marked with a bounding box, and target detection is essentially multi-target
  • the positioning of the target is to locate multiple target objects in the target picture, including classification and positioning. Therefore, during the training of the target detection model, including the positioning of the target, it is the position of the target in the image.
  • Each page of the document in the PDF can be converted into each target picture and then the target picture is divided into m preset areas, m ⁇ 2, m is an integer, and the preset area is used to describe the chart in each target picture position. For example, taking the division of each target picture into four regions as an example, please refer to FIG. 2.
  • FIG. 2 taking the division of each target picture into four regions as an example, please refer to FIG. 2.
  • the preset area in FIG. 2 includes a first area, a second area, a third area, and a fourth area.
  • the preset area in FIG. 2 includes a first area, a second area, a third area, and a fourth area.
  • the step of extracting the chart in each target picture through the target detection model to identify the position of the chart in each of the target pictures includes: passing the target The detection model extracts the graph in each target picture to identify the coordinates of the n vertices of the graph in each of the target pictures, where n ⁇ 3, and n is an integer.
  • the coordinates of each target picture may also be used to describe the chart in each target picture.
  • the graph in each target picture is extracted by the target detection model to identify the coordinates of the n vertices of the graph in each of the target pictures, wherein, n ⁇ 3, n is an integer.
  • a triangle shape can use the coordinates of the three vertices of the triangle to describe the position of the triangle in each target picture
  • the table uses the coordinates of the four vertices of the table to describe the position of the table in each target picture.
  • Quadrilateral The coordinates of the four vertices of the quadrilateral can be used to describe the position of the table in each target picture, and the coordinates of the five vertices of the pentagon are used to describe the position of the graphics in each target picture, etc., to achieve the alignment More precise description of the chart position. Please continue to refer to Table 1. As shown in Table 1, the figures and tables are described with uniform numbers 1, 2, and 3. The diagrams contained in PDF document A include Table 1, Figure 2 and Table 3.
  • a vertex is used in Table 1 To illustrate the position of a vertex of the graph in each of the target pictures, the coordinates (x1, y1) of page 3 in PDF document A have a vertex of table 1, in PDF document A The coordinate (x2, y2) on page 7 has a vertex of figure 2, and the coordinate (x3, y3) on page 9 in PDF document A has a vertex of table 3.
  • the target positioning is not only to identify the object, that is, to classify, but also to predict the position of the object.
  • the position is generally marked with a bounding box, and the target detection is essentially the positioning of multiple targets , That is, to locate multiple target objects in the picture, including classification and positioning. Therefore, in the process of target detection model training, including the positioning of the target, it is the position of the target in the image.
  • the coordinates of the four vertices of the table can be obtained by comparing the size of the abscissa and ordinate in the coordinates of each cell intersection.
  • FIG. 3 is a schematic block diagram of a positioning device for a chart in a PDF document provided by an embodiment of the application.
  • an embodiment of the present application also provides a device for locating charts in PDF documents.
  • the device for locating charts in a PDF document includes a unit for executing the above-mentioned method for locating charts in a PDF document, and the device can be configured in a computer device such as a terminal or a server.
  • the positioning device 300 of the chart in the PDF document includes a conversion unit 301, a recognition unit 302, an extraction unit 303 and a positioning unit 304.
  • the conversion unit 301 is configured to obtain a PDF document, and convert each page of the PDF document according to the position of each page in the PDF document into each page carrying a preset position identifier in a preset manner.
  • Recognition unit 302 used to identify all the pictures containing charts as target pictures through a preset target detection model, the charts including graphics and tables;
  • Extraction unit 303 used to detect the target The model extracts the chart in each target picture to identify the position of the chart in each target picture;
  • the positioning unit 304 is configured to use the position of each target picture in the PDF document The position and the position of the chart corresponding to each of the target pictures are combined in a preset order to generate the position of the chart in the PDF document.
  • the positioning device 300 for the chart in the PDF document further includes: a display unit, configured to display all the target pictures in a list form in a preset number sequence according to the order of each target picture in the PDF document.
  • the information of the target picture includes: the type of the chart, the position of the chart in each target picture, the position of each target picture in the PDF document, and the chart in the PDF document s position.
  • the extracting unit 303 is configured to extract the chart in each target picture through the target detection model to identify a preset area of the chart corresponding to each target picture Location, the preset area includes m areas, m ⁇ 2, and m is an integer.
  • the extracting unit 303 is configured to extract the graph in each target picture through the target detection model to identify that the n vertices of the graph correspond to each of the target pictures.
  • the device 300 for locating charts in the PDF document further includes:
  • the training unit is used to train the target detection model; the training unit includes:
  • the recognition subunit is used to input graphics and tables into the target detection model so that the target detection model recognizes the graphics and the table;
  • the extraction subunit is used to input a picture carrying graphics and/or tables to the target detection model so that the target detection model recognizes the graphics and/or the table, and correspondingly extracts the position of the graphics And/or the location of the form;
  • the training subunit is used to train the target detection model until the recognition accuracy of the graph and/or the table by the target detection model meets a preset condition.
  • the target detection model is a deep learning model.
  • the deep learning model is a Faster R-CNN model.
  • the conversion unit 301 is configured to use the Icepdf control to convert each page of the PDF document according to the position of each page in the PDF document into a file with a preset position identifier.
  • each unit in the positioning device of the chart in the PDF document is only used for illustration.
  • the positioning device of the chart in the PDF document can be divided into different units as needed, or the The units in the positioning device for the charts in the PDF document adopt different connection sequences and methods to complete all or part of the functions of the positioning device for the charts in the PDF document.
  • the positioning device of the chart in the PDF document can be implemented in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 4.
  • the computer device 400 may be a computer device such as a desktop computer or a server, or may be a component or component in other devices.
  • the computer device 400 includes a processor 402, a memory, and a network interface 405 connected through a system bus 401, where the memory may include a non-volatile storage medium 403 and an internal memory 404.
  • the non-volatile storage medium 403 can store an operating system 4031 and a computer program 4032.
  • the processor 402 can execute a method for locating the chart in the PDF document.
  • the processor 402 is used to provide calculation and control capabilities to support the operation of the entire computer device 400.
  • the internal memory 404 provides an environment for the running of the computer program 4032 in the non-volatile storage medium 403.
  • the processor 402 can make the processor 402 execute the above-mentioned method for locating a chart in a PDF document.
  • the network interface 405 is used for network communication with other devices.
  • the specific computer device 400 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 4, and will not be repeated here.
  • the processor 402 is configured to run a computer program 4032 stored in a memory to implement the method for locating a chart in a PDF document in the embodiment of the present application.
  • the processor 402 may be a central processing unit (Central Processing Unit, CPU), and the processor 402 may also be other general-purpose processors, digital signal processors (DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the steps of the method for positioning a chart in a PDF document described in the above embodiments.
  • the storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk or an optical disk, and other physical storage that can store computer programs. medium.
  • a physical, non-transitory storage medium such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk or an optical disk, and other physical storage that can store computer programs. medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例提供了一种PDF文档中图表的定位方法、装置、计算机设备及计算机可读存储介质。本申请实施例属于图像处理技术领域,实现PDF文档中图表的定位时,获取PDF文档,通过预设方式将PDF文档中的每页文档按照每页文档在PDF文档中的位置转换为携带有预设位置标识的每张图片,通过预设的目标检测模型识别出所有图片中包含图表的图片作为目标图片,通过目标检测模型提取每张目标图片中的图表以识别图表在对应每张目标图片中的位置,以每张目标图片在PDF文档中的位置及图表在对应每张目标图片中的位置按照预设顺序组合以生成图表在PDF文档中的位置,通过对PDF中的图表进行准确定位。

Description

PDF文档中图表的定位方法、装置及计算机设备
本申请要求于2019年5月30日提交中国专利局、申请号为201910462305.7、申请名称为“PDF文档中图表的定位方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种PDF文档中图表的定位方法、装置、计算机设备及计算机可读存储介质。
背景技术
现有的各类针对PDF文档的解析方式只能单独的提取PDF文档中的图片或内容,不能确切的知道PDF文档中哪块位置是表格,哪块位置是图形,由于无法准确确定PDF文档中的图表位置,降低了PDF文档的使用效率。
发明内容
本申请实施例提供了一种PDF文档中图表的定位方法、装置、计算机设备及计算机可读存储介质,能够解决传统技术中由于无法准确定位PDF文档中图表的位置导致PDF文档的使用效率低的问题。
第一方面,本申请实施例提供了一种PDF文档中图表的定位方法,所述方法包括:获取PDF文档,通过预设方式将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的每张图片;通过预设的目标检测模型识别出所有所述图片中包含图表的图片作为目标图片,所述图表包括图形和表格;通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置;以每张所述目标图片在所述PDF文档中的位置及所述图表在对应每张所述目标图片中的位置按照预设顺序组合以生成所述图表在所述PDF文档中的位置。
第二方面,本申请实施例还提供了一种PDF文档中图表的定位装置,包括: 转换单元,用于获取PDF文档,通过预设方式将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的每张图片;识别单元,用于通过预设的目标检测模型识别出所有所述图片中包含图表的图片作为目标图片,所述图表包括图形和表格;提取单元,用于通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置;定位单元,用于以每张所述目标图片在所述PDF文档中的位置及所述图表在对应每张所述目标图片中的位置按照预设顺序组合以生成所述图表在所述PDF文档中的位置。
第三方面,本申请实施例还提供了一种计算机设备,其包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现所述PDF文档中图表的定位方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器执行所述PDF文档中图表的定位方法。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的PDF文档中图表的定位方法的流程示意图;
图2为本申请实施例提供的PDF文档中图表的定位方法中一个图表位置区域划分示意图;
图3为本申请实施例提供的PDF文档中图表的定位装置的示意性框图;以及
图4为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的PDF文档中图表的定位方法可应用于终端或者服务器等计算机设备中,通过安装于终端或者服务器上的软件来实现所述PDF文档中图表的定位方法的步骤,其中所述终端可以为手机、笔记本电脑、平板电脑或者台式电脑等电子设备,所述服务器可以为云服务器或者服务器集群等。以终端为例,本申请实施例提供的PDF文档中图表的定位方法的具体实现过程如下:终端获取PDF文档,通过预设方式将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的每张图片;通过预设的目标检测模型识别出所有所述图片中包含图表的图片作为目标图片,所述图表包括图形和表格;通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置;以每张所述目标图片在所述PDF文档中的位置及所述图表在对应每张所述目标图片中的位置按照预设顺序组合以生成所述图表在所述PDF文档中的位置。
需要说明的是,在实际操作过程中,上述PDF文档中图表的定位方法的应用场景仅仅用于说明本申请技术方案,并不用于限定本申请技术方案。
图1为本申请实施例提供的PDF文档中图表的定位方法的示意性流程图。该PDF文档中图表的定位方法应用于终端或者服务器中,以完成PDF文档中图表的定位方法的全部或者部分功能。请参阅图1,如图1所示,该方法包括以下步骤S101-S104:
S101、获取PDF文档,通过预设方式将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的每张图片。
其中,预设位置标识指每页PDF文档在整个PDF文档中的位置描述,可以为每页PDF文档在PDF文档中页码编码,比如,文档页码用数字“1、2、3…”等描述,预设位置标识可以为PDF的第1页、第2页、第3页…。进一步地,所述预设位置标识还可以添加上该PDF文档的文档名称或者文档编号,比如, 文档名称为A文档,A文档的第3页可描述为A3,通过文档名称与文档页码的结合,可以提高对PDF文件的辨识效率。
预设方式包括不同编程语言中对应的将PDF文档转换为图片的方法,比如,JAVA中实现PD F文档转换为图片可以通过第三方提供的架包,比如下载Icepdf的架包,或者Jpedal的架包等。
具体地,获取PDF文档,通过预设方式将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的每张图片。获取PDF文件后,可以通过预设方式将所述PDF文档每一页转换为一张图片,PDF文档包含多页就对应转换成多张图片,可以转换为JPG格式或者JPEG格式,JAVA中实现将PDF文档转图片可以通过第三方提供的架包,比如下载Icepdf的架包,并导入项目中,通过Icepdf控件将所述PDF文档转换为若干图片。或者下载Pdfbox的架包,并导入项目,还可以采用下载Jpedal的架包,并导入项目中,均可以将所述PDF文档转换为图片格式,比如,通过Icepdf控件将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的JPG格式或者JPEG格式的每张图片。
S102、通过预设的目标检测模型识别出所有所述图片中包含图表的图片作为目标图片,所述图表包括图形和表格。
其中,图表是指图形和表格。目标检测,也叫目标提取,是一种基于目标几何和统计特征的图像分割,它将目标的分割和识别合二为一。目标检测对于人类来说并不困难,通过对图片中不同颜色模块的感知很容易定位并分类出其中目标物体,但对于计算机来说,面对的是RGB像素矩阵,很难从图像中直接得到抽象概念对应的目标并定位其位置,再加上有时候多个物体和杂乱的背景混杂在一起,目标检测更加困难。“目标检测”主要解决两个问题:图像上多个目标物在哪里,也就是目标位置,目标是什么,也就是目标的类别。
具体地,使用训练好的预设的目标检测模型识别每张所述图片以判断每张所述图片中是否包含图表,所述图表包括图形和表格,若所述图片中包含图形和/或表格,以所有所述图片中包含图形和/或表格的图片作为目标图片,进一步通过所述目标检测模型提取每张所述目标图片中的图形和/或表格,若所述图片 中不包含图表,对所述图片不处理,丢弃掉该图片,也可以称为过滤掉该图片,也就是对该图片不用处理。
进一步地,目标检测模型是基于目标检测算法进行目标检测的,目标检测算法主要是基于深度学习模型,本申请实施例实现基于深度学习的PDF文档中图表的定位,深度学习模型可以分成两大类:(1)Two-stage检测算法,其将检测问题划分为两个阶段,首先产生候选区域,英文为Region proposals,然后对候选区域分类,一般还需要对位置进行精修,这类算法的典型代表是基于Region proposal的R-CNN系算法,如R-CNN,Fast R-CNN,Faster R-CNN等;(2)One-stage检测算法,其不需要Region proposal阶段,直接产生物体的类别概率和位置坐标值,比较典型的算法如YOLO和SSD。
通过目标检测模型可以识别一张目标图片中的多个物体,并可以定位出不同物体,主要是给出物体的边界框。在使用目标检测模型识别所述图片中是否包含图表之前,先进行目标检测模型的训练。
在一个实施例中,所述通过预设的目标检测模型识别出所有所述图片中包含图表的图片作为目标图片的步骤之前,还包括:
训练所述目标检测模型。所述训练所述目标检测模型的步骤包括:将图形和表格分别输入目标检测模型以使所述目标检测模型识别所述图形和所述表格;将携带有图形和/或表格的图片输入至所述目标检测模型以使所述目标检测模型识别出所述图形和/或所述表格,并对应提取所述图形的位置和/或所述表格的位置;训练所述目标检测模型直至所述目标检测模型对所述图形和/或所述表格的识别准确率满足预设条件。
具体地,目标检测模型的训练过程如下:
(1)先建立目标检测模型。
其中,目标检测,英文为Object Detection,是指找出图像中的目的或者目标,目标又可以称为物体,确定它们的位置和大小,是机器视觉范畴的中心问题之一。计算机视觉中关于图像识别有四大类任务:
1)目标分类,英文为Classification。处置“是什么?”的问题,即给定一张图片或一段视频判别里面包含什么类别的目的。
2)目标定位,英文为Location。处置“在哪里?”的问题,即定位出这个目的的位置。
3)目标检测,英文为Detection。处置“是什么?在哪里?”的问题,即定位出这个目的的位置并且知道目的物是什么。
4)目标分割-Segmentation。分为实例的分割(英文为Instance-level)和场景分割(英文为Scene-level)。处置“每一个像素属于哪个目的物或场景”的问题。其中,基于候选区域的目标检测器,包括基于候选区域的,如R-CNN,SPP-net,Fast R-CNN,Faster R-CNN及R-FCN等模型,基于端到端(End-to-End)的目标检测方法,这些方法无需区域提名,包括YOLO和SSD,由于在本申请实施例中采取现有模型进行训练,在本申请实施例中,采取基于Faster R-CNN的目标检测模型为例来说明本申请技术方案。
(2)训练目标检测模型。建立完目标检测模型后,训练目标检测模型。训练所述目标检测模型的步骤包括:
1)将图形和表格分别输入目标检测模型以使所述目标检测模型识别所述图形和所述表格。
具体地,将图形和表格分别输入目标检测模型,使所述目标检测模型根据输入的图形和表格认识什么是图形及什么是表格,从而使所述目标检测模型能够识别出所述图形和所述表格。其中,训练目标检测模型的图表有以下两种:
1)将图形和表格分别输入目标检测模型,并告诉目标检测模型哪些是图形和哪些是表格,然后输入其他的图形和表格训练所述目标检测模型,直到目标检测模型对图形和表格的识别准确率达到需求,比如目标检测模型对图表的识别准确率在百分之九十之上。
2)输入从PDF中提取的图片,检测所述图片中是否有图形或者表格,假如图片中有图形或者表格,告诉目标检测模型哪些是图形和哪些是表格以让目标检测模型能够识别出图形和表格。
需要说明的是,这里只是教会目标检测模型识别出来什么是图形和什么是表格,重要的是模型能识别出来什么样的是图形和什么样的是表格,训练模型时重要的是能够识别出来图形和表格,而不在于图形或者表格的载体是什么, 也就是不一定非要是图片上的图形或者表格,就像进行人脸识别一样,可以采用活体的人脸识别人的五官,也可以通过照片识别人的五官,只要能识别出来人的五官就可以,五官的载体是次要的。当然,若能使用将PDF转换的图片来训练目标检测模型,效果会更准确。
2)将携带有图形和/或表格的图片输入至所述目标检测模型以使所述目标检测模型识别出所述图形和/或所述表格,并对应提取所述图形的位置和/或所述表格的位置。
具体地,由于目标检测模型本身能够进行目标定位,目标检测模型能够识别出图形和表格后,目标检测模型可以对输入的图片进行图形和表格的识别并对识别出的图形和表格进行对应的定位,提取图形和表格各自的位置,从而完成对输入图片中图形和表格的识别及定位。
3)训练所述目标检测模型直至所述目标检测模型对所述图形和/或所述表格的识别准确率满足预设条件。
具体地,目标检测模型能够对输入图片进行图形和表格各自的识别及定位后,通过大量样本的输入训练目标检测模型,提高目标检测模型对图形和表格识别的准确度,训练所述目标检测模型直至所述目标检测模型对所述图形和/或所述表格的识别准确率满足预设条件,所述预设条件是指目标检测模型对图形的识别准确率及目标检测模型对表格的识别准确率,比如,目标检测模型对图形的识别准确率达到90%以上,及目标检测模型对表格的识别准确率95%以上等。
训练完成的目标检测模型可以用来识别PDF转换成的图片中是否包含图形和/或表格。具体地,首先将PDF每一页转换为一张一张的图片,然后通过训练好的目标检测模型对转换后的图片进行检测,比如训练完成的FASTER-RCNN目标检测模型对图片进行检测,若目标检测模型检测到图片中包含图形和/或表格,若图片中包含多个图形和/或多个表格时,对检测到的图形和/或表格进行分类,并且逐一进行定位以确定图片中哪个位置是图形,哪个位置是表格,从而顺序识别出所述图片中的所有图表,避免对图片中的图表产生遗漏,提高对文档中图表的定位效率。
S103、通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置。
具体地,若所述图片中包含有图形和/或表格,将该图片作为目标图片,通过目标检测模型对目标图片中包含的图形和/或表格进行分类,并定位目标图片中哪个位置是图形,哪个位置是表格,并可以提取所述图形和/或表格在目标图片中的位置,所述图形或者表格在目标图片中的位置可以通过图形或者表格的四个顶点在所述目标图片中的坐标来表示。若所述图片中未包含有图片或者表格,则丢弃该张图片。
进一步地,基于候选区域的目标检测模型(又称为目标检测器)进行目标检测时,目标检测的第一步是要做区域提名(英文为Region Proposal),也就是找出可能的感兴趣区域(英文为Region Of Interest,ROI)。区域提名方法包括以下几种:
1)、滑动窗口。滑动窗口本质上就是穷举法,利用不同的尺度和长宽比把所有可能的大大小小的块都穷举出来,然后送去识别,识别出来概率大的就留下来。但是,这样的方法复杂度太高,产生了很多的冗余候选区域,在现实当中不可行。
2)、规则块。在穷举法的基础上进行了一些剪枝,只选用固定的大小和长宽比。这在一些特定的应用场景是很有效的,比如拍照搜题APP中的汉字检测,因为汉字方方正正,长宽比大多比较一致,因此用规则块做区域提名是一种比较合适的选择。但是对于普通的目标检测来说,规则块依然需要访问很多的位置,复杂度高。
3)、选择性搜索。从机器学习的角度来说,前面的方法召回是不错了,但是精度差强人意,所以问题的核心在于如何有效地去除冗余候选区域。其实冗余候选区域大多是发生了重叠,选择性搜索利用这一点,自底向上合并相邻的重叠区域,从而减少冗余。以R-CNN为例,R-CNN是Region-based Convolutional Neural Networks的缩写,中文翻译是基于区域的卷积神经网络,是一种结合区域提名(英文为Region Proposal)和卷积神经网络(英文为Convolutional Neural Networks,简写为CNN)的目标检测方法,R-CNN的主要步骤包括:(1)、区 域提名,通过Selective Search从原始图片提取2000个左右区域候选框;(2)区域大小归一化,把所有侯选框缩放成固定大小,比如,采用227×227);(3)特征提取,通过CNN网络,提取特征;(4)分类与回归,在特征层的基础上添加两个全连接层,再用SVM分类来做识别,用线性回归来微调边框位置与大小,其中每个类别单独训练一个边框回归器。
进一步地,Fast R-CNN的主要步骤如下:(1)特征提取,以整张图片为输入利用CNN得到图片的特征层;(2)区域提名,通过Selective Search等方法从原始图片提取区域候选框,并把这些候选框一一投影到最后的特征层;(3)区域归一化,针对特征层上的每个区域候选框进行RoI Pooling操作,得到固定大小的特征表示;(4)分类与回归,然后再通过两个全连接层,分别用softmax多分类做目标识别,用回归模型进行边框位置与大小微调。
更进一步地,Faster R-CNN的主要步骤如下:(1)特征提取,同Fast R-CNN,以整张图片为输入,利用CNN得到图片的特征层;(2)区域提名,在最终的卷积特征层上利用k个不同的矩形框(Anchor Box)进行提名,k一般取9;(3)分类与回归,对每个Anchor Box对应的区域进行object/non-object二分类,并用k个回归模型(各自对应不同的Anchor Box)微调候选框位置与大小,最后进行目标分类。
总之,Faster R-CNN抛弃了Selective Search,引入了RPN网络,使得区域提名、分类、回归一起共用卷积特征,从而得到了进一步的加速。但是,Faster R-CNN需要对两万个Anchor Box先判断是否是目标(目标判定),然后再进行目标识别,分成了两步。
S104、以每张所述目标图片在所述PDF文档中的位置及所述图表在对应每张所述目标图片中的位置按照预设顺序组合以生成所述图表在所述PDF文档中的位置。
其中,预设顺序包括每张所述目标图片在所述PDF文档中的位置在前、所述图表在对应每张所述目标图片中的位置在后的顺序,或者每张所述目标图片在所述PDF文档中的位置在后、所述图表在对应每张所述目标图片中的位置在前的顺序。
具体地,根据每张所述目标图片在所述PDF文档中的位置和所述图表在对应每张所述目标图片中的位置定位所述图表在所述PDF文档中的位置,即确定所述图表在对应每张目标图片中的位置后,再根据每张所述目标图片在所述PDF文档中的位置,最后定位所述图表在所述PDF文档中的位置。比如,若有一图表L在PDF文档A的第3页的坐标为(x1,y1),图表L在PDF文档的位置可以描述为A3(x1,y1),或者图表L在PDF文档的位置可以描述为(x1,y1)A3。
本申请实施例实现PDF文档中图表的定位时,通过获取PDF文件,通过预设方式将所述PDF文件转换为一张一张的独立图片,通过预设的目标检测模型识别出所有所述图片中包含图表的图片作为目标图片,通过所述目标检测模型提取每张所述目标图片中所述图表的位置,根据每张目标图片在PDF文档中的位置和图表在对应每张目标图片中的位置定位图表在PDF文档中的位置,能够实现自动识别PDF文档中哪块区域是图形或者表格,当需要使用PDF文件当中的图表时,比如,将PDF文档转换为WORD格式时,由于对PDF文件中的图表进行了准确的识别和定位,可以提高PDF文件的使用效率。
在一个实施例中,所述以每张所述目标图片在所述PDF文档中的位置及所述图表在对应每张所述目标图片中的位置按照预设顺序组合以生成所述图表在所述PDF文档中的位置的步骤之后,还包括:按照每张所述目标图片在所述PDF文档中的顺序以列表形式按照预设编号顺序显示所有所述目标图片的信息,所述信息包括:图表的类型、图表在每张所述目标图片的位置、每张所述目标图片在所述PDF文档中的位置、所述图表在所述PDF文档中的位置。
具体地,按照每张所述目标图片在所述PDF文档中的顺序以列表形式按照预设编号顺序显示所有所述目标图片的信息,所述信息包括:图表的类型、图表在每张所述目标图片的位置、每张所述目标图片在所述PDF文档中的位置、所述图表在所述PDF文档中的位置。比如,请参阅表1,表1为一PDF文档中包含图表的每张所述目标图片的信息示例,如表1所示,其中图形和表格用统一的编号1、2、3描述,PDF文档A包含的图表包括表格1、图形2及表格3,在表1中用一个顶点的坐标来示例描述图表的一个顶点在每张所述目标图片中 的位置,在PDF文档A中的第3页的坐标(x1,y1)位置有表格1的一个顶点,在PDF文档A中的第7页的坐标(x2,y2)位置有图形2的一个顶点,在PDF文档A中的第9页的坐标(x3,y3)位置有表格3的一个顶点,表格一般用表格的四个顶点的坐标就可以确定表格在每张所述目标图片中的位置,图形可以用图形的n个顶点的坐标确定图形在每张所述目标图片中的位置,n≥3,n为整数,比如,三角形图形可以用三角形的三个顶点的坐标来描述三角形在每张所述目标图片中的位置,四边形可以用四边形的四个顶点的坐标来描述表格在每张所述目标图片的位置,五角形图形以五角形的五个顶点的坐标来描述图形在每张所述目标图片中的位置等。
进一步地,其中图形和表格也可以分别用各自的预设编号1、2、3顺序描述,也就是表格用表格的预设编号1、2、3顺序描述,图形用图形的预设编号1、2、3顺序描述,表格可以描述为表格1、表格2及表格3等,图形描述为图形1、图形2及图形3等。
以列表形式按照预设编号顺序显示所有的包含图表的每张所述目标图片的信息,可以利用JS在页面中新建一个Excel表格来实现。JS即JavaScript,JavaScript是Web的编程语言,使用HTML结合CSS结构样式代码,比如使用CSS中的Table样式来实现以表格形式显示包含图表的每张所述目标图片的信息,其中,CSS,英文为Cascading Style Sheets,指层叠样式表。
表1
Figure PCTCN2019117747-appb-000001
在一个实施例中,所述通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置的步骤包括:通过 所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的预设区域位置,所述预设区域包括m个区域,m≥2,m为整数。
具体地,在目标检测模型中,其中目标定位是不仅仅要识别出来是什么物体,即分类,而且还要预测物体的位置,位置一般用边框(Bounding box)标记,而目标检测实质是多目标的定位,即要在目标图片中定位多个目标物体,包括分类和定位,因此,在目标检测模型训练的过程中,包括对目标的定位,就是目标在图像中的位置。可以将PDF中的每页文档转换为每张目标图片后将目标图片划分为m个预设区域,m≥2,m为整数,以预设区域来描述图表在每张所述目标图片中的位置。比如,以将每张所述目标图片划分为四个区域为例,请参阅图2,图2为本申请实施例提供的PDF文档中图表的定位方法中一个图表位置区域划分示意图,如图2所示,图2中的所述预设区域包括第一区域、第二区域、第三区域及第四区域,通过判断图表在第一区域、第二区域、第三区域或者第四区域中的哪个区域来描述图表在每张所述目标图片中的位置。其中,m越大,每页文档的区域划分越精细,对图表的位置描述越准确,可以根据实际需要确定m的值,也就是将每张所述目标图片划分为多少个预设区域。
在一个实施例中,所述通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置的步骤包括:通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表的n个顶点分别在对应每张所述目标图片中的坐标,其中,n≥3,n为整数。
具体地,除了可以将PDF中每张所述目标图片用区域划分来描述图表在每张所述目标图片中的位置外,还可以以每张所述目标图片中的坐标来描述图表在每张所述目标图片中的位置,通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表的n个顶点分别在对应每张所述目标图片中的坐标,其中,n≥3,n为整数。比如,三角形图形可以用三角形的三个顶点的坐标来描述三角形在每张所述目标图片中的位置,表格以表格的四个顶点的坐标来描述表格在每张所述目标图片的位置,四边形可以用四边形的四个顶点的坐标来描述表格在每张所述目标图片的位置,五角形图形以五角形的五个顶点的坐 标来描述图形在每张所述目标图片中的位置等,以实现对图表位置更精确的描述。请继续参阅表1,如表格1所示,其中图形和表格用统一的编号1、2、3描述,PDF文档A包含的图表包括表格1、图形2及表格3,在表1中用一个顶点的坐标来示例描述图表的一个顶点在每张所述目标图片中的位置,在PDF文档A中的第3页的坐标(x1,y1)位置有表格1的一个顶点,在PDF文档A中的第7页的坐标(x2,y2)位置有图形2的一个顶点,在PDF文档A中的第9页的坐标(x3,y3)位置有表格3的一个顶点。
由于在目标检测模型中,其中目标定位是不仅仅要识别出来是什么物体,即分类,而且还要预测物体的位置,位置一般用边框(Bounding box)标记,而目标检测实质是多目标的定位,即要在图片中定位多个目标物体,包括分类和定位,因此,在目标检测模型训练的过程中,包括对目标的定位,就是目标在图像中的位置。
另外,在使用深度学习模型进行文本识别中的表格识别时,首先进行表格的提取,可以使用OpenCV函数对图片灰度处理即二值化处理,腐蚀和膨胀后得到表格线,由获得的表格线得到单元格交点坐标,根据每个单元格交点坐标中横坐标和竖坐标的大小以判断出表格的顶点坐标。请继续参阅图2,若图2中所示的图为一个坐标系的四个象限,根据坐标系中四个象限的坐标特点可知,B1、B2、B3及B4中各个坐标满足表2所示的属性。根据表2中所示的属性可知:1)在B1所在的象限中,X1最小且Y1最大的坐标为表格的顶点坐标;2)在B2所在的象限中,X2最打且Y2最大的坐标为表格的顶点坐标;3)在B3所在的象限中,X3最大且Y3最小的坐标为表格的顶点坐标;4)在B4所在的象限中,X4最小且Y4最小的坐标为表格的顶点坐标。
根据以上各个坐标的属性,获得表格中的单元格交点坐标以后,通过比较各个单元格交点坐标中的横坐标和纵坐标的大小,即可获得表格的四个顶点的坐标。
表2
点所属象限 坐标属性
B1 X1<0;Y1>0
B2 X2>0;Y2>0
B3 X3>0;Y3<0
B4 X4<0;Y4<0
需要说明的是,上述各个实施例所述的PDF文档中图表的定位方法,可以根据需要将不同实施例中包含的技术特征重新进行组合,以获取组合后的实施方案,但都在本申请要求的保护范围之内。
请参阅图3,图3为本申请实施例提供的PDF文档中图表的定位装置的示意性框图。对应于上述PDF文档中图表的定位方法,本申请实施例还提供一种PDF文档中图表的定位装置。如图3所示,该PDF文档中图表的定位装置包括用于执行上述PDF文档中图表的定位方法的单元,该装置可以被配置于终端或者服务器等计算机设备中。具体地,请参阅图3,该PDF文档中图表的定位装置300包括转换单元301、识别单元302、提取单元303及定位单元304。其中,转换单元301,用于获取PDF文档,通过预设方式将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的每张图片;识别单元302,用于通过预设的目标检测模型识别出所有所述图片中包含图表的图片作为目标图片,所述图表包括图形和表格;提取单元303,用于通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置;定位单元304,用于以每张所述目标图片在所述PDF文档中的位置及所述图表在对应每张所述目标图片中的位置按照预设顺序组合以生成所述图表在所述PDF文档中的位置。
在一个实施例中,所述PDF文档中图表的定位装置300还包括:显示单元,用于按照每张所述目标图片在所述PDF文档中的顺序以列表形式按照预设编号顺序显示所有所述目标图片的信息,所述信息包括:图表的类型、图表在每张所述目标图片的位置、每张所述目标图片在所述PDF文档中的位置、所述图表在所述PDF文档中的位置。
在一个实施例中,所述提取单元303,用于通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的预设区域位置,所述预设区域包括m个区域,m≥2,m为整数。
在一个实施例中,所述提取单元303,用于通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表的n个顶点分别在对应每张所述目标图片中的坐标,其中,n≥3,n为整数。
在一个实施例中,所述PDF文档中图表的定位装置300还包括:
训练单元,用于训练所述目标检测模型;所述训练单元包括:
识别子单元,用于将图形和表格分别输入目标检测模型以使所述目标检测模型识别所述图形和所述表格;
提取子单元,用于将携带有图形和/或表格的图片输入至所述目标检测模型以使所述目标检测模型识别出所述图形和/或所述表格,并对应提取所述图形的位置和/或所述表格的位置;
训练子单元,用于训练所述目标检测模型直至所述目标检测模型对所述图形和/或所述表格的识别准确率满足预设条件。
在一个实施例中,所述目标检测模型为深度学习模型。
在一个实施例中,所述深度学习模型为Faster R-CNN模型。
在一个实施例中,所述转换单元301,用于通过Icepdf控件将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的JPG格式或者JPEG格式的每张图片。
需要说明的是,所属领域的技术人员可以清楚地了解到,上述PDF文档中图表的定位装置和各单元的具体实现过程,可以参考前述方法实施例中的相应描述,为了描述的方便和简洁,在此不再赘述。
同时,上述PDF文档中图表的定位装置中各个单元的划分和连接方式仅用于举例说明,在其他实施例中,可将PDF文档中图表的定位装置按照需要划分为不同的单元,也可将PDF文档中图表的定位装置中各单元采取不同的连接顺序和方式,以完成上述PDF文档中图表的定位装置的全部或部分功能。
上述PDF文档中图表的定位装置可以实现为一种计算机程序的形式,该计 算机程序可以在如图4所示的计算机设备上运行。
请参阅图4,图4是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备400可以是台式机电脑或者服务器等计算机设备,也可以是其他设备中的组件或者部件。
参阅图4,该计算机设备400包括通过系统总线401连接的处理器402、存储器和网络接口405,其中,存储器可以包括非易失性存储介质403和内存储器404。
该非易失性存储介质403可存储操作系统4031和计算机程序4032。该计算机程序4032被执行时,可使得处理器402执行一种上述PDF文档中图表的定位方法。
该处理器402用于提供计算和控制能力,以支撑整个计算机设备400的运行。
该内存储器404为非易失性存储介质403中的计算机程序4032的运行提供环境,该计算机程序4032被处理器402执行时,可使得处理器402执行一种上述PDF文档中图表的定位方法。
该网络接口405用于与其它设备进行网络通信。本领域技术人员可以理解,图4中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备400的限定,具体的计算机设备400可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图4所示实施例一致,在此不再赘述。
其中,所述处理器402用于运行存储在存储器中的计算机程序4032,以实现本申请实施例的PDF文档中图表的定位方法。
应当理解,在本申请实施例中,处理器402可以是中央处理单元(Central Processing Unit,CPU),该处理器402还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或 者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来完成,该计算机程序可存储于一计算机可读存储介质。该计算机程序被该计算机系统中的至少一个处理器执行,以实现上述方法的实施例的流程步骤。
因此,本申请实施例还提供一种计算机可读存储介质。该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时使处理器执行以上各实施例中所描述的PDF文档中图表的定位方法的步骤。
所述存储介质为实体的、非瞬时性的存储介质,例如可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储计算机程序的实体存储介质。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
以上所述,仅为本申请的具体实施方式,但本申请明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种PDF文档中图表的定位方法,包括:
    获取PDF文档,通过预设方式将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的每张图片;
    通过预设的目标检测模型识别出所有所述图片中包含图表的图片作为目标图片,所述图表包括图形和表格;
    通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置;
    以每张所述目标图片在所述PDF文档中的位置及所述图表在对应每张所述目标图片中的位置按照预设顺序组合以生成所述图表在所述PDF文档中的位置。
  2. 根据权利要求1所述PDF文档中图表的定位方法,其中,所述以每张所述目标图片在所述PDF文档中的位置及所述图表在对应每张所述目标图片中的位置按照预设顺序组合以生成所述图表在所述PDF文档中的位置的步骤之后,还包括:
    按照每张所述目标图片在所述PDF文档中的顺序以列表形式按照预设编号顺序显示所有所述目标图片的信息,所述信息包括:图表的类型、图表在每张所述目标图片的位置、每张所述目标图片在所述PDF文档中的位置、所述图表在所述PDF文档中的位置。
  3. 根据权利要求1所述PDF文档中图表的定位方法,其中,所述通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置的步骤包括:
    通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的预设区域位置,所述预设区域包括m个区域,m≥2,m为整数。
  4. 根据权利要求1所述PDF文档中图表的定位方法,其中,所述通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置的步骤包括:
    通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图 表的n个顶点分别在对应每张所述目标图片中的坐标,其中,n≥3,n为整数。
  5. 根据权利要求1所述PDF文档中图表的定位方法,其中,所述通过预设的目标检测模型识别出所有所述图片中包含图表的图片作为目标图片的步骤之前,还包括:
    训练所述目标检测模型;
    所述训练所述目标检测模型的步骤包括:
    将图形和表格分别输入目标检测模型以使所述目标检测模型识别所述图形和所述表格;
    将携带有图形和/或表格的图片输入至所述目标检测模型以使所述目标检测模型识别出所述图形和/或所述表格,并对应提取所述图形的位置和/或所述表格的位置;
    训练所述目标检测模型直至所述目标检测模型对所述图形和/或所述表格的识别准确率满足预设条件。
  6. 根据权利要求5所述PDF文档中图表的定位方法,其中,所述目标检测模型为深度学习模型。
  7. 根据权利要求6所述PDF文档中图表的定位方法,其中,所述深度学习模型为FasterR-CNN模型。
  8. 根据权利要求1所述PDF文档中图表的定位方法,其中,所述通过预设方式将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的每张图片的步骤包括:
    通过Icepdf控件将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的JPG格式或者JPEG格式的每张图片。
  9. 一种PDF文档中图表的定位装置,包括:
    转换单元,用于获取PDF文档,通过预设方式将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的每张图片;
    识别单元,用于通过预设的目标检测模型识别出所有所述图片中包含图表 的图片作为目标图片,所述图表包括图形和表格;
    提取单元,用于通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置;
    定位单元,用于以每张所述目标图片在所述PDF文档中的位置及所述图表在对应每张所述目标图片中的位置按照预设顺序组合以生成所述图表在所述PDF文档中的位置。
  10. 根据权利要求9所述PDF文档中图表的定位装置,其中,所述PDF文档中图表的定位装置还包括:
    显示单元,用于按照每张所述目标图片在所述PDF文档中的顺序以列表形式按照预设编号顺序显示所有所述目标图片的信息,所述信息包括:图表的类型、图表在每张所述目标图片的位置、每张所述目标图片在所述PDF文档中的位置、所述图表在所述PDF文档中的位置。
  11. 一种计算机设备,包括存储器以及与所述存储器相连的处理器;所述存储器用于存储计算机程序;所述处理器用于运行所述存储器中存储的计算机程序,以执行如下步骤:
    获取PDF文档,通过预设方式将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的每张图片;
    通过预设的目标检测模型识别出所有所述图片中包含图表的图片作为目标图片,所述图表包括图形和表格;
    通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置;
    以每张所述目标图片在所述PDF文档中的位置及所述图表在对应每张所述目标图片中的位置按照预设顺序组合以生成所述图表在所述PDF文档中的位置。
  12. 根据权利要求11所述计算机设备,其中,所述以每张所述目标图片在所述PDF文档中的位置及所述图表在对应每张所述目标图片中的位置按照预设顺序组合以生成所述图表在所述PDF文档中的位置的步骤之后,还包括:
    按照每张所述目标图片在所述PDF文档中的顺序以列表形式按照预设编号顺序显示所有所述目标图片的信息,所述信息包括:图表的类型、图表在每张 所述目标图片的位置、每张所述目标图片在所述PDF文档中的位置、所述图表在所述PDF文档中的位置。
  13. 根据权利要求11所述计算机设备,其中,所述通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置的步骤包括:
    通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的预设区域位置,所述预设区域包括m个区域,m≥2,m为整数。
  14. 根据权利要求11所述计算机设备,其中,所述通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置的步骤包括:
    通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表的n个顶点分别在对应每张所述目标图片中的坐标,其中,n≥3,n为整数。
  15. 根据权利要求11所述计算机设备,其中,所述通过预设的目标检测模型识别出所有所述图片中包含图表的图片作为目标图片的步骤之前,还包括:
    训练所述目标检测模型;
    所述训练所述目标检测模型的步骤包括:
    将图形和表格分别输入目标检测模型以使所述目标检测模型识别所述图形和所述表格;
    将携带有图形和/或表格的图片输入至所述目标检测模型以使所述目标检测模型识别出所述图形和/或所述表格,并对应提取所述图形的位置和/或所述表格的位置;
    训练所述目标检测模型直至所述目标检测模型对所述图形和/或所述表格的识别准确率满足预设条件。
  16. 根据权利要求15所述计算机设备,其中,所述目标检测模型为深度学习模型。
  17. 根据权利要求16所述计算机设备,其中,所述深度学习模型为Faster R-CNN模型。
  18. 根据权利要求11所述计算机设备,其中,所述通过预设方式将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的每张图片的步骤包括:
    通过Icepdf控件将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的JPG格式或者JPEG格式的每张图片。
  19. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时使所述处理器实现如下步骤:
    获取PDF文档,通过预设方式将所述PDF文档中的每页文档按照所述每页文档在所述PDF文档中的位置转换为携带有预设位置标识的每张图片;
    通过预设的目标检测模型识别出所有所述图片中包含图表的图片作为目标图片,所述图表包括图形和表格;
    通过所述目标检测模型提取每张所述目标图片中的所述图表以识别所述图表在对应每张所述目标图片中的位置;
    以每张所述目标图片在所述PDF文档中的位置及所述图表在对应每张所述目标图片中的位置按照预设顺序组合以生成所述图表在所述PDF文档中的位置。
  20. 根据权利要求19所述计算机可读存储介质,其中,所述以每张所述目标图片在所述PDF文档中的位置及所述图表在对应每张所述目标图片中的位置按照预设顺序组合以生成所述图表在所述PDF文档中的位置的步骤之后,还包括:
    按照每张所述目标图片在所述PDF文档中的顺序以列表形式按照预设编号顺序显示所有所述目标图片的信息,所述信息包括:图表的类型、图表在每张所述目标图片的位置、每张所述目标图片在所述PDF文档中的位置、所述图表在所述PDF文档中的位置。
PCT/CN2019/117747 2019-05-30 2019-11-13 Pdf文档中图表的定位方法、装置及计算机设备 WO2020238054A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910462305.7A CN110348294B (zh) 2019-05-30 2019-05-30 Pdf文档中图表的定位方法、装置及计算机设备
CN201910462305.7 2019-05-30

Publications (1)

Publication Number Publication Date
WO2020238054A1 true WO2020238054A1 (zh) 2020-12-03

Family

ID=68174424

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117747 WO2020238054A1 (zh) 2019-05-30 2019-11-13 Pdf文档中图表的定位方法、装置及计算机设备

Country Status (2)

Country Link
CN (1) CN110348294B (zh)
WO (1) WO2020238054A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818894A (zh) * 2021-02-08 2021-05-18 深圳万兴软件有限公司 识别pdf文件中文本框的方法、装置及计算机设备及存储介质
CN113408244A (zh) * 2021-06-22 2021-09-17 平安科技(深圳)有限公司 Java应用生成Word文档方法、装置、设备及介质
CN116758547A (zh) * 2023-06-27 2023-09-15 北京中超伟业信息安全技术股份有限公司 一种纸介质碳化方法、系统及存储介质

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348294B (zh) * 2019-05-30 2024-04-16 平安科技(深圳)有限公司 Pdf文档中图表的定位方法、装置及计算机设备
CN110909123B (zh) * 2019-10-23 2023-08-25 深圳价值在线信息科技股份有限公司 一种数据提取方法、装置、终端设备及存储介质
CN110765739B (zh) * 2019-10-24 2023-10-10 中国人民大学 一种从pdf文档中抽取表格数据和篇章结构的方法
CN111104871B (zh) * 2019-11-28 2023-11-07 北京明略软件系统有限公司 表格区域识别模型生成方法、装置及表格定位方法、装置
CN111178154B (zh) * 2019-12-10 2023-04-07 北京明略软件系统有限公司 表格边框预测模型生成方法、装置及表格定位方法、装置
CN111931021B (zh) * 2020-05-22 2024-07-16 淮阴工学院 一种基于数据挖掘的工程国家标准数据库自适应构建方法
CN112380825B (zh) * 2020-11-17 2022-07-15 平安科技(深圳)有限公司 Pdf文档跨页表格合并方法、装置、电子设备及存储介质
CN113065396A (zh) * 2021-03-02 2021-07-02 国网湖北省电力有限公司 基于深度学习的扫描档案图像的自动化归档处理系统及方法
CN112990110B (zh) * 2021-04-20 2022-03-25 数库(上海)科技有限公司 从研报中进行关键信息提取方法及相关设备
CN113127595B (zh) * 2021-04-26 2022-08-16 数库(上海)科技有限公司 研报摘要的观点详情提取方法、装置、设备和存储介质
CN113111858A (zh) * 2021-05-12 2021-07-13 数库(上海)科技有限公司 自动检测图片中表格的方法、装置、设备和存储介质
CN113723328B (zh) * 2021-09-06 2023-11-03 华南理工大学 一种图表文档面板分析理解方法
CN113989626B (zh) * 2021-12-27 2022-04-05 北京文安智能技术股份有限公司 一种基于目标检测模型的多类别垃圾场景区分方法
CN114155547B (zh) * 2022-02-08 2022-07-12 珠海盈米基金销售有限公司 一种图表识别方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8738553B1 (en) * 2009-07-22 2014-05-27 Google Inc. Image selection based on image quality
CN104517112A (zh) * 2013-09-29 2015-04-15 北大方正集团有限公司 一种表格识别方法与系统
CN106951400A (zh) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 一种pdf文件的信息抽取方法及装置
CN108415887A (zh) * 2018-02-09 2018-08-17 武汉大学 一种pdf文件向ofd文件转化的方法
CN109446487A (zh) * 2018-11-01 2019-03-08 北京神州泰岳软件股份有限公司 一种解析便携式文档格式文档表格的方法及装置
CN110348294A (zh) * 2019-05-30 2019-10-18 平安科技(深圳)有限公司 Pdf文档中图表的定位方法、装置及计算机设备

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7016532B2 (en) * 2000-11-06 2006-03-21 Evryx Technologies Image capture and identification system and process

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8738553B1 (en) * 2009-07-22 2014-05-27 Google Inc. Image selection based on image quality
CN104517112A (zh) * 2013-09-29 2015-04-15 北大方正集团有限公司 一种表格识别方法与系统
CN106951400A (zh) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 一种pdf文件的信息抽取方法及装置
CN108415887A (zh) * 2018-02-09 2018-08-17 武汉大学 一种pdf文件向ofd文件转化的方法
CN109446487A (zh) * 2018-11-01 2019-03-08 北京神州泰岳软件股份有限公司 一种解析便携式文档格式文档表格的方法及装置
CN110348294A (zh) * 2019-05-30 2019-10-18 平安科技(深圳)有限公司 Pdf文档中图表的定位方法、装置及计算机设备

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818894A (zh) * 2021-02-08 2021-05-18 深圳万兴软件有限公司 识别pdf文件中文本框的方法、装置及计算机设备及存储介质
CN112818894B (zh) * 2021-02-08 2023-12-15 深圳万兴软件有限公司 识别pdf文件中文本框的方法、装置及计算机设备及存储介质
CN113408244A (zh) * 2021-06-22 2021-09-17 平安科技(深圳)有限公司 Java应用生成Word文档方法、装置、设备及介质
CN113408244B (zh) * 2021-06-22 2023-08-22 平安科技(深圳)有限公司 Java应用生成Word文档方法、装置、设备及介质
CN116758547A (zh) * 2023-06-27 2023-09-15 北京中超伟业信息安全技术股份有限公司 一种纸介质碳化方法、系统及存储介质
CN116758547B (zh) * 2023-06-27 2024-03-12 北京中超伟业信息安全技术股份有限公司 一种纸介质碳化方法、系统及存储介质

Also Published As

Publication number Publication date
CN110348294B (zh) 2024-04-16
CN110348294A (zh) 2019-10-18

Similar Documents

Publication Publication Date Title
WO2020238054A1 (zh) Pdf文档中图表的定位方法、装置及计算机设备
US10762376B2 (en) Method and apparatus for detecting text
CN111488826B (zh) 一种文本识别方法、装置、电子设备和存储介质
US20220253631A1 (en) Image processing method, electronic device and storage medium
US20200004815A1 (en) Text entity detection and recognition from images
US11861919B2 (en) Text recognition method and device, and electronic device
CN109598298B (zh) 图像物体识别方法和系统
CN113837151B (zh) 表格图像处理方法、装置、计算机设备及可读存储介质
JP2021166070A (ja) 文書比較方法、装置、電子機器、コンピュータ読取可能な記憶媒体及びコンピュータプログラム
CN113255501B (zh) 生成表格识别模型的方法、设备、介质及程序产品
CN113239807B (zh) 训练票据识别模型和票据识别的方法和装置
US20230045715A1 (en) Text detection method, text recognition method and apparatus
US11881044B2 (en) Method and apparatus for processing image, device and storage medium
CN113313114B (zh) 证件信息获取方法、装置、设备以及存储介质
CN108021918B (zh) 文字识别方法及装置
CN114445833B (zh) 文本识别方法、装置、电子设备和存储介质
CN113344890B (zh) 医学图像识别方法、识别模型训练方法及装置
CN115880702A (zh) 数据处理方法、装置、设备、程序产品及存储介质
CN114818627A (zh) 一种表格信息抽取方法、装置、设备及介质
CN114120305A (zh) 文本分类模型的训练方法、文本内容的识别方法及装置
CN113538291A (zh) 卡证图像倾斜校正方法、装置、计算机设备和存储介质
CN115497112B (zh) 表单识别方法、装置、设备以及存储介质
CN114998906B (zh) 文本检测方法、模型的训练方法、装置、电子设备及介质
CN112818975B (zh) 文本检测模型训练方法及装置、文本检测方法及装置
CN116704535A (zh) 一种作答图像和题干图像的匹配方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19930352

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19930352

Country of ref document: EP

Kind code of ref document: A1