CN110516221B - Method, equipment and storage medium for extracting chart data in PDF document - Google Patents

Method, equipment and storage medium for extracting chart data in PDF document Download PDF

Info

Publication number
CN110516221B
CN110516221B CN201910805559.4A CN201910805559A CN110516221B CN 110516221 B CN110516221 B CN 110516221B CN 201910805559 A CN201910805559 A CN 201910805559A CN 110516221 B CN110516221 B CN 110516221B
Authority
CN
China
Prior art keywords
axis
statistical
chart
extracting
graphs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910805559.4A
Other languages
Chinese (zh)
Other versions
CN110516221A (en
Inventor
陆紫华
王凯
童刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qichacha Technology Co ltd
Original Assignee
Qichacha Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qichacha Technology Co ltd filed Critical Qichacha Technology Co ltd
Priority to CN201910805559.4A priority Critical patent/CN110516221B/en
Priority to PCT/CN2019/115964 priority patent/WO2021035954A1/en
Publication of CN110516221A publication Critical patent/CN110516221A/en
Application granted granted Critical
Publication of CN110516221B publication Critical patent/CN110516221B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Abstract

The invention discloses a method, equipment and a storage medium for extracting diagram data in a PDF document, wherein the method comprises the following steps: converting the PDF document into an SVG file; acquiring coordinate information and statistical graphs of one or more graphs in the SVG file; calculating coordinate values of data nodes of the statistical graph according to the coordinate information; and generating statistical data of the chart according to the coordinate values. Compared with the prior art, the method and the device have the advantages that the chart data information is automatically extracted from the PDF document, the extraction efficiency is greatly improved, the labor cost is saved, and the accuracy is improved.

Description

Method, equipment and storage medium for extracting chart data in PDF document
Technical Field
The invention relates to the field of computers, in particular to a method, equipment and a storage medium for extracting diagram data in a PDF document.
Background
The rapid development of the internet has brought forward a big data era, and the data information is huge and complex, wherein the financial industry is always an important producer and consumer of big data. With the large increase of companies and debt institutions on the market, the information required to be processed every day is increased explosively. For PDF documents containing charts (such as securities, research and report of financial industry, financial report and the like), the original manual processing mode has extremely low efficiency and can not meet the requirements of the current stage.
Therefore, a need arises for automatically extracting chart data information from a PDF document.
Disclosure of Invention
The invention aims to provide a method, equipment and a storage medium for extracting diagram data in a PDF document.
In order to achieve one of the above objects, an embodiment of the present invention provides a method for extracting diagram data from a PDF document, where the method includes:
converting the PDF document into an SVG file;
acquiring coordinate information and statistical graphs of one or more graphs in the SVG file;
calculating coordinate values of data nodes of the statistical graph according to the coordinate information;
and generating statistical data of the chart according to the coordinate values.
As a further improvement of an embodiment of the present invention, the "acquiring coordinate information and a statistical chart of one or more charts in the SVG file" specifically includes:
extracting a chart area of one or more charts in the SVG file;
extracting scale values of an X axis, a Y axis, the X axis and the Y axis in the chart area to obtain coordinate information of a chart corresponding to the chart area;
and extracting one or more statistical graphs in the graph area, wherein the statistical graphs are bar graphs, line graphs or curve graphs.
As a further improvement of an embodiment of the present invention, the step of "extracting a diagram area of one or more diagrams in the SVG file" specifically includes:
identifying and extracting rectangles in the SVG file to obtain a rectangle set;
and screening rectangles with the length and the width meeting the length and the width threshold value in the rectangle set, wherein the range framed by the rectangles is the chart area.
As a further improvement of an embodiment of the present invention, the step of "extracting an X axis and a Y axis in the chart region" specifically includes:
identifying and extracting horizontal lines and vertical lines in the chart area to obtain a horizontal line set and a vertical line set;
screening the longest transverse line in the transverse line set, wherein if the longest transverse line is unique, the longest transverse line is an X axis, otherwise, screening the color or the width of the longest transverse line, and obtaining the unique longest transverse line which is the X axis;
screening the longest vertical line in the vertical line set, wherein if the longest vertical line is unique or unique, the longest vertical line is the Y axis, otherwise, screening the color or the width of the longest vertical line, and obtaining the unique or unique longest vertical line which is the Y axis;
and if the vertical line set is empty, constructing a Y axis according to the X axis.
As a further improvement of an embodiment of the present invention, the step of "extracting scale values of the X axis and scale values of the Y axis in the chart region" includes:
recognizing and extracting text data in the chart area to obtain a text data set;
acquiring text data which is positioned at the lower side of the X axis and is closest to the X axis to obtain a scale value of the X axis;
if the Y axis is unique, acquiring text data which is located on the left side of the Y axis and is closest to the Y axis to obtain a scale value of the Y axis;
if the Y axis is unique, the Y axis on the left side is a first Y axis, the Y axis on the right side is a second Y axis, text data which is located on the left side of the first Y axis and is closest to the first Y axis is obtained, a scale value of the first Y axis is obtained, text data which is located on the right side of the second Y axis and is closest to the second Y axis is obtained, and a scale value of the second Y axis is obtained.
As a further improvement of an embodiment of the present invention, the extracting one or more statistical graphs in the chart region, where the statistical graphs are bar graphs, line graphs, or graph graphs specifically includes:
and identifying and extracting the bar charts/line charts/curve graphs in the chart area, and combining the bar charts/line charts/curve graphs with the same attribute if a plurality of bar charts/line charts/curve graphs exist after the interference factors are filtered.
As a further improvement of an embodiment of the present invention, when there is a unique Y axis, the "calculating coordinate values of data nodes of the statistical graph based on the coordinate information" specifically includes:
projecting each data node in the statistical graph on an X axis and a Y axis;
and calculating projection values of each data node in the statistical graph on an X axis and a Y axis according to the scale values to obtain coordinate values of each data node in the statistical graph.
As a further improvement of an embodiment of the present invention, when there are at least two statistical graphs in the presence of a unique Y-axis, the "calculating coordinate values of data nodes of the statistical graphs according to the coordinate information" specifically includes:
extracting annotations of the statistical chart in the chart area according to the color attribute of the statistical chart, wherein at least two annotations exist;
when the annotation of one statistical chart is positioned above another statistical chart, the statistical chart positioned above corresponds to the Y-axis coordinate on the left side, and the statistical chart positioned below corresponds to the Y-axis coordinate on the right side;
when the annotations of all the statistical graphs are positioned on the same horizontal plane, the statistical graph with the annotation positioned on the left side corresponds to the Y-axis coordinate on the left side, and the statistical graph with the annotation positioned on the right side corresponds to the Y-axis coordinate on the right side;
projecting each data node in the statistical graph on an X axis and a corresponding Y axis;
and calculating projection values of each data node in the statistical graph on an X axis and a corresponding Y axis according to the scale values to obtain coordinate values of each data node in the statistical graph.
In order to achieve one of the above objects, an embodiment of the present invention provides an electronic device, which includes a memory and a processor, wherein the memory stores a computer program operable on the processor, and the processor executes the computer program to implement any one of the steps in the method for extracting diagram data in a PDF document.
To achieve one of the above objects, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of any one of the above methods for extracting diagram data in a PDF document.
Compared with the prior art, the method and the device have the advantages that the chart data information is automatically extracted from the PDF document, the extraction efficiency is greatly improved, the labor cost is saved, and the accuracy is improved.
Drawings
FIG. 1 is a flow chart of the method for extracting chart data in a PDF document according to the present invention.
Fig. 2 is an example of a graph.
Fig. 3 is a schematic flowchart of step S2 in fig. 1.
Fig. 4 is a diagram of the diagram of fig. 2 in which the content in the rectangular region is extracted.
Fig. 5 is another example of a graph.
FIG. 6 is a tabular representation of data extracted from FIG. 5.
Detailed Description
The present invention will be described in detail below with reference to specific embodiments shown in the drawings. These embodiments are not intended to limit the present invention, and structural, methodological, or functional changes made by those skilled in the art according to these embodiments are included in the scope of the present invention.
As shown in FIG. 1, the method for extracting diagram data in a PDF document of the present invention comprises the following steps:
step S1: and converting the PDF document into an SVG file.
The PDF formatted file may be converted to an SVG formatted file through a variety of existing data conversion techniques, which are not set forth herein in detail. An example of a chart to be extracted in a PDF document may refer to fig. 2. SVG is a Scalable Vector Graphics (SVG) based on extensible markup language (XML) for describing a Graphics format of two-dimensional Vector Graphics.
SVG includes various elements, which generally define a text with a text element including location information (values of x, y) of text data and contents of text data (such as the following 1,000.):
<text xmlns="http://www.w3.org/2000/svg"transform="matrix(1 0 0 1 0 0)"x="48.768 53.331 55.581 60.144 64.707 69.27"y="240.69"style="fill:#595959;font-family:'MFSGHA+Calibri';"font-size="9">1,000.</text>
SVG generally defines a path by using a path element, d ═ defines a path instruction ", and is a white rectangle as follows, M denotes a starting point, and immediately follows the coordinates (42.24, 549.84) of M (i.e. x ═ 42.24, y ═ 549.84), L denotes a straight line, a straight line is drawn between a node (285.12, 549.84) in front of L and a node (285.12, 549.84) behind L, and so on, three sides of the rectangle are drawn, z denotes a closed state, a graph surrounded by three sides is closed to form a rectangle, and the rectangle is filled with white (# ffff), so that a white rectangle is obtained.
<path xmlns="http://www.w3.org/2000/svg"id=""transform="matrix(1 0 0 -1 0 841)"d="M42.24 549.84L285.12 549.84L285.12 549.84L285.12 712.32L285.12 712.32L42.24 712.32z"stroke="none"fill="#FFFFFF"fill-rule="evenodd"/>
Step S2: and acquiring coordinate information and statistical graphs of one or more graphs in the SVG file.
One or more charts can be included in the SVG file, and coordinate information of each chart and a statistical chart in the chart are extracted. The coordinate information of the graph refers to the position of the coordinate axis in the graph, scale information, and the like. The statistical graph refers to a geometric figure capable of representing statistical data, and the statistical graph mainly refers to a column graph, a line graph or a curve graph. A graph may include one or more statistical plots, such as a graph comprising two line graphs, or a graph comprising a bar graph and a graph, etc. Specifically, as shown in fig. 3, the present step includes the following steps:
step S21: and extracting a chart area of one or more charts in the SVG file.
In between the related data of the chart is extracted, the chart is found (or the position where the chart is located is found) first. The invention identifies and extracts the rectangle in the SVG file through the SVG element resolver to obtain a rectangle set. Because many rectangles may exist in the SVG file, not all rectangles are a chart (for example, a rectangle may only represent a small icon), it is necessary to screen the rectangles with the length and width meeting the length and width threshold in the rectangle set (the length and width threshold of the rectangle can be obtained through experience), and in addition, the screening can be performed according to the color filled in the rectangle (the rectangular area of the chart is filled with white color generally).
Each rectangle meeting the requirements after screening represents a chart, and the range framed by the rectangle is a chart area. For example, the white rectangle in the foregoing example defines a chart area, the value range of the chart area on the x-axis is [42.24,285.12], the value range of the y-axis is [549.84,712.32], but it should be noted that, the fourth parameter in matrix (100-; 1 indicates that the value of the y-axis gradually increases from the bottom to the top in the y-axis direction. Therefore, if the situation is uniformly converted to the situation that the direction of the y axis is upward, the value range of the chart area on the x axis is [42.24,285.12], and the value range of the y axis is [128.68,291.16 ]. As shown in fig. 4, the graph in fig. 2 is obtained by extracting the content in the rectangular area.
It should be noted that, the SVG element parser is the prior art, and can recognize various vector graphics by analyzing the characteristics of the SVG elements. For example, in the path element, d is represented by mllllz as a rectangle, ML as a straight line, MLL … L as a broken line, and MCC … C as a curve. The identification in the subsequent invention is generally identified by an SVG element resolver.
Step S22: and extracting the scale values of the X axis, the Y axis, the X axis and the Y axis in the chart area to obtain the coordinate information of the chart corresponding to the chart area.
It should be noted that, the SVG file may be represented by a lower case x-axis and y-axis by default with coordinate axes, and each SVG element has a corresponding coordinate value.
The coordinate axes X and Y (capital letters are used to distinguish the coordinate axes of SVG) in the chart region involved in this step are used as a reference frame for measuring the chart data to be extracted in the chart region, and the pictures of the charts can be directly observed in general (but some charts do not draw the Y axis, as shown in fig. 2, only the X axis is drawn). The step is used for extracting the coordinate information of the chart corresponding to the chart area, and comprises the following steps:
step S221: identifying and extracting horizontal lines and vertical lines in the chart area to obtain a horizontal line set and a vertical line set;
the horizontal lines are straight lines with equal y values, the vertical lines are straight lines with equal x values, and for example, d ═ M111.84697.2L131.04697.2 represents one horizontal line.
Step S222: and screening the longest transverse line in the transverse line set, wherein if the longest transverse line is unique, the longest transverse line is an X axis, otherwise, screening the color or the width of the longest transverse line, and obtaining the unique longest transverse line which is the X axis.
The X-axis is typically the longest transverse line within the rectangular area of the chart, but when there are multiple longest transverse lines, one transverse line is distinctive, such as a different color or width than the other longest transverse line, and the distinctive longest transverse line found is the X-axis.
Step S223: and screening the longest vertical line in the vertical line set, wherein if the longest vertical line is unique or unique, the longest vertical line is the Y axis, otherwise, screening the color or the width of the longest vertical line, and obtaining the unique or unique longest vertical line which is the Y axis.
The screening step for the Y axis is similar to the X axis and will not be described herein.
Step S224: and if the vertical line set is empty, constructing a Y axis according to the X axis.
If the Y axis selected in step S223 is not satisfactory (e.g., too short, or does not intersect with the X axis), or the set of vertical lines is empty, it indicates that the Y axis is not drawn in the chart region, and at this time, the Y axis needs to be constructed according to the X axis. The construction method comprises the following steps: and taking the minimum coordinate of the X axis on the X axis as the X axis coordinate of the Y axis, and taking the value range of the chart area on the Y axis as the value range of the Y axis on the Y axis. For example, the coordinate values of the X-axis elements are ([48,260],133), and the value range of the graph region on the Y-axis is [128.68,291.16], so that the coordinate values of the constructed Y-axis elements are (48, [128.68,291.16 ]).
Similarly, if there is a Y-axis of dimension two, the left and right Y-axes can be constructed using the minimum and maximum coordinates of the X-axis on the X-axis, respectively.
It should be noted that the number of Y axes can be determined by the scale values of the Y axes that can be acquired, and in any case, two Y axes are first constructed by the X axis, and then an attempt is made to acquire the scale values near the Y axis (refer to the following description), and when the acquisition is successful and the acquired scale values meet the requirements (the distance of the scale values from the Y axis is within the threshold range), it indicates that the Y axis exists, otherwise, it indicates that the Y axis does not exist.
Step S225: and identifying and extracting text data in the chart area to obtain a text data set.
Since the scale values of the X-axis and the Y-axis are stored in the form of text data, all the text data need to be extracted and then filtered by location.
Step S226: acquiring text data which is positioned at the lower side of the X axis and is closest to the X axis to obtain a scale value of the X axis;
step S227: if the Y axis is unique, acquiring text data which is located on the left side of the Y axis and is closest to the Y axis to obtain a scale value of the Y axis;
step S228: if the Y axis is unique, the Y axis on the left side is a first Y axis, the Y axis on the right side is a second Y axis, text data which is located on the left side of the first Y axis and is closest to the first Y axis is obtained, a scale value of the first Y axis is obtained, text data which is located on the right side of the second Y axis and is closest to the second Y axis is obtained, and a scale value of the second Y axis is obtained.
After the scale values are obtained in steps S226, S227 and S228, the scale values may be checked for reasonableness, for example, whether the distance between the scale values and the corresponding axis is within the threshold range is checked.
Through the steps, the scale values of the X axis, the Y axis, the X axis and the Y axis in the chart area can be obtained, and therefore the coordinate information of the chart corresponding to the chart area is obtained.
Step S23: and extracting one or more statistical graphs in the graph area, wherein the statistical graphs are bar graphs, line graphs or curve graphs.
And identifying and extracting the bar charts/line charts/graphs (indicated by "/", or) in the chart area, and combining the bar charts/line charts/graphs with the same attribute if a plurality of bar charts/line charts/graphs exist after the interference factors are filtered.
A histogram (representing a statistical chart) refers to a group of rectangles with the same filling color and the same bottom or top at the same horizontal plane, and many times when a histogram is converted into SVG vector graphics, a histogram is divided into multiple graphics (for example, a histogram includes 5 rectangles, two rectangles are divided into one path, and the other three rectangles are divided into another path), and at this time, clustering is performed through the attribute (here, the filling color) of the graphics, and the histograms with the same attribute are merged. After merging, the histogram may also be checked for plausibility, such as whether the bottoms or tops of all rectangles of the merged histogram are at the same level.
Similarly, a polyline graph (representing a statistical graph) may be divided into multiple polylines when being converted into SVG vector graphics, and at this time, merging is also performed according to the color attributes of the lines, and whether two polylines to be merged are adjacent (for example, whether the tail node of the first polyline and the head node of the second polyline are the same) is also considered when merging.
The merging process for a graph (representing a statistical graph) is similar to that of a polyline graph.
In addition, after one or more statistical graphs are extracted, annotations corresponding to the statistical graphs need to be extracted according to the color attributes, as shown in fig. 4, a bar graph represents the number of commodity room transaction sets, and a line graph represents the commodity room transaction area.
Step S3: and calculating the coordinate value of the data node of the statistical graph according to the coordinate information.
And when a unique Y axis exists, projecting each data node in the statistical graph on an X axis and a Y axis, and calculating projection values of each data node in the statistical graph on the X axis and the Y axis according to the scale values to obtain coordinate values of each data node in the statistical graph.
When a unique Y axis exists, extracting annotations of the statistical chart in the chart area according to the color attribute of the statistical chart, wherein at least two annotations exist;
when the annotation of one statistical chart is positioned above another statistical chart, the statistical chart positioned above corresponds to the Y-axis coordinate on the left side, and the statistical chart positioned below corresponds to the Y-axis coordinate on the right side;
when the annotations of all the statistical graphs are positioned on the same horizontal plane, the statistical graph with the annotation positioned on the left side corresponds to the Y-axis coordinate on the left side, and the statistical graph with the annotation positioned on the right side corresponds to the Y-axis coordinate on the right side;
projecting each data node in the statistical graph on an X axis and a corresponding Y axis;
and calculating projection values of each data node in the statistical graph on an X axis and a corresponding Y axis according to the scale values to obtain coordinate values of each data node in the statistical graph.
Step S4: and generating statistical data of the chart according to the coordinate values.
And generating statistical data of the chart by counting the coordinate values of the data nodes of the chart.
As shown in fig. 5 and 6, fig. 5 is another example of a graph, and fig. 6 is a table format showing data extracted from fig. 5.
Further, the method for extracting diagram data in a PDF document according to the present invention may further include the following steps:
step S5: the title of the chart above the chart area is obtained.
The chart is titled as text data and can be obtained by location.
The invention also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the program to realize the steps in the method for extracting the chart data in the PDF document.
The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps in the above method of extracting chart data in a PDF document.
It should be understood that although the present description refers to embodiments, not every embodiment contains only a single technical solution, and such description is for clarity only, and those skilled in the art should make the description as a whole, and the technical solutions in the embodiments can also be combined appropriately to form other embodiments understood by those skilled in the art.
The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

Claims (7)

1. A method of extracting chart data from a PDF document, the method comprising:
converting the PDF document into an SVG file;
acquiring coordinate information and statistical graphs of one or more graphs in the SVG file;
calculating coordinate values of data nodes of the statistical graph according to the coordinate information;
generating statistical data of the chart according to the coordinate values;
the SVG file comprises a plurality of charts, each chart comprises a plurality of statistical graphs, and the step of acquiring the coordinate information and the statistical graphs of one or more charts in the SVG file specifically comprises the following steps:
extracting diagram areas of one or more diagrams in the SVG file: identifying and extracting rectangles in the SVG file to obtain a rectangle set; screening rectangles with the length and the width meeting the length and the width threshold value in the rectangle set, wherein the range framed by the rectangles is a chart area;
extracting scale values of an X axis, a Y axis, the X axis and the Y axis in the chart area to obtain coordinate information of a chart corresponding to the chart area;
extracting one or more statistical graphs in the graph area, wherein the statistical graphs are bar graphs, line graphs or curve graphs: and identifying and extracting the bar charts/line charts/curve graphs in the chart area, and combining the bar charts/line charts/curve graphs with the same attribute if a plurality of bar charts/line charts/curve graphs exist after the interference factors are filtered.
2. The method for extracting chart data in a PDF document according to claim 1, wherein said step of "extracting X-axis and Y-axis in said chart region" specifically comprises:
identifying and extracting horizontal lines and vertical lines in the chart area to obtain a horizontal line set and a vertical line set;
screening the longest transverse line in the transverse line set, wherein if the longest transverse line is unique, the longest transverse line is an X axis, otherwise, screening the color or the width of the longest transverse line, and obtaining the unique longest transverse line which is the X axis;
screening the longest vertical line in the vertical line set, wherein if the longest vertical line is unique or unique, the longest vertical line is the Y axis, otherwise, screening the color or the width of the longest vertical line, and obtaining the unique or unique longest vertical line which is the Y axis;
and if the vertical line set is empty, constructing a Y axis according to the X axis.
3. The method of claim 1, wherein the step of extracting the scale values of the X-axis and the Y-axis in the graph region comprises:
recognizing and extracting text data in the chart area to obtain a text data set;
acquiring text data which is positioned at the lower side of the X axis and is closest to the X axis to obtain a scale value of the X axis;
if the Y axis is unique, acquiring text data which is located on the left side of the Y axis and is closest to the Y axis to obtain a scale value of the Y axis;
if the Y axis is unique, the Y axis on the left side is a first Y axis, the Y axis on the right side is a second Y axis, text data which is located on the left side of the first Y axis and is closest to the first Y axis is obtained, a scale value of the first Y axis is obtained, text data which is located on the right side of the second Y axis and is closest to the second Y axis is obtained, and a scale value of the second Y axis is obtained.
4. The method according to claim 1, wherein the calculating coordinate values of data nodes of the statistical graph according to the coordinate information when there is a unique Y-axis specifically comprises:
projecting each data node in the statistical graph on an X axis and a Y axis;
and calculating projection values of each data node in the statistical graph on an X axis and a Y axis according to the scale values to obtain coordinate values of each data node in the statistical graph.
5. The method as claimed in claim 1, wherein there are at least two statistical graphs when there is a unique Y-axis, and the calculating coordinate values of data nodes of the statistical graphs according to the coordinate information specifically comprises:
extracting annotations of the statistical chart in the chart area according to the color attribute of the statistical chart, wherein at least two annotations exist;
when the annotation of one statistical chart is positioned above another statistical chart, the statistical chart positioned above corresponds to the Y-axis coordinate on the left side, and the statistical chart positioned below corresponds to the Y-axis coordinate on the right side;
when the annotations of all the statistical graphs are positioned on the same horizontal plane, the statistical graph with the annotation positioned on the left side corresponds to the Y-axis coordinate on the left side, and the statistical graph with the annotation positioned on the right side corresponds to the Y-axis coordinate on the right side;
projecting each data node in the statistical graph on an X axis and a corresponding Y axis;
and calculating projection values of each data node in the statistical graph on an X axis and a corresponding Y axis according to the scale values to obtain coordinate values of each data node in the statistical graph.
6. An electronic device comprising a memory and a processor, wherein the memory stores a computer program operable on the processor, and wherein the processor executes the program to implement the steps of the method for extracting chart data in a PDF document according to any one of claims 1 to 5.
7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of extracting chart data in a PDF document according to any one of claims 1 to 5.
CN201910805559.4A 2019-08-29 2019-08-29 Method, equipment and storage medium for extracting chart data in PDF document Active CN110516221B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910805559.4A CN110516221B (en) 2019-08-29 2019-08-29 Method, equipment and storage medium for extracting chart data in PDF document
PCT/CN2019/115964 WO2021035954A1 (en) 2019-08-29 2019-11-06 Method and device for extracting chart data in pdf document, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910805559.4A CN110516221B (en) 2019-08-29 2019-08-29 Method, equipment and storage medium for extracting chart data in PDF document

Publications (2)

Publication Number Publication Date
CN110516221A CN110516221A (en) 2019-11-29
CN110516221B true CN110516221B (en) 2021-07-30

Family

ID=68627786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910805559.4A Active CN110516221B (en) 2019-08-29 2019-08-29 Method, equipment and storage medium for extracting chart data in PDF document

Country Status (2)

Country Link
CN (1) CN110516221B (en)
WO (1) WO2021035954A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861822B (en) * 2021-04-06 2024-03-12 刘羽 Map data processing method based on PDF file analysis
CN113095267B (en) * 2021-04-22 2022-09-27 上海携宁计算机科技股份有限公司 Data extraction method of statistical chart, electronic device and storage medium
CN112989779B (en) * 2021-05-20 2021-08-10 北京世纪好未来教育科技有限公司 Table generation method, electronic equipment and storage medium thereof
US11915389B2 (en) 2021-11-12 2024-02-27 Rockwell Collins, Inc. System and method for recreating image with repeating patterns of graphical image file to reduce storage space
US11954770B2 (en) 2021-11-12 2024-04-09 Rockwell Collins, Inc. System and method for recreating graphical image using character recognition to reduce storage space
US11842429B2 (en) 2021-11-12 2023-12-12 Rockwell Collins, Inc. System and method for machine code subroutine creation and execution with indeterminate addresses
US11887222B2 (en) 2021-11-12 2024-01-30 Rockwell Collins, Inc. Conversion of filled areas to run length encoded vectors

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107871128A (en) * 2017-12-11 2018-04-03 广州市标准化研究院(广州市组织机构代码管理中心) A kind of high robust image-recognizing method based on SVG dynamic charts
CN109144504A (en) * 2017-06-26 2019-01-04 华东师范大学 Data visualization image generation method and storage medium based on D3
CN109189997A (en) * 2018-08-10 2019-01-11 武汉优品楚鼎科技有限公司 A kind of method, device and equipment that broken line diagram data extracts
CN109446487A (en) * 2018-11-01 2019-03-08 北京神州泰岳软件股份有限公司 A kind of method and device parsing portable document format document table
CN109461195A (en) * 2018-09-28 2019-03-12 武汉优品楚鼎科技有限公司 A kind of chart extracting method, device and equipment based on SVG

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103019682B (en) * 2012-11-20 2015-09-30 清华大学 A kind of method being combined in display data in SVG by self-defined figure
CN103034502B (en) * 2012-12-26 2016-03-02 江苏西电南自智能电力设备有限公司 A kind of method embedding dynamic realtime monitoring chart in SVG interface
CN107015953A (en) * 2017-04-12 2017-08-04 北京图文天地科技发展有限公司 It is a kind of that folding tablet held before the breast by officials print publishing method is spelled to pdf document
CN108038426A (en) * 2017-11-29 2018-05-15 阿博茨德(北京)科技有限公司 The method and device of chart-information in a kind of extraction document

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109144504A (en) * 2017-06-26 2019-01-04 华东师范大学 Data visualization image generation method and storage medium based on D3
CN107871128A (en) * 2017-12-11 2018-04-03 广州市标准化研究院(广州市组织机构代码管理中心) A kind of high robust image-recognizing method based on SVG dynamic charts
CN109189997A (en) * 2018-08-10 2019-01-11 武汉优品楚鼎科技有限公司 A kind of method, device and equipment that broken line diagram data extracts
CN109461195A (en) * 2018-09-28 2019-03-12 武汉优品楚鼎科技有限公司 A kind of chart extracting method, device and equipment based on SVG
CN109446487A (en) * 2018-11-01 2019-03-08 北京神州泰岳软件股份有限公司 A kind of method and device parsing portable document format document table

Also Published As

Publication number Publication date
WO2021035954A1 (en) 2021-03-04
CN110516221A (en) 2019-11-29

Similar Documents

Publication Publication Date Title
CN110516221B (en) Method, equipment and storage medium for extracting chart data in PDF document
US8166037B2 (en) Semantic reconstruction
CN110968667B (en) Periodical and literature table extraction method based on text state characteristics
EP2343670B1 (en) Apparatus and method for digitizing documents
US7698627B2 (en) Method, program, and device for analyzing document structure
CN106709032A (en) Method and device for extracting structured information from spreadsheet document
CN110765739B (en) Method for extracting form data and chapter structure from PDF document
CN110427488B (en) Document processing method and device
US20130124684A1 (en) Visual separator detection in web pages using code analysis
CN114821612B (en) Method and system for extracting information of PDF document in securities future scene
CN112883926A (en) Identification method and device for table medical images
CN112380812B (en) Method, device, equipment and storage medium for extracting incomplete frame line table of PDF (Portable document Format)
CN109656652A (en) Webpage graph making method, apparatus, computer equipment and storage medium
US11914567B2 (en) Text-based machine learning extraction of table data from a read-only document
CN112417826A (en) PDF online editing method and device, electronic equipment and readable storage medium
Yuan et al. An opencv-based framework for table information extraction
CN114155547B (en) Chart identification method, device, equipment and storage medium
CN112416340A (en) Webpage generation method and system based on sketch
CN112347353A (en) Webpage denoising method
CN117217172B (en) Table information acquisition method, apparatus, computer device, and storage medium
CN116306575B (en) Document analysis method, document analysis model training method and device and electronic equipment
US11600028B1 (en) Semantic resizing of line charts
CN117095422B (en) Document information analysis method, device, computer equipment and storage medium
CN117312574A (en) Information extraction method, device, equipment and storage medium
CN115205859A (en) Method, apparatus, and medium for parsing bitmaps into structured data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 503, 5th floor, C1 Building, 88 Dongchang Road, Suzhou Industrial Park, Jiangsu Province, 215000

Applicant after: Qicha Technology Co.,Ltd.

Address before: Room 503, 5th floor, C1 Building, 88 Dongchang Road, Suzhou Industrial Park, Jiangsu Province, 215000

Applicant before: SUZHOU LANGDONG NET TEC Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: No. 8 Huizhi Street, Suzhou Industrial Park, Suzhou Area, China (Jiangsu) Pilot Free Trade Zone, Suzhou City, Jiangsu Province, 215000

Patentee after: Qichacha Technology Co.,Ltd.

Address before: Room 503, 5th floor, C1 Building, 88 Dongchang Road, Suzhou Industrial Park, Jiangsu Province, 215000

Patentee before: Qicha Technology Co.,Ltd.

CP03 Change of name, title or address