Disclosure of Invention
The invention aims to provide a method, equipment and a storage medium for extracting diagram data in a PDF document.
In order to achieve one of the above objects, an embodiment of the present invention provides a method for extracting diagram data from a PDF document, where the method includes:
converting the PDF document into an SVG file;
acquiring coordinate information and statistical graphs of one or more graphs in the SVG file;
calculating coordinate values of data nodes of the statistical graph according to the coordinate information;
and generating statistical data of the chart according to the coordinate values.
As a further improvement of an embodiment of the present invention, the "acquiring coordinate information and a statistical chart of one or more charts in the SVG file" specifically includes:
extracting a chart area of one or more charts in the SVG file;
extracting scale values of an X axis, a Y axis, the X axis and the Y axis in the chart area to obtain coordinate information of a chart corresponding to the chart area;
and extracting one or more statistical graphs in the graph area, wherein the statistical graphs are bar graphs, line graphs or curve graphs.
As a further improvement of an embodiment of the present invention, the step of "extracting a diagram area of one or more diagrams in the SVG file" specifically includes:
identifying and extracting rectangles in the SVG file to obtain a rectangle set;
and screening rectangles with the length and the width meeting the length and the width threshold value in the rectangle set, wherein the range framed by the rectangles is the chart area.
As a further improvement of an embodiment of the present invention, the step of "extracting an X axis and a Y axis in the chart region" specifically includes:
identifying and extracting horizontal lines and vertical lines in the chart area to obtain a horizontal line set and a vertical line set;
screening the longest transverse line in the transverse line set, wherein if the longest transverse line is unique, the longest transverse line is an X axis, otherwise, screening the color or the width of the longest transverse line, and obtaining the unique longest transverse line which is the X axis;
screening the longest vertical line in the vertical line set, wherein if the longest vertical line is unique or unique, the longest vertical line is the Y axis, otherwise, screening the color or the width of the longest vertical line, and obtaining the unique or unique longest vertical line which is the Y axis;
and if the vertical line set is empty, constructing a Y axis according to the X axis.
As a further improvement of an embodiment of the present invention, the step of "extracting scale values of the X axis and scale values of the Y axis in the chart region" includes:
recognizing and extracting text data in the chart area to obtain a text data set;
acquiring text data which is positioned at the lower side of the X axis and is closest to the X axis to obtain a scale value of the X axis;
if the Y axis is unique, acquiring text data which is located on the left side of the Y axis and is closest to the Y axis to obtain a scale value of the Y axis;
if the Y axis is unique, the Y axis on the left side is a first Y axis, the Y axis on the right side is a second Y axis, text data which is located on the left side of the first Y axis and is closest to the first Y axis is obtained, a scale value of the first Y axis is obtained, text data which is located on the right side of the second Y axis and is closest to the second Y axis is obtained, and a scale value of the second Y axis is obtained.
As a further improvement of an embodiment of the present invention, the extracting one or more statistical graphs in the chart region, where the statistical graphs are bar graphs, line graphs, or graph graphs specifically includes:
and identifying and extracting the bar charts/line charts/curve graphs in the chart area, and combining the bar charts/line charts/curve graphs with the same attribute if a plurality of bar charts/line charts/curve graphs exist after the interference factors are filtered.
As a further improvement of an embodiment of the present invention, when there is a unique Y axis, the "calculating coordinate values of data nodes of the statistical graph based on the coordinate information" specifically includes:
projecting each data node in the statistical graph on an X axis and a Y axis;
and calculating projection values of each data node in the statistical graph on an X axis and a Y axis according to the scale values to obtain coordinate values of each data node in the statistical graph.
As a further improvement of an embodiment of the present invention, when there are at least two statistical graphs in the presence of a unique Y-axis, the "calculating coordinate values of data nodes of the statistical graphs according to the coordinate information" specifically includes:
extracting annotations of the statistical chart in the chart area according to the color attribute of the statistical chart, wherein at least two annotations exist;
when the annotation of one statistical chart is positioned above another statistical chart, the statistical chart positioned above corresponds to the Y-axis coordinate on the left side, and the statistical chart positioned below corresponds to the Y-axis coordinate on the right side;
when the annotations of all the statistical graphs are positioned on the same horizontal plane, the statistical graph with the annotation positioned on the left side corresponds to the Y-axis coordinate on the left side, and the statistical graph with the annotation positioned on the right side corresponds to the Y-axis coordinate on the right side;
projecting each data node in the statistical graph on an X axis and a corresponding Y axis;
and calculating projection values of each data node in the statistical graph on an X axis and a corresponding Y axis according to the scale values to obtain coordinate values of each data node in the statistical graph.
In order to achieve one of the above objects, an embodiment of the present invention provides an electronic device, which includes a memory and a processor, wherein the memory stores a computer program operable on the processor, and the processor executes the computer program to implement any one of the steps in the method for extracting diagram data in a PDF document.
To achieve one of the above objects, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of any one of the above methods for extracting diagram data in a PDF document.
Compared with the prior art, the method and the device have the advantages that the chart data information is automatically extracted from the PDF document, the extraction efficiency is greatly improved, the labor cost is saved, and the accuracy is improved.
Detailed Description
The present invention will be described in detail below with reference to specific embodiments shown in the drawings. These embodiments are not intended to limit the present invention, and structural, methodological, or functional changes made by those skilled in the art according to these embodiments are included in the scope of the present invention.
As shown in FIG. 1, the method for extracting diagram data in a PDF document of the present invention comprises the following steps:
step S1: and converting the PDF document into an SVG file.
The PDF formatted file may be converted to an SVG formatted file through a variety of existing data conversion techniques, which are not set forth herein in detail. An example of a chart to be extracted in a PDF document may refer to fig. 2. SVG is a Scalable Vector Graphics (SVG) based on extensible markup language (XML) for describing a Graphics format of two-dimensional Vector Graphics.
SVG includes various elements, which generally define a text with a text element including location information (values of x, y) of text data and contents of text data (such as the following 1,000.):
<text xmlns="http://www.w3.org/2000/svg"transform="matrix(1 0 0 1 0 0)"x="48.768 53.331 55.581 60.144 64.707 69.27"y="240.69"style="fill:#595959;font-family:'MFSGHA+Calibri';"font-size="9">1,000.</text>
SVG generally defines a path by using a path element, d ═ defines a path instruction ", and is a white rectangle as follows, M denotes a starting point, and immediately follows the coordinates (42.24, 549.84) of M (i.e. x ═ 42.24, y ═ 549.84), L denotes a straight line, a straight line is drawn between a node (285.12, 549.84) in front of L and a node (285.12, 549.84) behind L, and so on, three sides of the rectangle are drawn, z denotes a closed state, a graph surrounded by three sides is closed to form a rectangle, and the rectangle is filled with white (# ffff), so that a white rectangle is obtained.
<path xmlns="http://www.w3.org/2000/svg"id=""transform="matrix(1 0 0 -1 0 841)"d="M42.24 549.84L285.12 549.84L285.12 549.84L285.12 712.32L285.12 712.32L42.24 712.32z"stroke="none"fill="#FFFFFF"fill-rule="evenodd"/>
Step S2: and acquiring coordinate information and statistical graphs of one or more graphs in the SVG file.
One or more charts can be included in the SVG file, and coordinate information of each chart and a statistical chart in the chart are extracted. The coordinate information of the graph refers to the position of the coordinate axis in the graph, scale information, and the like. The statistical graph refers to a geometric figure capable of representing statistical data, and the statistical graph mainly refers to a column graph, a line graph or a curve graph. A graph may include one or more statistical plots, such as a graph comprising two line graphs, or a graph comprising a bar graph and a graph, etc. Specifically, as shown in fig. 3, the present step includes the following steps:
step S21: and extracting a chart area of one or more charts in the SVG file.
In between the related data of the chart is extracted, the chart is found (or the position where the chart is located is found) first. The invention identifies and extracts the rectangle in the SVG file through the SVG element resolver to obtain a rectangle set. Because many rectangles may exist in the SVG file, not all rectangles are a chart (for example, a rectangle may only represent a small icon), it is necessary to screen the rectangles with the length and width meeting the length and width threshold in the rectangle set (the length and width threshold of the rectangle can be obtained through experience), and in addition, the screening can be performed according to the color filled in the rectangle (the rectangular area of the chart is filled with white color generally).
Each rectangle meeting the requirements after screening represents a chart, and the range framed by the rectangle is a chart area. For example, the white rectangle in the foregoing example defines a chart area, the value range of the chart area on the x-axis is [42.24,285.12], the value range of the y-axis is [549.84,712.32], but it should be noted that, the fourth parameter in matrix (100-; 1 indicates that the value of the y-axis gradually increases from the bottom to the top in the y-axis direction. Therefore, if the situation is uniformly converted to the situation that the direction of the y axis is upward, the value range of the chart area on the x axis is [42.24,285.12], and the value range of the y axis is [128.68,291.16 ]. As shown in fig. 4, the graph in fig. 2 is obtained by extracting the content in the rectangular area.
It should be noted that, the SVG element parser is the prior art, and can recognize various vector graphics by analyzing the characteristics of the SVG elements. For example, in the path element, d is represented by mllllz as a rectangle, ML as a straight line, MLL … L as a broken line, and MCC … C as a curve. The identification in the subsequent invention is generally identified by an SVG element resolver.
Step S22: and extracting the scale values of the X axis, the Y axis, the X axis and the Y axis in the chart area to obtain the coordinate information of the chart corresponding to the chart area.
It should be noted that, the SVG file may be represented by a lower case x-axis and y-axis by default with coordinate axes, and each SVG element has a corresponding coordinate value.
The coordinate axes X and Y (capital letters are used to distinguish the coordinate axes of SVG) in the chart region involved in this step are used as a reference frame for measuring the chart data to be extracted in the chart region, and the pictures of the charts can be directly observed in general (but some charts do not draw the Y axis, as shown in fig. 2, only the X axis is drawn). The step is used for extracting the coordinate information of the chart corresponding to the chart area, and comprises the following steps:
step S221: identifying and extracting horizontal lines and vertical lines in the chart area to obtain a horizontal line set and a vertical line set;
the horizontal lines are straight lines with equal y values, the vertical lines are straight lines with equal x values, and for example, d ═ M111.84697.2L131.04697.2 represents one horizontal line.
Step S222: and screening the longest transverse line in the transverse line set, wherein if the longest transverse line is unique, the longest transverse line is an X axis, otherwise, screening the color or the width of the longest transverse line, and obtaining the unique longest transverse line which is the X axis.
The X-axis is typically the longest transverse line within the rectangular area of the chart, but when there are multiple longest transverse lines, one transverse line is distinctive, such as a different color or width than the other longest transverse line, and the distinctive longest transverse line found is the X-axis.
Step S223: and screening the longest vertical line in the vertical line set, wherein if the longest vertical line is unique or unique, the longest vertical line is the Y axis, otherwise, screening the color or the width of the longest vertical line, and obtaining the unique or unique longest vertical line which is the Y axis.
The screening step for the Y axis is similar to the X axis and will not be described herein.
Step S224: and if the vertical line set is empty, constructing a Y axis according to the X axis.
If the Y axis selected in step S223 is not satisfactory (e.g., too short, or does not intersect with the X axis), or the set of vertical lines is empty, it indicates that the Y axis is not drawn in the chart region, and at this time, the Y axis needs to be constructed according to the X axis. The construction method comprises the following steps: and taking the minimum coordinate of the X axis on the X axis as the X axis coordinate of the Y axis, and taking the value range of the chart area on the Y axis as the value range of the Y axis on the Y axis. For example, the coordinate values of the X-axis elements are ([48,260],133), and the value range of the graph region on the Y-axis is [128.68,291.16], so that the coordinate values of the constructed Y-axis elements are (48, [128.68,291.16 ]).
Similarly, if there is a Y-axis of dimension two, the left and right Y-axes can be constructed using the minimum and maximum coordinates of the X-axis on the X-axis, respectively.
It should be noted that the number of Y axes can be determined by the scale values of the Y axes that can be acquired, and in any case, two Y axes are first constructed by the X axis, and then an attempt is made to acquire the scale values near the Y axis (refer to the following description), and when the acquisition is successful and the acquired scale values meet the requirements (the distance of the scale values from the Y axis is within the threshold range), it indicates that the Y axis exists, otherwise, it indicates that the Y axis does not exist.
Step S225: and identifying and extracting text data in the chart area to obtain a text data set.
Since the scale values of the X-axis and the Y-axis are stored in the form of text data, all the text data need to be extracted and then filtered by location.
Step S226: acquiring text data which is positioned at the lower side of the X axis and is closest to the X axis to obtain a scale value of the X axis;
step S227: if the Y axis is unique, acquiring text data which is located on the left side of the Y axis and is closest to the Y axis to obtain a scale value of the Y axis;
step S228: if the Y axis is unique, the Y axis on the left side is a first Y axis, the Y axis on the right side is a second Y axis, text data which is located on the left side of the first Y axis and is closest to the first Y axis is obtained, a scale value of the first Y axis is obtained, text data which is located on the right side of the second Y axis and is closest to the second Y axis is obtained, and a scale value of the second Y axis is obtained.
After the scale values are obtained in steps S226, S227 and S228, the scale values may be checked for reasonableness, for example, whether the distance between the scale values and the corresponding axis is within the threshold range is checked.
Through the steps, the scale values of the X axis, the Y axis, the X axis and the Y axis in the chart area can be obtained, and therefore the coordinate information of the chart corresponding to the chart area is obtained.
Step S23: and extracting one or more statistical graphs in the graph area, wherein the statistical graphs are bar graphs, line graphs or curve graphs.
And identifying and extracting the bar charts/line charts/graphs (indicated by "/", or) in the chart area, and combining the bar charts/line charts/graphs with the same attribute if a plurality of bar charts/line charts/graphs exist after the interference factors are filtered.
A histogram (representing a statistical chart) refers to a group of rectangles with the same filling color and the same bottom or top at the same horizontal plane, and many times when a histogram is converted into SVG vector graphics, a histogram is divided into multiple graphics (for example, a histogram includes 5 rectangles, two rectangles are divided into one path, and the other three rectangles are divided into another path), and at this time, clustering is performed through the attribute (here, the filling color) of the graphics, and the histograms with the same attribute are merged. After merging, the histogram may also be checked for plausibility, such as whether the bottoms or tops of all rectangles of the merged histogram are at the same level.
Similarly, a polyline graph (representing a statistical graph) may be divided into multiple polylines when being converted into SVG vector graphics, and at this time, merging is also performed according to the color attributes of the lines, and whether two polylines to be merged are adjacent (for example, whether the tail node of the first polyline and the head node of the second polyline are the same) is also considered when merging.
The merging process for a graph (representing a statistical graph) is similar to that of a polyline graph.
In addition, after one or more statistical graphs are extracted, annotations corresponding to the statistical graphs need to be extracted according to the color attributes, as shown in fig. 4, a bar graph represents the number of commodity room transaction sets, and a line graph represents the commodity room transaction area.
Step S3: and calculating the coordinate value of the data node of the statistical graph according to the coordinate information.
And when a unique Y axis exists, projecting each data node in the statistical graph on an X axis and a Y axis, and calculating projection values of each data node in the statistical graph on the X axis and the Y axis according to the scale values to obtain coordinate values of each data node in the statistical graph.
When a unique Y axis exists, extracting annotations of the statistical chart in the chart area according to the color attribute of the statistical chart, wherein at least two annotations exist;
when the annotation of one statistical chart is positioned above another statistical chart, the statistical chart positioned above corresponds to the Y-axis coordinate on the left side, and the statistical chart positioned below corresponds to the Y-axis coordinate on the right side;
when the annotations of all the statistical graphs are positioned on the same horizontal plane, the statistical graph with the annotation positioned on the left side corresponds to the Y-axis coordinate on the left side, and the statistical graph with the annotation positioned on the right side corresponds to the Y-axis coordinate on the right side;
projecting each data node in the statistical graph on an X axis and a corresponding Y axis;
and calculating projection values of each data node in the statistical graph on an X axis and a corresponding Y axis according to the scale values to obtain coordinate values of each data node in the statistical graph.
Step S4: and generating statistical data of the chart according to the coordinate values.
And generating statistical data of the chart by counting the coordinate values of the data nodes of the chart.
As shown in fig. 5 and 6, fig. 5 is another example of a graph, and fig. 6 is a table format showing data extracted from fig. 5.
Further, the method for extracting diagram data in a PDF document according to the present invention may further include the following steps:
step S5: the title of the chart above the chart area is obtained.
The chart is titled as text data and can be obtained by location.
The invention also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the program to realize the steps in the method for extracting the chart data in the PDF document.
The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps in the above method of extracting chart data in a PDF document.
It should be understood that although the present description refers to embodiments, not every embodiment contains only a single technical solution, and such description is for clarity only, and those skilled in the art should make the description as a whole, and the technical solutions in the embodiments can also be combined appropriately to form other embodiments understood by those skilled in the art.
The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.