CN113095267A

CN113095267A - Data extraction method of statistical chart, electronic device and storage medium

Info

Publication number: CN113095267A
Application number: CN202110434064.2A
Authority: CN
Inventors: 王小凤; 张浩波
Original assignee: Shanghai Jining Computer Technology Co ltd
Current assignee: Shanghai Jining Computer Technology Co ltd
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-07-09
Anticipated expiration: 2041-04-22
Also published as: CN113095267B

Abstract

The embodiment of the invention relates to the field of information processing, and discloses a data extraction method of a statistical chart, electronic equipment and a storage medium, wherein the data extraction method comprises the following steps: performing layer separation on a target image containing a statistical graph according to the type of the statistical graph by using a semantic segmentation model, acquiring a plurality of layers and determining the type of the statistical graph corresponding to each layer, wherein each layer is a binary image only containing the statistical graph; acquiring the position information of key points of a statistical graph in a layer; determining coordinate axes and scale information in the target image by using preset screening conditions; determining coordinate axis labels from text information identified by the target image by using the model by using a preset label screening condition; and determining statistical data represented by each statistical graph according to the key point position information, the coordinate axis, the scale information and the coordinate axis label and generating structural data. The scheme of the invention can realize accurate, complete, effective and rapid extraction of the statistical chart data.

Description

Data extraction method of statistical chart, electronic device and storage medium

Technical Field

The embodiment of the invention relates to the field of information processing, in particular to a data extraction method of a statistical chart, electronic equipment and a storage medium.

Background

The statistical graph can intuitively show the statistical data, but the user also has a need to extract information such as data in the statistical graph, so that the data integration and other processing are performed subsequently. However, in a case where the histogram cannot be edited, for example, the histogram in a Portable Document Format (PDF) file, a picture containing the histogram downloaded from a web page, or the like, data in the histogram cannot be directly exported, and at this time, the histogram needs to be further processed. At present, the commonly used extraction method is generally considered from two aspects: firstly, processing a statistical chart from different aspects by utilizing a plurality of models, and respectively extracting information such as statistical data, scale, annotation and the like in the statistical chart; secondly, after the file is converted into a Scalable Vector Graphics (SVG) format, a plurality of extraction rules are set based on the file in the SVG format for data extraction.

However, both the model and the rule have respective advantages and disadvantages, and the two methods only use the model or only use the rule, are very single, and cannot fully utilize and combine the respective advantages of the model, so that the speed and the accuracy of data extraction cannot be guaranteed to the maximum extent. Particularly, when a model is used for extraction, the accuracy of an extraction result depends on the accuracy of the model, the model is easily affected by interference information to cause inaccurate results, and in order to ensure the accuracy of the model, a large amount of corpus data is correspondingly needed to train the model, and the corpus data needs manual labeling of features, that is, a large amount of human resources are wasted to ensure that the model has certain accuracy; compared with statistical graphs in formats such as Portable Network graphics Format (PNG), the statistical graph in SVG Format has lower definition and cannot accurately describe the position of characters, that is, conversion into the SVG Format reduces the accuracy of the statistical graph, thereby reducing the accuracy of data extraction.

Disclosure of Invention

An object of an embodiment of the present invention is to provide a method for extracting statistical data from a statistical chart, an electronic device, and a storage medium, which are capable of accurately and quickly extracting statistical data from the statistical chart without format conversion, and extracting other information that can assist in understanding the statistical data, so that the extracted information is more complete and effective.

In order to solve the above technical problem, an embodiment of the present invention provides a method for extracting data from a histogram, including: performing layer separation on a target image containing a statistical graph according to the type of the statistical graph by using a semantic segmentation model, acquiring a plurality of layers and determining the type of the statistical graph corresponding to each layer, wherein the layers are binary images only containing the statistical graph; acquiring the position information of the key points of the statistical graph in the layer; determining coordinate axes and scale information in the target image by using preset screening conditions; determining coordinate axis labels from the text information identified by the target image by using the model by using a preset label screening condition; and determining statistical data represented by each statistical graph according to the key point position information, the coordinate axis, the scale information and the coordinate axis label and generating structural data.

An embodiment of the present invention further provides an electronic device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described method of data extraction of a statistical map.

Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which when executed by a processor implements the above-described data extraction method for a statistical chart.

According to the data extraction method of the statistical graph, provided by the embodiment of the invention, the semantic segmentation model is utilized to separate the layers of the target image according to the type of the statistical graph to obtain a plurality of binary images only containing the statistical graph as the layers, different statistical types correspond to an independent layer, and the layer represented by the binary images only contains the statistical graph of the statistical type, so that interference factors in the layers are reduced to the maximum extent, the acquired information of the positions of the key points of the statistical graph is subjected to small interference and high accuracy, and in the layer separation process, the semantic segmentation model can efficiently and accurately obtain each layer through processing, and the advantages of the semantic segmentation model are fully utilized. And finally, after the coordinate axis label is determined, statistical data and structural data are generated according to the coordinate axis label, the scale information, the coordinate axis and the key point position information, various information in the statistical graph is fully utilized, the statistical data are obtained through comprehensive analysis, and the structural data are generated, so that the data can be extracted quickly and accurately. In addition, the model and the rule participate in the data extraction process together, the advantages of the model and the rule are fully utilized, the processing efficiency and the accuracy are improved, the model and the rule do not need to be converted into an SVG format, and the processing speed is further accelerated.

In addition, the data extraction method for a statistical chart according to the embodiment of the present invention, where the type of the statistical chart is a histogram, the statistical chart is a rectangle, the key point position information is diagonal point position information of the rectangle, and the obtaining of the key point position information of the statistical chart in the layer includes: detecting whether the rectangle in the layer is complete; if at least one of the rectangles is incomplete, completing the incomplete rectangle; detecting whether the rectangles in the layers are connected or not; if a plurality of rectangles are connected, the connected rectangles are divided; and acquiring the diagonal position information of each rectangle in the layer. The position information of the key points can directly determine the position and the geometric information of the statistical graph, and the geometric information of the statistical graph is visual expression of statistical data, so the accuracy of the position information of the key points directly determines the accuracy of extracting the statistical data.

In addition, the data extraction method of the statistical chart provided by the embodiment of the present invention, where the determining of the coordinate axis in the target image includes: taking the only horizontal line segment meeting a preset first length condition in the target image as the abscissa axis, or determining the abscissa axis according to the position of a rectangle in a histogram when the histogram comprises the histogram; if the histogram has an ordinate axis, taking a unique or unique vertical line meeting a preset second length condition in the target image as the ordinate axis, or determining the ordinate axis by using the position of the histogram. The method fully considers the actual situation of the statistical chart contained in the target image, and pertinently provides a method for determining the coordinate axis for various application scenes, so that the correctness of the coordinate axis is still ensured to the maximum extent under different situations, a series of subsequent operations according to the coordinate axis are improved in precision, and finally the precision of extracted data is improved. And the method for acquiring and determining the horizontal and vertical coordinate axes can be suitable for data extraction of the statistical chart in different scenes, and is more flexible and practical.

In addition, in the data extraction method of the statistical chart according to the embodiment of the present invention, the scale information includes a first scale and a second scale, and the determining the scale information of the target image includes: determining initial scales from the target image and recording the distance between every two adjacent initial scales as a first distance; dividing the equidistant initial scales into a group according to the first interval, and acquiring a plurality of scale groups; taking the initial scales in a group with the maximum number of the initial scales in the scale grouping as the first scales, and taking the distance between two adjacent first scales as a second interval; and determining the second scale from the rest of the initial scales according to the first spacing and the second spacing. According to the invention, the accuracy of the label is improved to the maximum extent through the corresponding rule, and after the first scales are screened out, the second scales are continuously obtained according to the distance between the first scales, so that the scale information consisting of the first scales and the second scales is as much as possible, the finally obtained scale information is as much as possible, the reliability, the precision and the accuracy of the determination according to the scale information are improved, and the statistical data obtained according to the scale information are more accurate finally.

In addition, in the data extraction method of the statistical chart provided in the embodiment of the present invention, the scale information includes a guideline, and the screening of the initial scale from the target image includes: identifying horizontal lines in the target image; and taking the horizontal line as the initial scale. The obtained guiding line is beneficial to the follow-up inspection of the determined longitudinal axis label, especially under the condition that no scale point exists on the longitudinal axis, no scale point exists on the longitudinal axis or no scale point on the longitudinal axis can be determined, the longitudinal axis label can still be inspected, various information in the statistical chart is fully used, and therefore the reliability and the accuracy of statistical data extracted according to the various information are enhanced.

In addition, in the data extraction method of the statistical chart according to the embodiment of the present invention, the scale information includes a scale point, and the determining an initial scale from the target image includes: carrying out binarization on the target image and filling a statistical graph in the target image into a background color; and identifying a line segment which is within a preset threshold value and is vertical to the coordinate axis in the target image as the initial scale. The statistical graph and the binaryzation are filled through the background color, so that the interference in the image can be eliminated, the number of possible scale points is reduced, the workload of subsequent processing is reduced, the influence of the interference is reduced, the accuracy of the determined scale points is improved, the result of detecting the label by using the scale points is more accurate, the accuracy of the label is improved, and finally the precision of statistical data obtained according to the label can be improved.

In addition, in the data extraction method of the statistical chart according to the embodiment of the present invention, the coordinate axis labels include horizontal axis labels, and the horizontal axis labels are determined from text information identified by the target image using the model according to a preset label screening condition, and the method includes: taking an abscissa axis in the coordinate axes as a reference mark; determining the text information positioned below the reference mark as first text information; dividing the first text information on the same horizontal straight line into a group, and acquiring a plurality of first text groups; taking the first text packet containing the first text information with the largest quantity as the second text packet; taking the first text grouping that is closest or second closest to the reference identifier in a vertical direction as a third text grouping; if the second text packet belongs to the third text packet, taking the first text information in the second text packet as an initial cross-axis label; if the second text packet does not belong to the third text packet, changing the reference identifier into the statistical graph, and re-determining the initial cross-axis label; and screening the initial cross-axis label according to the scale information to determine the cross-axis label. The method for determining the label is provided in a targeted manner by fully combining with the actual situation, the accuracy of the cross-axis label is improved, and the label is screened after the initial cross-axis label is determined, so that the accuracy of the cross-axis label is improved, and the accuracy of the structural data generated according to the cross-axis label and other information is improved.

In addition, in the data extraction method of a statistical chart provided in the embodiment of the present invention, the coordinate axis label includes a longitudinal axis label, and the determining of the longitudinal axis label from text information identified by the target image using the model using a preset label screening condition includes: taking one or two groups of text information which are equal in difference and aligned with each other on the two sides of the statistical chart as the initial longitudinal axis label, or taking one or two groups of text information which are equal in difference and aligned with each other on the outer side of the ordinate axis as the initial longitudinal axis label when the determined coordinate axis comprises the ordinate axis; and taking the initial longitudinal axis label with the midpoint in the vertical direction and the corresponding scale information on the same horizontal line as the longitudinal axis label. The characteristics of the longitudinal axis label are fully considered, the longitudinal axis label is adaptively screened from the text of the longitudinal axis label through data equal difference, the screening range is narrowed, the subsequent processing workload is reduced, and the longitudinal axis label is also tested after the initial longitudinal axis label is determined, so that the accuracy of the longitudinal axis label is improved, and the accuracy of statistical data determined according to the longitudinal axis label is improved.

In addition, the method for extracting data of a statistical graph according to an embodiment of the present invention, where the determining statistical data represented by each statistical graph according to the key point position information, the coordinate axis, the scale information, and the coordinate axis label and generating structural data includes: determining legend information from the target image; determining the corresponding relation between the coordinate axis label and the statistical graph according to the legend information; determining a data value represented by a single pixel point according to the scale information and the coordinate axis label; and obtaining the statistical data according to the key point position information, the coordinate axis, the corresponding relation and the data value represented by the single pixel point and generating the structural data. And acquiring legend information, so that the correspondence between the longitudinal axis label and the statistical graph can be determined subsequently by using the legend information, and further, actual data corresponding to a single pixel point in the statistical graph is accurately calculated to obtain accurate statistical data corresponding to the statistical graph.

In addition, in the data extraction method of a statistical graph provided in the embodiment of the present invention, the legend information includes legend colors and legend texts, and the determining the legend information of the statistical graph from the target image includes: searching legend color marking blocks in the upper area and the lower area of the statistical graph according to the color of the statistical graph in the target image; determining the text information adjacent to the left side or the right side of the legend color marking block as a legend text; the determining the corresponding relationship between the coordinate axis labels and the statistical graphs according to the legend information includes: detecting whether the legend text carries indication information used for determining the corresponding coordinate axis label; if the legend text carries the indication information, determining the corresponding relation according to the indication information; and if the legend text does not carry the indication information, determining the corresponding relation according to the position of the legend text and/or the meaning of the legend characters. The legend information is searched by utilizing the color of the statistical graph, a more definite search basis is obtained, the speed and the accuracy of searching the legend information are improved, the accuracy of the legend information is further improved, the situation that the corresponding relation between the statistical graph and the longitudinal axis label is determined according to the legend information to cause deviation or error is avoided, the accuracy of the obtained statistical data is guaranteed, meanwhile, various conditions are fully considered, corresponding operation is performed pertinently, the accuracy of the corresponding relation is guaranteed to the maximum degree, and the efficiency of determining the corresponding relation can be improved.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a flow chart of a data extraction method for a histogram provided by an embodiment of the present invention;

FIG. 2 is a first schematic diagram of layers involved in the data extraction method for a statistical chart according to the embodiment shown in FIG. 1;

FIG. 3 is a second schematic diagram of layers involved in the data extraction method for the statistical chart according to the embodiment shown in FIG. 1;

FIG. 4 is a diagram illustrating extraction results involved in the data extraction method of the histogram provided in the embodiment shown in FIG. 1;

FIG. 5 is a flow chart of a data extraction method for a histogram provided by another embodiment of the present invention;

FIG. 6 is a schematic diagram of horizontal straight lines involved in the data extraction method of the statistical chart provided by the embodiment shown in FIG. 5;

FIG. 7 is a diagram illustrating the results of guideline screening involved in the data extraction method of the histogram provided in the embodiment shown in FIG. 5;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.

The implementation details of the data extraction method of the statistical chart of the present embodiment will be specifically described below with reference to fig. 1, and the following description is only provided for the convenience of understanding, and is not necessary to implement the present solution.

Step 101, performing layer separation on a target image containing a statistical graph according to the type of the statistical graph by using a semantic segmentation model, acquiring a plurality of layers and determining the type of the statistical graph corresponding to each layer.

In this embodiment, the target image may be a picture containing a statistical graph captured from a Portable Document Format (PDF) file, or a picture containing a statistical graph downloaded and stored from a web page. The statistical map in the target image may be a histogram, a line graph, a superposition of multiple histograms or a superposition of histogram and line graph, and the present embodiment does not specifically limit the statistical map. The layer is a binary image containing only one kind of statistical graph, where the statistical graph refers to a pattern used for visually representing statistical data in the statistical graph, for example, the statistical graph of a broken line graph is a broken line, the statistical graph of a histogram is a rectangle, etc., the binary image may be a monochrome image, a black-and-white image, etc., and for example, the layer may be an image as shown in fig. 2.

Specifically, after a target image is obtained, a statistical graph of a statistical graph in the target image is identified, statistical patterns belonging to the same statistical type are separated into the same binary image, a plurality of layers only containing the statistical graph are formed, and the type of the statistical graph corresponding to the statistical graph in the layers is determined.

More specifically, a semantic segmentation model for extracting statistical graphs of types such as column graphs and line graphs in the images is trained, then the trained semantic segmentation model is used for identifying the region where the statistical graphs in the statistical graphs are located and separating the regions to obtain layers, and meanwhile, the statistical graph type corresponding to each layer is stored.

Step 102, obtaining the position information of the key points of the statistical graph in the graph layer.

In this embodiment, because the layers corresponding to different types of statistical graphs actually include different statistical graphs, the representation methods of the position information of the key points in different layers are different, for example, the position information of the key points in the histogram is the diagonal point of the rectangle, and accordingly, the operation of determining the position information of the key points is also different. Specifically, the following may exist in determining the key point location information:

firstly, the type of the statistical graph corresponding to the graph layer is a column graph, the statistical graph is a rectangle, and the key points are diagonal points of the rectangle, such as upper left corner points and lower right corner points.

At this time, in order to obtain accurate key point position information, contour detection may be performed on the layer, then rectangle approximation is performed on each obtained contour, and diagonal point information of each rectangle after the rectangle approximation is obtained, where the rectangle approximation is to uniformly represent the detected contours as rectangles, for example, the 4 th rectangle from left to right shown in fig. 2 is divided into two upper and lower contours, and an upper left point and a lower right point of an upper contour are taken as an upper left corner point and a lower right corner point of the rectangle to construct a rectangle.

In an example, since the histogram in the target image includes both the bar graph and the line graph, the broken line covers the area of the partial rectangle, and there is a case that the partial histogram is incomplete after the layer separation, for example, the upper and lower outlines of the left-to-right 4 th rectangle shown in fig. 2 are actually independent from each other, after the operation of rectangle approximation, the two rectangles will be recognized, and then two sets of diagonal positions are obtained, and the horizontal coordinates of the two sets of diagonal positions are correspondingly the same, and finally an error is generated. Therefore, before obtaining the key point position information, it is also necessary to detect whether the rectangle in the layer is complete. Specifically, after all independent outlines in the layer are approximated to rectangles, whether the approximated rectangles have rectangles with the same abscissa, that is, whether the abscissas of the upper left corner point and the lower right corner point of the two approximated rectangles are equal or approximately equal, and if so, it is determined that an incomplete rectangle exists. And if at least one of the rectangles is incomplete, completing the incomplete rectangle. Specifically, after detecting that two abscissas correspond to the same rectangle, the two rectangles are merged, specifically, the ordinate with the smaller ordinate in the upper left coordinates and the ordinate with the higher lower right coordinates of the two rectangles are taken.

In another example, since the statistical map in the target image has a plurality of single column maps combined and the histograms are arranged in a connected manner, and there is a case where rectangles are connected in the separated layers, as shown in fig. 3, the connected rectangles only detect a set of diagonal point information, which eventually results in missing the key point position. Therefore, before obtaining the key point position information, it is also necessary to detect whether the rectangles in the layer are connected. Specifically, whether more than two vertical lines exist in the detected outline or not is judged, and if yes, connected rectangles are considered to exist. And if a plurality of rectangles are connected, segmenting the connected rectangles. Specifically, the contour with the detected most vertical lines is taken as a target, the distance between other vertical lines except for the two outermost vertical lines along the target is obtained, each contour is cut according to the distance, for example, a straight line with the same color as the background color is drawn to separate different areas, or the distance between adjacent vertical lines in the target is obtained, and the distance is taken as the interval to cut the contour for multiple times along the vertical direction in sequence in different contours, so that the width of the corresponding graph after cutting is equal to the interval.

It should be noted that the accuracy and precision of the extracted data are not affected by the width of the rectangle, so the present embodiment does not require the precision of the cutting, that is, the precision of the distance between the cutting lines when the rectangle is cut.

It should be further noted that, since the position information of the key point can directly position the statistical graph, and the statistical graph reflects the statistical data, the accuracy of the position information of the key point directly determines the accuracy of extracting the statistical data, and the above steps detect the rectangle of the layer where the histogram is located and perform corresponding operations, complement and/or partition, thereby ensuring the independence and integrity of the statistical graph in the histogram, avoiding the condition of key information loss or error caused by the incompleteness and independence of the statistical graph when the key point position information is obtained, making the statistical data calculated by using the key point position information more accurate and reliable, and further improving the accuracy of the invention.

Firstly, the type of the statistical graph corresponding to the graph layer is a line graph, the statistical graph is a line segment, and the key points are the end points of the line segment.

At this time, in order to obtain accurate key point position information, all line segments in the layer may be determined through line detection, and then end point position information of the line segments is obtained.

It should be noted that the obtained layer may have only one layer, or may include a plurality of layers. When only one layer exists, determining corresponding steps from the two situations according to the type of the actual statistical graph; when there are multiple layers, including both the layer corresponding to the bar graph and the layer corresponding to the line graph, both cases need to be included.

It should be further noted that, in this embodiment, the position information may be a coordinate, where a coordinate system where the coordinate is located uses an upper left point of the layer as an origin, a vertical direction is a positive direction of an ordinate axis, a horizontal direction is a positive direction of an abscissa axis, and a size of one pixel point is a unit scale of the coordinate axis. In other embodiments, the coordinate system may be established in other ways.

And 103, determining coordinate axis and scale information in the target image by using preset screening conditions.

In this embodiment, the coordinate axis determined in step 103 at least includes an abscissa axis, and the scale information is information used for explaining a scale on the coordinate axis, such as a scale point.

And 104, determining coordinate axis labels from the text information identified by the target image by using the model by using a preset coordinate axis label screening condition.

In this embodiment, the coordinate axis labels include a horizontal axis label and a vertical axis label, where the horizontal axis label refers to a character used for labeling information such as meaning represented by a certain scale of the horizontal axis in the statistical chart, for example, 2018, and similarly, the vertical axis label refers to a character used for labeling information such as meaning represented by a certain scale of the vertical axis in the statistical chart, and is generally a number. The following is a description of determining the horizontal and vertical axis labels:

determining the cross axis label specifically includes: firstly, an abscissa axis in the coordinate axes is taken as a reference mark, then text information positioned below the reference mark is determined as first text information, wherein the text information is a text extracted from a target image by using an Optical Character Recognition (OCR) tool and the like, then the first text information positioned on the same horizontal straight line is divided into a group to obtain a plurality of first text groups, then the first text group containing the largest quantity of the first text information is taken as a second text group, meanwhile, the first text group which is closest to or second close to the reference mark in the vertical direction is taken as a third text group, if the second text group belongs to the third text group, the first text information in the second text group is taken as an initial horizontal axis label, if the second text group does not belong to the third text group, the reference mark is changed into a statistical graph, and returning to the step of determining the text information positioned below the reference mark as the first text information, re-determining the initial cross-axis label once by taking the reference mark as a statistical graph, and finally, after obtaining the initial cross-axis label, further screening the initial cross-axis label according to the scale information to determine the cross-axis label. The method for determining the label is provided in a targeted manner by fully combining with the actual situation, the accuracy of the cross-axis label is improved, and the label is also checked after the initial cross-axis label is determined, so that the accuracy of the cross-axis label is improved, and the accuracy of the structural data generated according to the cross-axis label and other information is improved.

More specifically, the initial cross-axis label is further screened according to the scale information, whether the initial cross-axis label is correct or not is detected according to the fact that the midpoints of two adjacent scale information and the corresponding initial cross-axis label in the horizontal direction are located on the same vertical line, and only the detected initial vertical-axis label is used as the vertical-axis label.

It should be noted that the number of the first text messages is not the number of the character symbols, but the number of the texts formed by the whole character symbols located in the same text box is counted as one.

Determining the longitudinal axis label specifically comprises: taking one or two groups of text information which are equal in difference and aligned with each other on the two sides of the statistical chart as initial longitudinal axis labels, or taking one or two groups of text information which are equal in difference and aligned with each other on the outer side of the ordinate axis as initial longitudinal axis labels when the ordinate axis comprises the ordinate axis; and if the scale information and the midpoint of the initial longitudinal axis label in the vertical direction are positioned on the same horizontal line, taking the initial longitudinal axis label as the longitudinal axis label. The characteristics of the longitudinal axis label are fully considered, the longitudinal axis label is adaptively screened from the text of the longitudinal axis label through data equal difference, the screening range is narrowed, the subsequent processing workload is reduced, and the longitudinal axis label is also tested after the initial longitudinal axis label is determined, so that the accuracy of the longitudinal axis label is improved, and the accuracy of statistical data determined according to the longitudinal axis label is improved.

More specifically, when the coordinate axes determined in step 103 include the ordinate axes, if only one ordinate axis is determined in step 103, the text with data equal difference and aligned in the same manner on the left side of the ordinate axis is used as the initial ordinate axis label, if two ordinate axes are determined in step 103, the ordinate axis on the left side of the target image is used as the left ordinate axis and the ordinate axis on the right side of the target image is used as the right ordinate axis, the text with data equal difference and aligned in the same manner on the left side of the left ordinate axis is used as the initial ordinate axis label of the left ordinate axis, and the text with data equal difference and aligned in the same manner on the right side of the right ordinate axis is used as the initial ordinate axis label of the right. And then whether the initial longitudinal axis label is correct is detected by whether the symmetry axis of the guide line and/or the scale point is coincident with the symmetry axis of the corresponding initial longitudinal axis label, and the initial longitudinal axis label is used as the longitudinal axis label only through detection.

It should be noted that the vertical axis label should satisfy the condition that there is at most one group on the left side and at most one group on the right side, and if there are multiple groups found, an error occurs and it is considered that there is no group found.

It should be noted that, in the present embodiment, in the process of determining the coordinate axis labels (including the horizontal axis label and the vertical axis label), once the initial label is determined, further determination is performed by using the scale information in order to extract a label that does not satisfy the condition from the initial label, and then screen out a more accurate label. In other embodiments, the original tags may also be used directly or tags that do not correspond to a vertical or horizontal line may be culled when the screening is unsuccessful.

And 105, determining statistical data represented by each statistical graph according to the key point position information, the coordinate axis, the scale information and the coordinate axis label, and generating structural data.

Specifically, legend information is determined from a target image, a corresponding relationship between a coordinate axis label and a statistical graph is determined according to the legend information, a data value represented by a single pixel point is determined according to scale information and the coordinate axis label, and statistical data and structural data are generated according to the key point position information, the coordinate axis, the corresponding relationship and the data value represented by the single pixel point, wherein the legend information includes legend colors and legend texts, and the structural data may be a statistical table as shown in fig. 4. For example, according to the distance between two adjacent pieces of scale information or the distance between a longitudinal axis label and the center of the longitudinal axis label, calculating actual data corresponding to unit position information, such as 100 ten thousand yuan corresponding to one pixel point, then calculating statistical data corresponding to a statistical graph according to the key point position information and the actual data corresponding to the unit position information, such as 8 pixel points with a certain rectangle below the abscissa axis and with a length of 8 according to the key point position information, and determining that one pixel point in a layer where the histogram is located corresponds to 100 ten thousand yuan, then the corresponding statistical data is-8 x 100-800 ten thousand yuan; a certain broken line end point is located above the abscissa axis, the vertical distance from the certain broken line end point to the abscissa axis is 7 pixel points, it is determined that one pixel point accounts for 5% correspondingly in the layer where the broken line graph is located, and then the corresponding statistical data is 7 × 5% — 35%.

In one example, in order to further improve the accuracy of the extracted statistical data, the statistical data obtained by performing step 105 and the data obtained by using model recognition may be considered together. The data obtained by recognition refers to data texts which are extracted from the target image near the statistical graph through a character recognition model and have horizontal midpoints aligned with the statistical graph, such as the midpoints aligned with the corresponding rectangle in the histogram in the horizontal direction and line segment endpoints aligned in the line graph. And then detecting an error between the extracted data and the statistical data obtained by executing the step 105, and if the error is within a certain range, considering that the identified data is correct, and taking the identified data as final statistical data, otherwise, taking the data obtained by executing the step 105 as final statistical data.

It should be noted that, in the process of performing the calculation in step 105, it depends on the key point position information and the data value represented by a single pixel point, and there is inevitably a small error in the process of obtaining the key point position information and the data value represented by a single pixel point, so that the accuracy of the data obtained by performing step 105 is only close to 100%. The data is recognized from the target image by using the character recognition model, and since the data items existing in the target image are recognized, the accuracy of the obtained data under the condition that the model recognition is correct can reach 100%. However, in the actual application process, not every target image is marked with a data item in the statistical graph, and at this time, the model identification data cannot be utilized, and the identification error is also easily interfered by other texts in the process of utilizing the model identification data, and each data item in the target image cannot be accurately identified. While the data obtained in the case of incorrect model identification can be subject to large errors. According to the method, the statistical data obtained by executing the step 105 and the data obtained by utilizing the model identification are comprehensively considered, the accuracy of the data obtained by utilizing the model identification is checked through a plurality of set rules, the data obtained by utilizing the model identification is used as final data only under the condition that the data obtained by utilizing the model identification is accurate, and the data obtained by executing the step 105 is used as final data under the condition that the model identification is wrong or the data obtained by utilizing the model identification cannot be used, so that the overall accuracy and precision of the finally obtained statistical data are further improved.

It should be further noted that, in the process of generating the structure data, the structure data includes information describing the statistical data, such as a cross-axis label and a legend text, in addition to the statistical data, in order to obtain accurate statistical data, when the type of the statistical graph includes a bar graph, it is determined whether a midpoint of a rectangle in the bar graph in the vertical direction is aligned with a midpoint of a corresponding cross-axis label in the vertical direction, and when the type of the statistical graph includes a line graph, it is determined whether an end point of a line segment in the line graph is aligned with a midpoint of a corresponding cross-axis label in the vertical direction, thereby implementing further inspection on the coordinate axis label.

In one example, determining the legend information of the statistical map from the target image specifically includes: according to the color of the statistical graph in the target image, searching the area above and below the statistical graph for the legend color marking block, and then determining the adjacent text information on the left side or the right side of the legend color marking block as a legend text. More specifically, the colors of the statistical graphs under the same statistical graph type are extracted, the same or similar colors are divided into a group within an error range, the final color representation in each color grouping is extracted, for example, the color numerical values of each color in the color grouping are averaged, the average value is used as the final color representation, and the like, and then the color labeling block in the graph where the color similar to the final color representation is located is searched above or below the target image. Considering that the legend text may be located on the left side or the right side of the legend color labeling block, the text is searched in the vicinity of the legend color labeling block, if the text exists in the closer range of the left side of each legend color labeling block, the text information on the left side of the legend color labeling block is determined to be the legend text, and if the text exists in the closer range of the right side of each legend color labeling block, the text information on the right side of the legend color labeling block is determined to be the legend text.

In one example, determining correspondence between coordinate axis labels and statistical graphs according to legend information includes: and detecting whether the legend text carries indication information for determining a corresponding coordinate axis label or not, if the legend text carries the indication information, determining the corresponding relationship according to the indication information, and if the legend text does not carry the indication information, determining the corresponding relationship according to the position of the legend text and/or the meaning of the legend characters. More specifically, the indication information includes left and right axis text information, unit information, and the like, when the legend information includes left and right axis text information, such as "right axis" in the legend text "total amount of investment (right axis)", the legend is described to correspond to the label information of the right axis, and then the legend is determined to correspond to the chart layer of the line chart according to the color of the legend, and the information is grouped into a group; when legend information contains units, such as "billion dollars" in legend text "sales in whole world dollars", the text is recognized on the original image looking for unit information, unit text information is obtained at a location near the left axis, the legend is illustrated as corresponding to left axis label information, the legend is determined as corresponding to the histogram layers based on the legend color, and the information is grouped into a set. Various conditions are fully considered, and corresponding operation is performed in a targeted manner, so that the accuracy of the corresponding relation is ensured to the maximum extent, and the efficiency of determining the corresponding relation can be improved.

It should be noted that if the target image includes a line graph and some line segments in the line graph are not accurately identified in step 102, the determination may be assisted by a bar graph or a horizontal axis label, and if the target image includes a bar graph, it may be considered that a line point of the line graph is located on a symmetry axis in a vertical direction of each pillar. If none of the above cases is true, a point having the same abscissa as the midpoint (horizontal axis direction) of the horizontal axis label text may be used as the break point.

As shown in fig. 5, an embodiment of the present invention relates to a data extraction method of a statistical graph, which is different from the embodiment shown in fig. 1 in that some steps are further refined, including:

step 501, using a semantic segmentation model to perform layer separation on a target image containing a statistical graph according to the type of the statistical graph, acquiring a plurality of layers and determining the type of the statistical graph corresponding to each layer.

Step 501 in this embodiment is substantially the same as step 101 in the previous embodiment, and therefore, the description thereof is not repeated here.

Step 502, obtaining the position information of the key points of the statistical graph in the graph layer.

Step 502 in this embodiment is substantially the same as step 102 in the previous embodiment, and therefore, the description thereof is not repeated here.

Step 503, determining coordinate axes and scale information of the statistical chart from the target image by using preset screening conditions.

Determining the coordinate axis includes the following two cases:

one is to determine the axis of abscissa. Specifically, the only horizontal line in the target image that satisfies the preset length condition is taken as the axis of abscissa, or, when the histogram includes a bar graph, the axis of abscissa is determined according to the position of a rectangle in the bar graph.

More specifically, horizontal straight line detection is carried out on a target image, then screening is carried out by utilizing a first preset length condition, such as the length is larger than 0.7 time of the length of a picture, if the number of the screened horizontal straight lines is one and only one, the horizontal straight line is considered to be an abscissa axis, if the number of the screened horizontal straight lines is more than one, whether a statistical chart in the target image contains a column diagram is detected, if the statistical chart contains the column diagram, according to the characteristic that any rectangle in the column diagram necessarily has one side coincident with the abscissa axis, a horizontal straight line is coincident with one side of a rectangle in each column diagram, and the horizontal straight line is a horizontal line where the abscissa axis is located; or, when the statistical chart contains the histogram, the abscissa axis is determined directly according to the positions of the rectangles in the histogram, and the abscissa axis is determined according to the positions of the rectangles in the histogram.

One is to determine the ordinate axis. Specifically, the only or only vertical line satisfying a preset second length condition in the target image is taken as the ordinate axis, or the ordinate axis is determined using the position of the statistical graph.

More specifically, a target image is subjected to vertical straight line detection, and then is screened by using a preset length condition, particularly, only one vertical axis of a statistical graph may be provided, or two vertical axes of the statistical graph may be provided on both sides of a statistical pattern, so that if one vertical line or two vertical lines distributed on both sides of the statistical graph are provided, the vertical line is considered as the vertical axis; alternatively, the ordinate axes are looked up on both sides of the statistical image. The process of determining the coordinate axis fully considers various conditions of the statistical chart contained in the target image, and provides more than one method for determining the horizontal and vertical coordinate axes, so that the method can be suitable for extracting data of the statistical chart in different scenes, and is more flexible and practical.

It should be noted that the histogram does not necessarily have an axis of ordinate, and therefore, the method for determining the axis of ordinate is not necessarily performed to identify the axis of ordinate. However, when the ordinate axis does not exist in the target image, a guide line is generally set to visually display the statistical data represented by the statistical image.

The scale information comprises a first scale and a second scale, and the step of determining the scale information specifically comprises: determining initial scales from a target image and recording the distance between every two adjacent initial scales as a first distance; dividing the equidistant initial scales into a group according to the first interval, and acquiring a plurality of scale groups; determining the initial scale in a group with the maximum number of initial scales in the scale group as a first scale and taking the distance between two adjacent initial scales as a second distance; and determining a second scale according to the first distance and the second distance. And taking the first scale and the second scale as final scale information. The following two cases exist in determining the initial scale from the target image:

firstly, the scale information is the scale points. More specifically, binarization is performed on the target image, a statistical graph in the target image is filled to be background color, and then a line segment which is within a preset threshold value and is vertical to a coordinate axis in the target image is identified as an initial scale.

First, the scale information is the guide line. More specifically, horizontal lines in the target image are identified; the horizontal line is taken as the initial scale.

It should be noted that, determining the second scale according to the first pitch and the second pitch includes: acquiring the nearest initial scale outwards from any side of the scale information as the scale to be processed; acquiring the distance between the scale to be processed and the scale information according to the first distance to serve as a third distance; and judging whether the scales to be processed have second scales according to the third distance, and determining the corresponding second scales. Wherein, judge whether there is the target scale in the scale to be handled according to the third interval to confirm the second scale that corresponds, include: if the third distance is approximately N times of the second distance, judging that a second scale exists, and adding N second scales with the same distance (third distance/N) at the outer side of the scale information; and if the third distance is not N times of the second distance approximately, judging that no second scale exists, and taking the next initial scale closest to the scale to be processed as the scale to be processed, wherein N is a positive integer.

That is, an initial scale closest to the scale information is obtained from any side of the scale information to serve as a scale to be processed, and the distance between the scale to be processed and the scale information is obtained to serve as a third distance; if the third distance is approximately N times of the second distance, adding N-1 equidistant scale marks with the distance of (third distance/N) between the scale to be processed and the scale information (the first scale and the second scale) and taking the newly added scale and the scale to be processed as the second scale; and the third distance cannot be approximate to N times of the second distance, the next initial scale which is closest to the scale to be processed is used as the scale to be processed, and whether the current scale to be processed is the second scale or not is judged again until the last initial scale on the side is reached. After all the initial scales on the side are searched, turning to the other side for searching until all the initial scales are searched on the other side.

For example, taking the determination of the guiding line as an example, as shown in fig. 6, the horizontal straight lines existing in the target image are numbered with 1-9 straight lines from top to bottom in sequence on the left side, and the number of pixels between two adjacent straight lines is marked on the right side as the distance while detecting the straight lines, wherein the line 7 cannot be identified and is shown by a dotted line, and at this time, the line 6 and the line 8 are adjacent straight lines, and the distance between the lines is 12.1. Since the absolute value between the distance values 3.1 and 3.2 is within the error range, 3.1 and 3.2 can be considered to be approximately equal, i.e. line 1, line 2, line 3 are equally spaced; similarly, the

lines

3, 4, 5, 6 are equally spaced. Since the lines 3-6 are the most equidistantly distributed straight lines, it was confirmed that the

lines

3, 4, 5, 6 are the first scale and the average of the pitches 6.05 is the second pitch. And then, based on the line 3, searching upwards, wherein the line 3 is a scale to be processed, and because the distance between the line 2 and the line 3 is 3.2 and is not an integral multiple of the second distance, the line 2 does not meet the condition, the line 1 is continuously searched as the scale to be processed, and the distance between the line 1 and the line 3 can be known according to the fact that the distance between the line 2 and the line 3 is 3.2 and the distance between the line 2 and the line 1 is 3.1, namely the third distance is 6.3 and is approximately one time of the second distance, so that the line 1 meets the condition and can be reserved as the second scale, other initial scales do not exist upwards, the upwards searching is not carried out, and the downwards searching is carried out. At this time, based on the line 6, the line 8 is sequentially searched downwards, at this time, the line 8 is used as the scale to be processed, the distance between the line 6 and the line 8 is 12.1 and is approximately equal to 2 times of the second distance, at this time, the line 8 is considered as the guidance line, and an unrecognized guidance line must exist between the line 6 and the line 8, therefore, the line 8 is reserved as the second scale, and a horizontal straight line is supplemented between the line 6 and the line 8 to be also used as the second scale, namely, the unrecognized line 7, wherein the distances from the supplemented line 7 to the line 6 and the line 8 are 1/2 of the distance between the line 6 and the line 8; continuing to search downwards, similarly, it can be determined that the line 9 does not satisfy the condition, and is an interference straight line, and finally the guiding line shown in fig. 7 is determined. The obtained guiding line is beneficial to the follow-up inspection of the determined longitudinal axis label, especially under the condition that no scale point exists on the longitudinal axis, no scale point exists on the longitudinal axis or no scale point on the longitudinal axis can be determined, the longitudinal axis label can still be inspected, various information in the statistical chart is fully used, and therefore the reliability and the accuracy of statistical data extracted according to the various information are enhanced.

It should be noted that there are many possibilities for the statistical map in the target image: including the abscissa axis and the index line, and the abscissa axis and the scale line, the present embodiment performs the step of determining various information, acquires available information, and extracts data using the information. In other embodiments, step 503 only performs the step of determining the coordinate axis and the guideline when the coordinate axis and the scale information are determined accordingly, such as when the statistical chart contains the abscissa axis and the guideline, according to the specific situation.

It should also be noted that, in the case of having both an ordinate axis and a guideline, after determining the guideline and the scale points, the guideline and scale points can also be checked against each other to determine whether the resulting guideline and scale points are correct.

Step 504, determining coordinate axis labels from the text information identified by the target image by using the model, by using a preset coordinate axis label screening condition.

Step 504 in this embodiment is substantially the same as step 104 in the previous embodiment, and thus, is not described herein again.

And 505, determining statistical data represented by each statistical graph according to the key point position information, the coordinate axis, the scale information and the coordinate axis label, and generating structural data.

An embodiment of the present invention further provides an electronic device, as shown in fig. 8, including:

at least one processor 801; and the number of the first and second groups,

a memory 802 communicatively coupled to the at least one processor 801; wherein,

the memory 802 stores instructions executable by the at least one processor 801 to enable the at least one processor 801 to perform the method for extracting data from a statistical map as described in the above embodiments.

Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.

Embodiments of the present invention also provide a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A method for extracting data of a statistical chart is characterized by comprising the following steps:

performing layer separation on a target image containing a statistical graph according to the type of the statistical graph by using a semantic segmentation model, acquiring a plurality of layers and determining the type of the statistical graph corresponding to each layer, wherein the layers are binary images only containing the statistical graph;

acquiring the position information of the key points of the statistical graph in the layer;

determining coordinate axes and scale information in the target image by using preset screening conditions;

determining coordinate axis labels from the text information identified by the target image by using the model by using a preset label screening condition;

and determining statistical data represented by each statistical graph according to the key point position information, the coordinate axis, the scale information and the coordinate axis label and generating structural data.

2. The method according to claim 1, wherein the type of the histogram is a histogram, the histogram is a rectangle, the keypoint location information is diagonal location information of the rectangle, and the obtaining of the keypoint location information of the histogram in the layer includes:

detecting whether the rectangle in the layer is complete;

if at least one of the rectangles is incomplete, completing the incomplete rectangle;

detecting whether the rectangles in the layers are connected or not;

if a plurality of rectangles are connected, the connected rectangles are divided;

and acquiring the diagonal position information of each rectangle in the layer.

3. The method of claim 1, wherein the determining coordinate axes in the target image comprises:

taking the only horizontal line segment meeting a preset first length condition in the target image as the abscissa axis, or determining the abscissa axis according to the position of a rectangle in a histogram when the histogram comprises the histogram;

if the histogram has an ordinate axis, taking a unique or unique vertical line meeting a preset second length condition in the target image as the ordinate axis, or determining the ordinate axis by using the position of the histogram.

4. The method of claim 1, wherein the scale information comprises a first scale and a second scale, and wherein determining the scale information for the target image comprises:

determining initial scales from the target image and recording the distance between every two adjacent initial scales as a first distance;

dividing the equidistant initial scales into a group according to the first interval, and acquiring a plurality of scale groups;

taking the initial scales in a group with the maximum number of the initial scales in the scale grouping as the first scales, and taking the distance between two adjacent first scales as a second interval;

and determining the second scale according to the first distance and the second distance.

5. The method of claim 4, wherein the scale information is a guideline, and the filtering out an initial scale from the target image comprises:

identifying horizontal lines in the target image;

and taking the horizontal line as the initial scale.

6. The method of claim 4, wherein the scale information is a scale point, and wherein determining an initial scale from the target image comprises:

carrying out binarization on the target image and filling a statistical graph in the target image into a background color;

and identifying a line segment which is within a preset threshold value and is vertical to the coordinate axis in the target image as the initial scale.

7. The method of claim 1, wherein the coordinate axis labels comprise vertical axis labels, and determining the vertical axis labels from text information recognized by the target image by using the model using a preset label screening condition comprises:

taking one or two groups of text information which are equal in difference and aligned with each other on the two sides of the statistical chart as the initial longitudinal axis label, or taking one or two groups of text information which are equal in difference and aligned with each other on the outer side of the ordinate axis as the initial longitudinal axis label when the determined coordinate axis comprises the ordinate axis;

and taking the initial longitudinal axis label with the midpoint in the vertical direction and the corresponding scale information on the same horizontal line as the longitudinal axis label.

8. The method of claim 1, wherein determining statistical data and generating structural data for each of the statistical graphical representations based on the keypoint location information, the coordinate axis, the scale information, and the coordinate axis labels comprises:

determining legend information from the target image;

determining the corresponding relation between the coordinate axis label and the statistical graph according to the legend information;

determining a data value represented by a single pixel point according to the scale information and the coordinate axis label;

and obtaining the statistical data according to the key point position information, the coordinate axis, the corresponding relation and the data value represented by the single pixel point and generating the structural data.

9. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of data extraction of a statistical map as claimed in any one of claims 1 to 8.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the data extraction method of the statistical map of any one of claims 1 to 8.