CN115331013A - Data extraction method and processing equipment for line graph - Google Patents

Data extraction method and processing equipment for line graph Download PDF

Info

Publication number
CN115331013A
CN115331013A CN202211264165.0A CN202211264165A CN115331013A CN 115331013 A CN115331013 A CN 115331013A CN 202211264165 A CN202211264165 A CN 202211264165A CN 115331013 A CN115331013 A CN 115331013A
Authority
CN
China
Prior art keywords
legend
area
line
identification
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211264165.0A
Other languages
Chinese (zh)
Other versions
CN115331013B (en
Inventor
孙勇
顾文斌
杨祎聪
李晓平
丁雪纯
于业达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Hengsheng Juyuan Data Service Co ltd
Hangzhou Hengsheng Juyuan Information Technology Co ltd
Original Assignee
Shanghai Hengsheng Juyuan Data Service Co ltd
Hangzhou Hengsheng Juyuan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Hengsheng Juyuan Data Service Co ltd, Hangzhou Hengsheng Juyuan Information Technology Co ltd filed Critical Shanghai Hengsheng Juyuan Data Service Co ltd
Priority to CN202211264165.0A priority Critical patent/CN115331013B/en
Publication of CN115331013A publication Critical patent/CN115331013A/en
Application granted granted Critical
Publication of CN115331013B publication Critical patent/CN115331013B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Generation (AREA)

Abstract

The embodiment of the application provides a data extraction method and processing equipment of a line graph, and relates to the field of image pattern recognition. Detecting the line graph to be extracted by using a legend detection model, and determining the legend area of the line graph to be extracted; according to the position information of a blank area in the legend area, carrying out segmentation processing on the legend area to obtain at least one legend identification line area and at least one legend identification name area; respectively identifying each legend identification line area and each legend identification name area to obtain the characteristic information of each legend identification line and the content information of each legend identification name; and extracting data corresponding to the characteristic information in the data area of the line graph to be extracted according to the characteristic information of each legend identification line and the content information of each legend identification name, and obtaining at least one target data value corresponding to the content information. By using the legend segmentation method, the legend identification effect is improved, and the line graph data extraction accuracy is also improved.

Description

Data extraction method and processing equipment for line graph
Technical Field
The application relates to the field of image pattern recognition, in particular to a data extraction method and processing equipment of a line graph.
Background
The line graph is a statistical graph for describing the statistics total index dynamics, the interdependencies between study objects, and the allocation of each part. In the financial field, a large amount of data is represented in the form of line graphs, and the data has significance for financial services and financial analysis and needs to be extracted from the line graphs. However, the currently available line graphs are generally converted into an image form, which causes more difficulty in subsequent data extraction and analysis.
At present, the legend data analysis of the line graph mainly applies an Optical Character Recognition (OCR) method to perform overall recognition on legend areas to obtain names of multiple legends, then finds a color area forward according to the legend names as a line corresponding to the text, and determines multiple data values corresponding to the names of the legends in the line graph by matching the color of the obtained legend identification line with the color of the data areas.
However, directly performing overall recognition on the legend area easily recognizes the identification lines in the legend as characters, resulting in an error in the legend recognition result.
Disclosure of Invention
The application provides a data extraction method and processing equipment of a line graph, which can divide a graph area into a graph identification line area and a graph identification name area, and respectively identify the graph area, so that the problem that the graph identification line is identified as a character to cause a graph identification error due to the fact that the graph area is integrally identified is avoided, and the accuracy of the graph identification is improved.
The embodiment of the application can be realized as follows:
in a first aspect, an embodiment of the present application provides a method for extracting data of a line graph, including:
detecting a line graph to be extracted by using a legend detection model, and determining a legend area of the line graph to be extracted;
according to the position information of the blank area in the legend area, carrying out segmentation processing on the legend area to obtain at least one legend identification line area and at least one legend identification name area;
respectively identifying each legend identification line area and each legend identification name area to obtain the characteristic information of each legend identification line and the content information of each legend identification name;
and extracting data corresponding to the characteristic information in the data area of the line graph to be extracted according to the characteristic information of each legend identification line and the content information of each legend identification name, and obtaining at least one target data value corresponding to the content information.
In an optional implementation manner, the dividing the legend area to obtain at least one legend identification line area and at least one legend identification name area includes:
obtaining a horizontal arrangement graph corresponding to the legend area;
traversing and identifying the pixel change value in the horizontal arrangement graph;
when the pixel point change value of the target position meets the blank area identification condition, determining the target position as the position information of the blank area;
dividing the horizontal arrangement area according to the position information of the blank areas to determine a plurality of content areas;
and analyzing each content area to determine all the legend identification line areas and the legend identification name areas.
In an alternative embodiment, the obtaining the horizontal arrangement pattern corresponding to the legend area includes:
determining the legend distribution type of the legend area according to the arrangement sequence of the sub-areas to be identified in the legend area;
and if the legend distribution type is a vertical distribution type or a table distribution type, horizontally dividing each sub-area to be identified according to the projection result of each sub-area to be identified in the horizontal direction to obtain a plurality of horizontal arrangement patterns.
In an optional implementation, the analyzing each content area to determine all the legend identification line areas and the legend identification name areas includes:
performing preliminary traversal on each content area, and dividing each content area into a plurality of initial legend identification line areas and a plurality of initial legend identification name areas;
if the initial identification line regions all meet the straight line identification condition, taking the initial identification line regions as legend identification line regions, taking the initial legend identification name regions as legend identification name regions, and marking the types of the legend identification line regions as straight line identification line region types;
if the initial identification line regions all meet the symbol identification condition, taking the initial identification line regions as legend identification line regions, taking the initial legend identification name regions as legend identification name regions, and marking the types of the legend identification line regions as symbol identification line region types.
In an alternative embodiment, the method further comprises:
and correcting each legend identification line region, and taking the legend identification line region meeting identification name recognition conditions as a legend identification name region.
In an optional implementation manner, when the change value of the pixel point of the target location satisfies the blank condition, determining that the target location is location information of a blank area includes:
performing vertical projection on the horizontally arranged pattern to obtain a vertical projection pattern;
performing projection calculation on each pixel point in the vertical projection graph to obtain a projection value and a projection variation value of each pixel point;
and traversing each pixel point in the horizontal arrangement graph according to the projection value and the projection variation value of each pixel point, and taking the position information of the area where a plurality of pixels meeting the blank area identification condition are located as the position information of the blank area.
In an optional implementation manner, the extracting, according to feature information of each legend identification line and content information of each legend identification name, data corresponding to the feature information in a data area of the line graph to be extracted to obtain at least one target data value corresponding to the content information includes:
identifying the line graph to be extracted by using a straight line detection model, and determining a data area and coordinate information of the line graph to be extracted;
and matching the characteristic information of the legend identification line with an area corresponding to the coordinate information in the data area to obtain at least one target data value corresponding to the content information of each legend identification name.
In an optional implementation manner, the matching the feature information of the legend identification line with an area corresponding to coordinate information in the data area to obtain at least one target data value corresponding to content information of each legend identification name includes:
if the characteristic information of the legend identification line is color characteristics, calculating color value distances between the color characteristics and the color characteristics of a plurality of pixel points in an area corresponding to each horizontal axis scale point in the coordinate information, and taking the pixel points of which the color value distances are smaller than a preset color threshold and meet a preset slope condition as corresponding data value points;
and determining a target data value corresponding to each data value point according to the relative position relation between each data value point and the coordinate information.
In an optional implementation manner, the matching, according to the feature information of the legend identification line and the content information of each legend identification name, a position corresponding to the coordinate information in the data area to obtain a plurality of data values corresponding to the content information of each legend identification name includes:
if the characteristic information of the legend identification line is a symbol characteristic, calculating the symbol characteristic and carrying out pattern matching on the region corresponding to each cross-axis scale point in the coordinate information to obtain at least one region to be selected;
taking the coordinates of the central point of at least one region to be selected as corresponding data value points;
and determining a target data value corresponding to each data value point according to the relative position relation between each data value point and the coordinate information.
In a second aspect, an embodiment of the present application further provides a data extraction apparatus for a line graph, including:
the legend detection module is used for detecting the line graph to be extracted by using a legend detection model and determining the legend area of the line graph to be extracted;
the legend segmentation module is used for segmenting the legend area according to the position information of the blank area in the legend area to obtain at least one legend identification line area and at least one legend identification name area;
a legend identification module, configured to identify each legend identification line region and each legend identification name region respectively, to obtain feature information of each legend identification line and content information of each legend identification name;
and the data extraction module is used for extracting data corresponding to the characteristic information in the data area of the line graph to be extracted according to the characteristic information of each legend identification line and the content information of each legend identification name, and acquiring at least one target data value corresponding to the content information.
The legend segmentation module is specifically further used for obtaining a horizontal arrangement pattern corresponding to the legend area; traversing and identifying the pixel change value in the horizontal arrangement graph; when the pixel point variation value of the target position meets the blank area identification condition, determining the target position as the position information of the blank area; dividing the horizontal arrangement area according to the position information of the blank areas to determine a plurality of content areas; and analyzing each content area to determine all the legend identification line areas and the legend identification name areas.
The legend segmentation module is further specifically configured to determine a legend distribution type of the legend region according to an arrangement order of each sub-region to be identified in the legend region; and if the legend distribution type is a vertical distribution type or a table distribution type, horizontally dividing each sub-area to be identified according to the projection result of each sub-area to be identified in the horizontal direction to obtain a plurality of horizontal arrangement patterns.
The legend segmentation module is further specifically configured to perform preliminary traversal on each content area, and divide each content area into a plurality of initial legend identification line areas and a plurality of initial legend identification name areas; if the initial identification line regions all meet the straight line identification condition, taking the initial identification line regions as legend identification line regions, taking the initial legend identification name regions as legend identification name regions, and marking the types of the legend identification line regions as straight line identification line region types; if the initial identification line regions all meet the symbol identification condition, taking the initial identification line regions as legend identification line regions, taking the initial legend identification name regions as legend identification name regions, and marking the types of the legend identification line regions as symbol identification line region types.
The legend segmentation module is further specifically configured to correct each legend identification line region, and use the legend identification line region that meets the identification condition as a legend identification name region.
The legend segmentation module is specifically further used for vertically projecting the horizontally arranged graphs to obtain vertical projection graphs; performing projection calculation on each pixel point in the vertical projection graph to obtain a projection value and a projection variation value of each pixel point; and traversing each pixel point in the horizontal arrangement graph according to the projection value and the projection variation value of each pixel point, and taking the position information of the area where a plurality of pixels meeting the blank area identification condition are located as the position information of the blank area.
The data extraction module is specifically further configured to identify the line graph to be extracted by using a straight line detection model, and determine a data area and coordinate information of the line graph to be extracted; and matching the characteristic information of the legend identification line with an area corresponding to the coordinate information in the data area to obtain at least one target data value corresponding to the content information of each legend identification name.
The data extraction module is specifically further configured to, if the feature information of the legend identification line is a color feature, calculate color value distances between the color feature and color features of a plurality of pixel points in an area corresponding to each horizontal axis scale point in the coordinate information, and take pixel points, of which the color value distances are smaller than a preset color threshold and meet a preset slope condition, as corresponding data value points; and determining a target data value corresponding to each data value point according to the relative position relation between each data value point and the coordinate information.
The data extraction module is specifically further configured to, if the feature information of the legend identification line is a symbol feature, calculate a region corresponding to the symbol feature and each cross-axis scale point in the coordinate information to perform pattern matching, and obtain at least one region to be selected; taking the coordinates of the central point of at least one to-be-selected area as corresponding data value points; and determining a target data value corresponding to each data value point according to the relative position relation between each data value point and the coordinate information.
In a third aspect, an embodiment of the present application provides a processing apparatus, including: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the processing device is running, the processor executing the machine-readable instructions to perform the steps of the data extraction method of the line graph according to any one of the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the data extraction method for a line graph according to any one of the first aspect.
The beneficial effects of the embodiment of the application include:
by adopting the data extraction method and the processing equipment of the line graph, provided by the embodiment of the application, the legend area can be divided, each legend identification line area and legend identification name are obtained, each area is identified, the corresponding legend identification line and legend identification name are determined, the problem that the legend identification line and legend identification name are confused due to the overall identification of the legend area is solved, and the accuracy of legend identification is improved.
Drawings
To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
FIG. 1 is a content identifier included in a line graph;
fig. 2 is a schematic flowchart illustrating steps of a data extraction method for a line graph according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating a segmentation of a data extraction method of a line graph according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of an X-axis projection of a data extraction method for a line graph provided in an embodiment of the present application;
fig. 5 is a schematic diagram of an identification result of a data extraction method of a line graph according to an embodiment of the present application;
fig. 6 is a flowchart illustrating another step of the data extraction method for line charts according to the embodiment of the present application;
fig. 7 is a schematic Y-axis projection diagram of a data extraction method of a line graph according to an embodiment of the present application;
FIG. 8 is a flowchart illustrating another step of a data extraction method for line graphs according to an embodiment of the present application;
FIG. 9 is a table distribution type diagram of a line graph provided in an embodiment of the present application;
FIG. 10 is a flowchart illustrating another step of a data extraction method for line graphs according to an embodiment of the present application;
fig. 11 is a schematic diagram illustrating a broken line identification of a data extraction method of a line graph according to an embodiment of the present application;
FIG. 12 is a flowchart illustrating another step of a data extraction method for line graphs according to an embodiment of the present application;
FIG. 13 is a flowchart illustrating another step of a data extraction method for line graphs according to an embodiment of the present application;
FIG. 14 is a flowchart illustrating another step of a data extraction method for line graphs according to an embodiment of the present application;
FIG. 15 is a flowchart illustrating another step of a data extraction method for line graphs according to an embodiment of the present application;
FIG. 16 is a schematic diagram illustrating symbolic feature recognition of a data extraction method for a line graph according to an embodiment of the present application;
fig. 17 is a schematic structural diagram of a data extraction device of a line graph according to an embodiment of the present application;
fig. 18 is a schematic structural diagram of a processing apparatus according to an embodiment of the present application.
Icon: 101-legend area; 102-legend identification lines; 103-legend identification name; 104-Y axis; 105-Y axis scale points; 106-X axis scale points; 107-X axis; 108-a data area; 1011-first legend identifies line regions; 1012-first legend identifies the name area; 1013-second legend identifies line regions; 1014-second legend identification name area; 100-data extraction means of a line graph; 1001-legend detection module; 1002-legend segmentation module; 1003-legend identification module; 1004-data extraction module; 2001-a processor; 2002-memory.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
Furthermore, the appearances of the terms "first," "second," and the like, if any, are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.
It should be noted that the features of the embodiments of the present application may be combined with each other without conflict.
In the field of finance, a large amount of data represented in a graph form is contained, such as a line graph, a bar graph, a scatter graph, a pie graph and the like, different graphs have different data representation forms, the line graph is taken as one of widely used graph types, and the line graph contains a large amount of financial data which may have important significance for financial services and financial analysis. As shown in fig. 1, the line graph mainly includes: a legend area 101, a legend identification line 102, a legend identification name 103, a Y-axis 104, a Y-axis tick mark 105, an X-axis tick mark 106, an X-axis 107, and a data area 108 where a polyline is located.
The legend area 101 includes at least legends, and each set of legends includes a legend identification line 102 and a legend identification name 103 having a corresponding relationship. In addition, the data area 108 where the broken line is located includes a first quadrant surrounded by the positive half axis of the X-axis 107 and the positive half axis of the Y-axis 104, or includes a first quadrant and a fourth quadrant corresponding to the positive half axis of the X-axis 107 and the positive half axis of the Y-axis 104.
Although the extraction of data in the line graph is a necessary requirement in the data analysis process, the line graph obtained from various data sources such as news, reports, books and the like is generally converted into an image form at present, which also brings a lot of difficulties for the subsequent data extraction of the line graph.
Currently, the legend data analysis of the line graph mainly adopts a mode recognition mode, that is, an OCR technology is utilized to perform overall recognition on the legend area 101 in fig. 1, so as to obtain the text contents of a plurality of legend identification names, and then forward detection is performed according to the position of each legend identification name, if a color line is detected before a certain legend identification name, the legend identification line corresponding to the legend identification name is obtained. And then, matching the colors of the legend identification lines and the broken lines of the data area 108 to determine the broken lines corresponding to the legend identification names in the data area. And finally, extracting and obtaining a plurality of data points corresponding to the legend identification names according to the relative positions of the broken lines and the coordinate axes.
In the prior art, by identifying the whole legend area 101 and determining the legend identification name, when the legend identification line is closer to the legend identification name, the graphic information is easily identified as text information, or the text information is easily identified as graphic information. For example, a line is recognized as a word "one", or when two legend identifiers are too close to each other, the line may be recognized as a merged legend identifier, which may result in an incorrect legend identifier recognition.
Based on this, the embodiment of the application provides a data extraction method and processing equipment of a line graph, can cut apart the legend region, obtain each legend identification line region and legend identification name, distinguish each region again, confirm corresponding legend identification line, legend identification name, solved to the regional whole recognition of legend and lead to the problem that legend identification line, legend identification name discernment are confused, promoted the accuracy of legend discernment.
The embodiments of the present application provide a method for extracting data of a line graph and a processing device, which are described in detail below.
Fig. 2 is a schematic flowchart illustrating steps of a data extraction method of a line graph according to an embodiment of the present application, where an execution subject of the method may be a computer device with computing and processing capabilities, as shown in fig. 2, the method includes:
s201, detecting the line graph to be extracted by using the graph detection model, and determining a graph area of the line graph to be extracted.
The line graph to be extracted may be the same in content form contained in the line graph shown in fig. 1, and the line graph is stored in a picture form. Alternatively, the position of the legend area 101 in the line graph to be extracted is not limited herein.
The legend detection model analyzes and detects the line graph to be extracted by using an image segmentation network, and determines the outline position of the minimum rectangular area occupied by the legend area through the minimum circumscribed rectangle, wherein the outline position is used for marking the legend area of the line graph to be extracted. Optionally, the legend detection model may be a U-NET model, and may also be a common semantic segmentation model such as fcn and seg-NET, which is not limited herein.
S202, according to the position information of the blank area in the legend area, the legend area is divided to obtain at least one legend identification line area and at least one legend identification name area.
With continued reference to fig. 1, at least one different set of legends may be included within legend area 101, each set of legends including a legend identification line and a legend identification name having a correspondence. Referring to fig. 3, the legend area includes two groups of legends as an example, and optionally, each group of legends is respectively formed by a legend identification name and a legend identification line.
As shown in fig. 3, the legend area 101 may be divided according to the location information of the blank area in the legend area, and then a first legend identification line area 1011, a second legend identification line area 1013, a first legend identification name area 1012, and a second legend identification name area 1014 are obtained according to different characteristics corresponding to the legend identification names and legend identification lines in the legend area 101.
And S203, respectively identifying each legend identification line area and each legend identification name area to obtain the characteristic information of each legend identification line and the content information of each legend identification name.
Alternatively, the above characteristic information may be used to uniquely identify the type of the legend identification line. Specifically, the feature information may be a color feature that is distinguished based on color or a symbol feature that is distinguished based on a symbol. For example, referring to fig. 1, where different broken lines have different colors, such as black and gray, different legend identification lines and corresponding broken lines can be uniquely identified based on color characteristics. Alternatively, for example, in fig. 3, the first legend mark line region 1011 is a triangle symbol legend mark line, and the second legend mark line region 1013 is a square symbol legend mark line. It is understood that different types of legend identification lines have different characteristic information, for example, a color legend identification line has color characteristics, a symbol legend identification line has symbol characteristics, and the manner of extracting the characteristic information from the legend identification line region containing the different types of legend identification lines is correspondingly different.
If the legend identification line in a certain legend identification line region is identified as a color legend identification line type, in a possible implementation manner, the RGB value of the legend identification line may be extracted as the feature information of the legend identification line.
If the legend identification line in a certain legend identification line area is identified as the type of the symbol legend identification line, in a possible manner, the symbol image of the legend identification line may be extracted as the feature information of the legend identification line.
Optionally, after at least one legend identification name area is determined, an OCR method may be used to identify each legend identification name area to determine the text content of each legend identification name as the content information of each legend identification name.
And S204, extracting data corresponding to the characteristic information in the data area of the line graph to be extracted according to the characteristic information of each legend identification line and the content information of each legend identification name, and obtaining at least one target data value corresponding to the content information.
The corresponding broken lines can be found in the data area of the broken line graph according to the characteristic information of each legend identification line, and data extraction is carried out on each broken line to obtain a plurality of data values corresponding to each legend identification line. Further, according to the corresponding relation between each legend identification line and the legend identification name, a plurality of data values corresponding to the content information of the legend identification name are determined.
In step S203, when it is recognized that the legend identification line in a certain legend identification line area is a symbol legend identification line type, in order to extract and obtain a symbol image of the legend identification line as feature information, the following method may be adopted:
first, an X-axis projection is used for the legend mark line region, that is, after a vertical projection, a corresponding projection sequence is obtained. As shown in fig. 4, the upper image of fig. 4 is X-axis projected to obtain a projection sequence shown in the lower image of fig. 4. Note that each pixel in the projection sequence corresponds to a different projection value.
After traversing each pixel in the projection sequence from left to right, taking the interval of the Y-axis coordinate value Y1 where the pixel point with the first projection value not being 0 is located and the Y-axis coordinate value Y2 where the last projection value not being 0 as the area where the legend identification line is located, and taking the rest part in the area of the legend identification line as the blank in the vertical direction to remove, so as to obtain the legend identification line area with the removed vertical blank.
Further, the neighborhood projection variation value of each pixel in the region of the legend identification line with the vertical blank removed can be calculated by the following formula
Figure F_221011094126486_486322001
Figure P_221011094130148_148944001
Wherein the content of the first and second substances,
Figure F_221011094126564_564952002
Figure F_221011094126627_627470003
k denotes the neighborhood range size of the current pixel,
Figure F_221011094126708_708524004
representing the projection value of the current pixel i,
Figure F_221011094126833_833524005
representing the absolute value of the difference between the projected value of the current pixel i and the projected value of the next pixel i + 1. Illustratively, k may be 2.
If the area of the legend identification line except the vertical blank has neighborhood projection variation values of continuous m pixels
Figure F_221011094126898_898465006
If the projected values of the m pixels are all larger than zero and are smaller than the preset variation threshold, the starting positions of the m pixels to the pixels can be determined
Figure F_221011094126976_976589007
The section is the beginning cross-line section of the legend identifying the line. Illustratively, m may be 5, and the preset variation threshold may be 0.1.
If the area of the legend marking line except the vertical blank is within, the legend marks the horizontal line part of the line to the pixel
Figure F_221011094127054_054718008
The neighborhood projection variation value of each pixel
Figure F_221011094127119_119168009
Are all larger than a preset variation threshold, and pixels
Figure F_221011094127181_181699010
Subsequent nodes and pixels
Figure F_221011094127259_259782011
Is greater than a preset distance threshold, the pixel is reset
Figure F_221011094127455_455112012
Starting coordinate, pixel, identified as a symbol on a legend identification line
Figure F_221011094127520_520067013
End coordinates identified by a symbol on a line are identified as a legend, where the pixel
Figure F_221011094127598_598149014
Then the pixels in the region of the legend identification line meet the neighborhood projection variation value with continuous m pixels
Figure F_221011094127660_660675015
Is less than a preset variation threshold value, and the projection values of the m pixels are all greater than zero, that is, the pixels
Figure F_221011094127741_741233016
The ending positions of the m pixels are ending transverse line parts of the legend identification line.
Then, the area between the start horizontal line and the end horizontal line, that is to say
Figure F_221011094127803_803756017
]The image of the area in between is extracted as the symbol identification image. The symbol identification image can be used as the characteristic information of the symbol characteristic representation legend identification line.
Fig. 5 is a schematic diagram illustrating an identification effect of a data extraction method using a line graph provided in the embodiment of the present application, and a plurality of data values corresponding to the positions of the horizontal axis scale marks of the content of each legend identification name are obtained by performing data extraction on the line graph on the left side of fig. 5 through the data extraction method using the line graph provided in the embodiment of the present application.
In this embodiment, the legend area is divided to obtain each legend identification line area and legend identification name, and each area is identified respectively to determine the corresponding legend identification line and legend identification name, so that the problem of confusion of legend identification line and legend identification name caused by overall identification of the legend area is solved, and the accuracy of legend identification is improved.
Optionally, since the legend area includes a plurality of legend identification line areas and corresponding legend identification name areas, in order to segment each legend identification line area and corresponding legend identification name area, as shown in fig. 6, in the step S202, the step of segmenting the legend area to obtain at least one legend identification line area and at least one legend identification name area may be implemented by the following steps S301 to S304.
S301, obtaining a horizontal arrangement pattern corresponding to the legend area.
The horizontal arrangement pattern is a pattern in which the legends in the legend areas are sequentially arranged in the horizontal direction, for example, a pattern corresponding to the arrangement of the legends in the legend area 101 in fig. 1, or a pattern corresponding to the arrangement of the legends in the legend area 101 shown in fig. 3.
It should be noted that the arrangement manner of the legend may also be a column arrangement, a table format arrangement, or other arrangement manner as shown in fig. 7, and the legend areas arranged in other manner may be converted into horizontal arrangement patterns in advance.
S302, traversing and identifying the pixel change values in the horizontal arrangement graph.
By the embodiment, each pixel point in the horizontal arrangement graph can calculate the neighborhood projection variation value
Figure F_221011094127883_883298018
And projection value, neighborhood projection variation value of each pixel
Figure F_221011094127946_946317019
And the projection value can be used for representing the current pixel point and the conditions of surrounding pixel points of the current pixel point.
Therefore, the horizontal arrangement graph can be traversed, and the variation value is projected according to the neighborhood of each pixel
Figure F_221011094128024_024431020
Traversing to obtain the initial and end coordinates of a plurality of blank areas in the horizontal arrangement graph: [
Figure F_221011094128088_088868021
],[
Figure F_221011094128167_167495022
]…[
Figure F_221011094128230_230041023
],
Figure F_221011094128293_293479024
S303, when the pixel point variation value of the target position meets the blank area identification condition, determining the target position as the position information of the blank area.
The blank area identification condition may be: if the horizontal arrangement pattern is traversed from left to right, the projection variation value std of a certain pixel point is smaller than the preset variation threshold value, and the projection value is 0.
Then, according to the target positions of the plurality of pixel points meeting the blank area identification condition in the line graph to be extracted, the position information of the blank area in the line graph to be extracted can be determined.
S304, according to the position information of the blank areas, the horizontal arrangement area is divided, and a plurality of content areas are determined.
The content area may include a plurality of legend identifying line areas, a plurality of legend identifying name areas, it being understood that a blank area identifies a line area for each legendAnd each legend identifies the gaps between name regions. Therefore, the value can be set in accordance with the start coordinate and the end coordinate of each blank region
Figure F_221011094128372_372126025
],[
Figure F_221011094128434_434609026
]…[
Figure F_221011094128499_499543027
],
Figure F_221011094128577_577671028
The plurality of discontinuous areas which do not belong to the blank area and are arranged horizontally are divided into a plurality of content areas.
S305, analyzing each content area, and determining all legend identification line areas and legend identification name areas.
Further, since the content area includes a plurality of legend identification line areas and a plurality of legend identification name areas, each content area may be analyzed according to a difference between the feature information of the legend identification line in the legend identification line area and the content information of the legend identification name in the legend identification name area, and the plurality of content areas are divided into a plurality of legend identification line areas and a plurality of legend identification name areas.
In this embodiment, according to a plurality of blank regions, will horizontally arrange the region and divide into a plurality of content areas, solved and leaded to legend identification line, legend identification name discernment to the regional whole discernment of legend to obscure the problem, promoted legend discernment's accuracy.
The above division method is merely an example, and the effect of dividing each legend identification line region and each legend identification name region in implementing the present embodiment is not excluded from being implemented in other division orders and directions. The corresponding solutions should belong to the solutions that can be realized by those skilled in the art without creative work after reading this application.
Optionally, for the legend area, the legend may be in a horizontal arrangement form, for example, fig. 3 above, but it may also be in a vertical arrangement form, for example, fig. 7 above, in order to more accurately and efficiently obtain the legend identification line areas and the legend identification name areas, before the legend area is subjected to the segmentation processing, the legends in other arrangements, for example, the vertically arranged legends or the legends in the table arrangement, may be converted into a horizontally arranged pattern, and then segmented to obtain the legend identification line areas and the legend identification name areas. As shown in fig. 8, in the step S301, the horizontal arrangement pattern corresponding to the legend area is obtained, which can be realized by the following steps S401 to S402.
S401, determining the legend distribution type of the legend area according to the arrangement sequence of the sub-areas to be identified in the legend area.
Each sub-area to be identified comprises a group of legends, and each group of legend examples comprises a legend identification line and a corresponding legend identification name.
Optionally, the legend area may be identified by an OCR technology or other identification methods, so as to preliminarily determine the arrangement order of the sub-areas to be identified in the legend area. For example, if a plurality of sub-regions to be identified are horizontally arranged, the legend distribution type of the legend region is a horizontal arrangement type. And if a plurality of sub-areas to be identified which are vertically arranged are obtained, the legend distribution type of the legend area is a vertical arrangement type. If a plurality of sub-areas to be identified are obtained, which are arranged in the horizontal direction and the vertical direction, the legend distribution type of the legend area is a table arrangement type.
S402, if the legend distribution type is a vertical distribution type or a table distribution type, horizontally dividing each sub-area to be identified according to the projection result of each sub-area to be identified in the horizontal direction to obtain a plurality of horizontal arrangement patterns.
If the legend distribution type is a vertical distribution type, performing Y-axis projection on each sub-area to be identified as shown in fig. 7 to obtain a Y-axis projection graph corresponding to each legend area.
Referring to fig. 7, when the multiple sub-regions to be recognized are vertically adjacent to each other and subjected to Y-axis projection, due to gaps existing between the multiple sub-regions to be recognized, a pixel point with a projection value of 0 exists in a projected Y-axis projection pattern, and the pixel point can be used as a division point between the sub-regions to be recognized, and the legend region is vertically divided according to coordinates of the division point, so that the multiple sub-regions to be recognized are obtained.
Finally, the plurality of sub-areas to be identified can be horizontally divided to obtain a plurality of horizontally arranged patterns.
Alternatively, as shown in fig. 9, if the legend distribution type is a table distribution type, that is, the legend areas including a plurality of sub-areas to be identified in both the horizontal and vertical directions may be first vertically divided according to the above steps to obtain a plurality of legend areas with the legend distribution type being a vertical distribution type, that is, after the legend areas shown in fig. 9 are divided, three groups of legend areas with the legend distribution type being a vertical distribution type are obtained: the areas of the legend identification lines and the legend identification names corresponding to the soybeans and the corns are a group, the areas of the legend identification lines and the legend identification names corresponding to the rice and the cotton are a group, and the areas of the legend identification lines and the legend identification names corresponding to the wheat and the potatoes are a group. And then, vertically dividing each legend area to respectively obtain a plurality of areas to be identified. For example, a set of legend regions of "soybean" and "corn" is vertically divided to obtain two regions to be identified: the soybean identification line corresponds to the legend, the area to be identified where the legend identification name is located, the legend identification line corresponds to the corn, and the area to be identified where the legend identification name is located. And finally, uniformly arranging a plurality of regions to be identified in each legend region to obtain a horizontal arrangement pattern, namely, arranging all the regions to be identified in a straight line in sequence.
In this embodiment, the legend distribution types of other arrangement modes are converted into horizontal arrangement patterns, so that the subsequent unified identification method is facilitated, the complexity of the method is reduced, and the processing efficiency is improved.
Optionally, in the above embodiment, after a plurality of legend identification line regions and a plurality of legend identification name regions are preliminarily determined, since there may be a case that the legend identification line regions are wrongly determined as the legend identification name regions, or the legend identification name regions are wrongly determined as the legend identification line regions, the legend identification line regions and the legend identification name regions need to be further analyzed to obtain accurate legend identification line regions and legend identification name regions. As shown in fig. 10, in the above step S304, each content area is analyzed to determine all legend identification line areas and legend identification name areas, which can be realized by the following steps S501 to S503.
And S501, performing preliminary traversal on each content area, and dividing each content area into a plurality of initial legend identification line areas and a plurality of initial legend identification name areas.
Calculating the neighborhood projection variation value of each pixel in the above embodiment
Figure F_221011094128655_655791029
The formula (2) can calculate each pixel in each content area to obtain the neighborhood projection variation value of each pixel in the content area
Figure F_221011094128721_721702030
Optionally, if there is a neighborhood projection variation value of m continuous pixel points in a certain content area
Figure F_221011094128799_799837031
And if the distance between the first pixel point traversed from left to right and other pixel points in the content area is smaller than the threshold value s, dividing the content area into an initial legend identification line area. Illustratively, m may be 5,t may be 0.2, s may be 0.5, and the distance between two pixel points may be expressed by the formula
Figure F_221011094128877_877954032
And calculating to obtain the result, wherein,
Figure F_221011094128943_943406033
the coordinates of two pixels.
And traversing each content area respectively, dividing at least one content area meeting the conditions into an initial legend identification line area, and dividing at least one remaining content area into an initial legend identification name area.
S502, if the initial identification line regions all meet the straight line identification condition, the initial identification line regions are used as legend identification line regions, the initial legend identification name regions are used as legend identification name regions, and the types of the legend identification line regions are marked as straight line identification line region types.
In one possible implementation, the straight line identification condition may be: if the neighborhood projection variation value of each pixel point in a certain initial identification line region
Figure F_221011094129021_021517034
And when the initial identification line region passes from left to right, the distance between the first pixel point and other pixel points is less than s. Illustratively, t may be 0.2 and s may be 0.2.
If a certain initial identification line region meets the above-mentioned straight line identification condition, the type of the initial identification line region may be marked as a straight line identification line region type.
It is to be understood that, for the same line graph, the multiple legend identification lines in the legend area are generally represented in a uniform manner, and the type of the legend identification line area corresponding to each legend identification line should be the same.
Thus, each initial identification line region satisfying the straight line recognition condition can be used as a legend identification line region, and the initial legend identification name region can be used as a legend identification name region.
S503, if the initial marking line areas all meet the symbol identification condition, the initial marking line areas are used as legend marking line areas, the initial legend marking name areas are used as legend marking name areas, and the types of the legend marking line areas are marked as symbol marking line area types.
The symbol recognition condition may be: if a certain initial identification line region is traversed from left to right, the neighborhood projection variation value of n pixel points in the region is started
Figure F_221011094129101_101129035
Are all smaller than t, and the distance between the first pixel point and other pixel points is smaller than s. When the initial identification line region is traversed from right to left, the neighborhood projection variation values of n pixel points of the part are ended
Figure F_221011094129163_163666036
Are all smaller than t, and the distance between the first pixel point and other pixel points is smaller than s.
As can be seen from the above, for the same line graph, the types of the legend marking line regions corresponding to each legend marking line may be all straight line marking line region types, or all symbol marking line region types. If all the initial identification line regions meet the symbol identification condition, taking each initial identification line region meeting the symbol identification condition as a legend identification line region, and taking the initial legend identification name region as a legend identification name region.
Alternatively, as shown in fig. 11, if the legend mark line is a dotted line, the space between the dotted lines may be used as a blank area, and one content area may be divided into a plurality of legend mark line areas by the above-mentioned recognition. Therefore, the legend identification line regions which meet the straight line identification condition or meet the symbol identification condition can be further identified, a plurality of continuous regions are legend identification line regions, no other types of content regions exist among the legend identification line regions are merged into a new legend identification line region, and the type of the new legend identification line region is marked as a dotted line identification line region type.
In this embodiment, the initial identification line region is marked as a straight identification line region type or a symbol identification line region type, so that data extraction can be performed in a targeted manner in subsequent steps, and the accuracy of data extraction is improved.
Optionally, the data extraction method for the line graph provided in the embodiment of the present application further includes: and correcting each legend identification line region, and taking the legend identification line region meeting identification name recognition conditions as a legend identification name region.
In the at least one legend identification line region obtained in the above steps S502 to S503, it may be included to identify a "one" or a horizontal line in the legend identification name as a legend identification line region of the legend identification line, so that the lengths of the legend identification line regions may be compared, and the legend identification line region in which the length difference from other legend identification line regions is greater than sh pixel points is modified into the legend identification name region. Illustratively, sh may be 5.
In this embodiment, through further discerning legend identification line region, avoid with the characters mistake discernment for legend identification line, lead to legend discernment wrong problem, promoted the rate of accuracy of discernment.
Optionally, blank space areas exist between the legend identification lines and the legend identification names in each legend area, and therefore, the position information of the blank areas can be identified through vertical projection to serve as separation areas between the legend identification line areas and the legend identification name areas. As shown in fig. 12, in the step S303, when the pixel variation value of the target position satisfies the blank area identification condition, the step S601 to S603 may be implemented to determine that the target position is the position information of the blank area.
And S601, vertically projecting the horizontally arranged pattern to obtain a vertically projected pattern.
Referring to fig. 4, the horizontal arrangement pattern of the left image in fig. 4 is projected on the X-axis, i.e. vertically, so as to obtain the vertical projection pattern shown in the right image in fig. 4.
S602, performing projection calculation on each pixel point in the vertical projection graph to obtain a projection value and a projection variation value of each pixel point.
By the formula for calculating the projection variation value std of the pixel point in the above embodiment, the projection variation value std of each pixel point in the vertical projection graph can be calculated to obtain the projection value and the projection variation value of each pixel point in the vertical projection graph.
And S603, traversing each pixel point in the horizontal arrangement graph according to the projection value and the projection variation value of each pixel point, and taking the position information of the area where the plurality of pixels meeting the blank area identification condition are as the position information of the blank area.
Traversing all the pixel points of the horizontally arranged graph, dividing a plurality of adjacent pixel points meeting the condition that the projection variation value std is less than a preset variation threshold value and the projection value is 0, namely, the blank area identification condition into the same blank area, and marking the position information corresponding to the outline of each pixel point in the blank area as the position information of the blank area.
In this embodiment, a plurality of blank regions are determined according to the projection value and the projection variation value of each pixel point in the horizontally arranged graph, and the blank regions are determined in an interpretable manner, so that the data extraction efficiency is improved.
Optionally, in order to obtain data in the line graph to be extracted, the feature information extracted to each legend identification line needs to be matched with the line graph to be extracted to determine a data value corresponding to the content information of each legend identification line, as shown in fig. 13, in the step S204, according to the feature information of each legend identification line and the content information of each legend identification name, data corresponding to the feature information in the data area of the line graph to be extracted is extracted to obtain at least one target data value corresponding to the content information, which may be implemented by the following steps S701 to S702.
S701, identifying the line graph to be extracted by using the straight line detection model, and determining the data area and the coordinate information of the line graph to be extracted.
The straight line detection model can be a pre-trained semantic segmentation model, and can identify a horizontal line with the longest bottom in the area of the broken line graph to be extracted as an X axis in the broken line graph to be identified, and identify a leftmost vertical line intersected with the X axis as a Y axis in the broken line graph to be identified.
Further, as shown in fig. 1, the X-axis scale value 107 is set below the X-axis 107, so that the area below the X-axis can be identified according to the position of the X-axis, and a plurality of text contents corresponding to the points for marking the X-axis scale can be obtained. And identifying black pixel points in toph pixels above the text content as X-axis scale points corresponding to the text content, wherein toph can be 10. Thus, sets of X-axis coordinate information are obtained.
Alternatively, the coordinate information of the multiple sets of Y-axes may be identified and obtained in a manner similar to the above-described manner of identifying the coordinate information of the X-axis.
S702, matching the characteristic information of the figure legend identification line with an area corresponding to the coordinate information in the data area to obtain at least one target data value corresponding to the content information of each figure legend identification name.
The area corresponding to the coordinate information in the data area may refer to an area in which the scale point of the X-axis is located in the vertical upward and downward directions.
According to the feature information of the legend identification line obtained in the above embodiment, the feature information of each legend identification line, such as color features or symbol features, may be identified in the area corresponding to the coordinate information in the data area, that is, the area where the scale point of each X axis is located in the vertical upward direction and the downward direction, so as to obtain the data value point corresponding to the scale point of each X axis. And calculating the data value corresponding to each data value point to obtain at least one target data value. And the target data value is a coordinate value corresponding to the coordinate information of the X axis on the broken line.
In this embodiment, after the data area of the line graph is determined, the data area is identified according to the coordinate information to obtain at least one target data value, so that data in other areas are prevented from being identified, and the accuracy of data extraction is improved. And moreover, the data value is directly determined by the broken line corresponding to the coordinate information, so that errors caused by additionally determining the data value are avoided.
Alternatively, when the feature information of the legend identification line is a color feature, in order to extract at least one target data value corresponding to the content information of each legend identification name, as shown in fig. 14, in the step S702, the feature information of the legend identification line is matched with an area corresponding to the coordinate information in the data area, so as to obtain at least one target data value corresponding to the content information of each legend identification name, which may be implemented by the following steps S801 to S802.
S801, if the feature information of the legend identification line is a color feature, calculating color value distances between the color feature and the color features of a plurality of pixel points in an area corresponding to each horizontal axis scale point in the coordinate information, and taking the pixel points of which the color value distances are smaller than a preset color threshold and meet a preset slope condition as corresponding data value points.
By the above embodiment, the color feature may be an RGB value of the legend identification line, when the area where the vertical upward and downward directions of the scale point of a certain X axis are located is traversed, a plurality of pixel points may be detected, and one or more pixel points having the smallest distance between the RGB value of the plurality of pixel points and the RGB value of the legend identification line are used as the initial data value points.
If there is only one initial data value point obtained by matching a legend identification line at a scale point of a certain X axis, it indicates that there are no other interference pixel points in the line graph to be extracted, and the initial data value point can be used as the data value point corresponding to the scale point of the X axis.
If there is more than one initial data value point obtained by matching a legend identification line at a scale point of a certain X-axis, it is indicated that the initial pixel value points may include pixel points in a background grid, and the color features of the pixel points in the background grid are the same as those of the legend identification line. Then, the slope of the initial pixel point can be calculated by taking the pixel points around each initial data point. If the slope corresponding to a certain initial data point is not 0 and the slopes corresponding to other initial data points are 0, the initial pixel point is a pixel point on the broken line, and the pixel point is taken as a data value point corresponding to the scale point of the X axis. If the slope corresponding to each initial data point is 0, whether each initial pixel point corresponds to the scale point of the Y axis can be respectively judged, and if yes, the only initial data value point which does not correspond to the scale point of the Y axis can be used as the data value point corresponding to the scale point of the X axis.
S802, determining a target data value corresponding to each data value point according to the relative position relation between each data value point and the coordinate information.
According to the following formula, according to the position of each data value point, according to the Y coordinate value of the data value point (nameK, x _ scaleK, (xK, yK)) on the Y axisPoints [ (xm, ym), (xn, yn)]And coordinate information ([ [ y _ scale0, (x 0, y 0)) corresponding to the above-mentioned scale points],…,[ y_scaleJ,(xj,yj)]]) Determining a target data value corresponding to the data value point
Figure F_221011094129257_257840037
Figure P_221011094130242_242703001
In this embodiment, the color features of the legend identification line are matched in the region corresponding to each cross axis scale point in the coordinate information, and a plurality of pixel points which do not meet the preset slope condition are further removed, so that the problem that when a background grid which is the same as the broken line color exists, due to excessive noise in the broken line graph, the data of the broken line graph are difficult to accurately and cleanly extract from the broken line graph is solved, and the accuracy of data extraction of the broken line graph is improved.
Optionally, when the feature information of the legend identification line is a symbol feature, in order to extract at least one target data value corresponding to the content information of each legend identification name, as shown in fig. 15, in the step S702, the feature information of the legend identification line is matched with an area corresponding to the coordinate information in the data area to obtain at least one target data value corresponding to the content information of each legend identification name, which may also be implemented by the following steps S901 to S903.
S901, if the feature information of the legend identification line is a symbol feature, calculating the symbol feature and a region corresponding to each horizontal axis scale point in the coordinate information to perform pattern matching, and obtaining at least one region to be selected.
From fig. 3 in the above embodiment, the symbol feature may be a symbol identification image in a legend identification line, for example, an area where a triangle symbol in the first legend identification line area 1011 in fig. 3 is located, or an area where a square symbol in the second legend identification line area 1013 in fig. 3 is located, and the symbol identification image corresponding to each legend identification line is subjected to pattern matching in an area where a vertical upward direction and a downward direction of a scale point of the X axis are located in the data area of the line graph to be extracted, so that an indicated area of the symbol identification image of each legend identification line in the line graph is obtained.
Fig. 16 shows the result of pattern matching, and the area in the square frame in fig. 16 is marked with the candidate area that can be matched with the symbol mark image of the legend mark line on each broken line.
S902, taking the central point coordinate of at least one candidate area as a corresponding data value point.
It can be understood that each candidate area describes the position of each candidate area through the profile information, and therefore, the coordinates of the center point of each candidate area can be used as the data value point of the candidate area.
And S903, determining a target data value corresponding to each data value point according to the relative position relationship between each data value point and the coordinate information.
The manner of determining the target data value corresponding to each data value point is the same as that in step S802, and is not described herein again.
In this embodiment, a plurality of target data values are obtained by matching the symbolic features of the example identification lines with the data regions in the line graph to be extracted. The problem of data extraction of a line graph which contains a plurality of legend identification lines with symbol identifications and similar or identical colors in a legend area is solved.
Referring to fig. 17, an embodiment of the present application further provides a data extraction apparatus 100 for a line graph, which can be used to execute the steps of the data extraction method for a line graph in the foregoing embodiment, including:
a legend detection module 1001, configured to detect a line graph to be extracted by using a legend detection model, and determine a legend area of the line graph to be extracted;
a legend dividing module 1002, configured to divide a legend area to obtain at least one legend identification line area and at least one legend identification name area;
a legend identification module 1003, configured to respectively identify each legend identification line region and each legend identification name region according to location information of a blank region in the legend region, to obtain feature information of each legend identification line and content information of each legend identification name;
and the data extraction module 1004 is configured to extract, according to the feature information of each legend identification line and the content information of each legend identification name, data corresponding to the feature information in the data area of the line graph to be extracted, and obtain at least one target data value corresponding to the content information.
The legend dividing module 1002 is further specifically configured to obtain a horizontal arrangement pattern corresponding to the legend area; traversing and identifying the pixel change value in the horizontal arrangement graph; when the pixel point variation value of the target position meets the blank area identification condition, determining the target position as the position information of the blank area; dividing the horizontal arrangement area according to the position information of the blank areas to determine a plurality of content areas; and analyzing each content area to determine all the legend identification line areas and legend identification name areas.
The legend segmentation module 1002 is further specifically configured to determine a legend distribution type of the legend region according to an arrangement order of each sub-region to be identified in the legend region; and if the legend distribution type is a vertical distribution type or a table distribution type, horizontally dividing each sub-area to be identified according to the projection result of each sub-area to be identified in the horizontal direction to obtain a plurality of horizontal arrangement patterns.
The legend segmentation module 1002 is further configured to perform preliminary traversal on each content area, and divide each content area into a plurality of initial legend identification line areas and a plurality of initial legend identification name areas; if the initial identification line regions all meet the straight line identification condition, taking the initial identification line regions as legend identification line regions, taking the initial legend identification name regions as legend identification name regions, and marking the types of the legend identification line regions as straight line identification line region types; if the initial marking line regions all meet the symbol identification condition, the initial marking line regions are used as legend marking line regions, the initial legend marking name regions are used as legend marking name regions, and the types of the legend marking line regions are marked as symbol marking line region types.
The legend dividing module 1002 is further specifically configured to correct each legend identification line region, and use the legend identification line region that meets the identification condition as a legend identification name region.
The legend dividing module 1002 is further specifically configured to perform vertical projection on the horizontally arranged graph to obtain a vertical projection graph; performing projection calculation on each pixel point in the vertical projection graph to obtain a projection value and a projection variation value of each pixel point; and traversing each pixel point in the horizontal arrangement graph according to the projection value and the projection variation value of each pixel point, and taking the position information of the area where the plurality of pixels meeting the blank area identification condition are located as the position information of the blank area.
The data extraction module 1004 is further specifically configured to identify the line graph to be extracted by using the straight line detection model, and determine a data area and coordinate information of the line graph to be extracted; and matching the characteristic information of the legend identification line with an area corresponding to the coordinate information in the data area to obtain at least one target data value corresponding to the content information of each legend identification name.
The data extraction module 1004 is further specifically configured to, if the feature information of the legend identification line is a color feature, calculate a color value distance between the color feature and color features of a plurality of pixel points in an area corresponding to each horizontal axis scale point in the coordinate information, and take the pixel points whose color value distance is smaller than a preset color threshold and meets a preset slope condition as corresponding data value points; and determining a target data value corresponding to each data value point according to the relative position relationship between each data value point and the coordinate information.
The data extraction module 1004 is further specifically configured to, if the feature information of the legend identification line is a symbol feature, calculate a region corresponding to each horizontal axis scale point in the symbol feature and the coordinate information, and perform pattern matching to obtain at least one region to be selected; taking the coordinates of the central point of at least one to-be-selected area as corresponding data value points; and determining a target data value corresponding to each data value point according to the relative position relationship between each data value point and the coordinate information.
Referring to fig. 18, the present embodiment further provides a processing apparatus, including: a processor 2001, a memory 2002 and a bus, the memory 2002 storing machine-readable instructions executable by the processor 2001, the machine-readable instructions being executable by the processing device when the processing device is running, the processor 2001 and the memory 2002 communicating via the bus, the processor 2001 being configured to perform the steps of the data extraction method of the line graph in the above-described embodiments.
The memory 2002, processor 2001, and bus elements are electrically coupled to each other, directly or indirectly, to enable data transfer or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The data processing means of the data extraction system of the line graph includes at least one software functional module that can be stored in the memory 2002 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the processing device. The processor 2001 is used to execute executable modules stored in the memory 2002, such as software functional modules and computer programs included in the data processing apparatus of the data extraction system of the line graph.
The Memory 2002 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.
Optionally, the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the steps of the above method embodiments. The specific implementation and technical effects are similar, and are not described herein again.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to corresponding processes in the method embodiments, and are not described in detail in this application. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and there may be other divisions in actual implementation, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some communication interfaces, indirect coupling or communication connection between devices or modules, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (12)

1. A method for extracting data of a line graph, comprising:
detecting a line graph to be extracted by using a legend detection model, and determining a legend area of the line graph to be extracted;
according to the position information of the blank area in the legend area, carrying out segmentation processing on the legend area to obtain at least one legend identification line area and at least one legend identification name area;
identifying each legend identification line area and each legend identification name area respectively to obtain characteristic information of each legend identification line and content information of each legend identification name;
and extracting data corresponding to the characteristic information in the data area of the line graph to be extracted according to the characteristic information of each legend identification line and the content information of each legend identification name, and obtaining at least one target data value corresponding to the content information.
2. The method for extracting data of a line graph according to claim 1, wherein the dividing the legend area into at least one legend identification line area and at least one legend identification name area comprises:
obtaining a horizontal arrangement graph corresponding to the legend area;
traversing and identifying the pixel change values in the horizontal arrangement graph;
when the pixel point variation value of the target position meets the blank area identification condition, determining the target position as the position information of the blank area;
dividing the horizontal arrangement area according to the position information of the blank areas to determine a plurality of content areas;
and analyzing each content area to determine all the legend identification line areas and the legend identification name areas.
3. The method for extracting data of a line graph according to claim 2, wherein the obtaining of the horizontal arrangement graph corresponding to the legend area comprises:
determining the legend distribution type of the legend area according to the arrangement sequence of the sub-areas to be identified in the legend area;
and if the legend distribution type is a vertical distribution type or a table distribution type, horizontally dividing each sub-area to be identified according to the projection result of each sub-area to be identified in the horizontal direction to obtain a plurality of horizontal arrangement patterns.
4. The method for extracting data of a line chart according to claim 2, wherein the analyzing each of the content areas to determine all of the legend identification line areas and the legend identification name areas comprises:
performing preliminary traversal on each content area, and dividing each content area into a plurality of initial legend identification line areas and a plurality of initial legend identification name areas;
if the initial identification line regions all meet the straight line identification condition, taking the initial identification line regions as legend identification line regions, taking the initial legend identification name regions as legend identification name regions, and marking the types of the legend identification line regions as straight line identification line region types;
if the initial identification line regions all meet the symbol identification condition, taking the initial identification line regions as legend identification line regions, taking the initial legend identification name regions as legend identification name regions, and marking the types of the legend identification line regions as symbol identification line region types.
5. The method of line graph data extraction of claim 4, further comprising:
and correcting each legend identification line region, and taking the legend identification line region meeting identification name recognition conditions as a legend identification name region.
6. The method for extracting data of a line graph according to claim 2, wherein the determining that the target position is position information of a blank area when the pixel point variation value of the target position satisfies a blank condition includes:
carrying out vertical projection on the horizontally arranged graph to obtain a vertical projection graph;
performing projection calculation on each pixel point in the vertical projection graph to obtain a projection value and a projection variation value of each pixel point;
and traversing each pixel point in the horizontal arrangement graph according to the projection value and the projection variation value of each pixel point, and taking the position information of the area where a plurality of pixels meeting the blank area identification condition are located as the position information of the blank area.
7. The method for extracting data of a line drawing according to claim 1, wherein the extracting data corresponding to the feature information in the data area of the line drawing to be extracted according to the feature information of each legend identification line and the content information of each legend identification name to obtain at least one target data value corresponding to the content information comprises:
identifying the line graph to be extracted by using a straight line detection model, and determining a data area and coordinate information of the line graph to be extracted;
and matching the characteristic information of the legend identification line with an area corresponding to the coordinate information in the data area to obtain at least one target data value corresponding to the content information of each legend identification name.
8. The method for extracting data of a line graph according to claim 7, wherein the matching the feature information of the legend identification line with the area corresponding to the coordinate information in the data area to obtain at least one target data value corresponding to the content information of each legend identification name includes:
if the characteristic information of the legend identification line is color characteristics, calculating color value distances between the color characteristics and the color characteristics of a plurality of pixel points in an area corresponding to each horizontal axis scale point in the coordinate information, and taking the pixel points of which the color value distances are smaller than a preset color threshold and meet a preset slope condition as corresponding data value points;
and determining a target data value corresponding to each data value point according to the relative position relation between each data value point and the coordinate information.
9. The method for extracting line drawing data according to claim 7, wherein the matching a position corresponding to coordinate information in the data area according to feature information of the legend identification line and content information of each legend identification name to obtain a plurality of data values corresponding to the content information of each legend identification name includes:
if the characteristic information of the legend identification line is symbol characteristics, calculating the symbol characteristics and areas corresponding to all cross-axis scale points in the coordinate information to perform pattern matching to obtain at least one area to be selected;
taking the coordinates of the central point of at least one to-be-selected area as corresponding data value points;
and determining a target data value corresponding to each data value point according to the relative position relation between each data value point and the coordinate information.
10. A data extraction device of a line graph, characterized by comprising:
the legend detection module is used for detecting the line graph to be extracted by using a legend detection model and determining the legend area of the line graph to be extracted;
the legend segmentation module is used for segmenting the legend area according to the position information of the blank area in the legend area to obtain at least one legend identification line area and at least one legend identification name area;
a legend identification module, configured to respectively identify each legend identification line region and each legend identification name region, to obtain feature information of each legend identification line and content information of each legend identification name;
and the data extraction module is used for extracting data corresponding to the characteristic information in the data area of the line graph to be extracted according to the characteristic information of each legend identification line and the content information of each legend identification name, and acquiring at least one target data value corresponding to the content information.
11. A processing device, characterized in that the processing device comprises: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the processing device is operating, the processor executing the machine-readable instructions to perform the steps of the data extraction method of the line graph according to any one of claims 1-9.
12. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the method for data extraction of a line graph according to any one of claims 1-9.
CN202211264165.0A 2022-10-17 2022-10-17 Data extraction method and processing equipment for line graph Active CN115331013B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211264165.0A CN115331013B (en) 2022-10-17 2022-10-17 Data extraction method and processing equipment for line graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211264165.0A CN115331013B (en) 2022-10-17 2022-10-17 Data extraction method and processing equipment for line graph

Publications (2)

Publication Number Publication Date
CN115331013A true CN115331013A (en) 2022-11-11
CN115331013B CN115331013B (en) 2023-02-24

Family

ID=83915519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211264165.0A Active CN115331013B (en) 2022-10-17 2022-10-17 Data extraction method and processing equipment for line graph

Country Status (1)

Country Link
CN (1) CN115331013B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115830431A (en) * 2023-02-08 2023-03-21 湖北工业大学 Neural network image preprocessing method based on light intensity analysis

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9208403B1 (en) * 2014-06-16 2015-12-08 Qualcomm Incorporated Systems and methods for processing image data associated with line detection
CN108470350A (en) * 2018-02-26 2018-08-31 阿博茨德(北京)科技有限公司 Broken line dividing method in line chart and device
JP2018181244A (en) * 2017-04-21 2018-11-15 ウォンテッドリー株式会社 Line segment extraction device, method for controlling line segment extraction device, and program
CN110569774A (en) * 2019-08-30 2019-12-13 武汉大学 Automatic line graph image digitalization method based on image processing and pattern recognition
CN110598634A (en) * 2019-09-12 2019-12-20 山东文多网络科技有限公司 Machine room sketch identification method and device based on graph example library
CN110909732A (en) * 2019-10-14 2020-03-24 杭州电子科技大学上虞科学与工程研究院有限公司 Automatic extraction method of data in graph
CN111580894A (en) * 2020-04-02 2020-08-25 深圳壹账通智能科技有限公司 Data analysis early warning method, device, computer system and readable storage medium
CN112507876A (en) * 2020-12-07 2021-03-16 数地科技(北京)有限公司 Wired table picture analysis method and device based on semantic segmentation
CN112651315A (en) * 2020-12-17 2021-04-13 苏州超云生命智能产业研究院有限公司 Information extraction method and device of line graph, computer equipment and storage medium
CN112819871A (en) * 2021-03-02 2021-05-18 华融融通(北京)科技有限公司 Table image registration method based on linear segmentation
CN113095267A (en) * 2021-04-22 2021-07-09 上海携宁计算机科技股份有限公司 Data extraction method of statistical chart, electronic device and storage medium
CN113705576A (en) * 2021-11-01 2021-11-26 江西中业智能科技有限公司 Text recognition method and device, readable storage medium and equipment
CN113723328A (en) * 2021-09-06 2021-11-30 华南理工大学 Method for analyzing and understanding chart document panel
CN113743187A (en) * 2021-06-22 2021-12-03 万翼科技有限公司 Method and device for identifying legend in engineering drawing, electronic equipment and storage medium
CN114283436A (en) * 2021-12-20 2022-04-05 万翼科技有限公司 Table identification method, device, equipment and storage medium
CN114511862A (en) * 2022-02-17 2022-05-17 北京百度网讯科技有限公司 Form identification method and device and electronic equipment
CN114998428A (en) * 2022-04-14 2022-09-02 杭州电子科技大学 Broken line/curve data extraction system and method based on image processing
CN114998912A (en) * 2022-05-26 2022-09-02 网易(杭州)网络有限公司 Data extraction method and device, electronic equipment and storage medium

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9208403B1 (en) * 2014-06-16 2015-12-08 Qualcomm Incorporated Systems and methods for processing image data associated with line detection
JP2018181244A (en) * 2017-04-21 2018-11-15 ウォンテッドリー株式会社 Line segment extraction device, method for controlling line segment extraction device, and program
CN108470350A (en) * 2018-02-26 2018-08-31 阿博茨德(北京)科技有限公司 Broken line dividing method in line chart and device
US20190266395A1 (en) * 2018-02-26 2019-08-29 Abc Fintech Co., Ltd. Method and device for segmenting lines in line chart
CN110569774A (en) * 2019-08-30 2019-12-13 武汉大学 Automatic line graph image digitalization method based on image processing and pattern recognition
CN110598634A (en) * 2019-09-12 2019-12-20 山东文多网络科技有限公司 Machine room sketch identification method and device based on graph example library
US20210110194A1 (en) * 2019-10-14 2021-04-15 Hangzhou Dianzi University Method for automatic extraction of data from graph
CN110909732A (en) * 2019-10-14 2020-03-24 杭州电子科技大学上虞科学与工程研究院有限公司 Automatic extraction method of data in graph
CN111580894A (en) * 2020-04-02 2020-08-25 深圳壹账通智能科技有限公司 Data analysis early warning method, device, computer system and readable storage medium
CN112507876A (en) * 2020-12-07 2021-03-16 数地科技(北京)有限公司 Wired table picture analysis method and device based on semantic segmentation
CN112651315A (en) * 2020-12-17 2021-04-13 苏州超云生命智能产业研究院有限公司 Information extraction method and device of line graph, computer equipment and storage medium
CN112819871A (en) * 2021-03-02 2021-05-18 华融融通(北京)科技有限公司 Table image registration method based on linear segmentation
CN113095267A (en) * 2021-04-22 2021-07-09 上海携宁计算机科技股份有限公司 Data extraction method of statistical chart, electronic device and storage medium
CN113743187A (en) * 2021-06-22 2021-12-03 万翼科技有限公司 Method and device for identifying legend in engineering drawing, electronic equipment and storage medium
CN113723328A (en) * 2021-09-06 2021-11-30 华南理工大学 Method for analyzing and understanding chart document panel
CN113705576A (en) * 2021-11-01 2021-11-26 江西中业智能科技有限公司 Text recognition method and device, readable storage medium and equipment
CN114283436A (en) * 2021-12-20 2022-04-05 万翼科技有限公司 Table identification method, device, equipment and storage medium
CN114511862A (en) * 2022-02-17 2022-05-17 北京百度网讯科技有限公司 Form identification method and device and electronic equipment
CN114998428A (en) * 2022-04-14 2022-09-02 杭州电子科技大学 Broken line/curve data extraction system and method based on image processing
CN114998912A (en) * 2022-05-26 2022-09-02 网易(杭州)网络有限公司 Data extraction method and device, electronic equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
TAN LU等: "Probabilistic homogeneity for document image segmentation", 《PATTERN RECOGNITION》 *
TUAN ANHTRAN等: "A mixture model using Random Rotation Bounding Box to detect table region in document image", 《JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION》 *
郝圣立: "表格识别中的算法改进", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
韩冰: "泛在统计图表自动分类与信息提取方法研究", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115830431A (en) * 2023-02-08 2023-03-21 湖北工业大学 Neural network image preprocessing method based on light intensity analysis

Also Published As

Publication number Publication date
CN115331013B (en) 2023-02-24

Similar Documents

Publication Publication Date Title
CN108470021B (en) Method and device for positioning table in PDF document
CN109740469B (en) Lane line detection method, lane line detection device, computer device, and storage medium
CN110232311B (en) Method and device for segmenting hand image and computer equipment
US10878003B2 (en) System and method for extracting structured information from implicit tables
KR20000047428A (en) Apparatus and Method for Recognizing Character
CN110321837B (en) Test question score identification method, device, terminal and storage medium
CN115331013B (en) Data extraction method and processing equipment for line graph
CN111340020B (en) Formula identification method, device, equipment and storage medium
CN113095267B (en) Data extraction method of statistical chart, electronic device and storage medium
JP4704601B2 (en) Character recognition method, program, and recording medium
CN109409180B (en) Image analysis device and image analysis method
JP3728224B2 (en) Document processing apparatus and method
CN107798355B (en) Automatic analysis and judgment method based on document image format
WO2019185245A2 (en) An image processing system and an image processing method
CN115373534A (en) Handwriting presenting method and device, interactive panel and storage medium
CN112417826A (en) PDF online editing method and device, electronic equipment and readable storage medium
CN112380812A (en) Method, device, equipment and storage medium for extracting incomplete frame line table of PDF (Portable document Format)
CN112084103A (en) Interface test method, device, equipment and medium
CN113449763A (en) Information processing apparatus and recording medium
JP2581353B2 (en) Graph image registration system
JP5402417B2 (en) Image processing device
JP2630261B2 (en) Character recognition device
JP2022051199A (en) Image determination device, image determination method, and program
JP3100825B2 (en) Line recognition method
JP2576080B2 (en) Character extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant