CN115331013A

CN115331013A - Data extraction method and processing equipment for line graph

Info

Publication number: CN115331013A
Application number: CN202211264165.0A
Authority: CN
Inventors: 孙勇; 顾文斌; 杨祎聪; 李晓平; 丁雪纯; 于业达
Original assignee: Shanghai Hengsheng Juyuan Data Service Co ltd; Hangzhou Hengsheng Juyuan Information Technology Co ltd
Current assignee: Shanghai Hengsheng Juyuan Data Service Co ltd; Hangzhou Hengsheng Juyuan Information Technology Co ltd
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2022-11-11
Anticipated expiration: 2042-10-17
Also published as: CN115331013B

Abstract

The embodiment of the application provides a data extraction method and processing equipment of a line graph, and relates to the field of image pattern recognition. Detecting the line graph to be extracted by using a legend detection model, and determining the legend area of the line graph to be extracted; according to the position information of a blank area in the legend area, carrying out segmentation processing on the legend area to obtain at least one legend identification line area and at least one legend identification name area; respectively identifying each legend identification line area and each legend identification name area to obtain the characteristic information of each legend identification line and the content information of each legend identification name; and extracting data corresponding to the characteristic information in the data area of the line graph to be extracted according to the characteristic information of each legend identification line and the content information of each legend identification name, and obtaining at least one target data value corresponding to the content information. By using the legend segmentation method, the legend identification effect is improved, and the line graph data extraction accuracy is also improved.

Description

Data extraction method and processing equipment for line graph

Technical Field

The application relates to the field of image pattern recognition, in particular to a data extraction method and processing equipment of a line graph.

Background

The line graph is a statistical graph for describing the statistics total index dynamics, the interdependencies between study objects, and the allocation of each part. In the financial field, a large amount of data is represented in the form of line graphs, and the data has significance for financial services and financial analysis and needs to be extracted from the line graphs. However, the currently available line graphs are generally converted into an image form, which causes more difficulty in subsequent data extraction and analysis.

At present, the legend data analysis of the line graph mainly applies an Optical Character Recognition (OCR) method to perform overall recognition on legend areas to obtain names of multiple legends, then finds a color area forward according to the legend names as a line corresponding to the text, and determines multiple data values corresponding to the names of the legends in the line graph by matching the color of the obtained legend identification line with the color of the data areas.

However, directly performing overall recognition on the legend area easily recognizes the identification lines in the legend as characters, resulting in an error in the legend recognition result.

Disclosure of Invention

The application provides a data extraction method and processing equipment of a line graph, which can divide a graph area into a graph identification line area and a graph identification name area, and respectively identify the graph area, so that the problem that the graph identification line is identified as a character to cause a graph identification error due to the fact that the graph area is integrally identified is avoided, and the accuracy of the graph identification is improved.

The embodiment of the application can be realized as follows:

in a first aspect, an embodiment of the present application provides a method for extracting data of a line graph, including:

detecting a line graph to be extracted by using a legend detection model, and determining a legend area of the line graph to be extracted;

according to the position information of the blank area in the legend area, carrying out segmentation processing on the legend area to obtain at least one legend identification line area and at least one legend identification name area;

respectively identifying each legend identification line area and each legend identification name area to obtain the characteristic information of each legend identification line and the content information of each legend identification name;

and extracting data corresponding to the characteristic information in the data area of the line graph to be extracted according to the characteristic information of each legend identification line and the content information of each legend identification name, and obtaining at least one target data value corresponding to the content information.

In an optional implementation manner, the dividing the legend area to obtain at least one legend identification line area and at least one legend identification name area includes:

obtaining a horizontal arrangement graph corresponding to the legend area;

traversing and identifying the pixel change value in the horizontal arrangement graph;

when the pixel point change value of the target position meets the blank area identification condition, determining the target position as the position information of the blank area;

dividing the horizontal arrangement area according to the position information of the blank areas to determine a plurality of content areas;

and analyzing each content area to determine all the legend identification line areas and the legend identification name areas.

In an alternative embodiment, the obtaining the horizontal arrangement pattern corresponding to the legend area includes:

determining the legend distribution type of the legend area according to the arrangement sequence of the sub-areas to be identified in the legend area;

and if the legend distribution type is a vertical distribution type or a table distribution type, horizontally dividing each sub-area to be identified according to the projection result of each sub-area to be identified in the horizontal direction to obtain a plurality of horizontal arrangement patterns.

In an optional implementation, the analyzing each content area to determine all the legend identification line areas and the legend identification name areas includes:

performing preliminary traversal on each content area, and dividing each content area into a plurality of initial legend identification line areas and a plurality of initial legend identification name areas;

if the initial identification line regions all meet the straight line identification condition, taking the initial identification line regions as legend identification line regions, taking the initial legend identification name regions as legend identification name regions, and marking the types of the legend identification line regions as straight line identification line region types;

if the initial identification line regions all meet the symbol identification condition, taking the initial identification line regions as legend identification line regions, taking the initial legend identification name regions as legend identification name regions, and marking the types of the legend identification line regions as symbol identification line region types.

In an alternative embodiment, the method further comprises:

and correcting each legend identification line region, and taking the legend identification line region meeting identification name recognition conditions as a legend identification name region.

In an optional implementation manner, when the change value of the pixel point of the target location satisfies the blank condition, determining that the target location is location information of a blank area includes:

performing vertical projection on the horizontally arranged pattern to obtain a vertical projection pattern;

performing projection calculation on each pixel point in the vertical projection graph to obtain a projection value and a projection variation value of each pixel point;

and traversing each pixel point in the horizontal arrangement graph according to the projection value and the projection variation value of each pixel point, and taking the position information of the area where a plurality of pixels meeting the blank area identification condition are located as the position information of the blank area.

In an optional implementation manner, the extracting, according to feature information of each legend identification line and content information of each legend identification name, data corresponding to the feature information in a data area of the line graph to be extracted to obtain at least one target data value corresponding to the content information includes:

identifying the line graph to be extracted by using a straight line detection model, and determining a data area and coordinate information of the line graph to be extracted;

and matching the characteristic information of the legend identification line with an area corresponding to the coordinate information in the data area to obtain at least one target data value corresponding to the content information of each legend identification name.

In an optional implementation manner, the matching the feature information of the legend identification line with an area corresponding to coordinate information in the data area to obtain at least one target data value corresponding to content information of each legend identification name includes:

if the characteristic information of the legend identification line is color characteristics, calculating color value distances between the color characteristics and the color characteristics of a plurality of pixel points in an area corresponding to each horizontal axis scale point in the coordinate information, and taking the pixel points of which the color value distances are smaller than a preset color threshold and meet a preset slope condition as corresponding data value points;

and determining a target data value corresponding to each data value point according to the relative position relation between each data value point and the coordinate information.

In an optional implementation manner, the matching, according to the feature information of the legend identification line and the content information of each legend identification name, a position corresponding to the coordinate information in the data area to obtain a plurality of data values corresponding to the content information of each legend identification name includes:

if the characteristic information of the legend identification line is a symbol characteristic, calculating the symbol characteristic and carrying out pattern matching on the region corresponding to each cross-axis scale point in the coordinate information to obtain at least one region to be selected;

taking the coordinates of the central point of at least one region to be selected as corresponding data value points;

In a second aspect, an embodiment of the present application further provides a data extraction apparatus for a line graph, including:

the legend detection module is used for detecting the line graph to be extracted by using a legend detection model and determining the legend area of the line graph to be extracted;

the legend segmentation module is used for segmenting the legend area according to the position information of the blank area in the legend area to obtain at least one legend identification line area and at least one legend identification name area;

a legend identification module, configured to identify each legend identification line region and each legend identification name region respectively, to obtain feature information of each legend identification line and content information of each legend identification name;

and the data extraction module is used for extracting data corresponding to the characteristic information in the data area of the line graph to be extracted according to the characteristic information of each legend identification line and the content information of each legend identification name, and acquiring at least one target data value corresponding to the content information.

The legend segmentation module is specifically further used for obtaining a horizontal arrangement pattern corresponding to the legend area; traversing and identifying the pixel change value in the horizontal arrangement graph; when the pixel point variation value of the target position meets the blank area identification condition, determining the target position as the position information of the blank area; dividing the horizontal arrangement area according to the position information of the blank areas to determine a plurality of content areas; and analyzing each content area to determine all the legend identification line areas and the legend identification name areas.

The legend segmentation module is further specifically configured to determine a legend distribution type of the legend region according to an arrangement order of each sub-region to be identified in the legend region; and if the legend distribution type is a vertical distribution type or a table distribution type, horizontally dividing each sub-area to be identified according to the projection result of each sub-area to be identified in the horizontal direction to obtain a plurality of horizontal arrangement patterns.

The legend segmentation module is further specifically configured to perform preliminary traversal on each content area, and divide each content area into a plurality of initial legend identification line areas and a plurality of initial legend identification name areas; if the initial identification line regions all meet the straight line identification condition, taking the initial identification line regions as legend identification line regions, taking the initial legend identification name regions as legend identification name regions, and marking the types of the legend identification line regions as straight line identification line region types; if the initial identification line regions all meet the symbol identification condition, taking the initial identification line regions as legend identification line regions, taking the initial legend identification name regions as legend identification name regions, and marking the types of the legend identification line regions as symbol identification line region types.

The legend segmentation module is further specifically configured to correct each legend identification line region, and use the legend identification line region that meets the identification condition as a legend identification name region.

The legend segmentation module is specifically further used for vertically projecting the horizontally arranged graphs to obtain vertical projection graphs; performing projection calculation on each pixel point in the vertical projection graph to obtain a projection value and a projection variation value of each pixel point; and traversing each pixel point in the horizontal arrangement graph according to the projection value and the projection variation value of each pixel point, and taking the position information of the area where a plurality of pixels meeting the blank area identification condition are located as the position information of the blank area.

The data extraction module is specifically further configured to identify the line graph to be extracted by using a straight line detection model, and determine a data area and coordinate information of the line graph to be extracted; and matching the characteristic information of the legend identification line with an area corresponding to the coordinate information in the data area to obtain at least one target data value corresponding to the content information of each legend identification name.

The data extraction module is specifically further configured to, if the feature information of the legend identification line is a color feature, calculate color value distances between the color feature and color features of a plurality of pixel points in an area corresponding to each horizontal axis scale point in the coordinate information, and take pixel points, of which the color value distances are smaller than a preset color threshold and meet a preset slope condition, as corresponding data value points; and determining a target data value corresponding to each data value point according to the relative position relation between each data value point and the coordinate information.

The data extraction module is specifically further configured to, if the feature information of the legend identification line is a symbol feature, calculate a region corresponding to the symbol feature and each cross-axis scale point in the coordinate information to perform pattern matching, and obtain at least one region to be selected; taking the coordinates of the central point of at least one to-be-selected area as corresponding data value points; and determining a target data value corresponding to each data value point according to the relative position relation between each data value point and the coordinate information.

In a third aspect, an embodiment of the present application provides a processing apparatus, including: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the processing device is running, the processor executing the machine-readable instructions to perform the steps of the data extraction method of the line graph according to any one of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of the data extraction method for a line graph according to any one of the first aspect.

The beneficial effects of the embodiment of the application include:

by adopting the data extraction method and the processing equipment of the line graph, provided by the embodiment of the application, the legend area can be divided, each legend identification line area and legend identification name are obtained, each area is identified, the corresponding legend identification line and legend identification name are determined, the problem that the legend identification line and legend identification name are confused due to the overall identification of the legend area is solved, and the accuracy of legend identification is improved.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a content identifier included in a line graph;

fig. 2 is a schematic flowchart illustrating steps of a data extraction method for a line graph according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a segmentation of a data extraction method of a line graph according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an X-axis projection of a data extraction method for a line graph provided in an embodiment of the present application;

fig. 5 is a schematic diagram of an identification result of a data extraction method of a line graph according to an embodiment of the present application;

fig. 6 is a flowchart illustrating another step of the data extraction method for line charts according to the embodiment of the present application;

fig. 7 is a schematic Y-axis projection diagram of a data extraction method of a line graph according to an embodiment of the present application;

FIG. 8 is a flowchart illustrating another step of a data extraction method for line graphs according to an embodiment of the present application;

FIG. 9 is a table distribution type diagram of a line graph provided in an embodiment of the present application;

FIG. 10 is a flowchart illustrating another step of a data extraction method for line graphs according to an embodiment of the present application;

fig. 11 is a schematic diagram illustrating a broken line identification of a data extraction method of a line graph according to an embodiment of the present application;

FIG. 12 is a flowchart illustrating another step of a data extraction method for line graphs according to an embodiment of the present application;

FIG. 13 is a flowchart illustrating another step of a data extraction method for line graphs according to an embodiment of the present application;

FIG. 14 is a flowchart illustrating another step of a data extraction method for line graphs according to an embodiment of the present application;

FIG. 15 is a flowchart illustrating another step of a data extraction method for line graphs according to an embodiment of the present application;

FIG. 16 is a schematic diagram illustrating symbolic feature recognition of a data extraction method for a line graph according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a data extraction device of a line graph according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of a processing apparatus according to an embodiment of the present application.

Icon: 101-legend area; 102-legend identification lines; 103-legend identification name; 104-Y axis; 105-Y axis scale points; 106-X axis scale points; 107-X axis; 108-a data area; 1011-first legend identifies line regions; 1012-first legend identifies the name area; 1013-second legend identifies line regions; 1014-second legend identification name area; 100-data extraction means of a line graph; 1001-legend detection module; 1002-legend segmentation module; 1003-legend identification module; 1004-data extraction module; 2001-a processor; 2002-memory.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Furthermore, the appearances of the terms "first," "second," and the like, if any, are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

It should be noted that the features of the embodiments of the present application may be combined with each other without conflict.

In the field of finance, a large amount of data represented in a graph form is contained, such as a line graph, a bar graph, a scatter graph, a pie graph and the like, different graphs have different data representation forms, the line graph is taken as one of widely used graph types, and the line graph contains a large amount of financial data which may have important significance for financial services and financial analysis. As shown in fig. 1, the line graph mainly includes: a legend area 101, a legend identification line 102, a legend identification name 103, a Y-axis 104, a Y-axis tick mark 105, an X-axis tick mark 106, an X-axis 107, and a data area 108 where a polyline is located.

The legend area 101 includes at least legends, and each set of legends includes a legend identification line 102 and a legend identification name 103 having a corresponding relationship. In addition, the data area 108 where the broken line is located includes a first quadrant surrounded by the positive half axis of the X-axis 107 and the positive half axis of the Y-axis 104, or includes a first quadrant and a fourth quadrant corresponding to the positive half axis of the X-axis 107 and the positive half axis of the Y-axis 104.

Although the extraction of data in the line graph is a necessary requirement in the data analysis process, the line graph obtained from various data sources such as news, reports, books and the like is generally converted into an image form at present, which also brings a lot of difficulties for the subsequent data extraction of the line graph.

Currently, the legend data analysis of the line graph mainly adopts a mode recognition mode, that is, an OCR technology is utilized to perform overall recognition on the legend area 101 in fig. 1, so as to obtain the text contents of a plurality of legend identification names, and then forward detection is performed according to the position of each legend identification name, if a color line is detected before a certain legend identification name, the legend identification line corresponding to the legend identification name is obtained. And then, matching the colors of the legend identification lines and the broken lines of the data area 108 to determine the broken lines corresponding to the legend identification names in the data area. And finally, extracting and obtaining a plurality of data points corresponding to the legend identification names according to the relative positions of the broken lines and the coordinate axes.

In the prior art, by identifying the whole legend area 101 and determining the legend identification name, when the legend identification line is closer to the legend identification name, the graphic information is easily identified as text information, or the text information is easily identified as graphic information. For example, a line is recognized as a word "one", or when two legend identifiers are too close to each other, the line may be recognized as a merged legend identifier, which may result in an incorrect legend identifier recognition.

Based on this, the embodiment of the application provides a data extraction method and processing equipment of a line graph, can cut apart the legend region, obtain each legend identification line region and legend identification name, distinguish each region again, confirm corresponding legend identification line, legend identification name, solved to the regional whole recognition of legend and lead to the problem that legend identification line, legend identification name discernment are confused, promoted the accuracy of legend discernment.

The embodiments of the present application provide a method for extracting data of a line graph and a processing device, which are described in detail below.

Fig. 2 is a schematic flowchart illustrating steps of a data extraction method of a line graph according to an embodiment of the present application, where an execution subject of the method may be a computer device with computing and processing capabilities, as shown in fig. 2, the method includes:

s201, detecting the line graph to be extracted by using the graph detection model, and determining a graph area of the line graph to be extracted.

The line graph to be extracted may be the same in content form contained in the line graph shown in fig. 1, and the line graph is stored in a picture form. Alternatively, the position of the legend area 101 in the line graph to be extracted is not limited herein.

The legend detection model analyzes and detects the line graph to be extracted by using an image segmentation network, and determines the outline position of the minimum rectangular area occupied by the legend area through the minimum circumscribed rectangle, wherein the outline position is used for marking the legend area of the line graph to be extracted. Optionally, the legend detection model may be a U-NET model, and may also be a common semantic segmentation model such as fcn and seg-NET, which is not limited herein.

S202, according to the position information of the blank area in the legend area, the legend area is divided to obtain at least one legend identification line area and at least one legend identification name area.

With continued reference to fig. 1, at least one different set of legends may be included within legend area 101, each set of legends including a legend identification line and a legend identification name having a correspondence. Referring to fig. 3, the legend area includes two groups of legends as an example, and optionally, each group of legends is respectively formed by a legend identification name and a legend identification line.

As shown in fig. 3, the legend area 101 may be divided according to the location information of the blank area in the legend area, and then a first legend identification line area 1011, a second legend identification line area 1013, a first legend identification name area 1012, and a second legend identification name area 1014 are obtained according to different characteristics corresponding to the legend identification names and legend identification lines in the legend area 101.

And S203, respectively identifying each legend identification line area and each legend identification name area to obtain the characteristic information of each legend identification line and the content information of each legend identification name.

Alternatively, the above characteristic information may be used to uniquely identify the type of the legend identification line. Specifically, the feature information may be a color feature that is distinguished based on color or a symbol feature that is distinguished based on a symbol. For example, referring to fig. 1, where different broken lines have different colors, such as black and gray, different legend identification lines and corresponding broken lines can be uniquely identified based on color characteristics. Alternatively, for example, in fig. 3, the first legend mark line region 1011 is a triangle symbol legend mark line, and the second legend mark line region 1013 is a square symbol legend mark line. It is understood that different types of legend identification lines have different characteristic information, for example, a color legend identification line has color characteristics, a symbol legend identification line has symbol characteristics, and the manner of extracting the characteristic information from the legend identification line region containing the different types of legend identification lines is correspondingly different.

If the legend identification line in a certain legend identification line region is identified as a color legend identification line type, in a possible implementation manner, the RGB value of the legend identification line may be extracted as the feature information of the legend identification line.

If the legend identification line in a certain legend identification line area is identified as the type of the symbol legend identification line, in a possible manner, the symbol image of the legend identification line may be extracted as the feature information of the legend identification line.

Optionally, after at least one legend identification name area is determined, an OCR method may be used to identify each legend identification name area to determine the text content of each legend identification name as the content information of each legend identification name.

And S204, extracting data corresponding to the characteristic information in the data area of the line graph to be extracted according to the characteristic information of each legend identification line and the content information of each legend identification name, and obtaining at least one target data value corresponding to the content information.

The corresponding broken lines can be found in the data area of the broken line graph according to the characteristic information of each legend identification line, and data extraction is carried out on each broken line to obtain a plurality of data values corresponding to each legend identification line. Further, according to the corresponding relation between each legend identification line and the legend identification name, a plurality of data values corresponding to the content information of the legend identification name are determined.

In step S203, when it is recognized that the legend identification line in a certain legend identification line area is a symbol legend identification line type, in order to extract and obtain a symbol image of the legend identification line as feature information, the following method may be adopted:

first, an X-axis projection is used for the legend mark line region, that is, after a vertical projection, a corresponding projection sequence is obtained. As shown in fig. 4, the upper image of fig. 4 is X-axis projected to obtain a projection sequence shown in the lower image of fig. 4. Note that each pixel in the projection sequence corresponds to a different projection value.

After traversing each pixel in the projection sequence from left to right, taking the interval of the Y-axis coordinate value Y1 where the pixel point with the first projection value not being 0 is located and the Y-axis coordinate value Y2 where the last projection value not being 0 as the area where the legend identification line is located, and taking the rest part in the area of the legend identification line as the blank in the vertical direction to remove, so as to obtain the legend identification line area with the removed vertical blank.

Further, the neighborhood projection variation value of each pixel in the region of the legend identification line with the vertical blank removed can be calculated by the following formula

：

Wherein the content of the first and second substances,

，

k denotes the neighborhood range size of the current pixel,

representing the projection value of the current pixel i,

representing the absolute value of the difference between the projected value of the current pixel i and the projected value of the next pixel i + 1. Illustratively, k may be 2.

If the area of the legend identification line except the vertical blank has neighborhood projection variation values of continuous m pixels

If the projected values of the m pixels are all larger than zero and are smaller than the preset variation threshold, the starting positions of the m pixels to the pixels can be determined

The section is the beginning cross-line section of the legend identifying the line. Illustratively, m may be 5, and the preset variation threshold may be 0.1.

If the area of the legend marking line except the vertical blank is within, the legend marks the horizontal line part of the line to the pixel

The neighborhood projection variation value of each pixel

Are all larger than a preset variation threshold, and pixels

Subsequent nodes and pixels

Is greater than a preset distance threshold, the pixel is reset

Starting coordinate, pixel, identified as a symbol on a legend identification line

End coordinates identified by a symbol on a line are identified as a legend, where the pixel

Then the pixels in the region of the legend identification line meet the neighborhood projection variation value with continuous m pixels

Is less than a preset variation threshold value, and the projection values of the m pixels are all greater than zero, that is, the pixels

The ending positions of the m pixels are ending transverse line parts of the legend identification line.

Then, the area between the start horizontal line and the end horizontal line, that is to say

]The image of the area in between is extracted as the symbol identification image. The symbol identification image can be used as the characteristic information of the symbol characteristic representation legend identification line.

Fig. 5 is a schematic diagram illustrating an identification effect of a data extraction method using a line graph provided in the embodiment of the present application, and a plurality of data values corresponding to the positions of the horizontal axis scale marks of the content of each legend identification name are obtained by performing data extraction on the line graph on the left side of fig. 5 through the data extraction method using the line graph provided in the embodiment of the present application.

In this embodiment, the legend area is divided to obtain each legend identification line area and legend identification name, and each area is identified respectively to determine the corresponding legend identification line and legend identification name, so that the problem of confusion of legend identification line and legend identification name caused by overall identification of the legend area is solved, and the accuracy of legend identification is improved.

Optionally, since the legend area includes a plurality of legend identification line areas and corresponding legend identification name areas, in order to segment each legend identification line area and corresponding legend identification name area, as shown in fig. 6, in the step S202, the step of segmenting the legend area to obtain at least one legend identification line area and at least one legend identification name area may be implemented by the following steps S301 to S304.

S301, obtaining a horizontal arrangement pattern corresponding to the legend area.

The horizontal arrangement pattern is a pattern in which the legends in the legend areas are sequentially arranged in the horizontal direction, for example, a pattern corresponding to the arrangement of the legends in the legend area 101 in fig. 1, or a pattern corresponding to the arrangement of the legends in the legend area 101 shown in fig. 3.

It should be noted that the arrangement manner of the legend may also be a column arrangement, a table format arrangement, or other arrangement manner as shown in fig. 7, and the legend areas arranged in other manner may be converted into horizontal arrangement patterns in advance.

S302, traversing and identifying the pixel change values in the horizontal arrangement graph.

By the embodiment, each pixel point in the horizontal arrangement graph can calculate the neighborhood projection variation value

And projection value, neighborhood projection variation value of each pixel

And the projection value can be used for representing the current pixel point and the conditions of surrounding pixel points of the current pixel point.

Therefore, the horizontal arrangement graph can be traversed, and the variation value is projected according to the neighborhood of each pixel

Traversing to obtain the initial and end coordinates of a plurality of blank areas in the horizontal arrangement graph: [

]，[

]…[

]，

。

S303, when the pixel point variation value of the target position meets the blank area identification condition, determining the target position as the position information of the blank area.

The blank area identification condition may be: if the horizontal arrangement pattern is traversed from left to right, the projection variation value std of a certain pixel point is smaller than the preset variation threshold value, and the projection value is 0.

Then, according to the target positions of the plurality of pixel points meeting the blank area identification condition in the line graph to be extracted, the position information of the blank area in the line graph to be extracted can be determined.

S304, according to the position information of the blank areas, the horizontal arrangement area is divided, and a plurality of content areas are determined.

The content area may include a plurality of legend identifying line areas, a plurality of legend identifying name areas, it being understood that a blank area identifies a line area for each legendAnd each legend identifies the gaps between name regions. Therefore, the value can be set in accordance with the start coordinate and the end coordinate of each blank region

]，[

]…[

]，

The plurality of discontinuous areas which do not belong to the blank area and are arranged horizontally are divided into a plurality of content areas.

S305, analyzing each content area, and determining all legend identification line areas and legend identification name areas.

Further, since the content area includes a plurality of legend identification line areas and a plurality of legend identification name areas, each content area may be analyzed according to a difference between the feature information of the legend identification line in the legend identification line area and the content information of the legend identification name in the legend identification name area, and the plurality of content areas are divided into a plurality of legend identification line areas and a plurality of legend identification name areas.

In this embodiment, according to a plurality of blank regions, will horizontally arrange the region and divide into a plurality of content areas, solved and leaded to legend identification line, legend identification name discernment to the regional whole discernment of legend to obscure the problem, promoted legend discernment's accuracy.

The above division method is merely an example, and the effect of dividing each legend identification line region and each legend identification name region in implementing the present embodiment is not excluded from being implemented in other division orders and directions. The corresponding solutions should belong to the solutions that can be realized by those skilled in the art without creative work after reading this application.

Optionally, for the legend area, the legend may be in a horizontal arrangement form, for example, fig. 3 above, but it may also be in a vertical arrangement form, for example, fig. 7 above, in order to more accurately and efficiently obtain the legend identification line areas and the legend identification name areas, before the legend area is subjected to the segmentation processing, the legends in other arrangements, for example, the vertically arranged legends or the legends in the table arrangement, may be converted into a horizontally arranged pattern, and then segmented to obtain the legend identification line areas and the legend identification name areas. As shown in fig. 8, in the step S301, the horizontal arrangement pattern corresponding to the legend area is obtained, which can be realized by the following steps S401 to S402.

S401, determining the legend distribution type of the legend area according to the arrangement sequence of the sub-areas to be identified in the legend area.

Each sub-area to be identified comprises a group of legends, and each group of legend examples comprises a legend identification line and a corresponding legend identification name.

Optionally, the legend area may be identified by an OCR technology or other identification methods, so as to preliminarily determine the arrangement order of the sub-areas to be identified in the legend area. For example, if a plurality of sub-regions to be identified are horizontally arranged, the legend distribution type of the legend region is a horizontal arrangement type. And if a plurality of sub-areas to be identified which are vertically arranged are obtained, the legend distribution type of the legend area is a vertical arrangement type. If a plurality of sub-areas to be identified are obtained, which are arranged in the horizontal direction and the vertical direction, the legend distribution type of the legend area is a table arrangement type.

S402, if the legend distribution type is a vertical distribution type or a table distribution type, horizontally dividing each sub-area to be identified according to the projection result of each sub-area to be identified in the horizontal direction to obtain a plurality of horizontal arrangement patterns.

If the legend distribution type is a vertical distribution type, performing Y-axis projection on each sub-area to be identified as shown in fig. 7 to obtain a Y-axis projection graph corresponding to each legend area.

Referring to fig. 7, when the multiple sub-regions to be recognized are vertically adjacent to each other and subjected to Y-axis projection, due to gaps existing between the multiple sub-regions to be recognized, a pixel point with a projection value of 0 exists in a projected Y-axis projection pattern, and the pixel point can be used as a division point between the sub-regions to be recognized, and the legend region is vertically divided according to coordinates of the division point, so that the multiple sub-regions to be recognized are obtained.

Finally, the plurality of sub-areas to be identified can be horizontally divided to obtain a plurality of horizontally arranged patterns.

Alternatively, as shown in fig. 9, if the legend distribution type is a table distribution type, that is, the legend areas including a plurality of sub-areas to be identified in both the horizontal and vertical directions may be first vertically divided according to the above steps to obtain a plurality of legend areas with the legend distribution type being a vertical distribution type, that is, after the legend areas shown in fig. 9 are divided, three groups of legend areas with the legend distribution type being a vertical distribution type are obtained: the areas of the legend identification lines and the legend identification names corresponding to the soybeans and the corns are a group, the areas of the legend identification lines and the legend identification names corresponding to the rice and the cotton are a group, and the areas of the legend identification lines and the legend identification names corresponding to the wheat and the potatoes are a group. And then, vertically dividing each legend area to respectively obtain a plurality of areas to be identified. For example, a set of legend regions of "soybean" and "corn" is vertically divided to obtain two regions to be identified: the soybean identification line corresponds to the legend, the area to be identified where the legend identification name is located, the legend identification line corresponds to the corn, and the area to be identified where the legend identification name is located. And finally, uniformly arranging a plurality of regions to be identified in each legend region to obtain a horizontal arrangement pattern, namely, arranging all the regions to be identified in a straight line in sequence.

In this embodiment, the legend distribution types of other arrangement modes are converted into horizontal arrangement patterns, so that the subsequent unified identification method is facilitated, the complexity of the method is reduced, and the processing efficiency is improved.

Optionally, in the above embodiment, after a plurality of legend identification line regions and a plurality of legend identification name regions are preliminarily determined, since there may be a case that the legend identification line regions are wrongly determined as the legend identification name regions, or the legend identification name regions are wrongly determined as the legend identification line regions, the legend identification line regions and the legend identification name regions need to be further analyzed to obtain accurate legend identification line regions and legend identification name regions. As shown in fig. 10, in the above step S304, each content area is analyzed to determine all legend identification line areas and legend identification name areas, which can be realized by the following steps S501 to S503.

And S501, performing preliminary traversal on each content area, and dividing each content area into a plurality of initial legend identification line areas and a plurality of initial legend identification name areas.

Calculating the neighborhood projection variation value of each pixel in the above embodiment

The formula (2) can calculate each pixel in each content area to obtain the neighborhood projection variation value of each pixel in the content area

。

Optionally, if there is a neighborhood projection variation value of m continuous pixel points in a certain content area

And if the distance between the first pixel point traversed from left to right and other pixel points in the content area is smaller than the threshold value s, dividing the content area into an initial legend identification line area. Illustratively, m may be 5,t may be 0.2, s may be 0.5, and the distance between two pixel points may be expressed by the formula

And calculating to obtain the result, wherein,

the coordinates of two pixels.

And traversing each content area respectively, dividing at least one content area meeting the conditions into an initial legend identification line area, and dividing at least one remaining content area into an initial legend identification name area.

S502, if the initial identification line regions all meet the straight line identification condition, the initial identification line regions are used as legend identification line regions, the initial legend identification name regions are used as legend identification name regions, and the types of the legend identification line regions are marked as straight line identification line region types.

In one possible implementation, the straight line identification condition may be: if the neighborhood projection variation value of each pixel point in a certain initial identification line region

And when the initial identification line region passes from left to right, the distance between the first pixel point and other pixel points is less than s. Illustratively, t may be 0.2 and s may be 0.2.

If a certain initial identification line region meets the above-mentioned straight line identification condition, the type of the initial identification line region may be marked as a straight line identification line region type.

It is to be understood that, for the same line graph, the multiple legend identification lines in the legend area are generally represented in a uniform manner, and the type of the legend identification line area corresponding to each legend identification line should be the same.

Thus, each initial identification line region satisfying the straight line recognition condition can be used as a legend identification line region, and the initial legend identification name region can be used as a legend identification name region.

S503, if the initial marking line areas all meet the symbol identification condition, the initial marking line areas are used as legend marking line areas, the initial legend marking name areas are used as legend marking name areas, and the types of the legend marking line areas are marked as symbol marking line area types.

The symbol recognition condition may be: if a certain initial identification line region is traversed from left to right, the neighborhood projection variation value of n pixel points in the region is started

Are all smaller than t, and the distance between the first pixel point and other pixel points is smaller than s. When the initial identification line region is traversed from right to left, the neighborhood projection variation values of n pixel points of the part are ended

Are all smaller than t, and the distance between the first pixel point and other pixel points is smaller than s.

As can be seen from the above, for the same line graph, the types of the legend marking line regions corresponding to each legend marking line may be all straight line marking line region types, or all symbol marking line region types. If all the initial identification line regions meet the symbol identification condition, taking each initial identification line region meeting the symbol identification condition as a legend identification line region, and taking the initial legend identification name region as a legend identification name region.

Alternatively, as shown in fig. 11, if the legend mark line is a dotted line, the space between the dotted lines may be used as a blank area, and one content area may be divided into a plurality of legend mark line areas by the above-mentioned recognition. Therefore, the legend identification line regions which meet the straight line identification condition or meet the symbol identification condition can be further identified, a plurality of continuous regions are legend identification line regions, no other types of content regions exist among the legend identification line regions are merged into a new legend identification line region, and the type of the new legend identification line region is marked as a dotted line identification line region type.

In this embodiment, the initial identification line region is marked as a straight identification line region type or a symbol identification line region type, so that data extraction can be performed in a targeted manner in subsequent steps, and the accuracy of data extraction is improved.

Optionally, the data extraction method for the line graph provided in the embodiment of the present application further includes: and correcting each legend identification line region, and taking the legend identification line region meeting identification name recognition conditions as a legend identification name region.

In the at least one legend identification line region obtained in the above steps S502 to S503, it may be included to identify a "one" or a horizontal line in the legend identification name as a legend identification line region of the legend identification line, so that the lengths of the legend identification line regions may be compared, and the legend identification line region in which the length difference from other legend identification line regions is greater than sh pixel points is modified into the legend identification name region. Illustratively, sh may be 5.

In this embodiment, through further discerning legend identification line region, avoid with the characters mistake discernment for legend identification line, lead to legend discernment wrong problem, promoted the rate of accuracy of discernment.

Optionally, blank space areas exist between the legend identification lines and the legend identification names in each legend area, and therefore, the position information of the blank areas can be identified through vertical projection to serve as separation areas between the legend identification line areas and the legend identification name areas. As shown in fig. 12, in the step S303, when the pixel variation value of the target position satisfies the blank area identification condition, the step S601 to S603 may be implemented to determine that the target position is the position information of the blank area.

And S601, vertically projecting the horizontally arranged pattern to obtain a vertically projected pattern.

Referring to fig. 4, the horizontal arrangement pattern of the left image in fig. 4 is projected on the X-axis, i.e. vertically, so as to obtain the vertical projection pattern shown in the right image in fig. 4.

S602, performing projection calculation on each pixel point in the vertical projection graph to obtain a projection value and a projection variation value of each pixel point.

By the formula for calculating the projection variation value std of the pixel point in the above embodiment, the projection variation value std of each pixel point in the vertical projection graph can be calculated to obtain the projection value and the projection variation value of each pixel point in the vertical projection graph.

And S603, traversing each pixel point in the horizontal arrangement graph according to the projection value and the projection variation value of each pixel point, and taking the position information of the area where the plurality of pixels meeting the blank area identification condition are as the position information of the blank area.

Traversing all the pixel points of the horizontally arranged graph, dividing a plurality of adjacent pixel points meeting the condition that the projection variation value std is less than a preset variation threshold value and the projection value is 0, namely, the blank area identification condition into the same blank area, and marking the position information corresponding to the outline of each pixel point in the blank area as the position information of the blank area.

In this embodiment, a plurality of blank regions are determined according to the projection value and the projection variation value of each pixel point in the horizontally arranged graph, and the blank regions are determined in an interpretable manner, so that the data extraction efficiency is improved.

Optionally, in order to obtain data in the line graph to be extracted, the feature information extracted to each legend identification line needs to be matched with the line graph to be extracted to determine a data value corresponding to the content information of each legend identification line, as shown in fig. 13, in the step S204, according to the feature information of each legend identification line and the content information of each legend identification name, data corresponding to the feature information in the data area of the line graph to be extracted is extracted to obtain at least one target data value corresponding to the content information, which may be implemented by the following steps S701 to S702.

S701, identifying the line graph to be extracted by using the straight line detection model, and determining the data area and the coordinate information of the line graph to be extracted.

The straight line detection model can be a pre-trained semantic segmentation model, and can identify a horizontal line with the longest bottom in the area of the broken line graph to be extracted as an X axis in the broken line graph to be identified, and identify a leftmost vertical line intersected with the X axis as a Y axis in the broken line graph to be identified.

Further, as shown in fig. 1, the X-axis scale value 107 is set below the X-axis 107, so that the area below the X-axis can be identified according to the position of the X-axis, and a plurality of text contents corresponding to the points for marking the X-axis scale can be obtained. And identifying black pixel points in toph pixels above the text content as X-axis scale points corresponding to the text content, wherein toph can be 10. Thus, sets of X-axis coordinate information are obtained.

Alternatively, the coordinate information of the multiple sets of Y-axes may be identified and obtained in a manner similar to the above-described manner of identifying the coordinate information of the X-axis.

S702, matching the characteristic information of the figure legend identification line with an area corresponding to the coordinate information in the data area to obtain at least one target data value corresponding to the content information of each figure legend identification name.

The area corresponding to the coordinate information in the data area may refer to an area in which the scale point of the X-axis is located in the vertical upward and downward directions.

According to the feature information of the legend identification line obtained in the above embodiment, the feature information of each legend identification line, such as color features or symbol features, may be identified in the area corresponding to the coordinate information in the data area, that is, the area where the scale point of each X axis is located in the vertical upward direction and the downward direction, so as to obtain the data value point corresponding to the scale point of each X axis. And calculating the data value corresponding to each data value point to obtain at least one target data value. And the target data value is a coordinate value corresponding to the coordinate information of the X axis on the broken line.

In this embodiment, after the data area of the line graph is determined, the data area is identified according to the coordinate information to obtain at least one target data value, so that data in other areas are prevented from being identified, and the accuracy of data extraction is improved. And moreover, the data value is directly determined by the broken line corresponding to the coordinate information, so that errors caused by additionally determining the data value are avoided.

Alternatively, when the feature information of the legend identification line is a color feature, in order to extract at least one target data value corresponding to the content information of each legend identification name, as shown in fig. 14, in the step S702, the feature information of the legend identification line is matched with an area corresponding to the coordinate information in the data area, so as to obtain at least one target data value corresponding to the content information of each legend identification name, which may be implemented by the following steps S801 to S802.

S801, if the feature information of the legend identification line is a color feature, calculating color value distances between the color feature and the color features of a plurality of pixel points in an area corresponding to each horizontal axis scale point in the coordinate information, and taking the pixel points of which the color value distances are smaller than a preset color threshold and meet a preset slope condition as corresponding data value points.

By the above embodiment, the color feature may be an RGB value of the legend identification line, when the area where the vertical upward and downward directions of the scale point of a certain X axis are located is traversed, a plurality of pixel points may be detected, and one or more pixel points having the smallest distance between the RGB value of the plurality of pixel points and the RGB value of the legend identification line are used as the initial data value points.

If there is only one initial data value point obtained by matching a legend identification line at a scale point of a certain X axis, it indicates that there are no other interference pixel points in the line graph to be extracted, and the initial data value point can be used as the data value point corresponding to the scale point of the X axis.

If there is more than one initial data value point obtained by matching a legend identification line at a scale point of a certain X-axis, it is indicated that the initial pixel value points may include pixel points in a background grid, and the color features of the pixel points in the background grid are the same as those of the legend identification line. Then, the slope of the initial pixel point can be calculated by taking the pixel points around each initial data point. If the slope corresponding to a certain initial data point is not 0 and the slopes corresponding to other initial data points are 0, the initial pixel point is a pixel point on the broken line, and the pixel point is taken as a data value point corresponding to the scale point of the X axis. If the slope corresponding to each initial data point is 0, whether each initial pixel point corresponds to the scale point of the Y axis can be respectively judged, and if yes, the only initial data value point which does not correspond to the scale point of the Y axis can be used as the data value point corresponding to the scale point of the X axis.

S802, determining a target data value corresponding to each data value point according to the relative position relation between each data value point and the coordinate information.

According to the following formula, according to the position of each data value point, according to the Y coordinate value of the data value point (nameK, x _ scaleK, (xK, yK)) on the Y axisPoints [ (xm, ym), (xn, yn)]And coordinate information ([ [ y _ scale0, (x 0, y 0)) corresponding to the above-mentioned scale points]，…，[ y_scaleJ，(xj，yj)]]) Determining a target data value corresponding to the data value point

：

In this embodiment, the color features of the legend identification line are matched in the region corresponding to each cross axis scale point in the coordinate information, and a plurality of pixel points which do not meet the preset slope condition are further removed, so that the problem that when a background grid which is the same as the broken line color exists, due to excessive noise in the broken line graph, the data of the broken line graph are difficult to accurately and cleanly extract from the broken line graph is solved, and the accuracy of data extraction of the broken line graph is improved.

Optionally, when the feature information of the legend identification line is a symbol feature, in order to extract at least one target data value corresponding to the content information of each legend identification name, as shown in fig. 15, in the step S702, the feature information of the legend identification line is matched with an area corresponding to the coordinate information in the data area to obtain at least one target data value corresponding to the content information of each legend identification name, which may also be implemented by the following steps S901 to S903.

S901, if the feature information of the legend identification line is a symbol feature, calculating the symbol feature and a region corresponding to each horizontal axis scale point in the coordinate information to perform pattern matching, and obtaining at least one region to be selected.

From fig. 3 in the above embodiment, the symbol feature may be a symbol identification image in a legend identification line, for example, an area where a triangle symbol in the first legend identification line area 1011 in fig. 3 is located, or an area where a square symbol in the second legend identification line area 1013 in fig. 3 is located, and the symbol identification image corresponding to each legend identification line is subjected to pattern matching in an area where a vertical upward direction and a downward direction of a scale point of the X axis are located in the data area of the line graph to be extracted, so that an indicated area of the symbol identification image of each legend identification line in the line graph is obtained.

Fig. 16 shows the result of pattern matching, and the area in the square frame in fig. 16 is marked with the candidate area that can be matched with the symbol mark image of the legend mark line on each broken line.

S902, taking the central point coordinate of at least one candidate area as a corresponding data value point.

It can be understood that each candidate area describes the position of each candidate area through the profile information, and therefore, the coordinates of the center point of each candidate area can be used as the data value point of the candidate area.

And S903, determining a target data value corresponding to each data value point according to the relative position relationship between each data value point and the coordinate information.

The manner of determining the target data value corresponding to each data value point is the same as that in step S802, and is not described herein again.

In this embodiment, a plurality of target data values are obtained by matching the symbolic features of the example identification lines with the data regions in the line graph to be extracted. The problem of data extraction of a line graph which contains a plurality of legend identification lines with symbol identifications and similar or identical colors in a legend area is solved.

Referring to fig. 17, an embodiment of the present application further provides a data extraction apparatus 100 for a line graph, which can be used to execute the steps of the data extraction method for a line graph in the foregoing embodiment, including:

a legend detection module 1001, configured to detect a line graph to be extracted by using a legend detection model, and determine a legend area of the line graph to be extracted;

a legend dividing module 1002, configured to divide a legend area to obtain at least one legend identification line area and at least one legend identification name area;

a legend identification module 1003, configured to respectively identify each legend identification line region and each legend identification name region according to location information of a blank region in the legend region, to obtain feature information of each legend identification line and content information of each legend identification name;

and the data extraction module 1004 is configured to extract, according to the feature information of each legend identification line and the content information of each legend identification name, data corresponding to the feature information in the data area of the line graph to be extracted, and obtain at least one target data value corresponding to the content information.

The legend dividing module 1002 is further specifically configured to obtain a horizontal arrangement pattern corresponding to the legend area; traversing and identifying the pixel change value in the horizontal arrangement graph; when the pixel point variation value of the target position meets the blank area identification condition, determining the target position as the position information of the blank area; dividing the horizontal arrangement area according to the position information of the blank areas to determine a plurality of content areas; and analyzing each content area to determine all the legend identification line areas and legend identification name areas.

The legend segmentation module 1002 is further specifically configured to determine a legend distribution type of the legend region according to an arrangement order of each sub-region to be identified in the legend region; and if the legend distribution type is a vertical distribution type or a table distribution type, horizontally dividing each sub-area to be identified according to the projection result of each sub-area to be identified in the horizontal direction to obtain a plurality of horizontal arrangement patterns.

The legend segmentation module 1002 is further configured to perform preliminary traversal on each content area, and divide each content area into a plurality of initial legend identification line areas and a plurality of initial legend identification name areas; if the initial identification line regions all meet the straight line identification condition, taking the initial identification line regions as legend identification line regions, taking the initial legend identification name regions as legend identification name regions, and marking the types of the legend identification line regions as straight line identification line region types; if the initial marking line regions all meet the symbol identification condition, the initial marking line regions are used as legend marking line regions, the initial legend marking name regions are used as legend marking name regions, and the types of the legend marking line regions are marked as symbol marking line region types.

The legend dividing module 1002 is further specifically configured to correct each legend identification line region, and use the legend identification line region that meets the identification condition as a legend identification name region.

The legend dividing module 1002 is further specifically configured to perform vertical projection on the horizontally arranged graph to obtain a vertical projection graph; performing projection calculation on each pixel point in the vertical projection graph to obtain a projection value and a projection variation value of each pixel point; and traversing each pixel point in the horizontal arrangement graph according to the projection value and the projection variation value of each pixel point, and taking the position information of the area where the plurality of pixels meeting the blank area identification condition are located as the position information of the blank area.

The data extraction module 1004 is further specifically configured to identify the line graph to be extracted by using the straight line detection model, and determine a data area and coordinate information of the line graph to be extracted; and matching the characteristic information of the legend identification line with an area corresponding to the coordinate information in the data area to obtain at least one target data value corresponding to the content information of each legend identification name.

The data extraction module 1004 is further specifically configured to, if the feature information of the legend identification line is a color feature, calculate a color value distance between the color feature and color features of a plurality of pixel points in an area corresponding to each horizontal axis scale point in the coordinate information, and take the pixel points whose color value distance is smaller than a preset color threshold and meets a preset slope condition as corresponding data value points; and determining a target data value corresponding to each data value point according to the relative position relationship between each data value point and the coordinate information.

The data extraction module 1004 is further specifically configured to, if the feature information of the legend identification line is a symbol feature, calculate a region corresponding to each horizontal axis scale point in the symbol feature and the coordinate information, and perform pattern matching to obtain at least one region to be selected; taking the coordinates of the central point of at least one to-be-selected area as corresponding data value points; and determining a target data value corresponding to each data value point according to the relative position relationship between each data value point and the coordinate information.

Referring to fig. 18, the present embodiment further provides a processing apparatus, including: a processor 2001, a memory 2002 and a bus, the memory 2002 storing machine-readable instructions executable by the processor 2001, the machine-readable instructions being executable by the processing device when the processing device is running, the processor 2001 and the memory 2002 communicating via the bus, the processor 2001 being configured to perform the steps of the data extraction method of the line graph in the above-described embodiments.

The memory 2002, processor 2001, and bus elements are electrically coupled to each other, directly or indirectly, to enable data transfer or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The data processing means of the data extraction system of the line graph includes at least one software functional module that can be stored in the memory 2002 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the processing device. The processor 2001 is used to execute executable modules stored in the memory 2002, such as software functional modules and computer programs included in the data processing apparatus of the data extraction system of the line graph.

The Memory 2002 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

Optionally, the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the steps of the above method embodiments. The specific implementation and technical effects are similar, and are not described herein again.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to corresponding processes in the method embodiments, and are not described in detail in this application. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and there may be other divisions in actual implementation, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some communication interfaces, indirect coupling or communication connection between devices or modules, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for extracting data of a line graph, comprising:

identifying each legend identification line area and each legend identification name area respectively to obtain characteristic information of each legend identification line and content information of each legend identification name;

2. The method for extracting data of a line graph according to claim 1, wherein the dividing the legend area into at least one legend identification line area and at least one legend identification name area comprises:

obtaining a horizontal arrangement graph corresponding to the legend area;

traversing and identifying the pixel change values in the horizontal arrangement graph;

when the pixel point variation value of the target position meets the blank area identification condition, determining the target position as the position information of the blank area;

3. The method for extracting data of a line graph according to claim 2, wherein the obtaining of the horizontal arrangement graph corresponding to the legend area comprises:

4. The method for extracting data of a line chart according to claim 2, wherein the analyzing each of the content areas to determine all of the legend identification line areas and the legend identification name areas comprises:

5. The method of line graph data extraction of claim 4, further comprising:

6. The method for extracting data of a line graph according to claim 2, wherein the determining that the target position is position information of a blank area when the pixel point variation value of the target position satisfies a blank condition includes:

carrying out vertical projection on the horizontally arranged graph to obtain a vertical projection graph;

7. The method for extracting data of a line drawing according to claim 1, wherein the extracting data corresponding to the feature information in the data area of the line drawing to be extracted according to the feature information of each legend identification line and the content information of each legend identification name to obtain at least one target data value corresponding to the content information comprises:

8. The method for extracting data of a line graph according to claim 7, wherein the matching the feature information of the legend identification line with the area corresponding to the coordinate information in the data area to obtain at least one target data value corresponding to the content information of each legend identification name includes:

9. The method for extracting line drawing data according to claim 7, wherein the matching a position corresponding to coordinate information in the data area according to feature information of the legend identification line and content information of each legend identification name to obtain a plurality of data values corresponding to the content information of each legend identification name includes:

if the characteristic information of the legend identification line is symbol characteristics, calculating the symbol characteristics and areas corresponding to all cross-axis scale points in the coordinate information to perform pattern matching to obtain at least one area to be selected;

taking the coordinates of the central point of at least one to-be-selected area as corresponding data value points;

10. A data extraction device of a line graph, characterized by comprising:

a legend identification module, configured to respectively identify each legend identification line region and each legend identification name region, to obtain feature information of each legend identification line and content information of each legend identification name;

11. A processing device, characterized in that the processing device comprises: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the processing device is operating, the processor executing the machine-readable instructions to perform the steps of the data extraction method of the line graph according to any one of claims 1-9.

12. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the method for data extraction of a line graph according to any one of claims 1-9.