CN116910292A - Document chart retrieval method, device, electronic equipment and storage medium - Google Patents

Document chart retrieval method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116910292A
CN116910292A CN202310685076.1A CN202310685076A CN116910292A CN 116910292 A CN116910292 A CN 116910292A CN 202310685076 A CN202310685076 A CN 202310685076A CN 116910292 A CN116910292 A CN 116910292A
Authority
CN
China
Prior art keywords
chart
title
information
position information
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310685076.1A
Other languages
Chinese (zh)
Inventor
周立运
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rubik's Cube Medical Technology Suzhou Co ltd
Original Assignee
Rubik's Cube Medical Technology Suzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rubik's Cube Medical Technology Suzhou Co ltd filed Critical Rubik's Cube Medical Technology Suzhou Co ltd
Priority to CN202310685076.1A priority Critical patent/CN116910292A/en
Publication of CN116910292A publication Critical patent/CN116910292A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a document chart retrieval method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a target document of chart information to be extracted; inputting the target document into a trained target detection model, and outputting chart position information and title position information of the target document; the trained target detection model is obtained by training a sample document set of a coverage title area of a pre-labeled chart area; analyzing the chart position information and the title position information to determine the title corresponding to each chart in the target document, and dividing the title-containing chart information; wherein, the chart information containing the title is used for screening the title in the chart information based on the search information to obtain the target chart information when the search information is received. By adopting the method and the device for analyzing and determining the titles corresponding to the charts in the document, the titles can be used as indexes, and the information retrieval requirement of users for accurately and efficiently screening target charts from massive charts is met.

Description

Document chart retrieval method, device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a document chart retrieval method, a document chart retrieval device, an electronic device, and a storage medium.
Background
Along with the development of information technology, it is very common to present information points by using charts, especially in PDF documents in various fields, the information points are generally presented by using charts, so that accurate chart searching is an important premise for reading target points.
At present, a target detection model is generally adopted in the industry to identify pictures, tables and text areas in a document, but the method is completely based on a target detection algorithm, and when a plurality of charts exist in irregular charts and pages, wireless frames or wire frames of the charts are incomplete, the chart content extraction accuracy is low, so that the quality of chart retrieval results is reduced. If the graph is manually searched, the accuracy of graph search can be ensured, but the cost is high and the efficiency is low in the face of massive documents.
Therefore, how to efficiently and accurately search the key chart information in the document becomes a current urgent problem to be solved.
Disclosure of Invention
The invention provides a document chart retrieval method, a document chart retrieval device, electronic equipment and a storage medium, which are used for solving the defect of low chart information retrieval efficiency and accuracy in the prior art.
The invention provides a document chart retrieval method, which comprises the following steps:
acquiring a target document of chart information to be extracted;
Inputting the target document into a trained target detection model, and outputting chart position information and title position information of the target document; the trained target detection model is obtained by training a sample document set of a coverage title area of a chart area pre-marked;
analyzing the chart position information and the title position information to determine a title corresponding to each chart in the target document, and dividing the title information to obtain chart information containing the title;
the chart information containing the title is used for screening the title in the chart information based on the search information to obtain target chart information when the search information is received.
According to the document chart searching method provided by the invention, after the target document is input into the trained target detection model and the chart position information and the title position information of the target document are output, the method further comprises the following steps:
analyzing the bottom data of the target document to obtain the reference position information of each chart;
for each chart, if the reference position information is matched with the chart position information, reserving the chart position information;
And if the reference position information is inconsistent with the chart position information, combining the reference position information and the chart position information to update the chart position information.
According to the document chart searching method provided by the invention, the chart position information and the title position information are analyzed to determine the title corresponding to each chart in the target document, and the chart information containing the title is obtained by segmentation, and the method comprises the following steps:
analyzing the chart position information and the title position information to obtain a first area corresponding to the chart position information and a second area corresponding to the title position information;
calculating the region overlapping degree between the first region and the second region for each page of the target document;
if the region overlapping degree is greater than or equal to a preset overlapping degree threshold value, determining the title position information corresponding to the second region as a candidate title position of a corresponding chart;
and determining a title corresponding to each chart in the target document based on the candidate title position, and dividing to obtain chart information containing the title.
According to the document graph searching method provided by the invention, before calculating the region overlapping degree between the first region and the second region for each page of the target document, the method further comprises the following steps:
Acquiring page height information of each page of the target document;
determining a page invalid region of each page based on the page height information;
and if the second area corresponding to the title position information is overlapped with the page invalid area, screening out the title position information.
According to the document chart searching method provided by the invention, the title corresponding to each chart in the target document is determined based on the candidate title position, and chart information containing the title is obtained by segmentation, and the method comprises the following steps:
determining chart height information of the chart so as to determine a chart invalid area according to the chart height information;
if the second area corresponding to the candidate title position is overlapped with the invalid area of the chart, screening the candidate title position to obtain the rest candidate title position;
and determining the title corresponding to each chart in the target document according to the rest candidate title positions, and dividing to obtain chart information containing the title.
According to the method for searching the document charts, the title corresponding to each chart in the target document is determined according to the rest candidate title positions, and chart information containing the title is obtained by segmentation, and the method comprises the following steps:
If the residual candidate title positions comprise at least two, respectively calculating distance values between each residual candidate title position and the corresponding chart position information;
and screening out the target title positions in the residual candidate title positions based on the distance values to determine the title corresponding to each chart in the target document, and dividing to obtain chart information containing the title.
According to the document chart searching method provided by the invention, the target title position in each remaining candidate title position is screened out based on the distance value to determine the title corresponding to each chart in the target document, and chart information containing the title is obtained by dividing, and the method comprises the following steps:
screening out the residual candidate title positions corresponding to the minimum distance value as target title positions;
if the target title positions comprise at least two, identifying and obtaining target title contents based on each target title position;
screening target title contents containing preset title keywords as titles corresponding to the charts;
and carrying out pixel segmentation on the chart containing the title to obtain the chart information.
According to the document chart searching method provided by the invention, before the target document is input into the trained target detection model and chart position information and title position information of the target document are output, the method further comprises the following steps:
acquiring a sample document set; wherein the sample document set includes a plurality of sample documents;
converting each sample document into a page image to obtain a labeling result of each page image; the labeling result comprises a first labeling area of the chart and a second labeling area of the title; the second labeling area is covered by the first labeling area;
and acquiring an initial target detection model, and inputting each sample document containing the labeling result into the initial target detection model for model training to obtain the trained target detection model.
According to the document chart searching method provided by the invention, the target title content is identified and obtained based on each target title position, and the method comprises the following steps:
identifying and obtaining target title contents based on the target title positions;
and deleting the target title content under the condition that the target title content contains preset characters and preset sensitive words related to the catalogue.
According to the document chart searching method provided by the invention, after analyzing the chart position information and the title position information to determine the title corresponding to each chart in the target document, the method further comprises the following steps:
and fusing at least two charts when the titles corresponding to the at least two charts are the same.
The invention also provides a document chart retrieval device, which comprises:
an acquisition unit for acquiring a target document of chart information to be extracted;
the detection unit is used for inputting the target document into a trained target detection model and outputting chart position information and title position information of the target document; the trained target detection model is obtained by training a sample document set of a coverage title area of a chart area pre-marked;
an analysis unit, configured to analyze the chart position information and the title position information, so as to determine a title corresponding to each chart in the target document, and divide the title into chart information including the title;
the chart information containing the title is used for screening the title in the chart information based on the search information to obtain target chart information when the search information is received.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the document graph searching method according to any one of the above when executing the computer program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a document graph retrieval method as described in any of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a document chart retrieval method as described in any one of the above.
According to the document chart retrieval method, the device, the electronic equipment and the storage medium, as the chart areas marked in advance in each document in the sample document set cover the title areas, the target detection model can accurately output chart position information and title position information of the target document, and accurately determine the title corresponding to each chart based on the chart position information and the title position information, so that target chart information containing retrieval key point information can be accurately and efficiently screened out from a large number of charts based on the title corresponding to the chart, and the retrieval efficiency and the retrieval precision of the document chart information are improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of a document chart retrieval method provided by the invention;
FIG. 2 is a schematic flow chart of a chart information extraction method provided by the invention;
FIG. 3 is a schematic diagram of a document chart retrieving apparatus according to the present invention;
fig. 4 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Under the current state of the art, extracting chart content in a document typically employs a layout analysis method. A common method of layout analysis is to use a target detection model to identify pictures, forms, and text regions in a document (hereinafter the pictures and forms are combined simply as "charts"). Taking the example of identifying the table contained in the document, namely firstly extracting the table frame in the document, then utilizing the frame to acquire the in-frame region, and then carrying out optical character recognition (Optical Character Recognition, OCR) on the in-frame region to extract the table content, but not extracting the relation between the chart and the title thereof, which is not beneficial to the subsequent chart retrieval. Meanwhile, the method is completely based on a target detection algorithm, and when the chart is irregular, a plurality of charts exist in a page, or the wireless frames of the charts are arranged, or the frame lines are incomplete, the problem of low accuracy of chart content extraction exists.
In this regard, the present invention provides a document chart retrieval method. FIG. 1 is a schematic flow chart of a document chart retrieval method provided by the invention, as shown in FIG. 1, the method comprises the following steps:
step 110, obtaining a target document of the chart information to be extracted.
Here, the target document may be a document containing pictures and/or tables. The chart information may be character and/or graphic information correspondingly contained within the pictures and/or tables.
The target document may be an electronic document obtained by scanning a paper document, or may be a prestored electronic document. The format of the electronic document may include PDF format, PPT format, and the like.
For example, the target document may be an electronic document obtained by OCR recognition of a paper document, or may be an electronic document obtained from a document library constructed in advance.
Step 120, inputting the target document into the trained target detection model, and outputting the chart position information and the title position information of the target document; the trained target detection model is obtained by training a sample document set of a pre-labeled chart region coverage title region.
Specifically, the chart position information is used to characterize the position of the chart region in the target document, which may include the length, width, center point coordinates, etc. of the chart region; the title location information is used to characterize the location of the chart's corresponding title region in the target document, which may include the length, width, center point coordinates, etc. of the title region.
The target detection model is used for detecting chart position information and title position information in a target document, and is trained based on a sample document set pre-marked with a chart region covering a title region. The method comprises the steps of marking a sample document set with a chart area and a title area in advance, wherein the sample document set with the chart area and the title area covered by the chart area is used for marking each sample document in the sample document set in advance, the chart area covered by the chart area in advance is used for covering the title area associated with the chart, namely, the condition that each sample document in the sample document set is marked with a chart position and a title position in advance is understood, and the chart position marked in advance is bound with the corresponding title position.
For example, when the sample document includes fig. 1, the header region corresponding to fig. 1 is labeled as a header region, and the picture region and the header region corresponding to fig. 1 are labeled as a chart region when the sample document is labeled.
Because the chart area marked in advance of each sample document in the sample document set covers the title area, under the condition that the marking quality of the sample document set is higher (if the chart area marked in advance covers a unique title area), the title corresponding to the chart can be extracted directly based on the target detection model; in the case where the sample document is irregular, resulting in poor labeling quality (e.g., a pre-labeled chart region is overlaid with multiple heading regions), keywords may be used to extract the corresponding heading from the chart region. In addition, the title area is marked in the sample document set in advance, so that the title can be distinguished from the text of the document, the text does not need to be filtered and extracted, and the extraction efficiency of chart information is further improved.
In some embodiments, step 120 is preceded by:
firstly, acquiring a sample document set; the sample document set comprises a plurality of sample documents, the sample documents can be electronic documents obtained by scanning paper documents or electronic documents obtained from a pre-constructed document library, and optionally, the sample documents can be PDF documents containing charts.
Then, converting each document into a page image to obtain a labeling result of each page image; the labeling result comprises a first labeling area and a second labeling area, wherein the first labeling area is used for indicating the position of the chart, the second labeling area is used for indicating the position of the title, and the second labeling area is covered by the first labeling area.
Optionally, the first labeling area may include a picture labeling area and a form labeling area, and the user may select a picture, a form and a title in the page image for labeling by using a mouse, and assign different labels and colors to the picture labeling area, the form labeling area and the second labeling area for distinguishing. For example, a picture annotation region may be assigned the label "Image" and annotated with yellow, a table annotation region may be assigned the label "table" and annotated with red, and a second annotation region may be assigned the label "capton" and annotated with blue. In addition, the labeling sequence of the first labeling area and the second labeling area is partially sequential, so that the first labeling area can be labeled first, and the second labeling area can be labeled first.
In the process of labeling the second labeling area (i.e., the title labeling area), the title number (e.g., "graph X") and the title text are required to be overlaid, so that the corresponding title may be extracted from the chart area by using the keyword (e.g., "graph").
Alternatively, if the current page image is difficult to annotate, such as if the chart is incomplete and no title (e.g., spread), the current page image may not be annotated.
After the labeling results of the page images are obtained, an initial target detection model is obtained, and various text files containing the labeling results are input into the initial target detection model for model training, so that a trained target detection model is obtained.
The initial object detection model is understood here to be a model of random initialization of parameters. The initial target detection model may be constructed based on a Fast Region convolutional neural network (Fast Region-based Convolutional Network, fast R-CNN), a Mask Region convolutional neural network (Mask Region-based Convolutional Neural Network, mask-RCNN), and an object detection algorithm (You Only Look Once, yolo). In order to improve development efficiency, the embodiment of the invention can construct an initial target detection model by adopting a layout parser (layoutparser framework). The target detection model may detect the chart position information and the title position information by using a target detection algorithm, or may detect the chart position information and the title position information by using an example segmentation algorithm, which is not limited in particular by the embodiment of the present invention.
Optionally, after labeling each document in the sample document set, the sample document set may be converted into a coco format, and input into the initial target detection model in the coco format for model training.
After the trained target detection model is obtained, the accuracy of the target detection model can be verified by adopting a test data set, and if the accuracy is lower, the sample document can be obtained again to train the target detection model continuously until the accuracy of the target detection model meets the requirement.
In some embodiments, step 120 further comprises, after:
analyzing the bottom data of the target document to obtain the reference position information of each chart;
for each chart, if the reference position information is matched with the chart position information, the chart position information is reserved;
if the reference position information is inconsistent with the graph position information, combining the reference position information and the graph position information to update the graph position information.
Considering that there may be a problem of missing coordinate points in the chart position information output by the target detection model, further causing a problem that the chart region determined based on the chart position information is incomplete or the detected chart region overlaps. For example, if the upper left vertex coordinates are missing in the chart position information, the upper left corner region of the chart region cannot be accurately determined, resulting in incomplete chart region.
In contrast, the embodiment of the invention obtains the reference position information of each chart by analyzing the bottom data of the target document, and matches the reference position information with the chart position information to obtain the chart position information.
The underlying data of the target document is used for representing structural characteristics of the target document, namely typesetting, layout, style and other information of texts and charts in the target document. The reference position information of each chart refers to position information of each chart indicated by the underlying data, and the reference position information may include a length, a width, a center point coordinate, a diagonal vertex coordinate, and the like of a corresponding region of each chart.
For example, if the target document is in PDF format, the underlying data is chart coordinate data of the target document. In some cases, the underlying data may be binary data that may be parsed by parsing techniques of an underlying programming language, such as the C language, to obtain reference location information for each chart.
In some embodiments, the reference position information includes a plurality of coordinate point information of the chart region, the chart position information includes a plurality of coordinate point information of the chart region, and matching the reference position information with the chart position information may be understood as the plurality of coordinate point information included in the reference position information completely coincides with the plurality of coordinate point information included in the chart position information. The inconsistency of the matching of the reference position information with the chart position information may be understood as that the plurality of coordinate point information included in the reference position information partially intersects the plurality of coordinate point information included in the chart position information.
If the reference position information is matched with the chart position information, the accuracy of the chart position information is higher, and the chart position information is further reserved. For example, the reference position information includes upper left corner vertex coordinates 1, lower left corner vertex coordinates 2, upper right corner vertex coordinates 3, upper right corner vertex coordinates 4 and center point coordinates 5, the chart position information includes upper left corner vertex coordinates 1, lower left corner vertex coordinates 2, upper right corner vertex coordinates 3, upper right corner vertex coordinates 4 and center point coordinates 5, and it is known that the reference position information and the chart position information are completely coincident when they are matched, i.e. the two are matched and consistent, and then the chart position information can be retained.
If the reference position information is inconsistent with the chart position information, the accuracy of the chart position information is lower, and the reference position information and the chart position information need to be combined to update the chart position information. For example, the reference position information includes upper left corner vertex coordinates 1, lower left corner vertex coordinates 2, upper right corner vertex coordinates 3, upper right corner vertex coordinates 4 and center point coordinates 5, the graph position information includes upper left corner vertex coordinates 1, lower left corner vertex coordinates 2, upper right corner vertex coordinates 3 and center point coordinates 5, and as the two coordinates are matched, the upper right corner vertex coordinates 4 in the reference position information are not covered in the graph position information, and the reference position information is partially crossed with the graph position information, that is, the two coordinates are not matched, the upper right corner vertex coordinates 4 in the reference position information can be combined into the graph position information to update the graph position information.
Step 130, analyzing the chart position information and the title position information to determine the title corresponding to each chart in the target document, and dividing to obtain chart information containing the title;
by analyzing the chart position information and the title position information, a first area corresponding to the chart position information and a second area corresponding to the title position information can be obtained. Determining a corresponding square area according to four diagonal vertex coordinates in the chart position information, and taking the square area as a first area; a corresponding square region may be determined from four diagonal vertex coordinates in the header position information, and the square region may be taken as the second region.
Considering that the accuracy of the object detection model is not a percentage, it is necessary to analyze the title corresponding to each chart in combination with chart position information and title position information. In this regard, the embodiment of the invention proposes a plurality of screening conditions to screen the title position information so as to determine the title position information corresponding to each chart, so as to improve the reliability of the finally obtained title, further to search the target chart based on the title, and to improve the obtaining efficiency and accuracy of the document chart points.
Wherein the screening conditions include, but are not limited to: (1) the title is not within the document page invalid region; (2) the chart position overlays the title position; (3) the title is not within the chart invalidation zone; (4) the distance between the title and the chart meets a threshold; (5) the title content contains preset title keywords. The following describes a screening method for header position information in combination with the above screening conditions:
(1) The title is not within the document page invalid region:
specifically, considering that the contents of a header area and a footer area may be mistakenly identified as titles in practical application, the embodiment of the invention provides that page height information of each page of a target document can be obtained, and a page invalid area of each page is determined according to the page height information of each page so as to preliminarily filter invalid title position information; the page invalid region includes a first invalid region and a second invalid region, the first invalid region may be located at a top (e.g., a header region) of the page, and the second invalid region may be located at a bottom (e.g., a footer region) of the page. It should be noted that, since page height information of each page may be different, the embodiment of the present invention needs to determine a page invalid area for the page height information of each page, that is, page invalid areas corresponding to different pages may be different.
Optionally, a 1/16 page corresponding area from the top edge of the page is taken as a first invalid area, and a 1/16 page corresponding area from the bottom edge of the page is taken as a second invalid area, but the specific page invalid area scope is not limited in the embodiment of the invention.
And if the second area corresponding to the title position information is overlapped with the page invalid area, screening out the corresponding title position information. Wherein the second area overlapping the page invalid area may be any of the following: (a) The second area and the page invalid area have intersection, namely, an area overlapping part exists; (b) The second area and the page invalid area have an intersection, the intersection area is larger than or equal to a preset page overlapping threshold, such as 5%, 10%, 20%, and the like, and the specific value can be defined according to the precision of the actual application requirement.
(2) The chart position overlays the title position:
specifically, in order to achieve accurate binding of each chart and corresponding titles thereof, the embodiment of the invention provides that the region overlapping degree between the first region and the second region can be calculated for each page of the target document, and whether the first region contains the second region is further analyzed by utilizing the region overlapping degree; wherein, the region overlapping degree= (first region and second region overlapping portion area/second region area) ×100%.
If the region overlapping degree is greater than or equal to a preset overlapping degree threshold value (for example, the preset overlapping degree can be set to be 50%), the first region is indicated to comprise the second region, namely, the chart region comprises the title region, and the title position information corresponding to the second region can be determined to serve as a candidate title position of a corresponding chart according to the pre-marking strategy of the invention; wherein the corresponding chart is a chart corresponding to the first area currently analyzed.
(3) The title is not within the chart invalid region:
specifically, in the case that the first area is determined to include the second area, it may be further determined whether the second area is located in the chart invalid area, and if so, screening is required. More specifically, the chart invalid region may be determined according to the chart height information in the chart position information; since it is known that a large number of charts are counted, the central area of the chart usually records key point information, and the possibility of eliminating the title is extremely high, the embodiment of the invention proposes that a screening condition is added when the title is screened, that is, the title cannot appear in the central area of the chart, so that the central area of the chart can be set as an ineffective area of the chart. For example, the center point coordinate of the first region is taken as the origin O, the vertical selection is ±30%, the width dimension of the horizontal chart is unchanged, and the chart is determined as an invalid region of the chart. As another example, a region formed by a height section of 20% to 80% of the graph is taken as a graph ineffective region.
If the second area corresponding to a certain candidate title position is overlapped with the invalid area of the chart, screening out the candidate title position, and summarizing to obtain the rest candidate title positions; wherein the second region overlapping the chart invalid region may be any of the following: (a) The second area has intersection with the chart invalid area, namely, an area overlapping part exists; (b) The second area and the chart invalid area have intersection, the intersection area is larger than a preset chart overlapping threshold value, such as 5%, 10%, 20%, and the like, and the specific value can be defined according to the accuracy of practical application requirements.
(4) The distance between the title and the chart meets the threshold:
specifically, after the screening is performed under the above screening conditions, if only one candidate title position remains, the remaining candidate title position is used as the target title position of the corresponding chart; if the residual candidate title positions do not exist, discarding the corresponding chart; if the residual candidate title positions are two or more, respectively calculating distance values between each residual candidate title position and the corresponding chart position information so as to determine the corresponding associated title of the chart by using distance value analysis.
The distance value may be a euclidean distance between each remaining candidate title and the corresponding chart, and the distance analysis may be considered in combination with the title position relative to the chart orientation, if the title position is located in an upper left direction of the chart, the distance value may be a distance between an upper left corner vertex coordinate of the second region in each remaining candidate title position and an upper left corner vertex coordinate of the first region in the corresponding chart position information, where a smaller distance value indicates a greater probability that the corresponding remaining candidate title position is the target title position. Therefore, after the distance value is determined, the embodiment of the invention screens out the remaining candidate title positions with the minimum corresponding distance value as the target title positions.
For example, if the chart a includes the remaining candidate title a and the remaining candidate title b, the distance a & a between the chart a and the remaining candidate title a, the distance a & b between the chart a and the remaining candidate title b are calculated, respectively, and if the distance between a & a is 0.2 and the distance between a & b is 0.5, the remaining candidate title a is the target title of the chart a.
Further, if there is only one target title position, the target title position is taken as the title position of the corresponding chart. And if the target title positions comprise at least two target title positions, identifying and obtaining target title contents based on each target title position. The target title content can be obtained by performing text recognition on the second area corresponding to the target title position, for example, an OCR technology is adopted.
And screening out target title contents containing preset title keywords after the target title contents are obtained, and taking the target title contents as titles corresponding to the charts. The preset title keywords include, but are not limited to, at least one of the following: "graph", "Table", "Exhibit", "Table".
If there are multiple target title contents that all include the preset title keywords, the method may further filter based on the following conditions:
(1) the priority ordering is performed on the preset title keywords, the priority ordering object can be the appearance position of the same preset title keyword, or can be the content of each preset title keyword, and the embodiment of the invention is not limited; the priority proposed in this embodiment may be preset fixed, or may be updated in real time based on the historical matching data duty ratio. For example, for the appearance position of the same preset title keyword, the priority at the beginning of the title is higher than that at other positions of the title, if the target title content a is "graph xxxx" and the target title content B is "xxxx graph xxxx", the target title content a may be preferably the final title. For another example, for the simultaneous occurrence of a plurality of preset title keywords, for example, the target title content C is "Table xxxx" and the target title content D is "Table xxxx", the target title content C may be preferably the final title since the analysis history matching data is higher in priority than "Table".
(2) In the same document, the line text standard is almost fixed, and the corresponding title of the chart is screened from a plurality of target titles according to the positions of the corresponding titles of other charts. For example, if the corresponding titles of other charts in the target document are all at the upper left corner of the chart, the target title located at the upper left corner of the chart may be used as the corresponding title of the chart.
(3) In the same document, the formats (such as line spacing, font format, font size, etc.) of chart titles are generally consistent, and corresponding titles of charts are selected from multiple target titles according to the formats of corresponding titles of other charts. For example, the font format of the title corresponding to the other chart in the target document is Song body five, the font format of the target title 1 is Song body five, and the font format of the target title 2 is Song body four, and then the target title 1 can be used as the title corresponding to the chart.
In addition, text recognition can be performed on the target title to obtain target title content, if the target title content contains preset characters (the preset characters can be a plurality of regularly arranged points) and preset sensitive words related to the catalogue, the corresponding target title is indicated to be the catalogue title, and the target title content can be deleted at the moment. If there are at least two charts with the same title, the two charts are merged.
Therefore, the embodiment of the invention screens the title position information based on the method so as to improve the reliability of the title corresponding to the finally obtained chart, further searches the chart based on the title and improves the acquisition efficiency and accuracy of the document chart points.
It should be noted that, the order of the filtering steps is an optional manner, and the embodiment of the present invention may further adjust the order of the filtering steps according to actual situations, for example, the order of covering the title position according to the chart position, the distance between the title and the chart accords with the threshold, the title is not in the invalid area of the document page, the title is not in the invalid area of the chart, and the title content includes the preset title keyword may sequentially filter the title position information.
In addition, if the region overlapping degree is smaller than the preset overlapping degree threshold value, which indicates that the first region does not contain the second region, namely, the title region is located outside the chart region, all text contents in the chart region are identified, and then an identification result containing keywords is selected as a title corresponding to the chart. If the recognition result does not have the keywords, the center point of the chart area is taken as the origin, the text recognition can be performed in an expanded range, and the recognition result containing the keywords is selected as the title corresponding to the chart. If the title corresponding to the chart cannot be determined, the corresponding chart is not provided with the title, and the chart is discarded.
After obtaining the title corresponding to the chart, the chart containing the title is subjected to pixel segmentation to obtain chart information. Wherein, the chart information containing the title is used for screening the title in the chart information based on the search information to obtain the target chart information when the search information is received.
For example, a database may be constructed, which is configured to store the N pairs of title chart information having a one-to-one association relationship after the title screening, so that the user can quickly retrieve the corresponding target chart by using the mapping relationship between the title and the chart, thereby meeting the retrieval requirement of the user on the chart information in the designated field.
Therefore, according to the document chart searching method provided by the embodiment of the invention, since the chart areas marked in advance in each document in the sample document set cover the title areas, the target detection model can accurately output the chart position information and the title position information of the target document, and accurately determine the title corresponding to each chart based on the chart position information and the title position information, so that the target chart information containing the searching key point information can be accurately and efficiently screened from a large number of charts based on the title corresponding to the chart, and the searching efficiency and the searching precision of the document chart information are improved.
Fig. 2 is a schematic flow chart of the chart information extraction method provided by the invention, as shown in fig. 2, a sample document in PDF format is obtained, the sample document is converted into page images, and labeling results of the page images are obtained. After the labeling results of the page images are obtained, inputting various text files containing the labeling results into an initial target detection model for model training, and obtaining a trained target detection model. After the trained target detection model is obtained, the graph position and the title position of the target document are detected based on the target detection model, and a detection result is obtained. Wherein the detection result includes chart position information and title position information of the target document.
Then, by adopting the method of the embodiment, the chart position information is optimized by utilizing the bottom data of the target document, and the optimized chart position information is obtained. Based on the optimized chart position information and the title position information, the title position information of each chart is screened by adopting the method of the embodiment, the title corresponding to each chart is determined, and the chart information containing the title is obtained by segmentation.
The document chart searching device provided by the invention is described below, and the document chart searching device described below and the document chart searching method described above can be correspondingly referred to each other.
Based on any of the above embodiments, fig. 3 is a schematic structural diagram of a document chart retrieving apparatus provided by the present invention, as shown in fig. 3, the apparatus includes:
an acquisition unit 310 for acquiring a target document of chart information to be extracted;
a detection unit 320 for inputting the target document into the trained target detection model and outputting the chart position information and the title position information of the target document; the trained target detection model is obtained by training a sample document set of a coverage title area of a pre-labeled chart area;
an analysis unit 330 for analyzing the chart position information and the title position information to determine a title corresponding to each chart in the target document, and dividing the title-containing chart information;
wherein, the chart information containing the title is used for screening the title in the chart information based on the search information to obtain the target chart information when the search information is received.
Based on any of the above embodiments, the apparatus further comprises:
the analysis unit is used for analyzing the bottom data of the target document after inputting the target document into the trained target detection model and outputting the chart position information and the title position information of the target document to obtain the reference position information of each chart;
The first matching unit is used for reserving the chart position information if the reference position information is matched with the chart position information for each chart;
and the second matching unit is used for combining the reference position information and the chart position information to update the chart position information if the reference position information is inconsistent with the chart position information in matching.
Based on any of the above embodiments, the analysis unit 330 includes:
the area determining unit is used for analyzing the chart position information and the title position information to obtain a first area corresponding to the chart position information and a second area corresponding to the title position information;
an overlap calculation unit configured to calculate, for each page of the target document, a region overlap between the first region and the second region;
the screening unit is used for determining the title position information corresponding to the second area as the candidate title position of the corresponding chart if the area overlapping degree is greater than or equal to a preset overlapping degree threshold value;
a dividing unit for determining the title corresponding to each chart in the target document based on the candidate title position and dividing to obtain the chart information containing the title
Based on any of the above embodiments, the apparatus further comprises:
A height information determining unit for acquiring page height information of each page of the target document before calculating a region overlapping degree between the first region and the second region for each page of the target document;
a first invalid region determining unit for determining a page invalid region of each page based on the page height information;
and the first screening unit is used for screening out the title position information if the second area corresponding to the title position information is overlapped with the page invalid area.
Based on any of the above embodiments, the dividing unit includes:
a second invalid region determining unit for determining table height information of the chart to determine an invalid region of the chart according to the table height information;
the second screening unit is used for screening out the candidate title positions if the second area corresponding to the candidate title positions is overlapped with the invalid area of the chart, so as to obtain the rest candidate title positions;
and the first determining unit is used for determining the title corresponding to each chart in the target document according to the rest candidate title positions and dividing the title into chart information containing the title.
Based on any one of the above embodiments, the first determining unit includes:
A distance calculating unit, configured to calculate a distance value between each remaining candidate title position and the corresponding chart position information, respectively, if the remaining candidate title positions include at least two;
and the second determining unit is used for screening out the target title positions in the rest candidate title positions based on the distance values so as to determine the title corresponding to each chart in the target document, and dividing the title-containing chart information.
Based on any one of the above embodiments, the second determining unit includes:
the target title determining unit is used for screening out the residual candidate title positions with the minimum corresponding distance values as target title positions;
the identifying unit is used for identifying and obtaining target title contents based on each target title position if the target title positions comprise at least two target title positions;
the title screening unit is used for screening target title contents containing preset title keywords and taking the target title contents as titles corresponding to the charts;
and the pixel segmentation unit is used for carrying out pixel segmentation on the chart containing the title to obtain chart information.
Based on any of the above embodiments, the apparatus further comprises:
a sample acquisition unit for acquiring a sample document set before inputting the target document into the trained target detection model and outputting the chart position information and the title position information of the target document; wherein the sample document set comprises a plurality of sample documents;
The conversion unit is used for converting each type of document into a page image so as to acquire the labeling result of each page image; the labeling result comprises a first labeling area of the chart and a second labeling area of the title; the second labeling area is covered by the first labeling area;
the training unit is used for acquiring an initial target detection model, inputting various text files containing labeling results into the initial target detection model for model training, and obtaining a trained target detection model.
Based on any of the above embodiments, the identifying unit includes:
the content identification unit is used for identifying and obtaining target title contents based on the positions of the target titles;
and the catalog screening unit is used for deleting the target title content when the target title content contains preset characters and preset sensitive words related to the catalog.
Based on any of the above embodiments, the apparatus further comprises:
and the fusion unit is used for fusing at least two charts under the condition that the titles corresponding to the at least two charts are the same after analyzing the chart position information and the title position information to determine the title corresponding to each chart in the target document.
Fig. 4 is a schematic structural diagram of an electronic device according to the present invention, as shown in fig. 4, the electronic device may include: processor 410, memory 420, communication interface (Communications Interface) 430, and communication bus 440, wherein processor 410, memory 420, and communication interface 430 communicate with each other via communication bus 440. Processor 410 may invoke logic instructions in memory 420 to perform a document chart retrieval method comprising: acquiring a target document of chart information to be extracted; inputting the target document into a trained target detection model, and outputting chart position information and title position information of the target document; the trained target detection model is obtained by training a sample document set of a coverage title area of a chart area pre-marked; analyzing the chart position information and the title position information to determine a title corresponding to each chart in the target document, and dividing the title information to obtain chart information containing the title; the chart information containing the title is used for screening the title in the chart information based on the search information to obtain target chart information when the search information is received.
Further, the logic instructions in the memory 420 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a document chart retrieval method provided by the above methods, the method comprising: acquiring a target document of chart information to be extracted; inputting the target document into a trained target detection model, and outputting chart position information and title position information of the target document; the trained target detection model is obtained by training a sample document set of a coverage title area of a chart area pre-marked; analyzing the chart position information and the title position information to determine a title corresponding to each chart in the target document, and dividing the title information to obtain chart information containing the title; the chart information containing the title is used for screening the title in the chart information based on the search information to obtain target chart information when the search information is received.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above-provided document graph retrieval methods, the method comprising: acquiring a target document of chart information to be extracted; inputting the target document into a trained target detection model, and outputting chart position information and title position information of the target document; the trained target detection model is obtained by training a sample document set of a coverage title area of a chart area pre-marked; analyzing the chart position information and the title position information to determine a title corresponding to each chart in the target document, and dividing the title information to obtain chart information containing the title; the chart information containing the title is used for screening the title in the chart information based on the search information to obtain target chart information when the search information is received.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A document chart retrieval method, comprising:
acquiring a target document of chart information to be extracted;
inputting the target document into a trained target detection model, and outputting chart position information and title position information of the target document; the trained target detection model is obtained by training a sample document set of a coverage title area of a chart area pre-marked;
analyzing the chart position information and the title position information to determine a title corresponding to each chart in the target document, and dividing the title information to obtain chart information containing the title;
the chart information containing the title is used for screening the title in the chart information based on the search information to obtain target chart information when the search information is received.
2. The document chart retrieving method according to claim 1, further comprising, after the inputting the target document into a trained target detection model, outputting chart position information and title position information of the target document:
analyzing the bottom data of the target document to obtain the reference position information of each chart;
For each chart, if the reference position information is matched with the chart position information, reserving the chart position information;
and if the reference position information is inconsistent with the chart position information, combining the reference position information and the chart position information to update the chart position information.
3. The document chart retrieving method according to claim 1 or 2, wherein the analyzing the chart position information and the title position information to determine a title corresponding to each chart in the target document and dividing into chart information including the title includes:
analyzing the chart position information and the title position information to obtain a first area corresponding to the chart position information and a second area corresponding to the title position information;
calculating the region overlapping degree between the first region and the second region for each page of the target document;
if the region overlapping degree is greater than or equal to a preset overlapping degree threshold value, determining the title position information corresponding to the second region as a candidate title position of a corresponding chart;
and determining a title corresponding to each chart in the target document based on the candidate title position, and dividing to obtain chart information containing the title.
4. The document map retrieval method according to claim 3, further comprising, before said calculating a region overlap between said first region and said second region for each page of said target document:
acquiring page height information of each page of the target document;
determining a page invalid region of each page based on the page height information;
and if the second area corresponding to the title position information is overlapped with the page invalid area, screening out the title position information.
5. The document graph retrieval method according to claim 3, wherein determining a graph corresponding to each graph in the target document based on the candidate graph position and dividing the graph information including the graph includes:
determining chart height information of the chart so as to determine a chart invalid area according to the chart height information;
if the second area corresponding to the candidate title position is overlapped with the invalid area of the chart, screening the candidate title position to obtain the rest candidate title position;
and determining the title corresponding to each chart in the target document according to the rest candidate title positions, and dividing to obtain chart information containing the title.
6. The method for searching for a graph in a document according to claim 5, wherein determining a header corresponding to each graph in the target document according to the remaining candidate header positions and dividing the header into graph information including the headers includes:
if the residual candidate title positions comprise at least two, respectively calculating distance values between each residual candidate title position and the corresponding chart position information;
and screening out the target title positions in the residual candidate title positions based on the distance values to determine the title corresponding to each chart in the target document, and dividing to obtain chart information containing the title.
7. The method for searching for a document graph according to claim 6, wherein the step of screening out the target title positions in the remaining candidate title positions based on the distance values to determine a title corresponding to each graph in the target document and dividing the title into graph information including the title includes:
screening out the residual candidate title positions corresponding to the minimum distance value as target title positions;
if the target title positions comprise at least two, identifying and obtaining target title contents based on each target title position;
Screening target title contents containing preset title keywords as titles corresponding to the charts;
and carrying out pixel segmentation on the chart containing the title to obtain the chart information.
8. A document chart retrieval apparatus, comprising:
an acquisition unit for acquiring a target document of chart information to be extracted;
the detection unit is used for inputting the target document into a trained target detection model and outputting chart position information and title position information of the target document; the trained target detection model is obtained by training a sample document set of a coverage title area of a chart area pre-marked;
an analysis unit, configured to analyze the chart position information and the title position information, so as to determine a title corresponding to each chart in the target document, and divide the title into chart information including the title;
the chart information containing the title is used for screening the title in the chart information based on the search information to obtain target chart information when the search information is received.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the document graph retrieval method of any one of claims 1 to 7 when the computer program is executed.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the document graph retrieval method of any one of claims 1 to 7.
CN202310685076.1A 2023-06-09 2023-06-09 Document chart retrieval method, device, electronic equipment and storage medium Pending CN116910292A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310685076.1A CN116910292A (en) 2023-06-09 2023-06-09 Document chart retrieval method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310685076.1A CN116910292A (en) 2023-06-09 2023-06-09 Document chart retrieval method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116910292A true CN116910292A (en) 2023-10-20

Family

ID=88357211

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310685076.1A Pending CN116910292A (en) 2023-06-09 2023-06-09 Document chart retrieval method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116910292A (en)

Similar Documents

Publication Publication Date Title
US20210019287A1 (en) Systems and methods for populating a structured database based on an image representation of a data table
US20210073531A1 (en) Multi-page document recognition in document capture
US10824801B2 (en) Interactively predicting fields in a form
JP3640972B2 (en) A device that decodes or interprets documents
US7149347B1 (en) Machine learning of document templates for data extraction
US7120318B2 (en) Automatic document reading system for technical drawings
US11232300B2 (en) System and method for automatic detection and verification of optical character recognition data
US10489645B2 (en) System and method for automatic detection and verification of optical character recognition data
CN112434691A (en) HS code matching and displaying method and system based on intelligent analysis and identification and storage medium
WO2007117334A2 (en) Document analysis system for integration of paper records into a searchable electronic database
CN111274239A (en) Test paper structuralization processing method, device and equipment
CN111753120A (en) Method and device for searching questions, electronic equipment and storage medium
CN111860487B (en) Inscription marking detection and recognition system based on deep neural network
CN111985462A (en) Ancient character detection, identification and retrieval system based on deep neural network
CN109726369A (en) A kind of intelligent template questions record Implementation Technology based on normative document
CN113610068B (en) Test question disassembling method, system, storage medium and equipment based on test paper image
JP3711636B2 (en) Information retrieval apparatus and method
US20030108243A1 (en) Adaptive technology for automatic document analysis
CN116910292A (en) Document chart retrieval method, device, electronic equipment and storage medium
CN115050025A (en) Knowledge point extraction method and device based on formula recognition
US20150178966A1 (en) System and method to check the correct rendering of a font
CN115063784A (en) Bill image information extraction method and device, storage medium and electronic equipment
CN110727820B (en) Method and system for obtaining label for picture
JPH0743718B2 (en) Multimedia document structuring method
JP3750406B2 (en) Document filing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination