CN107688789B - Document chart extraction method, electronic device and computer readable storage medium - Google Patents

Document chart extraction method, electronic device and computer readable storage medium Download PDF

Info

Publication number
CN107688789B
CN107688789B CN201710776354.9A CN201710776354A CN107688789B CN 107688789 B CN107688789 B CN 107688789B CN 201710776354 A CN201710776354 A CN 201710776354A CN 107688789 B CN107688789 B CN 107688789B
Authority
CN
China
Prior art keywords
picture
chart
document
page
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710776354.9A
Other languages
Chinese (zh)
Other versions
CN107688789A (en
Inventor
王鸿滨
王晓伟
汪伟
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201710776354.9A priority Critical patent/CN107688789B/en
Priority to PCT/CN2017/108810 priority patent/WO2019041527A1/en
Publication of CN107688789A publication Critical patent/CN107688789A/en
Application granted granted Critical
Publication of CN107688789B publication Critical patent/CN107688789B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text

Abstract

The invention discloses a document chart extraction method, which comprises the following steps: acquiring position information of all characters in a specified document; generating a blank picture with the same size as the page of the specified document aiming at each page in the specified document, converting all characters in the page into pixel points in the blank picture according to the position information of all characters in the page, and rendering the blank picture; and marking the blank area in the rendered picture as a candidate chart area, and extracting chart information from the candidate chart area to be used as a chart extracted from the specified document page. The invention can improve the efficiency and the coverage of chart extraction.

Description

Document chart extraction method, electronic device and computer readable storage medium
Technical Field
The invention relates to the technical field of computer information, in particular to a document chart extraction method, electronic equipment and a computer readable storage medium.
Background
Most of the existing PDF chart extraction tools and programs are based on PDF storage objects, only the chart stored as a single picture object can be extracted, and a PDF document contains more chart information (such as Office charts and the like), and the charts can intuitively express part of information in the document. However, the conventional PDF chart extraction tool and program cannot accurately extract a chart composed of a plurality of parts, such as an Office chart. Therefore, the document chart extraction method in the prior art is not reasonable in design, and needs to be improved urgently.
Disclosure of Invention
In view of this, the invention provides a document graph extraction method, an electronic device and a computer-readable storage medium, which extract a graph from a PDF document in a page rendering manner, so as to improve the efficiency and coverage of graph extraction.
To achieve the above object, the present invention provides an electronic device, which includes a memory and a processor, wherein the memory stores a document graph extraction system operable on the processor, and the document graph extraction system implements the following steps when executed by the processor:
acquiring position information of all characters in a specified document;
generating a blank picture with the same size as the page of the specified document aiming at each page in the specified document, converting all characters in the page into pixel points in the blank picture according to the position information of all characters in the page, and rendering the blank picture; and
and marking the blank area in the rendered picture as a candidate chart area, and extracting chart information from the candidate chart area to be used as a chart extracted from the specified document page.
Preferably, the rendering the blank picture includes: and rendering the pixel point positions occupied by the characters to be black aiming at all the pixel point positions in the blank picture, and keeping the pixel point positions not occupied by the characters to be white.
Preferably, the document chart extraction system when executed by the processor further implements the steps of:
and processing the rendered picture by an image morphology processing method to make the text information boundary in the rendered picture obvious.
Preferably, the extracting of the chart information from the candidate chart region includes:
and converting the marked candidate chart area into a picture, screening the converted picture through pixel distribution analysis, and selecting the picture containing chart information as the chart extracted from the specified document page.
Preferably, the screening the converted picture through the pixel distribution analysis includes:
carrying out gray level processing on the converted picture, and converting the converted picture into a gray level image;
counting the number and proportion of black pixel points in the gray-scale image according to a row, and if the number and proportion of the black pixel points in the row exceed a specified threshold, judging that the row contains specific content; and
and counting the number of lines containing specific content, and if the number of lines containing the specific content is greater than or equal to a set threshold value, judging that the converted picture is a picture containing chart information.
In addition, in order to achieve the above object, the present invention further provides a document chart extraction method, which is applied to an electronic device, and the method includes:
acquiring position information of all characters in a specified document;
generating a blank picture with the same size as the page of the specified document aiming at each page in the specified document, converting all characters in the page into pixel points in the blank picture according to the position information of all characters in the page, and rendering the blank picture; and
and marking the blank area in the rendered picture as a candidate chart area, and extracting chart information from the candidate chart area to be used as a chart extracted from the specified document page.
Preferably, the rendering the blank picture includes:
aiming at all pixel point positions in the blank picture, rendering pixel point positions occupied by characters to be black, and keeping pixel point positions not occupied by the characters to be white;
the document chart extraction method further comprises the following steps:
and processing the rendered picture by an image morphology processing method to make the text information boundary in the rendered picture obvious.
Preferably, the extracting of the chart information from the candidate chart region includes:
and converting the marked candidate chart area into a picture, screening the converted picture through pixel distribution analysis, and selecting the picture containing chart information as the chart extracted from the specified document page.
Preferably, the screening the converted picture through the pixel distribution analysis includes:
carrying out gray level processing on the converted picture, and converting the converted picture into a gray level image;
counting the number and proportion of black pixel points in the gray-scale image according to a row, and if the number and proportion of the black pixel points in the row exceed a specified threshold, judging that the row contains specific content; and
and counting the number of lines containing specific content, and if the number of lines containing the specific content is greater than or equal to a set threshold value, judging that the converted picture is a picture containing chart information.
Further, to achieve the above object, the present invention also provides a computer-readable storage medium storing a document graph extraction system, which is executable by at least one processor to cause the at least one processor to perform the steps of the document graph extraction method as described above.
Compared with the prior art, the electronic equipment, the document chart extraction method and the computer-readable storage medium provided by the invention have the advantages that the chart is extracted from the PDF document in a page rendering mode, the chart which can be extracted by the traditional method can be extracted, the chart which is composed of a plurality of parts, such as Office chart information and the like which cannot be extracted by the traditional method can be extracted, and the chart extraction efficiency and the coverage are improved.
Drawings
FIG. 1 is a diagram of an alternative hardware architecture for an electronic device of the present invention;
FIG. 2 is a block diagram of a program module of an embodiment of a document table extraction system in an electronic device according to the present invention;
FIG. 3 is a flowchart illustrating an embodiment of a document chart extraction method according to the present invention.
Reference numerals:
electronic device 2
Memory device 21
Processor with a memory having a plurality of memory cells 22
Network interface 23
Document chart extraction system 20
Acquisition module 201
Rendering module 202
Extraction module 203
Procedure step S31-S33
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
It is further noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
First, the present invention provides an electronic device 2.
Fig. 1 is a schematic diagram of an alternative hardware architecture of the electronic device 2 according to the present invention. In this embodiment, the electronic device 2 may include, but is not limited to, a memory 21, a processor 22, and a network interface 23, which may be communicatively connected to each other through a system bus. It is noted that fig. 1 only shows the electronic device 2 with components 21-23, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.
The electronic device 2 may be a rack server, a blade server, a tower server, or a rack server, and the electronic device 2 may be an independent server or a server cluster formed by a plurality of servers.
The memory 21 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 21 may be an internal storage unit of the electronic device 2, such as a hard disk or a memory of the electronic device 2. In other embodiments, the memory 21 may also be an external storage device of the electronic device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the electronic device 2. Of course, the memory 21 may also comprise both an internal memory unit and an external memory device of the electronic device 2. In this embodiment, the memory 21 is generally used for storing an operating system installed in the electronic device 2 and various application software, such as program codes of the document chart extraction system 20. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is generally configured to control the overall operation of the electronic device 2, such as performing control and processing related to data interaction or communication with the electronic device 2. In this embodiment, the processor 22 is configured to operate the program codes or the processing data stored in the memory 21, for example, operate the document chart extraction system 20.
The network interface 23 may comprise a wireless network interface or a wired network interface, and the network interface 23 is generally used for establishing a communication connection between the electronic device 2 and other electronic devices. For example, the network interface 23 is used to connect the electronic device 2 with an external data platform through a network, and establish a data transmission channel and a communication connection between the electronic device 2 and the external data platform. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like.
The application environment and the hardware structure and function of the related devices of the various embodiments of the present invention have been described in detail so far. Hereinafter, various embodiments of the present invention will be proposed based on the above-described application environment and related devices.
Referring to fig. 2, a block diagram of a program of the document table extraction system 20 of the electronic device 2 according to an embodiment of the invention is shown. In this embodiment, the document chart extraction system 20 may be divided into one or more program modules, and the one or more program modules are stored in the memory 21 and executed by one or more processors (in this embodiment, the processor 22) to complete the present invention. For example, in fig. 2, the document chart extraction system 20 may be divided into an acquisition module 201, a rendering module 202, and an extraction module 203. The program modules referred to in the present invention refer to a series of computer program instruction segments capable of performing specific functions, and are more suitable than programs for describing the execution process of the document and chart extraction system 20 in the electronic device 2. The functions of the program modules 201 and 203 will be described in detail below.
The obtaining module 201 is configured to obtain position information of all characters in a specified document (e.g., a PDF document). In this embodiment, a specific text recognition tool (e.g., pdf2html tool) can be used to obtain the location information of all the texts in the specified document. The specific character recognition tool can analyze the PDF document into a text file, and simultaneously analyze specific position information (such as coordinates of the upper left corner, the length and width of the line of characters, and the like) of each line of text in the PDF document.
The rendering module 202 is configured to generate a blank picture with the same size as the page of the specified document for each page in the specified document, convert all the characters in the page into pixel points in the blank picture according to the position information of all the characters in the page, and render the blank picture.
In the embodiment, it can be determined that each line of characters occupies a specific position of the specified document page by the position information of all the characters. Wherein the rendering the blank picture comprises: and rendering the pixel point positions occupied by the characters to be black aiming at all the pixel point positions in the blank picture, and keeping the pixel point positions not occupied by the characters to be white. Through the rendering, the blank picture can be rendered into black and white two colors, wherein black is a character area, and white is a non-character area.
Further, in other embodiments, the rendering module 202 is further configured to: and processing the rendered picture by an image morphological processing method (such as expansion processing, contraction processing and the like) to make the text information boundary in the rendered picture obvious.
The extraction module 203 is configured to mark a blank area (i.e., a non-text area) in the rendered picture as a candidate chart area, and extract chart information from the candidate chart area as a chart extracted from the specified document page.
Preferably, in this embodiment, the extracting of the chart information from the candidate chart region includes:
and converting the marked candidate chart area into a picture, screening the converted picture through pixel distribution analysis (or content richness analysis), and selecting the picture containing chart information (such as PDF chart information) as the chart extracted from the specified document page. In this embodiment, a specific picture processing tool (e.g., imagemap tool) may be used to convert the labeled candidate chart regions into pictures.
Specifically, the step of screening out a picture containing chart information from the converted picture through pixel distribution analysis includes:
(1) the converted picture is subjected to gray scale processing (for example, gray scale processing is performed through an Opencv module in the application Python), and the converted picture is converted into a gray scale image. In the grayscale map, each pixel point of the picture is represented as 0 or 255. Wherein 0 represents black and is a pixel point with information content in the picture, and 255 represents white and is a blank pixel point in the picture.
(2) And counting the number and the proportion of the black pixels in the gray-scale image according to a row, and if the number and the proportion of the black pixels in the row exceed a specified threshold (e.g., the number exceeds 5, and the proportion exceeds 50%), determining that the row contains specific content.
(3) And counting the number of the rows containing the specific content to judge the richness of the content in the picture, wherein the more the rows containing the specific content are, the richer the content representing the picture is. If the number of lines containing the specific content is greater than or equal to the set threshold (e.g., 2 lines), it is determined that the converted picture is rich in content and is a picture containing chart information. Otherwise, if the number of lines containing specific content is less than the set threshold (e.g. 2 lines), it is determined that the converted picture content is not rich enough and is a blank picture without chart information.
Through the program module 201 and 203, the document and diagram extraction system 20 provided by the invention extracts diagrams from the PDF document in a page rendering manner, and the method can extract diagrams that can be extracted by the conventional method, and can also extract diagrams composed of a plurality of parts, such as Office diagram information and the like that cannot be extracted by the conventional method, thereby improving the efficiency and coverage of diagram extraction.
In addition, the invention also provides a document chart extraction method.
Fig. 3 is a schematic flow chart diagram illustrating an implementation of an embodiment of the document chart extraction method according to the present invention. In this embodiment, the execution order of the steps in the flowchart shown in fig. 3 may be changed and some steps may be omitted according to different requirements.
In step S31, position information of all the characters in a specified document (e.g., PDF document) is acquired. In this embodiment, a specific text recognition tool (e.g., pdf2html tool) can be used to obtain the location information of all the texts in the specified document. The specific character recognition tool can analyze the PDF document into a text file, and simultaneously analyze specific position information (such as coordinates of the upper left corner, the length and width of the line of characters, and the like) of each line of text in the PDF document.
Step S32, generating a blank picture with the same size as the page of the specified document for each page in the specified document, then converting all the characters in the page into pixel points in the blank picture according to the position information of all the characters in the page, and rendering the blank picture.
In the embodiment, it can be determined that each line of characters occupies a specific position of the specified document page by the position information of all the characters. Wherein the rendering the blank picture comprises: and rendering the pixel point positions occupied by the characters to be black aiming at all the pixel point positions in the blank picture, and keeping the pixel point positions not occupied by the characters to be white. Through the rendering, the blank picture can be rendered into black and white two colors, wherein black is a character area, and white is a non-character area.
Further, in other embodiments, step S32 further includes the following steps: and processing the rendered picture by an image morphological processing method (such as expansion processing, contraction processing and the like) to make the text information boundary in the rendered picture obvious.
Step S33, mark a blank area (i.e. a non-text area) in the rendered picture as a candidate chart area, and extract chart information from the candidate chart area as a chart extracted from the specified document page.
Preferably, in this embodiment, the extracting of the chart information from the candidate chart region includes:
and converting the marked candidate chart area into a picture, screening the converted picture through pixel distribution analysis (or content richness analysis), and selecting the picture containing chart information (such as PDF chart information) as the chart extracted from the specified document page. In this embodiment, a specific picture processing tool (e.g., imagemap tool) may be used to convert the labeled candidate chart regions into pictures.
Specifically, the step of screening out a picture containing chart information from the converted picture through pixel distribution analysis includes:
(1) the converted picture is subjected to gray scale processing (for example, gray scale processing is performed through an Opencv module in the application Python), and the converted picture is converted into a gray scale image. In the grayscale map, each pixel point of the picture is represented as 0 or 255. Wherein 0 represents black and is a pixel point with information content in the picture, and 255 represents white and is a blank pixel point in the picture.
(2) And counting the number and the proportion of the black pixels in the gray-scale image according to a row, and if the number and the proportion of the black pixels in the row exceed a specified threshold (e.g., the number exceeds 5, and the proportion exceeds 50%), determining that the row contains specific content.
(3) And counting the number of the rows containing the specific content to judge the richness of the content in the picture, wherein the more the rows containing the specific content are, the richer the content representing the picture is. If the number of lines containing the specific content is greater than or equal to the set threshold (e.g., 2 lines), it is determined that the converted picture is rich in content and is a picture containing chart information. Otherwise, if the number of lines containing specific content is less than the set threshold (e.g. 2 lines), it is determined that the converted picture content is not rich enough and is a blank picture without chart information.
Through the steps S31-S33, the document chart extraction method provided by the invention extracts the chart from the PDF document in a page rendering mode, and the method can extract the chart which can be extracted by the traditional method, and can also extract the chart which is composed of a plurality of parts, such as Office chart information and the like which cannot be extracted by the traditional method, so that the chart extraction efficiency and the coverage are improved.
Further, to achieve the above object, the present invention also provides a computer readable storage medium (such as ROM/RAM, magnetic disk, optical disk) storing a document graph extraction system 20, wherein the document graph extraction system 20 is executable by at least one processor 22, so that the at least one processor 22 executes the steps of the document graph extraction method as described above.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The preferred embodiments of the present invention have been described above with reference to the accompanying drawings, and are not to be construed as limiting the scope of the invention. The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. Additionally, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
Those skilled in the art can implement the invention in various modifications, such as features from one embodiment can be used in another embodiment to yield yet a further embodiment, without departing from the scope and spirit of the invention. All the equivalent structures or equivalent processes performed by using the contents of the specification and the drawings of the invention, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (6)

1. An electronic device comprising a memory and a processor, the memory having stored thereon a document schema extraction system operable on the processor, the document schema extraction system when executed by the processor implementing the steps of:
acquiring position information of all characters in a specified document;
generating a blank picture with the same size as the page of the specified document aiming at each page in the specified document, converting all characters in the page into pixel points in the blank picture according to the position information of all characters in the page, and rendering the blank picture; and
marking blank areas in the rendered pictures as candidate chart areas, and extracting chart information from the candidate chart areas to serve as the chart extracted from the specified document page;
the acquiring the position information of all characters in the specified document comprises:
analyzing the specified document into a text file, and analyzing the coordinates of each line of text in the text file at the upper left corner of the specified document and the length and width of the line of text;
the extracting of the chart information from the candidate chart region comprises:
converting the marked candidate chart area into a picture, screening the converted picture through pixel distribution analysis, and selecting the picture containing chart information as a chart extracted from the specified document page;
through pixel distribution analysis, screening the converted picture comprises the following steps:
carrying out gray level processing on the converted picture, and converting the converted picture into a gray level image;
counting the number and proportion of black pixel points in the gray-scale image according to a row, and if the number and proportion of the black pixel points in the row exceed a specified threshold, judging that the row contains specific content; and
and counting the number of lines containing specific content, and if the number of lines containing the specific content is greater than or equal to a set threshold value, judging that the converted picture is a picture containing chart information.
2. The electronic device of claim 1, wherein the rendering the blank picture comprises: and rendering the pixel point positions occupied by the characters to be black aiming at all the pixel point positions in the blank picture, and keeping the pixel point positions not occupied by the characters to be white.
3. The electronic device of claim 2, wherein the document graph extraction system, when executed by the processor, further performs the steps of:
and processing the rendered picture by an image morphology processing method to make the text information boundary in the rendered picture obvious.
4. A document chart extraction method is applied to electronic equipment, and is characterized by comprising the following steps:
acquiring position information of all characters in a specified document;
generating a blank picture with the same size as the page of the specified document aiming at each page in the specified document, converting all characters in the page into pixel points in the blank picture according to the position information of all characters in the page, and rendering the blank picture; and
marking blank areas in the rendered pictures as candidate chart areas, and extracting chart information from the candidate chart areas to serve as the chart extracted from the specified document page;
the acquiring the position information of all characters in the specified document comprises:
analyzing the specified document into a text file, and analyzing the coordinates of each line of text in the text file at the upper left corner of the specified document and the length and width of the line of text;
the extracting of the chart information from the candidate chart region comprises:
converting the marked candidate chart area into a picture, screening the converted picture through pixel distribution analysis, and selecting the picture containing chart information as a chart extracted from the specified document page;
through pixel distribution analysis, screening the converted picture comprises the following steps:
carrying out gray level processing on the converted picture, and converting the converted picture into a gray level image;
counting the number and proportion of black pixel points in the gray-scale image according to a row, and if the number and proportion of the black pixel points in the row exceed a specified threshold, judging that the row contains specific content; and
and counting the number of lines containing specific content, and if the number of lines containing the specific content is greater than or equal to a set threshold value, judging that the converted picture is a picture containing chart information.
5. The document chart extraction method of claim 4, wherein the rendering the blank picture comprises:
aiming at all pixel point positions in the blank picture, rendering pixel point positions occupied by characters to be black, and keeping pixel point positions not occupied by the characters to be white; and
the document chart extraction method further comprises the following steps:
and processing the rendered picture by an image morphology processing method to make the text information boundary in the rendered picture obvious.
6. A computer-readable storage medium storing a document graph extraction system executable by at least one processor to cause the at least one processor to perform the steps of the document graph extraction method according to any one of claims 4-5.
CN201710776354.9A 2017-08-31 2017-08-31 Document chart extraction method, electronic device and computer readable storage medium Active CN107688789B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710776354.9A CN107688789B (en) 2017-08-31 2017-08-31 Document chart extraction method, electronic device and computer readable storage medium
PCT/CN2017/108810 WO2019041527A1 (en) 2017-08-31 2017-10-31 Method of extracting chart in document, electronic device and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710776354.9A CN107688789B (en) 2017-08-31 2017-08-31 Document chart extraction method, electronic device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN107688789A CN107688789A (en) 2018-02-13
CN107688789B true CN107688789B (en) 2021-05-18

Family

ID=61155971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710776354.9A Active CN107688789B (en) 2017-08-31 2017-08-31 Document chart extraction method, electronic device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN107688789B (en)
WO (1) WO2019041527A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959346B (en) * 2018-04-12 2020-11-24 腾讯科技(上海)有限公司 Method, device, medium and equipment for determining text file thumbnail information
CN109445652B (en) * 2018-09-26 2021-08-13 中国平安人寿保险股份有限公司 PDF document display method and terminal equipment
CN109656647B (en) * 2018-09-27 2023-04-11 平安科技(深圳)有限公司 Chart picture generation method, device and equipment and computer readable storage medium
CN111414738A (en) * 2019-01-04 2020-07-14 珠海金山办公软件有限公司 Information analysis method and device, computer storage medium and terminal
CN110221888A (en) * 2019-04-28 2019-09-10 中至数据集团股份有限公司 Screenshot processing method, device, readable storage medium storing program for executing and smart machine
CN110390269B (en) * 2019-06-26 2023-08-01 平安科技(深圳)有限公司 PDF document table extraction method, device, equipment and computer readable storage medium
CN110502710A (en) * 2019-07-11 2019-11-26 平安普惠企业管理有限公司 Page generation method, device, equipment and readable storage medium storing program for executing
CN110377285B (en) * 2019-07-23 2023-10-03 腾讯科技(深圳)有限公司 Method and device for generating page skeleton screen and computer equipment
CN112579066A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Chart display method and device, storage medium and equipment
CN111338627B (en) * 2020-03-05 2023-05-16 苏宁云计算有限公司 Front-end webpage theme color adjustment method and device
CN112748923A (en) * 2021-01-18 2021-05-04 恒安嘉新(北京)科技股份公司 Method and device for creating visual billboard, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101008960A (en) * 2006-01-26 2007-08-01 株式会社理光 Information processing apparatus, information processing method, and computer program product
CN102081736A (en) * 2009-11-27 2011-06-01 株式会社理光 Equipment and method for extracting enclosing rectangles of characters from portable electronic documents
CN102567300A (en) * 2011-12-29 2012-07-11 方正国际软件有限公司 Picture document processing method and device
CN104346615A (en) * 2013-08-08 2015-02-11 北大方正集团有限公司 Device and method for extracting composite graph in format document
CN105159869A (en) * 2011-05-23 2015-12-16 成都科创知识产权研究所 Picture editing method and system
CN106874252A (en) * 2017-02-17 2017-06-20 张家口浩扬科技有限公司 A kind of document identification and display methods and its mobile terminal
CN106940804A (en) * 2017-02-23 2017-07-11 杭州仟金顶卓筑信息科技有限公司 Architectural engineering material management system form data method for automatically inputting

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4159720B2 (en) * 2000-03-15 2008-10-01 株式会社リコー Table recognition method, table recognition device, character recognition device, and storage medium storing table recognition program
CN101923723B (en) * 2009-06-16 2012-11-28 汉王科技股份有限公司 Method for realizing display of electronic document
CN104063364A (en) * 2013-03-19 2014-09-24 福建福昕软件开发股份有限公司北京分公司 PDF document recognition method
US9448982B2 (en) * 2014-01-29 2016-09-20 Konica Minolta Laboratory U.S.A., Inc. Immediate independent rasterization

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101008960A (en) * 2006-01-26 2007-08-01 株式会社理光 Information processing apparatus, information processing method, and computer program product
CN102081736A (en) * 2009-11-27 2011-06-01 株式会社理光 Equipment and method for extracting enclosing rectangles of characters from portable electronic documents
CN105159869A (en) * 2011-05-23 2015-12-16 成都科创知识产权研究所 Picture editing method and system
CN102567300A (en) * 2011-12-29 2012-07-11 方正国际软件有限公司 Picture document processing method and device
CN104346615A (en) * 2013-08-08 2015-02-11 北大方正集团有限公司 Device and method for extracting composite graph in format document
CN106874252A (en) * 2017-02-17 2017-06-20 张家口浩扬科技有限公司 A kind of document identification and display methods and its mobile terminal
CN106940804A (en) * 2017-02-23 2017-07-11 杭州仟金顶卓筑信息科技有限公司 Architectural engineering material management system form data method for automatically inputting

Also Published As

Publication number Publication date
CN107688789A (en) 2018-02-13
WO2019041527A1 (en) 2019-03-07

Similar Documents

Publication Publication Date Title
CN107688789B (en) Document chart extraction method, electronic device and computer readable storage medium
CN107689070B (en) Chart data structured extraction method, electronic device and computer-readable storage medium
CN110390269B (en) PDF document table extraction method, device, equipment and computer readable storage medium
CN111476227B (en) Target field identification method and device based on OCR and storage medium
CN107832676B (en) Table information line feed recognition method, electronic device and computer readable storage medium
CN110197238B (en) Font type identification method, system and terminal equipment
CN110728687B (en) File image segmentation method and device, computer equipment and storage medium
CN110675940A (en) Pathological image labeling method and device, computer equipment and storage medium
JP6795195B2 (en) Character type estimation system, character type estimation method, and character type estimation program
CN104915664B (en) Contact object identifier obtaining method and device
CN107844468A (en) The cross-page recognition methods of form data, electronic equipment and computer-readable recording medium
CN113221632A (en) Document picture identification method and device and computer equipment
CN112712014A (en) Table picture structure analysis method, system, equipment and readable storage medium
CN112784220B (en) Paper contract tamper-proof verification method and system
CN114005126A (en) Table reconstruction method and device, computer equipment and readable storage medium
CN110610170B (en) Document comparison method based on image accurate correction
CN107688788B (en) Document chart extraction method, electronic device and computer readable storage medium
CN113537184A (en) OCR (optical character recognition) model training method and device, computer equipment and storage medium
CN110363092B (en) Histogram identification method, apparatus, device and computer readable storage medium
CN110287988B (en) Data enhancement method, device and computer readable storage medium
CN111914046A (en) Generation method and device of target seating chart and computer equipment
CN109101973B (en) Character recognition method, electronic device and storage medium
CN107977404B (en) User information screening method, server and computer readable storage medium
CN110895849A (en) Method and device for cutting and positioning crown word number, computer equipment and storage medium
CN115909449A (en) File processing method, file processing device, electronic equipment, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant