CN107688789B

CN107688789B - Document chart extraction method, electronic device and computer readable storage medium

Info

Publication number: CN107688789B
Application number: CN201710776354.9A
Authority: CN
Inventors: 王鸿滨; 王晓伟; 汪伟; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2017-08-31
Filing date: 2017-08-31
Publication date: 2021-05-18
Anticipated expiration: 2037-08-31
Also published as: CN107688789A; WO2019041527A1

Abstract

The invention discloses a document chart extraction method, which comprises the following steps: acquiring position information of all characters in a specified document; generating a blank picture with the same size as the page of the specified document aiming at each page in the specified document, converting all characters in the page into pixel points in the blank picture according to the position information of all characters in the page, and rendering the blank picture; and marking the blank area in the rendered picture as a candidate chart area, and extracting chart information from the candidate chart area to be used as a chart extracted from the specified document page. The invention can improve the efficiency and the coverage of chart extraction.

Description

Document chart extraction method, electronic device and computer readable storage medium

Technical Field

The invention relates to the technical field of computer information, in particular to a document chart extraction method, electronic equipment and a computer readable storage medium.

Background

Most of the existing PDF chart extraction tools and programs are based on PDF storage objects, only the chart stored as a single picture object can be extracted, and a PDF document contains more chart information (such as Office charts and the like), and the charts can intuitively express part of information in the document. However, the conventional PDF chart extraction tool and program cannot accurately extract a chart composed of a plurality of parts, such as an Office chart. Therefore, the document chart extraction method in the prior art is not reasonable in design, and needs to be improved urgently.

Disclosure of Invention

In view of this, the invention provides a document graph extraction method, an electronic device and a computer-readable storage medium, which extract a graph from a PDF document in a page rendering manner, so as to improve the efficiency and coverage of graph extraction.

To achieve the above object, the present invention provides an electronic device, which includes a memory and a processor, wherein the memory stores a document graph extraction system operable on the processor, and the document graph extraction system implements the following steps when executed by the processor:

acquiring position information of all characters in a specified document;

generating a blank picture with the same size as the page of the specified document aiming at each page in the specified document, converting all characters in the page into pixel points in the blank picture according to the position information of all characters in the page, and rendering the blank picture; and

and marking the blank area in the rendered picture as a candidate chart area, and extracting chart information from the candidate chart area to be used as a chart extracted from the specified document page.

Preferably, the rendering the blank picture includes: and rendering the pixel point positions occupied by the characters to be black aiming at all the pixel point positions in the blank picture, and keeping the pixel point positions not occupied by the characters to be white.

Preferably, the document chart extraction system when executed by the processor further implements the steps of:

and processing the rendered picture by an image morphology processing method to make the text information boundary in the rendered picture obvious.

Preferably, the extracting of the chart information from the candidate chart region includes:

and converting the marked candidate chart area into a picture, screening the converted picture through pixel distribution analysis, and selecting the picture containing chart information as the chart extracted from the specified document page.

Preferably, the screening the converted picture through the pixel distribution analysis includes:

carrying out gray level processing on the converted picture, and converting the converted picture into a gray level image;

counting the number and proportion of black pixel points in the gray-scale image according to a row, and if the number and proportion of the black pixel points in the row exceed a specified threshold, judging that the row contains specific content; and

and counting the number of lines containing specific content, and if the number of lines containing the specific content is greater than or equal to a set threshold value, judging that the converted picture is a picture containing chart information.

In addition, in order to achieve the above object, the present invention further provides a document chart extraction method, which is applied to an electronic device, and the method includes:

acquiring position information of all characters in a specified document;

Preferably, the rendering the blank picture includes:

aiming at all pixel point positions in the blank picture, rendering pixel point positions occupied by characters to be black, and keeping pixel point positions not occupied by the characters to be white;

the document chart extraction method further comprises the following steps:

Further, to achieve the above object, the present invention also provides a computer-readable storage medium storing a document graph extraction system, which is executable by at least one processor to cause the at least one processor to perform the steps of the document graph extraction method as described above.

Compared with the prior art, the electronic equipment, the document chart extraction method and the computer-readable storage medium provided by the invention have the advantages that the chart is extracted from the PDF document in a page rendering mode, the chart which can be extracted by the traditional method can be extracted, the chart which is composed of a plurality of parts, such as Office chart information and the like which cannot be extracted by the traditional method can be extracted, and the chart extraction efficiency and the coverage are improved.

Drawings

FIG. 1 is a diagram of an alternative hardware architecture for an electronic device of the present invention;

FIG. 2 is a block diagram of a program module of an embodiment of a document table extraction system in an electronic device according to the present invention;

FIG. 3 is a flowchart illustrating an embodiment of a document chart extraction method according to the present invention.

Reference numerals:

electronic device	2
		Memory device	21
Processor with a memory having a plurality of memory cells	22
		Network interface	23
Document chart extraction system	20
		Acquisition module	201
Rendering module	202
		Extraction module	203
Procedure step	S31-S33

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

It is further noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

First, the present invention provides an electronic device 2.

Fig. 1 is a schematic diagram of an alternative hardware architecture of the electronic device 2 according to the present invention. In this embodiment, the electronic device 2 may include, but is not limited to, a memory 21, a processor 22, and a network interface 23, which may be communicatively connected to each other through a system bus. It is noted that fig. 1 only shows the electronic device 2 with components 21-23, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

The electronic device 2 may be a rack server, a blade server, a tower server, or a rack server, and the electronic device 2 may be an independent server or a server cluster formed by a plurality of servers.

The memory 21 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 21 may be an internal storage unit of the electronic device 2, such as a hard disk or a memory of the electronic device 2. In other embodiments, the memory 21 may also be an external storage device of the electronic device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the electronic device 2. Of course, the memory 21 may also comprise both an internal memory unit and an external memory device of the electronic device 2. In this embodiment, the memory 21 is generally used for storing an operating system installed in the electronic device 2 and various application software, such as program codes of the document chart extraction system 20. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is generally configured to control the overall operation of the electronic device 2, such as performing control and processing related to data interaction or communication with the electronic device 2. In this embodiment, the processor 22 is configured to operate the program codes or the processing data stored in the memory 21, for example, operate the document chart extraction system 20.

The network interface 23 may comprise a wireless network interface or a wired network interface, and the network interface 23 is generally used for establishing a communication connection between the electronic device 2 and other electronic devices. For example, the network interface 23 is used to connect the electronic device 2 with an external data platform through a network, and establish a data transmission channel and a communication connection between the electronic device 2 and the external data platform. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like.

The application environment and the hardware structure and function of the related devices of the various embodiments of the present invention have been described in detail so far. Hereinafter, various embodiments of the present invention will be proposed based on the above-described application environment and related devices.

Referring to fig. 2, a block diagram of a program of the document table extraction system 20 of the electronic device 2 according to an embodiment of the invention is shown. In this embodiment, the document chart extraction system 20 may be divided into one or more program modules, and the one or more program modules are stored in the memory 21 and executed by one or more processors (in this embodiment, the processor 22) to complete the present invention. For example, in fig. 2, the document chart extraction system 20 may be divided into an acquisition module 201, a rendering module 202, and an extraction module 203. The program modules referred to in the present invention refer to a series of computer program instruction segments capable of performing specific functions, and are more suitable than programs for describing the execution process of the document and chart extraction system 20 in the electronic device 2. The functions of the

program modules

201 and 203 will be described in detail below.

The obtaining module 201 is configured to obtain position information of all characters in a specified document (e.g., a PDF document). In this embodiment, a specific text recognition tool (e.g., pdf2html tool) can be used to obtain the location information of all the texts in the specified document. The specific character recognition tool can analyze the PDF document into a text file, and simultaneously analyze specific position information (such as coordinates of the upper left corner, the length and width of the line of characters, and the like) of each line of text in the PDF document.

The rendering module 202 is configured to generate a blank picture with the same size as the page of the specified document for each page in the specified document, convert all the characters in the page into pixel points in the blank picture according to the position information of all the characters in the page, and render the blank picture.

In the embodiment, it can be determined that each line of characters occupies a specific position of the specified document page by the position information of all the characters. Wherein the rendering the blank picture comprises: and rendering the pixel point positions occupied by the characters to be black aiming at all the pixel point positions in the blank picture, and keeping the pixel point positions not occupied by the characters to be white. Through the rendering, the blank picture can be rendered into black and white two colors, wherein black is a character area, and white is a non-character area.

Further, in other embodiments, the rendering module 202 is further configured to: and processing the rendered picture by an image morphological processing method (such as expansion processing, contraction processing and the like) to make the text information boundary in the rendered picture obvious.

The extraction module 203 is configured to mark a blank area (i.e., a non-text area) in the rendered picture as a candidate chart area, and extract chart information from the candidate chart area as a chart extracted from the specified document page.

Preferably, in this embodiment, the extracting of the chart information from the candidate chart region includes:

and converting the marked candidate chart area into a picture, screening the converted picture through pixel distribution analysis (or content richness analysis), and selecting the picture containing chart information (such as PDF chart information) as the chart extracted from the specified document page. In this embodiment, a specific picture processing tool (e.g., imagemap tool) may be used to convert the labeled candidate chart regions into pictures.

Specifically, the step of screening out a picture containing chart information from the converted picture through pixel distribution analysis includes:

(1) the converted picture is subjected to gray scale processing (for example, gray scale processing is performed through an Opencv module in the application Python), and the converted picture is converted into a gray scale image. In the grayscale map, each pixel point of the picture is represented as 0 or 255. Wherein 0 represents black and is a pixel point with information content in the picture, and 255 represents white and is a blank pixel point in the picture.

(2) And counting the number and the proportion of the black pixels in the gray-scale image according to a row, and if the number and the proportion of the black pixels in the row exceed a specified threshold (e.g., the number exceeds 5, and the proportion exceeds 50%), determining that the row contains specific content.

(3) And counting the number of the rows containing the specific content to judge the richness of the content in the picture, wherein the more the rows containing the specific content are, the richer the content representing the picture is. If the number of lines containing the specific content is greater than or equal to the set threshold (e.g., 2 lines), it is determined that the converted picture is rich in content and is a picture containing chart information. Otherwise, if the number of lines containing specific content is less than the set threshold (e.g. 2 lines), it is determined that the converted picture content is not rich enough and is a blank picture without chart information.

Through the

program module

201 and 203, the document and diagram extraction system 20 provided by the invention extracts diagrams from the PDF document in a page rendering manner, and the method can extract diagrams that can be extracted by the conventional method, and can also extract diagrams composed of a plurality of parts, such as Office diagram information and the like that cannot be extracted by the conventional method, thereby improving the efficiency and coverage of diagram extraction.

In addition, the invention also provides a document chart extraction method.

Fig. 3 is a schematic flow chart diagram illustrating an implementation of an embodiment of the document chart extraction method according to the present invention. In this embodiment, the execution order of the steps in the flowchart shown in fig. 3 may be changed and some steps may be omitted according to different requirements.

In step S31, position information of all the characters in a specified document (e.g., PDF document) is acquired. In this embodiment, a specific text recognition tool (e.g., pdf2html tool) can be used to obtain the location information of all the texts in the specified document. The specific character recognition tool can analyze the PDF document into a text file, and simultaneously analyze specific position information (such as coordinates of the upper left corner, the length and width of the line of characters, and the like) of each line of text in the PDF document.

Step S32, generating a blank picture with the same size as the page of the specified document for each page in the specified document, then converting all the characters in the page into pixel points in the blank picture according to the position information of all the characters in the page, and rendering the blank picture.

Further, in other embodiments, step S32 further includes the following steps: and processing the rendered picture by an image morphological processing method (such as expansion processing, contraction processing and the like) to make the text information boundary in the rendered picture obvious.

Step S33, mark a blank area (i.e. a non-text area) in the rendered picture as a candidate chart area, and extract chart information from the candidate chart area as a chart extracted from the specified document page.

Through the steps S31-S33, the document chart extraction method provided by the invention extracts the chart from the PDF document in a page rendering mode, and the method can extract the chart which can be extracted by the traditional method, and can also extract the chart which is composed of a plurality of parts, such as Office chart information and the like which cannot be extracted by the traditional method, so that the chart extraction efficiency and the coverage are improved.

Further, to achieve the above object, the present invention also provides a computer readable storage medium (such as ROM/RAM, magnetic disk, optical disk) storing a document graph extraction system 20, wherein the document graph extraction system 20 is executable by at least one processor 22, so that the at least one processor 22 executes the steps of the document graph extraction method as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The preferred embodiments of the present invention have been described above with reference to the accompanying drawings, and are not to be construed as limiting the scope of the invention. The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. Additionally, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

Those skilled in the art can implement the invention in various modifications, such as features from one embodiment can be used in another embodiment to yield yet a further embodiment, without departing from the scope and spirit of the invention. All the equivalent structures or equivalent processes performed by using the contents of the specification and the drawings of the invention, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An electronic device comprising a memory and a processor, the memory having stored thereon a document schema extraction system operable on the processor, the document schema extraction system when executed by the processor implementing the steps of:

acquiring position information of all characters in a specified document;

marking blank areas in the rendered pictures as candidate chart areas, and extracting chart information from the candidate chart areas to serve as the chart extracted from the specified document page;

the acquiring the position information of all characters in the specified document comprises:

analyzing the specified document into a text file, and analyzing the coordinates of each line of text in the text file at the upper left corner of the specified document and the length and width of the line of text;

the extracting of the chart information from the candidate chart region comprises:

converting the marked candidate chart area into a picture, screening the converted picture through pixel distribution analysis, and selecting the picture containing chart information as a chart extracted from the specified document page;

through pixel distribution analysis, screening the converted picture comprises the following steps:

2. The electronic device of claim 1, wherein the rendering the blank picture comprises: and rendering the pixel point positions occupied by the characters to be black aiming at all the pixel point positions in the blank picture, and keeping the pixel point positions not occupied by the characters to be white.

3. The electronic device of claim 2, wherein the document graph extraction system, when executed by the processor, further performs the steps of:

4. A document chart extraction method is applied to electronic equipment, and is characterized by comprising the following steps:

acquiring position information of all characters in a specified document;

5. The document chart extraction method of claim 4, wherein the rendering the blank picture comprises:

aiming at all pixel point positions in the blank picture, rendering pixel point positions occupied by characters to be black, and keeping pixel point positions not occupied by the characters to be white; and

the document chart extraction method further comprises the following steps:

6. A computer-readable storage medium storing a document graph extraction system executable by at least one processor to cause the at least one processor to perform the steps of the document graph extraction method according to any one of claims 4-5.