CN116542221A

CN116542221A - PDF file analysis preview method, device, equipment and storage medium

Info

Publication number: CN116542221A
Application number: CN202310369133.5A
Authority: CN
Inventors: 顾柏进
Original assignee: Kangjian Information Technology Shenzhen Co Ltd
Current assignee: Kangjian Information Technology Shenzhen Co Ltd
Priority date: 2023-04-03
Filing date: 2023-04-03
Publication date: 2023-08-04

Abstract

The invention relates to the development auxiliary field technology, and discloses a PDF file analysis preview method for medical diagnosis report preview, which comprises the following steps: performing page segmentation on the target PDF file to obtain a PDF page; when the PDF page is a plain text page, recognizing that all texts in the PDF page are written into blank files in a preset file format, and obtaining a corresponding preview page; when the PDF page is not a plain text page, analyzing the PDF page to obtain all block elements in the PDF page and the position information of the block elements; extracting the contents in all the block elements, and writing the extracted contents into blank files in a preset file format according to the position information to obtain a preview page; and sending the preview page to the user terminal to realize the preview of the target PDF file. The invention also relates to a blockchain technique, and the preview page can be stored in a blockchain node. The invention also provides a PDF file analysis preview device, electronic equipment and a medium. The invention can improve the efficiency of PDF file analysis preview.

Description

PDF file analysis preview method, device, equipment and storage medium

Technical Field

The present invention relates to the field of development assistance technology and digital medical technology, and in particular, to a PDF file analysis preview method, apparatus, electronic device, and storage medium.

Background

The PDF file is a commonly used electronic file format, and when the PDF file in the server needs to be previewed on the terminal (for example, a user performs online previewing on a medical diagnosis report in the PDF format), the PDF file needs to be parsed by using a data analysis technology, but because the PDF file needs to be reviewed by special software, in order to be convenient for previewing the PDF file, the PDF file needs to be parsed and converted into a general file format for previewing.

However, in the current PDF file parsing preview method, each page of the PDF file needs to be parsed to parse the content in the PDF page and convert the content into a preset general file format for previewing, which results in lower efficiency of PDF file parsing preview.

Disclosure of Invention

The invention provides a PDF file analysis preview method, a device, electronic equipment and a storage medium, and mainly aims to improve the efficiency of PDF file analysis preview.

Acquiring a file preview request of a target PDF file sent by a user terminal, and carrying out page segmentation on the target PDF file to obtain a PDF page;

judging whether the PDF page is a plain text page or not;

when the PDF page is a plain text page, recognizing that all texts in the PDF page are written into blank files in a preset file format, and obtaining a corresponding preview page;

when the PDF page is not a plain text page, analyzing the PDF page to obtain all block elements in the PDF page and the position information of the block elements;

extracting the contents in all the block elements, and writing the extracted contents into blank files in a preset file format according to the position information to obtain a preview page;

and sending all the preview pages to the user terminal so as to realize the preview of the target PDF file by viewing the preview pages through the user terminal.

Optionally, the determining whether the PDF page is a plain text page includes:

converting the PDF page into a gray image;

performing feature extraction on the gray level image by using a feature extraction network in a pre-constructed page classification model to obtain a feature extraction matrix;

mapping the feature extraction matrix into feature values of different preset identification categories by using a feature mapping layer in the page classification model;

normalizing each characteristic value to obtain the identification probability of each identification category;

and judging whether the PDF page is a plain text page according to the identification probability.

Optionally, the determining, according to the recognition probability, whether the PDF page is a plain text page includes:

judging whether the recognition probability larger than a preset recognition threshold exists or not;

if the recognition probability is larger than the recognition threshold, judging whether the recognition category corresponding to the maximum recognition probability is plain text or not;

when the identification category corresponding to the maximum identification probability is plain text, the PDF page is a plain text page;

when the identification category corresponding to the maximum identification probability is not plain text, the PDF page is not a plain text page;

and if the recognition probability larger than the recognition threshold does not exist, the PDF page is not a plain text page.

Optionally, the identifying that all the texts in the PDF page are written into blank files in a preset file format to obtain a corresponding preview page includes:

identifying all texts in the PDF page by utilizing an OCR technology to obtain page texts;

and converting all characters in the page text into preset fonts, and writing the preset fonts into the blank file to obtain a preview page corresponding to the PDF page.

Optionally, the sending all the preview pages to the user terminal includes:

combining all the preview pages according to the sequence of the corresponding PDF pages in the target PDF file to obtain a preview file;

and sending the preview file to the user terminal.

Optionally, the sending all the preview pages to the user terminal includes:

storing the preview file in a preset storage area, and acquiring a storage address of the preview file; and sending the storage address to the user terminal.

In order to solve the above problems, the present invention further provides a PDF file parsing preview device, where the device includes:

the classification judging module is used for acquiring a file preview request of a target PDF file sent by the user terminal, and carrying out page segmentation on the target PDF file to obtain a PDF page; judging whether the PDF page is a plain text page or not;

the page analysis module is used for identifying that all texts in the PDF page are written into blank files in a preset file format when the PDF page is a plain text page, so as to obtain a corresponding preview page; when the PDF page is not a plain text page, analyzing the PDF page to obtain all block elements in the PDF page and the position information of the block elements; extracting the contents in all the block elements, and writing the extracted contents into blank files in a preset file format according to the position information to obtain a preview page;

and the file sending module is used for sending all the preview pages to the user terminal so as to realize the preview of the target PDF file by viewing the preview pages through the user terminal.

Optionally, the sending all the preview pages to the user terminal includes:

In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:

a memory storing at least one computer program; a kind of electronic device with high-pressure air-conditioning system

And the processor executes the computer program stored in the memory to realize the PDF file analysis preview method.

In order to solve the above-mentioned problems, the present invention also provides a computer-readable storage medium having stored therein at least one computer program that is executed by a processor in an electronic device to implement the PDF file parsing preview method described above.

In the embodiment of the invention, when the PDF page is a plain text page, all texts in the PDF page are identified and written into blank files with preset file formats, so as to obtain a corresponding preview page; when the PDF page is a plain text page, the page does not need to be analyzed, the creation of the preview page can be realized by directly identifying the text in the page, the creation flow of the preview page is simplified, the creation speed of the preview page is further improved, and the analysis preview efficiency of the PDF file is further improved; therefore, the PDF file analysis preview method, the PDF file analysis preview device, the electronic equipment and the readable storage medium provided by the embodiment of the invention reduce the efficiency of PDF file analysis preview.

Drawings

Fig. 1 is a flowchart of a PDF file parsing preview method according to an embodiment of the present invention;

fig. 2 is a schematic block diagram of a PDF file analysis preview device according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an internal structure of an electronic device for implementing a PDF file parsing preview method according to an embodiment of the present invention;

the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The embodiment of the invention provides a PDF file analysis preview method. The execution subject of the PDF file analysis preview method includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the method provided by the embodiments of the present application. In other words, the PDF file parsing preview method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The service end includes but is not limited to: the server can be an independent server, or can be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), basic cloud computing services such as big data and artificial intelligent platforms, and the like.

Referring to fig. 1, which is a schematic flow chart of a PDF file analysis preview method according to an embodiment of the present invention, in an embodiment of the present invention, the PDF file analysis preview method includes the following steps:

s1, acquiring a file preview request of a target PDF file sent by a user terminal, and carrying out page segmentation on the target PDF file to obtain a PDF page;

in the embodiment of the invention, the preset server receives the file request of the target PDF file sent by the user terminal, and in order to better preview and convert the PDF file, the target PDF file is obtained and subjected to page segmentation to obtain a corresponding PDF file, wherein the target PDF file can be a medical diagnosis report.

Further, in the embodiment of the invention, although the PDF page can be directly forwarded to the format of the picture in a screenshot and scanning mode, the format of the picture which is often converted cannot be unified, and the definition cannot meet the preview requirement, so that the content in the PDF page needs to be accurately identified, and a unified format is constructed according to the identified content.

S2, judging whether the PDF page is a plain text page or not;

in the embodiment of the invention, because the types of the elements contained in the PDF page are different, only the text in the PDF page can be directly identified, the PDF page is not required to be analyzed, and the extraction speed of the content in the page is improved.

Further, in the embodiment of the invention, whether the PDF page is a plain text page is judged by using an artificial intelligent model through machine vision, and when the PDF page is the plain text page, the text in the page can be directly identified without analyzing the PDF page, so that the extraction speed of the content in the page is improved.

Specifically, in the embodiment of the present invention, determining whether the PDF page is a plain text page includes:

converting the PDF page into a gray image;

Further, in the embodiment of the present invention, feature extraction is performed on the gray level image by using a feature extraction network in a pre-constructed page classification model, so as to obtain a feature extraction matrix, including:

converting the gray scale image into an image matrix;

and carrying out convolution pooling on the image matrix by utilizing the feature extraction network to obtain the feature extraction matrix.

Specifically, in the embodiment of the invention, gray values of all pixels in the gray image are obtained, and the gray values are used as elements in a blank matrix to construct a matrix, so that the image matrix is obtained.

Further, in the embodiment of the present invention, the feature extraction layer is obtained by serially connecting a plurality of convolution layers and a plurality of pooling layers according to a certain connection sequence, and the embodiment of the present invention does not limit the layer structures of the convolution layers and the pooling layers, and the first layer in the feature extraction layer is the convolution layer, and the rest connection sequences are not limited; the feature mapping layer is formed by connecting multiple full-connection layers in series, wherein the number of output nodes of the last full-connection layer is consistent with the number of identification categories, each identification category corresponds to one output node one by one, and the output value of the output node is the feature value of the corresponding identification category.

Specifically, in the embodiment of the present invention, the identifying category includes: plain text.

In detail, in the embodiment of the present invention, the determining whether the PDF page is a plain text page according to the recognition probability includes:

S3, when the PDF page is a plain text page, recognizing that all texts in the PDF page are written into blank files in a preset file format, and obtaining a corresponding preview page;

in the embodiment of the invention, when the PDF page is a plain text page, all texts in the PDF page are recognized by utilizing an OCR technology, so that a page text is obtained; and converting all characters in the page text into preset fonts, and writing the preset fonts into the blank file to obtain a preview page corresponding to the PDF page. Specifically, in the embodiment of the present invention, the preset fonts are fonts existing in the server,

further, in the embodiment of the present invention, the preset file format is a file format that can be parsed by the user terminal, and optionally, the file format may be a file format such as jpg, png, etc.

S4, when the PDF page is not a plain text page, analyzing the PDF page to obtain all block elements in the PDF page and the position information of the block elements;

in the embodiment of the invention, the PDF page is analyzed to obtain all the block elements in the PDF page, wherein the types of the block elements comprise characters, tables or graphs.

S5, extracting the contents in all the block elements, and writing the extracted contents into blank files in a preset file format according to the position information to obtain a preview page;

in the embodiment of the invention, a blank area is constructed in the blank file according to the position information in the block element, so as to obtain the content insertion area of the block element; and filling the extracted content of each block element into the content insertion area of each block element to obtain the preview page.

Further, in the embodiment of the present invention, the extracted content of each block element is filled into the content insertion area of each block element, when the preview page is obtained, the font type of the character in the extracted content of each block element is identified, and if the identified font type does not exist in the server, the character in the extracted content is converted into a preset font and stored in the blank area.

In another embodiment of the present invention, the preview page may be stored in a blockchain node, so as to improve the efficiency of data access by using the high throughput characteristic of the blockchain node.

And S6, sending all the preview pages to the user terminal so as to realize the preview of the target PDF file by viewing the preview pages through the user terminal.

In the embodiment of the present invention, sending all the preview pages to the user terminal includes:

and sending the preview file to the user terminal.

In the embodiment of the invention, the preview file is sent to the user terminal to respond to the file preview request, so that the user can realize the preview of the target PDF file by checking the preview file, thereby realizing the analysis preview of the PDF file on the terminal which does not support the PDF file.

Further, in the embodiment of the present invention, the preview file may be indirectly sent to the user terminal, and in particular, in the step of sending the preview file to the user terminal in the embodiment of the present invention, the step of sending the preview file to the user terminal may be replaced by storing the preview file in a preset storage area, and obtaining a storage address of the preview file; and sending the storage address to the user terminal so that the user terminal browses the preview file according to the storage address, thereby realizing the preview of the target PDF file. Optionally, in the embodiment of the present invention, the preset storage area is a data storage area of the server.

As shown in fig. 2, a functional block diagram of the PDF file analysis preview apparatus of the present invention is shown.

The PDF file analysis preview apparatus 100 of the present invention may be mounted in an electronic device. According to the implemented functions, the PDF file parsing preview apparatus may include a classification judging module 101, a page parsing module 102, and a file sending module 103, where the modules may also be referred to as units, and refer to a series of computer program segments capable of being executed by a processor of an electronic device and performing a fixed function, and are stored in a memory of the electronic device.

In the present embodiment, the functions concerning the respective modules/units are as follows:

the classification judging module 101 is configured to obtain a file preview request of a target PDF file sent by a user terminal, and perform page segmentation on the target PDF file to obtain a PDF page; judging whether the PDF page is a plain text page or not;

the page parsing module 102 is configured to identify that all texts in the PDF page are written into blank files in a preset file format when the PDF page is a plain text page, so as to obtain a corresponding preview page; when the PDF page is not a plain text page, analyzing the PDF page to obtain all block elements in the PDF page and the position information of the block elements; extracting the contents in all the block elements, and writing the extracted contents into blank files in a preset file format according to the position information to obtain a preview page;

the file sending module 103 is configured to send all the preview pages to the user terminal, so as to view the preview pages through the user terminal, thereby implementing preview of the target PDF file.

In detail, each module in the PDF file analysis preview device 100 in the embodiment of the present invention adopts the same technical means as the PDF file analysis preview method described in fig. 1 and can generate the same technical effects when in use, which is not described herein.

Fig. 3 is a schematic structural diagram of an electronic device for implementing the PDF file analysis preview method of the present invention.

The electronic device may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as a PDF file parsing preview program, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, such as a mobile hard disk of the electronic device. The memory 11 may in other embodiments also be an external storage device of the electronic device, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used to store not only application software installed in an electronic device and various data, such as a code of a PDF file analysis preview program, but also temporarily store data that has been output or is to be output.

The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing Unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device and processes data by running or executing programs or modules (e.g., PDF file analysis preview programs, etc.) stored in the memory 11, and calling data stored in the memory 11.

The communication bus 12 may be a peripheral component interconnect standard (PerIPheral Component Interconnect, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The communication bus 12 is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 is not limiting of the electronic device and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.

For example, although not shown, the electronic device may further include a power source (such as a battery) for supplying power to the respective components, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure classification circuit, power converter or inverter, power status indicator, etc. The electronic device may further include various sensors, bluetooth modules, wi-Fi modules, etc., which are not described herein.

Optionally, the communication interface 13 may comprise a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the electronic device and other electronic devices.

Optionally, the communication interface 13 may further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The PDF file analysis preview program stored in the memory 11 in the electronic device is a combination of a plurality of computer programs, and when executed in the processor 10, may implement:

judging whether the PDF page is a plain text page or not;

In particular, the specific implementation method of the processor 10 on the computer program may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.

Further, the electronic device integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. The computer readable medium may be non-volatile or volatile. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).

Embodiments of the present invention may also provide a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, may implement:

judging whether the PDF page is a plain text page or not;

Further, the computer-usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.

In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. The PDF file analysis preview method is characterized by comprising the following steps of:

judging whether the PDF page is a plain text page or not;

2. The PDF file parsing preview method of claim 1, wherein the determining whether the PDF page is a plain text page includes:

converting the PDF page into a gray image;

3. The PDF file parsing preview method of claim 2, wherein said determining whether the PDF page is a plain text page according to the recognition probability includes:

4. The PDF file parsing preview method of claim 1, wherein the identifying that all texts in the PDF page are written into blank files in a preset file format to obtain a corresponding preview page includes:

5. The PDF file parsing preview method of claim 1, wherein said sending all the preview pages to the user terminal includes:

and sending the preview file to the user terminal.

6. The PDF file parsing preview method of any one of claims 1 to 5, wherein said sending all the preview pages to the user terminal includes:

7. A PDF file analysis preview device, comprising:

8. The PDF file parsing preview device of claim 7, wherein said sending all of the preview pages to the user terminal includes:

9. An electronic device, the electronic device comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor;

wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the PDF file parsing preview method of any one of claims 1 to 6.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the PDF file parsing preview method of any one of claims 1 to 6.