CN113486148A

CN113486148A - PDF file conversion method and device, electronic equipment and computer readable medium

Info

Publication number: CN113486148A
Application number: CN202110769021.XA
Authority: CN
Inventors: 万聪; 丁诗璟; 沈文俊; 高明; 胡德清; 余刚; 赵琴; 刘维安; 袁园; 欧阳明; 李亮; 李金灵; 沈冰华; 姚琛; 谢传聪; 苏蜜; 陈思广
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2021-10-08

Abstract

The invention discloses a PDF file conversion method, a PDF file conversion device, electronic equipment and a computer readable medium, and relates to the technical field of natural language processing. One embodiment of the method comprises: performing character recognition on the PDF file so as to output pixel coordinates and character contents of each character block; according to the pixel coordinates and the text content of each text block, aggregating each text block to form each paragraph; extracting numbers and titles corresponding to the numbers from each paragraph; and forming character contents with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number. The implementation method can solve the technical problem that the hierarchical structure of the file cannot be known and the retrieval result lacks context.

Description

PDF file conversion method and device, electronic equipment and computer readable medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for converting a PDF file, an electronic device, and a computer-readable medium.

Background

At present, OCR is generally adopted to recognize the page content of the PDF file from a picture as text, and then the text content containing keywords is retrieved through the keywords.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

1) the content of the PDF file is in a copy form, and the text retrieval in the file cannot be directly carried out;

2) the hierarchical structure of the file cannot be obtained, the retrieval result is a text segment, the content of the segment is not complete text information, the complete content and the context cannot be quickly obtained, and the efficiency of information retrieval and utilization is greatly weakened.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for converting a PDF file, an electronic device, and a computer-readable medium, so as to solve the technical problem that the hierarchical structure of a file and the search result lack a context.

In order to achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method for converting a PDF file, including:

performing character recognition on the PDF file so as to output pixel coordinates and character contents of each character block;

according to the pixel coordinates and the text content of each text block, aggregating each text block to form each paragraph;

extracting numbers and titles corresponding to the numbers from each paragraph;

and forming character contents with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number.

Optionally, performing text recognition on the PDF file, so as to output pixel coordinates and text content of each text block, including:

converting a PDF file into a plurality of continuous picture files by taking a page as a unit;

and performing character recognition on the picture file, thereby outputting the pixel coordinates and the character content of each character block in the picture file.

Optionally, aggregating the text blocks according to the pixel coordinates and the text contents of the text blocks to form paragraphs, where the method includes:

vectorizing the text content of each text block to obtain a vector of each text block;

and for any character block, inputting the vector and the pixel coordinate of the character block into a text classification model, and outputting whether the character block belongs to the previous paragraph or the next paragraph, thereby forming each paragraph.

Optionally, extracting a number and a title corresponding to the number from each paragraph includes:

and extracting the serial number and the title corresponding to the serial number from each paragraph through a trained Bi-LSTM-CRF model.

Optionally, forming the textual content with a hierarchical structure according to the paragraphs, their corresponding numbers, and the titles corresponding to the numbers, includes:

vectorizing the text content of each paragraph to obtain a vector of each paragraph;

for any paragraph, inputting the vector of the paragraph, the pixel coordinates of the text block at the edge of the paragraph, and the number extracted from the paragraph and the title corresponding to the number into a text classification model, and outputting whether the paragraph is classified as an upper level or a lower level, thereby forming the text content with a hierarchical structure.

Optionally, after forming the textual content with a hierarchical structure according to the paragraphs, their corresponding numbers, and the titles corresponding to the numbers, the method further includes:

and importing the number and the title and the text content corresponding to the number into a full text retrieval engine.

Optionally, after the number and the title and the text content corresponding to the number are imported into a full-text search engine, the method further includes:

retrieving a retrieval result corresponding to a target hierarchy and/or a keyword through the full-text retrieval engine according to the target hierarchy and/or the keyword input by a user;

and responding to any item of retrieval result clicked by a user, and displaying the hierarchical structure, the text content corresponding to the any item of retrieval result and the position area of the text content corresponding to the any item of retrieval result in the PDF file.

In addition, according to another aspect of an embodiment of the present invention, there is provided a PDF file conversion apparatus including:

the identification module is used for carrying out character identification on the PDF file so as to output pixel coordinates and character contents of each character block;

the aggregation module is used for aggregating each character block according to the pixel coordinates and the character content of each character block to form each paragraph;

the extraction module is used for extracting the serial numbers and the titles corresponding to the serial numbers from the paragraphs;

and the conversion module is used for forming character contents with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number.

Optionally, the identification module is further configured to:

Optionally, the aggregation module is further configured to:

Optionally, the extraction module is further configured to:

Optionally, the conversion module is further configured to:

Optionally, the system further comprises a retrieving module, configured to:

and after forming the text content with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number, importing the title and the text content corresponding to the number into a full-text search engine.

Optionally, the retrieving module is further configured to:

after the serial number and the title and the text content corresponding to the serial number are imported into a full-text search engine, searching a search result corresponding to a target hierarchy and/or a keyword through the full-text search engine according to the target hierarchy and/or the keyword input by a user;

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the method of any of the embodiments described above.

According to another aspect of the embodiments of the present invention, there is also provided a computer readable medium, on which a computer program is stored, which when executed by a processor implements the method of any of the above embodiments.

One embodiment of the above invention has the following advantages or benefits: because the technical means of extracting the serial numbers and the titles corresponding to the serial numbers from the paragraphs and forming the character contents with the hierarchical structure according to the paragraphs and the titles corresponding to the serial numbers and the serial numbers, the technical problem that the hierarchical structure of the file and the search result lack context cannot be known in the prior art is solved. The embodiment of the invention comprehensively uses OCR and NLP technologies, and converts the PDF file into structured and layered character contents based on the content of the text and the relative position information of the text, so that a user can completely know the character contents and the context in the file, and the effect of text retrieval and the information utilization efficiency are greatly improved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

fig. 1 is a schematic diagram of a main flow of a PDF file conversion method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a text recognition result according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a main flow of a PDF file conversion method according to a referential embodiment of the present invention;

fig. 4 is a schematic diagram of a main flow of a PDF file conversion method according to another referential embodiment of the present invention;

FIG. 5 is a diagram illustrating retrieval results from a full-text retrieval engine according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a detail page according to an embodiment of the present invention;

fig. 7 is a schematic diagram of the main blocks of a PDF file conversion apparatus according to an embodiment of the present invention;

FIG. 8 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 9 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of a main flow of a PDF file conversion method according to an embodiment of the present invention. As an embodiment of the present invention, as shown in fig. 1, the method for converting a PDF file may include:

step 101, performing character recognition on the PDF file, thereby outputting pixel coordinates and character contents of each character block.

Firstly, performing character recognition on a PDF file to be converted, for example, performing character recognition on each page of content in the PDF file by using OCR, so as to obtain pixel coordinates of each character block and character content in each character block.

Optionally, step 101 may comprise: converting a PDF file into a plurality of continuous picture files by taking a page as a unit; and performing character recognition on the picture file, thereby outputting the pixel coordinates and the character content of each character block in the picture file. Generally, because the content of a PDF file is a copy format and a text search in the file cannot be directly performed, it is necessary to convert the PDF file into a plurality of consecutive picture files, for example, a certain 60-page PDF file, and to convert the PDF file into consecutive 60 jpg picture files on a page-by-page basis; and performing OCR recognition on the picture file, thereby outputting the pixel coordinates of each character block and the character content in each character block.

As shown in fig. 2, the pixel coordinates are the origin of coordinates (0, 0) at the upper left corner of the picture, and the text block can be uniquely located by the pixel coordinates at the four corners, and can reflect the relative position relationship between adjacent text blocks. It should be noted that the upper part in fig. 2 is a real example of character recognition, the lower part is a detailed presentation of the recognition result, where text is text content, type refers to print/handwriting, score is confidence (the system scores the recognition result, generally in 0-1000 points, and the higher the score, the higher the probability of correct recognition), and coords is the pixel coordinates of the four corners.

And 102, aggregating the character blocks according to the pixel coordinates and the character contents of the character blocks to form paragraphs.

In this step, according to the recognition result (i.e., the pixel coordinates and the text content of each text block) in step 101, each text block is aggregated, and whether each text block belongs to the same paragraph is determined, so that each file block is aggregated into each paragraph.

Optionally, step 102 may comprise: vectorizing the text content of each text block to obtain a vector of each text block; and for any character block, inputting the vector and the pixel coordinate of the character block into a text classification model, and outputting whether the character block belongs to the previous paragraph or the next paragraph, thereby forming each paragraph. Specifically, the text content of each text block can be vectorized (word embedded) by adopting a BERT algorithm, so as to obtain the text meaning of each text block; and then taking the vector of the character block and the pixel coordinate of the character block as the input of a text classification model, and outputting whether the character block belongs to the previous paragraph or the next paragraph through the text classification model, thereby forming each paragraph according to the output result of the text classification model.

It should be noted that the text classification model needs to be supervised training in advance. Specifically, a large number of input and output samples are labeled manually, a text classification model is constructed and trained, the training result is a model file, and the function is to realize the paragraph of text blocks for new text blocks without manual labeling. Optionally, the text classification model is a transform-CRF model, by which each text block can be accurately aggregated into a paragraph.

Step 103, extracting the number and the title corresponding to the number from each paragraph.

Usually, each paragraph contains a number, some paragraphs contain numbers and titles corresponding to the numbers, and some paragraphs contain neither numbers nor titles, so that the numbers and titles corresponding to the numbers can be extracted from the paragraphs by a pre-trained extraction model, and the numbers and titles corresponding to the numbers are used to form a hierarchical structure (tree-like directory).

Optionally, step 103 may comprise: and extracting the serial number and the title corresponding to the serial number from each paragraph through a trained Bi-LSTM-CRF model. It should be noted that the Bi-LSTM-CRF model requires supervised training in advance. Specifically, a large number of paragraphs are manually marked with numbers and titles corresponding to the numbers, a Bi-LSTM-CRF model is constructed and trained, and the model has the function of extracting the titles corresponding to the numbers and the numbers from the paragraphs.

Alternatively, the numbers may be numbers, english letters, roman numerals, or the like, which is not limited in this respect by the embodiment of the present invention. It should be noted that the same number may be repeated, for example, the number "(a)" may be repeated at different levels, so that the hierarchical structure cannot be obtained by simply numbering, and a title corresponding to the number needs to be combined.

And 104, forming character contents with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number.

Each level may include one paragraph or a plurality of paragraphs, and in order to show a complete hierarchical structure, in the embodiment of the present invention, the paragraphs are aggregated according to the paragraphs and their corresponding numbers and titles corresponding to the numbers, so as to form the textual contents having a hierarchical structure, and thus the textual contents of each level include at least one paragraph.

Optionally, step 104 may include: vectorizing the text content of each paragraph to obtain a vector of each paragraph; for any paragraph, inputting the vector of the paragraph, the pixel coordinates of the text block at the edge of the paragraph, and the number extracted from the paragraph and the title corresponding to the number into a text classification model, and outputting whether the paragraph is classified as an upper level or a lower level, thereby forming the text content with a hierarchical structure.

Specifically, the text content of each paragraph can be vectorized (word embedded) by using a BERT algorithm, so as to obtain the text meaning of each paragraph; and then taking the vector of the paragraph, the pixel coordinate of the paragraph (the pixel coordinate of the word block at the most edge in the paragraph) and the number extracted from the paragraph and the title corresponding to the number (the part of the paragraph is empty), and outputting whether the paragraph is classified into the upper level or the lower level through the text classification model, thereby forming the word content with the hierarchical structure according to the output result of the text classification model.

It should be noted that the text classification model needs to be supervised training in advance. Specifically, a large number of input and output samples are labeled manually, a text classification model is constructed and trained, the training result is a model file, and the function is to realize the structuralization and the hierarchy of paragraphs for new paragraphs without manual labeling. Optionally, the text classification model is a transform-CRF model, by which paragraphs can be accurately grouped into a hierarchy.

To this end, a PDF file is converted into structured, hierarchical textual content, with the smallest granularity being the smallest level of textual content.

According to the various embodiments described above, it can be seen that the technical means of extracting the numbers and the titles corresponding to the numbers from the paragraphs and forming the text content with the hierarchical structure according to the paragraphs and the titles corresponding to the numbers and the numbers thereof in the embodiments of the present invention solves the technical problem that the hierarchical structure of the file and the search result lack context in the prior art. The embodiment of the invention comprehensively uses OCR and NLP technologies, and converts the PDF file into structured and layered character contents based on the content of the text and the relative position information of the text, so that a user can completely know the character contents and the context in the file, and the effect of text retrieval and the information utilization efficiency are greatly improved.

Fig. 3 is a schematic diagram of a main flow of a PDF file conversion method according to a referential embodiment of the present invention. As another embodiment of the present invention, as shown in fig. 3, the method for converting a PDF file may include:

step 301, converting a PDF file into a plurality of continuous picture files in page units.

And receiving the uploaded PDF file, and converting the PDF file into a plurality of continuous picture files by taking a page as a unit.

Step 302, performing character recognition on the picture file, thereby outputting pixel coordinates and character contents of each character block in the picture file.

After the PDF file is converted into a plurality of continuous picture files, OCR recognition is performed on the picture files, so that pixel coordinates of each text block and text content in each text block are output. The pixel coordinates are the upper left corner of the picture as the origin of coordinates (0, 0), and the character blocks can be uniquely positioned through the pixel coordinates of the four corners and can reflect the relative position relationship between the adjacent character blocks.

Step 303, vectorizing the text content of each text block to obtain a vector of each text block.

The text content of each text block can be vectorized by adopting a BERT algorithm, so that the text meaning of each text block is obtained.

Step 304, for any character block, inputting the vector and the pixel coordinate of the character block into a text classification model, and outputting whether the character block belongs to the previous paragraph or the next paragraph, thereby forming each paragraph.

The text classification model needs to be supervised and trained in advance, and the training process is not repeated. The embodiment of the invention judges whether the character block belongs to the previous paragraph or the next paragraph according to the vector and the pixel coordinate of the character block, so that the character block can be accurately aggregated to form each paragraph. Optionally, the text classification model is a transform-CRF model, by which each text block can be accurately aggregated into a paragraph.

And 305, extracting the serial numbers and the titles corresponding to the serial numbers from the paragraphs through the trained Bi-LSTM-CRF model.

The number and the title corresponding to the number can be extracted from each paragraph by a pre-trained extraction model (such as a Bi-LSTM-CRF model), and the number and the title corresponding to the number are used for forming the hierarchical structure. If the PDF file is a legal file, the term number and the term header corresponding to the term number may be extracted from each paragraph.

It should be noted that some paragraphs have neither a clause number nor a clause title, some paragraphs have only a clause number, and some paragraphs have both a clause number and a clause title, and the clause number and the title corresponding to the clause number or the clause number can be accurately extracted from each paragraph by the Bi-LSTM-CRF model. Alternatively, the clause number may be a number, an english alphabet, a roman numeral, or the like, which is not limited by the embodiment of the present invention. The Bi-LSTM-CRF model needs to be supervised and trained in advance, and the training process is not repeated.

Step 306, vectorizing the text content of each paragraph to obtain a vector of each paragraph.

Optionally, the text content of each paragraph may be vectorized by using a BERT algorithm, so as to obtain the text meaning of each paragraph.

Step 307, for any paragraph, inputting the vector of the paragraph, the pixel coordinates of the text block at the edge of the paragraph, the number extracted from the paragraph, and the title corresponding to the number into a text classification model, and outputting whether the paragraph is classified as the previous level or the next level, thereby forming the text content with a hierarchical structure.

The text classification model needs to be supervised and trained in advance, and the training process is not repeated. According to the embodiment of the invention, whether the paragraph is classified as the upper level or the lower level is judged through the vector of the paragraph, the pixel coordinate of the paragraph, the number extracted from the paragraph and the title corresponding to the number, so that the paragraphs can be accurately aggregated, and the character content with the hierarchical structure is formed. Optionally, the text classification model is a transform-CRF model, by which paragraphs can be accurately grouped into a hierarchy.

In addition, in one embodiment of the present invention, the detailed implementation of the method for converting a PDF file is described in detail in the above-mentioned method for converting a PDF file, and therefore, the repeated description is omitted here.

Fig. 4 is a schematic diagram of a main flow of a PDF file conversion method according to another referential embodiment of the present invention. As another embodiment of the present invention, as shown in fig. 4, the method for converting a PDF file may include:

step 401, performing character recognition on the PDF file, thereby outputting pixel coordinates and character contents of each character block.

And receiving the uploaded PDF file, and performing character recognition on the PDF file, for example, performing character recognition on each page of content in the PDF file by using an OCR (optical character recognition), so as to obtain pixel coordinates of each character block and character content in each character block. Generally, because PDF file contents are in a copy format and text search in a file cannot be directly performed, it is necessary to convert a PDF file into a plurality of continuous picture files on a page-by-page basis, and then perform OCR recognition on the picture files to output pixel coordinates of each text block and text contents in each text block.

And 402, aggregating the character blocks according to the pixel coordinates and the character contents of the character blocks to form paragraphs.

In this step, according to the recognition result (i.e., the pixel coordinates and the text content of each text block) in step 401, each text block is aggregated, and whether each text block belongs to the same paragraph is determined, so that each file block is aggregated into each paragraph.

In step 403, a number and a title corresponding to the number are extracted from each paragraph.

Usually, each paragraph contains a number, some paragraphs contain numbers and titles corresponding to the numbers, and some paragraphs contain neither numbers nor titles, so that the numbers and titles corresponding to the numbers can be extracted from the paragraphs by a pre-trained extraction model, and the numbers and titles corresponding to the numbers are used for forming a hierarchical structure.

Step 404, forming a text content with a hierarchical structure according to the paragraphs, the numbers corresponding to the paragraphs, and the titles corresponding to the numbers.

Step 405, importing the number and the title and the text content corresponding to the number into a full text search engine.

And importing all the numbers, the titles corresponding to the numbers and the text contents in the PDF file into a full-text search engine. If the PDF file is a legal file, the clause number, the clause title and the clause content are imported into a full text search engine (such as an ElasticSearch full text search engine).

And 406, retrieving a retrieval result corresponding to the target hierarchy and/or the keyword through the full-text retrieval engine according to the target hierarchy and/or the keyword input by the user.

The user can search the text content through the full-text search engine, for example, the user can input a target hierarchy, a keyword or a title, and the like, and the corresponding search result is searched through the full-text search engine. As shown in fig. 5, taking a legal document as an example, a user may input a hierarchy of legal terms and keywords, and output a search result as a list of legal terms including the keywords and having a granularity of a selected hierarchy, where the ranking is a sequence of terms respecting the original text. Further, filtering fields such as time, country, etc. may be added to improve the accuracy of the search.

Step 407, responding to any item of search result clicked by the user, and displaying the hierarchical structure, the text content corresponding to any item of search result, and the position area of the text content corresponding to any item of search result in the PDF file.

After selecting a certain term in the list shown in fig. 5, a detail page pops up, as shown in fig. 6, the left side of the detail page is a directory (containing a term number and a term title) of a hierarchical structure, the upper right side is a term content, and the lower side is a corresponding location area in the PDF file.

The invention comprehensively uses OCR and NLP technology, based on the content of the text and the relative position information of the text, the PDF file is converted into the structured and layered character content, and the retrieval result is presented by terms rather than general character fragments; and the structured and layered information is displayed in a visual and flexible manner, so that the user can know the complete content and the context of the terms, and the legal text retrieval effect and the information utilization efficiency are greatly improved.

In addition, in another embodiment of the present invention, the detailed implementation of the method for converting a PDF file is described in detail in the above-mentioned method for converting a PDF file, and therefore the repeated description is omitted here.

Fig. 7 is a schematic diagram of main blocks of a PDF file conversion apparatus according to an embodiment of the present invention, and as shown in fig. 7, the PDF file conversion apparatus 700 includes an identification module 701, an aggregation module 702, an extraction module 703, and a conversion module 704; the identification module 701 is configured to perform character identification on a PDF file, so as to output pixel coordinates and character contents of each character block; the aggregation module 702 is configured to aggregate the text blocks according to the pixel coordinates and the text contents of the text blocks to form paragraphs; the extraction module 703 is configured to extract the number and the title corresponding to the number from each paragraph; the conversion module 704 is configured to form text contents having a hierarchical structure according to the paragraphs, their corresponding numbers, and the titles corresponding to the numbers.

Optionally, the identifying module 701 is further configured to:

Optionally, the aggregation module 702 is further configured to:

Optionally, the extracting module 703 is further configured to:

Optionally, the conversion module 704 is further configured to:

Optionally, the system further comprises a retrieving module, configured to:

Optionally, the retrieving module is further configured to:

It should be noted that, in the implementation of the PDF file conversion apparatus according to the present invention, the above-mentioned PDF file conversion method has been described in detail, and therefore, the repeated description is omitted here.

Fig. 8 shows an exemplary system architecture 800 of a PDF file conversion method or a PDF file conversion apparatus to which an embodiment of the present invention can be applied.

As shown in fig. 8, the system architecture 800 may include

terminal devices

801, 802, 803, a network 804, and a server 805. The network 804 serves to provide a medium for communication links between the

terminal devices

801, 802, 803 and the server 805. Network 804 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

801, 802, 803 to interact with a server 805 over a network 804 to receive or send messages or the like. The

terminal devices

801, 802, 803 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

801, 802, 803 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 805 may be a server that provides various services, such as a back-office management server (for example only) that supports shopping-like websites browsed by users using the

terminal devices

801, 802, 803. The background management server can analyze and process the received data such as the article information query request and feed back the processing result to the terminal equipment.

It should be noted that the method for converting a PDF file provided by the embodiment of the present invention is generally executed by the server 805, and accordingly, the PDF file converting apparatus is generally disposed in the server 805. The method for converting a PDF file provided by the embodiment of the present invention may also be executed by the

terminal devices

801, 802, and 803, and accordingly, the apparatus for converting a PDF file may be disposed in the

terminal devices

801, 802, and 803.

It should be understood that the number of terminal devices, networks, and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 9, shown is a block diagram of a computer system 900 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The above-described functions defined in the system of the present invention are executed when the computer program is executed by a Central Processing Unit (CPU) 901.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer programs according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes an identification module, an aggregation module, an extraction module, and a translation module, where the names of the modules do not in some cases constitute a limitation on the modules themselves.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, implement the method of: performing character recognition on the PDF file so as to output pixel coordinates and character contents of each character block; according to the pixel coordinates and the text content of each text block, aggregating each text block to form each paragraph; extracting numbers and titles corresponding to the numbers from each paragraph; and forming character contents with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number.

According to the technical scheme of the embodiment of the invention, because the technical means that the serial numbers and the titles corresponding to the serial numbers are extracted from the paragraphs and the text content with the hierarchical structure is formed according to the paragraphs and the corresponding serial numbers and the titles corresponding to the serial numbers, the technical problem that the hierarchical structure of the file and the search result lack context in the prior art are solved. The embodiment of the invention comprehensively uses OCR and NLP technologies, and converts the PDF file into structured and layered character contents based on the content of the text and the relative position information of the text, so that a user can completely know the character contents and the context in the file, and the effect of text retrieval and the information utilization efficiency are greatly improved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A PDF file conversion method is characterized by comprising the following steps:

extracting numbers and titles corresponding to the numbers from each paragraph;

2. The method of claim 1, wherein performing text recognition on the PDF file to output pixel coordinates of each text block and text content comprises:

3. The method of claim 1, wherein aggregating the text blocks to form paragraphs according to the pixel coordinates and text content of the text blocks comprises:

4. The method of claim 1, wherein extracting the number and the title corresponding to the number from each paragraph comprises:

5. The method according to claim 1, wherein forming the text content with a hierarchical structure according to the paragraphs and their corresponding numbers and titles corresponding to the numbers comprises:

6. The method according to claim 1, wherein after forming the text with a hierarchical structure according to the paragraphs, their corresponding numbers and their corresponding titles, the method further comprises:

7. The method of claim 6, wherein after importing the number and the title and text content corresponding to the number into a full text search engine, further comprising:

8. A PDF file conversion apparatus, comprising:

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

the one or more programs, when executed by the one or more processors, implement the method of any of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.