CN113486148A - PDF file conversion method and device, electronic equipment and computer readable medium - Google Patents

PDF file conversion method and device, electronic equipment and computer readable medium Download PDF

Info

Publication number
CN113486148A
CN113486148A CN202110769021.XA CN202110769021A CN113486148A CN 113486148 A CN113486148 A CN 113486148A CN 202110769021 A CN202110769021 A CN 202110769021A CN 113486148 A CN113486148 A CN 113486148A
Authority
CN
China
Prior art keywords
text
paragraph
character
block
pdf file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110769021.XA
Other languages
Chinese (zh)
Inventor
万聪
丁诗璟
沈文俊
高明
胡德清
余刚
赵琴
刘维安
袁园
欧阳明
李亮
李金灵
沈冰华
姚琛
谢传聪
苏蜜
陈思广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202110769021.XA priority Critical patent/CN113486148A/en
Publication of CN113486148A publication Critical patent/CN113486148A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a PDF file conversion method, a PDF file conversion device, electronic equipment and a computer readable medium, and relates to the technical field of natural language processing. One embodiment of the method comprises: performing character recognition on the PDF file so as to output pixel coordinates and character contents of each character block; according to the pixel coordinates and the text content of each text block, aggregating each text block to form each paragraph; extracting numbers and titles corresponding to the numbers from each paragraph; and forming character contents with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number. The implementation method can solve the technical problem that the hierarchical structure of the file cannot be known and the retrieval result lacks context.

Description

PDF file conversion method and device, electronic equipment and computer readable medium
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for converting a PDF file, an electronic device, and a computer-readable medium.
Background
At present, OCR is generally adopted to recognize the page content of the PDF file from a picture as text, and then the text content containing keywords is retrieved through the keywords.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
1) the content of the PDF file is in a copy form, and the text retrieval in the file cannot be directly carried out;
2) the hierarchical structure of the file cannot be obtained, the retrieval result is a text segment, the content of the segment is not complete text information, the complete content and the context cannot be quickly obtained, and the efficiency of information retrieval and utilization is greatly weakened.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for converting a PDF file, an electronic device, and a computer-readable medium, so as to solve the technical problem that the hierarchical structure of a file and the search result lack a context.
In order to achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method for converting a PDF file, including:
performing character recognition on the PDF file so as to output pixel coordinates and character contents of each character block;
according to the pixel coordinates and the text content of each text block, aggregating each text block to form each paragraph;
extracting numbers and titles corresponding to the numbers from each paragraph;
and forming character contents with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number.
Optionally, performing text recognition on the PDF file, so as to output pixel coordinates and text content of each text block, including:
converting a PDF file into a plurality of continuous picture files by taking a page as a unit;
and performing character recognition on the picture file, thereby outputting the pixel coordinates and the character content of each character block in the picture file.
Optionally, aggregating the text blocks according to the pixel coordinates and the text contents of the text blocks to form paragraphs, where the method includes:
vectorizing the text content of each text block to obtain a vector of each text block;
and for any character block, inputting the vector and the pixel coordinate of the character block into a text classification model, and outputting whether the character block belongs to the previous paragraph or the next paragraph, thereby forming each paragraph.
Optionally, extracting a number and a title corresponding to the number from each paragraph includes:
and extracting the serial number and the title corresponding to the serial number from each paragraph through a trained Bi-LSTM-CRF model.
Optionally, forming the textual content with a hierarchical structure according to the paragraphs, their corresponding numbers, and the titles corresponding to the numbers, includes:
vectorizing the text content of each paragraph to obtain a vector of each paragraph;
for any paragraph, inputting the vector of the paragraph, the pixel coordinates of the text block at the edge of the paragraph, and the number extracted from the paragraph and the title corresponding to the number into a text classification model, and outputting whether the paragraph is classified as an upper level or a lower level, thereby forming the text content with a hierarchical structure.
Optionally, after forming the textual content with a hierarchical structure according to the paragraphs, their corresponding numbers, and the titles corresponding to the numbers, the method further includes:
and importing the number and the title and the text content corresponding to the number into a full text retrieval engine.
Optionally, after the number and the title and the text content corresponding to the number are imported into a full-text search engine, the method further includes:
retrieving a retrieval result corresponding to a target hierarchy and/or a keyword through the full-text retrieval engine according to the target hierarchy and/or the keyword input by a user;
and responding to any item of retrieval result clicked by a user, and displaying the hierarchical structure, the text content corresponding to the any item of retrieval result and the position area of the text content corresponding to the any item of retrieval result in the PDF file.
In addition, according to another aspect of an embodiment of the present invention, there is provided a PDF file conversion apparatus including:
the identification module is used for carrying out character identification on the PDF file so as to output pixel coordinates and character contents of each character block;
the aggregation module is used for aggregating each character block according to the pixel coordinates and the character content of each character block to form each paragraph;
the extraction module is used for extracting the serial numbers and the titles corresponding to the serial numbers from the paragraphs;
and the conversion module is used for forming character contents with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number.
Optionally, the identification module is further configured to:
converting a PDF file into a plurality of continuous picture files by taking a page as a unit;
and performing character recognition on the picture file, thereby outputting the pixel coordinates and the character content of each character block in the picture file.
Optionally, the aggregation module is further configured to:
vectorizing the text content of each text block to obtain a vector of each text block;
and for any character block, inputting the vector and the pixel coordinate of the character block into a text classification model, and outputting whether the character block belongs to the previous paragraph or the next paragraph, thereby forming each paragraph.
Optionally, the extraction module is further configured to:
and extracting the serial number and the title corresponding to the serial number from each paragraph through a trained Bi-LSTM-CRF model.
Optionally, the conversion module is further configured to:
vectorizing the text content of each paragraph to obtain a vector of each paragraph;
for any paragraph, inputting the vector of the paragraph, the pixel coordinates of the text block at the edge of the paragraph, and the number extracted from the paragraph and the title corresponding to the number into a text classification model, and outputting whether the paragraph is classified as an upper level or a lower level, thereby forming the text content with a hierarchical structure.
Optionally, the system further comprises a retrieving module, configured to:
and after forming the text content with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number, importing the title and the text content corresponding to the number into a full-text search engine.
Optionally, the retrieving module is further configured to:
after the serial number and the title and the text content corresponding to the serial number are imported into a full-text search engine, searching a search result corresponding to a target hierarchy and/or a keyword through the full-text search engine according to the target hierarchy and/or the keyword input by a user;
and responding to any item of retrieval result clicked by a user, and displaying the hierarchical structure, the text content corresponding to the any item of retrieval result and the position area of the text content corresponding to the any item of retrieval result in the PDF file.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the method of any of the embodiments described above.
According to another aspect of the embodiments of the present invention, there is also provided a computer readable medium, on which a computer program is stored, which when executed by a processor implements the method of any of the above embodiments.
One embodiment of the above invention has the following advantages or benefits: because the technical means of extracting the serial numbers and the titles corresponding to the serial numbers from the paragraphs and forming the character contents with the hierarchical structure according to the paragraphs and the titles corresponding to the serial numbers and the serial numbers, the technical problem that the hierarchical structure of the file and the search result lack context cannot be known in the prior art is solved. The embodiment of the invention comprehensively uses OCR and NLP technologies, and converts the PDF file into structured and layered character contents based on the content of the text and the relative position information of the text, so that a user can completely know the character contents and the context in the file, and the effect of text retrieval and the information utilization efficiency are greatly improved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
fig. 1 is a schematic diagram of a main flow of a PDF file conversion method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a text recognition result according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a main flow of a PDF file conversion method according to a referential embodiment of the present invention;
fig. 4 is a schematic diagram of a main flow of a PDF file conversion method according to another referential embodiment of the present invention;
FIG. 5 is a diagram illustrating retrieval results from a full-text retrieval engine according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a detail page according to an embodiment of the present invention;
fig. 7 is a schematic diagram of the main blocks of a PDF file conversion apparatus according to an embodiment of the present invention;
FIG. 8 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 9 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of a main flow of a PDF file conversion method according to an embodiment of the present invention. As an embodiment of the present invention, as shown in fig. 1, the method for converting a PDF file may include:
step 101, performing character recognition on the PDF file, thereby outputting pixel coordinates and character contents of each character block.
Firstly, performing character recognition on a PDF file to be converted, for example, performing character recognition on each page of content in the PDF file by using OCR, so as to obtain pixel coordinates of each character block and character content in each character block.
Optionally, step 101 may comprise: converting a PDF file into a plurality of continuous picture files by taking a page as a unit; and performing character recognition on the picture file, thereby outputting the pixel coordinates and the character content of each character block in the picture file. Generally, because the content of a PDF file is a copy format and a text search in the file cannot be directly performed, it is necessary to convert the PDF file into a plurality of consecutive picture files, for example, a certain 60-page PDF file, and to convert the PDF file into consecutive 60 jpg picture files on a page-by-page basis; and performing OCR recognition on the picture file, thereby outputting the pixel coordinates of each character block and the character content in each character block.
As shown in fig. 2, the pixel coordinates are the origin of coordinates (0, 0) at the upper left corner of the picture, and the text block can be uniquely located by the pixel coordinates at the four corners, and can reflect the relative position relationship between adjacent text blocks. It should be noted that the upper part in fig. 2 is a real example of character recognition, the lower part is a detailed presentation of the recognition result, where text is text content, type refers to print/handwriting, score is confidence (the system scores the recognition result, generally in 0-1000 points, and the higher the score, the higher the probability of correct recognition), and coords is the pixel coordinates of the four corners.
And 102, aggregating the character blocks according to the pixel coordinates and the character contents of the character blocks to form paragraphs.
In this step, according to the recognition result (i.e., the pixel coordinates and the text content of each text block) in step 101, each text block is aggregated, and whether each text block belongs to the same paragraph is determined, so that each file block is aggregated into each paragraph.
Optionally, step 102 may comprise: vectorizing the text content of each text block to obtain a vector of each text block; and for any character block, inputting the vector and the pixel coordinate of the character block into a text classification model, and outputting whether the character block belongs to the previous paragraph or the next paragraph, thereby forming each paragraph. Specifically, the text content of each text block can be vectorized (word embedded) by adopting a BERT algorithm, so as to obtain the text meaning of each text block; and then taking the vector of the character block and the pixel coordinate of the character block as the input of a text classification model, and outputting whether the character block belongs to the previous paragraph or the next paragraph through the text classification model, thereby forming each paragraph according to the output result of the text classification model.
It should be noted that the text classification model needs to be supervised training in advance. Specifically, a large number of input and output samples are labeled manually, a text classification model is constructed and trained, the training result is a model file, and the function is to realize the paragraph of text blocks for new text blocks without manual labeling. Optionally, the text classification model is a transform-CRF model, by which each text block can be accurately aggregated into a paragraph.
Step 103, extracting the number and the title corresponding to the number from each paragraph.
Usually, each paragraph contains a number, some paragraphs contain numbers and titles corresponding to the numbers, and some paragraphs contain neither numbers nor titles, so that the numbers and titles corresponding to the numbers can be extracted from the paragraphs by a pre-trained extraction model, and the numbers and titles corresponding to the numbers are used to form a hierarchical structure (tree-like directory).
Optionally, step 103 may comprise: and extracting the serial number and the title corresponding to the serial number from each paragraph through a trained Bi-LSTM-CRF model. It should be noted that the Bi-LSTM-CRF model requires supervised training in advance. Specifically, a large number of paragraphs are manually marked with numbers and titles corresponding to the numbers, a Bi-LSTM-CRF model is constructed and trained, and the model has the function of extracting the titles corresponding to the numbers and the numbers from the paragraphs.
Alternatively, the numbers may be numbers, english letters, roman numerals, or the like, which is not limited in this respect by the embodiment of the present invention. It should be noted that the same number may be repeated, for example, the number "(a)" may be repeated at different levels, so that the hierarchical structure cannot be obtained by simply numbering, and a title corresponding to the number needs to be combined.
And 104, forming character contents with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number.
Each level may include one paragraph or a plurality of paragraphs, and in order to show a complete hierarchical structure, in the embodiment of the present invention, the paragraphs are aggregated according to the paragraphs and their corresponding numbers and titles corresponding to the numbers, so as to form the textual contents having a hierarchical structure, and thus the textual contents of each level include at least one paragraph.
Optionally, step 104 may include: vectorizing the text content of each paragraph to obtain a vector of each paragraph; for any paragraph, inputting the vector of the paragraph, the pixel coordinates of the text block at the edge of the paragraph, and the number extracted from the paragraph and the title corresponding to the number into a text classification model, and outputting whether the paragraph is classified as an upper level or a lower level, thereby forming the text content with a hierarchical structure.
Specifically, the text content of each paragraph can be vectorized (word embedded) by using a BERT algorithm, so as to obtain the text meaning of each paragraph; and then taking the vector of the paragraph, the pixel coordinate of the paragraph (the pixel coordinate of the word block at the most edge in the paragraph) and the number extracted from the paragraph and the title corresponding to the number (the part of the paragraph is empty), and outputting whether the paragraph is classified into the upper level or the lower level through the text classification model, thereby forming the word content with the hierarchical structure according to the output result of the text classification model.
It should be noted that the text classification model needs to be supervised training in advance. Specifically, a large number of input and output samples are labeled manually, a text classification model is constructed and trained, the training result is a model file, and the function is to realize the structuralization and the hierarchy of paragraphs for new paragraphs without manual labeling. Optionally, the text classification model is a transform-CRF model, by which paragraphs can be accurately grouped into a hierarchy.
To this end, a PDF file is converted into structured, hierarchical textual content, with the smallest granularity being the smallest level of textual content.
According to the various embodiments described above, it can be seen that the technical means of extracting the numbers and the titles corresponding to the numbers from the paragraphs and forming the text content with the hierarchical structure according to the paragraphs and the titles corresponding to the numbers and the numbers thereof in the embodiments of the present invention solves the technical problem that the hierarchical structure of the file and the search result lack context in the prior art. The embodiment of the invention comprehensively uses OCR and NLP technologies, and converts the PDF file into structured and layered character contents based on the content of the text and the relative position information of the text, so that a user can completely know the character contents and the context in the file, and the effect of text retrieval and the information utilization efficiency are greatly improved.
Fig. 3 is a schematic diagram of a main flow of a PDF file conversion method according to a referential embodiment of the present invention. As another embodiment of the present invention, as shown in fig. 3, the method for converting a PDF file may include:
step 301, converting a PDF file into a plurality of continuous picture files in page units.
And receiving the uploaded PDF file, and converting the PDF file into a plurality of continuous picture files by taking a page as a unit.
Step 302, performing character recognition on the picture file, thereby outputting pixel coordinates and character contents of each character block in the picture file.
After the PDF file is converted into a plurality of continuous picture files, OCR recognition is performed on the picture files, so that pixel coordinates of each text block and text content in each text block are output. The pixel coordinates are the upper left corner of the picture as the origin of coordinates (0, 0), and the character blocks can be uniquely positioned through the pixel coordinates of the four corners and can reflect the relative position relationship between the adjacent character blocks.
Step 303, vectorizing the text content of each text block to obtain a vector of each text block.
The text content of each text block can be vectorized by adopting a BERT algorithm, so that the text meaning of each text block is obtained.
Step 304, for any character block, inputting the vector and the pixel coordinate of the character block into a text classification model, and outputting whether the character block belongs to the previous paragraph or the next paragraph, thereby forming each paragraph.
The text classification model needs to be supervised and trained in advance, and the training process is not repeated. The embodiment of the invention judges whether the character block belongs to the previous paragraph or the next paragraph according to the vector and the pixel coordinate of the character block, so that the character block can be accurately aggregated to form each paragraph. Optionally, the text classification model is a transform-CRF model, by which each text block can be accurately aggregated into a paragraph.
And 305, extracting the serial numbers and the titles corresponding to the serial numbers from the paragraphs through the trained Bi-LSTM-CRF model.
The number and the title corresponding to the number can be extracted from each paragraph by a pre-trained extraction model (such as a Bi-LSTM-CRF model), and the number and the title corresponding to the number are used for forming the hierarchical structure. If the PDF file is a legal file, the term number and the term header corresponding to the term number may be extracted from each paragraph.
It should be noted that some paragraphs have neither a clause number nor a clause title, some paragraphs have only a clause number, and some paragraphs have both a clause number and a clause title, and the clause number and the title corresponding to the clause number or the clause number can be accurately extracted from each paragraph by the Bi-LSTM-CRF model. Alternatively, the clause number may be a number, an english alphabet, a roman numeral, or the like, which is not limited by the embodiment of the present invention. The Bi-LSTM-CRF model needs to be supervised and trained in advance, and the training process is not repeated.
Step 306, vectorizing the text content of each paragraph to obtain a vector of each paragraph.
Optionally, the text content of each paragraph may be vectorized by using a BERT algorithm, so as to obtain the text meaning of each paragraph.
Step 307, for any paragraph, inputting the vector of the paragraph, the pixel coordinates of the text block at the edge of the paragraph, the number extracted from the paragraph, and the title corresponding to the number into a text classification model, and outputting whether the paragraph is classified as the previous level or the next level, thereby forming the text content with a hierarchical structure.
The text classification model needs to be supervised and trained in advance, and the training process is not repeated. According to the embodiment of the invention, whether the paragraph is classified as the upper level or the lower level is judged through the vector of the paragraph, the pixel coordinate of the paragraph, the number extracted from the paragraph and the title corresponding to the number, so that the paragraphs can be accurately aggregated, and the character content with the hierarchical structure is formed. Optionally, the text classification model is a transform-CRF model, by which paragraphs can be accurately grouped into a hierarchy.
In addition, in one embodiment of the present invention, the detailed implementation of the method for converting a PDF file is described in detail in the above-mentioned method for converting a PDF file, and therefore, the repeated description is omitted here.
Fig. 4 is a schematic diagram of a main flow of a PDF file conversion method according to another referential embodiment of the present invention. As another embodiment of the present invention, as shown in fig. 4, the method for converting a PDF file may include:
step 401, performing character recognition on the PDF file, thereby outputting pixel coordinates and character contents of each character block.
And receiving the uploaded PDF file, and performing character recognition on the PDF file, for example, performing character recognition on each page of content in the PDF file by using an OCR (optical character recognition), so as to obtain pixel coordinates of each character block and character content in each character block. Generally, because PDF file contents are in a copy format and text search in a file cannot be directly performed, it is necessary to convert a PDF file into a plurality of continuous picture files on a page-by-page basis, and then perform OCR recognition on the picture files to output pixel coordinates of each text block and text contents in each text block.
And 402, aggregating the character blocks according to the pixel coordinates and the character contents of the character blocks to form paragraphs.
In this step, according to the recognition result (i.e., the pixel coordinates and the text content of each text block) in step 401, each text block is aggregated, and whether each text block belongs to the same paragraph is determined, so that each file block is aggregated into each paragraph.
In step 403, a number and a title corresponding to the number are extracted from each paragraph.
Usually, each paragraph contains a number, some paragraphs contain numbers and titles corresponding to the numbers, and some paragraphs contain neither numbers nor titles, so that the numbers and titles corresponding to the numbers can be extracted from the paragraphs by a pre-trained extraction model, and the numbers and titles corresponding to the numbers are used for forming a hierarchical structure.
Step 404, forming a text content with a hierarchical structure according to the paragraphs, the numbers corresponding to the paragraphs, and the titles corresponding to the numbers.
Each level may include one paragraph or a plurality of paragraphs, and in order to show a complete hierarchical structure, in the embodiment of the present invention, the paragraphs are aggregated according to the paragraphs and their corresponding numbers and titles corresponding to the numbers, so as to form the textual contents having a hierarchical structure, and thus the textual contents of each level include at least one paragraph.
Step 405, importing the number and the title and the text content corresponding to the number into a full text search engine.
And importing all the numbers, the titles corresponding to the numbers and the text contents in the PDF file into a full-text search engine. If the PDF file is a legal file, the clause number, the clause title and the clause content are imported into a full text search engine (such as an ElasticSearch full text search engine).
And 406, retrieving a retrieval result corresponding to the target hierarchy and/or the keyword through the full-text retrieval engine according to the target hierarchy and/or the keyword input by the user.
The user can search the text content through the full-text search engine, for example, the user can input a target hierarchy, a keyword or a title, and the like, and the corresponding search result is searched through the full-text search engine. As shown in fig. 5, taking a legal document as an example, a user may input a hierarchy of legal terms and keywords, and output a search result as a list of legal terms including the keywords and having a granularity of a selected hierarchy, where the ranking is a sequence of terms respecting the original text. Further, filtering fields such as time, country, etc. may be added to improve the accuracy of the search.
Step 407, responding to any item of search result clicked by the user, and displaying the hierarchical structure, the text content corresponding to any item of search result, and the position area of the text content corresponding to any item of search result in the PDF file.
After selecting a certain term in the list shown in fig. 5, a detail page pops up, as shown in fig. 6, the left side of the detail page is a directory (containing a term number and a term title) of a hierarchical structure, the upper right side is a term content, and the lower side is a corresponding location area in the PDF file.
The invention comprehensively uses OCR and NLP technology, based on the content of the text and the relative position information of the text, the PDF file is converted into the structured and layered character content, and the retrieval result is presented by terms rather than general character fragments; and the structured and layered information is displayed in a visual and flexible manner, so that the user can know the complete content and the context of the terms, and the legal text retrieval effect and the information utilization efficiency are greatly improved.
In addition, in another embodiment of the present invention, the detailed implementation of the method for converting a PDF file is described in detail in the above-mentioned method for converting a PDF file, and therefore the repeated description is omitted here.
Fig. 7 is a schematic diagram of main blocks of a PDF file conversion apparatus according to an embodiment of the present invention, and as shown in fig. 7, the PDF file conversion apparatus 700 includes an identification module 701, an aggregation module 702, an extraction module 703, and a conversion module 704; the identification module 701 is configured to perform character identification on a PDF file, so as to output pixel coordinates and character contents of each character block; the aggregation module 702 is configured to aggregate the text blocks according to the pixel coordinates and the text contents of the text blocks to form paragraphs; the extraction module 703 is configured to extract the number and the title corresponding to the number from each paragraph; the conversion module 704 is configured to form text contents having a hierarchical structure according to the paragraphs, their corresponding numbers, and the titles corresponding to the numbers.
Optionally, the identifying module 701 is further configured to:
converting a PDF file into a plurality of continuous picture files by taking a page as a unit;
and performing character recognition on the picture file, thereby outputting the pixel coordinates and the character content of each character block in the picture file.
Optionally, the aggregation module 702 is further configured to:
vectorizing the text content of each text block to obtain a vector of each text block;
and for any character block, inputting the vector and the pixel coordinate of the character block into a text classification model, and outputting whether the character block belongs to the previous paragraph or the next paragraph, thereby forming each paragraph.
Optionally, the extracting module 703 is further configured to:
and extracting the serial number and the title corresponding to the serial number from each paragraph through a trained Bi-LSTM-CRF model.
Optionally, the conversion module 704 is further configured to:
vectorizing the text content of each paragraph to obtain a vector of each paragraph;
for any paragraph, inputting the vector of the paragraph, the pixel coordinates of the text block at the edge of the paragraph, and the number extracted from the paragraph and the title corresponding to the number into a text classification model, and outputting whether the paragraph is classified as an upper level or a lower level, thereby forming the text content with a hierarchical structure.
Optionally, the system further comprises a retrieving module, configured to:
and after forming the text content with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number, importing the title and the text content corresponding to the number into a full-text search engine.
Optionally, the retrieving module is further configured to:
after the serial number and the title and the text content corresponding to the serial number are imported into a full-text search engine, searching a search result corresponding to a target hierarchy and/or a keyword through the full-text search engine according to the target hierarchy and/or the keyword input by a user;
and responding to any item of retrieval result clicked by a user, and displaying the hierarchical structure, the text content corresponding to the any item of retrieval result and the position area of the text content corresponding to the any item of retrieval result in the PDF file.
According to the various embodiments described above, it can be seen that the technical means of extracting the numbers and the titles corresponding to the numbers from the paragraphs and forming the text content with the hierarchical structure according to the paragraphs and the titles corresponding to the numbers and the numbers thereof in the embodiments of the present invention solves the technical problem that the hierarchical structure of the file and the search result lack context in the prior art. The embodiment of the invention comprehensively uses OCR and NLP technologies, and converts the PDF file into structured and layered character contents based on the content of the text and the relative position information of the text, so that a user can completely know the character contents and the context in the file, and the effect of text retrieval and the information utilization efficiency are greatly improved.
It should be noted that, in the implementation of the PDF file conversion apparatus according to the present invention, the above-mentioned PDF file conversion method has been described in detail, and therefore, the repeated description is omitted here.
Fig. 8 shows an exemplary system architecture 800 of a PDF file conversion method or a PDF file conversion apparatus to which an embodiment of the present invention can be applied.
As shown in fig. 8, the system architecture 800 may include terminal devices 801, 802, 803, a network 804, and a server 805. The network 804 serves to provide a medium for communication links between the terminal devices 801, 802, 803 and the server 805. Network 804 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 801, 802, 803 to interact with a server 805 over a network 804 to receive or send messages or the like. The terminal devices 801, 802, 803 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 801, 802, 803 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 805 may be a server that provides various services, such as a back-office management server (for example only) that supports shopping-like websites browsed by users using the terminal devices 801, 802, 803. The background management server can analyze and process the received data such as the article information query request and feed back the processing result to the terminal equipment.
It should be noted that the method for converting a PDF file provided by the embodiment of the present invention is generally executed by the server 805, and accordingly, the PDF file converting apparatus is generally disposed in the server 805. The method for converting a PDF file provided by the embodiment of the present invention may also be executed by the terminal devices 801, 802, and 803, and accordingly, the apparatus for converting a PDF file may be disposed in the terminal devices 801, 802, and 803.
It should be understood that the number of terminal devices, networks, and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 9, shown is a block diagram of a computer system 900 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The above-described functions defined in the system of the present invention are executed when the computer program is executed by a Central Processing Unit (CPU) 901.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer programs according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes an identification module, an aggregation module, an extraction module, and a translation module, where the names of the modules do not in some cases constitute a limitation on the modules themselves.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, implement the method of: performing character recognition on the PDF file so as to output pixel coordinates and character contents of each character block; according to the pixel coordinates and the text content of each text block, aggregating each text block to form each paragraph; extracting numbers and titles corresponding to the numbers from each paragraph; and forming character contents with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number.
According to the technical scheme of the embodiment of the invention, because the technical means that the serial numbers and the titles corresponding to the serial numbers are extracted from the paragraphs and the text content with the hierarchical structure is formed according to the paragraphs and the corresponding serial numbers and the titles corresponding to the serial numbers, the technical problem that the hierarchical structure of the file and the search result lack context in the prior art are solved. The embodiment of the invention comprehensively uses OCR and NLP technologies, and converts the PDF file into structured and layered character contents based on the content of the text and the relative position information of the text, so that a user can completely know the character contents and the context in the file, and the effect of text retrieval and the information utilization efficiency are greatly improved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A PDF file conversion method is characterized by comprising the following steps:
performing character recognition on the PDF file so as to output pixel coordinates and character contents of each character block;
according to the pixel coordinates and the text content of each text block, aggregating each text block to form each paragraph;
extracting numbers and titles corresponding to the numbers from each paragraph;
and forming character contents with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number.
2. The method of claim 1, wherein performing text recognition on the PDF file to output pixel coordinates of each text block and text content comprises:
converting a PDF file into a plurality of continuous picture files by taking a page as a unit;
and performing character recognition on the picture file, thereby outputting the pixel coordinates and the character content of each character block in the picture file.
3. The method of claim 1, wherein aggregating the text blocks to form paragraphs according to the pixel coordinates and text content of the text blocks comprises:
vectorizing the text content of each text block to obtain a vector of each text block;
and for any character block, inputting the vector and the pixel coordinate of the character block into a text classification model, and outputting whether the character block belongs to the previous paragraph or the next paragraph, thereby forming each paragraph.
4. The method of claim 1, wherein extracting the number and the title corresponding to the number from each paragraph comprises:
and extracting the serial number and the title corresponding to the serial number from each paragraph through a trained Bi-LSTM-CRF model.
5. The method according to claim 1, wherein forming the text content with a hierarchical structure according to the paragraphs and their corresponding numbers and titles corresponding to the numbers comprises:
vectorizing the text content of each paragraph to obtain a vector of each paragraph;
for any paragraph, inputting the vector of the paragraph, the pixel coordinates of the text block at the edge of the paragraph, and the number extracted from the paragraph and the title corresponding to the number into a text classification model, and outputting whether the paragraph is classified as an upper level or a lower level, thereby forming the text content with a hierarchical structure.
6. The method according to claim 1, wherein after forming the text with a hierarchical structure according to the paragraphs, their corresponding numbers and their corresponding titles, the method further comprises:
and importing the number and the title and the text content corresponding to the number into a full text retrieval engine.
7. The method of claim 6, wherein after importing the number and the title and text content corresponding to the number into a full text search engine, further comprising:
retrieving a retrieval result corresponding to a target hierarchy and/or a keyword through the full-text retrieval engine according to the target hierarchy and/or the keyword input by a user;
and responding to any item of retrieval result clicked by a user, and displaying the hierarchical structure, the text content corresponding to the any item of retrieval result and the position area of the text content corresponding to the any item of retrieval result in the PDF file.
8. A PDF file conversion apparatus, comprising:
the identification module is used for carrying out character identification on the PDF file so as to output pixel coordinates and character contents of each character block;
the aggregation module is used for aggregating each character block according to the pixel coordinates and the character content of each character block to form each paragraph;
the extraction module is used for extracting the serial numbers and the titles corresponding to the serial numbers from the paragraphs;
and the conversion module is used for forming character contents with a hierarchical structure according to each paragraph, the number corresponding to each paragraph and the title corresponding to the number.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
the one or more programs, when executed by the one or more processors, implement the method of any of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202110769021.XA 2021-07-07 2021-07-07 PDF file conversion method and device, electronic equipment and computer readable medium Pending CN113486148A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110769021.XA CN113486148A (en) 2021-07-07 2021-07-07 PDF file conversion method and device, electronic equipment and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110769021.XA CN113486148A (en) 2021-07-07 2021-07-07 PDF file conversion method and device, electronic equipment and computer readable medium

Publications (1)

Publication Number Publication Date
CN113486148A true CN113486148A (en) 2021-10-08

Family

ID=77941755

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110769021.XA Pending CN113486148A (en) 2021-07-07 2021-07-07 PDF file conversion method and device, electronic equipment and computer readable medium

Country Status (1)

Country Link
CN (1) CN113486148A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114462383A (en) * 2022-04-12 2022-05-10 江西少科智能建造科技有限公司 Method, system, storage medium and equipment for obtaining design specification of building drawing

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334346A (en) * 2019-06-26 2019-10-15 京东数字科技控股有限公司 A kind of information extraction method and device of pdf document
CN112434691A (en) * 2020-12-02 2021-03-02 上海三稻智能科技有限公司 HS code matching and displaying method and system based on intelligent analysis and identification and storage medium
CN112926471A (en) * 2021-03-05 2021-06-08 中国工商银行股份有限公司 Method and device for identifying image content of business document

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334346A (en) * 2019-06-26 2019-10-15 京东数字科技控股有限公司 A kind of information extraction method and device of pdf document
CN112434691A (en) * 2020-12-02 2021-03-02 上海三稻智能科技有限公司 HS code matching and displaying method and system based on intelligent analysis and identification and storage medium
CN112926471A (en) * 2021-03-05 2021-06-08 中国工商银行股份有限公司 Method and device for identifying image content of business document

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114462383A (en) * 2022-04-12 2022-05-10 江西少科智能建造科技有限公司 Method, system, storage medium and equipment for obtaining design specification of building drawing
CN114462383B (en) * 2022-04-12 2022-07-08 江西少科智能建造科技有限公司 Method, system, storage medium and equipment for obtaining design specification of building drawing

Similar Documents

Publication Publication Date Title
CN106960030B (en) Information pushing method and device based on artificial intelligence
CN106649890B (en) Data storage method and device
CN106708940B (en) Method and device for processing pictures
US20150033116A1 (en) Systems, Methods, and Media for Generating Structured Documents
CN107085583B (en) Electronic document management method and device based on content
US11055373B2 (en) Method and apparatus for generating information
CN110688449A (en) Address text processing method, device, equipment and medium based on deep learning
CN108334489B (en) Text core word recognition method and device
CN111177532A (en) Vertical search method, device, computer system and readable storage medium
US20220284218A1 (en) Video classification method, electronic device and storage medium
US11599727B2 (en) Intelligent text cleaning method and apparatus, and computer-readable storage medium
CN112395420A (en) Video content retrieval method and device, computer equipment and storage medium
CN113204621B (en) Document warehouse-in and document retrieval method, device, equipment and storage medium
CN110020312B (en) Method and device for extracting webpage text
US20230206670A1 (en) Semantic representation of text in document
EP3961426A2 (en) Method and apparatus for recommending document, electronic device and medium
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN114780746A (en) Knowledge graph-based document retrieval method and related equipment thereof
CN110737824A (en) Content query method and device
CN110910178A (en) Method and device for generating advertisement
CN113486148A (en) PDF file conversion method and device, electronic equipment and computer readable medium
CN109902152B (en) Method and apparatus for retrieving information
CN111368693A (en) Identification method and device for identity card information
CN111783433A (en) Text retrieval error correction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination