CN116627912A

CN116627912A - Integration and extraction method for multi-modal content of multi-type document

Info

Publication number: CN116627912A
Application number: CN202310885109.7A
Authority: CN
Inventors: 阎德劲; 赵晓虎; 陈凤; 黄金元; 白建亮; 雷文强; 刘法; 向元新; 黎乾隆; 郑大安; 袁焦; 张郭勇; 奂锐; 吴雪松
Original assignee: CETC 10 Research Institute
Current assignee: CETC 10 Research Institute
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-08-22

Abstract

The invention discloses an integration and extraction method for multi-modal content of a multi-type document, and relates to the technical field of natural language processing. The method comprises the steps of firstly judging the type of a target document, then searching according to keywords, and generating multi-mode content of the target document by adopting different content extraction methods for data of different modes of the target document. Aiming at the phenomenon that the current information extraction is more and more diversified towards documents, the invention adopts a plurality of algorithms to carry out integrated search extraction, and solves the problems that the current information extraction method is mainly aimed at single-type files and single-mode contents, and the recognition accuracy of larger unstructured documents is reduced.

Description

Integration and extraction method for multi-modal content of multi-type document

Technical Field

The invention relates to the technical field of natural language processing, in particular to an integration and extraction method of multi-modal content of a multi-type document.

Background

With the increasing demands of enterprises and organizations for digitized information and processing, it is becoming increasingly important to implement automated document content extraction. In constructing a document content extraction system, it is often necessary to first determine the type of document to extract and identify for different unstructured and semi-structured document data. Because of the variety of file types in actual production, the identified content range has no unified standard, and how to use a multi-type and multi-mode document content extraction and detection method to meet the needs of users has become a great challenge for applying computer intelligent technology to the field of actual production environments.

The history of document content extraction can be traced back to the 60 s of the 20 th century, when researchers began to study how useful information was extracted from the text. However, in early studies, text extraction techniques were relatively simple, inefficient, and required significant manpower and time to process. With the continuous development of computer technology and information processing technology, document content extraction technology has also been greatly improved. In the 80 s of the 20 th century, automated document extraction scripts based on rules and templates began to appear. These scripts can implement text extraction by manually writing rules and templates, but they are still relatively inefficient and difficult to handle complex document structures and grammars. In the 21 st century, with the development of artificial intelligence technologies such as deep learning and neural networks, document content extraction technology has been greatly improved. For example, automated document extraction algorithms based on convolutional neural networks can achieve high accuracy text extraction without using rules or templates. Furthermore, automated document extraction algorithms based on content analysis and pattern recognition techniques are also evolving.

Currently, the mainstream document content extraction technology is mainly divided into two main categories: template matching class and deep learning class. The template matching class extracts article content by manually constructing rules, which can be diverse, such as: various methods such as character string similarity, regular expression, word bag model and the like, but complete rules are formulated in advance, and information beyond the rules cannot be extracted. Different rule settings are to be made for different scenes. The deep learning class needs to collect a large amount of data through the internet first, and has good generalization, high cost and poor interpretability.

For practical production environments, the current technology has the following drawbacks:

a large amount of data cannot be acquired to train the deep learning class model in a single production environment, and the template matching class model cannot be matched with all information required by a user.

The current information extraction is mainly aimed at a single type file, and cannot meet the requirements of users on multi-type file content extraction and retrieval.

The current information extraction is mainly aimed at single content in a file, and cannot meet the requirement of a user on simultaneous extraction and retrieval of multi-mode information content such as texts, tables and pictures.

Disclosure of Invention

The invention aims at: the method solves the problems that the existing information extraction method is mainly aimed at single-type files and single-mode contents, and the identification accuracy of larger unstructured documents is reduced.

The above object of the present invention can be achieved by the following technical solutions:

the invention relates to an integrated retrieval extraction method for multi-mode contents of multi-type documents, which comprises the following steps:

obtaining a search keyword and a target document to be searched;

judging the type of the target document;

and searching according to the keywords to obtain multi-mode search information of the target document.

Further, the types of the target document comprise DOC/DOCX files, EXCEL files, PDF files and TXT files, and the multi-modal content comprises texts, tables and pictures/block diagrams.

Further, the searching is performed according to the keywords to obtain multi-mode searching information of the target document, which specifically includes:

a DOC/DOCX file content extraction method;

an EXCEL file content extraction method;

a PDF file content extraction method;

TXT file content extraction method.

Further, the DOC/DOCX file content extraction method specifically comprises the following steps:

converting the target DOC/DOCX file into an HTML format by using an Aspose;

extracting texts and forms by using an HTML-based keyword fuzzy matching algorithm;

extracting the picture according to whether the block diagram title hits the search keyword;

for the extracted picture, if the extracted picture is in a WMF/EMF/VISIO format, converting the picture into a PNG format by using LibreOffice, and removing redundant blank by using a python Picllow package;

converting binary data of the picture into base64 and returning;

and matching and integrating the text, the table and the picture information, and returning all the extracted contents.

Further, the extracting of the text and the table by using an HTML-based keyword fuzzy matching algorithm specifically comprises the following steps:

retrieving the HTML tag content;

performing fuzzy matching of keywords based on the Levenstein distance, and calculating matching degree;

and using a quick ordering algorithm to order the keyword matching results from high to low in matching degree, and returning.

Further, the EXCEL file content extraction method specifically includes:

performing content matching on the table by using a python Pandas packet, and returning extracted content;

specifically, extracting EXCEL file information by using a Pandas library of python language;

and transmitting the file information in a key value pair mode, and finally merging the information and returning the extracted content.

Further, the PDF file content extraction method specifically includes:

when the current page is a picture, extracting content by using an OCR-based image text extraction algorithm;

when the current page is a table, extracting content by using a PDF table extraction algorithm based on nesting;

when the current page is not a picture or a table, extracting content by using a PDF text retrieval algorithm based on a pdfplumber;

Further, when the current page is a picture, extracting content by using an OCR-based image text extraction algorithm, specifically including:

performing layout analysis on the image page by using deep learning based on the pad OCR;

text analysis is performed on the image page by using deep learning based on the pad OCR;

integrating the layout information with the text information and returning the extracted content.

Further, when the current page is a table, extracting content by using a PDF table extraction algorithm based on nesting, which specifically comprises the following steps:

extracting PDF table contents by using a python language and a pdfplumbber tool;

Further, the TXT file content extraction method specifically includes:

extracting TXT text content by using a fuzzy matching algorithm;

the TXT table contents are extracted using a format parsing based table extraction algorithm.

The beneficial effects of the invention are as follows:

the invention relates to an integration extraction method of multi-mode content of a multi-type document, which combines a plurality of algorithms to support the extraction of the multi-type document and the multi-mode content; the picture text recognition algorithm based on OCR promotes the judgment of the text box position; the complete matching is combined with the fuzzy search to perform keyword-based document content retrieval. Compared with the method only supporting a single type of document, the method supports the content extraction of multiple types of documents at the same time, and expands the application range; compared with the method only supporting the extraction of single content, the method supports the extraction of multi-mode content of texts, pictures and block diagrams at the same time, and enlarges the information extraction range; compared with a single retrieval algorithm, the method and the device support a more flexible retrieval mode and improve the retrieval effect.

Drawings

For a clearer description of the technical solutions of embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and should not be considered limiting in scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art, wherein:

FIG. 1 is a flow chart of an extraction method of the present invention;

FIG. 2 is a flow chart of multimodal retrieval information for generating a target document based on retrieval of keywords;

FIG. 3 is a method of content extraction of DOC/DOCX files;

fig. 4 is a content extraction method of a PDF file.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the particular embodiments described herein are illustrative only and are not intended to limit the invention, i.e., the embodiments described are merely some, but not all, of the embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

It should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

The embodiment provides an integrated extraction method for multi-modal content of a multi-type document, as shown in fig. 1, comprising the following steps:

s1: obtaining a search keyword and a target document to be searched;

s2: judging the type of the target document;

s3: and searching according to the keywords to obtain multi-mode search information of the target document.

Specifically, as shown in fig. 2, the steps of searching according to the keywords and generating the multimodal search information of the target document are as follows:

s31: a DOC/DOCX file content extraction method;

s32: an EXCEL file content extraction method;

s33: a PDF file content extraction method;

s34: TXT file content extraction method.

As shown in fig. 3, the DOC/DOCX file content extracting method in step S31 specifically includes:

s311: converting the target DOC/DOCX file into an HTML format by using an Aspose;

s312: extracting texts and forms by using an HTML-based keyword fuzzy matching algorithm;

s313: extracting the picture according to whether the block diagram title hits the search keyword;

s314: for the extracted picture, if the extracted picture is in a WMF/EMF/VISIO format, converting the picture into a PNG format by using LibreOffice, and removing redundant blank by using a python Picllow package;

s315: converting binary data of the picture into base64 and returning;

s316: and matching and integrating the text, the table and the picture information, and returning all the extracted contents.

The step S312 is based on the keyword fuzzy matching algorithm of HTML, and specifically includes:

s3121: retrieving html tag content;

s3122: performing fuzzy matching of keywords based on the Levenstein distance (Levenshtein Distance), and calculating matching degree;

s3123: and using a quick ordering algorithm to order the keyword matching results from high to low in matching degree, and returning.

The method for extracting the content of the EXCEL file in step S32 specifically includes:

s321: performing content retrieval on the table by using a python Pandas package, and returning extracted content;

the method specifically comprises the following steps:

s3211: extracting EXCEL file information by using a Pandas library of python language;

s3212: and transmitting the file information in a key value pair mode, and finally merging the information and returning the extracted content.

As shown in fig. 4, the PDF file content extraction method of step S33 specifically includes:

s331: when the current page is a picture, extracting content by using an OCR-based image text extraction algorithm;

s332: when the current page is a table, extracting content by using a PDF table extraction algorithm based on nesting;

s333: when the current page is not a picture or a table, extracting content by using a PDF text retrieval algorithm based on a pdfplumber;

s334: and matching and integrating the text, the table and the picture information, and returning all the extracted contents.

In the step S331, when the current page is a picture, the content is extracted by using an OCR-based image text extraction algorithm, which specifically includes:

s3311: performing layout analysis on the image page by using deep learning based on the pad OCR;

s3312: text analysis is performed on the image page by using deep learning based on the pad OCR;

s3313: integrating the layout information with the text information and returning the extracted content.

When the current page is a table, the step S332 extracts content by using a nested PDF table extraction algorithm, which specifically includes:

s3321: extracting PDF table contents by using a python language and a pdfplumbber tool;

s3322: and transmitting the file information in a key value pair mode, and finally merging the information and returning the extracted content.

The TXT file content extraction method in step S34 specifically includes:

s341: extracting TXT text by using a fuzzy matching algorithm;

s342: the TXT table is extracted using a format parsing based table extraction algorithm.

The invention adopts an integrated retrieval extraction method for multi-mode contents of multi-type documents, firstly judges the types of the documents, then uses different algorithms to extract information of different modes in the documents, integrates the extracted results, and finally obtains the contents related to the retrieval keywords in the whole documents. The method avoids the training cost of the deep learning algorithm and the low accuracy of the model matching algorithm, and solves the problems that the current information extraction method mainly aims at single-type files and single-mode contents and the identification accuracy of larger unstructured documents is reduced.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that are not creatively contemplated by those skilled in the art within the technical scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope defined by the claims.

Claims

1. The method for integrating and extracting the multi-modal content of the multi-type document is characterized by comprising the following steps:

obtaining a search keyword and a target document to be searched;

judging the type of the target document;

2. The method for integrating and extracting multi-modal content of multi-modal documents according to claim 1, wherein the types of the target documents comprise DOC/DOCX files, EXCEL files, PDF files, and TXT files, and the multi-modal content comprises text, tables, and picture blocks.

3. The method for integrating and extracting multi-modal content of multi-modal documents according to claim 2, wherein the retrieving is performed according to keywords to obtain multi-modal retrieval information of the target document, specifically:

a DOC/DOCX file content extraction method;

an EXCEL file content extraction method;

a PDF file content extraction method;

TXT file content extraction method.

4. The method for integrating and extracting multi-modal content of multi-typed document according to claim 3, wherein the method for extracting DOC/DOCX file content specifically comprises:

converting the target DOC/DOCX file into an HTML format by using an Aspose;

converting binary data of the picture into base64 and returning;

5. The method for extracting multi-modal content from multi-modal documents according to claim 4, wherein the extracting of the text and the form using the HTML-based keyword fuzzy matching algorithm specifically comprises:

retrieving the HTML tag content;

6. The method for extracting multi-modal content from multi-modal documents according to claim 3, wherein the method for extracting the content of the EXCEL file specifically comprises:

7. The method for extracting and integrating multi-modal content of multi-typed document according to claim 3, wherein the method for extracting and integrating multi-modal content of PDF document specifically comprises:

8. The method for extracting multi-modal content from multi-modal documents according to claim 7, wherein when the current page is a picture, the content is extracted using an OCR-based image text extraction algorithm, specifically comprising:

9. The method for extracting multi-modal content from multi-modal documents according to claim 7, wherein when the current page is a table, the content is extracted using a nested PDF-based table extraction algorithm, specifically comprising:

10. The method for extracting multi-modal content of multi-modal document according to claim 3, wherein the method for extracting the content of the TXT file specifically comprises:

extracting TXT text content by using a fuzzy matching algorithm;