CN116627912A - Integration and extraction method for multi-modal content of multi-type document - Google Patents

Integration and extraction method for multi-modal content of multi-type document Download PDF

Info

Publication number
CN116627912A
CN116627912A CN202310885109.7A CN202310885109A CN116627912A CN 116627912 A CN116627912 A CN 116627912A CN 202310885109 A CN202310885109 A CN 202310885109A CN 116627912 A CN116627912 A CN 116627912A
Authority
CN
China
Prior art keywords
content
extracting
modal
picture
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310885109.7A
Other languages
Chinese (zh)
Inventor
阎德劲
赵晓虎
陈凤
黄金元
白建亮
雷文强
刘法
向元新
黎乾隆
郑大安
袁焦
张郭勇
奂锐
吴雪松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 10 Research Institute
Original Assignee
CETC 10 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 10 Research Institute filed Critical CETC 10 Research Institute
Priority to CN202310885109.7A priority Critical patent/CN116627912A/en
Publication of CN116627912A publication Critical patent/CN116627912A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/116Details of conversion of file system types or formats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an integration and extraction method for multi-modal content of a multi-type document, and relates to the technical field of natural language processing. The method comprises the steps of firstly judging the type of a target document, then searching according to keywords, and generating multi-mode content of the target document by adopting different content extraction methods for data of different modes of the target document. Aiming at the phenomenon that the current information extraction is more and more diversified towards documents, the invention adopts a plurality of algorithms to carry out integrated search extraction, and solves the problems that the current information extraction method is mainly aimed at single-type files and single-mode contents, and the recognition accuracy of larger unstructured documents is reduced.

Description

Integration and extraction method for multi-modal content of multi-type document
Technical Field
The invention relates to the technical field of natural language processing, in particular to an integration and extraction method of multi-modal content of a multi-type document.
Background
With the increasing demands of enterprises and organizations for digitized information and processing, it is becoming increasingly important to implement automated document content extraction. In constructing a document content extraction system, it is often necessary to first determine the type of document to extract and identify for different unstructured and semi-structured document data. Because of the variety of file types in actual production, the identified content range has no unified standard, and how to use a multi-type and multi-mode document content extraction and detection method to meet the needs of users has become a great challenge for applying computer intelligent technology to the field of actual production environments.
The history of document content extraction can be traced back to the 60 s of the 20 th century, when researchers began to study how useful information was extracted from the text. However, in early studies, text extraction techniques were relatively simple, inefficient, and required significant manpower and time to process. With the continuous development of computer technology and information processing technology, document content extraction technology has also been greatly improved. In the 80 s of the 20 th century, automated document extraction scripts based on rules and templates began to appear. These scripts can implement text extraction by manually writing rules and templates, but they are still relatively inefficient and difficult to handle complex document structures and grammars. In the 21 st century, with the development of artificial intelligence technologies such as deep learning and neural networks, document content extraction technology has been greatly improved. For example, automated document extraction algorithms based on convolutional neural networks can achieve high accuracy text extraction without using rules or templates. Furthermore, automated document extraction algorithms based on content analysis and pattern recognition techniques are also evolving.
Currently, the mainstream document content extraction technology is mainly divided into two main categories: template matching class and deep learning class. The template matching class extracts article content by manually constructing rules, which can be diverse, such as: various methods such as character string similarity, regular expression, word bag model and the like, but complete rules are formulated in advance, and information beyond the rules cannot be extracted. Different rule settings are to be made for different scenes. The deep learning class needs to collect a large amount of data through the internet first, and has good generalization, high cost and poor interpretability.
For practical production environments, the current technology has the following drawbacks:
a large amount of data cannot be acquired to train the deep learning class model in a single production environment, and the template matching class model cannot be matched with all information required by a user.
The current information extraction is mainly aimed at a single type file, and cannot meet the requirements of users on multi-type file content extraction and retrieval.
The current information extraction is mainly aimed at single content in a file, and cannot meet the requirement of a user on simultaneous extraction and retrieval of multi-mode information content such as texts, tables and pictures.
Disclosure of Invention
The invention aims at: the method solves the problems that the existing information extraction method is mainly aimed at single-type files and single-mode contents, and the identification accuracy of larger unstructured documents is reduced.
The above object of the present invention can be achieved by the following technical solutions:
the invention relates to an integrated retrieval extraction method for multi-mode contents of multi-type documents, which comprises the following steps:
obtaining a search keyword and a target document to be searched;
judging the type of the target document;
and searching according to the keywords to obtain multi-mode search information of the target document.
Further, the types of the target document comprise DOC/DOCX files, EXCEL files, PDF files and TXT files, and the multi-modal content comprises texts, tables and pictures/block diagrams.
Further, the searching is performed according to the keywords to obtain multi-mode searching information of the target document, which specifically includes:
a DOC/DOCX file content extraction method;
an EXCEL file content extraction method;
a PDF file content extraction method;
TXT file content extraction method.
Further, the DOC/DOCX file content extraction method specifically comprises the following steps:
converting the target DOC/DOCX file into an HTML format by using an Aspose;
extracting texts and forms by using an HTML-based keyword fuzzy matching algorithm;
extracting the picture according to whether the block diagram title hits the search keyword;
for the extracted picture, if the extracted picture is in a WMF/EMF/VISIO format, converting the picture into a PNG format by using LibreOffice, and removing redundant blank by using a python Picllow package;
converting binary data of the picture into base64 and returning;
and matching and integrating the text, the table and the picture information, and returning all the extracted contents.
Further, the extracting of the text and the table by using an HTML-based keyword fuzzy matching algorithm specifically comprises the following steps:
retrieving the HTML tag content;
performing fuzzy matching of keywords based on the Levenstein distance, and calculating matching degree;
and using a quick ordering algorithm to order the keyword matching results from high to low in matching degree, and returning.
Further, the EXCEL file content extraction method specifically includes:
performing content matching on the table by using a python Pandas packet, and returning extracted content;
specifically, extracting EXCEL file information by using a Pandas library of python language;
and transmitting the file information in a key value pair mode, and finally merging the information and returning the extracted content.
Further, the PDF file content extraction method specifically includes:
when the current page is a picture, extracting content by using an OCR-based image text extraction algorithm;
when the current page is a table, extracting content by using a PDF table extraction algorithm based on nesting;
when the current page is not a picture or a table, extracting content by using a PDF text retrieval algorithm based on a pdfplumber;
and matching and integrating the text, the table and the picture information, and returning all the extracted contents.
Further, when the current page is a picture, extracting content by using an OCR-based image text extraction algorithm, specifically including:
performing layout analysis on the image page by using deep learning based on the pad OCR;
text analysis is performed on the image page by using deep learning based on the pad OCR;
integrating the layout information with the text information and returning the extracted content.
Further, when the current page is a table, extracting content by using a PDF table extraction algorithm based on nesting, which specifically comprises the following steps:
extracting PDF table contents by using a python language and a pdfplumbber tool;
and transmitting the file information in a key value pair mode, and finally merging the information and returning the extracted content.
Further, the TXT file content extraction method specifically includes:
extracting TXT text content by using a fuzzy matching algorithm;
the TXT table contents are extracted using a format parsing based table extraction algorithm.
The beneficial effects of the invention are as follows:
the invention relates to an integration extraction method of multi-mode content of a multi-type document, which combines a plurality of algorithms to support the extraction of the multi-type document and the multi-mode content; the picture text recognition algorithm based on OCR promotes the judgment of the text box position; the complete matching is combined with the fuzzy search to perform keyword-based document content retrieval. Compared with the method only supporting a single type of document, the method supports the content extraction of multiple types of documents at the same time, and expands the application range; compared with the method only supporting the extraction of single content, the method supports the extraction of multi-mode content of texts, pictures and block diagrams at the same time, and enlarges the information extraction range; compared with a single retrieval algorithm, the method and the device support a more flexible retrieval mode and improve the retrieval effect.
Drawings
For a clearer description of the technical solutions of embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and should not be considered limiting in scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art, wherein:
FIG. 1 is a flow chart of an extraction method of the present invention;
FIG. 2 is a flow chart of multimodal retrieval information for generating a target document based on retrieval of keywords;
FIG. 3 is a method of content extraction of DOC/DOCX files;
fig. 4 is a content extraction method of a PDF file.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the particular embodiments described herein are illustrative only and are not intended to limit the invention, i.e., the embodiments described are merely some, but not all, of the embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
It should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
The embodiment provides an integrated extraction method for multi-modal content of a multi-type document, as shown in fig. 1, comprising the following steps:
s1: obtaining a search keyword and a target document to be searched;
s2: judging the type of the target document;
s3: and searching according to the keywords to obtain multi-mode search information of the target document.
Specifically, as shown in fig. 2, the steps of searching according to the keywords and generating the multimodal search information of the target document are as follows:
s31: a DOC/DOCX file content extraction method;
s32: an EXCEL file content extraction method;
s33: a PDF file content extraction method;
s34: TXT file content extraction method.
As shown in fig. 3, the DOC/DOCX file content extracting method in step S31 specifically includes:
s311: converting the target DOC/DOCX file into an HTML format by using an Aspose;
s312: extracting texts and forms by using an HTML-based keyword fuzzy matching algorithm;
s313: extracting the picture according to whether the block diagram title hits the search keyword;
s314: for the extracted picture, if the extracted picture is in a WMF/EMF/VISIO format, converting the picture into a PNG format by using LibreOffice, and removing redundant blank by using a python Picllow package;
s315: converting binary data of the picture into base64 and returning;
s316: and matching and integrating the text, the table and the picture information, and returning all the extracted contents.
The step S312 is based on the keyword fuzzy matching algorithm of HTML, and specifically includes:
s3121: retrieving html tag content;
s3122: performing fuzzy matching of keywords based on the Levenstein distance (Levenshtein Distance), and calculating matching degree;
s3123: and using a quick ordering algorithm to order the keyword matching results from high to low in matching degree, and returning.
The method for extracting the content of the EXCEL file in step S32 specifically includes:
s321: performing content retrieval on the table by using a python Pandas package, and returning extracted content;
the method specifically comprises the following steps:
s3211: extracting EXCEL file information by using a Pandas library of python language;
s3212: and transmitting the file information in a key value pair mode, and finally merging the information and returning the extracted content.
As shown in fig. 4, the PDF file content extraction method of step S33 specifically includes:
s331: when the current page is a picture, extracting content by using an OCR-based image text extraction algorithm;
s332: when the current page is a table, extracting content by using a PDF table extraction algorithm based on nesting;
s333: when the current page is not a picture or a table, extracting content by using a PDF text retrieval algorithm based on a pdfplumber;
s334: and matching and integrating the text, the table and the picture information, and returning all the extracted contents.
In the step S331, when the current page is a picture, the content is extracted by using an OCR-based image text extraction algorithm, which specifically includes:
s3311: performing layout analysis on the image page by using deep learning based on the pad OCR;
s3312: text analysis is performed on the image page by using deep learning based on the pad OCR;
s3313: integrating the layout information with the text information and returning the extracted content.
When the current page is a table, the step S332 extracts content by using a nested PDF table extraction algorithm, which specifically includes:
s3321: extracting PDF table contents by using a python language and a pdfplumbber tool;
s3322: and transmitting the file information in a key value pair mode, and finally merging the information and returning the extracted content.
The TXT file content extraction method in step S34 specifically includes:
s341: extracting TXT text by using a fuzzy matching algorithm;
s342: the TXT table is extracted using a format parsing based table extraction algorithm.
The invention adopts an integrated retrieval extraction method for multi-mode contents of multi-type documents, firstly judges the types of the documents, then uses different algorithms to extract information of different modes in the documents, integrates the extracted results, and finally obtains the contents related to the retrieval keywords in the whole documents. The method avoids the training cost of the deep learning algorithm and the low accuracy of the model matching algorithm, and solves the problems that the current information extraction method mainly aims at single-type files and single-mode contents and the identification accuracy of larger unstructured documents is reduced.
The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that are not creatively contemplated by those skilled in the art within the technical scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope defined by the claims.

Claims (10)

1. The method for integrating and extracting the multi-modal content of the multi-type document is characterized by comprising the following steps:
obtaining a search keyword and a target document to be searched;
judging the type of the target document;
and searching according to the keywords to obtain multi-mode search information of the target document.
2. The method for integrating and extracting multi-modal content of multi-modal documents according to claim 1, wherein the types of the target documents comprise DOC/DOCX files, EXCEL files, PDF files, and TXT files, and the multi-modal content comprises text, tables, and picture blocks.
3. The method for integrating and extracting multi-modal content of multi-modal documents according to claim 2, wherein the retrieving is performed according to keywords to obtain multi-modal retrieval information of the target document, specifically:
a DOC/DOCX file content extraction method;
an EXCEL file content extraction method;
a PDF file content extraction method;
TXT file content extraction method.
4. The method for integrating and extracting multi-modal content of multi-typed document according to claim 3, wherein the method for extracting DOC/DOCX file content specifically comprises:
converting the target DOC/DOCX file into an HTML format by using an Aspose;
extracting texts and forms by using an HTML-based keyword fuzzy matching algorithm;
extracting the picture according to whether the block diagram title hits the search keyword;
for the extracted picture, if the extracted picture is in a WMF/EMF/VISIO format, converting the picture into a PNG format by using LibreOffice, and removing redundant blank by using a python Picllow package;
converting binary data of the picture into base64 and returning;
and matching and integrating the text, the table and the picture information, and returning all the extracted contents.
5. The method for extracting multi-modal content from multi-modal documents according to claim 4, wherein the extracting of the text and the form using the HTML-based keyword fuzzy matching algorithm specifically comprises:
retrieving the HTML tag content;
performing fuzzy matching of keywords based on the Levenstein distance, and calculating matching degree;
and using a quick ordering algorithm to order the keyword matching results from high to low in matching degree, and returning.
6. The method for extracting multi-modal content from multi-modal documents according to claim 3, wherein the method for extracting the content of the EXCEL file specifically comprises:
performing content matching on the table by using a python Pandas packet, and returning extracted content;
specifically, extracting EXCEL file information by using a Pandas library of python language;
and transmitting the file information in a key value pair mode, and finally merging the information and returning the extracted content.
7. The method for extracting and integrating multi-modal content of multi-typed document according to claim 3, wherein the method for extracting and integrating multi-modal content of PDF document specifically comprises:
when the current page is a picture, extracting content by using an OCR-based image text extraction algorithm;
when the current page is a table, extracting content by using a PDF table extraction algorithm based on nesting;
when the current page is not a picture or a table, extracting content by using a PDF text retrieval algorithm based on a pdfplumber;
and matching and integrating the text, the table and the picture information, and returning all the extracted contents.
8. The method for extracting multi-modal content from multi-modal documents according to claim 7, wherein when the current page is a picture, the content is extracted using an OCR-based image text extraction algorithm, specifically comprising:
performing layout analysis on the image page by using deep learning based on the pad OCR;
text analysis is performed on the image page by using deep learning based on the pad OCR;
integrating the layout information with the text information and returning the extracted content.
9. The method for extracting multi-modal content from multi-modal documents according to claim 7, wherein when the current page is a table, the content is extracted using a nested PDF-based table extraction algorithm, specifically comprising:
extracting PDF table contents by using a python language and a pdfplumbber tool;
and transmitting the file information in a key value pair mode, and finally merging the information and returning the extracted content.
10. The method for extracting multi-modal content of multi-modal document according to claim 3, wherein the method for extracting the content of the TXT file specifically comprises:
extracting TXT text content by using a fuzzy matching algorithm;
the TXT table contents are extracted using a format parsing based table extraction algorithm.
CN202310885109.7A 2023-07-19 2023-07-19 Integration and extraction method for multi-modal content of multi-type document Pending CN116627912A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310885109.7A CN116627912A (en) 2023-07-19 2023-07-19 Integration and extraction method for multi-modal content of multi-type document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310885109.7A CN116627912A (en) 2023-07-19 2023-07-19 Integration and extraction method for multi-modal content of multi-type document

Publications (1)

Publication Number Publication Date
CN116627912A true CN116627912A (en) 2023-08-22

Family

ID=87621525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310885109.7A Pending CN116627912A (en) 2023-07-19 2023-07-19 Integration and extraction method for multi-modal content of multi-type document

Country Status (1)

Country Link
CN (1) CN116627912A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742213A (en) * 2021-07-13 2021-12-03 北京关键科技股份有限公司 Method, system, and medium for data analysis
CN114564938A (en) * 2020-11-27 2022-05-31 阿里巴巴集团控股有限公司 Document parsing method and device, storage medium and processor
CN115455935A (en) * 2022-09-14 2022-12-09 华东师范大学 Intelligent text information processing system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114564938A (en) * 2020-11-27 2022-05-31 阿里巴巴集团控股有限公司 Document parsing method and device, storage medium and processor
CN113742213A (en) * 2021-07-13 2021-12-03 北京关键科技股份有限公司 Method, system, and medium for data analysis
CN115455935A (en) * 2022-09-14 2022-12-09 华东师范大学 Intelligent text information processing system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
CWM22: "Python操作PDF文件笔记:(二)提取表格内容", pages 1 - 4, Retrieved from the Internet <URL:《https://zhuanlan.zhihu.com/p/556886560》> *
WPS达师: "aspose.word获取word内容, 文档操作神器Aspose.Word:轻松提取Word内容", pages 1 - 4, Retrieved from the Internet <URL:《https://www.wpsds.com/wpszixue/48892.html》> *
微笑点燃希望: "Aspose.Words 将Word(DOC / DOCX)转换为HTML教程", pages 1 - 5, Retrieved from the Internet <URL:《https://blog.csdn.net/liuyaokai1990/article/details/110949827》> *
毋建军 等: "《Python语言程序设计及医学应用》", vol. 978, 中国铁道出版社有限公司, pages: 137 - 139 *
空空STAR: "Python利用pdfplumber库提取pdf中的文字", pages 1 - 5, Retrieved from the Internet <URL:《https://www.jb51.net/python/2855725zt.htm》> *

Similar Documents

Publication Publication Date Title
CN106874378B (en) Method for constructing knowledge graph based on entity extraction and relation mining of rule model
CN106997382B (en) Innovative creative tag automatic labeling method and system based on big data
CN105824959B (en) Public opinion monitoring method and system
CN1701323B (en) Digital ink database searching using handwriting feature synthesis
CN102955848B (en) A kind of three-dimensional model searching system based on semanteme and method
CN107480200B (en) Word labeling method, device, server and storage medium based on word labels
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN111950273A (en) Network public opinion emergency automatic identification method based on emotion information extraction analysis
CN115796181A (en) Text relation extraction method for chemical field
CN112256861A (en) Rumor detection method based on search engine return result and electronic device
CN112347339A (en) Search result processing method and device
CN115759071A (en) Government affair sensitive information identification system and method based on big data
CN115130613A (en) False news identification model construction method, false news identification method and device
CN112989811B (en) History book reading auxiliary system based on BiLSTM-CRF and control method thereof
Shen et al. Practical text phylogeny for real-world settings
CN114238735B (en) Intelligent internet data acquisition method
Pu et al. A vision-based approach for deep web form extraction
CN116627912A (en) Integration and extraction method for multi-modal content of multi-type document
CN115269816A (en) Core personnel mining method and device based on information processing method and storage medium
JP4148247B2 (en) Vocabulary acquisition method and apparatus, program, and computer-readable recording medium
Griazev et al. Web mining taxonomy
Chen Natural language processing in web data mining
CN113641800B (en) Text duplicate checking method, device and equipment and readable storage medium
CN117390169B (en) Form data question-answering method, device, equipment and storage medium
Shen et al. MaRU: A Manga Retrieval and Understanding System Connecting Vision and Language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination