CN115455143A - Document processing method and device - Google Patents

Document processing method and device Download PDF

Info

Publication number
CN115455143A
CN115455143A CN202211058612.7A CN202211058612A CN115455143A CN 115455143 A CN115455143 A CN 115455143A CN 202211058612 A CN202211058612 A CN 202211058612A CN 115455143 A CN115455143 A CN 115455143A
Authority
CN
China
Prior art keywords
text
layout
document
model
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211058612.7A
Other languages
Chinese (zh)
Inventor
王则远
刘鹏
周旻
任丽军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lingxi Quantum Beijing Medical Technology Co ltd
Original Assignee
Lingxi Quantum Beijing Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lingxi Quantum Beijing Medical Technology Co ltd filed Critical Lingxi Quantum Beijing Medical Technology Co ltd
Priority to CN202211058612.7A priority Critical patent/CN115455143A/en
Publication of CN115455143A publication Critical patent/CN115455143A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The document processing method and the document processing device provided by the invention have the advantages that the target image document is input into the layout text extraction model, and at least one layout text of the target image document output by the layout text extraction model is obtained; and inputting each layout text into a text processing model, and obtaining an extracted text of which the information category extracted from each layout text by the text processing model is the target information category. According to the method and the device, all the layout texts can be extracted from the target image document by using the layout text extraction model, the extracted texts of the specified information types are extracted from all the layout texts by using the text processing model, a manual extraction mode is not needed, excessive consumption of resources such as manpower and time is avoided, the information extraction efficiency and accuracy are improved, information extraction can be performed on all the layout texts, and the comprehensiveness of the information extraction is improved.

Description

Document processing method and device
Technical Field
The invention relates to the technical field of document processing, in particular to a document processing method and device.
Background
With the development of science and technology, the role of electronic literature is increasingly important.
In scenes such as scientific research and enterprise production, there is a wide demand for extracting specified category information from electronic documents, for example, the demand for extracting PICO information in medical documents in evidence-based medical research.
However, the prior art cannot effectively extract the specified category information in the electronic document.
Disclosure of Invention
The document processing method and the document processing device provided by the invention are used for solving the defect that the extraction of the specified category information in the electronic document cannot be effectively realized in the prior art, and the extraction of the specified category information in the document is effectively realized.
In a first aspect, the present invention provides a document processing method comprising:
inputting a target image document into a layout text extraction model, and obtaining at least one layout text of the target image document output by the layout text extraction model;
and inputting each layout text into a text processing model to obtain an extracted text of which the information category extracted from each layout text by the text processing model is a target information category.
Optionally, the layout text extraction model is obtained by performing joint training on a pre-training semantic understanding model and an image document layout recognition model.
Optionally, the layout text extraction model includes a first processing layer, a second processing layer, and a third processing layer; wherein: the structure of the first processing layer corresponds to the pre-training semantic understanding model, and the structure of the second processing layer corresponds to the image document layout recognition model; the third processing layer is configured to output each of the layout texts based on the output data of the first processing layer and the output data of the second processing layer.
Optionally, the input of the first processing layer includes: image document text and text position information, the image document text being text in the target image document, the image document text and the text position information being obtained by the layout text extraction model using Optical Character Recognition (OCR) techniques;
the output of the first processing layer comprises: the text embedding vector is used for representing the semantic understanding of the text and the position embedding vector used for representing the mapping relation between the text paragraph and the image.
Optionally, the input of the second processing layer includes: the target image document, image document text and text position information; the output of the second processing layer comprises: a 2D position embedding vector at a character level and an image embedding vector for embodying image feature information.
Optionally, the training data of the layout text extraction model includes: the text classification method comprises the steps of image documents, image document texts, text position information and text classification labels, wherein the text classification labels are classes of document layout parts to which texts belong.
Optionally, the text processing model is obtained by performing fine tuning on a pre-trained natural language processing model by using a document layout text and a labeled text with a corresponding information category as the target information category as training data.
Optionally, after the obtaining at least one layout text of the target image document output by the layout text extraction model, the document processing method further includes:
performing integration and de-duplication processing on each layout text to obtain at least one corresponding processed text;
the inputting each layout text into a text processing model comprises:
inputting each processed text into the text processing model;
the obtaining of the extracted text in which the information category extracted by the text processing model from each layout text is a target information category includes:
and obtaining an extracted text of which the information category extracted from each processed text by the text processing model is the target information category.
Optionally, the document processing method further includes:
obtaining target text content of a target document; the target document is a document corresponding to the target image document, and the target text content comprises text of at least one layout part in the target document;
respectively determining the similarity between each extracted text and the target text content;
and sequencing the extracted texts according to the similarity between the extracted texts and the target text content, and outputting a sequencing result.
In a second aspect, the present invention also provides a document processing apparatus comprising: a first input unit, a first obtaining unit, a second input unit and a second obtaining unit; wherein:
the first input unit is used for inputting the target image document to the layout text extraction model;
the first obtaining unit is used for obtaining at least one layout text of the target image document output by the layout text extraction model;
the second input unit is used for inputting each layout text into a text processing model;
the second obtaining unit is configured to obtain extracted texts in which information types extracted by the text processing model from the layout texts are target information types.
Optionally, the layout text extraction model is obtained by performing joint training on a pre-training semantic understanding model and an image document layout recognition model.
Optionally, the layout text extraction model includes a first processing layer, a second processing layer, and a third processing layer; wherein: the structure of the first processing layer corresponds to the pre-training semantic understanding model, and the structure of the second processing layer corresponds to the image document layout recognition model; the third processing layer is configured to output each of the layout texts based on the output data of the first processing layer and the output data of the second processing layer.
Optionally, the input of the first processing layer includes: image document text and text position information, the image document text being a text in the target image document, the image document text and the text position information being obtained by the layout text extraction model using an Optical Character Recognition (OCR) technique;
the output of the first processing layer comprises: the text embedding vector is used for representing the semantic understanding of the text and the position embedding vector used for representing the mapping relation between the text paragraph and the image.
Optionally, the input of the second processing layer includes: the target image document, image document text and text position information; the output of the second processing layer comprises: a 2D position embedding vector at the character level and an image embedding vector for embodying image feature information.
Optionally, the training data of the layout text extraction model includes: the text classification method comprises the steps of image documents, image document texts, text position information and text classification labels, wherein the text classification labels are classes of document layout parts to which texts belong.
Optionally, the text processing model is obtained by performing fine tuning on a pre-trained natural language processing model by using a document layout text and a labeled text with a corresponding information category as the target information category as training data.
Optionally, the document processing apparatus further includes: a processing unit and a third obtaining unit;
the processing unit is configured to perform integration and deduplication processing on each layout text after obtaining at least one layout text of the target image document output by the layout text extraction model;
the third obtaining unit is configured to obtain at least one corresponding processed text;
the second input unit is used for inputting each processed text into the text processing model;
the second obtaining unit is configured to obtain extracted texts in which information types extracted by the text processing model from the processed texts are target information types.
Optionally, the document processing apparatus further includes: a fourth obtaining unit, a determining unit, a sorting unit and an output unit; wherein:
the fourth obtaining unit is used for obtaining the target text content of the target document; the target literature is a literature corresponding to the target image literature, and the target text content comprises text of at least one layout part in the target literature;
the determining unit is used for respectively determining the similarity between each extracted text and the target text content;
the sorting unit is used for sorting the extracted texts according to the similarity between the extracted texts and the target text content;
and the output unit is used for outputting the sequencing result.
In a third aspect, the present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the document processing method when executing the program.
In a fourth aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described document processing method.
In a fifth aspect, the present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the document processing method described above.
The document processing method and the document processing device provided by the invention have the advantages that the target image document is input into the layout text extraction model, and at least one layout text of the target image document output by the layout text extraction model is obtained; and inputting each layout text into a text processing model to obtain an extracted text with the information category extracted from each layout text by the text processing model as the target information category. The method can extract all the layout texts from the target image document by using the layout text extraction model, extract the extraction texts of the specified information types from all the layout texts by using the text processing model, avoid the manual extraction mode, avoid the excessive consumption of resources such as manpower, time and the like, improve the information extraction efficiency and accuracy, extract information aiming at all the layout texts, and improve the comprehensiveness of the information extraction.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is one of the flow diagrams of a document processing method provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a layout text extraction model training provided by an embodiment of the present invention;
FIG. 3 is a second schematic diagram of a text processing model training process according to an embodiment of the present invention;
fig. 4 is a schematic flowchart of a PICO information extraction scenario of a medical document according to an embodiment of the present invention;
FIG. 5 is a second schematic flow chart of a document processing method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a document processing apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
The document processing method of the present invention is described below with reference to fig. 1 to 5.
As shown in fig. 1, an embodiment of the present invention proposes a first document processing method, which may include the steps of:
s101, inputting a target image document into a layout text extraction model;
the target image document may be an electronic document in an image format that requires information extraction.
Specifically, if the file format of the target document is a non-image format, such as a PDF format, the present invention may obtain the target image document by converting each page of the target document into an image and then sequentially stitching the converted images.
The present invention is not limited to the specific field type of the target image document. For example, the target image document may be a document in the medical field, i.e., a medical document; for another example, the target image document may also be a document in the field of vehicle technology; as another example, the target image document may be a document in the field of communication technology.
It should be noted that the document may be composed of a plurality of categories of layout contents, such as the layout contents of the categories of title, author, organization, abstract, text, table, picture, and reference document.
The layout text may be text of a certain layout content in the target image document, such as abstract text.
The layout text extraction model may be a model for extracting at least one text of the content of the layout part, that is, at least one layout text from the document.
It should be noted that the present invention does not limit the specific model types of the layout text extraction model, such as a deep learning model and a convolutional neural network model.
Specifically, the method can take a certain natural language processing model with an OCR function as a basic model, take at least one layout content text of an image document and a marked layout category thereof as training data, train the basic model by using the training data, and take the trained model as a layout text extraction model. After obtaining the image document, the basic model may perform Optical Character Recognition (OCR) processing on the image document to obtain an image document text and corresponding text position information, and then perform machine learning by combining the image document text, the text position information, and each marked layout content text to train the ability of extracting the layout text from the document. The image document text is the text in the image document, and the text position information is the coordinate information of the text in the image.
According to the method and the device, the pre-training natural language processing model in the corresponding field can be selected as the basic model according to the requirements of the actual application scene, so that the layout text extraction model capable of extracting the layout text in the actual application scene is obtained, and the training efficiency and the layout text extraction efficiency are improved. For example, for an actual application scenario of extracting a layout text in a biomedical document, the invention may select a pre-trained natural language processing model in the biomedical field as the basic model, so as to obtain a layout text extraction model capable of extracting a layout text in a medical document.
S102, obtaining at least one layout text in a target image document output by a layout text extraction model;
specifically, the present invention may obtain one or more layout texts in the target image document output by the layout text extraction model after the target image document is input to the layout text extraction model.
It should be noted that the layout text extraction model may extract each layout text in the document from the target image document.
S103, inputting each layout text into a text processing model;
the text processing model may be a model for extracting a text whose information category is a target information category from the layout text.
Specifically, the target information category may be set by a technician according to actual conditions, such as PICO information in medical literature.
It should be noted that the present invention is not limited to the specific model types of the text processing model, such as a deep learning model and a convolutional neural network model.
Specifically, the method can take a certain natural language processing model as a basic model, take a document layout text and a corresponding text marked with a target information category as training data, train the basic model by using the training data, and take the trained model as a text processing model, so that the text processing model can extract the text of the target information category from the layout text.
Specifically, according to the requirements of the actual application scene, the pre-training natural language processing model in the corresponding field can be selected as the basic model, so that the text processing model capable of extracting the text of the target information category in the actual application scene is obtained, and the training efficiency and the text extraction efficiency are improved. For example, for an actual application scenario of extracting the PICO information in the layout text of the biomedical document, the invention can select a pre-trained natural language processing model in the biomedical field as a basic model to obtain a text processing model capable of extracting the text with the information category of the PICO information from the layout text of the medical document.
Specifically, the present invention may input each layout text into the text processing model after extracting each layout text from the target image document.
And S104, obtaining extracted texts with information types extracted from the layout texts by the text processing model as target information types.
The extracted text is the text with the information category extracted from the layout text as the target information category.
Specifically, after the layout texts are input into the text processing model, the extracted texts of the text processing model with the information types output by the layout texts as the target information types can be obtained. The text processing model can output an extracted text with the corresponding information category as the target information category based on a layout text.
Specifically, the present invention may input one layout text into the text processing model, and after obtaining the extracted text output by the text processing model based on the layout text, input another layout text into the text processing model until obtaining the extracted texts output by the text processing model based on each layout text.
Optionally, in another document processing method provided in the embodiment of the present invention, the text processing model is obtained by performing fine tuning on a pre-trained natural language processing model by using a document layout text and a labeled text whose corresponding information type is a target information type as training data.
The pre-trained natural language processing model can be a pre-trained natural language processing model in a corresponding field. At this time, the invention can obtain the text processing model for extracting the text with the information category as the target information category from the layout text by using the training data to train and fine-tune the pre-trained natural language processing model in the corresponding field.
Specifically, the method can obtain a plurality of documents in the corresponding field in advance, extract the document layout text from the documents, obtain the text of the marked target information category aiming at the document layout text, and train and fine-tune the pre-trained natural language processing model by taking the document layout text and the marked text as training samples to obtain the text processing model. For example, for the extraction requirement of the PICO information in the biomedical literature, the invention can obtain a plurality of literatures from the biomedical literature with an open PubMed literature database, extract the layout text in the literatures and mark the text with the information category of the PICO information, use one layout text and the marking text as one training sample, obtain a plurality of training samples according to the above, select a pre-training model PubMedBERT in the biomedical field as a pre-training natural language processing model, train and fine tune the PubMedBERT by using a plurality of training samples, and obtain a corresponding layout pubouputmedbert model, namely a text processing model.
It should be noted that, in the prior art, the extraction of the specified category information from the literature cannot be effectively realized, and the problems of low information extraction efficiency and incomplete information extraction exist. For example, in a scenario of evidence-based medical research, in the prior art, a professional evidence-based researcher may look up related documents by means of medical experience, manually extract the PICO information from the related documents, and manually look up the documents and extract the PICO information, so that the accuracy of information extraction is high, and the evidence-based object to be researched may be better attached, but when the documents are large, resources such as manpower and time may be consumed more, the extraction efficiency is low, and the extraction of the PICO information may not be comprehensive; on the other hand, in the prior art, information such as titles and summaries of documents screened out can be extracted by a machine, the machine extraction mode is time-saving and labor-saving, but the required PICO information may not exist in the titles and summaries, and the PICO information extraction may not be comprehensive.
Specifically, the inventor of the technical scheme of the invention combines information extraction with the latest natural language processing technology, designs the technical scheme of the invention to realize the extraction of texts of specified information categories from documents and provide key guidance information for relevant workers, for example, the texts for extracting PICO information from medical documents are realized, including research objects, intervention measures, curative effects and the like, and the method helps researchers to improve scientific research efficiency and assist in making high-quality system evaluation, so that the medical relevant workers make optimal medical decisions on the basis of the existing best scientific research evidence.
Specifically, according to the method shown in fig. 1, all layout texts of the target image document can be extracted from the target image document by using the layout text extraction model, and the extracted texts of the specified information types can be extracted from each layout text by using the text processing model.
The document processing method provided by the invention comprises the steps of inputting a target image document into a layout text extraction model, and obtaining at least one layout text of the target image document output by the layout text extraction model; and inputting each layout text into a text processing model, and obtaining an extracted text of which the information category extracted from each layout text by the text processing model is the target information category. The method can extract all the layout texts from the target image document by using the layout text extraction model, extract the extraction texts of the specified information types from all the layout texts by using the text processing model, avoid the manual extraction mode, avoid the excessive consumption of resources such as manpower, time and the like, improve the information extraction efficiency and accuracy, extract information aiming at all the layout texts, and improve the comprehensiveness of the information extraction.
Based on fig. 1, the present embodiment proposes a second document processing method. In the method, a layout text extraction model is obtained by performing combined training on a pre-training semantic understanding model and an image literature layout recognition model.
The pre-trained semantic understanding model can be a pre-trained semantic understanding model and can be used for performing semantic understanding on the image documents.
The image document layout recognition model may be a model for recognizing a layout of an image document, among others. Specifically, the image document layout recognition model may capture information such as relative positions of visual features and text in the image document, and perform layout recognition based on the information such as the text, the visual feature information, and the relative positions of the text of the image document.
The layout text extraction model may be a model obtained by performing joint training on the pre-training semantic understanding model and the image document layout recognition model, and used for performing layout text extraction on the target image document. The layout text extraction model may have the functions of a pre-training semantic understanding model and an image document layout recognition model, and may perform layout recognition on the image document, perform semantic understanding on the content in the image document, segment the layout text of the image document, and extract each layout text of the image document.
It can be understood that the invention can select the pre-training semantic understanding model and the image document layout recognition model according to the field of the document to which the information extraction needs to be performed, and perform the joint training on the pre-training semantic understanding model and the image document layout recognition model by using the related data of the document in the corresponding field, so that the jointly trained layout text extraction model can have the capability of performing the layout text extraction on the image document in the corresponding field, and the accuracy of the layout text extraction is ensured. For example, when the method is applied to a PICO information extraction scene of a medical document, the pre-trained semantic understanding model can be a pre-trained semantic understanding model BioBERT model in the medical field, the image document layout recognition model can be a LayoutLMv2 model in the medical field, and the method can use related data of the medical document to perform joint training on the BioBERT model and the LayoutLMv2 model, so that the jointly trained layout text extraction model can have the capability of performing layout text extraction on the medical document in an image format, and the accuracy of layout text extraction is ensured.
Optionally, the layout text extraction model includes a first processing layer, a second processing layer, and a third processing layer; wherein: the structure of the first processing layer corresponds to the pre-training semantic understanding model, and the structure of the second processing layer corresponds to the image document layout recognition model; the third processing layer is used for outputting each layout text based on the output data of the first processing layer and the output data of the second processing layer.
Specifically, the structure of the first processing layer may be a model structure of a pre-trained semantic understanding model, and the structure of the second processing layer may be a model structure of an image document layout recognition model. At this time, the model structure of the layout text extraction model may include a pre-training semantic understanding model and an image document layout recognition model.
Specifically, the layout text extraction model can perform semantic understanding on the image document through the first processing layer, and capture information such as relative positions of visual features and texts of the image document through the second processing layer, so as to solve the problem of the visual information layer.
Specifically, the layout text extraction model may have an OCR processing function, and after obtaining the target image document, OCR processing may be performed on the target image document in advance to obtain processed data, and then each layout text may be extracted from the processed data.
Optionally, the input of the first processing layer includes: the image document text and the text position information are obtained by the layout text extraction model by using an Optical Character Recognition (OCR) technology;
the output of the first processing layer includes: the text embedding vector is used for representing the semantic understanding of the text and the position embedding vector used for representing the mapping relation between the text paragraph and the image.
Specifically, after the layout text extraction model obtains the target image document, the OCR technology may be utilized to process the target image document to obtain an image document text and corresponding text position information, and the image document text and the corresponding text position information are input to the first processing layer; then, the first processing layer can output a text vector for expressing semantic understanding of the text and a position embedding vector for representing the mapping relation between the text paragraph and the image based on the image document text and the corresponding text position information.
Optionally, the input of the second processing layer includes: target image documents, image document text and text position information; the output of the second processing layer comprises: a 2D position embedding vector at the character level and an image embedding vector for embodying image feature information.
The 2D position embedding vector can be used to represent a relative position marker in the document and capture the relationship of symbol quality inspection in the document; image embedding vectors, may be used to capture some expressive features such as the direction, type, and color of the word.
Specifically, the layout text extraction model may input the target image document, the image document text, and the text position information to the second processing layer after processing the target image document by using an OCR technology to obtain the image document text and the corresponding text position information; at this time, the second processing layer may input to the second processing layer based on the target image document, the image document text, and the text position information, and output the 2D position embedding vector at the character level and the image embedding vector for embodying the image feature information.
Specifically, the third processing layer may obtain a text vector for representing semantic understanding of the text and a position embedding vector for representing a mapping relationship between a text paragraph and an image, which are output by the first processing layer, obtain a character-level 2D position embedding vector and an image embedding vector for representing image feature information, which are output by the second processing layer, and output each layout text of the target image document based on the obtained text vector, position embedding vector, 2D position embedding vector, and image embedding vector.
Optionally, the training data of the layout text extraction model includes: the text classification method comprises the steps of image documents, image document texts, text position information and text classification labels, wherein the text classification labels are classifications of document layout parts to which texts belong.
It should be noted that the text position information is coordinate information where the text is located in an image coordinate system of the image; the text category label can be a category of a document layout portion, such as an abstract and a title, to which the text belongs in the document.
Specifically, the invention can firstly obtain a plurality of PDF documents in the corresponding field, convert each page of the PDF document into an image by using a text analysis tool, obtain an image document text and text position information in the image document text by using an OCR technology, then divide the extracted text into layout texts of different categories such as title, author, mechanism, abstract, text, table, picture, reference document and the like according to paragraphs, and manually label the category of each layout text.
For example, as shown in fig. 2, for a layout text extraction model training diagram, for a requirement of extracting a layout text in a medical document, the present invention may first obtain a plurality of biomedical PDF documents from 50 ten thousand biomedical documents opened in a PubMed document database, perform text parsing on the biomedical PDF documents by a text parsing tool to obtain a plurality of corresponding image documents, obtain image document texts of the image documents and position coordinate information of the texts, i.e., position coordinate information of the texts in the images, by using an OCR technology, obtain each layout text in the image documents, manually label a category of each layout text, use an image document, an image document text, text position information, and a category label of the layout text corresponding to one document as a training sample, obtain a plurality of corresponding training samples, select a BioBERT model in the biomedical field as a pre-training semantic understanding model, select a lautlmv 2 model in the biomedical field as an image document layout recognition model, and use a plurality of training samples to perform joint training on the BioBERT model and the lautlmv 2 model, and obtain a layout text extraction model including the BioBERT 2 model.
Optionally, as shown in the text processing model training diagram shown in fig. 3, in a scene of extracting information from a biomedical document PICO, the present invention may also use the layout text extracted from the image document corresponding to the PDF document by the lxbiolayout lm model, and corresponding layout labels (i.e., the text type labels) and PICO label (i.e., the text marked with information type being PICO information) training samples, select a PubMedBERT model in the biomedical field as a base model, and train and fine-tune the PubMedBERT model by using the training samples to obtain the layout PubMedBERT model, i.e., the text processing model.
It should be noted that, when the prior art extracts the specified category information, there may be a problem that timing problems are not considered, the content of sentences themselves is easy to ignore, key sentences which are distributed in a small manner but contain main content in the text are excluded, and/or the contextual information in a true sense cannot be captured. For example, in the prior art, during the extraction of PICO information from evidence-based medical documents, named entity recognition is often performed based on texts such as topics and abstracts displayed on a medical journal website, specifically, related text information is crawled by using technical means such as a crawler and the like, a corresponding data set is labeled, and then entity recognition model training is performed, but this method may have problems such as incomplete text data and incomplete PICO information coverage, wherein a time sequence problem may not be considered based on a traditional statistical method or a word bag model such as word frequency, and the like, and the content of a sentence is easily ignored, a key sentence with small distribution but main content in the text is excluded, and a sequence model based on RNN, LSTM and the like may solve the time sequence problem to a certain extent, and capture information of context to a certain extent, but may not capture context information in a true sense. Therefore, the accuracy of the existing key information extraction method of the medical literature may be low.
Specifically, with the development of natural language processing, there is a new research direction in natural language processing, that is, context information in the true sense is captured through a self-attention mechanism, text information can be represented better, and new accuracy on various tasks is created. In contrast, the inventor of the technical scheme of the present invention combines information extraction with the latest natural language processing technology, that is, selects a pre-training semantic understanding model and an image layout recognition model capable of capturing context information, performs joint training on the pre-training semantic understanding model and the image layout recognition model to obtain a layout text extraction model capable of capturing context information, and can obtain a text processing model capable of capturing context information, designs the technical scheme of the present invention to extract a text of a specific information category from a document, and provides key guidance information for relevant workers, for example, the text for extracting PICO information from a medical document is realized, including research objects, intervention measures, curative effects, etc., to assist the researchers to improve the research efficiency, to assist the manufacture of high-quality system evaluation, so that the medical relevant workers make an optimal medical decision on the basis of the existing best scientific research evidence, compared with a PICO information extraction mode only combining questions and abstracts in the prior art, the PICO information extraction of the present invention is more comprehensive and more suitable for generating requirements, and can effectively ensure the accuracy of information extraction.
The document processing method provided by the invention can obtain the layout text extraction model by performing combined training on the pre-training semantic understanding model and the image document layout recognition model, extract the layout text by using the layout text extraction model, and use the extracted layout text for the subsequent text extraction of the target information category, thereby further ensuring the information extraction efficiency and accuracy.
Based on fig. 1, the embodiment of the present invention proposes a third document processing method, which may further include step S201 after step S102; in this case, step S103 may be specifically step S202, and step S104 may be step S203; wherein:
s201, performing integration and de-duplication processing on each layout text to obtain at least one corresponding processed text;
in practical applications, the output layout text may be irregular, for example, there may be a case where a paragraph of a document is recognized as two-part text, and for example, there may be repeated extracted text. Therefore, after obtaining each layout text extracted from the target image document by the layout text extraction model, the invention firstly carries out corresponding standard processing on each layout text and then inputs the layout text into the text processing model.
Specifically, the present invention may be configured with a content formatting correction module, and perform a standardized processing on each layout text through the content formatting correction module. Specifically, in the process of performing specification processing on the layout text, a layout parser tool may be used to integrate and correct each layout text output by the layout text extraction model by using a Mask-RCNN algorithm. The analysis tool can perform modularized paragraph level division on each layout text, integrates the texts of the same layout part divided into two paragraphs, and enhances the integrity of the analysis text so as to improve the extraction efficiency and accuracy of the texts which are subsequently used for the text processing model to perform the specified information category.
S202, inputting each processed text into a text processing model;
specifically, the invention can input each processed text into the text processing model instead of each layout text after obtaining each processed text.
And S203, obtaining extracted texts of which the information types extracted from the processed texts by the text processing model are target information types.
Specifically, the extracted texts extracted by the text processing model aiming at the processed texts can be obtained after the processed texts are input into the text processing model.
Specifically, when the present invention is applied to the PICO information extraction scenario of the medical document, as shown in the overall flow diagram shown in fig. 4, the present invention may first obtain the PDF medical document, convert the PDF medical document into a corresponding image document, then extract each layout text from the image document by using the lxbiolayout lm model, perform a standardized process on each layout text by using the content formatting correction module to obtain each corresponding processed text, input each processed text into the layout pubmedbert model, and obtain each extracted text, which is the PICO information extraction result, extracted from each processed text by the layout pubmedbert model.
The document processing method provided by the invention can perform corresponding standard processing on each layout text in advance after each layout text extracted from the target image document by the layout text extraction model is obtained, so that the text integrity is enhanced, and the extraction efficiency and accuracy of the text of the specified information category subsequently used for the text processing model are improved.
Based on fig. 1, as shown in fig. 5, the embodiment of the present invention proposes a fourth document processing method. The method may further comprise:
s501, obtaining target text content of a target document; the target document is a document corresponding to the target image document, and the target text content comprises text of at least one layout part in the target document;
specifically, the target text content may include text of one or more layout portions in the target document, for example, a title and an abstract, where the title and the abstract of the target document may be used as the target text content.
S502, respectively determining the similarity between each extracted text and the content of a target text;
specifically, after the target text content and at least one extracted text are determined, vector cosine similarity calculation is carried out on each determined extracted text and the target text content respectively, so that the text similarity between each extracted text and the target text content is calculated respectively.
It should be noted that the timing of executing steps S501 and S502 is not limited in the present invention, and for example, may be before step S101; for another example, after the step S104, as shown in fig. 5.
S503, sequencing the extracted texts according to the similarity between the extracted texts and the target text content;
optionally, the extracted texts can be sorted according to the order of the similarity of the texts with the target text content from large to small;
optionally, the extracted texts may be sorted according to the order of the text similarity from small to small;
optionally, the method of the invention may remove all the extracted texts with text similarity lower than a certain threshold in advance, and only sort all the extracted texts with text similarity not lower than the threshold.
And S504, outputting the sequencing result.
Specifically, the invention can output the sequencing result of each extracted text after sequencing, so that technicians can select the extracted text meeting the requirement of each extracted text based on the sequencing result and the actual service requirement, for example, the extracted text with the highest text similarity can be selected.
It is understood that the above steps S503 and S504 need to be executed after the above step S104.
It should be noted that, through the steps S501, S502, S503 and S504, after each extracted text is obtained, the similarity between each extracted text and the target text content is respectively determined, each extracted text is sorted based on the similarity, and the sorting result is output, so as to provide the relevant information between each extracted text and the target text content for the technical staff, assist the technical staff in selecting the text suitable for the actual service requirement, improve the service quality, and enhance the user stickiness.
The document processing method provided by the invention can respectively determine the similarity between each extracted text and the target text content after each extracted text is obtained, sequence each extracted text based on the similarity and output the sequencing result so as to provide the correlation information of each extracted text and the target text content for technical personnel, assist the technical personnel in selecting the text suitable for the actual service requirement of the technical personnel, improve the service quality and enhance the user viscosity.
The document processing apparatus provided by the present invention is described below, and the document processing apparatus described below and the document processing method described above can be referred to in correspondence with each other.
In accordance with the method shown in fig. 1, as shown in fig. 6, an embodiment of the present invention provides a document processing apparatus. The apparatus may include: a first input unit 601, a first obtaining unit 602, a second input unit 603, and a second obtaining unit 604; wherein:
a first input unit 601 for inputting a target image document to a layout text extraction model;
a first obtaining unit 602, configured to obtain at least one layout text of a target image document output by a layout text extraction model;
a second input unit 603 for inputting each layout text to the text processing model;
a second obtaining unit 604, configured to obtain extracted texts in which information types extracted by the text processing models from the layout texts are target information types.
It should be noted that, specific processing procedures of the first input unit 601, the first obtaining unit 602, the second input unit 603, and the second obtaining unit 604 and technical effects brought by the processing procedures can refer to the related descriptions of steps S101, S102, S103, and S104 in fig. 1, respectively, and are not described herein again.
Optionally, the layout text extraction model is obtained by performing joint training on a pre-training semantic understanding model and an image document layout recognition model.
Optionally, the layout text extraction model includes a first processing layer, a second processing layer, and a third processing layer; wherein: the structure of the first processing layer corresponds to the pre-training semantic understanding model, and the structure of the second processing layer corresponds to the image document layout recognition model; the third processing layer is used for outputting each layout text based on the output data of the first processing layer and the output data of the second processing layer.
Optionally, the input of the first processing layer includes: the image document text and the text position information are obtained by the layout text extraction model by utilizing an Optical Character Recognition (OCR) technology;
the output of the first processing layer comprises: the text embedding vector is used for representing the semantic understanding of the text and the position embedding vector used for representing the mapping relation between the text paragraph and the image.
Optionally, the input of the second processing layer includes: target image documents, image document text and text position information; the output of the second processing layer comprises: a 2D position embedding vector at a character level and an image embedding vector for embodying image feature information.
Optionally, the training data of the layout text extraction model includes: the text classification method comprises the following steps of image documents, image document texts, text position information and text classification labels, wherein the text classification labels are the types of document layout parts to which texts belong.
Optionally, the text processing model is obtained by performing fine tuning on a pre-trained natural language processing model by using a document layout text and a labeled text with a corresponding information category as a target information category as training data.
Optionally, the document processing apparatus further comprises: a processing unit and a third obtaining unit;
the processing unit is used for performing integration and de-duplication processing on each layout text after at least one layout text of the target image document output by the layout text extraction model is obtained;
a third obtaining unit, configured to obtain at least one processed text;
a second input unit 603 for inputting each processed text to the text processing model;
a second obtaining unit 604, configured to obtain extracted texts in which information categories extracted by the text processing model from the processed texts are target information categories.
Optionally, the document processing apparatus further comprises: a fourth obtaining unit, a determining unit, a sorting unit and an output unit; wherein:
a fourth obtaining unit, configured to obtain target text content of the target document; the target document is a document corresponding to the target image document, and the target text content comprises text of at least one layout part in the target document;
the determining unit is used for respectively determining the similarity between each extracted text and the target text content;
the sequencing unit is used for sequencing the extracted texts according to the similarity between the extracted texts and the target text content;
and the output unit is used for outputting the sequencing result.
The document processing method provided by the invention comprises the steps of inputting a target image document into a layout text extraction model, and obtaining at least one layout text of the target image document output by the layout text extraction model; and inputting each layout text into a text processing model to obtain an extracted text with the information category extracted from each layout text by the text processing model as the target information category. According to the method and the device, all the layout texts can be extracted from the target image document by using the layout text extraction model, the extracted texts of the specified information types are extracted from all the layout texts by using the text processing model, a manual extraction mode is not needed, excessive consumption of resources such as manpower and time is avoided, the information extraction efficiency and accuracy are improved, information extraction can be performed on all the layout texts, and the comprehensiveness of the information extraction is improved.
In another aspect, an embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement any one of the document processing methods described above, where the document processing method may include:
inputting the target image document into a layout text extraction model to obtain at least one layout text of the target image document output by the layout text extraction model;
and inputting each layout text into a text processing model, and obtaining an extracted text of which the information category extracted from each layout text by the text processing model is a target category.
Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor) 701, a communication Interface (Communications Interface) 702, a memory (memory) 703 and a communication bus 704, wherein the processor 701, the communication Interface 702 and the memory 703 communicate with each other via the communication bus 704. The processor 701 may call logic instructions in the memory 703 to perform the document processing method described above.
In addition, the logic instructions in the memory 703 can be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of executing any of the above-mentioned document processing methods, the document processing methods comprising:
inputting the target image document into a layout text extraction model to obtain at least one layout text of the target image document output by the layout text extraction model;
and inputting each layout text into a text processing model to obtain an extracted text with the information category extracted from each layout text by the text processing model as the target category.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform any of the above document processing methods, the document processing method comprising:
inputting the target image document into a layout text extraction model to obtain at least one layout text of the target image document output by the layout text extraction model;
and inputting each layout text into a text processing model, and obtaining an extracted text of which the information category extracted from each layout text by the text processing model is a target category.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of various embodiments or some parts of embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of document processing, comprising:
inputting a target image document into a layout text extraction model, and obtaining at least one layout text of the target image document output by the layout text extraction model;
and inputting each layout text into a text processing model, and obtaining an extracted text of which the information category extracted from each layout text by the text processing model is a target information category.
2. The document processing method according to claim 1, wherein the layout text extraction model is obtained by performing joint training on a pre-trained semantic understanding model and an image document layout recognition model.
3. The document processing method according to claim 2, wherein the layout text extraction model includes a first processing layer, a second processing layer, and a third processing layer; wherein: the structure of the first processing layer corresponds to the pre-training semantic understanding model, and the structure of the second processing layer corresponds to the image document layout recognition model; the third processing layer is configured to output each of the layout texts based on the output data of the first processing layer and the output data of the second processing layer.
4. The document processing method of claim 3, wherein the input to the first processing layer comprises: image document text and text position information, the image document text being text in the target image document, the image document text and the text position information being obtained by the layout text extraction model using Optical Character Recognition (OCR) techniques;
the output of the first processing layer comprises: the text embedding vector is used for representing the semantic understanding of the text and the position embedding vector used for representing the mapping relation between the text paragraph and the image.
5. The document processing method of claim 3, wherein the input to the second processing layer comprises: the target image document, image document text and text position information; the output of the second processing layer comprises: a 2D position embedding vector at a character level and an image embedding vector for embodying image feature information.
6. The document processing method according to claim 2, wherein the training data of the layout text extraction model includes: the text classification method comprises the steps of image documents, image document texts, text position information and text classification labels, wherein the text classification labels are classes of document layout parts to which texts belong.
7. The document processing method according to claim 1, wherein the text processing model is obtained by fine-tuning a pre-trained natural language processing model by using a document layout text and a label text having a corresponding information category as the target information category as training data.
8. The document processing method according to claim 1, wherein after said obtaining at least one layout text of the target image document output by the layout text extraction model, the document processing method further comprises:
performing integration and de-duplication processing on each layout text to obtain at least one corresponding processed text;
the inputting each layout text into a text processing model comprises:
inputting each processed text into the text processing model;
the obtaining of the extracted text in which the information category extracted by the text processing model from each layout text is a target information category includes:
and acquiring an extracted text in which the information category extracted from each processed text by the text processing model is a target information category.
9. The document processing method according to any one of claims 1 to 7, further comprising:
obtaining target text content of a target document; the target literature is a literature corresponding to the target image literature, and the target text content comprises text of at least one layout part in the target literature;
respectively determining the similarity of each extracted text and the target text content;
and sequencing each extracted text according to the similarity of each extracted text and the content of the target text, and outputting a sequencing result.
10. A document processing apparatus, comprising: the device comprises a first input unit, a first obtaining unit, a second input unit and a second obtaining unit; wherein:
the first input unit is used for inputting the target image document to the layout text extraction model;
the first obtaining unit is used for obtaining at least one layout text of the target image document output by the layout text extraction model;
the second input unit is used for inputting each layout text into a text processing model;
the second obtaining unit is configured to obtain extracted texts in which information types extracted by the text processing model from the layout texts are target information types.
CN202211058612.7A 2022-08-30 2022-08-30 Document processing method and device Pending CN115455143A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211058612.7A CN115455143A (en) 2022-08-30 2022-08-30 Document processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211058612.7A CN115455143A (en) 2022-08-30 2022-08-30 Document processing method and device

Publications (1)

Publication Number Publication Date
CN115455143A true CN115455143A (en) 2022-12-09

Family

ID=84301274

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211058612.7A Pending CN115455143A (en) 2022-08-30 2022-08-30 Document processing method and device

Country Status (1)

Country Link
CN (1) CN115455143A (en)

Similar Documents

Publication Publication Date Title
CN110750959B (en) Text information processing method, model training method and related device
CN109685056B (en) Method and device for acquiring document information
Wilkinson et al. Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections
CN108804423B (en) Medical text feature extraction and automatic matching method and system
CN113961685A (en) Information extraction method and device
CN107679070B (en) Intelligent reading recommendation method and device and electronic equipment
CN111581367A (en) Method and system for inputting questions
CN113221711A (en) Information extraction method and device
CN112861864A (en) Topic entry method, topic entry device, electronic device and computer-readable storage medium
CN115953788A (en) Green financial attribute intelligent identification method and system based on OCR (optical character recognition) and NLP (non-line-segment) technologies
CN114818718A (en) Contract text recognition method and device
CN112036330A (en) Text recognition method, text recognition device and readable storage medium
CN115130437B (en) Intelligent document filling method and device and storage medium
CN114579796B (en) Machine reading understanding method and device
CN116822634A (en) Document visual language reasoning method based on layout perception prompt
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN113111869B (en) Method and system for extracting text picture and description thereof
CN112560849B (en) Neural network algorithm-based grammar segmentation method and system
CN115455143A (en) Document processing method and device
Batomalaque et al. Image to text conversion technique for anti-plagiarism system
EP3757825A1 (en) Methods and systems for automatic text segmentation
CN113962196A (en) Resume processing method and device, electronic equipment and storage medium
CN111311197A (en) Travel data processing method and device
KR102442339B1 (en) Apparatus and method for ocr conversion of learning material
CN114357990B (en) Text data labeling method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination