CN118095205A - Information extraction method, device and equipment of layout file and storage medium - Google Patents

Information extraction method, device and equipment of layout file and storage medium Download PDF

Info

Publication number
CN118095205A
CN118095205A CN202410065280.8A CN202410065280A CN118095205A CN 118095205 A CN118095205 A CN 118095205A CN 202410065280 A CN202410065280 A CN 202410065280A CN 118095205 A CN118095205 A CN 118095205A
Authority
CN
China
Prior art keywords
target
information
field
file
extracted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410065280.8A
Other languages
Chinese (zh)
Inventor
赵刚
王希萌
弓源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Resources Pharmaceutical Commercial Group Co ltd
Original Assignee
China Resources Pharmaceutical Commercial Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Resources Pharmaceutical Commercial Group Co ltd filed Critical China Resources Pharmaceutical Commercial Group Co ltd
Priority to CN202410065280.8A priority Critical patent/CN118095205A/en
Publication of CN118095205A publication Critical patent/CN118095205A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Input (AREA)

Abstract

The application relates to an information extraction method, device and equipment for format files and a storage medium. The method comprises the following steps: receiving an extraction request sent by a terminal device, wherein the extraction request comprises a target layout file to be processed and information of the layout type of the target layout file, and the target layout file is a file in a fixed typesetting mode; determining a target field library corresponding to the target format file according to the format type information of the target format file, wherein the target field library comprises target fields to be extracted; extracting information from the target format file according to the target field to be extracted in the target field library to obtain structured field information corresponding to the target format file, wherein the structured field information comprises the target field to be extracted and text data corresponding to the target field extracted from the target format file; and sending the structured field information to the terminal equipment. By adopting the method, the generalization of the information extraction of the format file can be improved.

Description

Information extraction method, device and equipment of layout file and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting information of a layout file.
Background
The layout file is a file with a fixed typesetting mode and can comprise an identity card, a business license, a manuscript book and the like. The accurate extraction of key information in layout files is a common requirement in daily life.
In the related art, optical character recognition (Optical Character Recognition) is generally adopted to extract text information in a sample file, and then data labeling is carried out, so that training data of a layout file is constructed. And then training a deep learning model by using the training data, and extracting key information in the layout file by using the trained deep learning model.
However, although the accuracy of extracting key information in the layout file by using the deep learning model is high, the method is only suitable for the layout file related to training data, so that the generalization of information extraction of the layout file is low.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method, apparatus, device, and storage medium for extracting information of a layout file, which can improve generalization of information extraction of a layout file.
In a first aspect, the present application provides a method for extracting information of a layout file. The method comprises the following steps:
receiving an extraction request sent by a terminal device, wherein the extraction request comprises a target layout file to be processed and layout type information of the target layout file, and the target layout file is a file in a fixed typesetting mode;
Determining a target field library corresponding to the target format file according to the format type information of the target format file, wherein the target field library comprises target fields to be extracted;
According to a target field to be extracted in the target field library, extracting information from the target format file to obtain structured field information corresponding to the target format file, wherein the structured field information comprises the target field to be extracted and text data corresponding to the target field extracted from the target format file;
and sending the structured field information to the terminal equipment.
In one embodiment, the extracting information of the target layout file according to the target field to be extracted in the target field library to obtain the structured field information corresponding to the target layout file includes:
performing optical character recognition on the target format file to obtain full text information in the target format file;
Generating identification indication information of a large language model according to a target field to be extracted in the target field library and the full text information, wherein the identification indication information is used for describing an identification object, an identification type and an output type of the large language model;
and extracting text data corresponding to the target field from the full text information by using the large language model according to the identification indication information so as to generate the structured field information.
In one embodiment, the extracting information of the target layout file according to the target field to be extracted in the target field library to obtain the structured field information corresponding to the target layout file includes:
Determining a target detection model corresponding to the target format file according to the format type information of the target format file, wherein the target detection model is used for detecting the extraction range of the target field to be extracted in the target format file;
acquiring text information of the extraction range from the target layout file through optical character recognition;
and matching the text information of the extraction range according to the target field to be extracted in the target field library, and generating the structured field information according to a matching result.
In one embodiment, after extracting information from the target layout file according to the target field to be extracted in the target field library to obtain structured field information corresponding to the target layout file, the method further includes:
acquiring priori knowledge data corresponding to the target field;
and correcting the structured field information according to the priori knowledge data.
In one embodiment, before the receiving the extraction request sent by the terminal device, the method further includes:
Receiving a creation request of a format type sent by the terminal equipment, wherein the creation request comprises a sample file of the format type to be created, and the sample file contains marking information which is used for indicating a sample field to be extracted corresponding to the format type to be created;
and generating a field library corresponding to the format type to be created according to the sample field indicated by the marking information.
In one embodiment, the labeling information further includes location information of the sample field; after receiving the format type creation request sent by the terminal device, the method further includes:
generating a sample set according to the sample field and the position information of the sample field;
and training an initial detection model by using the sample set to obtain a target detection model corresponding to the format type to be created.
In a second aspect, the present application provides an information extraction apparatus for a layout file. The device comprises:
The receiving module is used for receiving an extraction request sent by the terminal equipment, wherein the extraction request comprises a target format file to be processed and information of format types of the target format file, and the target format file is a file in a fixed typesetting mode;
The determining module is used for determining a target field library corresponding to the target layout file according to the layout type information of the target layout file, wherein the target field library comprises target fields to be extracted;
The extraction module is used for extracting information of the target format file according to a target field to be extracted in the target field library to obtain structured field information corresponding to the target format file, wherein the structured field information comprises the target field to be extracted and text data corresponding to the target field extracted from the target format file;
and the sending module is used for sending the structured field information to the terminal equipment.
In one embodiment, the extracting module is specifically configured to perform optical character recognition on the target format file, and obtain full text information in the target format file; generating identification indication information of a large language model according to a target field to be extracted in the target field library and the full text information, wherein the identification indication information is used for describing an identification object, an identification type and an output type of the large language model; and extracting text data corresponding to the target field from the full text information by using the large language model according to the identification indication information so as to generate the structured field information.
In one embodiment, the extracting module is specifically configured to determine, according to information of a format type of the target format file, a target detection model corresponding to the target format file, where the target detection model is used to detect an extracting range of the target field to be extracted in the target format file; acquiring text information of the extraction range from the target layout file through optical character recognition; and matching the text information of the extraction range according to the target field to be extracted in the target field library, and generating the structured field information according to a matching result.
In one embodiment, the information extraction device of the layout file further includes:
the correction module is used for acquiring priori knowledge data corresponding to the target field; and correcting the structured field information according to the priori knowledge data.
In one embodiment, the receiving module is further configured to receive a creation request of a format type sent by the terminal device, where the creation request includes a sample file of the format type to be created, and the sample file includes labeling information, where the labeling information is used to indicate a sample field to be extracted corresponding to the format type to be created;
the information extraction device of the layout file further comprises:
And the generation module is used for generating a field library corresponding to the format type to be created according to the sample field indicated by the marking information.
In one embodiment, the generating module is further configured to generate a sample set according to the sample field and the location information of the sample field; and training an initial detection model by using the sample set to obtain a target detection model corresponding to the format type to be created.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the information extraction method of the layout file in the first aspect when executing the computer program.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the information extraction method of a layout file described in the first aspect.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the method for extracting information of a layout file according to the first aspect.
The method, the device, the equipment and the storage medium for extracting the format file information comprise the steps of firstly receiving an extraction request sent by a terminal equipment, wherein the extraction request comprises a target format file to be processed and format type information of the target format file, and the target format file is a file in a fixed typesetting mode. And secondly, determining a target field library corresponding to the target layout file according to the layout type information of the target layout file, wherein the target field library comprises target fields to be extracted. And extracting information from the target format file according to the target field to be extracted in the target field library to obtain structured field information corresponding to the target format file, wherein the structured field information comprises the target field to be extracted and text data corresponding to the target field extracted from the target format file. And finally, the structured field information is sent to the terminal equipment. Because different format types correspond to different target field libraries, the target fields in the format files of different format types can be adaptively extracted based on the different target field libraries, and structured field information is returned, so that the information extraction of the format files is ensured to have higher generalization.
Drawings
Fig. 1 is an application environment diagram of an information extraction method of a layout file according to an embodiment of the present application;
Fig. 2 is a flow chart of a method for extracting information of a layout file according to an embodiment of the present application;
FIG. 3 is an interface schematic diagram of information extraction of a layout file according to an embodiment of the present application;
Fig. 4 is a schematic diagram of information extraction of a layout file according to an embodiment of the present application;
Fig. 5 is a flow chart of another method for extracting information of layout files according to an embodiment of the present application;
Fig. 6 is a flowchart illustrating a method for extracting information from a layout file according to another embodiment of the present application;
fig. 7 is a flowchart of another method for extracting information of layout files according to an embodiment of the present application;
FIG. 8 is an interface diagram of information extraction of another layout file according to an embodiment of the present application;
fig. 9 is a block diagram of an information extraction device for layout files according to an embodiment of the present application;
fig. 10 is an internal structure diagram of a computer device according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The information extraction method of the format file provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server.
When the user needs to extract the format file information, the user may send an extraction request to the server through the terminal 102, where the extraction request includes the target format file to be processed and the format type information of the target format file, and the target format file is a file in a fixed typesetting mode. Then, the server 104 determines a target field library corresponding to the target layout file according to the layout type information of the target layout file, where the target field library includes target fields to be extracted. And, the server 104 may extract information from the target layout file according to the target field to be extracted in the target field library, so as to obtain structured field information corresponding to the target layout file, where the structured field information includes the target field to be extracted and text data corresponding to the target field extracted from the target layout file. Finally, the server 104 transmits the structured field information to the terminal device.
The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers. The information extraction system of the layout file may be run on the server 104.
In one embodiment, as shown in fig. 2, there is provided an information extraction method of a layout file, which is described by taking an example that the method is applied to the server in fig. 1, and includes S201-S204:
S201, receiving an extraction request sent by a terminal device, wherein the extraction request comprises a target layout file to be processed and information of the layout type of the target layout file.
In the application, when the user needs to extract the information of the layout file, the user can send an extraction request to the server through the terminal, so that the server receives the target layout file to be processed and the information of the layout type of the target layout file.
The target layout file is a file in a fixed typesetting mode.
It should be understood that embodiments of the present application are not limited to the type of layout of the target layout file, and may include, by way of example, identity documents, business licenses, manuscript books, and the like.
It should be appreciated that embodiments of the present application are not limited to the file type of the target layout file, and in some embodiments, the target layout file may be a file of a picture type, and exemplary target layout files may be a portable document format (Portable Document Format, PDF) file, a joint photographic experts group (Joint Photographic Experts Group, JPG) file, a portable network graphics (Portable Network Graphics, PNG) file, or the like.
It should be understood that, the information of the format type of the target format file may be input by the user through the terminal device, or may be selected by the user from the candidate format types displayed by the terminal device, which is not limited in the embodiment of the present application.
Exemplary, fig. 3 is a schematic diagram of an interface for information extraction of a layout file according to an embodiment of the present application, and as shown in fig. 3, a user may select a target layout file to be processed, where the target layout file to be processed may be displayed in a file display area in the interface. The user may then configure the layout type of the target layout file by clicking on the layout selection control. After the user selects the target layout file and the layout type of the target layout file, the user can click a control for uploading the file, and an extraction request is sent to the server. After the server finishes the extraction, the information extracted from the target layout file can be displayed in a result display area in the interface.
S202, determining a target field library corresponding to the target layout file according to the layout type information of the target layout file.
In this step, after the server receives the extraction request sent by the terminal device, the target field library corresponding to the target layout file may be determined according to the layout type information of the target layout file.
The target field library comprises target fields to be extracted. By way of example, if the format type is an identity document, the target field may include name, gender, date of birth, address, identity number, etc. The target resource may be recorded in the target field library through an identification file, the identification information may be key information of the target field, and accordingly, text information of the target field may be value information of the target field, so as to form a key value pair.
It should be understood that the embodiments of the present application do not limit how the target field library is generated, and in some embodiments, the server may preconfigure some layout types and generate the target field library corresponding to the preconfigured layout types. In some embodiments, layout types to be automatically learned or indicated by a user can also be added, and a target field library can be correspondingly generated for selection and use.
And S203, extracting information of the target layout file according to the target field to be extracted in the target field library, and obtaining structured field information corresponding to the target layout file.
In this step, after the server determines the target field library corresponding to the target layout file, information extraction may be performed on the target layout file according to the target field to be extracted in the target field library, so as to obtain structured field information corresponding to the target layout file.
The structured field information comprises a target field to be extracted and text data corresponding to the target field extracted from the target format file. For example, the above structured field information may be a key-value pair of a key and a value, where the key is field identification information of a target field, and the value is text data corresponding to the target field extracted from the target layout file.
It should be understood that the embodiment of the present application does not limit how to extract information from the target layout file. Two ways of extracting information from the target layout file are provided below.
Fig. 4 is a schematic diagram of information extraction of a layout file according to an embodiment of the present application. As shown in fig. 4, in the first manner of extracting information from the target layout file, the server may first perform optical character recognition on the target layout file, to obtain full text information in the target layout file. And then, the server generates identification indication information of the large language model according to the target field to be extracted in the target field library and the full text information. And finally, the server extracts text data corresponding to the target field from the full text information by using a large language model according to the identification indication information so as to generate structured field information.
The optical character recognition is a computer input technology which converts characters of various notes, newspapers, books, manuscripts and other printed matters into image information through optical input modes such as scanning and the like, and then converts the image information into usable texts through a character recognition technology.
It should be appreciated that embodiments of the present application are not limited in the type of OCR, and that OCR recognition may be achieved by way of example using an open source OCR model tool (e.g., paddle-OCR), any third party OCR service, or a self-developing trained OCR model.
Wherein the large language model (Large Language Model, LLM) is an artificial intelligence model for understanding and generating human language. Large language models are typically trained on large amounts of text data and can perform a wide range of tasks including text summarization, translation, emotion analysis, and the like.
It should be appreciated that embodiments of the present application are not limited to the type of large language model, and in some embodiments, the large language model may be any open-source large language model or other autonomously developed and trimmed large language model, etc.
In some embodiments, the recognition indication information is used to describe a recognition object, a recognition type, and an output type of the large language model. Illustratively, the identifying indication information may be a prompt information, which may be an instruction manner for starting the machine learning model, and is a text or sentence for guiding the machine learning model to generate an output of a specific type, theme or format.
In some embodiments, after OCR recognition is completed to obtain full text information, structured extraction of text may begin using a large language model. The corresponding target field in the target field library and the extracted full text information can be combined into identification indication information, and the output type is added in the identification indication information.
Illustratively, the recognition instruction information may be "OCR recognition result: full text information, category-element relationship information (i.e., object field): the name and the document number are extracted according to the OCR recognition result, the appointed category-element relation information is extracted, and the result is output in a numbered musical notation (JSON) format.
As shown in fig. 4, in the second manner of extracting information from the target layout file, the server may determine, according to the layout type information of the target layout file, a target detection model corresponding to the target layout file, where the target detection model is used to detect an extraction range of a target field to be extracted in the target layout file. Subsequently, the server may obtain text information of the extraction range from the target layout file through optical character recognition. And finally, the server matches the text information of the extraction range according to the target field to be extracted in the target field library, and generates structured field information according to the matching result.
The target detection model may be generated after training a neural network model, for example, a single detection (You Only Look Once, YOLO) model, a regional convolution neural network (Region-based Convolutional Neural Networks, RCNN) model, and the like.
For example, the extraction range may be marked in the target layout file in the form of a range frame, and the server then performs OCR recognition on the content marked in the range frame in the target layout file to obtain text information of the extraction range. Then, the text information of the extraction range is matched with the target field in the target field library. If the matching is successful, determining text data corresponding to the target field from the text information of the extraction range, and forming structured field information by the text data corresponding to the target field and the target field.
In the application, the extraction range is determined through the target detection model, and then OCR recognition is carried out to extract information extracted from the target format file, so that compared with the method for directly using a large language model, the method has the advantage that the occupation requirement of computing resources is smaller, and the whole extraction is lighter.
In some embodiments, after obtaining the structured field information corresponding to the target layout file, the server may further obtain priori knowledge data corresponding to the target field, so as to perform correction processing on the structured field information according to the priori knowledge data.
Illustratively, correcting the structured field information according to the priori knowledge data may include correcting and filtering unreasonable, erroneous and other irrelevant information in the structured field information based on the priori knowledge; or, the non-extracted target fields in the structured field information may be supplemented based on a priori knowledge, for example, the target information with obvious features such as provincial regions, numbers, etc. may be supplemented, and the extraction result may be null for the non-extracted fields.
It should be noted that, in the process of extracting information of a layout file, intermediate result information obtained in the extracting process may be synchronously stored in a database and marked as a target layout type.
S204, the structured field information is sent to the terminal equipment.
According to the information extraction method of the format file, firstly, an extraction request sent by a terminal device is received, the extraction request comprises a target format file to be processed and information of format types of the target format file, and the target format file is a file in a fixed typesetting mode. And secondly, determining a target field library corresponding to the target format file according to the format type information of the target format file, wherein the target field library comprises target fields to be extracted. And extracting information from the target format file according to the target field to be extracted in the target field library to obtain structured field information corresponding to the target format file, wherein the structured field information comprises the target field to be extracted and text data corresponding to the target field extracted from the target format file. And finally, sending the structured field information to the terminal equipment. Because different format types correspond to different target field libraries, the target fields in the format files of different format types can be adaptively extracted based on the different target field libraries, and structured field information is returned, so that the information extraction of the format files is ensured to have higher generalization.
Fig. 5 is a flow chart of another method for extracting information from a layout file according to an embodiment of the present application, as shown in fig. 5, the method for extracting information from a layout file illustrates a first method for extracting information from a target layout file, and the method for extracting information from a layout file includes S301-S306:
s301, receiving an extraction request sent by the terminal equipment.
The extraction request comprises a target layout file to be processed and information of the layout type of the target layout file, wherein the target layout file is a file in a fixed typesetting mode.
S302, determining a target field library corresponding to the target layout file according to the layout type information of the target layout file.
The target field library comprises target fields to be extracted.
S303, performing optical character recognition on the target layout file to acquire full text information in the target layout file.
It should be appreciated that embodiments of the present application are not limited in the type of OCR, and that OCR recognition may be achieved by way of example using an open source OCR model tool (e.g., paddle-OCR), any third party OCR service, or a self-developing trained OCR model.
S304, according to the target field to be extracted in the target field library and the full text information, generating identification indication information of the large language model. Wherein the recognition instruction information is used for describing a recognition object, a recognition type and an output type of the large language model.
Wherein the large language model (Large Language Model, LLM) is an artificial intelligence model for understanding and generating human language. Large language models are typically trained on large amounts of text data and can perform a wide range of tasks including text summarization, translation, emotion analysis, and the like.
It should be appreciated that embodiments of the present application are not limited to the type of large language model, and in some embodiments, the large language model may be any open-source large language model or other autonomously developed and trimmed large language model, etc.
In some embodiments, after OCR recognition is completed to obtain full text information, large language models may be utilized to begin the organized extraction of text. The corresponding target field to be extracted in the target field library and the extracted full text information can be combined into identification indication information, and the output type is added in the identification indication information.
Illustratively, the recognition instruction information may be "OCR recognition result: full text information, category-element relationship information (i.e., object field): the name and the document number are extracted according to the OCR recognition result, the appointed category-element relation information is extracted, and the result is output in a numbered musical notation (JSON) format.
S305, extracting text data corresponding to the target field from the full text information by using a large language model according to the identification instruction information so as to generate structured field information.
S306, the structured field information is sent to the terminal equipment.
Fig. 6 is a flowchart of another method for extracting information from a layout file according to an embodiment of the present application, as shown in fig. 6, the method for extracting information from a layout file illustrates a second method for extracting information from a target layout file, and the method for extracting information from a layout file includes S401-S406:
S401, receiving an extraction request sent by a terminal device, wherein the extraction request comprises a target layout file to be processed and layout type information of the target layout file, and the target layout file is a file in a fixed typesetting mode.
S402, determining a target field library corresponding to the target format file according to the format type information of the target format file, wherein the target field library comprises target fields to be extracted.
S403, determining a target detection model corresponding to the target layout file according to the layout type information of the target layout file, wherein the target detection model is used for detecting the extraction range of the target field to be extracted in the target layout file.
The target detection model may be generated after training a neural network model, for example, a single detection (You Only Look Once, YOLO) model, a regional convolution neural network (Region-based Convolutional Neural Networks, RCNN) model, and the like.
S404, acquiring text information of an extraction range from the target layout file through optical character recognition.
In some embodiments, the extraction scope may be marked in the target layout file in the form of a scope box, and the server then performs OCR recognition on the content marked in the scope box in the target layout file to obtain text information of the extraction scope.
S405, matching the text information of the extraction range according to the target field to be extracted in the target field library, and generating structured field information according to the matching result.
In some embodiments, if the matching is successful, text data corresponding to the target field is determined from the text information in the extraction range, and the text data corresponding to the target field and the target field form structured field information.
S406, the structured field information is sent to the terminal equipment.
Fig. 7 is a flowchart of another method for extracting information of a layout file according to an embodiment of the present application, as shown in fig. 7, where the method for extracting information of a layout file illustrates how to create a requested layout type. The information extraction method of the layout file comprises the following steps of S501-S508:
s501, receiving a creation request of a format type sent by a terminal device, wherein the creation request comprises a sample file of the format type to be created, and the sample file contains marking information which is used for indicating a sample field to be extracted corresponding to the format type to be created.
In the application, the user can also create the layout type by creating the created model.
Fig. 8 is an interface schematic diagram of information extraction of another layout file according to an embodiment of the present application. As shown in fig. 8, the user uploads a certain number of sample files of a single layout type in the interface, and marks the sample fields in the sample files. Then, the sample file is sent to the server through the creation request by clicking on the control for uploading the file.
S502, generating a field library corresponding to the format type to be created according to the sample field indicated by the marking information.
In some embodiments, the server may extract sample fields indicated by the annotation information in the sample file and store the sample fields in the database in the form of identification information (e.g., keys), thereby forming a field library corresponding to the type of layout to be created.
S503, generating a sample set according to the sample field and the position information of the sample field.
In some embodiments, the labeling information further includes location information of the sample field. For example, the labeling information includes a labeling frame, and the position information may be the coordinates of the upper left corner and the lower right corner of the labeling frame.
S504, training an initial detection model by using a sample set to obtain a target detection model corresponding to the format type to be created.
Wherein, the training can be weak supervision self-training learning training.
S505, receiving an extraction request sent by the terminal equipment, wherein the extraction request comprises a target layout file to be processed and layout type information of the target layout file, and the target layout file is a file in a fixed typesetting mode.
S506, determining a target field library corresponding to the target format file according to the format type information of the target format file, wherein the target field library comprises target fields to be extracted.
S507, extracting information from the target format file according to the target field to be extracted in the target field library to obtain structured field information corresponding to the target format file, wherein the structured field information comprises the target field to be extracted and text data corresponding to the target field extracted from the target format file.
S508, the structured field information is sent to the terminal equipment.
In some embodiments, for a newly created version of the test file, the target field to be extracted corresponding to the format type in the identification indication information of the large language model may be directly updated, and then the text data corresponding to the target field to be extracted may be directly extracted.
In some embodiments, the new version type target detection model obtained by training the creation scene mode may be used to perform target field range detection extraction on the new version of the test file, and the second method of extracting information from the target version file may be used to obtain text data corresponding to the target field to be extracted.
In other embodiments, the server may update the object detection model periodically, e.g., once a week. When updating the target detection model, the high-confidence target detection result in the intermediate detection state information stored in the database can carry out incremental iterative fine adjustment on the detection model, so that the extraction and identification effects of the whole process are further ensured.
According to the application, the information extraction system of the format file on the server has weak supervision autonomous learning capability, and can learn and identify the target field contained in the new format file, so that automatic key information identification and extraction of the subsequent new format file are realized.
According to the information extraction method of the format file, firstly, an extraction request sent by a terminal device is received, the extraction request comprises a target format file to be processed and information of format types of the target format file, and the target format file is a file in a fixed typesetting mode. And secondly, determining a target field library corresponding to the target format file according to the format type information of the target format file, wherein the target field library comprises target fields to be extracted. And extracting information from the target format file according to the target field to be extracted in the target field library to obtain structured field information corresponding to the target format file, wherein the structured field information comprises the target field to be extracted and text data corresponding to the target field extracted from the target format file. And finally, sending the structured field information to the terminal equipment. Because different format types correspond to different target field libraries, the target fields in the format files of different format types can be adaptively extracted based on the different target field libraries, and structured field information is returned, so that the information extraction of the format files is ensured to have higher generalization.
It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides an information extraction device for the format file, which is used for realizing the information extraction method of the format file. The implementation scheme of the solution provided by the device is similar to the implementation scheme described in the above method, so the specific limitation in the embodiments of the information extraction device for one or more layout files provided below may refer to the limitation of the information extraction method for a layout file hereinabove, and will not be repeated herein.
In one embodiment, as shown in fig. 9, there is provided an information extraction apparatus 600 of a layout file, including: a receiving module 601, a determining module 602, an extracting module 603, a transmitting module 604, a correcting module 605 and a generating module 606, wherein:
The receiving module 601 is configured to receive an extraction request sent by a terminal device, where the extraction request includes a target layout file to be processed and information of a layout type of the target layout file, and the target layout file is a file in a fixed layout mode.
The determining module 602 is configured to determine, according to the layout type information of the target layout file, a target field library corresponding to the target layout file, where the target field library includes target fields to be extracted.
The extracting module 603 is configured to extract information of the target layout file according to a target field to be extracted in the target field library, so as to obtain structured field information corresponding to the target layout file, where the structured field information includes the target field to be extracted and text data corresponding to the target field extracted from the target layout file.
And the sending module 604 is configured to send the structured field information to the terminal device.
In one embodiment, the extracting module 603 is specifically configured to perform optical character recognition on the target layout file, and obtain full text information in the target layout file; generating identification indication information of the large language model according to the target field to be extracted and the full text information in the target field library, wherein the identification indication information is used for describing identification objects, identification types and output types of the large language model; and extracting text data corresponding to the target field from the full text information by using a large language model according to the identification indication information so as to generate structured field information.
In one embodiment, the extracting module 603 is specifically configured to determine, according to information of a format type of the target format file, a target detection model corresponding to the target format file, where the target detection model is configured to detect an extraction range of a target field to be extracted in the target format file; acquiring text information of an extraction range from a target format file through optical character recognition; according to the target field to be extracted in the target field library, matching the text information of the extraction range, and generating structured field information according to the matching result.
In one embodiment, the information extraction apparatus 600 of the layout file further includes:
the correction module 605 is configured to obtain a priori knowledge data corresponding to the target field; and correcting the structured field information according to the priori knowledge data.
In one embodiment, the receiving module 601 is further configured to receive a creation request of a format type sent by a terminal device, where the creation request includes a sample file of the format type to be created, and the sample file includes labeling information, where the labeling information is used to indicate a sample field to be extracted corresponding to the format type to be created.
The information extraction device 600 of the layout file further includes:
the generating module 606 is configured to generate a field library corresponding to the layout type to be created according to the sample field indicated by the annotation information.
In one embodiment, the generating module is further configured to generate a sample set according to the sample field and the location information of the sample field; and training the initial detection model by using the sample set to obtain a target detection model corresponding to the format type to be created.
The modules in the information extraction device of the layout file may be all or partially implemented by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 10. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method of extracting information of a layout file.
It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the method for extracting information of a layout file when executing the computer program.
In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the information extraction method of layout files described above.
In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-described method of extracting information of a layout file.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1. An information extraction method of a layout file, which is characterized by comprising the following steps:
receiving an extraction request sent by a terminal device, wherein the extraction request comprises a target layout file to be processed and layout type information of the target layout file, and the target layout file is a file in a fixed typesetting mode;
Determining a target field library corresponding to the target format file according to the format type information of the target format file, wherein the target field library comprises target fields to be extracted;
According to a target field to be extracted in the target field library, extracting information from the target format file to obtain structured field information corresponding to the target format file, wherein the structured field information comprises the target field to be extracted and text data corresponding to the target field extracted from the target format file;
and sending the structured field information to the terminal equipment.
2. The method of claim 1, wherein the extracting information of the target layout file according to the target field to be extracted in the target field library to obtain the structured field information corresponding to the target layout file includes:
performing optical character recognition on the target format file to obtain full text information in the target format file;
Generating identification indication information of a large language model according to a target field to be extracted in the target field library and the full text information, wherein the identification indication information is used for describing an identification object, an identification type and an output type of the large language model;
and extracting text data corresponding to the target field from the full text information by using the large language model according to the identification indication information so as to generate the structured field information.
3. The method of claim 1, wherein the extracting information of the target layout file according to the target field to be extracted in the target field library to obtain the structured field information corresponding to the target layout file includes:
Determining a target detection model corresponding to the target format file according to the format type information of the target format file, wherein the target detection model is used for detecting the extraction range of the target field to be extracted in the target format file;
acquiring text information of the extraction range from the target layout file through optical character recognition;
And matching the text information of the extraction range according to the target field to be extracted in the target field library, and generating the structured field information according to a matching result.
4. The method according to claim 1, wherein after extracting information from the target format file according to the target field to be extracted in the target field library, the method further includes:
acquiring priori knowledge data corresponding to the target field;
and correcting the structured field information according to the priori knowledge data.
5. The method according to claim 1, characterized in that before the receiving the extraction request sent by the terminal device, the method further comprises:
Receiving a creation request of a format type sent by the terminal equipment, wherein the creation request comprises a sample file of the format type to be created, and the sample file contains marking information which is used for indicating a sample field to be extracted corresponding to the format type to be created;
and generating a field library corresponding to the format type to be created according to the sample field indicated by the marking information.
6. The method of claim 5, wherein the labeling information further comprises location information of the sample field; after receiving the format type creation request sent by the terminal device, the method further includes:
generating a sample set according to the sample field and the position information of the sample field;
and training an initial detection model by using the sample set to obtain a target detection model corresponding to the format type to be created.
7. An information extraction apparatus for a layout file, the apparatus comprising:
The receiving module is used for receiving an extraction request sent by the terminal equipment, wherein the extraction request comprises a target format file to be processed and information of format types of the target format file, and the target format file is a file in a fixed typesetting mode;
The determining module is used for determining a target field library corresponding to the target layout file according to the layout type information of the target layout file, wherein the target field library comprises target fields to be extracted;
The extraction module is used for extracting information of the target format file according to a target field to be extracted in the target field library to obtain structured field information corresponding to the target format file, wherein the structured field information comprises the target field to be extracted and text data corresponding to the target field extracted from the target format file;
and the sending module is used for sending the structured field information to the terminal equipment.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
CN202410065280.8A 2024-01-17 2024-01-17 Information extraction method, device and equipment of layout file and storage medium Pending CN118095205A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410065280.8A CN118095205A (en) 2024-01-17 2024-01-17 Information extraction method, device and equipment of layout file and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410065280.8A CN118095205A (en) 2024-01-17 2024-01-17 Information extraction method, device and equipment of layout file and storage medium

Publications (1)

Publication Number Publication Date
CN118095205A true CN118095205A (en) 2024-05-28

Family

ID=91146920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410065280.8A Pending CN118095205A (en) 2024-01-17 2024-01-17 Information extraction method, device and equipment of layout file and storage medium

Country Status (1)

Country Link
CN (1) CN118095205A (en)

Similar Documents

Publication Publication Date Title
US20170116521A1 (en) Tag processing method and device
CN114596566B (en) Text recognition method and related device
CN115917613A (en) Semantic representation of text in a document
CN113837151A (en) Table image processing method and device, computer equipment and readable storage medium
CN116610304B (en) Page code generation method, device, equipment and storage medium
CN113255498A (en) Financial reimbursement invoice management method based on block chain technology
CN116860747A (en) Training sample generation method and device, electronic equipment and storage medium
CN116701637A (en) Zero sample text classification method, system and medium based on CLIP
CN115828856A (en) Test paper generation method, device, equipment and storage medium
CN116225956A (en) Automated testing method, apparatus, computer device and storage medium
CN118095205A (en) Information extraction method, device and equipment of layout file and storage medium
CN115511104A (en) Method, apparatus, device and medium for training a contrast learning model
CN111767710B (en) Indonesia emotion classification method, device, equipment and medium
CN113392312A (en) Information processing method and system and electronic equipment
CN114943234B (en) Enterprise name linking method, enterprise name linking device, computer equipment and storage medium
CN117079084B (en) Sample image generation method, device, computer equipment and storage medium
US11727672B1 (en) System and method for generating training data sets for specimen defect detection
CN117131222A (en) Semi-automatic labeling method and device based on open world large model
CN117612181A (en) Image recognition method, device, computer equipment and storage medium
CN117274882A (en) Multi-scale target detection method and system based on improved YOLO model
CN118133044A (en) Problem extension method, device, computer equipment, storage medium and product
CN117851605A (en) Industry knowledge graph construction method, computer equipment and storage medium
CN116884019A (en) Signature recognition method, signature recognition device, computer equipment and storage medium
CN116107565A (en) Interactive page processing method, device, computer equipment and storage medium
CN115457572A (en) Model training method and device, computer equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination