CN117237971A

CN117237971A - Food quality inspection report data extraction method based on multi-mode information extraction

Info

Publication number: CN117237971A
Application number: CN202311492020.0A
Authority: CN
Inventors: 林韶军; 陈征宇; 黄炳裕; 侯国通; 赵世豪; 叶威鑫; 刘骏
Original assignee: Evecom Information Technology Development Co ltd
Current assignee: Evecom Information Technology Development Co ltd
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2023-12-15
Anticipated expiration: 2043-11-10
Also published as: CN117237971B

Abstract

The application relates to a food quality inspection report data extraction method based on multi-mode information extraction, which comprises the following steps: step S1, acquiring a food quality inspection report, converting the food quality inspection report into a picture format, and preprocessing the food quality inspection report; step S2, constructing a data set based on the preprocessed picture data; step S3, labeling the data set to obtain a training data set 1 and a training data set 2; and step S5, carrying out automatic data structuring processing on the preprocessed quality inspection report based on the trained information extraction model and the second information extraction model, and reading corresponding fields in the JSON file stored in the extraction result according to the requirement. The application effectively improves the efficiency and the quality of the food quality inspection report document data structuring.

Description

Food quality inspection report data extraction method based on multi-mode information extraction

Technical Field

The application relates to the field of data extraction, in particular to a food quality inspection report data extraction method based on multi-mode information extraction.

Background

Food safety is a key for guaranteeing high-quality life of residents, quality inspection reports are one of main tools for guaranteeing food safety, and in daily production life, main existing forms of the quality inspection reports obtained by enterprises or related food safety institutions comprise PDF document formats and picture formats, wherein the PDF document formats are divided into text formats and non-text formats, and the picture formats are divided into single pictures and spliced pictures. With the continuous advancement of the digitization process in the food safety field in China, the task of converting a quality inspection report in the form of a PDF document or a picture into structured data to realize effective data analysis, digitization energization and the like becomes a problem to be solved urgently.

In the prior art, the method of directly analyzing PDF is only suitable for a PDF document of text type, but the extraction of text contents cannot be carried out on a non-text PDF quality inspection report document and a quality inspection report of picture type. The method based on the OCR technology can realize the direct extraction of the text information in the tables in the pictures, but because the quality inspection report formats of different detection mechanisms are different and some tables have complex structures such as merging cells, blank cells and the like, the problem of row-column mismatch of the extracted results of the method based on the OCR technology can occur.

Disclosure of Invention

In order to solve the above problems, the present application aims to provide a method for extracting food quality inspection report data based on multi-mode information extraction, which effectively improves the efficiency and quality of structuring food quality inspection report document data.

In order to achieve the above purpose, the present application adopts the following technical scheme:

a food quality inspection report data extraction method based on multi-mode information extraction comprises the following steps:

step S1, acquiring a food quality inspection report, converting the food quality inspection report into a picture format, and preprocessing the food quality inspection report;

step S2, based on the preprocessed picture data, according to the format characteristics of the quality inspection report, carrying out a data set for acquiring structural identification information such as a report number, a detection mechanism, a consignment unit and the like on the first page and the second page, and recording the data set as a data set 1; the third page and the subsequent parts are recorded detection content information, and are used for acquiring a data set of the detection item, the single item judgment and other structured detection content, and marking the data set as a data set 2;

step 3, labeling the segmentation type of the data set 1 by using labeling software Label Studio, and labeling the relationship type of the data set 2 by using labeling software Label Studio to obtain a training data set 1 and a training data set 2;

s4, constructing a multi-mode information extraction model, and training based on the training data set 1 and the training data set 2 to generate corresponding weight parameters of the first information extraction model and weight parameters of the second information extraction model;

and S5, based on the trained first information extraction model and the trained second information extraction model, carrying out automatic data structuring processing on the preprocessed quality inspection report, and reading corresponding fields in the JSON file stored in the extraction result according to the requirement.

The pretreatment is specifically as follows: judging whether the original data is in a PDF file format or a picture file format; if the PDF format is adopted, paging is automatically converted into a picture format according to page numbers; if the picture is in the picture format, judging whether the picture is a spliced picture or not according to the aspect ratio of the picture, and if the picture is the spliced picture, carrying out automatic picture cutting processing; and taking out the first page in each quality inspection report in an automatic screening mode, extracting report number fields and page numbers in the first page of the quality inspection report by adopting an OCR or information extraction method, combining the report number fields and the page numbers to serve as unique identifiers of each report, and using the unique identifiers for renaming the picture data of the quality inspection report.

Further, the multi-mode information extraction model is specifically as follows:

the input sequence comprises a text part and a visual part, wherein the text part adopts an OCR tool to extract characters in pictures and coordinate information corresponding to the characters, a document mark after a serialization module is used as a text sequence, and two special marks [ CLS ] and [ SEP ] are respectively added at the beginning and the end of the text sequence after pretreatment of a BERT-Style model;

text embedded representation of the tag sequence T:

（1）

wherein the method comprises the steps of、/>、/>Respectively representing a mark embedding layer, a 1D position embedding layer and a mark type embedding layer;

in the feature extraction of the visual part, the Faster-RCNN is adopted as the backbone of the visual encoder, the document image is fed into the visual backbone, the adaptive pooling layer is introduced to convert the output into a feature map with fixed width W and height H, the feature map is flattened into a visual sequence V, and the linear layer is usedEach visual marker is projected into the same dimension as the text embedding, the visual embedding is generated taking into account similarity, 1D position and marker type,

visual embedded representation of the tag sequence T:

（2）

wherein,respectively, similarity, 1D position and type of marker.

Further, for each text mark, the OCR tool obtains its 2D coordinates and the width and height of the bounding box

Wherein%，/>) Representing the coordinates of the upper left corner of the bounding box, (-)>，/>) Representing the lower right corner;

and all coordinate values are normalized;

the individual embedded layers are constructed in the horizontal and vertical directions as shown in the following formula:

（3）

wherein the method comprises the steps ofIs->Shaft insert layer->De-annotation->An axis embedding layer for integrating each text and visual mark with its corresponding layout embedding, and finally combining the text and visual mark together to obtain a text and visual mark with a length of +>Is a long sequence of>Is the maximum length of the text part, and the obtained H is the final input of the model:

（4）。

further, the extracted result is stored in a database and is synchronized to a manual verification platform for manual verification, the extracted structured data is corrected based on the verification result, the first information extraction model and the second information extraction model are further trained based on the corrected structured data, and the model is optimized.

The application has the following beneficial effects:

according to the application, firstly, the text content of the quality inspection report is identified by an OCR method, secondly, the text results are associated and matched by an NLP method, and finally, reports in different formats are output to obtain uniform extraction results, so that the problems of mismatching of lines and rows of the quality inspection report caused by different templates or complex forms such as merging cells and the like are avoided, and the data extraction efficiency and quality of the food quality inspection report are effectively improved.

Drawings

FIG. 1 is a flow chart of the method of the present application;

FIG. 2 is a schematic representation of the annotation of data set 2 according to an embodiment of the present application;

fig. 3 is a schematic diagram of a data extraction flow according to an embodiment of the application.

Detailed Description

The application is described in further detail below with reference to the attached drawings and specific examples:

referring to fig. 1, the application provides a food quality inspection report data extraction method based on multi-mode information extraction, which comprises the following steps:

In this embodiment, the pretreatment is specifically: judging whether the original data is in a PDF file format or a picture file format; if the PDF format is adopted, paging is automatically converted into a picture format according to page numbers; if the picture is in the picture format, judging whether the picture is a spliced picture or not according to the aspect ratio of the picture, and if the picture is the spliced picture, carrying out automatic picture cutting processing; and taking out the first page in each quality inspection report in an automatic screening mode, extracting report number fields and page numbers in the first page of the quality inspection report by adopting an OCR or information extraction method, combining the report number fields and the page numbers to serve as unique identifiers of each report, and using the unique identifiers for renaming the picture data of the quality inspection report.

In this embodiment, the multi-modal information extraction model is specifically as follows:

text embedded representation of the tag sequence T:

（1）

visual embedded representation of the tag sequence T:

（2）

wherein,respectively, similarity, 1D position and type of marker.

In this embodiment, for each text mark, the OCR tool obtains its 2D coordinates along with the width and height of the bounding box

all coordinate values are normalized within the range of [0,1000 ];

similar calculations may also be performed for visual markers. In order to find the layout embedment of the text/visual markers,

（3）

wherein the method comprises the steps ofIs->Shaft insert layer->De-annotation->The shaft is embedded in the layer. To achieve a final input representation of the model +.>Integrating the embedding of each text and visual mark with the corresponding layout embedding, and finally combining the text and visual mark to obtain a text-visual mark with the length of +.>Is a long sequence of>Is the maximum length of the text part, and the obtained H is the final input of the model:

（4）

in this embodiment, the extracted result is saved to a database and synchronized to a manual verification platform for manual verification, the extracted structured data is corrected based on the verification result, and the first information extraction model and the second information extraction model are further trained based on the corrected structured data, so as to optimize the model.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present application, and is not intended to limit the application in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present application still fall within the protection scope of the technical solution of the present application.

Claims

1. The food quality inspection report data extraction method based on multi-mode information extraction is characterized by comprising the following steps of:

step S2, based on the preprocessed picture data, taking a first page and a second page as data sets for acquiring report numbers, detection mechanisms and consignment unit structural identification information according to the format characteristics of a quality inspection report, and recording the data sets as data set 1; the third page and the subsequent parts are recorded detection content information, and are used for acquiring detection items, judging a data set of the structured detection content in a single item, and marking the data set as a data set 2;

step 5, based on the trained first information extraction model and the trained second information extraction model, carrying out automatic data structuring processing on the preprocessed quality inspection report, and reading corresponding fields in the JSON file stored in the extraction result according to the need;

the multi-mode information extraction model is specifically as follows:

text embedded representation of the tag sequence T:

（1）

visual embedded representation of the tag sequence T:

（2）

wherein,respectively, similarity, 1D position and type of marker.

2. The method for extracting food quality inspection report data based on multi-modal information extraction according to claim 1, wherein the preprocessing specifically comprises: judging whether the original data is in a PDF file format or a picture file format; if the PDF format is adopted, paging is automatically converted into a picture format according to page numbers; if the picture is in the picture format, judging whether the picture is a spliced picture or not according to the aspect ratio of the picture, and if the picture is the spliced picture, carrying out automatic picture cutting processing; and taking out the first page in each quality inspection report in an automatic screening mode, extracting report number fields and page numbers in the first page of the quality inspection report by adopting an OCR or information extraction method, combining the report number fields and the page numbers to serve as unique identifiers of each report, and using the unique identifiers for renaming the picture data of the quality inspection report.

3. The method for extracting food quality inspection report data based on multi-modal information extraction as set forth in claim 1, wherein for each text mark, the OCR tool obtains its 2D coordinates and width and height of the bounding boxWherein (/ ->，/>) Representing the coordinates of the upper left corner of the bounding box, (-)>，/>) Representing the lower right corner;

and all coordinate values are normalized;

（3）

（4）。

4. the method for extracting food quality inspection report data based on multi-modal information extraction according to claim 1, wherein the extracted result is stored in a database and synchronized to a manual inspection platform for manual inspection, the extracted structured data is corrected based on the inspection result, and the information first extraction model and the second information extraction model are further trained based on the corrected structured data to optimize the model.