CN117237971A - Food quality inspection report data extraction method based on multi-mode information extraction - Google Patents

Food quality inspection report data extraction method based on multi-mode information extraction Download PDF

Info

Publication number
CN117237971A
CN117237971A CN202311492020.0A CN202311492020A CN117237971A CN 117237971 A CN117237971 A CN 117237971A CN 202311492020 A CN202311492020 A CN 202311492020A CN 117237971 A CN117237971 A CN 117237971A
Authority
CN
China
Prior art keywords
quality inspection
inspection report
picture
information extraction
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311492020.0A
Other languages
Chinese (zh)
Other versions
CN117237971B (en
Inventor
林韶军
陈征宇
黄炳裕
侯国通
赵世豪
叶威鑫
刘骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Evecom Information Technology Development Co ltd
Original Assignee
Evecom Information Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Evecom Information Technology Development Co ltd filed Critical Evecom Information Technology Development Co ltd
Priority to CN202311492020.0A priority Critical patent/CN117237971B/en
Publication of CN117237971A publication Critical patent/CN117237971A/en
Application granted granted Critical
Publication of CN117237971B publication Critical patent/CN117237971B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Character Input (AREA)

Abstract

The application relates to a food quality inspection report data extraction method based on multi-mode information extraction, which comprises the following steps: step S1, acquiring a food quality inspection report, converting the food quality inspection report into a picture format, and preprocessing the food quality inspection report; step S2, constructing a data set based on the preprocessed picture data; step S3, labeling the data set to obtain a training data set 1 and a training data set 2; and step S5, carrying out automatic data structuring processing on the preprocessed quality inspection report based on the trained information extraction model and the second information extraction model, and reading corresponding fields in the JSON file stored in the extraction result according to the requirement. The application effectively improves the efficiency and the quality of the food quality inspection report document data structuring.

Description

Food quality inspection report data extraction method based on multi-mode information extraction
Technical Field
The application relates to the field of data extraction, in particular to a food quality inspection report data extraction method based on multi-mode information extraction.
Background
Food safety is a key for guaranteeing high-quality life of residents, quality inspection reports are one of main tools for guaranteeing food safety, and in daily production life, main existing forms of the quality inspection reports obtained by enterprises or related food safety institutions comprise PDF document formats and picture formats, wherein the PDF document formats are divided into text formats and non-text formats, and the picture formats are divided into single pictures and spliced pictures. With the continuous advancement of the digitization process in the food safety field in China, the task of converting a quality inspection report in the form of a PDF document or a picture into structured data to realize effective data analysis, digitization energization and the like becomes a problem to be solved urgently.
In the prior art, the method of directly analyzing PDF is only suitable for a PDF document of text type, but the extraction of text contents cannot be carried out on a non-text PDF quality inspection report document and a quality inspection report of picture type. The method based on the OCR technology can realize the direct extraction of the text information in the tables in the pictures, but because the quality inspection report formats of different detection mechanisms are different and some tables have complex structures such as merging cells, blank cells and the like, the problem of row-column mismatch of the extracted results of the method based on the OCR technology can occur.
Disclosure of Invention
In order to solve the above problems, the present application aims to provide a method for extracting food quality inspection report data based on multi-mode information extraction, which effectively improves the efficiency and quality of structuring food quality inspection report document data.
In order to achieve the above purpose, the present application adopts the following technical scheme:
a food quality inspection report data extraction method based on multi-mode information extraction comprises the following steps:
step S1, acquiring a food quality inspection report, converting the food quality inspection report into a picture format, and preprocessing the food quality inspection report;
step S2, based on the preprocessed picture data, according to the format characteristics of the quality inspection report, carrying out a data set for acquiring structural identification information such as a report number, a detection mechanism, a consignment unit and the like on the first page and the second page, and recording the data set as a data set 1; the third page and the subsequent parts are recorded detection content information, and are used for acquiring a data set of the detection item, the single item judgment and other structured detection content, and marking the data set as a data set 2;
step 3, labeling the segmentation type of the data set 1 by using labeling software Label Studio, and labeling the relationship type of the data set 2 by using labeling software Label Studio to obtain a training data set 1 and a training data set 2;
s4, constructing a multi-mode information extraction model, and training based on the training data set 1 and the training data set 2 to generate corresponding weight parameters of the first information extraction model and weight parameters of the second information extraction model;
and S5, based on the trained first information extraction model and the trained second information extraction model, carrying out automatic data structuring processing on the preprocessed quality inspection report, and reading corresponding fields in the JSON file stored in the extraction result according to the requirement.
The pretreatment is specifically as follows: judging whether the original data is in a PDF file format or a picture file format; if the PDF format is adopted, paging is automatically converted into a picture format according to page numbers; if the picture is in the picture format, judging whether the picture is a spliced picture or not according to the aspect ratio of the picture, and if the picture is the spliced picture, carrying out automatic picture cutting processing; and taking out the first page in each quality inspection report in an automatic screening mode, extracting report number fields and page numbers in the first page of the quality inspection report by adopting an OCR or information extraction method, combining the report number fields and the page numbers to serve as unique identifiers of each report, and using the unique identifiers for renaming the picture data of the quality inspection report.
Further, the multi-mode information extraction model is specifically as follows:
the input sequence comprises a text part and a visual part, wherein the text part adopts an OCR tool to extract characters in pictures and coordinate information corresponding to the characters, a document mark after a serialization module is used as a text sequence, and two special marks [ CLS ] and [ SEP ] are respectively added at the beginning and the end of the text sequence after pretreatment of a BERT-Style model;
text embedded representation of the tag sequence T:
(1)
wherein the method comprises the steps of、/>、/>Respectively representing a mark embedding layer, a 1D position embedding layer and a mark type embedding layer;
in the feature extraction of the visual part, the Faster-RCNN is adopted as the backbone of the visual encoder, the document image is fed into the visual backbone, the adaptive pooling layer is introduced to convert the output into a feature map with fixed width W and height H, the feature map is flattened into a visual sequence V, and the linear layer is usedEach visual marker is projected into the same dimension as the text embedding, the visual embedding is generated taking into account similarity, 1D position and marker type,
visual embedded representation of the tag sequence T:
(2)
wherein,respectively, similarity, 1D position and type of marker.
Further, for each text mark, the OCR tool obtains its 2D coordinates and the width and height of the bounding box
Wherein%,/>) Representing the coordinates of the upper left corner of the bounding box, (-)>,/>) Representing the lower right corner;
and all coordinate values are normalized;
the individual embedded layers are constructed in the horizontal and vertical directions as shown in the following formula:
(3)
wherein the method comprises the steps ofIs->Shaft insert layer->De-annotation->An axis embedding layer for integrating each text and visual mark with its corresponding layout embedding, and finally combining the text and visual mark together to obtain a text and visual mark with a length of +>Is a long sequence of>Is the maximum length of the text part, and the obtained H is the final input of the model:
(4)。
further, the extracted result is stored in a database and is synchronized to a manual verification platform for manual verification, the extracted structured data is corrected based on the verification result, the first information extraction model and the second information extraction model are further trained based on the corrected structured data, and the model is optimized.
The application has the following beneficial effects:
according to the application, firstly, the text content of the quality inspection report is identified by an OCR method, secondly, the text results are associated and matched by an NLP method, and finally, reports in different formats are output to obtain uniform extraction results, so that the problems of mismatching of lines and rows of the quality inspection report caused by different templates or complex forms such as merging cells and the like are avoided, and the data extraction efficiency and quality of the food quality inspection report are effectively improved.
Drawings
FIG. 1 is a flow chart of the method of the present application;
FIG. 2 is a schematic representation of the annotation of data set 2 according to an embodiment of the present application;
fig. 3 is a schematic diagram of a data extraction flow according to an embodiment of the application.
Detailed Description
The application is described in further detail below with reference to the attached drawings and specific examples:
referring to fig. 1, the application provides a food quality inspection report data extraction method based on multi-mode information extraction, which comprises the following steps:
step S1, acquiring a food quality inspection report, converting the food quality inspection report into a picture format, and preprocessing the food quality inspection report;
step S2, based on the preprocessed picture data, according to the format characteristics of the quality inspection report, carrying out a data set for acquiring structural identification information such as a report number, a detection mechanism, a consignment unit and the like on the first page and the second page, and recording the data set as a data set 1; the third page and the subsequent parts are recorded detection content information, and are used for acquiring a data set of the detection item, the single item judgment and other structured detection content, and marking the data set as a data set 2;
step 3, labeling the segmentation type of the data set 1 by using labeling software Label Studio, and labeling the relationship type of the data set 2 by using labeling software Label Studio to obtain a training data set 1 and a training data set 2;
s4, constructing a multi-mode information extraction model, and training based on the training data set 1 and the training data set 2 to generate corresponding weight parameters of the first information extraction model and weight parameters of the second information extraction model;
and S5, based on the trained first information extraction model and the trained second information extraction model, carrying out automatic data structuring processing on the preprocessed quality inspection report, and reading corresponding fields in the JSON file stored in the extraction result according to the requirement.
In this embodiment, the pretreatment is specifically: judging whether the original data is in a PDF file format or a picture file format; if the PDF format is adopted, paging is automatically converted into a picture format according to page numbers; if the picture is in the picture format, judging whether the picture is a spliced picture or not according to the aspect ratio of the picture, and if the picture is the spliced picture, carrying out automatic picture cutting processing; and taking out the first page in each quality inspection report in an automatic screening mode, extracting report number fields and page numbers in the first page of the quality inspection report by adopting an OCR or information extraction method, combining the report number fields and the page numbers to serve as unique identifiers of each report, and using the unique identifiers for renaming the picture data of the quality inspection report.
In this embodiment, the multi-modal information extraction model is specifically as follows:
the input sequence comprises a text part and a visual part, wherein the text part adopts an OCR tool to extract characters in pictures and coordinate information corresponding to the characters, a document mark after a serialization module is used as a text sequence, and two special marks [ CLS ] and [ SEP ] are respectively added at the beginning and the end of the text sequence after pretreatment of a BERT-Style model;
text embedded representation of the tag sequence T:
(1)
wherein the method comprises the steps of、/>、/>Respectively representing a mark embedding layer, a 1D position embedding layer and a mark type embedding layer;
in the feature extraction of the visual part, the Faster-RCNN is adopted as the backbone of the visual encoder, the document image is fed into the visual backbone, the adaptive pooling layer is introduced to convert the output into a feature map with fixed width W and height H, the feature map is flattened into a visual sequence V, and the linear layer is usedEach visual marker is projected into the same dimension as the text embedding, the visual embedding is generated taking into account similarity, 1D position and marker type,
visual embedded representation of the tag sequence T:
(2)
wherein,respectively, similarity, 1D position and type of marker.
In this embodiment, for each text mark, the OCR tool obtains its 2D coordinates along with the width and height of the bounding box
Wherein%,/>) Representing the coordinates of the upper left corner of the bounding box, (-)>,/>) Representing the lower right corner;
all coordinate values are normalized within the range of [0,1000 ];
similar calculations may also be performed for visual markers. In order to find the layout embedment of the text/visual markers,
the individual embedded layers are constructed in the horizontal and vertical directions as shown in the following formula:
(3)
wherein the method comprises the steps ofIs->Shaft insert layer->De-annotation->The shaft is embedded in the layer. To achieve a final input representation of the model +.>Integrating the embedding of each text and visual mark with the corresponding layout embedding, and finally combining the text and visual mark to obtain a text-visual mark with the length of +.>Is a long sequence of>Is the maximum length of the text part, and the obtained H is the final input of the model:
(4)
in this embodiment, the extracted result is saved to a database and synchronized to a manual verification platform for manual verification, the extracted structured data is corrected based on the verification result, and the first information extraction model and the second information extraction model are further trained based on the corrected structured data, so as to optimize the model.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present application, and is not intended to limit the application in any way, and any person skilled in the art may make modifications or alterations to the disclosed technical content to the equivalent embodiments. However, any simple modification, equivalent variation and variation of the above embodiments according to the technical substance of the present application still fall within the protection scope of the technical solution of the present application.

Claims (4)

1. The food quality inspection report data extraction method based on multi-mode information extraction is characterized by comprising the following steps of:
step S1, acquiring a food quality inspection report, converting the food quality inspection report into a picture format, and preprocessing the food quality inspection report;
step S2, based on the preprocessed picture data, taking a first page and a second page as data sets for acquiring report numbers, detection mechanisms and consignment unit structural identification information according to the format characteristics of a quality inspection report, and recording the data sets as data set 1; the third page and the subsequent parts are recorded detection content information, and are used for acquiring detection items, judging a data set of the structured detection content in a single item, and marking the data set as a data set 2;
step 3, labeling the segmentation type of the data set 1 by using labeling software Label Studio, and labeling the relationship type of the data set 2 by using labeling software Label Studio to obtain a training data set 1 and a training data set 2;
s4, constructing a multi-mode information extraction model, and training based on the training data set 1 and the training data set 2 to generate corresponding weight parameters of the first information extraction model and weight parameters of the second information extraction model;
step 5, based on the trained first information extraction model and the trained second information extraction model, carrying out automatic data structuring processing on the preprocessed quality inspection report, and reading corresponding fields in the JSON file stored in the extraction result according to the need;
the multi-mode information extraction model is specifically as follows:
the input sequence comprises a text part and a visual part, wherein the text part adopts an OCR tool to extract characters in pictures and coordinate information corresponding to the characters, a document mark after a serialization module is used as a text sequence, and two special marks [ CLS ] and [ SEP ] are respectively added at the beginning and the end of the text sequence after pretreatment of a BERT-Style model;
text embedded representation of the tag sequence T:
(1)
wherein the method comprises the steps of、/>、/>Respectively representing a mark embedding layer, a 1D position embedding layer and a mark type embedding layer;
in the feature extraction of the visual part, the Faster-RCNN is adopted as the backbone of the visual encoder, the document image is fed into the visual backbone, the adaptive pooling layer is introduced to convert the output into a feature map with fixed width W and height H, the feature map is flattened into a visual sequence V, and the linear layer is usedEach visual marker is projected into the same dimension as the text embedding, the visual embedding is generated taking into account similarity, 1D position and marker type,
visual embedded representation of the tag sequence T:
(2)
wherein,respectively, similarity, 1D position and type of marker.
2. The method for extracting food quality inspection report data based on multi-modal information extraction according to claim 1, wherein the preprocessing specifically comprises: judging whether the original data is in a PDF file format or a picture file format; if the PDF format is adopted, paging is automatically converted into a picture format according to page numbers; if the picture is in the picture format, judging whether the picture is a spliced picture or not according to the aspect ratio of the picture, and if the picture is the spliced picture, carrying out automatic picture cutting processing; and taking out the first page in each quality inspection report in an automatic screening mode, extracting report number fields and page numbers in the first page of the quality inspection report by adopting an OCR or information extraction method, combining the report number fields and the page numbers to serve as unique identifiers of each report, and using the unique identifiers for renaming the picture data of the quality inspection report.
3. The method for extracting food quality inspection report data based on multi-modal information extraction as set forth in claim 1, wherein for each text mark, the OCR tool obtains its 2D coordinates and width and height of the bounding boxWherein (/ ->,/>) Representing the coordinates of the upper left corner of the bounding box, (-)>,/>) Representing the lower right corner;
and all coordinate values are normalized;
the individual embedded layers are constructed in the horizontal and vertical directions as shown in the following formula:
(3)
wherein the method comprises the steps ofIs->Shaft insert layer->De-annotation->An axis embedding layer for integrating each text and visual mark with its corresponding layout embedding, and finally combining the text and visual mark together to obtain a text and visual mark with a length of +>Is a long sequence of>Is the maximum length of the text part, and the obtained H is the final input of the model:
(4)。
4. the method for extracting food quality inspection report data based on multi-modal information extraction according to claim 1, wherein the extracted result is stored in a database and synchronized to a manual inspection platform for manual inspection, the extracted structured data is corrected based on the inspection result, and the information first extraction model and the second information extraction model are further trained based on the corrected structured data to optimize the model.
CN202311492020.0A 2023-11-10 2023-11-10 Food quality inspection report data extraction method based on multi-mode information extraction Active CN117237971B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311492020.0A CN117237971B (en) 2023-11-10 2023-11-10 Food quality inspection report data extraction method based on multi-mode information extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311492020.0A CN117237971B (en) 2023-11-10 2023-11-10 Food quality inspection report data extraction method based on multi-mode information extraction

Publications (2)

Publication Number Publication Date
CN117237971A true CN117237971A (en) 2023-12-15
CN117237971B CN117237971B (en) 2024-01-30

Family

ID=89091562

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311492020.0A Active CN117237971B (en) 2023-11-10 2023-11-10 Food quality inspection report data extraction method based on multi-mode information extraction

Country Status (1)

Country Link
CN (1) CN117237971B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836650A (en) * 2021-02-05 2021-05-25 广东电网有限责任公司广州供电局 Semantic analysis method and system for quality inspection report scanning image table
CN114529932A (en) * 2022-02-17 2022-05-24 北京译图智讯科技有限公司 Credit investigation report identification method
US20220277218A1 (en) * 2021-02-26 2022-09-01 Inception Institute of Artificial Intelligence Ltd Domain specific pre-training of cross modality transformer model
CN115331075A (en) * 2022-08-11 2022-11-11 杭州电子科技大学 Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph
CN116229494A (en) * 2023-02-15 2023-06-06 上海万达信息系统有限公司 License key information extraction method based on small sample data
CN116543404A (en) * 2023-05-09 2023-08-04 重庆师范大学 Table semantic information extraction method, system, equipment and medium based on cell coordinate optimization
CN116912847A (en) * 2023-07-11 2023-10-20 平安科技(深圳)有限公司 Medical text recognition method and device, computer equipment and storage medium
EP4266195A1 (en) * 2022-04-19 2023-10-25 Microsoft Technology Licensing, LLC Training of text and image models

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836650A (en) * 2021-02-05 2021-05-25 广东电网有限责任公司广州供电局 Semantic analysis method and system for quality inspection report scanning image table
US20220277218A1 (en) * 2021-02-26 2022-09-01 Inception Institute of Artificial Intelligence Ltd Domain specific pre-training of cross modality transformer model
CN114529932A (en) * 2022-02-17 2022-05-24 北京译图智讯科技有限公司 Credit investigation report identification method
EP4266195A1 (en) * 2022-04-19 2023-10-25 Microsoft Technology Licensing, LLC Training of text and image models
CN115331075A (en) * 2022-08-11 2022-11-11 杭州电子科技大学 Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph
CN116229494A (en) * 2023-02-15 2023-06-06 上海万达信息系统有限公司 License key information extraction method based on small sample data
CN116543404A (en) * 2023-05-09 2023-08-04 重庆师范大学 Table semantic information extraction method, system, equipment and medium based on cell coordinate optimization
CN116912847A (en) * 2023-07-11 2023-10-20 平安科技(深圳)有限公司 Medical text recognition method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵强: ""融合多模态信息的产品摘要抽取模型"", 《计算机应用》, pages 1 - 7 *

Also Published As

Publication number Publication date
CN117237971B (en) 2024-01-30

Similar Documents

Publication Publication Date Title
CN109840519B (en) Self-adaptive intelligent bill identification and input device and application method thereof
Huang et al. Icdar2019 competition on scanned receipt ocr and information extraction
CN112508011A (en) OCR (optical character recognition) method and device based on neural network
Caldeira et al. Industrial optical character recognition system in printing quality control of hot-rolled coils identification
CN106845467B (en) Aeronautical maintenance work card action recognition methods based on optical character recognition technology
CN115588202B (en) Contour detection-based method and system for extracting characters in electrical design drawing
CN113901933A (en) Electronic invoice information extraction method, device and equipment based on artificial intelligence
CN111461133A (en) Express delivery surface single item name identification method, device, equipment and storage medium
CN117237971B (en) Food quality inspection report data extraction method based on multi-mode information extraction
CN104123527A (en) Mask-based image table document identification method
CN115713775B (en) Method, system and computer equipment for extracting form from document
CN112906817A (en) Intelligent image labeling method
CN110443306B (en) Authenticity identification method for wine cork
JP2004178010A (en) Document processor, its method, and program
CN112613367A (en) Bill information text box acquisition method, system, equipment and storage medium
CN113743159A (en) OCR method applied to power enterprises
CN116681997A (en) Classification method, system, medium and equipment for bad scene images
CN114996500B (en) Trademark graph retrieval method
CN111241955B (en) Bill information extraction method and system
CN114429573A (en) Data enhancement-based household garbage data set generation method
CN109325557B (en) Data intelligence acquisition method based on computer visual image identification
CN114463767A (en) Credit card identification method, device, computer equipment and storage medium
CN109739981B (en) PDF file type judgment method and character extraction method
CN113657373A (en) Automatic document cataloguing method
CN112733735B (en) Method for classifying and identifying drawing layout by adopting machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant