CN116434245A

CN116434245A - Multi-mode document classification method, system, computer equipment and storage medium

Info

Publication number: CN116434245A
Application number: CN202310378465.XA
Authority: CN
Inventors: 刘颖
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-07-14

Abstract

The invention provides a multi-mode document classification method, a system and a storage medium, which are applied to a multi-mode document classification system, wherein the method comprises the steps of roughly classifying N document images to be classified to obtain a first roughly classified document image; extracting the first rough classification document image for text detection to obtain at least one text region; dividing each text region to obtain at least one text image block; inputting the text image block into a pre-trained field recognition model for category subdivision to obtain subdivision labels; and determining the target category of the single-needle image to be classified according to the subdivision label to obtain a target classification document image, so that the classification problem aiming at multiple categories and less training samples is difficult to solve, and the classification efficiency is improved.

Description

Multi-mode document classification method, system, computer equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, a system, a computer device, and a storage medium for classifying documents in multiple modes.

Background

In the financial insurance industry, different business document files are generated due to different business approval processes, and when the business document files are processed, the business document files need to be classified, in the financial field, a plurality of independent evidence files which are difficult to classify exist, and the same type also has the problems of template replacement or multiple template patterns, the documents among different types are difficult to distinguish due to the very similar, and the problems of difficult classification are difficult to solve by a large training amount, because the training samples are insufficient, the problems of image classification can be classified by using a coarse-granularity classification problem and a fine-granularity classification method at present, but the classification problem of multiple types and less training samples is difficult to solve, so that a scheme capable of accurately distinguishing different documents is needed in business

Disclosure of Invention

The invention mainly aims to provide a multi-mode document classification method, a multi-mode document classification system and a storage medium, and aims to solve the technical problem that the existing large-batch document classification efficiency is low.

In order to achieve the above object, the present invention provides a multi-modal document classification method, the method comprising:

coarse classification is carried out on the N to-be-classified document images, and a first coarse classification document image is obtained;

extracting the first rough classification document image for text detection to obtain at least one text region;

dividing each text region to obtain at least one text image block;

inputting the text image block into a pre-trained field recognition model for category subdivision to obtain subdivision labels;

and determining the target category of the single-needle image to be classified according to the subdivision label, and obtaining a target classification document image.

Optionally, the coarse classification is performed on the N document images to be classified to obtain a first coarse classification document image, and the method includes:

performing image cutting on the document image to be classified to obtain at least one area to be selected;

redundant removal is carried out on the area to be selected according to a preset inhibition condition, and at least one selected area is obtained;

inputting the selected area into a pre-trained classification model, and obtaining a first confidence coefficient of the selected area, wherein the first confidence coefficient is used for determining the probability that the document image to be classified is a first coarse classification in the coarse classification;

and extracting the to-be-classified document image corresponding to the first confidence coefficient meeting the preset condition to obtain a first coarse classification document image.

Optionally, the extracting the first coarse classification document image performs text detection to obtain at least one text region, and the method includes:

extracting the characteristics of the first rough classification document image to obtain at least one characteristic image block;

according to the characteristic image block, the rotation angle of the first classification document image is predicted, and an adjustment angle is obtained;

performing angle adjustment on the first coarse classification document image according to the adjustment angle;

and performing text detection on the first rough classification document image with the angle adjusted to obtain at least one text region comprising text.

Optionally, the segmenting each text region to obtain at least one text image block, and the method includes:

extracting shallow features of the text region to obtain a text shallow feature set;

performing instance prediction on the text region according to the text shallow feature set to obtain at least one text instance;

and dividing the text instance according to the coordinates of the text region to obtain at least one text image block.

Optionally, the determining the target category of the single-needle image to be classified according to the subdivision label obtains a target classification document image, and the method includes:

extracting corresponding fields in the text image block based on the subdivision labels to obtain matching fields;

extracting a preset matching rule to perform category matching on the matching field to obtain a target classification document image;

and if the matching abnormality is generated, inputting the first coarse classification document corresponding to the matching abnormality into the classification model for classification, and obtaining a target classification document image.

Optionally, after the coarse classification of the N document images to be classified, the method further includes:

redundant removal is carried out on the area to be selected according to preset inhibition conditions, and at least one selected area is obtained

Inputting the selected area into a pre-trained classification model to obtain a second confidence coefficient of the selected area, wherein the first confidence coefficient is used for determining the probability that the document image to be classified is the second coarse classification in the coarse classification

And extracting the to-be-classified document image corresponding to the second confidence coefficient meeting the preset condition to obtain a second coarse classification document image.

Optionally, the word recognition model includes: the method comprises the steps of inputting the text image block into a pre-trained field identification model for category subdivision to obtain subdivision labels, and comprises the following steps of:

inputting the text image block into the deep convolution layer for feature recognition to obtain a feature sequence;

performing label prediction on the characteristic sequence by using the circulating layer to obtain prediction distribution;

and carrying out de-duplication integration on the prediction distribution and the characteristic sequence by using a transcription layer to obtain the subdivision tag.

In addition, in order to achieve the above object, the present invention also provides a multi-modal document classification system, the system comprising;

coarse classification module: the method comprises the steps of performing coarse classification on N to-be-classified document images to obtain a first coarse classification document image;

text detection module: the method comprises the steps of extracting a first rough classification document image to perform text detection to obtain at least one text region;

text segmentation module: the method comprises the steps of dividing each text region to obtain at least one text image block;

a text recognition module: the text image block is input into a pre-trained field recognition model to conduct category subdivision, and subdivision labels are obtained;

and a fine classification module: and the target classification document image is obtained by determining the target category of the single-needle image to be classified according to the subdivision label.

In addition, to achieve the above object, the present invention also provides a computer device including a memory and a processor, the memory storing a computer program, the processor implementing the steps of the multimodal document classification method as described above when executing the computer program.

In addition, in order to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a data processing program which, when executed by a processor, implements the steps of the multimodal document classification method as described above.

Compared with the prior art, the embodiment of the application has the following main beneficial effects:

the multi-mode document classification method comprises the steps of roughly classifying N document images to be classified to obtain a first roughly classified document image; extracting the first rough classification document image for text detection to obtain at least one text region; dividing each text region to obtain at least one text image block; inputting the text image block into a pre-trained field recognition model for category subdivision to obtain subdivision labels; and determining the target category of the single-needle image to be classified according to the subdivision label to obtain a target classification document image, so that the classification problem aiming at multiple categories and less training samples is difficult to solve, and the classification efficiency is improved.

Drawings

For a clearer description of the solution in the present application, a brief description will be given below of the drawings that are needed in the description of the embodiments of the present application, it being obvious that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a multimodal document classification method according to the present application;

FIG. 3 is a block diagram of one embodiment of a multimodal document classification system according to the present application

FIG. 4 is a schematic structural diagram of one embodiment of a computer device according to the present application.

Detailed Description

The method for determining a data format provided in the embodiments of the present invention is applied to a data processing system, and all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains unless defined otherwise; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the description of the figures above are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to better understand the technical solutions of the present application, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the

terminal devices

101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social networking platform software, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that, the multi-modal document classification method provided in the embodiments of the present application is generally executed by a server/terminal device, and accordingly, the multi-modal document classification system is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow chart of one embodiment of a multimodal document classification method according to the present application is shown. The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein. The multi-mode document classification method comprises the following steps:

s210: and carrying out coarse classification on the N to-be-classified document images to obtain a first coarse classification document image.

Specifically, in this embodiment, the document image to be classified is from any system of databases for inputting documents to be classified, and is not limited herein, and the document image to be classified includes various types of documents, certificates, tickets and their equivalents related to the financial insurance industry, and each document image to be classified includes image data, text data and a combination of the image data and the text data, so that a great amount of document data to be classified can be classified by using artificial intelligence, and classification efficiency can be effectively improved.

Further, in the step, coarse classification can be performed on the document image to be classified by using a classification model constructed by an NTS-Net model, firstly, image segmentation is performed on the document image to be classified, wherein segmentation of the document image to be classified is performed on the basis of the feature to be identified, and picture segmentation is performed on the region containing the feature to be identified in the single document image to be classified, so as to obtain at least one region to be selected. Further, redundancy removal is performed on the to-be-selected area according to a preset inhibition condition to obtain at least one selected area, wherein the extracted to-be-selected area is filtered by using a Navigator agent module in NTS-Net to remove the redundant to-be-selected area, and the inhibition condition can be understood as an inhibition mode of the Navigator agent on redundant characteristic images; in this embodiment, the Navigator agent may further extract a plurality of regions to be selected through an improved anchor mechanism, and filter the redundant regions to be selected through NMS (non-maximum suppression) to remove the redundant regions to be selected to obtain a group of filtered selected regions; { G '1, G'2,..G 'A } wherein G' N represents the nth candidate region. The region size of the candidate region extracted by the anchor mechanism of the Navigator reagent is not limited here.

Inputting the selected area into a pre-trained classification model, wherein the trained classification model can be constructed for the NTS-Net model, and performing confidence calculation on the selected area by using a Teacher agent module in the NTS-Net model to obtain a first confidence coefficient of the selected area, wherein the first confidence coefficient is used for determining the probability that the document image to be classified is the first coarse classification in the coarse classification.

Further, extracting the document image to be classified corresponding to the first confidence coefficient meeting a preset condition to obtain a first coarse classification document image, wherein the preset condition mentioned in the embodiment can be understood as preselection setting of coarse powder class, the coarse classification class can be a non-template class, and when the coarse classification model classifies the document image to be classified, the first confidence coefficient corresponding to the template class is obtained as preset probability, and the classification is classified as the first coarse classification class.

In another preferred embodiment, the method is used to classify the document image to be classified after cutting the document image to be classified to perform classification of the second coarse classification category, and in this embodiment, the second coarse classification category is a template category, and when the coarse classification model classifies the document image to be classified to obtain a probability that the second confidence corresponding to the template category is preset, the classification is performed as the second coarse classification category.

S220: and extracting the first rough classification document image to perform text detection to obtain at least one text region.

Specifically, in this embodiment, fine granularity classification is further required for the first coarse classification, and it should be noted that, a coarse classification document image under the first coarse classification is non-template, and feature extraction is performed on the first coarse classification document image to obtain at least one feature image block; and carrying out angle prediction on the characteristic image block, wherein the angle prediction on the first coarse classification document image is realized, the embodiment can be realized in a GhostNet layer carried in a field identification model, the extracted prediction angle is compared with the image angle of the first coarse classification document image, and if deviation exists, the angle adjustment is carried out according to the prediction angle. Performing text detection on the first coarse classification document image with the angle adjusted to obtain at least one text area comprising text,

s230: and dividing each text region to obtain at least one text image block.

Specifically, in this step, the segmentation mode may be set according to actual needs, for example, in some embodiments, the text region may be segmented according to a certain proportion based on the height of the text region, so as to obtain at least one text image block, or the text image block may be segmented according to a fixed length, so as to obtain at least one text image block, or the text image blocks may be uniformly segmented at intervals based on the overall length of the text region, so as to obtain multiple text image blocks with identical lengths

Optionally, a PSENT layer can be carried for carrying out text prediction on the text image block, and shallow features of the text region are extracted to obtain a text shallow feature set; performing instance prediction on the text region according to the text shallow feature set to obtain at least one text instance; dividing the text instance according to the coordinates of the text region to obtain at least one text image block

S240: and inputting the text image block into a pre-trained field recognition model to conduct category subdivision, and obtaining subdivision labels.

Specifically, extracting corresponding fields in the text image block based on the subdivision labels to obtain matching fields; and extracting a preset matching rule to perform category matching on the matching field, and obtaining the target classification document image.

In detail, the field is identified by utilizing an OCR character recognition mode, corresponding field data is extracted, corresponding sub-category information under a first rough classification category is extracted, and text image blocks are input into a pre-trained field recognition model for category subdivision, wherein the field recognition model comprises a deep convolution layer, a circulating layer and a transcription layer. Inputting the text image block into a deep convolution layer for feature recognition to obtain a feature sequence; performing label prediction on the characteristic sequence by using a circulating layer to obtain prediction distribution; and carrying out de-duplication integration on the prediction distribution and the characteristic sequence by using a transcription layer to obtain the subdivision tag.

In this step, the field recognition model performs feature extraction by using a recognition network corresponding to the first class information of each target object allocated to each target image after the segmentation process, and determines the second class information of each target object according to the extracted features of each target image.

In this embodiment, if the document image to be classified includes a plurality of target class images to be identified, when feature extraction is performed, first class information of each target object and position information of each target object in the document image to be classified are determined according to the features, and the target image of each target object is obtained by performing segmentation processing on the document image to be classified according to the first class information and the position information of each target object, a determination manner of the target image of a single target object is the same as a determination manner of the target image of only one target object in the document image to be classified, which is not described herein again.

In the technical scheme provided by the embodiment, when a document image to be classified comprises a plurality of target objects, an object recognition device performs segmentation processing on the document image to be classified according to first class information and position information of each target object to obtain target images of each target object, distributes each segmented target image to a recognition network corresponding to the first class information of each target object to perform feature extraction, and determines second class information of each target object based on the first class according to the extracted features of each target image.

If the generation of the matching fails, inputting the first coarse classification document corresponding to the matching failure into the classification model for classification to obtain a target classification document image

S250: and determining the target category of the single-needle image to be classified according to the subdivision label, and obtaining a target classification document image.

Specifically, target classification document images are output for aggregation, and target classification document images of the same category are stored in classification data. And constructing a category index according to the category labels.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 4, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a multi-modal document classification system 300, where the apparatus embodiment corresponds to the method embodiment shown in fig. 2, and the apparatus is particularly applicable to various electronic devices.

The embodiment of the invention provides a multi-modal document classification system 300, which comprises;

coarse classification module 301: the method comprises the steps of performing coarse classification on N to-be-classified document images to obtain a first coarse classification document image;

Further, in the step, coarse classification can be performed on the document image to be classified by using a classification model constructed by an NTS-Net model, firstly, image segmentation is performed on the document image to be classified, wherein segmentation of the document image to be classified is performed on the basis of the feature to be identified, and picture segmentation is performed on the region containing the feature to be identified in the single document image to be classified, so as to obtain at least one region to be selected.

Further, redundancy removal is performed on the to-be-selected area according to a preset inhibition condition to obtain at least one selected area, wherein the extracted to-be-selected area is filtered by using a Navigator agent module in NTS-Net to remove the redundant to-be-selected area, and the inhibition condition can be understood as an inhibition mode of the Navigator agent on redundant characteristic images; in this embodiment, the Navigator agent may further extract a plurality of regions to be selected through an improved anchor mechanism, and filter the redundant regions to be selected through NMS (non-maximum suppression) to remove the redundant regions to be selected to obtain a group of filtered selected regions; { G '1, G'2,..G 'A } wherein G' N represents the nth candidate region. The region size of the candidate region extracted by the anchor mechanism of the Navigator reagent is not limited here.

Text detection module 302: the method comprises the steps of extracting a first rough classification document image to perform text detection to obtain at least one text region;

text segmentation module 303: the method comprises the steps of dividing each text region to obtain at least one text image block;

Text recognition module 304: the text image block is input into a pre-trained field recognition model to conduct category subdivision, and subdivision labels are obtained;

And if the generation of the matching failure, inputting the first coarse classification document corresponding to the matching failure into the classification model for classification, and obtaining a target classification document image.

The fine classification module 305: and the target classification document image is obtained by determining the target category of the single-needle image to be classified according to the subdivision label.

Specifically, target classification document images are output for aggregation, and target classification document images of the same category are stored in classification data. And building a category index based on the category labels

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 5 comprises a memory 51, a processor 52, a network interface 53 which are communicatively connected to each other via a system bus. It should be noted that only the computer device 5 with components 51-53 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 51 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 51 may be an internal storage unit of the computer device 5, such as a hard disk or a memory of the computer device 5. In other embodiments, the memory 51 may also be an external storage device of the computer device 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 5. Of course, the memory 51 may also comprise both an internal memory unit of the computer device 5 and an external memory device. In this embodiment, the memory 51 is typically used to store an operating system and various types of application software installed on the computer device 5, such as program codes of the X method, and the like. Further, the memory 51 may be used to temporarily store various types of data that have been output or are to be output.

The processor 52 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 52 is typically used to control the overall operation of the computer device 5. In this embodiment, the processor 52 is configured to execute the program code stored in the memory 51 or process data, such as the program code for executing the X method.

The network interface 53 may comprise a wireless network interface or a wired network interface, which network interface 53 is typically used to establish communication connections between the computer device 5 and other electronic devices.

The present application also provides another embodiment, namely, a computer readable storage medium storing the multi-modal document classification method program, where the multi-modal document classification method program is executable by at least one processor, so that the at least one processor performs the steps of the multi-modal document classification method as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general purpose hardware online platform, and of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.

The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It is apparent that the embodiments described above are only some embodiments of the present application, but not all embodiments, the preferred embodiments of the present application are given in the drawings, but not limiting the patent scope of the present application. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a more thorough understanding of the present disclosure. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing, or equivalents may be substituted for elements thereof. All equivalent structures made by the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the protection scope of the application.

Claims

1. A method of multimodal document classification, the method comprising:

dividing each text region to obtain at least one text image block;

2. The multi-modal document classification method as claimed in claim 1, wherein said performing coarse classification on the N document images to be classified to obtain a first coarse classified document image includes:

3. The multi-modal document classification method as claimed in claim 2, wherein after said coarse classification of the N document images to be classified, the method further comprises:

inputting the selected area into a pre-trained classification model, and obtaining a second confidence coefficient of the selected area, wherein the first confidence coefficient is used for determining the probability that the document image to be classified is a second coarse classification in the coarse classification;

4. The multi-modal document classification method as claimed in claim 1, wherein said extracting the first coarse classification document image for text detection to obtain at least one text region includes:

5. The method of claim 1, wherein said segmenting each text region to obtain at least one text image block comprises:

6. The multi-modal document classification method as claimed in claim 1, wherein said determining the target class of the single needle image to be classified according to the subdivision label, obtaining a target classification document image, comprises:

extracting a preset matching rule to perform category matching on the matching field;

if the matching is successful, obtaining a target classification document image;

and if the matching fails, extracting the first coarse classification document corresponding to the matching failure, inputting the first coarse classification document into the classification model for classification, and obtaining a target classification document image.

7. The multi-modal document classification method as claimed in claim 1 wherein said word recognition model comprises: deep convolutional, cyclic, and transcriptional layers,

inputting the text image block into a pre-trained field recognition model for category subdivision to obtain subdivision labels, wherein the method comprises the following steps:

8. A multimodal document classification system, the system comprising;

9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the multimodal document classification method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the multimodal document classification method of any of claims 1 to 7.