CN114780773B

CN114780773B - Document picture classification method and device, storage medium and electronic equipment

Info

Publication number: CN114780773B
Application number: CN202210253277.XA
Authority: CN
Inventors: 夏伯谦; 李亚东; 王洪彬
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2024-07-02
Anticipated expiration: 2042-03-15
Also published as: CN114780773A

Abstract

The embodiment of the specification discloses a document picture classification method, a device, a storage medium and electronic equipment, wherein the method comprises the following steps: and acquiring information of at least two modes such as image information and text information of the target document picture to be classified, and fusing the information of the at least two modes to obtain multi-mode fusion information, so that understanding and classification of the target document picture are realized through analysis and processing of the multi-mode information.

Description

Document picture classification method and device, storage medium and electronic equipment

Technical Field

The embodiment of the specification relates to the field of natural language processing, in particular to a document picture classification method, a device, a storage medium and electronic equipment.

Background

The document picture is a picture comprising a plurality of characters, and the document picture classification technology is a technology for classifying the document picture by using a natural language processing method according to a preset category. The document picture classification technology is used as a basic technology of natural language processing, and is widely applied to the fields of various data mining, text processing and the like. In the digital age, sorting and sorting of texts is a big pain point for many enterprises. For example, hospitals receive large amounts of text data each day, including medical slips, pay slips, medication slips, CT sheets, and the like.

Disclosure of Invention

The embodiment of the specification provides a document picture classification method, a device, a storage medium and electronic equipment, which can realize the automation of document picture classification and improve the accuracy of document picture classification and arrangement. The technical scheme is as follows:

in a first aspect, embodiments of the present disclosure provide a document picture classification method, where the method includes:

Acquiring image information and text information of a target document picture;

Performing multi-mode fusion processing on the text information and the image information to obtain multi-mode fusion information;

And obtaining the classification information of the target document picture according to the multi-mode fusion information.

In a second aspect, embodiments of the present disclosure provide a document picture classification apparatus, the apparatus including:

the information acquisition module is used for acquiring image information and text information of the target document picture;

The fusion information module is used for carrying out multi-mode fusion processing on the text information and the image information to obtain multi-mode fusion information;

and the document classification module is used for obtaining the classification information of the target document picture according to the multi-mode fusion information.

In a third aspect, the present description provides a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-described method steps.

In a fourth aspect, embodiments of the present disclosure provide an electronic device, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The technical scheme provided by some embodiments of the present specification at least includes:

according to the embodiment of the specification, the information of at least two modes such as image information, text information and the like included in the document picture is fused, so that the document picture is understood and classified, and compared with the technology of classifying the document picture by only information of a single mode in the related technology, the accuracy of understanding and classifying the document picture is improved by utilizing the complementation of information among different modes in the embodiment of the specification, the robustness is good, and the classification requirement of the document picture in a complex use environment is better met.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIGS. 1A-1C are schematic diagrams of some document pictures to be classified according to embodiments of the present disclosure;

FIG. 2 is a schematic flow chart of a document picture classification method according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart of document picture classification according to an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart of a document picture classification method according to an embodiment of the present disclosure;

FIG. 5 is a schematic flow chart of document picture classification according to an embodiment of the present disclosure;

Fig. 6 is a schematic structural diagram of a neural network according to an embodiment of the present disclosure;

Fig. 7 is a schematic structural diagram of a document picture classification apparatus according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions of the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is apparent that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the embodiments of the present disclosure, are intended to be within the scope of the embodiments of the present disclosure.

In the description of the embodiments of the present specification, it should be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In the description of the embodiments of the present specification, it should be noted that, unless explicitly stated and limited otherwise, the word "comprise" and "having" and any variations thereof, is intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. The specific meaning of the terms in the embodiments of the present specification will be understood in detail by those skilled in the art. Furthermore, in the description of the embodiments of the present specification, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The embodiments of the present specification will be described in detail with reference to specific examples.

Document picture classification is a technique aimed at extracting and structuring unstructured information in scanned files or digital business documents (images, PDF files, etc.) to obtain classification information of document pictures. In the embodiment of the specification, the scanned file or the digital business document is collectively called as a document picture, the document picture has rich sources, for example, when residents or enterprises transact various matters on the internet, document materials required by transacting matters are converted into images and uploaded, and the document picture corresponding to the document materials is obtained. Taking the example of handling the financial audit, when residents or enterprises handle the financial audit on the internet, the residents or enterprises need to take photos, scan checks, deposit agreements and the like into images and upload the images to the transaction platform, so that the types of document pictures collected by the transaction platform at least comprise the checks and the deposit agreements.

The document pictures comprise rich text information and image information, and particularly the document pictures with different classification types, wherein the characteristics of the text information and the characteristics of the image information are different. 1A-1C, which are schematic diagrams of some document pictures provided in the embodiment of the present application, FIG. 1A is an invoice type document picture, FIG. 1B is a menu type document picture, FIG. 1C is a case type document picture, the embodiment of the present application further includes document pictures with the document picture types of receipts, business reports, etc., and FIGS. 1A-1C are only examples.

When the office staff sorts and sorts the texts, a great deal of time is consumed, and meanwhile, serious error rate and low efficiency are caused. In addition, in the traditional document picture analysis technology, the most common document picture classification method is to extract text content in a document included in a picture, and classify the document picture according to the text content. For example, text content included in the document picture is extracted through optical character recognition (optical character recognition, OCR), then the text content is recognized and understood through a bidirectional decoding representation (Bidirectional Encoder Representation from Transformers, BERT) model of the transformer, and keywords are extracted, so that the document picture is classified. In other document picture analysis technologies, document pictures are classified according to image mode information by performing structural processing on document layout information included in the document pictures, that is, extracting image mode information of the document pictures. However, the above method generally can only understand and classify the document picture through single modal information, and does not effectively utilize a plurality of modal information included in the document picture, and does not consider association relations among the plurality of modal information and fuse the plurality of modal information.

In one embodiment, as shown in fig. 2, a document picture classification method is presented, which may be implemented in dependence on a computer program, and may be run on a von neumann system-based document picture classification device. The computer program may be integrated in the application or may run as a stand-alone tool class application.

Specifically, the method comprises the following steps:

S101, acquiring image information and text information of a target document picture.

The text information can be understood as information corresponding to text content in a document picture, the text content refers to text segments which are composed of Chinese or foreign language characters and can be used for meaning expression, the text segments comprise any text quantity such as sentences, paragraphs and articles, and any language form such as English, chinese, german and the like, and the information corresponding to the text content comprises characters, semantics, positions where the characters are located, punctuation, and information needing to be extracted by a person skilled in the art such as repeated words.

The image information can be understood to include visual information such as information of the overall page style of the document picture, local image information corresponding to text regions in the document picture, and image information corresponding to non-text regions. The extraction of the local image information corresponding to the text region in the document picture can give consideration to more detail features, and the image information corresponding to the non-text region may also contain key information pointing to classification information of the document picture.

In other embodiments, the image information may further include: the number information of the forms in the document picture, the characteristic information for representing the image area ratio of the forms in the document picture, the characteristic information for representing the ratio of the handwritten text in the document picture to the total text, the characteristic information for representing the ratio of the printed text in the document picture to the total text, and the like.

In one embodiment, before obtaining the image information and the text information of the target document picture, the method further includes: and preprocessing the target document picture. For example, preprocessing of the target document picture includes one or more of the following: image deblurring, image brightness enhancement, image contrast enhancement, image super-resolution reconstruction, and image correction. For example, correction of large directions and small angles of the document picture is realized by a four-direction rotation technology and a perspective correction technology of the document picture. In the embodiment, the document picture quality is enhanced by an image processing technology, so that the accuracy and the information quantity of text information and image information in the extracted document picture are improved, and the accuracy and the reliability of classification of the document picture are further improved.

In one embodiment, the method for acquiring text information and image information in a target document picture includes: dividing a target document included in the target document picture based on a preset dividing unit to obtain at least one sub-text; the preset dividing unit at least comprises one of the following: word units, sentence units and segment units; and acquiring text information corresponding to each sub-text and acquiring image information corresponding to the sub-text in the target document picture.

Specifically, a target document included in the target document picture is acquired, and text content included in the target document picture can be extracted through optical character recognition (optical character recognition, OCR).

Further, the target document is segmented based on a preset segmentation unit, and at least one sub-text is obtained. The target document is segmented, namely a segment of Chinese sequence or sequences of other languages are segmented into individual Chinese words or words of other languages, and each sub-text corresponds to the individual Chinese words or words of other languages. For example, using the more sophisticated jieba text segmentation system, the specific text segmentation method is not limited. The preset dividing unit at least comprises one of the following: the character string is used as a unit, the word is used as a unit, the sentence sentence is used as a unit and the section character is used as a unit, and the setting is carried out according to the requirements of related technicians.

In one embodiment, obtaining at least one sub-text in a target document includes: and cleaning the target document or the acquired at least one sub-text. For example, the target document is cleaned, including the processing of deactivating words and special symbols is performed on the target document, so that the situation that the deactivating words or the special symbols are used as the sub-text is avoided. Assuming that after processing, the number of texts is f and the vocabulary size is C. The number of C can be obviously reduced by removing stop words, and redundant information is removed. Special symbols here include punctuation marks, monetary symbols, mathematical symbols, etc. that occur in sentences. And cleaning the target document, and further acquiring at least one sub-text in the cleaned target document.

For another example, the obtained at least one sub-text is cleaned, the occurrence frequency of the special sub-text in the target document is counted, and the special sub-text in the at least one sub-text is removed according to the frequency of the special sub-text. The present embodiment counts the frequency of occurrence of each particular sub-text, removing the "extreme frequency" (Extreme frequency). "extreme frequency" refers to the situation where one sub-text appears very frequently or very low. The frequency of occurrence of the special sub-text is very high in all documents, which indicates that the special sub-text is more represented as a common feature of all documents, and has little effect on the classification task of document pictures. The occurrence frequency of the special sub-text is very low in all texts, which indicates that the words are rare, and the words are very likely to belong to rare words, so that the common characteristics of certain document pictures cannot be reflected, and the deletion is performed. That is, the frequency ω _i of the acquired sub-text needs to satisfy: epsilon _low＜Freq_(ωi)＜ε_high,ε_high and epsilon _low are parameters for filtering up and down frequencies, and are adjusted according to specific text data. And cleaning the obtained at least one sub-text, and further obtaining text information corresponding to each sub-text and image information corresponding to the sub-text according to the cleaned sub-text.

In this embodiment, by cleaning the target document or at least one obtained sub-text, unnecessary sub-texts such as words, words or sentences which do not help or even have negative effects on classification of the document picture can be filtered, text information of the unnecessary sub-texts is prevented from being obtained, the document picture classification efficiency is improved, and the accuracy and reliability of classifying the document picture are improved.

Further, according to at least one sub-text, text information corresponding to each sub-text is obtained, and image information corresponding to the sub-text in the target document picture is obtained. Specifically, text information corresponding to each sub-text is obtained, which means that each sub-text is independently and thermally coded (one-hot) according to a segmentation unit corresponding to each sub-text. For example, a matrix is constructed at the word level at the sub-text in units of words, and a matrix is constructed at the word level at the sub-text in units of words, the number of rows and columns of the matrix being the number that is not repeatedly represented. The value of the matrix is initialized to 0 and the value corresponding to the level identification sequence position in each row is set to 1.

And acquiring image information corresponding to the sub-text in the target document picture, for example, using ResNeXt-FPN network as an image encoder, extracting a first feature map of the document picture, averaging and pooling the first feature map into a fixed size (W multiplied by H), expanding an averaged and pooled second feature map according to rows, and linearly projecting the sub-text on the second feature map to obtain a feature sequence of an image corresponding to the sub-text, namely, the image information corresponding to the sub-text.

S102, performing multi-mode fusion processing on the text information and the image information to obtain multi-mode fusion information.

And fusing the modal information of the text layer and the modal information of the image layer according to the association between different modalities to obtain multi-modal fusion information. According to the association relation between different modes, the text information and the image information are fused, so that the text information and the image information can be understood to correspond to different weights respectively when fused into multi-mode fusion information, and the proportion of the weights is related to the classification information corresponding to the document picture.

For example, some classified document pictures include rich visual information or image information, and the image information such as font type, size, style, etc. has obvious characteristics such as alert notifications, epidemic situation management notifications, etc. Therefore, when obtaining the multi-modal information corresponding to the document pictures corresponding to the classifications, the image information needs to be more paid attention, and a higher weight is configured for the image information. Other classified document pictures comprise rich text information, and the text information such as keywords of the text content or the spatial relationship of the text content has obvious characteristics, for example, texts in a table are arranged in a grid layout, keywords in a title in a first column or a first row can be extracted, so that the document pictures are classified according to the keywords and the grid layout, such as types of invoices, examination rules and the like, and therefore, when multi-mode information corresponding to the document pictures corresponding to the classifications is obtained, the text information needs to be paid more attention, and higher weight is configured for the text information.

In one embodiment, a method of obtaining multimodal fusion information includes: in S101, obtaining a vector representing text information and a vector representing image information corresponding to the text information, adding Concat layers to connect the two vectors, obtaining attention vectors of the text vector and the image vector aiming at the connected two vectors through a bidirectional cyclic neural network added with an attention mechanism, further carrying out category prediction probability normalization on the text vector and the image vector respectively added with the attention vectors through a full-connection layer of the neural network by using a softmax function to obtain probability distribution of predicted document classification information, and obtaining a text vector and a weight vector respectively corresponding to the image vector corresponding to the probability distribution according to the probability distribution; and carrying out fusion processing on the text vector and the image vector according to the weight vectors respectively corresponding to the text vector and the image vector to obtain a fusion vector or a vector expression representing the multi-mode fusion information.

In the above embodiment, the bidirectional recurrent neural network is a special long-short-term memory network layer constructed on the basis of the long-short-term memory network, and the special long-short-term memory network layer includes two layers of long-short-term memory networks, the first layer inputs data in a positive sequence manner and outputs the state of each time step, and the second layer inputs data in a reverse sequence manner and outputs the state of each time step, and finally the two states are combined to obtain a complete output. The text information and the image information can be jointly learned by using the bidirectional circulating neural network, the local invariance information of document pictures with different classifications can be learned during pre-training, and when the bidirectional circulating neural network needs to migrate another classification type, training and optimizing can be carried out on the bidirectional circulating neural network only by manually marking a small number of samples of the document pictures.

And S103, obtaining classification information of the target document picture according to the multi-mode fusion information.

And analyzing and processing the multi-mode fusion information fused with the text information and the image information of the target document picture to obtain the classification information of the target document picture. Fig. 3 is a schematic flow chart of a document picture classification method according to an embodiment of the present application, including the target document picture shown in fig. 1B. Extracting text information 301 "stir-fry" and image information 302 (including text content "stir-fry") from the target document picture, wherein the extracting method is described in S101, and the details are not repeated here; further performing multi-mode fusion processing on the text information 301 and the image information 301 to obtain multi-mode fusion information 303, wherein the processing procedure is referred to the above step S102, and is not repeated here; according to the multi-mode fusion information, analysis tasks are executed, such as form understanding (the task requires extracting four types of semantic entities including questions, answers, titles and others from the form in the document picture), bill understanding (the bill understanding is obtained by pretraining two bill understanding data sets of CORD and SROIE, 30 types of key information entities including names, prices, numbers, store names, store addresses, total prices, consumption time and the like are extracted from the document picture when in use), complicated layout long-document understanding and the like, and the application also comprises other analysis tasks, wherein the classification information 304 of the target document picture is obtained as a menu according to the analysis result.

In one embodiment, as shown in fig. 4, a document picture classification method is presented, which may be implemented in dependence on a computer program, and may be run on a von neumann system-based document picture classification device. The computer program may be integrated in the application or may run as a stand-alone tool class application.

Specifically, the method comprises the following steps:

S201, dividing a target document included in a target document picture based on a preset dividing unit to obtain at least one sub-text.

The target document included in the target document picture is obtained, and text content included in the document picture is extracted, for example, by optical character recognition (optical character recognition, OCR). The target document is segmented, namely a segment of Chinese sequence or sequences of other languages are segmented into individual Chinese words or words of other languages, and each sub-text corresponds to the individual Chinese words or words of other languages. The preset dividing unit at least comprises one of the following: the character string is used as a unit, the word is used as a unit, the sentence sentence is used as a unit and the section character is used as a unit, and the setting is carried out according to the requirements of related technicians.

In one embodiment, obtaining at least one sub-text in a target document includes: and cleaning the target document or the acquired at least one sub-text. For example, the target document is cleaned, including the processing of deactivating words and special symbols is performed on the target document, so that the situation that the deactivating words or the special symbols are used as the sub-text is avoided. And cleaning the target document, and further acquiring at least one sub-text in the cleaned target document. By cleaning the target document, unnecessary sub-texts such as characters, words or sentences and the like which do not help or even have negative influence on the classification of the document picture can be filtered, the text information of the unnecessary sub-texts is prevented from being acquired, the classification efficiency of the document picture is improved, and the accuracy and the reliability of the classification of the document picture are improved.

S202, obtaining text information corresponding to each sub-text according to the text content included in each sub-text.

Specifically, obtaining text information corresponding to each sub-text according to text content included in each sub-text refers to performing independent thermal coding (one-hot) on each sub-text according to a segmentation unit corresponding to each sub-text and the text information corresponding to the sub-text. For example, in the sub-text in units of words, a matrix corresponding to the sub-text is constructed at the word level, and the number of rows and columns of the matrix is the number that is not repeatedly represented. The value of the matrix is initialized to 0 and the value corresponding to the level identification sequence position in each row is set to 1.

S203, obtaining position information corresponding to the sub-texts according to the corresponding positions of the sub-texts in the target document picture.

The positional information, also called layout information, of each sub-text is represented using a bounding box parallel to the coordinate axes of the target document picture corresponding to the coordinate range covered by each sub-text in the target document picture. For example, coordinates of each sub-text in the document picture are obtained according to a text bounding box obtained by the OCR technology, after the coordinates corresponding to each sub-text are converted into virtual coordinates, vector coordinate representations of the neural network embedded layers embedding sublayers corresponding to the four layers of x, y, w, h are calculated, and finally the position information corresponding to each sub-text is represented by a vector expression obtained after the four vector coordinate representations are connected. The application also comprises other acquisition modes of the position information.

In the embodiment of the present application, the text information corresponding to each sub-text includes the text information obtained in S202 and the position information obtained in S203. Some classified document pictures comprise rich text information, and the text information such as keywords of the text content or spatial relations of the text content, namely position information, has obvious characteristics, for example, texts in a table are arranged in a grid layout, keywords in a title of a first column or a first row can be extracted, so that the document pictures are classified according to the keywords and the grid layout, such as types of invoices, examination rules and the like, and therefore, the accuracy and the reliability of classification of the document pictures can be improved by analyzing the position information of each sub text.

S204, dividing the target document picture according to at least one sub-text included in the target document picture to obtain a sub-picture corresponding to each sub-text, and obtaining image information corresponding to each sub-picture.

And corresponding to the coordinate range covered by each sub-text in the document picture, using a partial image area parallel to the coordinate axis of the document picture and comprising the contents of the sub-text as a sub-picture corresponding to each sub-text, and further acquiring the image information of the sub-picture as the image information of the sub-text. For example, using ResNeXt-FPN network as an image encoder, extracting a first feature map of the sub-picture, averaging and pooling the first feature map into a fixed size (w×h), expanding an averaged and pooled second feature map according to a line, linearly projecting the sub-text on the second feature map to obtain a sub-image corresponding to the sub-text, and extracting a feature sequence of the sub-image, that is, image information corresponding to the sub-text, through ResNeXt-FPN network.

S205, splicing the vector representing the text information and the vector representing the image information to obtain a spliced vector.

The vector for representing the text information comprises a vector for representing the text information and a vector for representing the position information, and the vector for representing the text information, the vector for representing the position information and the vector for representing the image information are spliced to obtain a spliced vector. The vectors are connected, for example, by Concat layers.

As shown in fig. 5, a flow chart of document picture classification provided by the embodiment of the application includes text information 5011, text information "clear" and position information "(1, 2)" and corresponding image information 5012, text information 5021, text information "stewed" and position information "(1, 3)" and corresponding image information 5022, text information 5031, text information "sheep" and position information "(1, 4)" and corresponding image information 5032, text information 5041, text information "meat" and position information "(1, 5)" and corresponding image information 5042, and text information 5051 obtained by different dividing units from the dividing units corresponding to the text information 5011-5041, text information 5051 includes text information "small dish" and position information "(4, 5)" and corresponding image information 5052. It is to be understood that the collected sub-text shown in fig. 5 is only an example, and the present application also includes any other sub-text that is split in any case.

Connecting a vector representing text information 5011 with a vector representing image information 5012 to obtain a first spliced vector; connecting the vector representing the text information 5021 with the vector representing the image information 5022 to obtain a second spliced vector; connecting the vector representing the text information 5033 and the vector representing the image information 5034 to obtain a third spliced vector; connecting the vector representing the text information 5041 and the vector representing the image information 5042 to obtain a fourth spliced vector; and connecting the vector representing the text information 5051 and the vector representing the image information 5052 to obtain a fifth spliced vector.

S206, configuring weight vectors for the spliced vectors through the full-connection layer of the neural network to obtain multi-mode fusion information.

And fusing the mode information of the text level, the mode information of the position level and the mode information of the image level according to the association among different modes to obtain multi-mode fusion information. According to the association relation between different modes, the text information and the image information are fused, so that the text information, the position information and the image information can be understood to correspond to different weights respectively when being fused into multi-mode fusion information, and the proportion of the weights is related to the classification information corresponding to the document picture.

The neural network is obtained through training of a pre-training document picture in a training set and classification information corresponding to the pre-training document picture, and weight vectors respectively corresponding to text information, position information and image information of the pre-training document picture are associated with the classification information corresponding to the pre-training document picture. For example, some classified document pictures include rich visual information or image information, and the image information such as font type, size, style, etc. has obvious characteristics such as alert notifications, epidemic situation management notifications, etc. Therefore, when obtaining the multi-modal information corresponding to the document pictures corresponding to the classifications, the image information needs to be more paid attention, and a higher weight is configured for the image information. Other classified document pictures comprise rich text information, and the text information such as keywords of the text content or the spatial relationship of the text content has obvious characteristics, for example, texts in a table are arranged in a grid layout, keywords in a title in a first column or a first row can be extracted, so that the document pictures are classified according to the keywords and the grid layout, such as types of invoices, examination rules and the like, and therefore, when multi-mode information corresponding to the document pictures corresponding to the classifications is obtained, the text information needs to be paid more attention, and higher weight is configured for the text information.

As shown in fig. 6, a schematic structural diagram of a neural network provided by the embodiment of the present application adopts a deep neural network-based hidden markov model, that is, a DNN-HMM model, and introduces an error back propagation algorithm to optimize the neural network model based on the existing neural network model, so as to improve the recognition accuracy of the neural network model.

The deep neural network is composed of an input layer, a hidden layer and an output layer, as shown in fig. 6, the input layer is used for calculating the output value of the hidden layer unit input to the bottom layer according to the spliced vector input to the deep neural network, and the input layer generally comprises a plurality of input units, and the input unit is used for calculating the output value of the hidden layer unit input to the bottom layer according to the spliced vector input. After the spliced vector is input to the input unit, the input unit calculates an output value output to the hidden layer of the bottom layer by utilizing the spliced vector input to the input unit according to the weighted value of the input unit.

The hidden layers are usually multiple, each hidden layer includes multiple hidden layer units, and the hidden layer units receive input values from hidden layer units in the next hidden layer. And carrying out weighted summation on the input values from the hidden layer units in the hidden layer of the next layer according to the weighted value of the layer, and outputting the weighted summation result as the output value of the hidden layer units of the previous layer.

The output layer comprises a plurality of output units, the output units receive input values from hidden layer units in the hidden layer at the uppermost layer, the input values from the hidden layer units in the hidden layer at the uppermost layer are weighted and summed according to the weighted value of the layer, an actual output value is calculated according to the result of the weighted and summed, and the connection weight value and the threshold value of each layer are adjusted along the output path based on the back propagation of errors of the expected output value and the actual output value from the output layer.

Specifically, in this embodiment, an initial model is created by using a DNN-HMM model with an error back propagation algorithm, after extracting a splice vector corresponding to text information and image information of a document picture, the splice vector is input into a neural network model, a training process of the neural network model generally comprises two parts of forward propagation and back propagation, in the forward propagation process, the splice vector is transferred to an output layer after being operated by a transfer function (also called an activation function or a transfer function) of hidden layer neurons (also called nodes) from an input layer of the neural network model, wherein each layer of neuron state affects the next layer of neuron state, an actual output value-multi-mode fusion information is calculated at the output layer, an expected error between the actual output value and an expected output value is calculated, parameters of the neural network model are adjusted based on the expected error, the parameters comprise a weight value and a threshold value of each layer, and after training is completed, the weight vector is configured for the splice vector, thereby obtaining the neural network of multi-mode fusion information.

S207, performing analysis tasks on the multi-mode fusion information through the multi-mode document understanding model to obtain the classification information of the target document picture.

The multi-modal document understanding model is obtained through training of multi-modal fusion information and classification information in a training set. For example, the multi-modal document understanding model may use the document understanding pre-training model LayoutLM 1.0.0 or the new generation document understanding pre-training model LayoutLM 2.0.0, and may further introduce a spatial awareness self-attention mechanism into the multi-modal document understanding model, so as to further improve the understanding and analysis capability of the multi-modal document understanding model on document pictures.

The analysis tasks include at least one or more of the following tasks: document layout analysis (Document Layout Analysis), visual information extraction (Visual Information Extraction), document picture classification (Document Image Classification), and the like. The document layout analysis task mainly carries out automatic analysis, identification, understanding and the like on the position relations of images, texts, tables and the like in document pictures; the visual information extraction task mainly extracts entities and relations from a large amount of unstructured contents in a document picture, models a document with rich vision as a computer vision problem, and extracts information through semantic segmentation or text box detection; through the tasks, the document picture classification task is realized, wherein the task is a process of analyzing and identifying document images and classifying the document images into different categories, such as scientific papers, resume, invoices, receipts and the like.

For example, in the flow chart of document picture classification shown in fig. 5, a plurality of stitching vectors are input into the neural network 601, multi-modal fusion information is obtained through a full connection layer of the neural network 601, the multi-modal fusion information is further input into the multi-modal document understanding module 602, and classification information of a target document picture is obtained through the multi-modal understanding module 601.

The following are device embodiments of the present specification, and may be used to perform method embodiments of the present specification. For details not disclosed in the embodiments of the apparatus of the embodiments of the present specification, please refer to the method embodiments of the present specification.

Referring to fig. 7, a schematic diagram of a document picture classification apparatus according to an exemplary embodiment of the present disclosure is shown. The document picture classification apparatus may be implemented as all or part of the apparatus by software, hardware, or a combination of both. The device comprises an acquisition information module 701, a fusion information module 702 and a document classification module 703.

The acquiring information module 701 is configured to acquire image information and text information of a target document picture;

the fusion information module 702 is configured to perform multi-mode fusion processing on the text information and the image information to obtain multi-mode fusion information;

The document classification module 703 is configured to obtain classification information of the target document picture according to the multimodal fusion information.

In one embodiment, the acquire information module 701 includes:

The segmentation text unit is used for segmenting the target document included in the target document picture based on a preset segmentation unit to obtain at least one sub-text; the preset dividing unit at least comprises one of the following: word units, sentence units and segment units;

The information acquisition unit is used for acquiring text information corresponding to each sub-text and acquiring image information corresponding to the sub-text in the target document picture.

In one embodiment, the text information includes text information;

And the information acquisition unit is also used for acquiring the text information corresponding to each sub-text according to the text content included in each sub-text.

In one embodiment, the text information further includes location information;

And the information acquisition unit is also used for acquiring the position information corresponding to the sub-text according to the position of each sub-text corresponding to the target document picture.

In one embodiment, obtaining an information unit includes:

a segmentation image subunit, configured to segment the target document picture according to at least one sub-text included in the target document picture, so as to obtain a sub-picture corresponding to each sub-text;

and the image acquisition subunit is used for acquiring the image information corresponding to each sub-image.

In one embodiment, the fusion information module 702 includes:

The vector splicing unit is used for splicing the vector representing the text information and the vector representing the image information to obtain a spliced vector;

The configuration weight unit is used for configuring weight vectors for the spliced vectors through the full-connection layer of the neural network to obtain multi-mode fusion information; the neural network is obtained through training a pre-training document picture in a training set and classification information corresponding to the pre-training document picture, and weight vectors respectively corresponding to text information and image information of the pre-training document picture have an association relationship with the classification information corresponding to the pre-training document picture.

In one embodiment, the document classification module 703 includes:

The analysis and classification unit is used for executing analysis tasks on the multi-mode fusion information through a multi-mode document understanding model to obtain classification information of the target document picture; the multi-modal document understanding model is obtained through training of multi-modal fusion information and classification information in a training set.

In one embodiment, the analysis tasks include at least one or more of the following tasks: document layout analysis, visual information extraction, and document picture classification.

In one embodiment, the document picture classification apparatus further includes:

And the preprocessing module is used for preprocessing the target document picture.

In one embodiment, the pre-treatment comprises at least one or more of the following: image deblurring, image brightness enhancement, image contrast enhancement, image super-resolution reconstruction, and image correction.

It should be noted that, when the document image classification device provided in the above embodiment performs the document image classification method, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the document image classification device and the document image classification method provided in the above embodiments belong to the same concept, which embody the detailed implementation process in the method embodiment, and are not described herein again.

The foregoing embodiment numbers of the present specification are merely for description, and do not represent advantages or disadvantages of the embodiments.

The embodiments of the present disclosure further provide a computer storage medium, where a plurality of instructions may be stored, where the instructions are adapted to be loaded by a processor and executed by the processor, where the specific execution process may refer to the specific description of the embodiments shown in fig. 1 to 6, and the details are not repeated herein.

The embodiments of the present disclosure further provide a computer program product, where at least one instruction is stored in the computer program product, where the at least one instruction is loaded by the processor and executed by the processor, where the specific execution process may refer to the specific description of the embodiments shown in fig. 1 to 6, and details are not repeated herein.

Referring to fig. 8, a schematic structural diagram of an electronic device is provided in an embodiment of the present disclosure. As shown in fig. 8, the electronic device 800 may include: at least one processor 801, at least one network interface 804, a user interface 803, memory 805, at least one communication bus 802.

Wherein a communication bus 802 is used to enable connected communication between these components.

The user interface 803 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 803 may further include a standard wired interface and a wireless interface.

The network interface 804 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.

Wherein the processor 801 may include one or more processing cores. The processor 801 connects various portions of the overall server 800 using various interfaces and lines, performs various functions of the server 800 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 805, and invoking data stored in the memory 805. Alternatively, the processor 801 may be implemented in at least one hardware form of digital signal Processing (DIGITAL SIGNAL Processing, DSP), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 801 may integrate one or a combination of several of a processor (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 801 and may be implemented on a single chip.

The Memory 805 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 805 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). Memory 805 may be used to store instructions, programs, code, sets of codes, or instruction sets. The memory 805 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the above-described respective method embodiments, etc.; the storage data area may store data or the like referred to in the above respective method embodiments. The memory 805 may also optionally be at least one storage device located remotely from the aforementioned processor 801. As shown in fig. 8, an operating system, a network communication module, a user interface module, and a document picture classification application program may be included in the memory 805 as one type of computer storage medium.

In the electronic device 800 shown in fig. 8, the user interface 803 is mainly used for providing an input interface for a user, and acquiring data input by the user; and the processor 801 may be used to invoke a document picture classification application stored in the memory 805 and specifically perform the following operations:

Acquiring image information and text information of a target document picture;

In one embodiment, the processor 801 performs the steps of obtaining the image information and the text information of the target document picture, specifically:

Dividing a target document included in the target document picture based on a preset dividing unit to obtain at least one sub-text; the preset dividing unit at least comprises one of the following: word units, sentence units and segment units;

and acquiring text information corresponding to each sub-text, and acquiring image information corresponding to the sub-text in the target document picture.

In one embodiment, the processor 801 executes the text information including text information;

the text information corresponding to each sub-text is obtained, and the specific implementation is as follows:

And obtaining the text information corresponding to each sub-text according to the text content included in each sub-text.

In one embodiment, the text information further includes location information;

After executing the text information corresponding to each sub-text according to the text content included in each sub-text, the processor 801 further executes:

and obtaining the position information corresponding to the sub-text according to the position of each sub-text corresponding to the target document picture.

In one embodiment, the processor 801 performs the obtaining the image information corresponding to the sub-text in the target document picture, specifically performs:

Dividing the target document picture according to at least one sub-text included in the target document picture to obtain sub-pictures corresponding to each sub-text;

And acquiring image information corresponding to each sub-picture.

In one embodiment, the processor 801 performs the multi-mode fusion processing on the text information and the image information to obtain multi-mode fusion information, and specifically performs:

splicing the vector representing the text information and the vector representing the image information to obtain a spliced vector;

Configuring weight vectors for the spliced vectors through a full connection layer of the neural network to obtain multi-mode fusion information; the neural network is obtained through training a pre-training document picture in a training set and classification information corresponding to the pre-training document picture, and weight vectors respectively corresponding to text information and image information of the pre-training document picture have an association relationship with the classification information corresponding to the pre-training document picture.

In one embodiment, the processor 801 executes the classification information obtained from the target document picture according to the multimodal fusion information, specifically executing:

Executing analysis tasks on the multi-modal fusion information through a multi-modal document understanding model to obtain classification information of the target document picture; the multi-modal document understanding model is obtained through training of multi-modal fusion information and classification information in a training set.

In one embodiment, before the processor 801 executes the capturing the image information and the text information of the target document picture, the method further includes:

And preprocessing the target document picture.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, or the like.

The foregoing disclosure is merely illustrative of the preferred embodiments of the present invention and is not intended to limit the scope of the embodiments of the present invention, as defined by the claims appended hereto.

Claims

1. A document picture classification method, the method comprising:

Acquiring image information and text information of a target document picture;

Obtaining classification information of the target document picture according to the multi-mode fusion information;

the multi-mode fusion processing is performed on the text information and the image information to obtain multi-mode fusion information, which comprises the following steps:

2. The document picture classification method according to claim 1, wherein the acquiring image information and text information of the target document picture comprises:

3. The document picture classification method according to claim 2, the text information including text information;

the obtaining text information corresponding to each sub-text includes:

4. A document picture classification method according to claim 3, said text information further comprising location information;

after obtaining the text information corresponding to each sub-text according to the text content included in each sub-text, the method further includes:

5. The document picture classification method according to claim 2, wherein the obtaining the image information corresponding to the sub-text in the target document picture includes:

And acquiring image information corresponding to each sub-picture.

6. The document picture classification method according to claim 1, wherein the obtaining the classification information of the target document picture according to the multimodal fusion information includes:

7. The document picture classification method according to claim 6, wherein the analysis task includes at least one or more of the following tasks: document layout analysis, visual information extraction, and document picture classification.

8. The document picture classification method according to claim 1, before the obtaining the image information and the text information of the target document picture, the method further comprising:

And preprocessing the target document picture.

9. The document picture classification method according to claim 8, said preprocessing comprising at least one or more of: image deblurring, image brightness enhancement, image contrast enhancement, image super-resolution reconstruction, and image correction.

10. A document picture classification apparatus, the apparatus comprising:

the document classification module is used for obtaining classification information of the target document picture according to the multi-mode fusion information;

The fusion information module comprises: the vector splicing unit is used for splicing the vector representing the text information and the vector representing the image information to obtain a spliced vector;

11. A computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps of any one of claims 1 to 9.

12. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1-9.