CN114780773A

CN114780773A - Document and picture classification method and device, storage medium and electronic equipment

Info

Publication number: CN114780773A
Application number: CN202210253277.XA
Authority: CN
Inventors: 夏伯谦; 李亚东; 王洪彬
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-07-22
Anticipated expiration: 2042-03-15
Also published as: CN114780773B

Abstract

The embodiment of the specification discloses a document picture classification method, a document picture classification device, a storage medium and electronic equipment, wherein the method comprises the following steps: the method comprises the steps of obtaining at least two modes of information such as image information and text information of a target document picture to be classified, fusing the at least two modes of information to obtain multi-mode fusion information, and accordingly understanding and classifying the target document picture through analysis and processing of the multi-mode information.

Description

Document and picture classification method and device, storage medium and electronic equipment

Technical Field

The embodiment of the specification relates to the field of natural language processing, in particular to a document and picture classification method, a document and picture classification device, a storage medium and electronic equipment.

Background

The document picture is a picture comprising a plurality of characters, and the document picture classification technology is a technology for classifying the document pictures by using a natural language processing method according to a preset category. The document and picture classification technology is used as a basic technology of natural language processing and is widely applied to the fields of data mining, text processing and the like. In the digital age, the classification and arrangement of texts is a big pain point of many enterprises. For example, hospitals receive a huge amount of text data each day, which includes types of illness orders, payment orders, medication orders, CT films, and the like.

Disclosure of Invention

The embodiment of the specification provides a document and picture classification method, a document and picture classification device, a storage medium and electronic equipment, which can realize the automation of document and picture classification and improve the accuracy of document and picture classification and sorting. The technical scheme is as follows:

in a first aspect, an embodiment of the present specification provides a document picture classification method, where the method includes:

acquiring image information and text information of a target document picture;

performing multi-modal fusion processing on the text information and the image information to obtain multi-modal fusion information;

and obtaining the classification information of the target document picture according to the multi-mode fusion information.

In a second aspect, an embodiment of the present specification provides a document picture classification device, where the device includes:

the information acquisition module is used for acquiring image information and text information of the target document picture;

the fusion information module is used for performing multi-mode fusion processing on the text information and the image information to obtain multi-mode fusion information;

and the document classification module is used for obtaining the classification information of the target document picture according to the multi-mode fusion information.

In a third aspect, the present specification provides a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

In a fourth aspect, embodiments of the present specification provide an electronic device, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The technical scheme provided by some embodiments of the present specification has at least the following beneficial effects:

compared with the technology of classifying the document pictures only through information of a single modality in the related technology, the accuracy of understanding and classifying the document pictures is improved by utilizing the complementation of information of different modalities, the robustness is good, and the requirement of classifying the document pictures in a complex use environment is better met.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

FIGS. 1A-1C are schematic diagrams of some document pictures to be classified provided by an embodiment of the present specification;

FIG. 2 is a flowchart illustrating a document image classification method provided by an embodiment of the present specification;

FIG. 3 is a flowchart illustrating a document picture classification provided by an embodiment of the present specification;

FIG. 4 is a flowchart illustrating a document image classification method provided by an embodiment of the present specification;

FIG. 5 is a flowchart illustrating a document picture classification provided by an embodiment of the present specification;

fig. 6 is a schematic structural diagram of a neural network provided in an embodiment of the present specification;

FIG. 7 is a schematic diagram illustrating an exemplary embodiment of an image source tracking device;

fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments in the present specification without any creative effort belong to the protection scope of the embodiments in the present specification.

In the description of the embodiments herein, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In the description of the embodiments herein, it is noted that, unless explicitly stated or limited otherwise, "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. Specific meanings of the above terms in the embodiments of the present specification can be understood in specific cases by those of ordinary skill in the art. In addition, in the description of the embodiments of the present specification, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The embodiments of the present disclosure will be described in detail with reference to specific embodiments.

The document picture classification is a technology aiming at extracting and structuring unstructured information in a scanned file or a digital business document (an image, a PDF file, etc.), so as to acquire classification information of a document picture. In the embodiment of the present specification, the scanned document or the digital business document is collectively referred to as a document picture, and the source of the document picture is very rich, for example, when a resident or an enterprise transacts various matters on the internet, document materials required for transacting the matters are converted into images and uploaded, so that the document picture corresponding to the document materials is obtained. Taking the transaction of financial audit matters as an example, when a resident or an enterprise transacts the financial audit matters on the internet, a check, a deposit agreement and the like need to be photographed and scanned into an image and uploaded to the matter transaction platform, so that the types of document pictures acquired by the matter transaction platform at least comprise the check and the deposit agreement.

The document pictures comprise rich text information and image information, particularly document pictures of different classification types, and the characteristics of the text information and the characteristics of the image information are different. As shown in fig. 1A to fig. 1C, schematic diagrams of some document pictures provided in the embodiments of the present application are provided, where fig. 1A is a document picture of an invoice type, fig. 1B is a document picture of a menu type, fig. 1C is a document picture of a case type, and the embodiments of the present application further include document pictures of receipts, business reports, and the like, and fig. 1A to fig. 1C are only examples.

In the face of the huge amount of text data, office staff can not only waste a large amount of time when classifying and sorting the texts, but also cause serious error rate and low efficiency. In the conventional document and picture analysis technology, the most common document and picture classification method is to extract the text content in the document included in the picture and classify the document and picture according to the text content. For example, the text content included in the document picture is extracted through Optical Character Recognition (OCR), the text content is recognized and understood through a Bidirectional Encoder Representation from transforms (BERT) model of a transformer, and keywords are extracted, so that the document picture is classified. In other document picture analysis technologies, document picture layout information included in a document picture is structured, that is, image modality information of the document picture is extracted, so that the document picture is classified through the image modality information. However, the above method generally only can understand and classify the document picture through single modality information, does not effectively utilize multiple modality information included in the document picture, and does not consider the association relationship among the multiple modality information and merge the multiple modality information.

In one embodiment, as shown in fig. 2, a document picture classification method is proposed, which can be implemented by means of a computer program and can be run on a document picture classification device based on the von neumann architecture. The computer program may be integrated into the application or may run as a separate tool-like application.

Specifically, the method comprises the following steps:

s101, obtaining image information and text information of the target document picture.

The text information can be understood as information corresponding to text content in a document picture, the text content refers to a text segment which is composed of Chinese or foreign characters and can be represented in meaning, the text segment includes any text amount such as sentences, paragraphs and articles, and any language form such as English, Chinese and German, and the information corresponding to the text content includes any information which needs to be extracted by a person skilled in the art, such as characters, semantics, positions where the characters are located, punctuations, and frequency of repeated words.

The image information may be understood as visual information including information of a page overall style of the document picture, local image information corresponding to a text region in the document picture, image information corresponding to a non-text region, and the like. The extraction of the local image information corresponding to the text region in the document picture can take more detailed features into consideration, and the image information corresponding to the non-text region may also contain key information pointing to the classification information of the document picture.

In still other embodiments, the image information may further include: the document image display device includes document image information, and feature information indicating the number of forms in the document image, feature information indicating the image area ratio of the forms in the document image, feature information indicating the ratio of handwritten text in all text in the document image, feature information indicating the ratio of printed text in all text in the document image, and the like.

In one embodiment, before acquiring the image information and the text information of the target document picture, the method further includes: and preprocessing the target document picture. For example, the pre-processing of the target document picture includes one or more of the following processes: image deblurring, image brightness enhancement, image contrast enhancement, image super-resolution reconstruction and image correction. For example, the correction of the large direction and the small angle of the document picture is realized through a four-direction rotation technology and a perspective correction technology of the document picture. In the embodiment, the quality of the document picture is enhanced through an image processing technology, the accuracy and the information quantity of extracting text information and image information in the document picture are improved, and the accuracy and the reliability of classifying the document picture are further improved.

In one embodiment, the method for acquiring text information and image information in a target document picture comprises the following steps: segmenting a target document included in a target document picture based on a preset segmentation unit to obtain at least one sub-text; wherein, the preset segmentation unit at least comprises one of the following components: taking characters as units, words as units, sentences as units and segments as units; and acquiring text information corresponding to each subfile and acquiring image information corresponding to the subfiles in the target document picture.

Specifically, the target document included in the target document picture is acquired, and the text content included in the target document picture can be extracted through Optical Character Recognition (OCR).

Furthermore, the target document is segmented based on a preset segmentation unit to obtain at least one sub-text. The target document is segmented, namely a Chinese sequence or a sequence of other languages is segmented into Chinese words or words of other languages, and each subfile corresponds to the Chinese words or the words of other languages. For example, a more mature jieba text segmentation system is used, and the specific text segmentation method is not limited. The preset segmentation unit at least comprises one of the following components: the method comprises the steps of taking a character as a unit, taking a word as a unit, taking a sentence sensor as a unit and taking a paragraph paramgraph as a unit, and setting according to the requirements of related technicians.

In one embodiment, obtaining at least one sub-text in a target document comprises: and cleaning the target document or the acquired at least one sub-text. For example, the target document is cleaned, the target document is subjected to word-off-removing and special symbol-removing processing, and the situation that the word-off-removing or the special symbol is taken as the sub-text is avoided. Assume that after processing, the number of texts is f and the vocabulary size is C. The stop word can obviously reduce the C number and remove redundant information. The special symbols referred to herein include punctuation marks as well as monetary symbols, mathematical symbols, etc. that appear in sentences. And cleaning the target document, and further acquiring at least one sub-text in the cleaned target document.

For another example, the obtained at least one sub-text is cleaned, the frequency of the occurrence of the special sub-texts in the target document is counted, and the special sub-texts in the at least one sub-text are removed according to the frequency of the special sub-texts. This example counts the frequency of occurrence of each particular subfile, removing the case of "Extreme frequency" (Extreme frequency). "extreme frequency" refers to a condition where a subfile occurs with a very high or very low frequency. The frequency of the occurrence of the special subfile is very high in all the documents, which indicates that the special subfile is more represented by the common characteristics of all the documents and has little effect on the classification task of the document pictures. The occurrence frequency of the special subfiles is very low in all texts, which indicates that the words are rare and possibly belong to rare words, so that the common characteristics of a certain class of document pictures cannot be embodied, and the words are deleted. That is, the frequency ω of the acquired subfolders_iThe requirements are as follows: epsilon_low＜Freq_(ωi)＜ε_high，ε_highAnd epsilon_lowThe parameters of the upper and lower frequency filtering are adjusted according to the concrete text data. And cleaning the acquired at least one sub-text, and further acquiring text information corresponding to each sub-text and image information corresponding to the sub-text according to the cleaned sub-text.

In the embodiment, by cleaning the target document or the obtained at least one sub-text, unnecessary sub-texts such as words, words or sentences which do not help or even have negative effects on the classification of the document pictures can be filtered, so that the acquisition of text information of the unnecessary sub-texts is avoided, the document picture classification efficiency is improved, and the accuracy and reliability of the classification of the document pictures are improved.

Further, according to at least one sub-text, obtaining text information corresponding to each sub-text, and obtaining image information corresponding to the sub-text in the target document picture. Specifically, the text information corresponding to each sub-text is obtained, and the independent hot coding (one-hot) is performed on each sub-text according to the segmentation unit corresponding to each sub-text. For example, in the sub-text with words as units, a matrix is constructed on the word level, and the number of rows and columns of the matrix is the number which is not repeatedly represented. The value of the matrix is initialized to 0 and the value in each row corresponding to the level identification sequence position is set to 1.

Acquiring image information corresponding to the sub-text in the target document picture, for example, using a ResNeXt-FPN network as an image encoder, extracting a first feature map of the document picture, then averagely pooling the first feature map into a fixed size (WxH), then expanding an averaged pooled second feature map according to lines, and obtaining a feature sequence of an image corresponding to the sub-text, namely, image information corresponding to the sub-text, by linear projection of the sub-text on the second feature map.

And S102, performing multi-mode fusion processing on the text information and the image information to obtain multi-mode fusion information.

And fusing the modal information of the text layer and the modal information of the image layer according to the association between different modalities to obtain multi-modal fusion information. Text information and image information are fused according to the incidence relation among different modes, so that the text information and the image information respectively correspond to different weights when fused into multi-mode fusion information, and the proportion of the weights is related to the classification information corresponding to the document picture.

For example, some classified document pictures include rich visual information or image information, such as font type, size, style, etc., all of which have obvious characteristics, such as alarm notification, epidemic situation management notification, etc. Therefore, when obtaining multimodal information corresponding to the document pictures corresponding to the classifications, it is necessary to pay more attention to the image information and to assign a higher weight to the image information. Other classified document pictures comprise rich text information, and the text information such as keywords of the text content or spatial relationship of the text content has obvious features, for example, the text is arranged in a grid layout in a table, and the keywords usually in the headings of the first column or the first line can be extracted, so that the document pictures are classified according to the keywords and the grid layout, for example, types such as invoices, examination rules and the like, and therefore when obtaining multi-modal information corresponding to the document pictures corresponding to the classifications, the document pictures need to pay more attention to the text information and configure higher weight for the text information.

In one embodiment, a method of obtaining multimodal fusion information includes: obtaining a vector representing text information and a vector representing image information corresponding to the text information in S101, adding a Consat layer to connect the two vectors, obtaining attention vectors of the text vector and the image vector aiming at the two connected vectors through a bidirectional recurrent neural network with an attention mechanism, further performing class prediction probability normalization on the text vector and the image vector which are respectively added with the attention vectors through a full connection layer of the neural network by using a softmax function to obtain a probability distribution of predicted document classification information, and obtaining weight vectors respectively corresponding to the text vector and the image vector corresponding to the probability distribution according to the probability distribution; and fusing the text vector and the image vector according to the weight vectors respectively corresponding to the text vector and the image vector to obtain a fusion vector or a vector expression representing multi-mode fusion information.

In the above embodiment, the bidirectional recurrent neural network is a special long-short term memory network layer constructed on the basis of the long-short term memory network, the special long-short term memory network layer includes two layers of long-short term memory networks, the first layer inputs data in a forward order and outputs a state of each time step, the second layer inputs data in a reverse order and outputs a state of each time step, and finally the two states are combined to obtain a complete output. The method has the advantages that the two-way circulation neural network can be used for jointly learning text information and image information, local invariance information of different classified document pictures can be learned during pre-training, and when the two-way circulation neural network needs to transfer another classification type, the two-way circulation neural network can be trained and optimized only by manually marking a small number of document picture samples.

S103, obtaining the classification information of the target document picture according to the multi-mode fusion information.

And analyzing and processing the multi-mode fusion information fused with the text information and the image information of the target document picture to obtain the classification information of the target document picture. As shown in fig. 3, a flowchart of a document picture classification method provided in an embodiment of the present application is schematically illustrated, including the target document picture shown in fig. 1B. Extracting text information 301 of "small fried dish" and image information 302 (including text content "small fried dish") from the target document picture, wherein the extracting method is referred to the above S101, and the details are not repeated here; further performing multi-modal fusion processing on the text information 301 and the image information 301 to obtain multi-modal fusion information 303, wherein the processing process refers to the above S102, which is not described again here; according to the multi-modal fusion information, analysis tasks are executed, for example, form understanding (the task requires that four types of semantic entities including questions, answers, titles and others are extracted from a form in a document picture), bill understanding (obtained through pre-training of two bill understanding data sets of CORD and SROIE, 30 types of key information entities such as names, prices, quantities, store names, store addresses, total prices, consumption time and the like are extracted from the document picture during use), complex layout long-document understanding and the like, and other analysis tasks are also included, and classification information 304 of a target document picture is obtained as a 'menu' according to an analysis result.

Compared with the technology of classifying the document pictures only through information of a single modality in the related technology, the accuracy of understanding and classifying the document pictures is improved by utilizing the complementation of information among different modalities, the robustness is good, and the requirement of classifying the document pictures in a complex use environment is better met.

In one embodiment, as shown in fig. 4, a document picture classification method is proposed, which can be implemented by means of a computer program and can be run on a document picture classification device based on the von neumann architecture. The computer program may be integrated into the application or may run as a separate tool-like application.

Specifically, the method comprises the following steps:

s201, segmenting a target document included in the target document picture based on a preset segmentation unit to obtain at least one sub-text.

And acquiring the target document included in the target document picture, and extracting the text content included in the document picture through Optical Character Recognition (OCR), for example. The target document is segmented, that is, a Chinese sequence or a sequence in other languages is segmented into Chinese words or words in other languages, and each subfile corresponds to the Chinese words or the words in other languages. The preset segmentation unit at least comprises one of the following components: the method comprises the steps of taking a character as a unit, taking a word as a unit, taking a sentence sensor as a unit and taking a paragraph paramgraph as a unit, and setting according to the requirements of related technicians.

In one embodiment, obtaining at least one sub-text in a target document comprises: and cleaning the target document or the acquired at least one sub-text. For example, the target document is cleaned, the target document is subjected to the processes of removing stop words and removing special symbols, and the condition that the stop words or the special symbols are used as sub-texts is avoided. And cleaning the target document, and further acquiring at least one sub-text in the cleaned target document. By cleaning the target document, unnecessary subfolders such as words, phrases or sentences which do not help or even have negative influence on the classification of the document picture can be filtered, the acquisition of text information of the unnecessary subfolders is avoided, the document picture classification efficiency is improved, and the accuracy and reliability of the document picture classification are improved.

S202, obtaining the text information corresponding to the sub-texts according to the text content included in each sub-text.

Specifically, the text information corresponding to each sub-text is obtained according to the text content included in each sub-text, and each sub-text is independently thermally encoded (one-hot) according to the segmentation unit corresponding to each sub-text and the text information corresponding to the sub-text. For example, in a sub-text with a unit of a word, a matrix corresponding to the sub-text is constructed at a word level, and the number of rows and columns of the matrices is not repeated. The matrix is initialized to 0 and the value in each row corresponding to the level identification sequence position is set to 1.

S203, obtaining the position information corresponding to the sub-document according to the corresponding position of each sub-document in the target document picture.

The position information, also referred to as layout information, of each sub-text is represented using a bounding box parallel to the coordinate axes of the target document picture, corresponding to the coordinate range covered by each sub-text in the target document picture. For example, coordinates of each sub-text in a document picture are obtained according to a text bounding box obtained by an OCR technology, after the coordinates corresponding to each sub-text are converted into virtual coordinates, vector coordinates of the neural network embedding layers embedded in the x, y, w and h layers respectively corresponding to the x, y, w and h layers are calculated to represent, and finally, position information corresponding to each sub-text is represented by a vector expression obtained by representing and connecting the four vector coordinates. The application also comprises other position information acquisition modes.

In the embodiment of the present application, the text information corresponding to each sub-text includes the text information obtained in S202 and the position information obtained in S203. Some classified document pictures include rich text information, and the text information, such as keywords of the text content or spatial relationship of the text content, that is, position information, has obvious features, for example, the text is arranged in a grid layout in a table, and the keywords usually in the first column or the first row title can be extracted, so that the document pictures are classified according to the keywords and the grid layout, for example, types of invoices, examination rules and the like, and therefore, the accuracy and reliability of document picture classification can be improved by analyzing the position information of each sub-text.

S204, segmenting the target document picture according to at least one sub-text included in the target document picture to obtain a sub-picture corresponding to each sub-text, and acquiring image information corresponding to each sub-picture.

And corresponding to the coordinate range covered by each sub-text in the document picture, using a partial image area which is parallel to the coordinate axis of the document picture and comprises the content of the sub-text as the sub-picture corresponding to each sub-text, and further acquiring the image information of the sub-picture as the image information of the sub-text. For example, a ResNeXt-FPN network is used as an image encoder, a first feature map of the sub-picture is extracted, the first feature map is averagely pooled to be of a fixed size (WxH), then the averaged pooled second feature map is expanded according to lines, a sub-picture corresponding to the sub-picture can be obtained through linear projection of the sub-text on the second feature map, and a feature sequence of the sub-picture, namely image information corresponding to the sub-picture, is further extracted through the ResNeXt-FPN network.

And S205, splicing the vector representing the text information and the vector representing the image information to obtain a spliced vector.

The vector representing the text information comprises a vector representing the text information and a vector representing the position information, and the vector representing the text information, the vector representing the position information and the vector representing the image information are spliced to obtain a spliced vector. The vectors are concatenated, for example, by means of the Concat layer.

As shown in fig. 5, a schematic flow chart of document and picture classification provided in this embodiment includes text information "clear" and position information "(1, 2)" in text information 5011 and corresponding image information 5012, text information "stew" and position information "(1, 3)" in text information 5021 and corresponding image information 5022, text information "sheep" and position information "(1, 4)" in text information 5031 and corresponding image information 5032, and text information "meat" and position information "(1, 5)" in text information 5041 and corresponding image information 5042, and further includes text information 5051 obtained by different segmentation units from the segmentation units corresponding to text information 5011-5041, and text information 5051 includes text information "pan" and position information "(4, 5)" and corresponding image information 5052. It is understood that the collected sub-texts shown in fig. 5 are only examples, and any other segmented sub-texts are also included in the present application.

Connecting the vector representing the text information 5011 with the vector representing the image information 5012 to obtain a first splicing vector; connecting the vector representing the text information 5021 with the vector representing the image information 5022 to obtain a second splicing vector; connecting the vector representing the text information 5033 with the vector representing the image information 5034 to obtain a third spliced vector; connecting the vector representing the text information 5041 with the vector representing the image information 5042 to obtain a fourth splicing vector; the vector characterizing the text information 5051 and the vector characterizing the image information 5052 are connected to obtain a fifth stitched vector.

And S206, configuring a weight vector for the splicing vector through the full connection layer of the neural network to obtain multi-mode fusion information.

And fusing the modal information of the character layer, the modal information of the position layer and the modal information of the image layer according to the association among different modalities to obtain multi-modal fusion information. Text information and image information are fused according to the incidence relation among different modes, so that the text information, the position information and the image information respectively correspond to different weights when the text information, the position information and the image information are fused into multi-mode fusion information, and the proportion of the weights is related to the classification information corresponding to the document picture.

The neural network is obtained by training the pre-training document pictures in the training set and the classification information corresponding to the pre-training document pictures, and the weight vectors respectively corresponding to the text information, the position information and the image information in the text information of the pre-training document pictures have an incidence relation with the classification information corresponding to the pre-training document pictures. For example, some classified document pictures include rich visual information or image information, such as font type, size, style, etc., all of which have obvious characteristics, such as alarm notification, epidemic situation management notification, etc. Therefore, when obtaining multimodal information corresponding to document pictures corresponding to these classifications, it is necessary to pay more attention to image information and to assign a higher weight to the image information. Other classified document pictures comprise rich text information, and the text information such as keywords of the text content or spatial relationship of the text content has obvious features, for example, the text is arranged in a grid layout in a table, and the keywords usually in the headings of the first column or the first line can be extracted, so that the document pictures are classified according to the keywords and the grid layout, for example, types such as invoices, examination rules and the like, and therefore when obtaining multi-modal information corresponding to the document pictures corresponding to the classifications, the document pictures need to pay more attention to the text information and configure higher weight for the text information.

As shown in fig. 6, for the structural schematic diagram of the neural network provided in the embodiment of the present application, a hidden markov model based on a deep neural network, that is, a DNN-HMM model, is adopted, and an error back propagation algorithm is introduced on the basis of the existing neural network model to perform optimization, so as to improve the recognition accuracy of the neural network model.

The deep neural network is composed of an input layer, a hidden layer and an output layer, as shown in fig. 6, the input layer is used for calculating an output value input to a hidden layer unit of a bottom layer according to a splicing vector input to the deep neural network, the input layer generally includes a plurality of input units, and the input units are used for calculating an output value input to a hidden layer unit of the bottom layer according to an input splicing vector. After the splicing vector is input to the input unit, the input unit calculates an output value output to the hidden layer of the bottom layer by using the splicing vector input to the input unit according to a weighted value of the input unit.

The hidden layers are typically multiple, each layer of hidden layers comprising multiple hidden layer units that receive input values from hidden layer units in the next hidden layer. And carrying out weighted summation on input values from hidden layer units in the next hidden layer according to the weighted value of the current layer, and taking the result of the weighted summation as an output value output to the hidden layer unit of the previous hidden layer.

The output layer comprises a plurality of output units, the output units receive input values from hidden layer units in the uppermost hidden layer, the input values from the hidden layer units in the uppermost hidden layer are weighted and summed according to the weighted value of the output unit, an actual output value is calculated according to the weighted and summed result, and the connection weight value and the threshold value of each layer are adjusted along an output path and are reversely propagated from the output layer based on the error between the expected output value and the actual output value.

Specifically, in this embodiment, a DNN-HMM model with an introduced error back propagation algorithm is used to create an initial model, after a stitching vector corresponding to text information and image information of a document picture is extracted, the stitching vector is input into a neural network model, a training process of the neural network model generally consists of two parts, namely, forward propagation and back propagation, in the forward propagation process, the stitching vector is calculated from an input layer of the neural network model through a transfer function (also referred to as an activation function and a conversion function) of hidden layer neurons (also referred to as nodes), and then is transmitted to an output layer, wherein each layer of neuron state affects a next layer of neuron state, an actual output value-multi-modal fusion information is calculated in the output layer, an expected error between the actual output value and the expected output value is calculated, parameters of the neural network model are adjusted based on the expected error, the parameters include a weight value and a threshold value of each layer, and after training is finished, generating a weight vector configured for the splicing vector so as to obtain the neural network of the multi-mode fusion information.

And S207, executing an analysis task on the multi-modal fusion information through the multi-modal document understanding model to obtain the classification information of the target document picture.

The multi-modal document understanding model is obtained by training multi-modal fusion information and classification information in a training set. For example, the multi-modal document understanding model may use a document understanding pre-training model layout lm 1.0 or a new generation document understanding pre-training model layout lm 2.0, and may also introduce a spatial perception self-attention mechanism into the multi-modal document understanding model, so as to further improve the understanding and analyzing capability of the multi-modal document understanding model on the document pictures.

The analysis tasks include at least one or more of the following tasks: document Layout Analysis (Document Layout Analysis), Visual Information Extraction (Visual Information Extraction), Document Image classification (Document Image classification), and the like. The document layout analysis task mainly carries out automatic analysis, recognition, understanding and the like on the position relations of images, texts, tables and the like in the document pictures; the visual information extraction task mainly extracts entities and relations from a large amount of unstructured contents in document pictures, models a document with rich vision as a computer vision problem, and extracts information through semantic segmentation or text box detection; through the tasks, the document and picture classification task is realized, and the task is a process of analyzing and identifying the document image and dividing the document image into different categories, such as scientific thesis, resume, invoice, receipt and the like.

For example, in the flowchart of document picture classification shown in fig. 5, a plurality of stitching vectors are input to the neural network 601, multi-modal fusion information is obtained through the full connection layer of the neural network 601, the multi-modal fusion information is further input to the multi-modal document understanding module 602, and classification information of a target document picture is obtained through the multi-modal understanding module 601.

The following are examples of apparatus that may be used to perform embodiments of the methods of the embodiments of the present disclosure. For details which are not disclosed in the embodiments of the apparatus described in the present specification, please refer to the embodiments of the method described in the present specification.

Referring to fig. 7, a schematic structural diagram of an image source tracking device according to an exemplary embodiment of the present disclosure is shown. The image source tracking device may be implemented as all or part of a device in software, hardware, or a combination of both. The device comprises an information acquisition module 701, an information fusion module 702 and a document classification module 703.

An information obtaining module 701, configured to obtain image information and text information of a target document picture;

a fusion information module 702, configured to perform multi-modal fusion processing on the text information and the image information to obtain multi-modal fusion information;

and the document classification module 703 is configured to obtain classification information of the target document picture according to the multi-modal fusion information.

In one embodiment, the obtain information module 701 includes:

the text segmentation unit is used for segmenting a target document included in the target document picture based on a preset segmentation unit to obtain at least one sub-text; wherein the preset segmentation unit at least comprises one of the following: taking characters as units, words as units, sentences as units and segments as units;

and the acquisition information unit is used for acquiring text information corresponding to each sub-text and acquiring image information corresponding to the sub-text in the target document picture.

In one embodiment, the text information comprises text information;

and the information acquisition unit is further used for acquiring the text information corresponding to the subfiles according to the text content included in each subfile.

In one embodiment, the text information further includes location information;

and the information acquisition unit is also used for acquiring the position information corresponding to the subfolders according to the corresponding position of each subfolder in the target document picture.

In one embodiment, the obtaining information element comprises:

the image segmentation sub-unit is used for segmenting the target document picture according to at least one sub-text included in the target document picture to obtain a sub-picture corresponding to each sub-text;

and the image acquisition sub-unit is used for acquiring the image information corresponding to each sub-picture.

In one embodiment, the fused information module 702 includes:

the vector splicing unit is used for splicing the vector representing the text information and the vector representing the image information to obtain a spliced vector;

the weight configuration unit is used for configuring weight vectors for the splicing vectors through a full connection layer of the neural network to obtain multi-mode fusion information; the neural network is obtained by training a pre-training document picture in a training set and classification information corresponding to the pre-training document picture, and weight vectors corresponding to text information and image information of the pre-training document picture respectively have an association relation with the classification information corresponding to the pre-training document picture.

In one embodiment, the document classification module 703 includes:

the analysis and classification unit is used for executing an analysis task on the multi-modal fusion information through a multi-modal document understanding model to obtain the classification information of the target document picture; the multi-modal document understanding model is obtained by training multi-modal fusion information and classification information in a training set.

In one embodiment, the analysis tasks include at least one or more of the following tasks: document layout analysis, visual information extraction and document picture classification.

In one embodiment, the document picture classification device further comprises:

and the preprocessing module is used for preprocessing the target document picture.

In one embodiment, the pre-treatment comprises at least one or more of: image deblurring, image brightness enhancement, image contrast enhancement, image super-resolution reconstruction and image correction.

It should be noted that, when the image source tracking apparatus provided in the foregoing embodiment executes the image source tracking method, only the division of the functional modules is illustrated, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the image source tracking device and the image source tracking method provided by the above embodiments belong to the same concept, and details of implementation processes thereof are referred to in the method embodiments and are not described herein again.

The above example numbers are for description only and do not represent the merits of the examples.

An embodiment of the present specification further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, and the instructions are suitable for being loaded by a processor and being executed by the image source tracking method according to the embodiment shown in fig. 1 to fig. 6, and a specific execution process may refer to a specific description of the embodiment shown in fig. 1 to fig. 6, which is not described herein again.

An embodiment of the present disclosure further provides a computer program product, where at least one instruction is stored, and the at least one instruction is loaded by the processor and executes the image source tracking method according to the embodiment shown in fig. 1 to 6, where a specific execution process may refer to specific descriptions of the embodiment shown in fig. 1 to 6, and is not described herein again.

Please refer to fig. 8, which provides a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 8, the electronic device 800 may include: at least one processor 801, at least one network interface 804, a user interface 803, memory 805, at least one communication bus 802.

The communication bus 802 is used to realize connection communication among these components.

The user interface 803 may include a Display (Display) and a Camera (Camera), and the optional user interface 803 may further include a standard wired interface and a wireless interface.

The network interface 804 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface).

Processor 801 may include one or more processing cores, among others. The processor 801 interfaces with various components throughout the server 800 using various interfaces and lines to perform various functions of the server 800 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 805 and invoking data stored in the memory 805. Alternatively, the processor 801 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 801 may integrate one or a combination of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is to be understood that the modem may not be integrated into the processor 801, but may be implemented by a single chip.

The Memory 805 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 805 includes a non-transitory computer-readable medium. The memory 805 may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory 805 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 805 may optionally be at least one memory device located remotely from the processor 801. As shown in fig. 8, memory 805, which is a type of computer storage media, may include an operating system, a network communication module, a user interface module, and an image source tracking application.

In the electronic device 800 shown in fig. 8, the user interface 803 is mainly used as an interface for providing input for a user, and acquiring data input by the user; the processor 801 may be configured to invoke the image source tracking application stored in the memory 805 and perform the following operations:

acquiring image information and text information of a target document picture;

performing multi-mode fusion processing on the text information and the image information to obtain multi-mode fusion information;

In one embodiment, the processor 801 executes the image information and the text information of the target document picture, specifically:

segmenting a target document included in the target document picture based on a preset segmentation unit to obtain at least one sub-text; wherein the preset segmentation unit at least comprises one of the following: taking characters as units, words as units, sentences as units and segments as units;

and acquiring text information corresponding to each sub-document, and acquiring image information corresponding to the sub-documents in the target document picture.

In one embodiment, processor 801 executes that the textual information comprises textual information;

the obtaining of the text information corresponding to each sub-document specifically executes:

and obtaining the text information corresponding to the sub-texts according to the text content included in each sub-text.

In one embodiment, the text information further includes location information;

after the processor 801 executes the text information corresponding to each sub-text obtained according to the text content included in the sub-text, the processor 801 further executes:

and obtaining the position information corresponding to the sub-texts according to the position of each sub-text corresponding to the target document picture.

In one embodiment, the processor 801 executes the acquiring of the image information corresponding to the sub-document in the target document picture, specifically executing:

segmenting the target document picture according to at least one sub-text included in the target document picture to obtain a sub-picture corresponding to each sub-text;

and acquiring image information corresponding to each sub-picture.

In an embodiment, the processor 801 performs the multi-modal fusion processing on the text information and the image information to obtain multi-modal fusion information, and specifically performs:

splicing the vector representing the text information and the vector representing the image information to obtain a spliced vector;

configuring a weight vector for the splicing vector through a full connection layer of the neural network to obtain multi-mode fusion information; the neural network is obtained by training a pre-training document picture in a training set and classification information corresponding to the pre-training document picture, and weight vectors corresponding to text information and image information of the pre-training document picture respectively have an incidence relation with the classification information corresponding to the pre-training document picture.

In one embodiment, the processor 801 executes the classification information of the target document picture obtained according to the multimodal fusion information, specifically:

executing an analysis task on the multi-modal fusion information through a multi-modal document understanding model to obtain classification information of the target document picture; the multi-modal document understanding model is obtained by training multi-modal fusion information and classification information in a training set.

In one embodiment, before the processor 801 executes the acquiring of the image information and the text information of the target document picture, it further executes:

and preprocessing the target document picture.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a computer to implement the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present disclosure, and certainly should not be construed as limiting the scope of the present disclosure, therefore, the scope of the present disclosure is intended to be encompassed by the embodiments of the present disclosure, subject to the full range of equivalents to which the claims of the present disclosure are entitled.

Claims

1. A document picture classification method, the method comprising:

acquiring image information and text information of a target document picture;

2. The method for classifying document pictures according to claim 1, wherein the acquiring of the image information and the text information of the target document picture comprises:

3. The method for classifying document pictures according to claim 2, wherein the text information comprises text information;

the acquiring text information corresponding to each sub-document includes:

4. The document picture classification method according to claim 3, said text information further comprising position information;

after the text information corresponding to each sub-text is obtained according to the text content included in each sub-text, the method further includes:

5. The document picture classification method according to claim 2, wherein the acquiring of the image information corresponding to the sub-document in the target document picture includes:

and acquiring image information corresponding to each sub-picture.

6. The document picture classification method according to claim 1, wherein the performing multi-modal fusion processing on the text information and the image information to obtain multi-modal fusion information includes:

configuring a weight vector for the splicing vector through a full connection layer of the neural network to obtain multi-mode fusion information; the neural network is obtained by training a pre-training document picture in a training set and classification information corresponding to the pre-training document picture, and weight vectors corresponding to text information and image information of the pre-training document picture respectively have an association relation with the classification information corresponding to the pre-training document picture.

7. The document picture classification method according to claim 1 or 6, wherein the obtaining of the classification information of the target document picture according to the multi-modal fusion information comprises:

performing an analysis task on the multi-modal fusion information through a multi-modal document understanding model to obtain classification information of the target document picture; the multi-modal document understanding model is obtained by training multi-modal fusion information and classification information in a training set.

8. The document picture classification method according to claim 6, wherein the analysis task at least comprises one or more of the following tasks: document layout analysis, visual information extraction and document picture classification.

9. The document picture classification method according to claim 1, wherein before the obtaining of the image information and the text information of the target document picture, the method further comprises:

and preprocessing the target document picture.

10. The document picture classification method according to claim 9, wherein the preprocessing includes at least one or more of the following: image deblurring, image brightness enhancement, image contrast enhancement, image super-resolution reconstruction and image correction.

11. A document picture classification device, the device comprising:

12. A computer storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to carry out the method steps according to any one of claims 1 to 10.

13. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 10.