CN114218889A

CN114218889A - Document processing method, document model training method, document processing device, document model training equipment and storage medium

Info

Publication number: CN114218889A
Application number: CN202111431086.XA
Authority: CN
Inventors: 彭启明; 罗斌; 曹宇慧; 冯仕堃; 陈永锋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-03-22

Abstract

The disclosure provides a method, a device, equipment and a storage medium for document processing and document model training, and relates to the technical field of computers, in particular to the artificial intelligence fields of natural language processing, computer vision, deep learning and the like. The document processing method comprises the following steps: the method comprises the steps of obtaining information of at least one mode of a document to be processed, wherein the information of each mode in the information of the at least one mode comprises at least one processing unit, the information of the at least one mode comprises a text sequence, the processing unit comprises a text unit, and the text units under the same layout are arranged in the text sequence according to a preset sequence; obtaining a representation vector of each processing unit in the at least one processing unit; and obtaining the processing result of the document to be processed based on the representation vectors of the processing units. The present disclosure can improve a document processing effect.

Description

Document processing method, document model training method, document processing device, document model training equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to artificial intelligence fields such as natural language processing, computer vision, and deep learning, and in particular, to a method, an apparatus, and a storage medium for document processing and training a document model.

Background

With the advent of the digital age, documents are gradually transformed from traditional paper documents to electronic documents. To understand an electronic document, the electronic document may be processed using a document model.

With the increasing variety of information contained in electronic documents, how to improve the document processing effect is an urgent problem to be solved.

Disclosure of Invention

The disclosure provides a method, a device and a storage medium for processing a document and training a document model.

According to an aspect of the present disclosure, there is provided a document processing method including: the method comprises the steps of obtaining information of at least one mode of a document to be processed, wherein the information of each mode in the information of the at least one mode comprises at least one processing unit, the information of the at least one mode comprises a text sequence, the processing unit comprises a text unit, and the text units under the same layout are arranged in the text sequence according to a preset sequence; obtaining a representation vector of each processing unit in the at least one processing unit; and obtaining the processing result of the document to be processed based on the representation vectors of the processing units.

According to another aspect of the present disclosure, there is provided a method for training a document model, including: the method comprises the steps of obtaining information of at least one mode of a document sample, wherein the information of each mode in the information of the at least one mode comprises at least one processing unit, the information of the at least one mode comprises a text sequence, the at least one processing unit comprises text units, and the text units under the same layout are arranged in the text sequence according to a preset sequence; obtaining a representation vector of each processing unit in the at least one processing unit; obtaining a prediction result of the document sample based on the representation vectors of the processing units; constructing a loss function based on the prediction result; based on the loss function, a document model is trained.

According to another aspect of the present disclosure, there is provided a document processing apparatus including: the document processing device comprises a first obtaining module, a second obtaining module and a processing module, wherein the first obtaining module is used for obtaining information of at least one mode of a document to be processed, the information of each mode in the information of the at least one mode comprises at least one processing unit, the information of the at least one mode comprises a text sequence, the processing units comprise text units, and the text units under the same layout are arranged in the text sequence according to a preset sequence; a second obtaining module, configured to obtain a representation vector of each processing unit in the at least one processing unit; and the third acquisition module is used for acquiring the processing result of the document to be processed based on the representation vectors of the processing units.

According to another aspect of the present disclosure, there is provided a training apparatus for a document model, including: the document sample processing device comprises a first obtaining module, a second obtaining module and a processing module, wherein the first obtaining module is used for obtaining information of at least one mode of a document sample, the information of each mode in the information of at least one mode comprises at least one processing unit, the information of at least one mode comprises a text sequence, the at least one processing unit comprises text units, and the text units under the same layout are arranged in the text sequence according to a preset sequence; a second obtaining module, configured to obtain a representation vector of each processing unit in the at least one processing unit; a third obtaining module, configured to obtain a prediction result of the document sample based on the representation vectors of the processing units; a construction module for constructing a loss function based on the prediction result; and the training module is used for training a document model based on the loss function.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the above aspects.

According to the technical scheme of the document processing method and the document processing device, the document processing effect can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a seventh embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to an eighth embodiment of the present disclosure;

FIG. 9 is a schematic diagram according to a ninth embodiment of the present disclosure;

FIG. 10 is a schematic diagram of an electronic device for implementing a document processing method or a training method of a document model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure, which provides a training method of a document pre-training model, including:

101. the method comprises the steps of obtaining information of at least one mode of a document to be processed, wherein the information of each mode in the information of the at least one mode comprises at least one processing unit, the information of the at least one mode comprises a text sequence, the at least one processing unit comprises text units, and the text units under the same layout are arranged in the text sequence according to a preset sequence.

102. A representative vector for each of the at least one processing unit is obtained.

103. And obtaining the processing result of the document to be processed based on the representation vectors of the processing units.

The execution body of the embodiment may be referred to as a document processing device, and the document processing device may be software, hardware or a combination of software and hardware, and the device may be located in an electronic device. This electronic equipment can be located server or terminal equipment, and the server can be local server or high in the clouds, and terminal equipment can include: personal computers (Personal computers, PCs), portable computers, mobile devices (such as mobile phones and tablet computers), vehicle-mounted terminals (such as car machines), wearable devices (such as smart watches and smart bracelets), smart home devices (such as smart televisions and smart speakers), and the like.

The document processing method can be applied to various scenes, such as information extraction and document classification, wherein the information extraction comprises the extraction of contract numbers, time, goods information and the like in electronic invoices; document classification, for example, divides electronic documents into technical documents, legal documents, treaty documents, and the like.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

Taking information extraction as an example, referring to fig. 2, a user may upload a document to be processed through the terminal device 201, and the terminal device 201 may send the document to be processed to the server 202, so that the server completes information extraction of the document to be processed.

It will be appreciated that the document processing procedure may also be performed locally at the terminal device, if the capabilities of the terminal device are sufficient.

The document to be processed is an electronic document, and the format of the electronic document can be word, pdf, picture and the like.

A modality (modal) refers to a representation of information, such as text (text), image (image), video (video), and the like.

In this embodiment, the information of at least one modality includes a text sequence, that is, includes a text modality.

Further, the information of at least one modality may also be an image, i.e. may also include an image modality.

For a text sequence, the text sequence may be composed of at least one text unit.

Text units may also be referred to as subwords, words, tokens, and the like.

Taking Chinese as an example, the text unit may be embodied as a character.

For example, "weather is good today", this text sequence is composed of 6 text units (specifically words).

In the related art, when extracting the text in the document, the text is generally extracted according to a preset sequence, for example, from top to bottom and from left to right.

However, in some cases, since layout information of characters in a document may be different, if extraction is performed in order from top to bottom and from left to right, an error may occur.

For example, referring to fig. 3, if the electronic document includes a column 301, the text 1 and the text 2 belong to coherent text expressions, and if the column information is not considered and the extraction is directly performed from left to right, the text 1, the text 3, the text 2, and the text 4 are extracted, so that the actual reading sequence is destroyed.

Therefore, in this embodiment, when extracting the text, the layout information is referred to, the text in the same layout is sorted according to the preset sequence, and the text in different layouts is sequentially arranged.

For example, if the

characters

1 and 2 belong to the same layout (assumed to be called a first layout), and the characters 3 and 4 belong to the same layout (assumed to be called a second layout), the characters in the first layout are arranged in a preset order, the characters in the second layout are arranged in the preset order, and the characters in different layouts are spliced. For example, the [ word 1, word 2], and the [ word 3, word 4] are extracted, and the two sets are then spliced, and the splicing may also be called merging, that is, the obtained text sequence is [ word 1, word 2, word 3, word 4 ].

The layout refers to a format that affects the reading order, such as column, table, natural segment, and the like.

After the information of at least one modality of the document to be processed is acquired, a representation vector of the processing unit can be acquired for the processing unit of the information of each modality in the at least one modality.

After the representation vector of the processing unit is obtained, a processing result of the document to be processed can be obtained based on the representation vector of the processing unit.

Taking information extraction as an example, the processing result may be information extracted from the document to be processed; alternatively, the processing result may be a category of the document, taking the document classification as an example.

In the embodiment of the disclosure, the text units in the same layout are arranged in the text sequence according to the preset sequence, so that the sequence of the text units can conform to the actual reading sequence, the semantic consistency of the text sequence is improved, and the document processing effect is further improved.

The above description is for a text sequence. In practice, not only text but also information of other modalities may be present in the document.

In practice, the document may be an image, for example, an image of a paper document may be obtained after scanning or taking a picture of the paper document, that is, the document at this time may be an image document.

For image documents, the information of at least one modality may include, in addition to the text described above (such as in particular the text sequence described above), an image of the document itself.

Accordingly, if the information of at least one mode is a text, the processing unit of the text may be called a text unit, such as a chinese character; if the information of at least one modality is an image, the processing unit of the image may be referred to as an image unit (or an image block).

Whether a text unit or an image unit, a respective corresponding representation vector can be obtained.

For the expression vector of the text unit, the text unit can be extracted first, and then the expression vector is obtained by adopting the embedding layer.

For the representation vector of an image unit, a visual encoder (visual encoder) may be used to obtain the representation vector, that is, without dividing into a plurality of image units, the visual encoder may perform encoding processing on the whole image and obtain the encoding vector of each image unit in the image, and then may obtain the representation vector of the image unit based on the encoding vector of the image unit.

The text units may be extracted by Optical Character Recognition (OCR), the layout information may be obtained by a layout parser, and the text units may be arranged based on the layout information.

That is, some embodiments, the obtaining information of at least one modality of the document to be processed, for a text sequence included in the information of at least one modality, includes: performing OCR on the document to be processed to obtain the text unit in the document to be processed; performing layout analysis on the text unit to obtain layout information of the text unit; and sequentially splicing the text units under different layouts based on the layout information, and arranging the text units under the same layout according to a preset sequence.

For example, referring to fig. 4, taking a document as an image document as an example, the image document may be input into an OCR module, and OCR is performed on the image document, and the OCR output may include: a text unit in the image document, and two-dimensional (2D) position information of the text unit.

The text units output by the OCR may be input to a layout parser (layout parser), and the layout parser may obtain layout information of the text units, and arrange the text units in the same layout according to a preset order based on the layout information, and arrange the text units in different layouts in sequence.

For example, as described above, the text unit in the first layout is [ word 1, word 2], the text unit in the second layout is [ word 3, word 4], and the text sequence is [ word 1, word 2, word 3, word 4 ].

In addition, the layout parser may also output one-dimensional (1D) position information of the text unit, segment information of the segment (segment) in which the text unit is located, layout information of the layout (layout) in which the text unit is located, and the like.

For example, the detection box is generally a rectangle, and can be represented by four elements in total, namely, coordinates (x1, y1) at the upper left corner and coordinates (x2, y2) at the lower right corner of the rectangle.

The 1D position information may be a literal serial number, e.g., 0,1, 2, etc.

The segments can be specified according to actual needs, for example, each line can be regarded as one segment, each sentence can be regarded as one segment, and the like. The clip information may be represented by the number of the clip, such as 0,1, 2, etc.

The layout refers to a format that affects the reading order, such as columns, tables, etc., and the layout information can also be represented by the number of the layout, such as 0,1, 2, etc.

The layout parser may then input the obtained word units and the format information for the word units to the text embedding layer. The format information includes, for example, one or more of the above-described 1D position information, 2D position information, clip information, and layout information.

By performing OCR and layout analysis on the document to be processed, the sequentially arranged text units in the same layout can be obtained, and the semantic consistency of the text sequence is improved.

In addition, referring to fig. 4, the image document may be further input to a visual encoder (visual encoder), which encodes the image and outputs an encoding vector, which may include an image unit, and format information of the image unit. Like the format information of the text unit, the format information of the image unit may include: one or more of 1D position information, 2D position information, clip information, layout information of the image unit.

Thereafter, the visual encoder may newly input the encoding vector of the image unit and the format of the image unit to the image embedding layer.

In some embodiments, the obtaining the representation vector of each of the at least one processing unit includes: obtaining a semantic representation vector of each processing unit and a format representation vector of each processing unit, wherein the format representation vector comprises at least one of the following items: a position representing vector, a segment representing vector and a layout representing vector; and obtaining the representation vector of each processing unit based on the semantic representation vector and the format representation vector.

Wherein the position representation vector may comprise a 1D position representation vector, and/or a 2D position representation vector.

Representing the vector in a format includes: the position representation vector, the fragment representation vector, and the layout representation vector, and the position representation vector includes a 1D position representation vector and a 2D position representation vector as an example, referring to fig. 5, embedding (embedding) layers (a text embedding layer and an image embedding layer) may include a semantic embedding layer, a 1D position embedding layer, a 2D position embedding layer, a fragment embedding layer, and a layout embedding layer.

Wherein the processing of text units and image units is similar. The following description will be given taking text units as examples.

For a text unit, a semantic embedding layer may be used to convert the text unit into a semantic expression vector, a 1D position embedding layer may be used to convert 1D position information of the text unit into a 1D position expression vector, a 2D position embedding layer may be used to convert 2D position information of the text unit into a 2D position expression vector, a fragment embedding layer may be used to convert fragment information of the text unit into a fragment expression vector, and a layout embedding layer may be used to convert layout information of the text unit into a fragment expression vector.

Each embedding layer (semantic embedding layer, 1D position embedding layer, 2D position embedding layer, fragment embedding layer and layout embedding layer) can be realized by adopting a deep neural network, and the specific structure can be set as required.

Wherein, for a text unit, the semantic embedding layer of the text unit can convert the text unit in the text form into a semantic expression vector of the text unit in the vector form.

For an image unit, since the visual encoder obtains the encoding vector, i.e. already in vector form, but generally speaking, the encoding vector has a different dimension from the semantic representation vector of the text unit, the encoding vector can be converted into a vector having the same dimension as the semantic representation vector of the text unit through a conversion network (such as a full connection layer), and the converted vector can be referred to as the semantic representation vector of the image unit.

Since the roles of the semantic embedding layer of the text unit and the semantic embedding layer of the image unit are not completely consistent, the semantic embedding layer of the text unit and the semantic embedding layer of the image unit can respectively have respective model parameters.

For other format-dependent embedding layers, such as a 1D position embedding layer, a 2D position embedding layer, a fragment embedding layer, a layout embedding layer, the text elements and the image elements may share model parameters.

After the representative vectors of the respective embedding layers are obtained, as shown in fig. 5, the respective representative vectors may be subjected to an addition operation to obtain a representative vector of a text unit and a representative vector of an image unit.

For example, taking a text unit as an example, a semantic representation vector of the text unit + a 1D position representation vector of the text unit + a 2D position representation vector of the text unit + a fragment representation vector of the text unit + a layout representation vector of the text unit are performed, that is, an addition operation is performed, so as to obtain a representation vector of the text unit.

By introducing the segment embedding layer and the layout embedding layer into the embedding layer, more granularity of format information can be introduced during document processing, the format content of the document can be learned replaceably, and the document processing effect is improved.

In some embodiments, the information of the at least one modality further comprises: the obtaining of the semantic expression vector of each processing unit of the image corresponding to the document to be processed includes: if the information of the at least one mode is a text sequence, performing semantic embedding processing on each text unit in the text sequence to obtain a semantic expression vector of each text unit; and/or, if the information of the at least one modality is the image, visually encoding the image to obtain semantic expression vectors of all image units in the image.

As shown in fig. 4, for information (text and image) of different modalities, semantic representation vectors may be obtained in different manners, for example, for a text, a text unit may be obtained first, and then for each text unit, a semantic representation vector of a text unit may be obtained, for an image, it is not necessary to divide the text unit into image units first, and a visual encoder may be used to process the entire image to obtain a semantic representation vector of each image unit.

By adopting different semantic expression vector acquisition modes corresponding to the information of different modes, more effective semantic expression vectors can be obtained.

In some embodiments, the obtaining the processing result of the document to be processed based on the representation vectors of the respective processing units includes: processing the representation vectors of each processing unit based on a self-attention network of spatial perception to obtain hidden layer coding vectors; and decoding the hidden layer coding vector to obtain a processing result of the document to be processed.

Referring to fig. 4, based on the text embedding layer and the image embedding layer, the representation vector of the text unit and the representation vector of the image unit may be obtained, and then the representation vectors of the text unit and the representation vectors of the image unit may be spliced, and the spliced vectors are used as the input of the pre-training model. Vector stitching and text stitching are similar, i.e., merged together, e.g., one vector is [0,1], the other vector is [1,1], and the stitched vector is [0,1,1,1 ].

The pre-training model in FIG. 4 is represented by a transform layer. It is understood that a generic Transformer includes an encoding portion and a decoding portion, and the Transformer layer may specifically refer to the encoding portion of the Transformer.

The output vector of the transform layer (which may be referred to as a hidden layer encoding vector) may be input to a decoding layer, and the output of the decoding layer is a processing result of the text to be processed.

The decoding layer may be configured according to a specific task, and the corresponding model parameters may also be determined in the training stage.

Accordingly, the processing result may also be different depending on the task. For example, if the task is information extraction, the processing result may be information extracted from the image document, or if the task is document classification, the processing result may be a classification result of the image document.

In this embodiment, the self-attention network of the transform layer may adopt a space-aware self-attention network (spatial-aware self-attention network/network).

Wherein a spatially aware self-attention network explicitly introduces spatial positional relationships between processing units (tokens).

For example, the attention score of a conventional self-attention mechanism is given by α_ijRepresentation, then the attention score based on the self-attention mechanism of spatial perception may be represented as α'_ij，

b^(1D)，

The values of the learned relative position deviations (x) are_i,y_i) Is the upper left corner coordinate of the detection box for the 2D position of the ith token. Alpha is alpha_ijAn attention score between the ith token and the jth token is represented. When the 2D relative position deviation value is determined, x \ y is based on the point at the upper left corner, or x may be the x coordinate of the point at the upper left corner, and y is the y coordinate of the point at the upper right corner.

By adopting a self-attention network of spatial perception, a hidden layer coding vector containing more accurate spatial position information can be obtained, so that the document processing effect is better improved.

The above describes a model application process, i.e. processing documents based on a model.

The following describes the training process of the model.

Fig. 6 is a schematic diagram of a sixth embodiment according to the present disclosure, where the embodiment provides a method for training a document model, and the method of the embodiment includes:

601. the method includes the steps that information of at least one mode of a document sample is obtained, the information of each mode in the information of the at least one mode comprises at least one processing unit, the information of the at least one mode comprises a text sequence, the at least one processing unit comprises text units, and the text units under the same layout are arranged in the text sequence according to a preset sequence.

602. A representative vector for each of the at least one processing unit is obtained.

603. And obtaining the prediction result of the document sample based on the representation vectors of the processing units.

604. And constructing a loss function based on the prediction result.

605. Based on the loss function, a document model is trained.

In the model training stage, the processed document may be called a document sample, the processing result corresponding to the document sample may be called a prediction result, and the processing process from the document sample to the prediction result of the document sample is consistent with the principle of the model application stage and will not be described in detail.

After obtaining the prediction result, a loss function (loss) can be constructed by using the prediction result and the label value of the document sample.

The document sample can be obtained in the existing data set, or obtained by collection, and the like, and the label value of the document sample can be manually marked or obtained in the existing data set, and the like.

In some embodiments, said obtaining a prediction vector based on said representation vectors of said respective processing units comprises: executing a plurality of tasks based on the representation vectors of the processing units to obtain a prediction result corresponding to each task of the plurality of tasks, wherein the plurality of tasks comprises: text tasks, graphics and text tasks, and layout tasks. The graphics-text task comprises a fine-grained graphics-text matching task, the processing unit comprises image units, any image unit in the image units is randomly replaced, and for the fine-grained graphics-text matching task, the executing a plurality of tasks based on the representation vectors of the processing units to obtain the prediction results corresponding to the tasks in the plurality of tasks comprises: and obtaining a prediction result corresponding to the fine-grained image-text matching task based on the representation vector of the image unit, wherein the prediction result corresponding to the fine-grained image-text matching task is used for predicting the replaced image unit.

In some embodiments, the teletext tasks include fine-grained teletext matching tasks, the processing unit includes image units, any one of the image units is randomly replaced, and for the fine-grained teletext matching tasks, the performing a plurality of tasks based on the representation vectors of the respective processing units to obtain a prediction result corresponding to each of the plurality of tasks includes: and obtaining a prediction result corresponding to the fine-grained image-text matching task based on the representation vector of the image unit, wherein the prediction result corresponding to the fine-grained image-text matching task is used for predicting the replaced image unit.

In some embodiments, the obtaining the prediction result of the document sample based on the representation vectors of the respective processing units includes: processing the representation vectors of each processing unit based on a self-attention network of spatial perception to obtain hidden layer coding vectors; and decoding the hidden layer coding vector to obtain a prediction result of the document sample.

In some embodiments, the obtaining information of at least one modality of the document sample for a text sequence included in the information of the at least one modality includes: performing OCR on the document sample to obtain the text units in the document sample; performing layout analysis on the text unit to obtain layout information of the text unit; and sequentially splicing the text units under different layouts based on the layout information, and arranging the text units under the same layout according to a preset sequence.

In some embodiments, the information of the at least one modality further comprises: the obtaining of the semantic representation vector of each processing unit from the image corresponding to the document sample includes: if the information of the at least one mode is a text sequence, performing semantic embedding processing on each text unit in the text sequence to obtain a semantic expression vector of each text unit; and/or, if the information of the at least one modality is the image, visually encoding the image to obtain semantic expression vectors of all image units in the image.

For details of the process of predicting the result of the document sample from the document sample, reference may be made to the above embodiments.

The contents of the embedding layer can be seen in fig. 7. Wherein, Ti (

i

1, 2.) is a text unit. Vi (i ═ 1, 2.) is an image unit. 511 denotes an example in which the length of the text sequence is 512, 48 denotes an example in which the image document is divided into 49 image units, and 128 denotes the maximum value of the segment coding and layout number. It is to be understood that these specific values are examples and that other values may be used. The addition in fig. 7 is vector addition, and in the case of the semantic embedding layer, it is necessary to convert the Ti shown in the table into a corresponding semantic representation vector, and in the case of the 1D position embedding layer, it is necessary to convert the 1D position number (0, 1.) shown in the table into a corresponding 1D position representation vector. The remaining embedding layers are processed similarly.

In the training phase, tasks need to be set, as shown in fig. 7, the tasks of this embodiment may include: a mask Visual-Language Model (mask Visual-Language Model) task, a Fine-grained Text-Image Matching (Fine-grained Text-Image Matching) task, a Text-Image Alignment (Text-Image Alignment) task, and a Text Position Prediction (Word Position Prediction) task.

The content of each task is as follows:

mask Visual-Language Model (mask Visual-Language Model): the task belongs to a text task. According to the information of the input end of the document, the character at the text side is subjected to mask, the difference between the prediction result of the mask character in the document and the actual value is minimized, and in a document image scene, in order to avoid label leakage at the image side, the word needing the mask is subjected to blacking operation in the document image;

fine-grained Text-Image Matching (Fine-grained graphics Matching task): the task belongs to a graphic interaction task. Randomly replacing a small block in the image, and predicting which block of the image is replaced;

Text-Image Alignment (Text Image Alignment): the task also belongs to a graphic interaction task. Randomly blacking the text in the image, and predicting which words are blacked;

word Position Prediction (text Position Prediction task): this task belongs to the layout task. To which image block a word belongs in the document.

When constructing the loss function, a loss function of one task may be constructed corresponding to each task, for example, the loss functions of 4 tasks are constructed, then the loss functions of the 4 tasks are added to obtain a total loss function, and then the model parameters are adjusted by minimizing the total loss function.

The process of training the model based on the loss function can be realized by adopting a correlation technique, namely, model parameters can be adjusted through back propagation until a preset end condition is reached, the model parameters reaching the end condition are taken as final model parameters, and the trained model is obtained. The ending condition may be a preset number of times or a preset time or a loss function convergence, etc.

The document model obtained in this embodiment may be referred to as a document pre-training model, and for a specific task, the document pre-training model may also be subjected to fine tuning (finetune) to generate a model corresponding to the specific task, and then, the model corresponding to the specific task may be adopted to perform processing of the specific task on the document to be processed.

In this embodiment, the layout analysis module is used to make the text in the splicing process conform to the actual reading sequence, thereby improving the semantic consistency of the text. The purpose of image-text interaction in a document scene is to enable the text to learn the layout characteristics in the image and the corresponding relationship between the images and texts, and the traditional image-text matching only needs to roughly judge whether the text appears in the document or not, so that the traditional image-text matching is too simple. Compared with the traditional image-text matching algorithm, the fine-grained image-text matching algorithm can learn the fine-grained image-text matching relationship, and meanwhile, the layout learning capability of the model is greatly facilitated. For format information of different layers, segment information (segment id) and layout information (layout id) are added to help a model to better learn format characteristics. Meanwhile, the method has pre-training tasks related to texts, pictures and texts and layout, and the analysis capability of the model in the three aspects is comprehensively improved.

Fig. 8 is a schematic diagram according to an eighth embodiment of the present disclosure, which provides a document processing apparatus. As shown in fig. 8, the document processing apparatus 800 includes: a first acquisition module 801, a second acquisition module 802, and a third acquisition module 803.

The first obtaining module 801 is configured to obtain information of at least one modality of a document to be processed, where the information of each modality in the information of at least one modality includes at least one processing unit, the information of at least one modality includes a text sequence, the processing unit includes a text unit, and the text units under the same layout are arranged in the text sequence according to a preset order; the second obtaining module 802 is configured to obtain a representation vector of each processing unit in the at least one processing unit; the third obtaining module 803 is configured to obtain a processing result of the document to be processed based on the representation vectors of the processing units.

In some embodiments, the second obtaining module 802 is further configured to: obtaining a semantic representation vector of each processing unit and a format representation vector of each processing unit, wherein the format representation vector comprises at least one of the following items: a position representing vector, a segment representing vector and a layout representing vector; and obtaining the representation vector of each processing unit based on the semantic representation vector and the format representation vector.

In some embodiments, the third obtaining module 803 is further configured to: processing the representation vectors of each processing unit based on a self-attention network of spatial perception to obtain hidden layer coding vectors; and decoding the hidden layer coding vector to obtain a processing result of the document to be processed.

In some embodiments, the first obtaining module 801 is further configured to, for a text sequence included in the information of the at least one modality: performing OCR on the document to be processed to obtain the text unit in the document to be processed; performing layout analysis on the text unit to obtain layout information of the text unit; and sequentially splicing the text units under different layouts based on the layout information, and arranging the text units under the same layout according to a preset sequence.

In some embodiments, the information of the at least one modality further comprises: the second obtaining module 802 is further configured to: if the information of the at least one mode is the text sequence, performing semantic embedding processing on each text unit in the text sequence to obtain a semantic expression vector of each text unit; and/or, if the information of the at least one modality is the image, visually encoding the image to obtain semantic expression vectors of all image units in the image.

FIG. 9 is a diagram illustrating a ninth embodiment of the present disclosure, which provides a device for training a document model. As shown in fig. 9, the training apparatus 900 for document models includes: a first obtaining module 901, a second obtaining module 902, a third obtaining module 903, a building module 904, and a training module 905.

The first obtaining module 901 is configured to obtain information of at least one modality of a document sample, where the information of each modality in the information of at least one modality includes at least one processing unit, the information of at least one modality includes a text sequence, the at least one processing unit includes text units, and the text units under the same layout are arranged in the text sequence according to a preset order; the second obtaining module 902 is configured to obtain a representation vector of each processing unit in the at least one processing unit; the third obtaining module 903 is configured to obtain a prediction result of the document sample based on the representation vectors of the processing units; a construction module 904 for constructing a loss function based on the prediction result; the training module 905 is configured to train a document model based on the loss function.

In some embodiments, the third obtaining module 903 is further configured to: executing a plurality of tasks based on the representation vectors of the processing units to obtain a prediction result corresponding to each task of the plurality of tasks, wherein the plurality of tasks comprises: text tasks, graphics and text tasks, and layout tasks.

In some embodiments, the teletext task comprises a fine-grained teletext matching task, the processing unit comprises image units, any one of the image units is randomly replaced, and for the fine-grained teletext matching task, the third obtaining module 903 is further configured to: and obtaining a prediction result corresponding to the fine-grained image-text matching task based on the representation vector of the image unit, wherein the prediction result corresponding to the fine-grained image-text matching task is used for predicting the replaced image unit.

In some embodiments, the second obtaining module 902 is further configured to: obtaining a semantic representation vector of each processing unit and a format representation vector of each processing unit, wherein the format representation vector comprises at least one of the following items: a position representing vector, a segment representing vector and a layout representing vector; and obtaining the representation vector of each processing unit based on the semantic representation vector and the format representation vector.

In some embodiments, the third obtaining module 903 is further configured to: processing the representation vectors of each processing unit based on a self-attention network of spatial perception to obtain hidden layer coding vectors; and decoding the hidden layer coding vector to obtain a prediction result of the document sample.

In some embodiments, the first obtaining module 901 is further configured to, for a text sequence included in the information of the at least one modality: performing OCR on the document sample to obtain the text units in the document sample; performing layout analysis on the text unit to obtain layout information of the text unit; and sequentially splicing the text units under different layouts based on the layout information, and arranging the text units under the same layout according to a preset sequence.

In some embodiments, the information of the at least one modality further comprises: the third obtaining module 903 is further configured to: if the information of the at least one mode is the text sequence, performing semantic embedding processing on each text unit in the text sequence to obtain a semantic expression vector of each text unit; and/or, if the information of the at least one modality is the image, visually encoding the image to obtain semantic expression vectors of all image units in the image.

In the embodiment of the disclosure, the text units in the same layout are arranged in the text sequence according to the preset sequence, so that the sequence of the text units can conform to the actual reading sequence, the semantic consistency of the text sequence is improved, and the processing effect of the document model is further improved.

It is to be understood that in the disclosed embodiments, the same or similar elements in different embodiments may be referenced.

It is to be understood that "first", "second", and the like in the embodiments of the present disclosure are used for distinction only, and do not indicate the degree of importance, the order of timing, and the like.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 10 illustrates a schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the electronic device 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 can also be stored. The calculation unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in the electronic device 1000 are connected to the I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the electronic device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1001 executes the respective methods and processes described above, such as a document processing method or a training method of a document model. For example, in some embodiments, the document processing method or the training method of the document model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto electronic device 1000 via ROM 1002 and/or communications unit 1009. When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the above-described document processing method or training method of the document model may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured in any other suitable way (e.g., by means of firmware) to perform a document processing method or a training method of a document model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A document processing method, comprising:

the method comprises the steps of obtaining information of at least one mode of a document to be processed, wherein the information of each mode in the information of the at least one mode comprises at least one processing unit, the information of the at least one mode comprises a text sequence, the processing unit comprises a text unit, and the text units under the same layout are arranged in the text sequence according to a preset sequence;

obtaining a representation vector of each processing unit in the at least one processing unit;

and obtaining the processing result of the document to be processed based on the representation vectors of the processing units.

2. The method of claim 1, wherein said obtaining a representation vector for each of the at least one processing unit comprises:

obtaining a semantic representation vector of each processing unit and a format representation vector of each processing unit, wherein the format representation vector comprises at least one of the following items: a position representing vector, a segment representing vector and a layout representing vector;

and obtaining the representation vector of each processing unit based on the semantic representation vector and the format representation vector.

3. The method according to claim 1 or 2, wherein the obtaining of the processing result of the document to be processed based on the representation vector of each processing unit comprises:

processing the representation vectors of each processing unit based on a self-attention network of spatial perception to obtain hidden layer coding vectors;

and decoding the hidden layer coding vector to obtain a processing result of the document to be processed.

4. The method according to claim 1 or 2, wherein the obtaining information of at least one modality of the document to be processed for a text sequence comprised by the information of the at least one modality comprises:

performing OCR on the document to be processed to obtain the text unit in the document to be processed;

performing layout analysis on the text unit to obtain layout information of the text unit;

and sequentially splicing the text units under different layouts based on the layout information, and arranging the text units under the same layout according to a preset sequence.

5. The method of claim 2, wherein the information of the at least one modality further comprises: the obtaining of the semantic expression vector of each processing unit of the image corresponding to the document to be processed includes:

if the information of the at least one mode is the text sequence, performing semantic embedding processing on each text unit in the text sequence to obtain a semantic expression vector of each text unit; and/or the presence of a gas in the gas,

and if the information of the at least one modality is the image, visually encoding the image to obtain semantic expression vectors of all image units in the image.

6. A method for training a document model, comprising:

the method comprises the steps of obtaining information of at least one mode of a document sample, wherein the information of each mode in the information of the at least one mode comprises at least one processing unit, the information of the at least one mode comprises a text sequence, the at least one processing unit comprises text units, and the text units under the same layout are arranged in the text sequence according to a preset sequence;

obtaining a prediction result of the document sample based on the representation vectors of the processing units;

constructing a loss function based on the prediction result;

based on the loss function, a document model is trained.

7. The method of claim 6, wherein said obtaining a prediction vector based on the representation vector of the respective processing unit comprises:

executing a plurality of tasks based on the representation vectors of the processing units to obtain a prediction result corresponding to each task of the plurality of tasks, wherein the plurality of tasks comprises: text tasks, graphics and text tasks, and layout tasks.

8. The method of claim 7, wherein the teletext task comprises a fine-grained teletext matching task, the processing elements comprising image elements, any one of the image elements being randomly replaced, the performing, for the fine-grained teletext matching task, a plurality of tasks based on the representation vectors of the respective processing elements to obtain a prediction result for each of the plurality of tasks, comprising:

and obtaining a prediction result corresponding to the fine-grained image-text matching task based on the representation vector of the image unit, wherein the prediction result corresponding to the fine-grained image-text matching task is used for predicting the replaced image unit.

9. The method of any of claims 6-8, wherein the obtaining the representation vector for each of the at least one processing unit comprises:

10. The method according to any one of claims 6-8, wherein said obtaining a prediction of said document sample based on said representation vector of said respective processing unit comprises:

and decoding the hidden layer coding vector to obtain a prediction result of the document sample.

11. The method according to any one of claims 6-8, wherein the obtaining information of at least one modality of the document sample for a text sequence included in the information of the at least one modality includes:

performing OCR on the document sample to obtain the text units in the document sample;

12. The method of claim 9, wherein the information of the at least one modality further comprises: the obtaining of the semantic representation vector of each processing unit from the image corresponding to the document sample includes:

13. A document processing apparatus comprising:

the document processing device comprises a first obtaining module, a second obtaining module and a processing module, wherein the first obtaining module is used for obtaining information of at least one mode of a document to be processed, the information of each mode in the information of the at least one mode comprises at least one processing unit, the information of the at least one mode comprises a text sequence, the processing units comprise text units, and the text units under the same layout are arranged in the text sequence according to a preset sequence;

a second obtaining module, configured to obtain a representation vector of each processing unit in the at least one processing unit;

and the third acquisition module is used for acquiring the processing result of the document to be processed based on the representation vectors of the processing units.

14. The apparatus of claim 13, wherein the second obtaining means is further for:

15. The apparatus of claim 13 or 14, wherein the third obtaining means is further configured to:

16. The apparatus of claim 13 or 14, wherein the information for the at least one modality comprises a text sequence, the first obtaining module further to:

17. The apparatus of claim 14, wherein the information of the at least one modality further comprises: the second obtaining module is further configured to:

18. A device for training a document model, comprising:

the document sample processing device comprises a first obtaining module, a second obtaining module and a processing module, wherein the first obtaining module is used for obtaining information of at least one mode of a document sample, the information of each mode in the information of at least one mode comprises at least one processing unit, the information of at least one mode comprises a text sequence, the at least one processing unit comprises text units, and the text units under the same layout are arranged in the text sequence according to a preset sequence;

a third obtaining module, configured to obtain a prediction result of the document sample based on the representation vectors of the processing units;

a construction module for constructing a loss function based on the prediction result;

and the training module is used for training a document model based on the loss function.

19. The apparatus of claim 18, wherein the third obtaining means is further for:

20. The apparatus of claim 19, wherein the teletext task comprises a fine-grained teletext matching task, the processing unit comprises image units, any one of the image units being randomly replaced, and for the fine-grained teletext matching task, the third obtaining module is further configured to:

21. The apparatus of any of claims 18-20, wherein the second obtaining means is further configured to:

22. The apparatus of any of claims 18-20, wherein the third obtaining means is further configured to:

23. The apparatus according to any one of claims 18-20, wherein the information for the at least one modality includes a text sequence, the first obtaining module further to:

24. The apparatus of claim 19, wherein the information of the at least one modality further comprises: the third obtaining module is further configured to:

25. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

26. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-12.

27. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-12.