CN114495113A

CN114495113A - Text classification method and training method and device of text classification model

Info

Publication number: CN114495113A
Application number: CN202210154579.1A
Authority: CN
Inventors: 王佳阳; 何烩烩; 向宇波; 苏崔聪; 沈俊宇; 刘明浩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2022-05-13

Abstract

The disclosure provides a text classification method, a training method and device of a text classification model, electronic equipment and a storage medium, relates to the field of artificial intelligence, in particular to the field of character recognition, the field of deep learning and the field of image processing, and can be applied to scenes of extracting document information and the like. The specific implementation scheme of the text classification method is as follows: determining the text characteristic of each field in the fields according to the image characteristic of the image to be processed and the fields included by the image to be processed; determining the structural characteristics of each field according to a plurality of text characteristics of the fields and a plurality of position information of the fields in the image to be processed; and determining the category of each field according to the text characteristics of each field and the structural characteristics of each field.

Description

Text classification method and training method and device of text classification model

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and in particular to the field of character recognition, the field of deep learning, and the field of image processing, which can be applied to extracting scenes such as document information. And more particularly, to a text classification method and a training method of a text classification model, an apparatus, an electronic device, and a storage medium.

Background

With the development of computer technology and network technology, deep learning technology has been widely used in many fields. For example, a deep learning technique may be employed to extract documents included in an image to enable electronic storage of the documents.

Disclosure of Invention

The present disclosure is directed to a method and an apparatus for training a text classification method and a text classification model, an electronic device, and a storage medium, which improve detection accuracy.

According to an aspect of the present disclosure, there is provided a text classification method including: determining the text characteristic of each field in the fields according to the image characteristic of the image to be processed and the fields included by the image to be processed; determining the structural characteristics of each field according to a plurality of text characteristics of the fields and a plurality of position information of the fields in the image to be processed; and determining the category of each field according to the text characteristics of each field and the structural characteristics of each field.

According to another aspect of the present disclosure, a training method of a text classification model is provided, wherein the text classification model includes a text feature extraction network, a graph convolution network and a category prediction network; the training method comprises the following steps: inputting the image characteristics of the sample image and a plurality of fields included in the sample image into a text characteristic extraction network to obtain the text characteristics of each field in the plurality of fields; the sample image further includes first information indicating an actual category of each field; determining graph features for the fields by adopting a graph convolution network according to a plurality of text features of the fields and a plurality of position information of the fields in the image to be processed, wherein the graph features comprise structural features of each field; inputting a plurality of text characteristics of a plurality of fields and a plurality of structural characteristics of a plurality of fields into a category prediction network to obtain a prediction category of each field; and training the text classification model according to the prediction class and the actual class.

According to another aspect of the present disclosure, there is provided a text classification apparatus including: the first text feature extraction module is used for determining the text feature of each field in a plurality of fields according to the image feature of the image to be processed and the plurality of fields included by the image to be processed; the structural feature determining module is used for determining the structural feature of each field according to a plurality of text features of a plurality of fields and the position information of the fields in the image to be processed; and the category determining module is used for determining the category of each field according to the text characteristics of each field and the structural characteristics of each field.

According to another aspect of the present disclosure, there is provided a training apparatus for a text classification model, wherein the text classification model includes a text feature extraction network, a graph convolution network, and a category prediction network; the training device comprises: the second text feature extraction module is used for inputting the image features of the sample image and the fields included in the sample image into a text feature extraction network to obtain the text features of each field in the fields; the sample image further includes first information indicating an actual category of each field; the image characteristic determining module is used for determining image characteristics aiming at the fields by adopting an image convolution network according to a plurality of text characteristics of the fields and position information of the fields in the image to be processed, and the image characteristics comprise structural characteristics of each field; the category prediction module is used for inputting a plurality of text characteristics of a plurality of fields and a plurality of structural characteristics of the plurality of fields into a category prediction network to obtain the prediction category of each field; and the first training module is used for training the text classification model according to the prediction category and the actual category.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the text classification method and/or the training method of the text classification model provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a text classification method and/or a training method of a text classification model provided by the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising computer programs/instructions which, when executed by a processor, implement the text classification method and/or the training method of a text classification model provided by the present disclosure.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an application scenario of a method and an apparatus for training a text classification method and/or a text classification model according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram of a method of training a text classification model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of the principle of determining text characteristics of each of a plurality of fields according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of the principle of determining the structural characteristics of each field according to an embodiment of the present disclosure;

FIG. 5 is a flow diagram of a method of training a text classification model according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a training text classification model according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a structure of a text classification apparatus according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of a training apparatus for a text classification model according to an embodiment of the present disclosure; and

FIG. 9 is a block diagram of an electronic device for implementing a text classification method and/or a training method of a text classification model according to embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The present disclosure provides a text classification method, which includes a text feature extraction stage, a structural feature extraction stage, and a category determination stage. In the text feature extraction stage, the text feature of each field in a plurality of fields is determined according to the image feature of the image to be processed and the plurality of fields included in the image to be processed. In the structural feature extraction stage, the structural feature of each field is determined according to a plurality of text features of the fields and a plurality of position information of the fields in the image to be processed. In the category determining stage, the category of each field is determined according to the text characteristics of each field and the structural characteristics of each field.

An application scenario of the method and apparatus provided by the present disclosure will be described below with reference to fig. 1.

Fig. 1 is a schematic view of an application scenario of a text classification method and a training method and device of a text classification model according to an embodiment of the present disclosure.

As shown in fig. 1, the application scenario 100 of this embodiment may include an electronic device 110, and the electronic device 110 may be various electronic devices with processing functionality, including but not limited to a smartphone, a tablet, a laptop, a desktop computer, a server, and so on.

The electronic device 110 may, for example, perform text detection on the input image 120 resulting in a plurality of fields. Subsequently, the electronic device 110 may further classify the plurality of fields and store the plurality of fields according to the classification result, thereby implementing electronic storage and structured storage of the text in the image 120. In particular, the electronic device 110 may employ a word recognition technique (e.g., OCR, etc.) to detect text, resulting in multiple fields. The electronic device 110 may classify the fields by, for example, semantically recognizing the plurality of fields. Alternatively, the electronic device 110 may integrate the image information of the image 120 and semantic information of a plurality of fields to classify the fields.

In an embodiment, the electronic device 110 may use the identified fields and the respective categories of the fields as the detection result 130. Or the electronic device may pair the fields according to the categories of the fields, and use the paired key value pair as the detection result 130. The electronic device 110 may also store the detection results,

in an embodiment, the electronic device 110 can employ a text classification model 140 to classify a plurality of fields, for example. For example, the text classification model 140 may be trained, for example, by the server 150. The electronic device 110 may be communicatively coupled to the server 150 via a network to send model acquisition requests to the server 150. Accordingly, the server 150 may send the trained text classification model 140 to the electronic device 110 in response to the request.

In one embodiment, the electronic device 110 may also send the input image 120 to the server 150, detect fields in the image 120 by the server 150, and classify the detected fields based on the trained text classification model 140.

In an embodiment, the text classification model may be constructed based on any one of the following detection methods: template matching based methods, field search based methods. The template matching-based method includes the steps of presetting format images according to experience as templates, setting positions needing information extraction for each template, matching an image to be detected with each template, and detecting texts in the image to be detected according to matching results to obtain detection results. The field search-based method is characterized in that fields needing to be extracted are enumerated according to experience, and the positions of paragraphs where the fields needing to be extracted may appear are set; and then searching the fields identified by the character identification technology according to the enumerated fields and the set paragraph positions.

In an embodiment, the text classification model may be constructed based on any one of the following models: an object detection model, a Natural Language Processing (NLP) model, a graph convolution network model, and the like. The text classification model constructed based on the target detection model can detect character areas in the image to be processed, classify the character areas, and then recognize characters in the character areas by adopting a character recognition technology. The text classification model constructed based on the natural language processing model can convert the recognition result obtained by recognizing the image to be processed by the character recognition technology into a document, and then perform sequence labeling on the document. The text classification model constructed based on the graph convolution network model can send the recognition result obtained by recognizing the image to be processed by the image to be processed and the character recognition technology into the graph convolution network, and input the result output by the graph convolution network into a circulating network and the like so as to label the text in the image to be processed in sequence.

It should be noted that the text classification method provided by the present disclosure may be executed by the electronic device 110, and may also be executed by the server 150. Accordingly, the text classification apparatus provided by the present disclosure may be disposed in the electronic device 110, and may also be disposed in the server 150. The training method of the text classification model provided by the present disclosure may be performed by the server 150. Accordingly, the training device of the text classification model provided by the present disclosure may be disposed in the server 150.

It should be understood that the number and type of electronic devices 110 and servers 150 in fig. 1 are merely illustrative. There may be any number and type of electronic devices 110 and servers 150, as desired for an implementation.

The text classification method provided by the present disclosure will be described in detail below with reference to fig. 1 through fig. 2 to 4 below.

As shown in fig. 2, the text classification method 200 of this embodiment may include operations S210 to S230.

In operation S210, a text feature of each of a plurality of fields is determined according to an image feature of an image to be processed and the plurality of fields included in the image to be processed.

According to an embodiment of the present disclosure, the image to be processed may be, for example, an arbitrary image including text. The image may be obtained, for example, by a screenshot, or may be obtained by taking a paper document. The paper document may be, for example, a ticket, a receipt, a test paper, a book, etc., which is not limited in this disclosure. In this embodiment, the character recognition technology may be adopted to perform character recognition on the image to be processed, so as to obtain the plurality of fields included in the image to be processed and the position information of each field in the plurality of fields.

According to the embodiment of the disclosure, the image to be processed can be input into the image feature extraction network, so that the image features of the image to be processed can be obtained. Or, the image features may be extracted by using Histogram of Oriented Gradients (HOG) algorithm, Local Binary Pattern (LBP) operator, and the like. The image feature extraction network may include, for example, a convolutional neural network, which is not limited in this disclosure.

The embodiment can fuse the image features and the semantic features of the fields, so as to obtain the text features of each field. For example, the image features and the semantic features of each field may be spliced, and the spliced features may be used as the text features of each field. It will be appreciated that one text feature may be obtained for each of the plurality of fields, and then a plurality of text features corresponding one-to-one to the plurality of fields may be summed for the plurality of fields.

In operation S220, a structural feature of each field is determined according to a plurality of text features of the plurality of fields and a plurality of position information of the plurality of fields in the image to be processed.

According to an embodiment of the present disclosure, the position information may be represented by, for example, the position of a bounding box surrounding each field in the image to be processed. The embodiment may extract the structural feature of the field from the plurality of text features using the position information as reference information. The structural features can be extracted from the text features, because the image features corresponding to the fields are fused in the text features, and the image features can reflect the structures of the characters in the fields to a certain extent. It is understood that one location information may be obtained through the aforementioned operation S210 for each of the plurality of fields, and then a plurality of location information corresponding to the plurality of fields one-to-one may be obtained in total for the plurality of fields.

In one embodiment, a graph-convolution network may be employed to extract structural features of each field. Specifically, the embodiment may input the position information and the text feature into a Graph volume network (GCN), and output the structural feature of each field via the Graph volume network.

In operation S230, a category of each field is determined according to the text feature of each field and the structural feature of each field.

According to the embodiment of the disclosure, the text features and the structural features can be spliced and input into a category prediction network formed by a recurrent neural network and a conditional random field network, and the category of each field is output by the category prediction network. The recurrent neural network may be, for example, a Bi-directional Long-Short Term Memory (Bi-LSTM) network, which is not limited in this disclosure.

According to an embodiment of the present disclosure, the category of each field may be, for example, any one of a plurality of predetermined categories, which may include, for example: time, place, people, etc. The plurality of categories may also embody which type of data each field is in the form of a key-value pair, for example. The classes of data in the form of Key-Value pairs include a Key class (Key) and a Value class (Value). It is understood that the preset categories may be set according to actual scenes, and the disclosure is not limited thereto.

According to the text classification method, when the fields are classified, the text characteristics and the structural characteristics of the fields are comprehensively considered. The text features are obtained according to the fields and the image features, and the structural features are obtained according to the text features and the position information of the fields, so that the field classification results are obtained by comprehensively considering the image information, the semantic information and the layout information, the precision of the classification results can be improved to a certain extent, and the precision of identifying the fields in the image to be processed is improved. When the image to be processed is a shot image such as a receipt, good basic information can be provided for the extraction and storage of the structured information by the text classification method of the embodiment.

FIG. 3 is a schematic diagram illustrating the principle of determining a text characteristic of each of a plurality of fields according to an embodiment of the present disclosure.

According to the embodiment of the present disclosure, as shown in fig. 3, when determining the text feature of each field, for each field, the embodiment 300 may first determine the image sub-feature 321 for each field in the image feature 320 according to the position information 310 of each field in the image to be processed. The image sub-features 321 are then fused with the text-embedded features 330 of each field to obtain the text features of each field.

For example, the image feature 320 may be obtained by performing a convolution operation on the image to be processed 301, and the obtained image feature 320 has a mapping relationship with the image to be processed 301. The embodiment may determine, according to the mapping relationship, an image feature corresponding to a pixel at a position indicated by the position information in the image to be processed 301, and use the corresponding image feature as the image sub-feature 321 for the field. Alternatively, the embodiment may sample the image features 320 according to the location information, resulting in image sub-features 321 for the fields. Specifically, a region Pooling (ROI posing) method may be adopted to align the bounding box surrounding the field in the image to be processed 301 with the image feature 320 based on the coordinates of the bounding box indicated by the position information, and the partial image feature aligned with the bounding box surrounding the field is used as the image sub-feature 321 for the field.

Illustratively, each field may be semantically recognized, resulting in text-embedded features 330. Wherein, semantic recognition can be performed on the field by adopting a semantic recognition model. The semantic recognition model may include, for example, a Convolutional Recurrent Neural Network (CRNN) model, which is not limited in this disclosure.

Illustratively, the text-embedded feature 330 of each field and the image sub-feature 321 for each field may be stitched, and the stitched feature may be used as the text feature of each field.

In an embodiment, the text feature 340 of each field may also be extracted from the stitched features resulting from stitching the image sub-features 321 and the text embedding features 330. For example, the embodiment may input the stitched features into a feature extraction network, and extract the text features 340 of each field via the feature extraction network. Therefore, the image sub-features and the text embedding features in the splicing features can be fully learned, and the accuracy of the obtained text features is improved.

Illustratively, the feature extraction network may employ a self-attention model, for example, to sufficiently learn important features in the stitched features, ignore non-important features, and improve the accuracy of the extracted text features and the processing efficiency. As shown in fig. 3, the self-attention model may employ a Transformer model 302, etc., which is not limited in this disclosure.

FIG. 4 is a schematic diagram illustrating the principle of determining structural characteristics of each field according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, in determining the structural feature of each field, for example, the similarity between text features of a plurality of fields and the association relationship between position information of a plurality of fields may be considered. In this way, the association relationship between the fields can be sufficiently learned when the structural features of the fields are determined, so that the determined structural features can better reflect the features of the fields. The similarity between the text features may be represented by the difference between the text features, for example.

In one embodiment, when determining the structural feature of each field, a difference feature representing a difference between a plurality of text features may be determined according to the plurality of text features of the plurality of fields. Meanwhile, a relationship characteristic representing a relative relationship between the plurality of pieces of position information may be determined from the plurality of pieces of position information of the plurality of fields in the image to be processed. Finally, the structural features of each field are determined based on the difference features and the relationship features.

Illustratively, the number of the plurality of text features is set to be n. When determining the difference features, for each text feature, the difference between the text feature and each text feature in the n text features may be determined, resulting in n differences. For n text features, n × n differences can be obtained. The embodiment may determine a difference feature from the n x n differences. This embodiment may also arrange the aforementioned n x n differences in a matrix form.

Illustratively, the number of the plurality of text features is set to be n. When the difference characteristics are determined, the n text characteristics can be sorted into a matrix form to obtain a characteristic matrix. Each element in the feature matrix represents a text feature of a field. For example, the n text features may be arranged into a feature vector, the feature vector may be copied into n parts, and the n parts of feature vectors may be sequentially arranged to form a feature matrix. A transpose operation may then be performed on the feature matrix to obtain a transposed matrix for the feature matrix. And finally, determining the difference characteristic according to the difference between the characteristic matrix and the transposed matrix. And the difference value between the feature matrix and the transposed matrix is in a matrix form and is used for representing the difference between the n text features.

Illustratively, a matrix representing the differences between n text features may be normalized to obtain the difference features. Alternatively, the matrix representing the difference may be processed by using the full-link layer, and the features obtained by the processing may be normalized to obtain the difference features.

For example, taking the position information as represented by the position of the bounding box surrounding the field in the image to be processed as an example, when determining the relationship feature representing the relative relationship between a plurality of pieces of position information, at least one of the following parameters may be determined: the horizontal distance between the plurality of bounding boxes surrounding the plurality of fields, the vertical distance between the plurality of bounding boxes, the aspect ratio of each of the plurality of bounding boxes, the area ratio of the plurality of bounding boxes, and the like. The parameters may then be characterized to obtain associated features. When the parameters are characterized, the features after the characterization can be normalized, and the normalized features are used as the associated features. It is to be understood that the determination of the above parameters and relationship features are provided as examples only to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto.

It can be understood that when the number of the text features is n, the number of the bounding boxes is also n, and for the bounding box a therein, n sets of parameters can be obtained by the method described above, where each set of parameters represents an association relationship between the bounding box a and one bounding box (e.g., the bounding box B) of the bounding boxes. And after each group of parameters is characterized, the obtained characteristics are the associated characteristics between the boundary box A and the boundary box B.

According to an embodiment of the present disclosure, as shown in fig. 4, when determining a structural feature, the embodiment 400 may input the n text features 401 into a difference extraction network 410, and obtain a difference feature after processing through the difference extraction network 410. Meanwhile, the n position information 402 may be input into the relationship extraction network 420, and the relationship feature is obtained after being processed by the relationship extraction network 420. It is understood that the structures of the difference extraction network 410 and the relationship extraction network 420 can be set according to actual requirements to implement the functions of extracting the difference features and the associated features respectively. For example, the difference extraction network 410 may include a transpose layer, a difference calculation layer, a fully connected layer, a normalization layer, and the like. Relationship extraction network 420 may include a parameter extraction layer, a parameter characterization layer, and a dimension mapping layer. And the dimension mapping layer is used for mapping the features obtained by the characterization processing to the same size as the difference features. In one implementation, when processing the n text features, the difference extraction network 410 may first process the n text features through a convolutional layer and/or a full connection layer, so that the difference extraction network 410 performs non-linear learning on the n text features. Subsequently, the n features processed by the convolutional layer and/or the full link layer are arranged in a matrix form, and the difference features are obtained by adopting the method for determining the difference features described above.

In an embodiment, after obtaining the difference feature and the relationship feature, the embodiment may weight the relationship feature by using the difference feature as a weight to obtain a weighted feature. If the n text features 401 described above are respectively used as n graph nodes X _0 to X _ n, and the weighted features are used as connecting edges representing the n graph nodes, a connected graph 403 can be constructed. The embodiment may use the graph convolution network GCN 430 to iteratively update the connectivity graph 403, resulting in the connectivity graph 404 satisfying the convergence condition. The feature represented by each node in the connected graph 404 that satisfies the convergence condition is the structural feature of each field.

In actual processing, the weighted features and the plurality of text features may be input into the graph convolution network GCN 440, and the structure features of each of the plurality of fields may be obtained after the weighted features and the plurality of text features are processed by the graph convolution network GCN 440.

The embodiment obtains the difference feature according to the text feature considering the image feature, and weights the relationship feature according to the difference feature, so that the weighted relationship feature can express the layout relationship among the fields more accurately. Therefore, the extracted structural features can be more accurate, the accuracy of the category of the obtained field is improved conveniently, and the accuracy of the extraction of the structured information is improved.

In order to facilitate the execution of the text classification method provided by the present disclosure, the present disclosure also provides a training method of a text classification model. The training method will be described in detail below with reference to fig. 5.

Fig. 5 is a flowchart illustrating a method for training a text classification model according to an embodiment of the present disclosure.

As shown in fig. 5, the training method 500 of the text classification model of this embodiment may include operations S510 to S540. The text classification model can comprise a text feature extraction network, a graph volume network and a category prediction network.

In operation S510, the image feature of the sample image and a plurality of fields included in the sample image are input to a text feature extraction network, and a text feature of each of the plurality of fields is obtained.

According to embodiments of the present disclosure, the text feature extraction network may comprise a semantic extraction network (which may employ the semantic recognition model described above). The semantic extraction network is used for extracting semantic features of a plurality of fields as text embedding features. The implementation of operation S510 is similar to the implementation of operation S210 described above, and is not described herein again.

In one embodiment, the text feature extraction network may include a feature segmentation network and a feature extraction network in addition to the semantic extraction network. Wherein the input of the feature segmentation network comprises image features and position information of a plurality of fields in the sample image. The feature segmentation network is configured to segment image sub-features for each of the plurality of fields from the image features based on the location information. The input of the feature extraction network is a splicing feature obtained by splicing the image sub-feature of each field and the text embedding feature of each field, and the output is the text feature of each field.

In operation S520, graph features for a plurality of fields are determined using a graph convolution network according to a plurality of text features of the plurality of fields and a plurality of position information of the plurality of fields in an image to be processed.

According to an embodiment of the present disclosure, an implementation manner of operation S520 is similar to the implementation manner of operation S220 described above, and is not described herein again.

In an embodiment, the text classification model may further include the relationship extraction network described above and the difference extraction network described above. The operation S520 may input the plurality of text features into the difference extraction network to obtain a difference feature representing a difference between the plurality of text features. And simultaneously inputting a plurality of pieces of position information of a plurality of fields in the sample image into a relationship extraction network to obtain relationship characteristics representing the relative relationship among the plurality of pieces of position information. And finally, determining the graph characteristics aiming at the fields by adopting a graph convolution network according to the difference characteristics and the relation characteristics. The graph characteristics include structural characteristics of each field. It is to be understood that the present disclosure does not limit the order of acquisition of the relational features and the differential features. It will be appreciated that the graph features may be represented using the connectivity graph described previously that satisfies the convergence criteria.

For example, the embodiment may weight the relationship feature by taking the difference feature as a weight, and obtain a weighted feature. And finally, inputting the weighted features into a graph convolution network to obtain the graph features aiming at a plurality of fields. The implementation principle of the processing method is similar to the principle described in fig. 4, and is not described herein again.

In operation S530, a plurality of text features of a plurality of fields and a plurality of structural features of a plurality of fields are input into a category prediction network, resulting in a prediction category for each field.

According to an embodiment of the present disclosure, the class prediction network may be composed of the cyclic neural network and the conditional random field network described previously. The predicted category for each field may be any one of the predetermined categories described above. The implementation of operation S530 is similar to that of operation S230 described above, and is not described herein again. It is understood that the category prediction network may also be composed of, for example, a recurrent neural network and a softmax logistic regression layer, which is not limited by this disclosure.

In operation S540, a text classification model is trained according to the prediction class and the actual class.

According to an embodiment of the present disclosure, the sample image may include first information indicating an actual category of each field. The operation S540 may determine a loss of the text classification model according to a difference between the prediction class and the actual class. And adjusting network parameters in the text classification model by adopting a back propagation algorithm so as to minimize the loss of the text classification model and finish the training of the text classification model.

Illustratively, the loss of the text classification model may be determined using a cross-entropy loss function or a Negative Log-Likelihood (NLL) loss function, etc., which is not limited by this disclosure. For example, the value according to the negative log likelihood loss is represented by crf _ loss, and the value can be calculated by using the following formula (1):

wherein the content of the first and second substances,

for the input sequence of the recurrent neural network, for the prediction class y,

can be expressed as

For a given input sequence

The class predicts the conditional probability that the network outputs for class y.

FIG. 6 is a schematic diagram of a principle of training a text classification model according to an embodiment of the present disclosure.

According to the embodiment of the disclosure, when the text classification model is trained, a relation prediction network can be added in the text classification model, so that the incidence relation between fields is predicted according to the relation features in the graph features output by the graph convolution network, and the actual incidence relation is used as supervision information to train the graph convolution network, so that the training of the graph convolution network is more targeted and tends to learn the structural information of the fields.

As shown in fig. 6, in this embodiment 600, the text classification model may include a text feature extraction network, a difference extraction network 620, a relationship extraction network 630, a graph convolution network GCN 640, a category prediction network 650, and a relationship prediction network 660. The text feature extraction network may include a semantic extraction network 611, a feature segmentation network 612, and a feature extraction network 613.

When training the text classification model, a text recognition technique may be first used to perform text recognition on the sample image, so as to obtain a plurality of fields included in the sample image and position information of each field in the plurality of fields, which may be represented by a field layout 601, for example, where each bounding box in the layout 601 surrounds one field.

After obtaining the layout 601, the embodiment 600 may input the plurality of fields into the semantic extraction network 611 in a sequence form, so as to obtain semantic features of the plurality of fields. Meanwhile, the image feature of the sample image and the plurality of location information of the plurality of fields may be input to the feature segmentation network 612, resulting in an image sub-feature for each field. The image sub-features and the semantic features are spliced and input to the feature extraction network 613, so that the text features of each field can be obtained, and the text feature sequence 602 is obtained.

After the text feature sequence 602 is obtained, the text feature sequence 602 may be input into the difference extraction network 620, and processed by the difference extraction network 620 to obtain the difference features described above. The plurality of location information is input into the relationship extraction network 630, and the relationship characteristics described above can be obtained after being processed by the relationship extraction network 630. And weighting the relation characteristics by taking the difference characteristics as weights to obtain weighted characteristics. The connected graph 603 can be constructed by taking each text feature in the text feature sequence 602 as graph nodes X _0 to X _ n, and taking the weighted feature as a connecting edge representing n graph nodes. After the connected graph 603 is iteratively updated through the GCN 640, the connected graph 604 satisfying the convergence condition is obtained, and the connected graph 604 can be used as a graph feature for a plurality of fields. The graph nodes 604_1 in the connected graph 604 may represent structural characteristics of the fields, and the edges 604_2 in the connected graph 604 may represent relational characteristics of the fields.

After the connected graph 604 is obtained, the features represented by the graph nodes 604_1 can be used as the structural features of the fields, so as to obtain a structural feature sequence 605. The structural feature sequence 605 and the textual feature sequence 602 are concatenated and input into the category prediction network 650 to obtain the prediction category for each of the plurality of fields. Meanwhile, the relationship characteristic represented by the edge 604_2 may be input into the relationship prediction network 660, resulting in predicted relationship information between the plurality of fields.

In an embodiment, the sample image may further include second information indicating a relationship of the plurality of fields with each other. After obtaining the prediction category and the prediction relationship information, the embodiment 600 may calculate the loss value crf _ loss of the category prediction loss according to the prediction category and the actual category included in the sample image by using the formula (1) described above. Meanwhile, the loss value ce _ loss of the relation prediction loss can be determined according to the prediction relation information and the second information. The text classification model may then be trained based on the two-part loss.

Illustratively, the loss value ce _ loss of the relational prediction loss may be calculated, for example, using a cross-entropy loss function. The embodiment can use the weighted sum of the two losses as the total loss of the text classification model, and adjust the network parameters in the text classification model through a back propagation algorithm to minimize the total loss, thereby completing the training of the text classification model. It is understood that the weight used in calculating the weighted sum of the two losses can be set according to actual requirements, and the disclosure is not limited thereto.

Illustratively, the relationship prediction network 660 may be composed of, for example, a fully connected layer and a softmax logistic regression layer, or may be composed of any network structure that can be used for classification, which is not limited by the present disclosure.

In an embodiment, the text classification model can be self-supervised trained according to the difference features, so that the difference extraction network can learn the difference between the text features more accurately. Specifically, the unsupervised loss of the text classification model may be determined based on the difference feature and the plurality of text features. The text classification model is then trained based on the unsupervised loss.

For example, the embodiment may take the weighted sum of the unsupervised loss and the two-part loss described above as the total loss of the text classification model, which is then trained on the basis of the total loss. Alternatively, the embodiment may train the entire text classification model with the class prediction penalty as a global penalty. Meanwhile, the GCN 640, the relationship extraction network 630 and the difference extraction network 620 are trained with the relationship prediction loss as a local loss. The difference extraction network 620 is trained with the self-supervision loss as a local loss, which is not limited by this disclosure.

For example, the number of the plurality of fields is set to n, and the size of the difference feature is n × n. Value A of ith row and jth column element in difference characteristic A_ijThe difference between the text feature of the ith field and the text feature of the jth field in the n fields is characterized, and the value gl _ loss of the unsupervised loss can be represented by the following formula (2), for example:

wherein | A | Y phosphor_FF-norm, v representing the difference characteristic A_i、v_jThe text features of the ith field and the jth field correspond to the text features of the ith field and the jth field, respectively, and can be obtained by processing the text features of the ith field and the jth field through the convolutional layer and/or the full link layer, respectively, as described above. Eta and gamma are weight parameters set according to actual requirements.

It is to be understood that the above calculation formula of the self-supervision loss is only used as an example to facilitate understanding of the present disclosure, and the present disclosure does not limit the same. As long as the self-supervision loss can represent the difference of the difference feature characterization and the relationship of the difference between each other calculated according to the plurality of text features.

Based on the text classification method provided by the disclosure, the disclosure also provides a text classification device. The apparatus will be described in detail below with reference to fig. 7.

Fig. 7 is a block diagram of a structure of a text classification device according to an embodiment of the present disclosure.

As shown in fig. 7, the text classification apparatus 700 of this embodiment may include a first text feature extraction module 710, a structural feature determination module 720, and a category determination module 730.

The first text feature extraction module 710 is configured to determine a text feature of each of the fields according to an image feature of the image to be processed and a plurality of fields included in the image to be processed. In an embodiment, the first text feature extraction module 710 may be configured to perform the operation S210 described above, which is not described herein again.

The structural feature determining module 720 is configured to determine a structural feature of each field according to a plurality of text features of the plurality of fields and a plurality of position information of the plurality of fields in the image to be processed. In an embodiment, the structural feature determining module 720 may be configured to perform the operation S220 described above, which is not described herein again.

The category determination module 730 is configured to determine a category for each field according to the text features and the structural features of each field. In an embodiment, the category determining module 730 may be configured to perform the operation S230 described above, which is not described herein again.

According to an embodiment of the present disclosure, the structural feature determination module 720 may include a first difference feature determination sub-module, a first relationship feature determination sub-module, and a structural feature determination sub-module. The first difference feature determination submodule is used for determining a difference feature which represents the difference between the plurality of text features according to the plurality of text features. The first relation feature determination submodule is used for determining a relation feature representing the relative relation between the plurality of pieces of position information according to the plurality of pieces of position information of the plurality of fields in the image to be processed. The structural feature determination submodule is used for determining the structural feature of each field according to the difference feature, the relation feature and the text features.

According to an embodiment of the present disclosure, the difference feature determination submodule may include a feature matrix determination unit and a difference feature determination unit. The feature matrix determination unit is used for determining a feature matrix composed of a plurality of text features. The difference characteristic determining unit is used for determining the difference characteristic according to the difference value between the characteristic matrix and the transposed matrix of the characteristic matrix.

According to an embodiment of the present disclosure, the structural feature determination submodule may include a first weighting unit and a first feature determination unit. The weighting unit is used for weighting the relation characteristic by taking the difference characteristic as weight to obtain weighted characteristic. The characteristic determining unit is used for inputting the weighted characteristics and the plurality of text characteristics into the graph convolution network to obtain the structural characteristics of each field.

According to an embodiment of the present disclosure, the first text feature extraction module 710 may include a sub-feature determination sub-module, an embedded feature determination sub-module, and a text feature determination sub-module. And the sub-feature determining sub-module is used for determining the image sub-features of each field in the image features according to the position information of each field in the image to be processed. The embedded characteristic determination submodule is used for determining the text embedded characteristic of each field. And the text characteristic determining submodule is used for obtaining the text characteristic of each field according to the image sub-characteristic and the text embedding characteristic.

According to an embodiment of the present disclosure, the text feature determination submodule may include a concatenation unit and an extraction unit. And the splicing unit is used for splicing the image sub-features and the text embedding features to obtain splicing features. The extraction unit is used for extracting the text feature of each field from the splicing features.

Based on the training method of the text classification model provided by the disclosure, the disclosure also provides a training device of the text classification model. The apparatus will be described in detail below with reference to fig. 8.

Fig. 8 is a block diagram of a structure of a training apparatus for a text classification model according to an embodiment of the present disclosure.

As shown in fig. 8, the training apparatus 800 of the text classification model of this embodiment may include a second text feature extraction module 810, a graph feature determination module 820, a category prediction module 830, and a first training module 840. The text classification model comprises a text feature extraction network, a graph convolution network and a category prediction network.

The second text feature extraction module 810 is configured to input the image features of the sample image and the plurality of fields included in the sample image into a text feature extraction network, resulting in text features for each of the plurality of fields. In an embodiment, the second text feature extraction module 810 may be configured to perform the operation S510 described above, which is not described herein again.

The graph feature determination module 820 is configured to determine a graph feature for the plurality of fields by using a graph convolution network according to a plurality of text features of the plurality of fields and a plurality of position information of the plurality of fields in the image to be processed. Wherein the graph characteristics include structural characteristics of each field. In an embodiment, the graph characteristic determining module 820 may be configured to perform the operation S520 described above, which is not described herein again.

The category prediction module 830 is configured to input a plurality of text features of a plurality of fields and a plurality of structural features of a plurality of fields into a category prediction network, so as to obtain a prediction category of each field. Wherein the graph characteristics include structural characteristics of each field. In an embodiment, the category prediction module 830 may be configured to perform the operation S530 described above, which is not described herein again.

The first training module 840 is configured to train the text classification model according to the prediction class and the actual class. In an embodiment, the first training module 840 may be configured to perform the operation S540 described above, which is not described herein again.

According to an embodiment of the disclosure, the graph feature further includes a relationship feature of the plurality of fields to each other, the sample image further includes second information indicating a relationship of the plurality of fields to each other, and the text classification model further includes a relationship prediction network. The training apparatus 800 for the text classification model may further include a relationship prediction module and a second training module. The relation prediction module is used for inputting the relation characteristics into the relation prediction network to obtain the prediction relation information among the fields. And the second training module is used for training the text classification model according to the prediction relation information and the second information.

According to an embodiment of the present disclosure, the text classification model further includes a relationship extraction network and a difference extraction network. The graph feature determination module 820 may include a second difference feature determination sub-module, a second relationship feature determination sub-module, and a graph feature determination sub-module. The second different characteristic determining submodule is used for inputting the plurality of text characteristics into the difference extraction network to obtain the difference characteristics representing the difference between the plurality of text characteristics. The second relation characteristic determination submodule is used for inputting the plurality of pieces of position information of the plurality of fields in the sample image into the relation extraction network to obtain the relation characteristics representing the relative relation between the plurality of pieces of position information. The graph characteristic determination submodule is used for determining the graph characteristics aiming at the fields by adopting a graph volume network according to the difference characteristics, the relation characteristics and the text characteristics.

According to an embodiment of the present disclosure, the training apparatus 800 of the text classification model may further include a loss determination module and a third training module. And the loss determining module is used for determining the self-supervision loss of the text classification model according to the difference characteristic and the plurality of text characteristics. And the third training module is used for training the text classification model according to the self-supervision loss.

According to an embodiment of the present disclosure, the above-described graph characteristic determination submodule may include a second weighting unit and a second characteristic determination unit. The second weighting unit is used for weighting the relation characteristic by taking the difference characteristic as weight to obtain weighted characteristic. The second feature determination unit is used for inputting the weighted features into the graph convolution network to obtain the graph features aiming at the fields.

In the technical scheme of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and applying the personal information of the related users all conform to the regulations of related laws and regulations, and necessary security measures are taken without violating the good customs of the public order. In the technical scheme of the disclosure, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that may be used to implement the text classification methods and/or the training methods of text classification models of embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as a text classification method and/or a training method of a text classification model. For example, in some embodiments, the text classification method and/or the training method of the text classification model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the above described text classification method and/or training method of the text classification model may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g. by means of firmware) to perform the text classification method and/or the training method of the text classification model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and a VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of text classification, comprising:

determining the text characteristics of each field in a plurality of fields according to the image characteristics of the image to be processed and the plurality of fields included by the image to be processed;

determining the structural feature of each field according to a plurality of text features of the fields and a plurality of position information of the fields in the image to be processed; and

and determining the category of each field according to the text characteristics of each field and the structural characteristics of each field.

2. The method of claim 1, wherein determining structural characteristics of each of the fields comprises:

determining a difference characteristic representing the difference between the plurality of text characteristics according to the plurality of text characteristics;

determining a relation characteristic representing a relative relation between a plurality of pieces of position information according to the plurality of pieces of position information of the plurality of fields in the image to be processed; and

and determining the structural characteristic of each field according to the difference characteristic, the relation characteristic and the text characteristics.

3. The method of claim 2, wherein said determining, from the plurality of textual features, a difference feature that characterizes a difference between the plurality of textual features from each other comprises:

determining a feature matrix composed of the plurality of text features; and

and determining the difference characteristic according to the difference value between the characteristic matrix and the transposed matrix of the characteristic matrix.

4. The method of claim 2, wherein said determining a structural feature of said each field from said difference feature, said relationship feature, and said plurality of textual features comprises:

weighting the relation characteristic by taking the difference characteristic as a weight to obtain a weighted characteristic; and

and inputting the weighted features and the plurality of text features into a graph convolution network to obtain the structural features of each field.

5. The method of claim 1, wherein determining text characteristics of each of the plurality of fields comprises:

determining image sub-features aiming at each field in the image features according to the position information of each field in the image to be processed;

determining text embedding characteristics of each field; and

and obtaining the text characteristic of each field according to the image sub-characteristic and the text embedding characteristic.

6. The method of claim 5, wherein the deriving the text feature for each field from the image sub-features and the text embedding features comprises:

splicing the image sub-features and the text embedding features to obtain splicing features; and

and extracting the text feature of each field from the splicing features.

7. A training method of a text classification model is provided, wherein the text classification model comprises a text feature extraction network, a graph volume network and a category prediction network; the method comprises the following steps:

inputting image features of a sample image and a plurality of fields included in the sample image into the text feature extraction network to obtain text features of each field in the fields; the sample image further comprises first information indicating an actual category of said each field;

determining graph features for the fields by adopting the graph volume network according to a plurality of text features of the fields and a plurality of position information of the fields in the image to be processed, wherein the graph features comprise structural features of each field;

inputting a plurality of text features of the fields and a plurality of structural features of the fields into the category prediction network to obtain a prediction category of each field; and

and training the text classification model according to the prediction class and the actual class.

8. The method of claim 7, wherein the graph characteristics further include relationship characteristics of the plurality of fields with respect to each other; the sample image further includes second information indicating a relationship of the plurality of fields to each other; the text classification model further comprises a relationship prediction network; the method further comprises the following steps:

inputting the relation characteristics into the relation prediction network to obtain the prediction relation information among the fields; and

and training the text classification model according to the prediction relation information and the second information.

9. The method of claim 7, wherein the text classification model further comprises a relationship extraction network and a difference extraction network; determining, with the graph convolution network, graph features for the plurality of fields comprises:

inputting the text features into the difference extraction network to obtain difference features representing the differences among the text features;

inputting a plurality of pieces of position information of the fields in the sample image into the relationship extraction network to obtain a relationship feature representing the relative relationship between the plurality of pieces of position information; and

determining, with the graph convolution network, graph features for the plurality of fields according to the difference features, the relationship features, and the plurality of text features.

10. The method of claim 9, further comprising:

determining an unsupervised loss of the text classification model based on the difference feature and the plurality of text features; and

and training the text classification model according to the self-supervision loss.

11. The method of claim 9, wherein the determining, with the graph convolution network, the graph features for the plurality of fields from the difference feature, the relationship feature, and the plurality of text features comprises:

weighting the relation features by taking the difference features as weights to obtain weighted features; and

and inputting the weighted features and the text features into the graph convolution network to obtain the graph features aiming at the fields.

12. A text classification apparatus comprising:

the first text feature extraction module is used for determining the text feature of each field in a plurality of fields according to the image feature of the image to be processed and the plurality of fields included in the image to be processed;

the structural feature determining module is used for determining the structural feature of each field according to a plurality of text features of the fields and a plurality of position information of the fields in the image to be processed; and

and the category determining module is used for determining the category of each field according to the text characteristics of each field and the structural characteristics of each field.

13. The apparatus of claim 12, wherein the structural feature determination module comprises:

a first difference feature determination submodule, configured to determine, according to the plurality of text features, a difference feature representing a difference between the plurality of text features;

a first relation feature determination submodule, configured to determine, according to a plurality of pieces of location information of the plurality of fields in the image to be processed, a relation feature representing a relative relation between the plurality of pieces of location information; and

and the structural feature determining submodule is used for determining the structural feature of each field according to the difference feature, the relation feature and the text features.

14. The apparatus of claim 13, wherein the difference feature determination submodule comprises:

a feature matrix determination unit configured to determine a feature matrix configured by the plurality of text features; and

a difference feature determining unit, configured to determine the difference feature according to a difference between the feature matrix and a transposed matrix of the feature matrix.

15. The apparatus of claim 13, wherein the structural feature determination submodule comprises:

the first weighting unit is used for weighting the relation characteristic by taking the difference characteristic as a weight to obtain a weighted characteristic; and

and the first characteristic determining unit is used for inputting the weighted characteristics and the plurality of text characteristics into a graph convolution network to obtain the structural characteristics of each field.

16. The apparatus of claim 12, wherein the text feature extraction module comprises:

the sub-feature determining sub-module is used for determining the image sub-features of each field in the image features according to the position information of each field in the image to be processed;

the embedded characteristic determining sub-module is used for determining the text embedded characteristic of each field; and

and the text characteristic determining submodule is used for obtaining the text characteristic of each field according to the image sub-characteristic and the text embedding characteristic.

17. The apparatus of claim 16, wherein the text feature determination sub-module comprises:

the splicing unit is used for splicing the image sub-features and the text embedding features to obtain splicing features; and

and the extraction unit is used for extracting the text characteristics of each field from the splicing characteristics.

18. A training device for a text classification model, wherein the text classification model comprises a text feature extraction network, a graph volume network and a category prediction network; the device comprises:

the second text feature extraction module is used for inputting the image features of the sample image and the fields included in the sample image into the text feature extraction network to obtain the text features of each field in the fields; the sample image further comprises first information indicating an actual category of said each field;

a graph feature determination module, configured to determine, according to a plurality of text features of the plurality of fields and position information of the plurality of fields in the image to be processed, a graph feature for the plurality of fields by using the graph convolution network, where the graph feature includes a structural feature of each field;

a category prediction module, configured to input a plurality of text features of the fields and a plurality of structural features of the fields into the category prediction network, so as to obtain a prediction category of each field; and

and the first training module is used for training the text classification model according to the prediction category and the actual category.

19. The apparatus of claim 18, wherein the graph features further comprise relationship features of the plurality of fields with respect to each other; the sample image further includes second information indicating a relationship of the plurality of fields to each other; the text classification model further comprises a relationship prediction network; the device further comprises:

the relation prediction module is used for inputting the relation characteristics into the relation prediction network to obtain the prediction relation information among the fields; and

and the second training module is used for training the text classification model according to the prediction relation information and the second information.

20. The apparatus of claim 18, wherein the text classification model further comprises a relationship extraction network and a difference extraction network; the graph feature determination module includes:

a second difference feature determination submodule, configured to input the plurality of text features into the difference extraction network, so as to obtain a difference feature representing a difference between the plurality of text features;

a second relation feature determination submodule, configured to input the plurality of pieces of location information of the plurality of fields in the sample image into the relation extraction network, so as to obtain a relation feature representing a relative relation between the plurality of pieces of location information; and

a graph feature determination submodule configured to determine a graph feature for the plurality of fields using the graph convolution network according to the difference feature, the relationship feature, and the plurality of text features.

21. The apparatus of claim 20, further comprising:

a loss determination module for determining an unsupervised loss of the text classification model based on the difference feature and the plurality of text features; and

and the third training module is used for training the text classification model according to the self-supervision loss.

22. The apparatus of claim 20, wherein the graph feature determination submodule comprises:

the second weighting unit is used for weighting the relation characteristic by taking the difference characteristic as a weight to obtain a weighted characteristic; and

and the second characteristic determining unit is used for inputting the weighted characteristics into the graph convolution network to obtain the graph characteristics aiming at the fields.

23. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.

24. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-11.

25. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method according to any one of claims 1 to 11.