WO2023024614A1

WO2023024614A1 - Document classification method and apparatus, electronic device and storage medium

Info

Publication number: WO2023024614A1
Application number: PCT/CN2022/094788
Authority: WO
Inventors: 李煜林; 庾悦晨; 钦夏孟; 章成全; 姚锟; 韩钧宇; 刘经拓
Original assignee: 北京百度网讯科技有限公司
Priority date: 2021-08-27
Filing date: 2022-05-24
Publication date: 2023-03-02
Also published as: CN113742483A

Abstract

The present disclosure provides a document classification method and apparatus, an electronic device and a storage medium, which relate to the technical field of artificial intelligence, in particular to the technical fields of computer vision and deep learning, and can be applied in smart city and smart finance scenarios. A specific implementation solution is: a document classification method, comprising: acquiring text information and image information of text comprised in a document to be processed; performing fusion on the basis of the text information and the image information to obtain a fusion feature; acquiring a feature sequence of the text according to the fusion feature; and determining the category of the document on the basis of a predefined document category and the feature sequence. The technical solution provided in the present disclosure solves the technical problem of document obfuscation in document classification, and improves the classification accuracy.

Description

Method, device, electronic device and storage medium for classifying documents

This disclosure claims the priority of the Chinese patent application with the application number 202110994014.X, the application date is August 27, 2021, and the title is "Document Classification Method, Device, Electronic Equipment, and Storage Medium", wherein the above patent application is published The contents of are incorporated by reference in this disclosure.

technical field

The present disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision and deep learning.

Background technique

Documents are an important information carrier and are widely used in various business and office scenarios. In an automated office or input system, classifying different documents is one of the most critical processes.

Contents of the invention

The present disclosure provides a method, device, electronic equipment and storage medium for document classification.

According to an aspect of the present disclosure, a method for document classification is provided, including:

Acquiring text information and image information of the text included in the document to be processed;

Based on the fusion of text information and image information to obtain fusion features;

Acquire the feature sequence of the text according to the fusion feature;

Based on the predefined document category and feature sequence, the category of the document to be processed is determined.

In some embodiments, the text includes at least one line of text content; acquiring text information of the text included in the document to be processed includes: acquiring at least one line of text content and location information of the at least one line of text content.

In some embodiments, performing fusion based on text information and image information to obtain fusion features includes: adding image information to position information of at least one line of text content, and then concatenating with at least one line of text content to obtain fusion features.

In some embodiments, obtaining the feature sequence of the text according to the fusion feature includes: performing arithmetic averaging on the first character feature in at least one line of text content in the fusion feature, and combining the arithmetic mean result with the first word feature in at least one line of text content A position feature is multiplied to obtain the feature sequence of the text.

In some embodiments, obtaining the feature sequence of the text according to the fusion feature includes: inputting the fusion feature into a stacked self-attention network to obtain an enhanced fusion feature; the initial weight of the self-attention network is the fusion feature.

In some embodiments, the self-attention network is represented as follows:

H ₀ =V

Among them, W _l* represents the learnable parameter matrix of the fully connected layer that multiple learning parameters do not share, * is a positive integer; d represents the feature dimension; H _l represents the output of the l-layer self-attention network; V represents the fusion feature; σ represents the normalization function.

In some embodiments, obtaining the feature sequence of the text includes: arithmetically averaging the second single-character feature in at least one line of text content composed of the feature _H1 output by the self-attention network, and combining the result of the arithmetic mean with a line of text The second position feature in the content is multiplied to obtain the feature sequence of the text.

In some embodiments, the fused features are represented as follows:

V=concat(T,F+S)

Among them, T is the vector of the encoded single word in at least one line of text content; F is the vector of using the region of interest pooling algorithm to extract the image information of at least one line of text content on the entire image; S is the position code of at least one line of text content The vector after ; the vector dimensions of T, F, and S are the same.

In some embodiments, acquiring the text information and image information of the text contained in the document to be processed includes: using a neural network to extract the image information of the document to be processed.

In some embodiments, based on a predefined document category and feature sequence, determining the category of the document to be processed includes:

Predefined document categories; use the classifier function to obtain the probability of the feature sequence of the text on the predefined document categories;

The predefined document category with the highest probability value in the probability is taken as the category of the document.

According to a second aspect of the present disclosure, there is also provided a document classification device, including:

An acquisition module, configured to: acquire text information and image information of the text included in the document to be processed;

The fusion feature module is used for: performing fusion based on text information and image information to obtain fusion features;

The feature sequence acquisition module is used to: acquire the feature sequence of the text according to the fusion feature;

The classification module is configured to: determine the category of the document to be processed based on the predefined document category and feature sequence.

According to a third aspect of the present disclosure, a method for training a document classification model is also provided, including:

Predefine the categories of test documents, and predefine the correct probability value of each category document;

Obtain the text information and image information of the text of the test document;

Processing based on text information and image information to obtain the feature sequence of the text;

According to the feature sequence of the text and the category of the predefined test document, determine the predicted category of the test document and the predicted probability distribution value of the test document belonging to each category of document;

The parameters of the document classification model are adjusted based on the correct probability value and the predicted probability distribution value, and a target document classification model is obtained in response to preset conditions.

According to a fourth aspect of the present disclosure, there is also provided an electronic device, including:

at least one processor; and

memory communicatively coupled to at least one processor; wherein,

The memory stores instructions that can be executed by at least one processor, and the instructions are executed by at least one processor, so that at least one processor can execute the method in any one of the above method technical solutions.

According to a fifth aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to make a computer execute the method in any one of the above-mentioned method technical solutions.

According to a sixth aspect of the present disclosure, there is also provided a computer program product, including a computer program, and when the computer program is executed by a processor, the method in any one of the above method technical solutions is implemented.

The technical solution provided by the present disclosure has the following beneficial effects:

(1) The technical solution provided in the present disclosure proposes a document classification method for multimodal feature fusion. This method takes text content, text image blocks and text coordinates as input information to enhance the semantic expression of document features.

(2) The technical solution provided by this disclosure can effectively solve document confusion and improve classification accuracy by building a deep self-attention network to fuse multi-modal features of documents.

It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.

Description of drawings

The accompanying drawings are used to better understand the present solution, and do not constitute a limitation to the present disclosure. in:

FIG. 1 shows a schematic flowchart of a method for classifying documents provided by an embodiment of the present disclosure;

FIG. 2 shows a schematic flowchart of an optical character recognition method provided by an embodiment of the present disclosure;

FIG. 3 shows a schematic flowchart of determining a document category according to a text feature sequence and a predefined document category provided by an embodiment of the present disclosure;

Fig. 4 shows a schematic diagram of a document classification device provided by an embodiment of the present disclosure;

FIG. 5 shows a schematic flowchart of a method for training a document classification model provided by an embodiment of the present disclosure;

Fig. 6 is a block diagram of an electronic device used to implement the document classification method of the embodiment of the present disclosure.

Detailed ways

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Explanation of terms:

Region of interest pooling algorithm: ROI Pooling (RegionOFInterest); the pooling layer is sandwiched between consecutive convolutional layers to compress the amount of data and parameters and reduce overfitting; if the input is an image, then the pooling layer The main function of is to compress the image.

OCR: Optical Character Recognition, Optical Character Recognition.

In the prior art, the methods for processing document classification include: manual approach: this approach is to fill in the report by the uploader or manual classification by the auditor, which is time-consuming, laborious and inefficient; image classification: classify the image through the visual information of the document; text classification: Based on the acquired document text content, use text to classify; classification based on image and text content: obtain classification results based on images and text respectively, and give the final result based on voting or predefined rules.

Most of the current document classification techniques consider the method of a single feature, ignoring the relevance of document layout and text. The technical solutions provided by the present disclosure solve the above technical problems.

Fig. 1 shows a schematic flowchart of a method for document classification provided by an embodiment of the present disclosure. As shown in Fig. 1, the method may mainly include the following steps:

S101: Obtain the text information and image information of the text included in the document to be processed; use a camera to obtain the image of the document to be classified, which can be obtained by a mobile phone, camera, tablet computer, scanner, etc. The image must contain text information, otherwise, it does not belong to the document classified in this disclosure. The text and text information in the image is obtained through the text recognition algorithm, which is one of the basis for document classification; the image information includes the color feature, texture feature, shape feature and spatial relationship feature of the image.

S102: Fusion based on text information and image information to obtain fusion features;

S103: Obtain a feature sequence of the text according to the fusion feature;

S104: Determine the category of the document to be processed based on the predefined document category and feature sequence.

In the present disclosure, text information and image information are processed, and what is obtained through fusion may be a line text, a feature sequence of column text, or text in other arrangements. The so-called feature sequence of the text is an example, that is, there is a text such as: "I am very happy", then the feature sequence of the text is: M={m _r ;r∈[1,4]}; m represents the feature of a single word sequence.

The feature sequence of the text is processed, and the category of the document is determined in combination with the predefined document category. Pre-defined document categories, for example: VAT invoices in invoices, taxi tickets, tolls, train tickets, itinerary and other documents. To classify invoices of the same type, the technical solution of the present disclosure can be adopted. It can also be other types of documents, such as case sheets, prescription lists, medical record pages, inspection reports and other documents in hospital scenarios.

The text includes at least one line of text content; acquiring text information of the text included in the document to be processed includes: acquiring at least one line of text content and position information of at least one line of text content.

Acquiring at least one line of text content and location information of at least one line of text content includes: acquiring at least one line of text by using an optical character recognition method. Among the existing text recognition algorithms, optical character recognition (OCR) is mostly used. OCR can be used to obtain at least one line of text.

Fig. 2 shows a schematic flowchart of an optical character recognition method provided by an embodiment of the present disclosure. As shown in Fig. 2, the method may mainly include the following steps:

S201: Text detection algorithm: used to obtain position information of at least one line of text content in the document to be processed.

S202: Text recognition algorithm: used to obtain at least one line of text content. The so-called text includes the position of the text content, the content of the text, and the horizontal and vertical arrangement, oblique arrangement or other arrangements of the text. The technical solution can recognize row texts, column texts, or texts arranged in other ways, so that the application scope of the present disclosure is wider. The text detection algorithm includes: EAST algorithm, which is a prior art, and will not be described in detail here. The text detection algorithm obtains the position information of at least one line of text content in the document. Specifically output the upper left, upper right, lower left, and lower right coordinates S={s _i ; i∈N} of a line of text content. The character recognition algorithm includes: CTC algorithm, which is also a prior art, and will not be described in detail here. The character recognition algorithm obtains a line of text content T={t _i ; i∈N}, and defines a line of text content t _i with k _i words t _i ={c _ij ; j∈ki _} . When the OCR recognizes the position and content of a line of text, it means that the text information of the text in the document to be processed has been obtained.

Fusion based on text information and image information to obtain fusion features, including: adding image information and position information of at least one line of text content, and then concatenating with at least one line of text content to obtain fusion features.

Obtaining text information and image information of a document and merging them is a technical feature of this technical solution that is different from the prior art. Enhance text semantic expression by fusing multimodal features. The so-called multi-modality refers to the two modes of document text information and image information.

Acquiring the feature sequence of the text according to the fusion feature, including: performing arithmetic mean of the first single character feature in at least one line of text content in the fusion feature, and multiplying the result of the arithmetic mean with the first position feature in at least one line of text to obtain A sequence of features for the text.

The first word feature in at least one line of text content is to encode each word t _i in at least one line of text content into a 768-dimensional vector

The dimension can also be other numbers. As long as the first word feature can be extracted. The first position feature in the position information of at least one line of text content is the 768-dimensional image information of a line of text in the document extracted by using a pooling algorithm (pooling) in the entire document

This dimension can also be other numbers, but it must be consistent with the vector dimension encoded by the single word t _i ; the first position feature also includes: for the 4-dimensional space coordinates of at least one line of text, it is also encoded as a 768-dimensional vector

This dimension can also be other dimensions, but it must be consistent with the dimension of the vector encoded by the word t _i . The 4-dimensional space coordinates of at least one line of text are the upper left, upper right, lower left, and lower right coordinates of each text.

So far, the document can be classified, and the category of the document can be determined according to the feature sequence of the text and the predefined document category.

However, in order to further enhance the information representation of fusion features in spatial, visual, semantic and other dimensions, a technical solution based on multi-layer self-attention network stacking deep networks can be used.

Acquiring the feature sequence of the text according to the fusion feature, including: inputting the fusion feature into the stacked self-attention network to obtain the enhanced fusion feature; the initial weight of the self-attention network is the fusion feature.

The representation of the self-attention network is as follows:

H ₀ =V

Among them, W _l* represents the learnable parameter matrix of the fully connected layer that multiple learning parameters do not share, and * is a positive integer; d represents the feature dimension, which is 768 dimensions in the above technical solution; H _l represents the self-attention of the l layer The output of the network; V represents the fusion feature; σ represents the normalization function, and in this embodiment, the normalization function adopts the sigmoid function. Taking the fused features as initial weights, stack H _l step by step. The self-attention network first uses two fully connected layers (W _l1 and W _l2 ) to calculate the input feature H _l-1 , uses matrix multiplication for the calculated features, and normalizes it by the sigmoid function σ to obtain the weight Matrix, the weight matrix is then multiplied by H _l-1 to get a new feature H _l and output as the lth layer.

Obtaining the feature sequence of the text, including: performing arithmetic averaging on the second character feature in at least one line of text content composed of the feature H _l output by the self-attention network, and comparing the result of the arithmetic mean with the second position in the line of text content The features are multiplied to obtain the feature sequence of the text. The second word feature is to encode each word t _i in at least one line of text content into a 768-dimensional vector

And the encoded features output by the above deep self-attention network are denoted by x. The second position feature is the 768-dimensional image information of at least one line of text in the document extracted by pooling algorithm (pooling)

And the encoded features output by the above-mentioned deep self-attention network are denoted by y, and y corresponds to the encoded features of the image information F and the position s of a line of text content. The 768 dimension is an implementation in the embodiment, and may also be other dimensions. However, dimensions must be consistent before and after encoding. The encoded H is expressed as:

H=(x _1,1 ,x _1,2 ,…,x _1,k1 ,x _2,1 ,x _2,2 ,…,x _2,k2 ,…,x _n,1 ,…,x _n,kn ,y ₁ ,…,y _n )

Carry out arithmetic mean to the second character feature in at least one line of text content composed of the feature H _l output by the self-attention network, and multiply the result of the arithmetic mean with the second position feature in a line of text content to obtain the feature of the text sequence. The specific implementation method is: for all the second character features x _r of a line of text content (such as the rth row, the rth column or the rth sorting of other arrangements), these second character features are arithmetically averaged and The Hadamard product of the result and the second position feature y _r is obtained to obtain the feature sequence of a line of text content:

M={m _r ; r∈[1,N]};

in,

The fusion features are expressed as follows:

V=concat(T,F+S)

Among them, T is the vector of the encoded single word in at least one line of text content; F is the vector of using the region of interest pooling algorithm to extract the image information of at least one line of text content on the entire image; S is the position code of at least one line of text content The vector after ; the vector dimensions of T, F, and S are the same. Acquiring the text information and image information of the text included in the document to be processed includes: using a neural network to extract the image information of the document to be processed. Neural networks include: convolutional neural networks.

Fig. 3 shows a schematic flowchart of determining a document category according to a text feature sequence and a predefined document category provided by an embodiment of the present disclosure. As shown in Fig. 3 , the method may mainly include the following steps:

S301: Predefine document categories.

First, determine the category of the document, such as an invoice or a hospital inspection document, and so on.

S302: Use a classifier function to obtain the probability of the feature sequence of the text on the predefined document category; the classifier function includes: a softmax function.

M'=mean(M), average all elements m in the text feature sequence M, then use a fully connected layer to a vector of predefined category size, and then use the softmax function to map to a probability distribution, expressed as follows:

Among them, scores is the mapped probability distribution value; fc is the fully connected layer.

S303: Take the predefined document category with the highest probability value among the probabilities as the category of the document.

cls=argmax(scores)

Among them, cls is the classification category of the document, and argmax is the function of taking the maximum value.

So far, after merging features and enhancing the information representation of fused features in spatial, visual, semantic and other dimensions, the text-based feature sequence is classified, meeting the requirements of document classification.

Based on the same principle as the above-mentioned object labeling method, the embodiment of the present disclosure also provides another document classification device. FIG. 4 shows a schematic diagram of a document classification device provided by the embodiment of the present disclosure. As shown in FIG. 4 , The document classification apparatus 400 includes an acquisition module 401 , a fusion feature module 402 , a feature sequence acquisition module 403 and a classification module 404 .

In the embodiment of the present disclosure, the text includes at least one line of text content; when acquiring the text information of the text included in the document to be processed, the acquisition module 401 is further configured to: acquire at least one line of text content and location information of at least one line of text content.

In the embodiment of the present disclosure, when the fusion feature module 402 is used for fusion based on text information and image information to obtain fusion features, it is also used for:

The image information is added to the position information of at least one line of text content, and then concatenated with at least one line of text content to obtain fusion features.

In the embodiment of the present disclosure, when the feature sequence acquisition module 403 is used to acquire the feature sequence of the text according to the fusion feature, it is also used to: arithmetically average the first word feature in at least one line of text content in the fusion feature, and The result of the arithmetic mean is multiplied by the first position feature in at least one line of text content to obtain a feature sequence of the text.

In the embodiment of the present disclosure, when the feature sequence acquisition module 403 is used to acquire the feature sequence of the text according to the fusion feature, it is also used to: input the fusion feature into the stacked self-attention network to obtain an enhanced fusion feature; the initial self-attention network The weights are fused features.

In the disclosed embodiment, the representation of the self-attention network is as follows:

H ₀ =V

In the embodiment of the present disclosure, when the feature sequence acquisition module 403 is used to acquire the feature sequence of the text, it is also used to: perform the second word feature in at least one row of text content composed of the feature _H1 output by the self-attention network Arithmetic mean, and multiply the result of the arithmetic mean with the second position feature in a line of text content to obtain the feature sequence of the text.

In the embodiment of the present disclosure, the fusion feature is expressed as follows:

V=concat(T,F+S)

In the embodiment of the present disclosure, the obtaining module 401 is further configured to: use a neural network to extract image information of the document to be processed.

In the embodiment of the present disclosure, the classification module 404 is also used for:

Predefined document categories;

Use the classifier function to obtain the probability of the feature sequence of the text on the predefined document category;

The present disclosure also provides a training method for a document classification model. FIG. 5 shows a schematic flowchart of a training method for a document classification model provided by an embodiment of the present disclosure. As shown in FIG. 5 , the method may mainly include the following steps:

S501: Predefine categories of test documents, and predefine correct probability values for documents of each category.

The categories of predefined test documents are as follows: such as forms, contracts, bills, certificates, etc. The correct probability value, for example, is as follows: the probability corresponding to the labeled category is 1, and the rest are 0.

S502: Obtain text information and image information of the text of the test document.

S503: Process based on the text information and the image information to obtain a text feature sequence.

S504: Determine the predicted category of the test document and the predicted probability distribution value of the test document belonging to each category according to the feature sequence of the text and the category of the predefined test document.

S505: Adjust document classification model parameters based on the correct probability value and the predicted probability distribution value, and obtain a target document classification model in response to preset conditions.

The preset conditions include: the number of training rounds, the training time, and whether the training samples have been trained; the preset conditions can also include: whether the model converges in the later stage of training, for example: we will correct the probability (corresponding to the labeled category probability is 1 , the rest are 0) and the predicted probability distribution use the minimum cross-entropy function algorithm to calculate and optimize model parameters, and save model snapshots at fixed intervals, wait for the model to converge, that is, the cross-entropy will no longer decrease in the later stage of training, and obtain the snapshot version with the minimum cross-entropy Used as the optimal model for actual forecasting.

The technical solution provided by the disclosure integrates multiple modes, that is, text content, text position and image information, and avoids only processing information of a single mode to obtain document classification results. Using the method of multi-modal fusion, it effectively solves the visual attributes based on the document, which is limited to the format of the document and cannot handle similar documents; it solves the problem of using plain text for classification, ignoring the visual layout of the content in the document and the content in the document. The image information that will exist can easily lead to semantic confusion; it solves the problem that the use of images and text is independent of each other, and the correlation between the two modal information is not considered, and there is a possibility of conflict. The technical solution provided by the disclosure can effectively solve document confusion and improve classification accuracy.

According to the embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

FIG. 6 shows a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in FIG. 6, the device 600 includes a computing unit 601 that can execute according to a computer program stored in a read-only memory (ROM) 602 or loaded from a storage unit 608 into a random-access memory (RAM) 603. Various appropriate actions and treatments. In the RAM 803, various programs and data necessary for the operation of the device 600 can also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to the bus 604 .

Multiple components in the device 600 are connected to the I/O interface 605, including: an input unit 606, such as a keyboard, a mouse, etc.; an output unit 607, such as various types of displays, speakers, etc.; a storage unit 608, such as a magnetic disk, an optical disk, etc. ; and a communication unit 609, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 609 allows the device 600 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.

The computing unit 601 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 601 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 executes the various methods and processes described above, such as the document classification method. For example, in some embodiments, the document classification method can be implemented as a computer software program, which is tangibly contained in a machine-readable medium, such as a memory Unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed on the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by computing unit 601, one or more steps of the document classification method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured in any other appropriate way (for example, by means of firmware) to execute the document classification method.

Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.

Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.

The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.

A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.

Claims

A method for classifying documents, comprising:

Acquiring text information and image information of the text included in the document to be processed;

performing fusion based on the text information and the image information to obtain fusion features;

Obtaining the feature sequence of the text according to the fusion feature;

Based on the predefined document category and the feature sequence, the category of the document to be processed is determined.
The method according to claim 1, wherein the text includes at least one line of text content; said obtaining the text information of the text included in the document to be processed comprises: obtaining the at least one line of text content and the at least one line of text content location information.
The method according to claim 2, wherein said merging based on said text information and said image information to obtain fused features comprises:

The image information is added to the position information of the at least one line of text content, and then concatenated with the at least one line of text content to obtain a fusion feature.
The method according to claim 3, wherein said obtaining the feature sequence of the text according to the fusion feature comprises: arithmetically averaging the first single character feature in at least one line of text content in the fusion feature, and multiplying the result of the arithmetic mean by the first position feature in the at least one line of text content to obtain a feature sequence of the text.
The method according to claim 3, wherein said obtaining the feature sequence of said text according to said fusion features comprises: inputting said fusion features into a stacked self-attention network to obtain enhanced fusion features; said self-attention The initial weights of the network are the fused features.
The method according to claim 5, wherein the representation of the self-attention network is as follows:

H 0 =V

Among them, W l* represents the learnable parameter matrix of the fully connected layer that multiple learning parameters do not share, * is a positive integer; d represents the feature dimension; H l represents the output of the l-layer self-attention network; V represents the fusion feature; σ represents the normalization function.
The method according to claim 6, wherein said obtaining the feature sequence of said text comprises: the second single character feature in said at least one line of text content formed by said self-attention network output feature H1 An arithmetic mean is performed, and the result of the arithmetic mean is multiplied by the second position feature in the line of text content to obtain a feature sequence of the text.
The method according to claim 3, wherein the fusion feature is represented as follows:

V=concat(T,F+S)

Wherein, T is the vector of the encoded single word in the at least one line of text content; F is the vector of using the region of interest pooling algorithm to extract the image information of the at least one line of text content on the entire image; S is the at least one line of text content The position-encoded vector of a line of text content; the vector dimensions of T, F and S are the same.
The method according to claim 1, wherein said acquiring the text information and image information of the text included in the document to be processed comprises: using a neural network to extract the image information of the document to be processed.
The method according to claim 1, wherein said determining the category of the document to be processed based on the predefined document category and the feature sequence comprises:

Predefined document categories;

Using a classifier function to obtain the probability of the feature sequence of the text on the predefined document category;

Taking the predefined document category with the highest probability value among the probabilities as the category of the document.
A document classification device, comprising:

An acquisition module, configured to: acquire text information and image information of the text included in the document to be processed;

A fusion feature module, configured to: perform fusion based on the text information and the image information to obtain a fusion feature;

A feature sequence acquisition module, configured to: acquire the feature sequence of the text according to the fusion feature;

A classification module, configured to: determine the category of the document to be processed based on the predefined document category and the feature sequence.
The device according to claim 11, wherein the text includes at least one line of text content; when the acquisition module is used to acquire the text information of the text included in the document to be processed, it is also used to: acquire the at least one line of text content and the location information of the at least one line of text content.
The device according to claim 12, wherein, when the fusion feature module is used for fusion based on text information and the image information to obtain fusion features, it is also used for:

The image information is added to the position information of the at least one line of text content, and then concatenated with the at least one line of text content to obtain a fusion feature.
The device according to claim 13, wherein, when the feature sequence obtaining module is used to obtain the feature sequence of the text according to the fusion feature, it is also used to: include at least one line of text content in the fusion feature Carry out the arithmetic mean of the first character feature of the first character, and multiply the result of the arithmetic mean with the first position feature in the at least one line of text content to obtain the feature sequence of the text.
The device according to claim 13, wherein, when the feature sequence obtaining module is used to obtain the feature sequence of the text according to the fusion feature, it is also used to: input the fusion feature into a stacked self-attention network to obtain Enhanced fusion features; the initial weight of the self-attention network is the fusion features.
The device according to claim 15, wherein the representation of the self-attention network is as follows:

H 0 =V

Among them, W l* represents the learnable parameter matrix of the fully connected layer that multiple learning parameters do not share, * is a positive integer; d represents the feature dimension; H l represents the output of the l-layer self-attention network; V represents the fusion feature; σ represents the normalization function.
The device according to claim 16, wherein, when the feature sequence obtaining module is used to obtain the feature sequence of the text, it is also used to: the feature H1 formed by the self-attention network output Carrying out the arithmetic mean of the second character feature in at least one line of text content, and multiplying the result of the arithmetic mean with the second position feature in the line of text content to obtain the feature sequence of the text.
The device according to claim 13, wherein the fusion features are expressed as follows:

V=concat(T,F+S)

Wherein, T is the vector of the encoded single word in the at least one line of text content; F is the vector of using the region of interest pooling algorithm to extract the image information of the at least one line of text content on the entire image; S is the at least one line of text content The position-encoded vector of a line of text content; the vector dimensions of T, F and S are the same.
The device according to claim 11, wherein the acquiring module is further configured to: extract the image information of the document to be processed by using a neural network.
The device according to claim 11, wherein the classification module is further used for:

Predefined document categories;

Using a classifier function to obtain the probability of the feature sequence of the text on the predefined document category;

Taking the predefined document category with the highest probability value among the probabilities as the category of the document.
A training method for a document classification model, comprising:

Predefine the categories of test documents, and predefine the correct probability value of each category document;

Obtain the text information and image information of the text of the test document;

processing based on the text information and the image information to obtain a feature sequence of the text;

According to the feature sequence of the text and the category of the predefined test document, determine the predicted category of the test document and the predicted probability distribution value of the test document belonging to each category document;

Adjusting document classification model parameters based on the correct probability value and the predicted probability distribution value, and obtaining a target document classification model in response to preset conditions.
An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-10. Methods.
A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1-10.
A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-10.