CN115601620A

CN115601620A - Feature fusion method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN115601620A
Application number: CN202211304730.1A
Authority: CN
Inventors: 李煜林; 钦夏孟; 章成全; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2023-01-13

Abstract

The disclosure provides a feature fusion method and device, electronic equipment and a computer-readable storage medium, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing, large models and computer vision, and can be applied to scenes such as optical character recognition. The specific implementation scheme is as follows: acquiring a first input feature and a second input feature, wherein the correlation between the first input feature and the target analysis object and the correlation between the second input feature and the target analysis object both meet a preset correlation condition; and inputting the first input characteristic and the second input characteristic into a preset characteristic fusion network to obtain a first cross attention characteristic and a second cross attention characteristic. The feature fusion network provided by the scheme is used for performing feature fusion processing on the first input feature and the second input feature to obtain a first cross attention feature and a second cross attention feature which are fused with the characteristics of the first input feature and the characteristics of the second input feature, and the effect of feature fusion can be improved.

Description

Feature fusion method and device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the technical field of artificial intelligence, and in particular, to the technical field of deep learning, image processing, large models, and computer vision, which can be applied to scenes such as Optical Character Recognition (OCR), and in particular, to a feature fusion method, an apparatus, an electronic device, and a computer-readable storage medium.

Background

Feature fusion refers to fusing different features into a new feature, so that the characteristics of the different features are better utilized to improve the performance of the model.

The effect of feature fusion can greatly affect the performance of the model, so how to improve the effect of feature fusion when fusing different features into new features to ensure the performance of the model becomes an important technical problem in the related field.

Disclosure of Invention

In order to solve at least one of the above drawbacks, the present disclosure provides a feature fusion method, an apparatus, an electronic device, and a computer-readable storage medium.

According to a first aspect of the present disclosure, there is provided a feature fusion method, the method comprising:

acquiring a first input feature and a second input feature, wherein the correlation between the first input feature and the target analysis object and the correlation between the second input feature and the target analysis object both meet a preset correlation condition;

inputting the first input characteristic and the second input characteristic into a preset characteristic fusion network to obtain a first cross attention characteristic and a second cross attention characteristic;

wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork;

the first self-attention subnetwork is used for performing self-attention processing on the first input feature to obtain a first self-attention feature;

the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic;

the first cross attention sub-network is used for carrying out cross attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature;

the second cross attention subnetwork is used for carrying out cross attention processing on the basis of the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature.

According to a second aspect of the present disclosure, there is provided a feature fusion apparatus, the apparatus comprising:

the characteristic acquisition module is used for acquiring a first input characteristic and a second input characteristic, and the correlation between the first input characteristic and the target analysis object and the correlation between the second input characteristic and the target analysis object both meet a preset correlation condition;

the feature fusion module is used for inputting the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature;

the second cross attention subnetwork is used for performing cross attention processing based on the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature.

According to a third aspect of the present disclosure, there is provided an electronic apparatus comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the feature fusion method.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the above-described feature fusion method.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart diagram of a feature fusion method provided in an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram illustrating one embodiment of a feature fusion method provided by the embodiments of the present disclosure;

FIG. 3 is a flowchart illustrating a document image classification method according to an embodiment of the disclosure;

FIG. 4 is a schematic structural diagram of a feature fusion apparatus provided in an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a document image classification device according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of an electronic device for implementing the feature fusion method of embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

When different features are fused into a new feature for model processing, the effect of feature fusion has an important influence on the model performance, and therefore how to improve the effect of feature fusion becomes an important technical problem in the related field.

When classifying document images, in the related art, generally, image features and text features of the document images are respectively acquired, and then a document classification result is determined based on the image features and the text features, for example, a classification result is obtained based on the image features and the text features, and then a final classification result is determined from the two classification results based on rules such as voting. This approach does not take into account the correlation between different modal characteristics, resulting in a less accurate final classification result.

Some feature fusion methods also exist in the related art, which are used for performing feature fusion on image features and text features, but the effect of feature fusion through these feature fusion methods may be poor, and the accuracy of classification results may also be affected. For example, a new feature obtained by fusing an image feature and a text feature may have a poor effect on semantic expression, and may have a semantic confusion problem, which affects the accuracy of a classification result.

The embodiment of the present disclosure provides a feature fusion method, a feature fusion device, an electronic device, and a computer-readable storage medium, which are intended to solve at least one of the above technical problems in the prior art.

Fig. 1 shows a schematic flow diagram of a feature fusion method provided in an embodiment of the present disclosure, and as shown in fig. 1, the method mainly includes:

step S110: and acquiring a first input characteristic and a second input characteristic, wherein the correlation between the first input characteristic and the target analysis object and the correlation between the second input characteristic and the target analysis object both meet a preset correlation condition.

The target analysis object may be an object that needs to perform feature analysis according to the fused features obtained by the feature fusion processing.

The first input feature and the second input feature are both features related to the target analysis object, and the relevance between the first input feature and the target analysis object satisfies a preset relevance condition. The correlation condition can be set according to actual needs to ensure that the first input feature and the second input feature have strong correlation with the target analysis object, so that the fused feature obtained by feature fusion of the first input feature and the second input feature has strong effectiveness.

As an example, the target analysis object is a document image to be analyzed, the first input feature is a visual feature extracted based on the document image, and the second input feature is a text feature extracted based on a target text included in the document image. The relevance condition is that the visual feature and the text feature are extracted based on the same document image.

As an example, the target analysis object is a target video to be analyzed, and the first input feature and the second input feature are any one of multi-modal features extracted from the target video. The relevance condition is that the first input feature and the second input feature are extracted based on the same target video. Specifically, the multimodal features include video features, image features, audio features, text features, and the like extracted based on the target video.

As an example, the target analysis object is a target video to be analyzed, the first input feature is any one of multi-modal features extracted from the target video, the second input feature is other features associated with the target video, and the correlation condition is that both the first input feature and the second input feature are related to the target video. The multi-modal features comprise video features, image features, audio features, text features and the like extracted based on the target video. The other characteristics include user behavior characteristics of the associated users of the target video, user social characteristics of the associated users of the target video, advertisement characteristics of the target video, and the like. The associated users include, but are not limited to, uploading users, forwarding users, etc. of the target video.

Step S120: inputting the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature;

the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature;

In the embodiment of the present disclosure, the feature fusion network is configured to fuse features of the first input feature and the second input feature to obtain a first cross attention feature and a second cross attention feature, where the first cross attention feature and the second cross attention feature both fuse characteristics of the first input feature and characteristics of the second input feature.

In the feature network provided by the embodiment of the present disclosure, a first self-attention feature is obtained by performing self-attention processing on a first input feature through a first self-attention subnetwork, a second self-attention subnetwork is used for performing self-attention processing on a second input feature to obtain a second self-attention feature, then cross-attention processing is performed through the first cross-attention subnetwork based on the first self-attention feature and a second attention weight parameter corresponding to the second self-attention subnetwork to obtain a first cross-attention feature, and cross-attention processing is performed through the second cross-attention subnetwork based on the second self-attention feature and the first attention weight parameter corresponding to the first self-attention subnetwork to obtain a second cross-attention feature.

In implementations of the present disclosure, the second attention weight parameter is an attention weight parameter when the second self-attention subnetwork is configured to self-attention process the second input feature. Specifically, the second input feature may be converted to obtain a query feature, a key-value feature, and a content feature, so that the second attention weight parameter is learned according to the query feature, the key-value feature, and the content feature.

As an example, the attention weight parameter may be represented by the following formula one:

w＝softmax(QK ^T) … … (formula one)

Wherein w represents an attention weight parameter, softmax is a normalized exponential function, Q represents a query feature, and K represents a key-value feature. In this example, Q and K can be converted from the second input characteristic.

In implementations of the present disclosure, the first attention weight parameter is an attention weight parameter when the first self-attention subnetwork is configured to self-attention process the first input feature. Specifically, the first input feature may be converted to obtain a query feature, a key-value feature, and a content feature, so that the first attention weight parameter is learned according to the query feature, the key-value feature, and the content feature.

As an example, the formula expression of the first attention weight parameter may refer to formula one described above, in which Q and K may both be converted from the first input feature.

In the embodiment of the present disclosure, since the first cross attention feature is obtained by performing cross attention processing based on the first self attention feature and the second attention weight parameter corresponding to the second self attention subnetwork, it is equivalent to guiding the first self attention feature to be recombined by the second attention weight parameter of the second attention subnetwork, so that the first cross attention feature can fuse the characteristics of the second input feature on the basis of the characteristics of the first input feature.

In the embodiment of the present disclosure, since the second cross attention feature is obtained by performing cross attention processing based on the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork, it is equivalent to guiding the second self attention feature to be recombined by the first attention weight parameter of the first attention subnetwork, so that the second cross attention feature can fuse the characteristics of the first input feature on the basis of the characteristics of the second input feature.

According to the method provided by the embodiment of the disclosure, by acquiring the first input characteristic and the second input characteristic, the correlation between the first input characteristic and the target analysis object and the correlation between the second input characteristic and the target analysis object both satisfy a preset correlation condition; inputting the first input characteristic and the second input characteristic into a preset characteristic fusion network to obtain a first cross attention characteristic and a second cross attention characteristic; wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork; the first self-attention subnetwork is used for performing self-attention processing on the first input feature to obtain a first self-attention feature; the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic; the first cross attention subnetwork is used for performing cross attention processing on the basis of the first self attention feature and a second attention weight parameter corresponding to the second self attention subnetwork to obtain a first cross attention feature; the second cross attention subnetwork is used for carrying out cross attention processing on the basis of the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature. According to the scheme, the first input feature and the second input feature are subjected to feature fusion processing based on the feature fusion network, so that a first cross attention feature and a second cross attention feature which are fused with the characteristics of the first input feature and the characteristics of the second input feature are obtained, the effect of feature fusion can be improved, and the performance of a model can be improved.

In the embodiment of the disclosure, the first cross attention subnetwork and the second self attention subnetwork share the attention weight parameter, so that the self attention mechanism and the cross attention mechanism can be better combined, the feature fusion effect is improved, the number of model parameters in the feature fusion network can be reduced, and the complexity of model operation is reduced.

In the embodiment of the present disclosure, in order to improve the feature fusion effect, multiple layers of feature fusion networks may be set according to actual needs, and the feature fusion networks of the layers are connected in sequence.

In an alternative aspect of the disclosure, when performing the cross attention processing based on the first self-attention feature and the second attention weight parameter corresponding to the second self-attention sub-network to obtain the first cross attention feature, the first cross attention sub-network is specifically configured to:

converting the first self-attention feature into a first content feature;

performing cross attention processing on the basis of the first content feature and a second attention weight parameter corresponding to a second self-attention subnetwork to obtain a first cross attention feature;

the second cross attention subnetwork is specifically configured to, when performing cross attention processing based on the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain the second cross attention feature:

converting the second self-attention feature into a second content feature;

and performing cross attention processing on the basis of the second content characteristic and the first attention weight parameter corresponding to the first self-attention subnetwork to obtain a second cross attention characteristic.

In an implementation of the present disclosure, when performing the cross attention processing based on the first cross attention subnetwork, the first self attention feature may be converted into a first content feature, and then the cross attention processing is performed based on the first content feature and a second attention weight parameter corresponding to the second self attention subnetwork, so as to obtain the first cross attention feature.

The second attention weight parameter is determined based on the query feature and the key-value feature converted from the second input feature, which is equivalent to guiding the recombination of the first content feature through the query feature and the key-value feature of the second attention sub-network, so that the effective combination of the second self-attention sub-network and the first cross-attention sub-network is realized.

In this disclosure, when performing the cross attention processing based on the second cross attention subnetwork, the second self-attention feature may be converted into the second content feature, and then the cross attention processing is performed based on the second content feature and the first attention weight parameter corresponding to the first self-attention subnetwork, so as to obtain the second cross attention feature.

The first attention weight parameter is determined based on the query feature and the key-value feature converted from the first input feature, which is equivalent to guiding the recombination of the second content feature through the query feature and the key-value feature of the second attention subnetwork, so that the first self-attention subnetwork and the second cross-attention subnetwork can be effectively combined.

In an optional mode of the present disclosure, the feature fusion network further includes a feature fusion layer, and after the first input feature and the second input feature are input into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature, the method further includes:

the first cross attention feature and the second cross attention feature are input to a feature fusion layer to determine a fused feature based on the first cross attention feature and the second cross attention feature.

In an embodiment of the present disclosure, the feature fusion network further includes a feature fusion layer, where the feature fusion layer interfaces the first cross attention subnetwork and the second cross attention subnetwork, and is configured to perform feature fusion on the first cross attention feature output by the first cross attention subnetwork and the second cross attention feature output by the second cross attention subnetwork to obtain the fused feature.

The post-fusion feature is obtained based on the first cross attention feature and the second cross attention feature, has the characteristics of the first cross attention feature and the second cross attention feature, and can be used for subsequent analysis processing.

In an optional aspect of the disclosure, determining the post-fusion feature based on the first cross attention feature and the second cross attention feature comprises:

and performing feature fusion on the first cross attention feature and the second cross attention feature to obtain a fused feature.

In the embodiment of the present disclosure, a specific processing manner of the feature fusion layer may be to directly perform feature fusion on the first cross attention feature and the second cross attention feature to obtain a fused feature.

As an example, the fused feature may be obtained by stitching the first cross attention feature with the second cross attention feature.

In an optional aspect of the present disclosure, determining a post-fusion feature based on the first cross attention feature and the second cross attention feature includes:

performing feature fusion on the first cross attention feature and the first input feature to obtain a first sub-fusion feature;

performing feature fusion on the second cross attention feature and the second input feature to obtain a second sub-fusion feature;

and performing feature fusion on the first sub-fusion feature and the second sub-fusion feature to obtain a fused feature.

In this embodiment of the disclosure, the specific processing manner of the feature fusion layer may further be to perform feature fusion on the first cross attention feature and the first input feature to obtain a first sub-fusion feature, perform feature fusion on the second cross attention feature and the second input feature to obtain a second sub-fusion feature, and perform feature fusion on the first sub-fusion feature and the second sub-fusion feature to obtain a fused feature.

Since the first cross-attention feature and the second cross-attention feature are both subjected to cross-attention processing, confusion may occur between the features, and the characteristics of the first input feature included in the first cross-attention feature and the characteristics of the second input feature included in the second cross-attention feature may be averaged. By performing feature fusion on the first cross attention feature and the first input feature, the first sub-fusion feature can fuse the characteristics of the second input feature on the basis of considering the characteristics of the first input feature. By performing feature fusion on the second cross attention feature and the second input feature, the second sub-fusion feature can fuse the characteristics of the first input feature on the basis of considering the characteristics of the second input feature. Therefore, the fused features obtained based on the fusion of the first sub-fusion features and the second sub-fusion features have better feature fusion effect.

In the embodiment of the disclosure, the cross features after the cross attention processing are fused with the original input features at the feature fusion layer, so that data transmission links in the feature fusion network can be increased, the complexity of the network is increased, and the feature fusion effect of the feature fusion network is improved.

As an example, the specific way of feature fusing the first cross attention feature with the first input feature may be to add a feature matrix corresponding to the first cross attention feature with the feature matrix corresponding to the first input feature, and the specific way of feature fusing the second cross attention feature with the second input feature may be to add a feature matrix corresponding to the second cross attention feature with the feature matrix corresponding to the second input feature.

As an example, the first sub-fusion feature and the second sub-fusion feature may be spliced to obtain the fused feature.

In an optional mode of the present disclosure, the target analysis object is a document image including a target text, the first input feature is a visual feature, the second input feature is a text feature, and the correlation condition is that both the visual feature and the text feature are extracted based on the document image.

In the embodiment of the present disclosure, the target analysis object is a document image including a target text, and the corresponding application scenario is to perform feature analysis on the document image according to the feature fusion feature, for example, classify the document image, perform layout analysis on the document image, and identify a named entity in the target text.

The correlation condition is that the visual features and the text features are extracted based on the document images, namely, the visual features and the text features extracted from the same document images are limited to be subjected to feature fusion. The visual features and the text features extracted from the same document image generally have a certain correlation, for example, a header area in the document image is one of the key features for distinguishing the classification of the document image, and the text content of the header can be mined based on the image features of the document image, so that the two have a certain correlation.

The text features and the visual features are fused through the feature fusion network provided by the scheme, a better feature fusion effect is achieved, the obtained fused features have a better effect on semantic expression, the problem of semantic confusion can be avoided, and the analysis effect on the document images is ensured.

In an optional mode of the present disclosure, the method further includes:

extracting image characteristics of the document image and position characteristics of pixel points in the document image;

and carrying out feature fusion on the image features and the position features to obtain the visual features.

In the embodiment of the disclosure, the image features can be extracted from the document image, the position features of the pixel points in the document image are extracted, and the image features and the position features are subjected to feature fusion to obtain the visual features. Because the position information in the document image has important significance for analyzing the document image, the position information in the document image is introduced equivalently by fusing the position characteristics in the visual characteristics, and the subsequent analysis effect on the document image is promoted.

As one example, the characterization of image features and location features may be implemented in a manner that incorporates location features into the channel dimensions of the image features.

In an optional aspect of the present disclosure, extracting an image feature of a document image includes:

and extracting image features of the document image based on a preset Convolutional Neural Network (CNN).

In the embodiment of the disclosure, a layer of convolutional neural network may be preset to extract the image features of the document image.

In an optional manner of this disclosure, the method further includes:

performing Optical Character Recognition (OCR) processing on the document image to obtain a target text;

and performing feature extraction on the target text to obtain text features.

In the embodiment of the disclosure, the target text can be extracted from the document image by adopting an OCR mode. Then, each character in the target text is coded through a Word Vector (Word 2 Vector) model to obtain a Word Vector, and then a bidirectional coder representation (BERT) model based on a Transformer is adopted to convert the Word Vector to obtain text characteristics.

In an optional mode of the present disclosure, the feature fusion network is connected to a full connection layer, and the method further includes:

inputting the fused features into the full connection layer to obtain any one of the following output of the full connection layer:

a classification result of the document image;

a named entity recognition result in the target text;

and analyzing the layout of the document image.

In the embodiment of the disclosure, the feature fusion network is connected with the full connection layer, and the fused features are input to the full connection layer to obtain an output result of the full connection layer, that is, an analysis result of the document image.

Specifically, the output result of the fully connected layer may include any one of:

a classification result of the document image;

a named entity recognition result in the target text;

and analyzing the layout of the document image.

As an example, a classification method for document images provided by the embodiments of the present disclosure includes the following specific steps:

1. feature extraction

Firstly, obtaining a document image I, wherein I belongs to R ^h×w×3 The width of the document image is w pixels, and the height is h pixels. The document image may have three channels of red, green, and blue (RGB).

And processing the document image through a layer of convolutional neural network. The convolution kernel (kernel) of the convolution neural network is K, and K belongs to R ^{r×r×3×(d-1)} Where d-1 is the number of convolution kernels. Obtaining the image characteristic F through convolution processing,

according to the position characteristic P of each pixel of the document image,

wherein, P _i.j I row and j column representing image feature F, having a value equal to

Combining the image feature F with the position feature P to obtain a complete visual feature: v = { F, P }, wherein,

and extracting n lines of target text in the document image through OCR processing. And encoding each word in the target text into a d-dimensional vector by using a word2vector model. Then, a BERT model is used to encode the feature vector sequence of each line to obtain a text feature T, T = { BERT (T) _a ),a∈[1,n]}，T∈R ^n×d Wherein BERT represents the loss function of the BERT model, and a represents any line of the target text.

2. Feature fusion

Reducing the dimension of the first two dimensions of the visual characteristic V to obtain the visual characteristic V',

the visual feature and the text feature have equal dimensionality, and cross attention processing is facilitated.

Performing feature fusion on the visual feature V' and the text feature T through the feature fusion network provided by the disclosure to obtain a first cross attention feature V _x And obtaining a second cross attention feature T _x 。

First cross attention feature V _x And obtaining a second cross attention feature T _x Splicing to obtain fused features M, M = { V = _x ,T _x },

Post-fusion feature I, I = concat (V, T), where concat represents the loss function of the feature fusion network.

3. Classification of document images

The feature fusion network is followed by a full Connected Layer (FC) to map the feature M onto a vector corresponding to the number of classes. The vectors are mapped to a probability distribution scores, scores = softmax (FC (M)) using a softmax function, where softmax denotes the softmax function and FC denotes the loss function of the fully connected layer. The category with the highest probability value may be taken as the predicted category cls for the document. cls = argmax (scores), where argmax denotes the argmax function for maximizing the probability distribution.

As an example, fig. 2 shows a flowchart of a specific implementation of the feature fusion method provided in the embodiment of the present disclosure.

As shown in fig. 2, the visual features would be converted to a query feature Vq, a key-value feature Vk, and a content feature Vv. The self-attention processing in the first self-attention subnetwork is performed based on the query feature Vq, the key-value feature Vk, and the content feature Vv to obtain a first self-attention feature Vn, the first self-attention weight parameter Mv is an attention weight parameter when the self-attention processing is performed in the first self-attention subnetwork, and the first self-attention weight parameter Mv can be calculated based on the query feature Vq and the key-value feature Vk.

The text features are converted into query features Tq, key-value features Tk, and content features TT. And performing self-attention processing in the second self-attention subnetwork based on the query feature Tq, the key-value feature Tk and the content feature TT to obtain a second self-attention feature Tn, wherein the second self-attention weight parameter Mt is an attention weight parameter during self-attention processing in the second self-attention subnetwork, and the second self-attention weight parameter Mt can be calculated based on the query feature Tq and the key-value feature Tk.

The first cross attention feature Vo can be obtained by performing cross self-attention processing in the first cross attention sub-network based on the first self-attention feature Vn and the second self-attention weight parameter Mt.

A second cross attention feature To can be obtained by performing cross self-attention processing in a second cross attention sub-network based on the second attention feature Tn and the first self-attention weight parameter Mv.

And performing feature fusion on the first cross attention feature Vo and the visual feature to obtain a first sub-fusion feature Vf.

And performing feature fusion on the second cross attention feature To and the text feature To obtain a second sub-fusion feature Tf.

And fusing the first sub-fusion characteristic Vf and the second sub-fusion characteristic Tf to obtain a fused characteristic I.

Fig. 3 shows a flowchart of a document image classification method provided by an embodiment of the present disclosure, and as shown in fig. 3, the method may mainly include:

step S310: extracting image features of the document image and position features of pixel points in the document image, and performing feature fusion on the image features and the position features to obtain visual features;

step S320: performing OCR processing on the document image to obtain a target text, and performing feature extraction on the target text to obtain text features;

step S330: inputting the visual features and the text features into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature;

the first self-attention subnetwork is used for performing self-attention processing on the visual features to obtain first self-attention features;

the second self-attention sub-network is used for performing self-attention processing on the text features to obtain second self-attention features;

the second cross attention sub-network is used for carrying out cross attention processing on the basis of the second self-attention feature and the first attention weight parameter corresponding to the first self-attention sub-network to obtain a second cross attention feature;

step S340: performing feature fusion on the first cross attention feature and the visual feature to obtain a first sub-fusion feature;

step S350: performing feature fusion on the second cross attention feature and the text feature to obtain a second sub-fusion feature;

step S360: performing feature fusion on the first sub-fusion feature and the second sub-fusion feature to obtain a fused feature;

step S370: and inputting the fused features into the full-link layer to obtain a classification result of the full-link layer output document image.

In the embodiment of the disclosure, the visual features and the text features can be respectively extracted from the document image containing the target text, feature fusion is performed on the visual features and the text features based on the feature fusion network provided in the embodiment of the disclosure to obtain fused features, and the fused features fuse the characteristics of the first input features and the characteristics of the second input features, so that feature analysis can be performed on the document image according to the fused features, and the document image can be classified.

In the embodiment of the disclosure, the target text can be extracted from the document image by adopting an OCR mode. And then coding each character in the target text through a Word2Vector model to obtain a Word Vector, and then converting the Word Vector by adopting a BERT-based model to obtain text characteristics.

In the feature network provided by the embodiment of the present disclosure, a first self-attention sub-network performs self-attention processing on a visual feature to obtain a first self-attention feature, and a second self-attention sub-network is used for performing self-attention processing on a text feature to obtain a second self-attention feature, and then a first cross-attention sub-network performs cross-attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross-attention feature, and a second cross-attention sub-network is used for cross-attention processing on the basis of the second self-attention feature and the first attention weight parameter corresponding to the first self-attention sub-network to obtain a second cross-attention feature.

In implementations of the present disclosure, the second attention weight parameter is the attention weight parameter when the second self-attention subnetwork is used to self-attention process the text feature. Specifically, the text feature may be converted to obtain the query feature, the key-value feature, and the content feature, so that the second attention weight parameter is learned according to the query feature, the key-value feature, and the content feature.

In implementations of the present disclosure, the first attention weight parameter is an attention weight parameter when the first self-attention subnetwork is configured to self-attention process the visual feature. Specifically, the visual features may be converted to obtain a query feature, a key-value feature, and a content feature, respectively, so that the first attention weight parameter may be learned according to the query feature, the key-value feature, and the content feature.

In the embodiment of the present disclosure, since the first cross attention feature is obtained by performing cross attention processing based on the first self-attention feature and the second attention weight parameter corresponding to the second self-attention subnetwork, it is equivalent to that the first self-attention feature is guided to be reorganized by the second attention weight parameter of the second attention subnetwork, so that the first cross attention feature can fuse the characteristics of the text feature on the basis of having the characteristics of the visual frequency feature.

In the embodiment of the present disclosure, since the second cross attention feature is obtained by performing cross attention processing based on the second self attention feature and the first attention weight parameter corresponding to the first self attention sub-network, it is equivalent to guiding the second self attention feature to be reorganized by the first attention weight parameter of the first attention sub-network, so that the second cross attention feature can merge the characteristics of the visual feature on the basis of having the characteristics of the text feature.

The method provided by the embodiment of the disclosure includes the steps that visual features and text features are respectively extracted from a document image and input into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature; wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork; the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature; the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic; the first cross attention subnetwork is used for performing cross attention processing on the basis of the first self attention feature and a second attention weight parameter corresponding to the second self attention subnetwork to obtain a first cross attention feature; the second cross attention subnetwork is used for carrying out cross attention processing on the basis of the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature. According to the scheme, the visual features and the texts are subjected to feature fusion processing based on the feature fusion network, so that the first cross attention feature and the second cross attention feature which are fused with the characteristics of the visual features and the characteristics of the text features are obtained, the feature fusion effect can be improved, and the accuracy of the subsequent classification result of the document images is improved.

In the embodiment of the present disclosure, the specific processing manner of the feature fusion layer may be to perform feature fusion on the first cross attention feature and the visual feature to obtain a first sub-fusion feature, perform feature fusion on the second cross attention feature and the text feature to obtain a second sub-fusion feature, and perform feature fusion on the first sub-fusion feature and the second sub-fusion feature to obtain a fused feature.

Since the first cross-attention feature and the second cross-attention feature are both subjected to cross-attention processing, confusion may occur between the features, and characteristics of the visual feature included in the first cross-attention feature and characteristics of the text feature included in the second cross-attention feature may be averaged. The first cross attention feature and the visual feature are subjected to feature fusion, so that the first sub-fusion feature can fuse the characteristics of the text feature on the basis of considering the characteristics of the visual feature. The second cross attention feature and the second input feature are subjected to feature fusion, so that the second sub-fusion feature can fuse the characteristics of the visual feature on the basis of considering the characteristics of the text feature. Therefore, the fused features obtained based on the fusion of the first sub-fusion features and the second sub-fusion features have better feature fusion effect.

As an example, the specific way of feature fusing the first cross attention feature and the visual feature may be to add a feature matrix corresponding to the first cross attention feature and a feature matrix corresponding to the visual feature, and the specific way of feature fusing the second cross attention feature and the text feature may be to add a feature matrix corresponding to the second cross attention feature and a feature matrix corresponding to the text feature.

As an example, the first sub-fusion feature and the second sub-fusion feature may be spliced to obtain the post-fusion feature.

In the embodiment of the disclosure, the feature fusion network is connected with the full connection layer, and the fused features are input into the full connection layer to obtain the classification result of the document image output by the full connection layer.

Based on the same principle as the method shown in fig. 1, fig. 4 shows a schematic structural diagram of a feature fusion apparatus provided by an embodiment of the present disclosure, and as shown in fig. 4, the feature fusion apparatus 30 may include:

a feature obtaining module 410, configured to obtain a first input feature and a second input feature, where correlations between the first input feature and the target analysis object satisfy a preset correlation condition;

the feature fusion module 420 is configured to input the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature;

According to the device provided by the embodiment of the disclosure, by acquiring the first input feature and the second input feature, the correlation between the first input feature and the target analysis object and the correlation between the second input feature and the target analysis object both satisfy a preset correlation condition; inputting the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature; wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork; the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature; the second self-attention subnetwork is used for performing self-attention processing on the second input characteristic to obtain a second self-attention characteristic; the first cross attention sub-network is used for carrying out cross attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature; the second cross attention subnetwork is used for carrying out cross attention processing on the basis of the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature. According to the scheme, the first input feature and the second input feature are subjected to feature fusion processing based on the feature fusion network, so that a first cross attention feature and a second cross attention feature which are fused with the characteristics of the first input feature and the characteristics of the second input feature are obtained, the effect of feature fusion can be improved, and the performance of a model can be improved.

Optionally, when the first cross attention subnetwork performs cross attention processing based on the first self attention feature and a second attention weight parameter corresponding to the second self attention subnetwork to obtain the first cross attention feature, the first cross attention subnetwork is specifically configured to:

converting the first self-attention feature into a first content feature;

and performing cross attention processing on the basis of the first content feature and a second attention weight parameter corresponding to the second self-attention subnetwork to obtain a first cross attention feature.

converting the second self-attention feature into a second content feature;

Optionally, the feature fusion network further includes a feature fusion layer, and the apparatus further includes:

and the feature fusion layer processing module is used for inputting the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature, and then inputting the first cross attention feature and the second cross attention feature into the feature fusion layer to determine a fused feature based on the first cross attention feature and the second cross attention feature.

Optionally, when determining the fused feature based on the first cross attention feature and the second cross attention feature, the feature fusion layer processing module is specifically configured to:

Optionally, the target analysis object is a document image including a target text, the first input feature is a visual feature, the second input feature is a text feature, and the correlation condition is that the visual feature and the text feature are extracted based on the document image.

Optionally, the apparatus further includes a visual feature extraction module, where the visual feature extraction module is configured to:

Optionally, when the visual feature extraction module extracts the image feature of the document image, the visual feature extraction module is specifically configured to:

and extracting the image characteristics of the document image based on a preset convolutional neural network.

Optionally, the apparatus further includes a text feature extraction module, where the text feature extraction module is configured to:

performing OCR processing on the document image to obtain a target text;

and performing feature extraction on the target text to obtain text features.

Optionally, the feature fusion network is connected to a full connection layer, and the apparatus further includes:

the full connection layer processing module is used for inputting the fused features into the full connection layer to obtain any one of the following output of the full connection layer:

a classification result of the document image;

a named entity recognition result in the target text;

and analyzing the layout of the document image.

It is understood that the above modules of the feature fusion apparatus in the embodiment of the present disclosure have functions of implementing the corresponding steps of the feature fusion method in the embodiment shown in fig. 1. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The modules can be software and/or hardware, and each module can be implemented independently or by integrating a plurality of modules. For the functional description of each module of the above feature fusion device, reference may be specifically made to the corresponding description of the feature fusion method in the embodiment shown in fig. 1, and details are not repeated here.

Based on the same principle as the method shown in fig. 3, fig. 5 shows a schematic structural diagram of a document image classification device provided by the embodiment of the disclosure, and as shown in fig. 5, the feature fusion device 50 may include:

the feature extraction module 510 is configured to extract image features of the document image and location features of pixel points in the document image, and perform feature fusion on the image features and the location features to obtain visual features; the method comprises the steps of performing OCR processing on a document image to obtain a target text, and performing feature extraction on the target text to obtain text features;

a feature fusion module 520, configured to input the visual features and the text features into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature;

the first self-attention sub-network is used for performing self-attention processing on the visual features to obtain first self-attention features;

the feature fusion layer processing module 530 is configured to perform feature fusion on the first cross attention feature and the visual feature to obtain a first sub-fusion feature, perform feature fusion on the second cross attention feature and the text feature to obtain a second sub-fusion feature, and perform feature fusion on the first sub-fusion feature and the second sub-fusion feature to obtain a fused feature;

and the full link layer processing module 540 is configured to input the fused features to the full link layer to obtain a classification result of the full link layer output document image.

The method provided by the embodiment of the disclosure includes the steps that visual features and text features are respectively extracted from a document image and input into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature; wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork; the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature; the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic; the first cross attention subnetwork is used for performing cross attention processing on the basis of the first self attention feature and a second attention weight parameter corresponding to the second self attention subnetwork to obtain a first cross attention feature; the second cross attention subnetwork is used for carrying out cross attention processing on the basis of the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature. In the scheme, the visual features and the texts are subjected to feature fusion processing based on the feature fusion network, so that the first cross attention features and the second cross attention features which are fused with the features of the visual features and the features of the texts are obtained, the feature fusion effect can be improved, and the accuracy of the subsequent classification result of the document images is improved.

It is understood that the above modules of the document image classification device in the embodiment of the present disclosure have functions of implementing the corresponding steps of the document image classification method in the embodiment shown in fig. 3. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The modules can be software and/or hardware, and each module can be implemented independently or by integrating a plurality of modules. For the functional description of each module of the above feature fusion device, reference may be specifically made to the corresponding description of the document image classification method in the embodiment shown in fig. 3, which is not repeated herein.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device and a readable storage medium according to an embodiment of the present disclosure.

The electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the feature fusion method as provided by the embodiments of the present disclosure.

Compared with the prior art, the electronic equipment obtains the first input characteristic and the second input characteristic, and the correlation between the first input characteristic and the target analysis object meets the preset correlation condition; inputting the first input characteristic and the second input characteristic into a preset characteristic fusion network to obtain a first cross attention characteristic and a second cross attention characteristic; wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork; the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature; the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic; the first cross attention sub-network is used for carrying out cross attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature; the second cross attention subnetwork is used for performing cross attention processing based on the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature. According to the scheme, the first input feature and the second input feature are subjected to feature fusion processing based on the feature fusion network, so that a first cross attention feature and a second cross attention feature which are fused with the characteristics of the first input feature and the characteristics of the second input feature are obtained, the effect of feature fusion can be improved, and the performance of a model can be improved.

The readable storage medium is a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a feature fusion method as provided by an embodiment of the present disclosure.

Compared with the prior art, the readable storage medium has the advantages that by acquiring the first input characteristic and the second input characteristic, the correlation between the first input characteristic and the target analysis object and the correlation between the second input characteristic and the target analysis object both meet the preset correlation condition; inputting the first input characteristic and the second input characteristic into a preset characteristic fusion network to obtain a first cross attention characteristic and a second cross attention characteristic; wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork; the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature; the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic; the first cross attention sub-network is used for carrying out cross attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature; the second cross attention subnetwork is used for carrying out cross attention processing on the basis of the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature. In the scheme, feature fusion processing is performed on the first input feature and the second input feature based on the feature fusion network, so that a first cross attention feature and a second cross attention feature which are fused with the characteristics of the first input feature and the characteristics of the second input feature are obtained, the effect of feature fusion can be improved, and the performance of the model can be improved.

FIG. 6 illustrates a schematic block diagram of an example electronic device 60 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the electronic device 60 includes a computing unit 610 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 620 or a computer program loaded from a storage unit 680 into a Random Access Memory (RAM) 630. In the RAM630, various programs and data required for the operation of the device 60 can also be stored. The computing unit 610, the ROM620, and the RAM630 are connected to each other by a bus 640. An input/output (I/O) interface 650 is also connected to bus 640.

Various components in device 60 are connected to I/O interface 650, including: an input unit 660 such as a keyboard, a mouse, etc.; an output unit 670 such as various types of displays, speakers, and the like; a storage unit 680, such as a magnetic disk, optical disk, or the like; and a communication unit 690 such as a network card, modem, wireless communication transceiver, etc. The communication unit 690 allows the device 60 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 610 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 610 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 610 performs the feature fusion method provided in the embodiments of the present disclosure. For example, in some embodiments, performing the feature fusion methods provided in embodiments of the present disclosure may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 680. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 60 via the ROM620 and/or the communication unit 690. When loaded into RAM630 and executed by computing unit 610, may perform one or more steps of the feature fusion method provided in embodiments of the present disclosure. Alternatively, in other embodiments, the computing unit 610 may be configured in any other suitable manner (e.g., by way of firmware) to perform the feature fusion methods provided in embodiments of the present disclosure.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of feature fusion, comprising:

acquiring a first input characteristic and a second input characteristic, wherein the correlation between the first input characteristic and the target analysis object and the correlation between the second input characteristic and the target analysis object meet a preset correlation condition;

inputting the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature;

the second self-attention sub-network is used for performing self-attention processing on the second input feature to obtain a second self-attention feature;

the first cross attention subnetwork is used for performing cross attention processing based on the first self attention feature and a second attention weight parameter corresponding to the second self attention subnetwork to obtain a first cross attention feature;

2. The method according to claim 1, wherein the first cross attention subnetwork, when performing cross attention processing based on the first self attention feature and a second attention weight parameter corresponding to the second self attention subnetwork to obtain a first cross attention feature, is specifically configured to:

converting the first self-attention feature into a first content feature;

performing cross attention processing on the basis of the first content feature and a second attention weight parameter corresponding to the second self-attention subnetwork to obtain a first cross attention feature;

the second cross attention subnetwork is specifically configured to, when performing cross attention processing based on the second self-attention feature and the first attention weight parameter corresponding to the first self-attention subnetwork to obtain a second cross attention feature:

converting the second self-attention feature into a second content feature;

and performing cross attention processing on the basis of the second content feature and a first attention weight parameter corresponding to the first self-attention subnetwork to obtain a second cross attention feature.

3. The method according to claim 1 or 2, wherein the feature fusion network further comprises a feature fusion layer, and after the inputting the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature, the method further comprises:

inputting the first cross attention feature and the second cross attention feature to the feature fusion layer to determine a post-fusion feature based on the first cross attention feature and the second cross attention feature.

4. The method of claim 3, wherein the determining a post-fusion feature based on the first cross-attention feature and the second cross-attention feature comprises:

5. The method of claim 3, wherein the determining a post-fusion feature based on the first cross-attention feature and the second cross-attention feature comprises:

6. The method according to any one of claims 3 to 5, wherein the target analysis object is a document image containing target text, the first input feature is a visual feature, the second input feature is a text feature, and the relevance condition is that the visual feature and the text feature are both extracted based on the document image.

7. The method of claim 6, further comprising:

extracting image features of the document image and position features of pixel points in the document image;

and performing feature fusion on the image features and the position features to obtain the visual features.

8. The method of claim 7, wherein said extracting image features of the document image comprises:

9. The method according to any one of claims 6-8, further comprising:

performing Optical Character Recognition (OCR) processing on the document image to obtain the target text;

and extracting the features of the target text to obtain the text features.

10. The method of any of claims 6-9, wherein the feature fusion network has a fully connected layer connected, the method further comprising:

a classification result of the document image;

a named entity recognition result in the target text;

and analyzing the layout of the document image.

11. A feature fusion apparatus comprising:

the characteristic acquisition module is used for acquiring a first input characteristic and a second input characteristic, and the correlation between the first input characteristic and the target analysis object and the correlation between the second input characteristic and the target analysis object meet a preset correlation condition;

the first cross attention sub-network is used for performing cross attention processing based on the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature;

12. The apparatus of claim 11, wherein the first cross attention subnetwork, when performing cross attention processing based on the first self-attention feature and a second attention weight parameter corresponding to the second self-attention subnetwork to obtain a first cross attention feature, is specifically configured to:

converting the first self-attention feature into a first content feature;

converting the second self-attention feature into a second content feature;

13. The apparatus of claim 11 or 12, wherein the feature fusion network further comprises a feature fusion layer, the apparatus further comprising:

a feature fusion layer processing module, configured to, after the first input feature and the second input feature are input into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature, input the first cross attention feature and the second cross attention feature into the feature fusion layer, so as to determine a fused feature based on the first cross attention feature and the second cross attention feature.

14. The apparatus of claim 13, wherein the feature fusion layer processing module, when determining the post-fusion feature based on the first cross attention feature and the second cross attention feature, is specifically to:

15. The apparatus of claim 13, wherein the feature fusion layer processing module, when determining the post-fusion feature based on the first cross attention feature and the second cross attention feature, is specifically to:

16. The apparatus according to any one of claims 13-15, wherein the target analysis object is a document image containing target text, the first input feature is a visual feature, the second input feature is a textual feature, and the relevance condition is that the visual feature and the textual feature are both extracted based on the document image.

17. The apparatus of claim 16, further comprising a visual feature extraction module to:

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

19. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.