CN115601620A - Feature fusion method and device, electronic equipment and computer readable storage medium - Google Patents

Feature fusion method and device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN115601620A
CN115601620A CN202211304730.1A CN202211304730A CN115601620A CN 115601620 A CN115601620 A CN 115601620A CN 202211304730 A CN202211304730 A CN 202211304730A CN 115601620 A CN115601620 A CN 115601620A
Authority
CN
China
Prior art keywords
feature
attention
cross
self
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211304730.1A
Other languages
Chinese (zh)
Inventor
李煜林
钦夏孟
章成全
姚锟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202211304730.1A priority Critical patent/CN115601620A/en
Publication of CN115601620A publication Critical patent/CN115601620A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a feature fusion method and device, electronic equipment and a computer-readable storage medium, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing, large models and computer vision, and can be applied to scenes such as optical character recognition. The specific implementation scheme is as follows: acquiring a first input feature and a second input feature, wherein the correlation between the first input feature and the target analysis object and the correlation between the second input feature and the target analysis object both meet a preset correlation condition; and inputting the first input characteristic and the second input characteristic into a preset characteristic fusion network to obtain a first cross attention characteristic and a second cross attention characteristic. The feature fusion network provided by the scheme is used for performing feature fusion processing on the first input feature and the second input feature to obtain a first cross attention feature and a second cross attention feature which are fused with the characteristics of the first input feature and the characteristics of the second input feature, and the effect of feature fusion can be improved.

Description

Feature fusion method and device, electronic equipment and computer readable storage medium
Technical Field
The present disclosure relates to the technical field of artificial intelligence, and in particular, to the technical field of deep learning, image processing, large models, and computer vision, which can be applied to scenes such as Optical Character Recognition (OCR), and in particular, to a feature fusion method, an apparatus, an electronic device, and a computer-readable storage medium.
Background
Feature fusion refers to fusing different features into a new feature, so that the characteristics of the different features are better utilized to improve the performance of the model.
The effect of feature fusion can greatly affect the performance of the model, so how to improve the effect of feature fusion when fusing different features into new features to ensure the performance of the model becomes an important technical problem in the related field.
Disclosure of Invention
In order to solve at least one of the above drawbacks, the present disclosure provides a feature fusion method, an apparatus, an electronic device, and a computer-readable storage medium.
According to a first aspect of the present disclosure, there is provided a feature fusion method, the method comprising:
acquiring a first input feature and a second input feature, wherein the correlation between the first input feature and the target analysis object and the correlation between the second input feature and the target analysis object both meet a preset correlation condition;
inputting the first input characteristic and the second input characteristic into a preset characteristic fusion network to obtain a first cross attention characteristic and a second cross attention characteristic;
wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork;
the first self-attention subnetwork is used for performing self-attention processing on the first input feature to obtain a first self-attention feature;
the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic;
the first cross attention sub-network is used for carrying out cross attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature;
the second cross attention subnetwork is used for carrying out cross attention processing on the basis of the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature.
According to a second aspect of the present disclosure, there is provided a feature fusion apparatus, the apparatus comprising:
the characteristic acquisition module is used for acquiring a first input characteristic and a second input characteristic, and the correlation between the first input characteristic and the target analysis object and the correlation between the second input characteristic and the target analysis object both meet a preset correlation condition;
the feature fusion module is used for inputting the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature;
wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork;
the first self-attention subnetwork is used for performing self-attention processing on the first input feature to obtain a first self-attention feature;
the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic;
the first cross attention sub-network is used for carrying out cross attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature;
the second cross attention subnetwork is used for performing cross attention processing based on the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature.
According to a third aspect of the present disclosure, there is provided an electronic apparatus comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the feature fusion method.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the above-described feature fusion method.
It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a schematic flow chart diagram of a feature fusion method provided in an embodiment of the present disclosure;
FIG. 2 is a schematic flow chart diagram illustrating one embodiment of a feature fusion method provided by the embodiments of the present disclosure;
FIG. 3 is a flowchart illustrating a document image classification method according to an embodiment of the disclosure;
FIG. 4 is a schematic structural diagram of a feature fusion apparatus provided in an embodiment of the present disclosure;
FIG. 5 is a schematic structural diagram of a document image classification device according to an embodiment of the present disclosure;
FIG. 6 is a block diagram of an electronic device for implementing the feature fusion method of embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
When different features are fused into a new feature for model processing, the effect of feature fusion has an important influence on the model performance, and therefore how to improve the effect of feature fusion becomes an important technical problem in the related field.
When classifying document images, in the related art, generally, image features and text features of the document images are respectively acquired, and then a document classification result is determined based on the image features and the text features, for example, a classification result is obtained based on the image features and the text features, and then a final classification result is determined from the two classification results based on rules such as voting. This approach does not take into account the correlation between different modal characteristics, resulting in a less accurate final classification result.
Some feature fusion methods also exist in the related art, which are used for performing feature fusion on image features and text features, but the effect of feature fusion through these feature fusion methods may be poor, and the accuracy of classification results may also be affected. For example, a new feature obtained by fusing an image feature and a text feature may have a poor effect on semantic expression, and may have a semantic confusion problem, which affects the accuracy of a classification result.
The embodiment of the present disclosure provides a feature fusion method, a feature fusion device, an electronic device, and a computer-readable storage medium, which are intended to solve at least one of the above technical problems in the prior art.
Fig. 1 shows a schematic flow diagram of a feature fusion method provided in an embodiment of the present disclosure, and as shown in fig. 1, the method mainly includes:
step S110: and acquiring a first input characteristic and a second input characteristic, wherein the correlation between the first input characteristic and the target analysis object and the correlation between the second input characteristic and the target analysis object both meet a preset correlation condition.
The target analysis object may be an object that needs to perform feature analysis according to the fused features obtained by the feature fusion processing.
The first input feature and the second input feature are both features related to the target analysis object, and the relevance between the first input feature and the target analysis object satisfies a preset relevance condition. The correlation condition can be set according to actual needs to ensure that the first input feature and the second input feature have strong correlation with the target analysis object, so that the fused feature obtained by feature fusion of the first input feature and the second input feature has strong effectiveness.
As an example, the target analysis object is a document image to be analyzed, the first input feature is a visual feature extracted based on the document image, and the second input feature is a text feature extracted based on a target text included in the document image. The relevance condition is that the visual feature and the text feature are extracted based on the same document image.
As an example, the target analysis object is a target video to be analyzed, and the first input feature and the second input feature are any one of multi-modal features extracted from the target video. The relevance condition is that the first input feature and the second input feature are extracted based on the same target video. Specifically, the multimodal features include video features, image features, audio features, text features, and the like extracted based on the target video.
As an example, the target analysis object is a target video to be analyzed, the first input feature is any one of multi-modal features extracted from the target video, the second input feature is other features associated with the target video, and the correlation condition is that both the first input feature and the second input feature are related to the target video. The multi-modal features comprise video features, image features, audio features, text features and the like extracted based on the target video. The other characteristics include user behavior characteristics of the associated users of the target video, user social characteristics of the associated users of the target video, advertisement characteristics of the target video, and the like. The associated users include, but are not limited to, uploading users, forwarding users, etc. of the target video.
Step S120: inputting the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature;
wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork;
the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature;
the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic;
the first cross attention sub-network is used for carrying out cross attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature;
the second cross attention subnetwork is used for carrying out cross attention processing on the basis of the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature.
In the embodiment of the present disclosure, the feature fusion network is configured to fuse features of the first input feature and the second input feature to obtain a first cross attention feature and a second cross attention feature, where the first cross attention feature and the second cross attention feature both fuse characteristics of the first input feature and characteristics of the second input feature.
In the feature network provided by the embodiment of the present disclosure, a first self-attention feature is obtained by performing self-attention processing on a first input feature through a first self-attention subnetwork, a second self-attention subnetwork is used for performing self-attention processing on a second input feature to obtain a second self-attention feature, then cross-attention processing is performed through the first cross-attention subnetwork based on the first self-attention feature and a second attention weight parameter corresponding to the second self-attention subnetwork to obtain a first cross-attention feature, and cross-attention processing is performed through the second cross-attention subnetwork based on the second self-attention feature and the first attention weight parameter corresponding to the first self-attention subnetwork to obtain a second cross-attention feature.
In implementations of the present disclosure, the second attention weight parameter is an attention weight parameter when the second self-attention subnetwork is configured to self-attention process the second input feature. Specifically, the second input feature may be converted to obtain a query feature, a key-value feature, and a content feature, so that the second attention weight parameter is learned according to the query feature, the key-value feature, and the content feature.
As an example, the attention weight parameter may be represented by the following formula one:
w=softmax(QK T) … … (formula one)
Wherein w represents an attention weight parameter, softmax is a normalized exponential function, Q represents a query feature, and K represents a key-value feature. In this example, Q and K can be converted from the second input characteristic.
In implementations of the present disclosure, the first attention weight parameter is an attention weight parameter when the first self-attention subnetwork is configured to self-attention process the first input feature. Specifically, the first input feature may be converted to obtain a query feature, a key-value feature, and a content feature, so that the first attention weight parameter is learned according to the query feature, the key-value feature, and the content feature.
As an example, the formula expression of the first attention weight parameter may refer to formula one described above, in which Q and K may both be converted from the first input feature.
In the embodiment of the present disclosure, since the first cross attention feature is obtained by performing cross attention processing based on the first self attention feature and the second attention weight parameter corresponding to the second self attention subnetwork, it is equivalent to guiding the first self attention feature to be recombined by the second attention weight parameter of the second attention subnetwork, so that the first cross attention feature can fuse the characteristics of the second input feature on the basis of the characteristics of the first input feature.
In the embodiment of the present disclosure, since the second cross attention feature is obtained by performing cross attention processing based on the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork, it is equivalent to guiding the second self attention feature to be recombined by the first attention weight parameter of the first attention subnetwork, so that the second cross attention feature can fuse the characteristics of the first input feature on the basis of the characteristics of the second input feature.
According to the method provided by the embodiment of the disclosure, by acquiring the first input characteristic and the second input characteristic, the correlation between the first input characteristic and the target analysis object and the correlation between the second input characteristic and the target analysis object both satisfy a preset correlation condition; inputting the first input characteristic and the second input characteristic into a preset characteristic fusion network to obtain a first cross attention characteristic and a second cross attention characteristic; wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork; the first self-attention subnetwork is used for performing self-attention processing on the first input feature to obtain a first self-attention feature; the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic; the first cross attention subnetwork is used for performing cross attention processing on the basis of the first self attention feature and a second attention weight parameter corresponding to the second self attention subnetwork to obtain a first cross attention feature; the second cross attention subnetwork is used for carrying out cross attention processing on the basis of the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature. According to the scheme, the first input feature and the second input feature are subjected to feature fusion processing based on the feature fusion network, so that a first cross attention feature and a second cross attention feature which are fused with the characteristics of the first input feature and the characteristics of the second input feature are obtained, the effect of feature fusion can be improved, and the performance of a model can be improved.
In the embodiment of the disclosure, the first cross attention subnetwork and the second self attention subnetwork share the attention weight parameter, so that the self attention mechanism and the cross attention mechanism can be better combined, the feature fusion effect is improved, the number of model parameters in the feature fusion network can be reduced, and the complexity of model operation is reduced.
In the embodiment of the present disclosure, in order to improve the feature fusion effect, multiple layers of feature fusion networks may be set according to actual needs, and the feature fusion networks of the layers are connected in sequence.
In an alternative aspect of the disclosure, when performing the cross attention processing based on the first self-attention feature and the second attention weight parameter corresponding to the second self-attention sub-network to obtain the first cross attention feature, the first cross attention sub-network is specifically configured to:
converting the first self-attention feature into a first content feature;
performing cross attention processing on the basis of the first content feature and a second attention weight parameter corresponding to a second self-attention subnetwork to obtain a first cross attention feature;
the second cross attention subnetwork is specifically configured to, when performing cross attention processing based on the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain the second cross attention feature:
converting the second self-attention feature into a second content feature;
and performing cross attention processing on the basis of the second content characteristic and the first attention weight parameter corresponding to the first self-attention subnetwork to obtain a second cross attention characteristic.
In an implementation of the present disclosure, when performing the cross attention processing based on the first cross attention subnetwork, the first self attention feature may be converted into a first content feature, and then the cross attention processing is performed based on the first content feature and a second attention weight parameter corresponding to the second self attention subnetwork, so as to obtain the first cross attention feature.
The second attention weight parameter is determined based on the query feature and the key-value feature converted from the second input feature, which is equivalent to guiding the recombination of the first content feature through the query feature and the key-value feature of the second attention sub-network, so that the effective combination of the second self-attention sub-network and the first cross-attention sub-network is realized.
In this disclosure, when performing the cross attention processing based on the second cross attention subnetwork, the second self-attention feature may be converted into the second content feature, and then the cross attention processing is performed based on the second content feature and the first attention weight parameter corresponding to the first self-attention subnetwork, so as to obtain the second cross attention feature.
The first attention weight parameter is determined based on the query feature and the key-value feature converted from the first input feature, which is equivalent to guiding the recombination of the second content feature through the query feature and the key-value feature of the second attention subnetwork, so that the first self-attention subnetwork and the second cross-attention subnetwork can be effectively combined.
In an optional mode of the present disclosure, the feature fusion network further includes a feature fusion layer, and after the first input feature and the second input feature are input into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature, the method further includes:
the first cross attention feature and the second cross attention feature are input to a feature fusion layer to determine a fused feature based on the first cross attention feature and the second cross attention feature.
In an embodiment of the present disclosure, the feature fusion network further includes a feature fusion layer, where the feature fusion layer interfaces the first cross attention subnetwork and the second cross attention subnetwork, and is configured to perform feature fusion on the first cross attention feature output by the first cross attention subnetwork and the second cross attention feature output by the second cross attention subnetwork to obtain the fused feature.
The post-fusion feature is obtained based on the first cross attention feature and the second cross attention feature, has the characteristics of the first cross attention feature and the second cross attention feature, and can be used for subsequent analysis processing.
In an optional aspect of the disclosure, determining the post-fusion feature based on the first cross attention feature and the second cross attention feature comprises:
and performing feature fusion on the first cross attention feature and the second cross attention feature to obtain a fused feature.
In the embodiment of the present disclosure, a specific processing manner of the feature fusion layer may be to directly perform feature fusion on the first cross attention feature and the second cross attention feature to obtain a fused feature.
As an example, the fused feature may be obtained by stitching the first cross attention feature with the second cross attention feature.
In an optional aspect of the present disclosure, determining a post-fusion feature based on the first cross attention feature and the second cross attention feature includes:
performing feature fusion on the first cross attention feature and the first input feature to obtain a first sub-fusion feature;
performing feature fusion on the second cross attention feature and the second input feature to obtain a second sub-fusion feature;
and performing feature fusion on the first sub-fusion feature and the second sub-fusion feature to obtain a fused feature.
In this embodiment of the disclosure, the specific processing manner of the feature fusion layer may further be to perform feature fusion on the first cross attention feature and the first input feature to obtain a first sub-fusion feature, perform feature fusion on the second cross attention feature and the second input feature to obtain a second sub-fusion feature, and perform feature fusion on the first sub-fusion feature and the second sub-fusion feature to obtain a fused feature.
Since the first cross-attention feature and the second cross-attention feature are both subjected to cross-attention processing, confusion may occur between the features, and the characteristics of the first input feature included in the first cross-attention feature and the characteristics of the second input feature included in the second cross-attention feature may be averaged. By performing feature fusion on the first cross attention feature and the first input feature, the first sub-fusion feature can fuse the characteristics of the second input feature on the basis of considering the characteristics of the first input feature. By performing feature fusion on the second cross attention feature and the second input feature, the second sub-fusion feature can fuse the characteristics of the first input feature on the basis of considering the characteristics of the second input feature. Therefore, the fused features obtained based on the fusion of the first sub-fusion features and the second sub-fusion features have better feature fusion effect.
In the embodiment of the disclosure, the cross features after the cross attention processing are fused with the original input features at the feature fusion layer, so that data transmission links in the feature fusion network can be increased, the complexity of the network is increased, and the feature fusion effect of the feature fusion network is improved.
As an example, the specific way of feature fusing the first cross attention feature with the first input feature may be to add a feature matrix corresponding to the first cross attention feature with the feature matrix corresponding to the first input feature, and the specific way of feature fusing the second cross attention feature with the second input feature may be to add a feature matrix corresponding to the second cross attention feature with the feature matrix corresponding to the second input feature.
As an example, the first sub-fusion feature and the second sub-fusion feature may be spliced to obtain the fused feature.
In an optional mode of the present disclosure, the target analysis object is a document image including a target text, the first input feature is a visual feature, the second input feature is a text feature, and the correlation condition is that both the visual feature and the text feature are extracted based on the document image.
In the embodiment of the present disclosure, the target analysis object is a document image including a target text, and the corresponding application scenario is to perform feature analysis on the document image according to the feature fusion feature, for example, classify the document image, perform layout analysis on the document image, and identify a named entity in the target text.
The correlation condition is that the visual features and the text features are extracted based on the document images, namely, the visual features and the text features extracted from the same document images are limited to be subjected to feature fusion. The visual features and the text features extracted from the same document image generally have a certain correlation, for example, a header area in the document image is one of the key features for distinguishing the classification of the document image, and the text content of the header can be mined based on the image features of the document image, so that the two have a certain correlation.
The text features and the visual features are fused through the feature fusion network provided by the scheme, a better feature fusion effect is achieved, the obtained fused features have a better effect on semantic expression, the problem of semantic confusion can be avoided, and the analysis effect on the document images is ensured.
In an optional mode of the present disclosure, the method further includes:
extracting image characteristics of the document image and position characteristics of pixel points in the document image;
and carrying out feature fusion on the image features and the position features to obtain the visual features.
In the embodiment of the disclosure, the image features can be extracted from the document image, the position features of the pixel points in the document image are extracted, and the image features and the position features are subjected to feature fusion to obtain the visual features. Because the position information in the document image has important significance for analyzing the document image, the position information in the document image is introduced equivalently by fusing the position characteristics in the visual characteristics, and the subsequent analysis effect on the document image is promoted.
As one example, the characterization of image features and location features may be implemented in a manner that incorporates location features into the channel dimensions of the image features.
In an optional aspect of the present disclosure, extracting an image feature of a document image includes:
and extracting image features of the document image based on a preset Convolutional Neural Network (CNN).
In the embodiment of the disclosure, a layer of convolutional neural network may be preset to extract the image features of the document image.
In an optional manner of this disclosure, the method further includes:
performing Optical Character Recognition (OCR) processing on the document image to obtain a target text;
and performing feature extraction on the target text to obtain text features.
In the embodiment of the disclosure, the target text can be extracted from the document image by adopting an OCR mode. Then, each character in the target text is coded through a Word Vector (Word 2 Vector) model to obtain a Word Vector, and then a bidirectional coder representation (BERT) model based on a Transformer is adopted to convert the Word Vector to obtain text characteristics.
In an optional mode of the present disclosure, the feature fusion network is connected to a full connection layer, and the method further includes:
inputting the fused features into the full connection layer to obtain any one of the following output of the full connection layer:
a classification result of the document image;
a named entity recognition result in the target text;
and analyzing the layout of the document image.
In the embodiment of the disclosure, the feature fusion network is connected with the full connection layer, and the fused features are input to the full connection layer to obtain an output result of the full connection layer, that is, an analysis result of the document image.
Specifically, the output result of the fully connected layer may include any one of:
a classification result of the document image;
a named entity recognition result in the target text;
and analyzing the layout of the document image.
As an example, a classification method for document images provided by the embodiments of the present disclosure includes the following specific steps:
1. feature extraction
Firstly, obtaining a document image I, wherein I belongs to R h×w×3 The width of the document image is w pixels, and the height is h pixels. The document image may have three channels of red, green, and blue (RGB).
And processing the document image through a layer of convolutional neural network. The convolution kernel (kernel) of the convolution neural network is K, and K belongs to R r×r×3×(d-1) Where d-1 is the number of convolution kernels. Obtaining the image characteristic F through convolution processing,
Figure BDA0003906037480000111
according to the position characteristic P of each pixel of the document image,
Figure BDA0003906037480000112
wherein, P i.j I row and j column representing image feature F, having a value equal to
Figure BDA0003906037480000113
Combining the image feature F with the position feature P to obtain a complete visual feature: v = { F, P }, wherein,
Figure BDA0003906037480000114
and extracting n lines of target text in the document image through OCR processing. And encoding each word in the target text into a d-dimensional vector by using a word2vector model. Then, a BERT model is used to encode the feature vector sequence of each line to obtain a text feature T, T = { BERT (T) a ),a∈[1,n]},T∈R n×d Wherein BERT represents the loss function of the BERT model, and a represents any line of the target text.
2. Feature fusion
Reducing the dimension of the first two dimensions of the visual characteristic V to obtain the visual characteristic V',
Figure BDA0003906037480000115
the visual feature and the text feature have equal dimensionality, and cross attention processing is facilitated.
Performing feature fusion on the visual feature V' and the text feature T through the feature fusion network provided by the disclosure to obtain a first cross attention feature V x And obtaining a second cross attention feature T x
First cross attention feature V x And obtaining a second cross attention feature T x Splicing to obtain fused features M, M = { V = x ,T x },
Figure BDA0003906037480000116
Post-fusion feature I, I = concat (V, T), where concat represents the loss function of the feature fusion network.
3. Classification of document images
The feature fusion network is followed by a full Connected Layer (FC) to map the feature M onto a vector corresponding to the number of classes. The vectors are mapped to a probability distribution scores, scores = softmax (FC (M)) using a softmax function, where softmax denotes the softmax function and FC denotes the loss function of the fully connected layer. The category with the highest probability value may be taken as the predicted category cls for the document. cls = argmax (scores), where argmax denotes the argmax function for maximizing the probability distribution.
As an example, fig. 2 shows a flowchart of a specific implementation of the feature fusion method provided in the embodiment of the present disclosure.
As shown in fig. 2, the visual features would be converted to a query feature Vq, a key-value feature Vk, and a content feature Vv. The self-attention processing in the first self-attention subnetwork is performed based on the query feature Vq, the key-value feature Vk, and the content feature Vv to obtain a first self-attention feature Vn, the first self-attention weight parameter Mv is an attention weight parameter when the self-attention processing is performed in the first self-attention subnetwork, and the first self-attention weight parameter Mv can be calculated based on the query feature Vq and the key-value feature Vk.
The text features are converted into query features Tq, key-value features Tk, and content features TT. And performing self-attention processing in the second self-attention subnetwork based on the query feature Tq, the key-value feature Tk and the content feature TT to obtain a second self-attention feature Tn, wherein the second self-attention weight parameter Mt is an attention weight parameter during self-attention processing in the second self-attention subnetwork, and the second self-attention weight parameter Mt can be calculated based on the query feature Tq and the key-value feature Tk.
The first cross attention feature Vo can be obtained by performing cross self-attention processing in the first cross attention sub-network based on the first self-attention feature Vn and the second self-attention weight parameter Mt.
A second cross attention feature To can be obtained by performing cross self-attention processing in a second cross attention sub-network based on the second attention feature Tn and the first self-attention weight parameter Mv.
And performing feature fusion on the first cross attention feature Vo and the visual feature to obtain a first sub-fusion feature Vf.
And performing feature fusion on the second cross attention feature To and the text feature To obtain a second sub-fusion feature Tf.
And fusing the first sub-fusion characteristic Vf and the second sub-fusion characteristic Tf to obtain a fused characteristic I.
Fig. 3 shows a flowchart of a document image classification method provided by an embodiment of the present disclosure, and as shown in fig. 3, the method may mainly include:
step S310: extracting image features of the document image and position features of pixel points in the document image, and performing feature fusion on the image features and the position features to obtain visual features;
step S320: performing OCR processing on the document image to obtain a target text, and performing feature extraction on the target text to obtain text features;
step S330: inputting the visual features and the text features into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature;
wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork;
the first self-attention subnetwork is used for performing self-attention processing on the visual features to obtain first self-attention features;
the second self-attention sub-network is used for performing self-attention processing on the text features to obtain second self-attention features;
the first cross attention sub-network is used for carrying out cross attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature;
the second cross attention sub-network is used for carrying out cross attention processing on the basis of the second self-attention feature and the first attention weight parameter corresponding to the first self-attention sub-network to obtain a second cross attention feature;
step S340: performing feature fusion on the first cross attention feature and the visual feature to obtain a first sub-fusion feature;
step S350: performing feature fusion on the second cross attention feature and the text feature to obtain a second sub-fusion feature;
step S360: performing feature fusion on the first sub-fusion feature and the second sub-fusion feature to obtain a fused feature;
step S370: and inputting the fused features into the full-link layer to obtain a classification result of the full-link layer output document image.
In the embodiment of the disclosure, the visual features and the text features can be respectively extracted from the document image containing the target text, feature fusion is performed on the visual features and the text features based on the feature fusion network provided in the embodiment of the disclosure to obtain fused features, and the fused features fuse the characteristics of the first input features and the characteristics of the second input features, so that feature analysis can be performed on the document image according to the fused features, and the document image can be classified.
In the embodiment of the disclosure, a layer of convolutional neural network may be preset to extract the image features of the document image.
In the embodiment of the disclosure, the image features can be extracted from the document image, the position features of the pixel points in the document image are extracted, and the image features and the position features are subjected to feature fusion to obtain the visual features. Because the position information in the document image has important significance for analyzing the document image, the position information in the document image is introduced equivalently by fusing the position characteristics in the visual characteristics, and the subsequent analysis effect on the document image is promoted.
As one example, the characterization of image features and location features may be implemented in a manner that incorporates location features into the channel dimensions of the image features.
In the embodiment of the disclosure, the target text can be extracted from the document image by adopting an OCR mode. And then coding each character in the target text through a Word2Vector model to obtain a Word Vector, and then converting the Word Vector by adopting a BERT-based model to obtain text characteristics.
In the feature network provided by the embodiment of the present disclosure, a first self-attention sub-network performs self-attention processing on a visual feature to obtain a first self-attention feature, and a second self-attention sub-network is used for performing self-attention processing on a text feature to obtain a second self-attention feature, and then a first cross-attention sub-network performs cross-attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross-attention feature, and a second cross-attention sub-network is used for cross-attention processing on the basis of the second self-attention feature and the first attention weight parameter corresponding to the first self-attention sub-network to obtain a second cross-attention feature.
In implementations of the present disclosure, the second attention weight parameter is the attention weight parameter when the second self-attention subnetwork is used to self-attention process the text feature. Specifically, the text feature may be converted to obtain the query feature, the key-value feature, and the content feature, so that the second attention weight parameter is learned according to the query feature, the key-value feature, and the content feature.
In implementations of the present disclosure, the first attention weight parameter is an attention weight parameter when the first self-attention subnetwork is configured to self-attention process the visual feature. Specifically, the visual features may be converted to obtain a query feature, a key-value feature, and a content feature, respectively, so that the first attention weight parameter may be learned according to the query feature, the key-value feature, and the content feature.
In the embodiment of the present disclosure, since the first cross attention feature is obtained by performing cross attention processing based on the first self-attention feature and the second attention weight parameter corresponding to the second self-attention subnetwork, it is equivalent to that the first self-attention feature is guided to be reorganized by the second attention weight parameter of the second attention subnetwork, so that the first cross attention feature can fuse the characteristics of the text feature on the basis of having the characteristics of the visual frequency feature.
In the embodiment of the present disclosure, since the second cross attention feature is obtained by performing cross attention processing based on the second self attention feature and the first attention weight parameter corresponding to the first self attention sub-network, it is equivalent to guiding the second self attention feature to be reorganized by the first attention weight parameter of the first attention sub-network, so that the second cross attention feature can merge the characteristics of the visual feature on the basis of having the characteristics of the text feature.
The method provided by the embodiment of the disclosure includes the steps that visual features and text features are respectively extracted from a document image and input into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature; wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork; the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature; the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic; the first cross attention subnetwork is used for performing cross attention processing on the basis of the first self attention feature and a second attention weight parameter corresponding to the second self attention subnetwork to obtain a first cross attention feature; the second cross attention subnetwork is used for carrying out cross attention processing on the basis of the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature. According to the scheme, the visual features and the texts are subjected to feature fusion processing based on the feature fusion network, so that the first cross attention feature and the second cross attention feature which are fused with the characteristics of the visual features and the characteristics of the text features are obtained, the feature fusion effect can be improved, and the accuracy of the subsequent classification result of the document images is improved.
In the embodiment of the disclosure, the first cross attention subnetwork and the second self attention subnetwork share the attention weight parameter, so that the self attention mechanism and the cross attention mechanism can be better combined, the feature fusion effect is improved, the number of model parameters in the feature fusion network can be reduced, and the complexity of model operation is reduced.
In the embodiment of the present disclosure, in order to improve the feature fusion effect, multiple layers of feature fusion networks may be set according to actual needs, and the feature fusion networks of the layers are connected in sequence.
In the embodiment of the present disclosure, the specific processing manner of the feature fusion layer may be to perform feature fusion on the first cross attention feature and the visual feature to obtain a first sub-fusion feature, perform feature fusion on the second cross attention feature and the text feature to obtain a second sub-fusion feature, and perform feature fusion on the first sub-fusion feature and the second sub-fusion feature to obtain a fused feature.
Since the first cross-attention feature and the second cross-attention feature are both subjected to cross-attention processing, confusion may occur between the features, and characteristics of the visual feature included in the first cross-attention feature and characteristics of the text feature included in the second cross-attention feature may be averaged. The first cross attention feature and the visual feature are subjected to feature fusion, so that the first sub-fusion feature can fuse the characteristics of the text feature on the basis of considering the characteristics of the visual feature. The second cross attention feature and the second input feature are subjected to feature fusion, so that the second sub-fusion feature can fuse the characteristics of the visual feature on the basis of considering the characteristics of the text feature. Therefore, the fused features obtained based on the fusion of the first sub-fusion features and the second sub-fusion features have better feature fusion effect.
In the embodiment of the disclosure, the cross features after the cross attention processing are fused with the original input features at the feature fusion layer, so that data transmission links in the feature fusion network can be increased, the complexity of the network is increased, and the feature fusion effect of the feature fusion network is improved.
As an example, the specific way of feature fusing the first cross attention feature and the visual feature may be to add a feature matrix corresponding to the first cross attention feature and a feature matrix corresponding to the visual feature, and the specific way of feature fusing the second cross attention feature and the text feature may be to add a feature matrix corresponding to the second cross attention feature and a feature matrix corresponding to the text feature.
As an example, the first sub-fusion feature and the second sub-fusion feature may be spliced to obtain the post-fusion feature.
In the embodiment of the disclosure, the feature fusion network is connected with the full connection layer, and the fused features are input into the full connection layer to obtain the classification result of the document image output by the full connection layer.
Based on the same principle as the method shown in fig. 1, fig. 4 shows a schematic structural diagram of a feature fusion apparatus provided by an embodiment of the present disclosure, and as shown in fig. 4, the feature fusion apparatus 30 may include:
a feature obtaining module 410, configured to obtain a first input feature and a second input feature, where correlations between the first input feature and the target analysis object satisfy a preset correlation condition;
the feature fusion module 420 is configured to input the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature;
wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork;
the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature;
the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic;
the first cross attention sub-network is used for carrying out cross attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature;
the second cross attention subnetwork is used for carrying out cross attention processing on the basis of the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature.
According to the device provided by the embodiment of the disclosure, by acquiring the first input feature and the second input feature, the correlation between the first input feature and the target analysis object and the correlation between the second input feature and the target analysis object both satisfy a preset correlation condition; inputting the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature; wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork; the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature; the second self-attention subnetwork is used for performing self-attention processing on the second input characteristic to obtain a second self-attention characteristic; the first cross attention sub-network is used for carrying out cross attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature; the second cross attention subnetwork is used for carrying out cross attention processing on the basis of the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature. According to the scheme, the first input feature and the second input feature are subjected to feature fusion processing based on the feature fusion network, so that a first cross attention feature and a second cross attention feature which are fused with the characteristics of the first input feature and the characteristics of the second input feature are obtained, the effect of feature fusion can be improved, and the performance of a model can be improved.
Optionally, when the first cross attention subnetwork performs cross attention processing based on the first self attention feature and a second attention weight parameter corresponding to the second self attention subnetwork to obtain the first cross attention feature, the first cross attention subnetwork is specifically configured to:
converting the first self-attention feature into a first content feature;
and performing cross attention processing on the basis of the first content feature and a second attention weight parameter corresponding to the second self-attention subnetwork to obtain a first cross attention feature.
The second cross attention subnetwork is specifically configured to, when performing cross attention processing based on the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain the second cross attention feature:
converting the second self-attention feature into a second content feature;
and performing cross attention processing on the basis of the second content characteristic and the first attention weight parameter corresponding to the first self-attention subnetwork to obtain a second cross attention characteristic.
Optionally, the feature fusion network further includes a feature fusion layer, and the apparatus further includes:
and the feature fusion layer processing module is used for inputting the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature, and then inputting the first cross attention feature and the second cross attention feature into the feature fusion layer to determine a fused feature based on the first cross attention feature and the second cross attention feature.
Optionally, when determining the fused feature based on the first cross attention feature and the second cross attention feature, the feature fusion layer processing module is specifically configured to:
and performing feature fusion on the first cross attention feature and the second cross attention feature to obtain a fused feature.
Optionally, when determining the fused feature based on the first cross attention feature and the second cross attention feature, the feature fusion layer processing module is specifically configured to:
performing feature fusion on the first cross attention feature and the first input feature to obtain a first sub-fusion feature;
performing feature fusion on the second cross attention feature and the second input feature to obtain a second sub-fusion feature;
and performing feature fusion on the first sub-fusion feature and the second sub-fusion feature to obtain a fused feature.
Optionally, the target analysis object is a document image including a target text, the first input feature is a visual feature, the second input feature is a text feature, and the correlation condition is that the visual feature and the text feature are extracted based on the document image.
Optionally, the apparatus further includes a visual feature extraction module, where the visual feature extraction module is configured to:
extracting image characteristics of the document image and position characteristics of pixel points in the document image;
and carrying out feature fusion on the image features and the position features to obtain the visual features.
Optionally, when the visual feature extraction module extracts the image feature of the document image, the visual feature extraction module is specifically configured to:
and extracting the image characteristics of the document image based on a preset convolutional neural network.
Optionally, the apparatus further includes a text feature extraction module, where the text feature extraction module is configured to:
performing OCR processing on the document image to obtain a target text;
and performing feature extraction on the target text to obtain text features.
Optionally, the feature fusion network is connected to a full connection layer, and the apparatus further includes:
the full connection layer processing module is used for inputting the fused features into the full connection layer to obtain any one of the following output of the full connection layer:
a classification result of the document image;
a named entity recognition result in the target text;
and analyzing the layout of the document image.
It is understood that the above modules of the feature fusion apparatus in the embodiment of the present disclosure have functions of implementing the corresponding steps of the feature fusion method in the embodiment shown in fig. 1. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The modules can be software and/or hardware, and each module can be implemented independently or by integrating a plurality of modules. For the functional description of each module of the above feature fusion device, reference may be specifically made to the corresponding description of the feature fusion method in the embodiment shown in fig. 1, and details are not repeated here.
Based on the same principle as the method shown in fig. 3, fig. 5 shows a schematic structural diagram of a document image classification device provided by the embodiment of the disclosure, and as shown in fig. 5, the feature fusion device 50 may include:
the feature extraction module 510 is configured to extract image features of the document image and location features of pixel points in the document image, and perform feature fusion on the image features and the location features to obtain visual features; the method comprises the steps of performing OCR processing on a document image to obtain a target text, and performing feature extraction on the target text to obtain text features;
a feature fusion module 520, configured to input the visual features and the text features into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature;
wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork;
the first self-attention sub-network is used for performing self-attention processing on the visual features to obtain first self-attention features;
the second self-attention sub-network is used for performing self-attention processing on the text features to obtain second self-attention features;
the first cross attention sub-network is used for carrying out cross attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature;
the second cross attention sub-network is used for carrying out cross attention processing on the basis of the second self-attention feature and the first attention weight parameter corresponding to the first self-attention sub-network to obtain a second cross attention feature;
the feature fusion layer processing module 530 is configured to perform feature fusion on the first cross attention feature and the visual feature to obtain a first sub-fusion feature, perform feature fusion on the second cross attention feature and the text feature to obtain a second sub-fusion feature, and perform feature fusion on the first sub-fusion feature and the second sub-fusion feature to obtain a fused feature;
and the full link layer processing module 540 is configured to input the fused features to the full link layer to obtain a classification result of the full link layer output document image.
The method provided by the embodiment of the disclosure includes the steps that visual features and text features are respectively extracted from a document image and input into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature; wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork; the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature; the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic; the first cross attention subnetwork is used for performing cross attention processing on the basis of the first self attention feature and a second attention weight parameter corresponding to the second self attention subnetwork to obtain a first cross attention feature; the second cross attention subnetwork is used for carrying out cross attention processing on the basis of the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature. In the scheme, the visual features and the texts are subjected to feature fusion processing based on the feature fusion network, so that the first cross attention features and the second cross attention features which are fused with the features of the visual features and the features of the texts are obtained, the feature fusion effect can be improved, and the accuracy of the subsequent classification result of the document images is improved.
It is understood that the above modules of the document image classification device in the embodiment of the present disclosure have functions of implementing the corresponding steps of the document image classification method in the embodiment shown in fig. 3. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The modules can be software and/or hardware, and each module can be implemented independently or by integrating a plurality of modules. For the functional description of each module of the above feature fusion device, reference may be specifically made to the corresponding description of the document image classification method in the embodiment shown in fig. 3, which is not repeated herein.
In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.
The present disclosure also provides an electronic device and a readable storage medium according to an embodiment of the present disclosure.
The electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the feature fusion method as provided by the embodiments of the present disclosure.
Compared with the prior art, the electronic equipment obtains the first input characteristic and the second input characteristic, and the correlation between the first input characteristic and the target analysis object meets the preset correlation condition; inputting the first input characteristic and the second input characteristic into a preset characteristic fusion network to obtain a first cross attention characteristic and a second cross attention characteristic; wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork; the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature; the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic; the first cross attention sub-network is used for carrying out cross attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature; the second cross attention subnetwork is used for performing cross attention processing based on the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature. According to the scheme, the first input feature and the second input feature are subjected to feature fusion processing based on the feature fusion network, so that a first cross attention feature and a second cross attention feature which are fused with the characteristics of the first input feature and the characteristics of the second input feature are obtained, the effect of feature fusion can be improved, and the performance of a model can be improved.
The readable storage medium is a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a feature fusion method as provided by an embodiment of the present disclosure.
Compared with the prior art, the readable storage medium has the advantages that by acquiring the first input characteristic and the second input characteristic, the correlation between the first input characteristic and the target analysis object and the correlation between the second input characteristic and the target analysis object both meet the preset correlation condition; inputting the first input characteristic and the second input characteristic into a preset characteristic fusion network to obtain a first cross attention characteristic and a second cross attention characteristic; wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork; the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature; the second self-attention sub-network is used for carrying out self-attention processing on the second input characteristic to obtain a second self-attention characteristic; the first cross attention sub-network is used for carrying out cross attention processing on the basis of the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature; the second cross attention subnetwork is used for carrying out cross attention processing on the basis of the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature. In the scheme, feature fusion processing is performed on the first input feature and the second input feature based on the feature fusion network, so that a first cross attention feature and a second cross attention feature which are fused with the characteristics of the first input feature and the characteristics of the second input feature are obtained, the effect of feature fusion can be improved, and the performance of the model can be improved.
FIG. 6 illustrates a schematic block diagram of an example electronic device 60 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the electronic device 60 includes a computing unit 610 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 620 or a computer program loaded from a storage unit 680 into a Random Access Memory (RAM) 630. In the RAM630, various programs and data required for the operation of the device 60 can also be stored. The computing unit 610, the ROM620, and the RAM630 are connected to each other by a bus 640. An input/output (I/O) interface 650 is also connected to bus 640.
Various components in device 60 are connected to I/O interface 650, including: an input unit 660 such as a keyboard, a mouse, etc.; an output unit 670 such as various types of displays, speakers, and the like; a storage unit 680, such as a magnetic disk, optical disk, or the like; and a communication unit 690 such as a network card, modem, wireless communication transceiver, etc. The communication unit 690 allows the device 60 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Computing unit 610 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 610 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 610 performs the feature fusion method provided in the embodiments of the present disclosure. For example, in some embodiments, performing the feature fusion methods provided in embodiments of the present disclosure may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 680. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 60 via the ROM620 and/or the communication unit 690. When loaded into RAM630 and executed by computing unit 610, may perform one or more steps of the feature fusion method provided in embodiments of the present disclosure. Alternatively, in other embodiments, the computing unit 610 may be configured in any other suitable manner (e.g., by way of firmware) to perform the feature fusion methods provided in embodiments of the present disclosure.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (19)

1. A method of feature fusion, comprising:
acquiring a first input characteristic and a second input characteristic, wherein the correlation between the first input characteristic and the target analysis object and the correlation between the second input characteristic and the target analysis object meet a preset correlation condition;
inputting the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature;
wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork;
the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature;
the second self-attention sub-network is used for performing self-attention processing on the second input feature to obtain a second self-attention feature;
the first cross attention subnetwork is used for performing cross attention processing based on the first self attention feature and a second attention weight parameter corresponding to the second self attention subnetwork to obtain a first cross attention feature;
the second cross attention subnetwork is used for performing cross attention processing based on the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature.
2. The method according to claim 1, wherein the first cross attention subnetwork, when performing cross attention processing based on the first self attention feature and a second attention weight parameter corresponding to the second self attention subnetwork to obtain a first cross attention feature, is specifically configured to:
converting the first self-attention feature into a first content feature;
performing cross attention processing on the basis of the first content feature and a second attention weight parameter corresponding to the second self-attention subnetwork to obtain a first cross attention feature;
the second cross attention subnetwork is specifically configured to, when performing cross attention processing based on the second self-attention feature and the first attention weight parameter corresponding to the first self-attention subnetwork to obtain a second cross attention feature:
converting the second self-attention feature into a second content feature;
and performing cross attention processing on the basis of the second content feature and a first attention weight parameter corresponding to the first self-attention subnetwork to obtain a second cross attention feature.
3. The method according to claim 1 or 2, wherein the feature fusion network further comprises a feature fusion layer, and after the inputting the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature, the method further comprises:
inputting the first cross attention feature and the second cross attention feature to the feature fusion layer to determine a post-fusion feature based on the first cross attention feature and the second cross attention feature.
4. The method of claim 3, wherein the determining a post-fusion feature based on the first cross-attention feature and the second cross-attention feature comprises:
and performing feature fusion on the first cross attention feature and the second cross attention feature to obtain a fused feature.
5. The method of claim 3, wherein the determining a post-fusion feature based on the first cross-attention feature and the second cross-attention feature comprises:
performing feature fusion on the first cross attention feature and the first input feature to obtain a first sub-fusion feature;
performing feature fusion on the second cross attention feature and the second input feature to obtain a second sub-fusion feature;
and performing feature fusion on the first sub-fusion feature and the second sub-fusion feature to obtain a fused feature.
6. The method according to any one of claims 3 to 5, wherein the target analysis object is a document image containing target text, the first input feature is a visual feature, the second input feature is a text feature, and the relevance condition is that the visual feature and the text feature are both extracted based on the document image.
7. The method of claim 6, further comprising:
extracting image features of the document image and position features of pixel points in the document image;
and performing feature fusion on the image features and the position features to obtain the visual features.
8. The method of claim 7, wherein said extracting image features of the document image comprises:
and extracting the image characteristics of the document image based on a preset convolutional neural network.
9. The method according to any one of claims 6-8, further comprising:
performing Optical Character Recognition (OCR) processing on the document image to obtain the target text;
and extracting the features of the target text to obtain the text features.
10. The method of any of claims 6-9, wherein the feature fusion network has a fully connected layer connected, the method further comprising:
inputting the fused features into the full connection layer to obtain any one of the following output of the full connection layer:
a classification result of the document image;
a named entity recognition result in the target text;
and analyzing the layout of the document image.
11. A feature fusion apparatus comprising:
the characteristic acquisition module is used for acquiring a first input characteristic and a second input characteristic, and the correlation between the first input characteristic and the target analysis object and the correlation between the second input characteristic and the target analysis object meet a preset correlation condition;
the feature fusion module is used for inputting the first input feature and the second input feature into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature;
wherein the feature fusion network comprises a first self-attention subnetwork, a second self-attention subnetwork, a first cross-attention subnetwork, and a second cross-attention subnetwork;
the first self-attention sub-network is used for carrying out self-attention processing on the first input feature to obtain a first self-attention feature;
the second self-attention sub-network is used for performing self-attention processing on the second input feature to obtain a second self-attention feature;
the first cross attention sub-network is used for performing cross attention processing based on the first self-attention feature and a second attention weight parameter corresponding to the second self-attention sub-network to obtain a first cross attention feature;
the second cross attention subnetwork is used for performing cross attention processing based on the second self attention feature and the first attention weight parameter corresponding to the first self attention subnetwork to obtain a second cross attention feature.
12. The apparatus of claim 11, wherein the first cross attention subnetwork, when performing cross attention processing based on the first self-attention feature and a second attention weight parameter corresponding to the second self-attention subnetwork to obtain a first cross attention feature, is specifically configured to:
converting the first self-attention feature into a first content feature;
performing cross attention processing on the basis of the first content feature and a second attention weight parameter corresponding to the second self-attention subnetwork to obtain a first cross attention feature;
the second cross attention subnetwork is specifically configured to, when performing cross attention processing based on the second self-attention feature and the first attention weight parameter corresponding to the first self-attention subnetwork to obtain a second cross attention feature:
converting the second self-attention feature into a second content feature;
and performing cross attention processing on the basis of the second content feature and a first attention weight parameter corresponding to the first self-attention subnetwork to obtain a second cross attention feature.
13. The apparatus of claim 11 or 12, wherein the feature fusion network further comprises a feature fusion layer, the apparatus further comprising:
a feature fusion layer processing module, configured to, after the first input feature and the second input feature are input into a preset feature fusion network to obtain a first cross attention feature and a second cross attention feature, input the first cross attention feature and the second cross attention feature into the feature fusion layer, so as to determine a fused feature based on the first cross attention feature and the second cross attention feature.
14. The apparatus of claim 13, wherein the feature fusion layer processing module, when determining the post-fusion feature based on the first cross attention feature and the second cross attention feature, is specifically to:
and performing feature fusion on the first cross attention feature and the second cross attention feature to obtain a fused feature.
15. The apparatus of claim 13, wherein the feature fusion layer processing module, when determining the post-fusion feature based on the first cross attention feature and the second cross attention feature, is specifically to:
performing feature fusion on the first cross attention feature and the first input feature to obtain a first sub-fusion feature;
performing feature fusion on the second cross attention feature and the second input feature to obtain a second sub-fusion feature;
and performing feature fusion on the first sub-fusion feature and the second sub-fusion feature to obtain a fused feature.
16. The apparatus according to any one of claims 13-15, wherein the target analysis object is a document image containing target text, the first input feature is a visual feature, the second input feature is a textual feature, and the relevance condition is that the visual feature and the textual feature are both extracted based on the document image.
17. The apparatus of claim 16, further comprising a visual feature extraction module to:
extracting image features of the document image and position features of pixel points in the document image;
and performing feature fusion on the image features and the position features to obtain the visual features.
18. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.
19. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.
CN202211304730.1A 2022-10-24 2022-10-24 Feature fusion method and device, electronic equipment and computer readable storage medium Pending CN115601620A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211304730.1A CN115601620A (en) 2022-10-24 2022-10-24 Feature fusion method and device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211304730.1A CN115601620A (en) 2022-10-24 2022-10-24 Feature fusion method and device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN115601620A true CN115601620A (en) 2023-01-13

Family

ID=84848774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211304730.1A Pending CN115601620A (en) 2022-10-24 2022-10-24 Feature fusion method and device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN115601620A (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837965A (en) * 2021-09-26 2021-12-24 北京百度网讯科技有限公司 Image definition recognition method and device, electronic equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837965A (en) * 2021-09-26 2021-12-24 北京百度网讯科技有限公司 Image definition recognition method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ASHISH VASWANI, NOAM SHAZEER, NIKI PARMAR, JAKOB USZKOREIT, LLION JONES, AIDAN N. GOMEZ, ŁUKASZ KAISER, AND ILLIA POLOSUKHIN: "Attention is all you need", IN PROCEEDINGS OF THE 31ST INTERNATIONAL CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NIPS\'17), 4 December 2017 (2017-12-04), pages 6000 *
YULIN LI, YUXI QIAN, YUECHEN YU, XIAMENG QIN, CHENGQUAN ZHANG, YAN LIU, KUN YAO, JUNYU HAN, JINGTUO LIU, AND ERRUI DING: "StrucTexT: Structured Text Understanding with Multi-Modal Transformers", IN PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM \'21), 24 October 2021 (2021-10-24), pages 1912, XP059239797, DOI: 10.1145/3474085.3475345 *

Similar Documents

Publication Publication Date Title
CN112966522B (en) Image classification method and device, electronic equipment and storage medium
CN110705301B (en) Entity relationship extraction method and device, storage medium and electronic equipment
CN114399769B (en) Training method of text recognition model, and text recognition method and device
CN112949415B (en) Image processing method, apparatus, device and medium
CN112559800B (en) Method, apparatus, electronic device, medium and product for processing video
US20230162477A1 (en) Method for training model based on knowledge distillation, and electronic device
CN113657395B (en) Text recognition method, training method and device for visual feature extraction model
CN114494784A (en) Deep learning model training method, image processing method and object recognition method
CN113360700B (en) Training of image-text retrieval model, image-text retrieval method, device, equipment and medium
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN114091472B (en) Training method of multi-label classification model
CN112560461A (en) News clue generation method and device, electronic equipment and storage medium
CN114445826A (en) Visual question answering method and device, electronic equipment and storage medium
CN113947700A (en) Model determination method and device, electronic equipment and memory
CN113360683A (en) Method for training cross-modal retrieval model and cross-modal retrieval method and device
CN110738261B (en) Image classification and model training method and device, electronic equipment and storage medium
CN116343233B (en) Text recognition method and training method and device of text recognition model
CN112906368A (en) Industry text increment method, related device and computer program product
CN115457329B (en) Training method of image classification model, image classification method and device
CN115496734A (en) Quality evaluation method of video content, network training method and device
CN114118049B (en) Information acquisition method, device, electronic equipment and storage medium
CN116246287A (en) Target object recognition method, training device and storage medium
CN112559713B (en) Text relevance judging method and device, model, electronic equipment and readable medium
CN114707638A (en) Model training method, model training device, object recognition method, object recognition device, object recognition medium and product
CN114861758A (en) Multi-modal data processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination