CN114821605A - Text processing method, device, equipment and medium - Google Patents

Text processing method, device, equipment and medium Download PDF

Info

Publication number
CN114821605A
CN114821605A CN202210762364.8A CN202210762364A CN114821605A CN 114821605 A CN114821605 A CN 114821605A CN 202210762364 A CN202210762364 A CN 202210762364A CN 114821605 A CN114821605 A CN 114821605A
Authority
CN
China
Prior art keywords
text
features
heterogeneous
feature
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210762364.8A
Other languages
Chinese (zh)
Other versions
CN114821605B (en
Inventor
李晓川
赵雅倩
李仁刚
郭振华
范宝余
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202210762364.8A priority Critical patent/CN114821605B/en
Publication of CN114821605A publication Critical patent/CN114821605A/en
Application granted granted Critical
Publication of CN114821605B publication Critical patent/CN114821605B/en
Priority to PCT/CN2022/141186 priority patent/WO2024001100A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1912Selecting the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Compression Of Band Width Or Redundancy In Fax (AREA)

Abstract

The application relates to the technical field of artificial intelligence, and discloses a text processing method, a device, equipment and a medium, which are used for coding acquired images to be analyzed and texts to obtain input characteristics; the text comprises a first text and a second text; the input features include initial image features and initial text features. According to a set homogeneous attention mechanism, carrying out correlation analysis on the initial image characteristics and the initial text characteristics to obtain intermediate image characteristics and intermediate text characteristics; and performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features. And analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer to determine a target text matched with the first text. By setting a homogeneous attention mechanism and a heterogeneous attention mechanism, the attributes of the multi-mode features are fully mined, and the target text matched with the first text can be screened out more accurately.

Description

Text processing method, device, equipment and medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a text processing method, apparatus, device, and computer-readable storage medium.
Background
Visual common sense Reasoning (VCR) refers to selecting an answer that best fits the description of a quiz sentence among 4 options and selecting a theoretical basis for supporting the answer among the additional 4 options according to a specified image. The multimode artificial intelligence generally relates to multimode data input such as vision, voice, text, various types of sensing signals and the like, which is similar to the situation appearing in a daily scene, so that the multimode artificial intelligence has a better landing prospect and becomes one of the current international mainstream research directions. The VCR task is a branch of the multi-modal field, and falls within the domain of multi-modal comprehension, which is intended to enable computers to gain the ability to "understand", i.e., to respond by viewing images, based on the target characters involved in the question. The VCR task provides 4 options for the answer and the computer needs to select the best qualified output among the 4 options.
the interface of the input and the output of the transformer structure is relatively flexible, and the dimension of the characteristic is not changed by the structure. At the present stage, the most extensive is a visual common sense reasoning system based on a transformer structure, and answers which best accord with the description of the question sentences are selected for the designated images. First, an input image and several pieces of text are encoded: the image is encoded using a convolutional neural network. Inputting a question sentence, a candidate answer sentence and a candidate explanation sentence, and performing feature extraction by adopting a ready-made text encoder. The candidate answers and the reasonable probabilities of the candidate interpretations are represented by fixed characters, which are represented by a fixed vector code, i.e., probability embedded vectors.
The method realizes the joint coding of the multi-modal characteristics by stacking the transform structures, thereby realizing the interaction among different modal characteristics, and finally predicting the probability that the current candidate answer and the explanation meet the requirements by decoding the specified position characteristics. The transform structure of the full connection layer is simple and roughly splices all the features together, and the relation among all the features is calculated through an attention mechanism, so that the learning difficulty of the model is increased.
Therefore, how to improve the feature screening capability of the model without increasing the difficulty of model learning is a problem to be solved by those skilled in the art.
Disclosure of Invention
The embodiment of the application aims to provide a text processing method, a text processing device, text processing equipment and a computer readable storage medium, which can improve the feature screening capability of a model without increasing the learning difficulty of the model.
In order to solve the foregoing technical problem, an embodiment of the present application provides a text processing method, including:
coding the acquired image and text to be analyzed to obtain input characteristics; wherein the text comprises a first text and a second text; the first text and the second text have a mapping relation; the input features comprise initial image features and initial text features;
according to a set homogeneous attention mechanism, performing correlation analysis on the initial image features and the initial text features to obtain intermediate image features and intermediate text features;
performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features;
analyzing the heterogeneous image features and the heterogeneous text features by using a scorer, and determining a target text matched with the first text; wherein the target text is a text included in the second text.
Optionally, the performing, according to a set homogeneous attention mechanism, correlation analysis on the initial image feature and the initial text feature to obtain an intermediate image feature and an intermediate text feature includes:
constructing a graph structure according to the initial image features, the initial text features and a feature space conversion matrix and a mapping matrix obtained by model training;
fusing the characteristics of each node in the graph structure according to a set characteristic updating rule to obtain the fused characteristics of each node; the fusion features comprise image features added with correlation features and text features added with correlation features;
and coding the fusion characteristics to obtain intermediate image characteristics and intermediate text characteristics.
Optionally, the constructing a graph structure according to the initial image feature, the initial text feature, and a feature space transformation matrix and a mapping matrix obtained by model training includes:
determining initial attention vectors of the initial image features and the initial text features according to a feature space conversion matrix and a mapping matrix obtained by model training;
mapping the initial attention vector according to a mapping matrix obtained by model training to obtain an attention vector;
constructing a graph structure based on the initial image features, the initial text features, and the attention vector.
Optionally, the fusing the features of each node in the graph structure according to the set feature update rule to obtain the fused feature of each node includes:
screening and normalizing the input features to obtain normalized weights among all nodes in the graph structure;
and determining the fusion characteristics of each node in the graph structure according to the set characteristic mapping matrix, the update rate, the input characteristics, the normalization weight and the attention vector.
Optionally, the determining, according to the set feature mapping matrix, the update rate, the input feature, the normalization weight, and the attention vector, the fusion feature of each node in the graph structure includes:
calling a feature updating formula, and analyzing the input features and the attention vectors to obtain updating features; the expression of the characteristic updating formula is as follows:
Figure 904649DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 383035DEST_PATH_IMAGE002
the characteristics of the update are represented by,fthe characteristics of the input are represented by,σwhich is indicative of the rate of update,attnf) A vector of attention is represented that is representative of an attention vector,
Figure 760926DEST_PATH_IMAGE003
which represents the normalized weight of the weight,W d representing a feature mapping matrix;
and superposing the updated feature and the attention vector to obtain a fusion feature.
Optionally, the performing cross-modal analysis on the intermediate image feature and the intermediate text feature according to a set heterogeneous attention mechanism to obtain a heterogeneous image feature and a heterogeneous text feature includes:
constructing a heterogeneous graph structure according to the intermediate image features, the intermediate text features and a feature space conversion matrix and a mapping matrix obtained by model training;
fusing the characteristics of each node in the heterogeneous graph structure according to a set heterogeneous characteristic updating rule to obtain heterogeneous fusion characteristics of each node; the heterogeneous fusion features comprise image features added with heterogeneous features and text features added with heterogeneous features;
and coding the heterogeneous fusion characteristics to obtain heterogeneous image characteristics and heterogeneous text characteristics.
Optionally, the constructing a heterogeneous graph structure according to the intermediate image feature, the intermediate text feature, and a feature space transformation matrix and a mapping matrix obtained by model training includes:
determining initial cross-attention vectors of the intermediate image features and the intermediate text features according to a feature space conversion matrix and a mapping matrix obtained by model training;
mapping the initial attention-crossing vector according to a mapping matrix obtained by model training to obtain an attention-crossing vector;
constructing a allopgram structure based on the intermediate image features, the intermediate text features, and the cross-attention vector.
Optionally, the fusing the features of each node in the heterogeneous graph structure according to the set heterogeneous feature update rule to obtain the heterogeneous fusion features of each node includes:
screening and normalizing the intermediate image features and the intermediate text features to obtain heterogeneous normalization weights among nodes in the heterogeneous graph structure;
and determining heterogeneous fusion characteristics of each node in the heterogeneous graph structure according to the set characteristic mapping matrix, the update rate, the intermediate image characteristics, the intermediate text characteristics, the heterogeneous normalization weight and the cross-attention vector.
Optionally, the determining, according to the set feature mapping matrix, the update rate, the intermediate image feature, the intermediate text feature, the heterogeneous normalization weight, and the cross-attention vector, a heterogeneous fusion feature of each node in the heterogeneous graph structure includes:
calling a first heterogeneous feature updating formula, and analyzing the intermediate image feature, the intermediate text feature and the cross-attention vector to obtain a first heterogeneous updating feature; the expression of the first heterogeneous characteristic updating formula is as follows:
Figure 564934DEST_PATH_IMAGE004
(1);
calling a second heterogeneous feature updating formula, and analyzing the intermediate image feature, the intermediate text feature and the cross-attention vector to obtain a second heterogeneous updating feature; the expression of the second heterogeneous characteristic updating formula is as follows:
Figure 762698DEST_PATH_IMAGE005
(2);
wherein, the first and the second end of the pipe are connected with each other,
Figure 779195DEST_PATH_IMAGE006
the heterogeneous update characteristics are represented by the data of the mobile terminal,pa feature of the intermediate text is represented,gthe characteristics of the intermediate image are represented,σwhich is indicative of the rate of update,crossattnp,g) A cross-attention vector is represented that spans the attention vector,
Figure 480435DEST_PATH_IMAGE003
which represents the normalized weight of the weight,W d representing a feature mapping matrix;
and superposing the heterogeneous updating features and the cross-attention vector to obtain heterogeneous fusion features.
Optionally, the analyzing the heterogeneous image features and the heterogeneous text features by using a scorer, and determining the target text matched with the first text includes:
coding the heterogeneous image features and the heterogeneous text features to obtain coding features;
taking the coding features as input features of the scorer to obtain probability scores corresponding to the second texts;
and taking the second text with the highest probability score as the target text matched with the first text.
Optionally, the first text is a question text, and the second text is an answer text.
Optionally, the answer text is multiple; the analyzing the heterogeneous image features and the heterogeneous text features by using a scorer and determining the target text matched with the first text comprises the following steps:
and analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer so as to screen an answer text matched with the question text from a plurality of answer texts.
The embodiment of the application also provides a text processing device, which comprises a coding unit, a correlation analysis unit, a cross-mode analysis unit and a matching unit;
the encoding unit is used for encoding the acquired image to be analyzed and the text to obtain input characteristics; wherein the text comprises a first text and a second text; the first text and the second text have a mapping relation; the input features comprise initial image features and initial text features;
the correlation analysis unit is used for carrying out correlation analysis on the initial image features and the initial text features according to a set homogeneous attention mechanism to obtain intermediate image features and intermediate text features;
the cross-modal analysis unit is used for performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features;
the matching unit is used for analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer and determining a target text matched with the first text; wherein the target text is a text included in the second text.
Optionally, the correlation analysis unit comprises a construction subunit, a fusion subunit and an encoding subunit;
the construction subunit is configured to construct a graph structure according to the initial image feature, the initial text feature, and a feature space transformation matrix and a mapping matrix obtained by model training;
the fusion subunit is configured to fuse the features of each node in the graph structure according to a set feature update rule to obtain a fusion feature of each node; the fusion features comprise image features added with correlation features and text features added with correlation features;
and the coding subunit is used for coding the fusion characteristics to obtain intermediate image characteristics and intermediate text characteristics.
Optionally, the constructing subunit is configured to determine, according to a feature space transformation matrix and a mapping matrix obtained by model training, an initial attention vector of the initial image feature and the initial text feature; mapping the initial attention vector according to a mapping matrix obtained by model training to obtain an attention vector; constructing a graph structure based on the initial image features, the initial text features, and the attention vector.
Optionally, the fusion subunit is configured to perform screening and normalization processing on the input features, so as to obtain a normalization weight between nodes in the graph structure; and determining the fusion characteristics of each node in the graph structure according to the set characteristic mapping matrix, the update rate, the input characteristics, the normalization weight and the attention vector.
Optionally, the fusion subunit is configured to invoke a feature update formula, and analyze the input feature and the attention vector to obtain an update feature; the expression of the characteristic updating formula is as follows:
Figure 986503DEST_PATH_IMAGE007
wherein the content of the first and second substances,
Figure 405983DEST_PATH_IMAGE002
the characteristics of the update are represented by,fthe characteristics of the input are represented by,σwhich is indicative of the rate of update,attnf) A vector of attention is represented, and,
Figure 757329DEST_PATH_IMAGE003
which represents the normalized weight of the weight,W d representing a feature mapping matrix;
and superposing the updated feature and the attention vector to obtain a fusion feature.
Optionally, the cross-modal analysis unit includes a construction subunit, a fusion subunit, and an encoding subunit;
the constructing subunit is configured to construct a heterogeneous graph structure according to the intermediate image features, the intermediate text features, and a feature space transformation matrix and a mapping matrix obtained by model training;
the fusion subunit is configured to fuse, according to a set heterogeneous feature update rule, features of each node in the heterogeneous graph structure to obtain a heterogeneous fusion feature of each node; the heterogeneous fusion features comprise image features added with heterogeneous features and text features added with heterogeneous features;
and the coding subunit is used for coding the heterogeneous fusion characteristics to obtain heterogeneous image characteristics and heterogeneous text characteristics.
Optionally, the constructing subunit is configured to determine an initial cross-attention vector of the intermediate image feature and the intermediate text feature according to a feature space transformation matrix and a mapping matrix obtained by model training;
mapping the initial attention-crossing vector according to a mapping matrix obtained by model training to obtain an attention-crossing vector;
constructing a allopgram structure based on the intermediate image features, the intermediate text features, and the cross-attention vector.
Optionally, the fusion subunit is configured to perform screening and normalization processing on the intermediate image features and the intermediate text features to obtain heterogeneous normalization weights between nodes in the heterogeneous graph structure;
and determining heterogeneous fusion characteristics of each node in the heterogeneous graph structure according to the set characteristic mapping matrix, the update rate, the intermediate image characteristics, the intermediate text characteristics, the heterogeneous normalization weight and the cross-attention vector.
Optionally, the fusion subunit is configured to invoke a first heterogeneous feature update formula, and analyze the intermediate image feature, the intermediate text feature, and the cross-attention vector to obtain a first heterogeneous update feature; the expression of the first heterogeneous characteristic updating formula is as follows:
Figure 578655DEST_PATH_IMAGE008
(1);
calling a second heterogeneous feature updating formula, and analyzing the intermediate image feature, the intermediate text feature and the cross-attention vector to obtain a second heterogeneous updating feature; the expression of the second heterogeneous characteristic updating formula is as follows:
Figure 255624DEST_PATH_IMAGE009
(2);
wherein the content of the first and second substances,
Figure 162400DEST_PATH_IMAGE006
the heterogeneous update characteristics are represented by the data of the mobile terminal,pthe intermediate text feature is represented by a representation of,gthe characteristics of the intermediate image are represented,σwhich is indicative of the rate of update,crossattnp,g) A cross-attention vector is represented that spans the attention vector,
Figure 51859DEST_PATH_IMAGE003
which represents the normalized weight of the weight,W d representing a feature mapping matrix;
and superposing the heterogeneous updating features and the cross-attention vector to obtain heterogeneous fusion features.
Optionally, the matching unit includes a feature encoding unit, an input unit, and a processing unit;
the feature coding unit is used for coding the heterogeneous image features and the heterogeneous text features to obtain coding features;
the input unit is used for taking the coding features as the input features of the scorer to obtain the probability scores corresponding to the second texts;
the acting unit is used for taking the second text with the highest probability score as the target text matched with the first text.
Optionally, the first text is a question text, and the second text is an answer text.
Optionally, the answer text is multiple; the matching unit is used for analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by utilizing a scorer so as to screen an answer text matched with the question text from a plurality of answer texts.
An embodiment of the present application further provides an electronic device, including:
a memory for storing a computer program;
a processor for executing the computer program to implement the steps of the processing method as described above.
Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the processing method of the foregoing text.
According to the technical scheme, the acquired image to be analyzed and the acquired text are coded to obtain the input characteristics; wherein the text may include a first text and a second text; the input features include initial image features and initial text features. In order to fully mine the relevance between the image to be analyzed and the text, the relevance analysis can be carried out on the initial image characteristic and the initial text characteristic according to a set homogeneous attention mechanism to obtain an intermediate image characteristic and an intermediate text characteristic; and performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features. And analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer to determine a target text matched with the first text. In the technical scheme, the image features and the text features are different in type and belong to multi-mode features, the attributes of the multi-mode features can be fully mined by setting a homogeneous attention mechanism and a heterogeneous attention mechanism, the learning difficulty of the model cannot be increased in the implementation process, the target text matched with the first text can be more accurately screened out based on the mined features, and the feature screening capability of the model is improved while the learning difficulty of the model is not increased.
Drawings
In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
Fig. 1 is a flowchart of a text processing method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an inference system for selecting text answers according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a homogeneous attention layer according to an embodiment of the present disclosure;
FIG. 4 is a schematic structural diagram of a heterogeneous attention layer according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;
fig. 6 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.
The terms "including" and "having," and any variations thereof, in the description and claims of this application and the drawings described above, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may include other steps or elements not expressly listed.
In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings.
Next, a method for processing a text provided in an embodiment of the present application will be described in detail. Fig. 1 is a flowchart of a text processing method provided in an embodiment of the present application, where the method includes:
s101: and coding the acquired image and text to be analyzed to obtain input characteristics.
Wherein the text may include a first text and a second text; the first text and the second text have a mapping relation; the input features may include initial image features and initial text features.
In an embodiment of the present application, the second text may include a candidate answer and a candidate explanation. Wherein the candidate interpretation is a resolution of the candidate answer. In practical application, the candidate answers may be first analyzed as an answer file, and after one candidate answer that is most matched with the first text is selected from all the candidate answers, the most matched candidate answer and the candidate explanation may be further analyzed as an answer file, and the candidate explanation that is most matched with the first text is selected from all the candidate explanations.
The encoding processing of the image and text to be analyzed is a conventional way, and is not described herein again.
S102: and according to a set homogeneous attention mechanism, carrying out correlation analysis on the initial image characteristic and the initial text characteristic to obtain an intermediate image characteristic and an intermediate text characteristic.
In the embodiment of the application, in order to fully mine the relevance of the initial image feature and the initial text feature, a homogeneous attention mechanism and a heterogeneous attention mechanism are set. Wherein, the homogeneous attention mechanism can be that all the characteristics are subjected to relevance analysis, and all the characteristics comprise initial image characteristics and initial text characteristics; that is, the attention-homogeneity mechanism may analyze the correlation between the initial image features, the correlation between the initial text features, and the correlation between the initial image features and the initial text features.
In practical application, a homogeneous attention layer can be built based on a homogeneous attention mechanism, and a heterogeneous attention layer can be built based on a heterogeneous attention mechanism.
Fig. 2 is a schematic structural diagram of an inference system for text answer selection according to an embodiment of the present application, and fig. 2 is an example in which an answer file includes candidate answers and candidate interpretations. The inference system of fig. 2 includes feature concatenation, composite features, homogenous attention layers, heterogeneous attention layers, coded features, and a scorer. The feature splicing and the comprehensive features can be used for processing the acquired image and text to be analyzed to obtain initial image features and initial text features. In practical applications, continuous iterative processing needs to be performed on the image features and the text features so as to fully mine the association between the image features and the text features, and therefore, as can be seen from fig. 2, a plurality of homogeneous attention layers and a plurality of heterogeneous attention layers are provided, and each homogeneous attention layer has its corresponding heterogeneous attention layer. The input of each heterogeneous attention layer is the output of the last homogeneous attention layer adjacent to it.
The operation flow required to be executed by each homogeneous attention layer is the same, so in the embodiment of the present application, the description is given by taking the processing flow of one homogeneous attention layer as an example. Similarly, the operation flow required to be executed by each heterogeneous attention layer is the same, and therefore, in the embodiment of the present application, the description is given by taking the processing flow of one heterogeneous attention layer as an example.
The image features and the text features are of different types and belong to multi-modal features. In addition, because the graph structure has a feature aggregation function, the graph structure and the transform structure can be combined, and a graph attention mechanism is designed. The graph attention mechanism aims to solve the problem of correlation among cross-modal features, thereby improving feature effectiveness.
In specific implementation, a graph structure can be constructed according to the initial image features, the initial text features, and a feature space transformation matrix and a mapping matrix obtained by model training.
The number of nodes in the graph structure is the same as the number of all the features of the initial image feature and the initial text feature, one node represents one feature, and the edges of the graph structure can be assigned by the attention weights corresponding to the initial image feature and the initial text feature.
After the graph structure is constructed, in order to fully mine hidden information among the features, the features of each node in the graph structure can be fused according to a set feature update rule to obtain the fusion features of each node; wherein the fusion features comprise image features added with correlation features and text features added with correlation features.
And coding the fusion features to obtain intermediate image features and intermediate text features. In the embodiment of the application, features are mined in a mode of combining a graph structure and a transform structure, so after the fused features are obtained, transform coding can be performed on the fused features, and intermediate image features and intermediate text features are obtained.
S103: and performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features.
The heterogeneous attention mechanism can be used for mining the relevance between the image feature and the text feature after the homogeneous attention mechanism is processed.
In specific implementation, the heterogeneous graph structure can be constructed according to the image features, the text features and the feature space transformation matrix and the mapping matrix obtained by model training.
The number of nodes in the heterogeneous graph structure is the same as the number of features input to the heterogeneous attention layer, one node represents one feature, and the edges of the heterogeneous graph structure can be assigned by the heterogeneous attention weight corresponding to the input feature.
After the heterogeneous graph structure is constructed, in order to fully mine hidden information among the multi-modal features, the features of each node in the heterogeneous graph structure can be fused according to a set heterogeneous feature updating rule to obtain heterogeneous fusion features of each node; the heterogeneous fusion features comprise image features added with heterogeneous features and text features added with heterogeneous features;
and encoding the heterogeneous fusion characteristics to obtain heterogeneous image characteristics and heterogeneous text characteristics. In the embodiment of the application, the feature is mined in a mode of combining a heterogeneous graph structure and a transform structure, so after the heterogeneous fusion feature is obtained, transform coding can be performed on the heterogeneous fusion feature to obtain a heterogeneous image feature and a heterogeneous text feature.
S104: and analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer to determine a target text matched with the first text.
With the structure of the inference system shown in fig. 2, after the initial image features and the initial text features are sequentially processed by the multiple homogeneous attention layers and the heterogeneous attention layers, the heterogeneous image features and the heterogeneous text features output by the last heterogeneous attention layer can be finally obtained.
In order to select a target text matched with the first text by using the grader, the heterogeneous image characteristics and the heterogeneous text characteristics can be coded to obtain coding characteristics; taking the coding characteristics as input characteristics of a scorer to obtain probability scores corresponding to the second texts; and taking the second text with the highest probability score as the target text matched with the first text.
When there are a plurality of second texts with the highest probability scores, any one of the second texts with the highest probability scores may be selected as the target text matched with the first text, or all of the plurality of second texts with the highest probability scores may be selected as the target text matched with the first text.
In a specific application, the first text may be a question text and the second text may be an answer text.
Answer texts are often multiple. In practical application, the heterogeneous image features and the heterogeneous text features can be analyzed by a grader to screen out one answer text matched with the question text from the multiple answer texts.
According to the technical scheme, the acquired image to be analyzed and the acquired text are coded to obtain the input characteristics; wherein the text may include a first text and a second text; the input features include initial image features and initial text features. In order to fully mine the relevance between the image to be analyzed and the text, the relevance analysis can be carried out on the initial image characteristic and the initial text characteristic according to a set homogeneous attention mechanism to obtain an intermediate image characteristic and an intermediate text characteristic; and performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features. And analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer to determine a target text matched with the first text. In the technical scheme, the image features and the text features are different in type and belong to multi-mode features, the attributes of the multi-mode features can be fully mined by setting a homogeneous attention mechanism and a heterogeneous attention mechanism, the learning difficulty of the model cannot be increased in the implementation process, the target text matched with the first text can be more accurately screened out based on the mined features, and the feature screening capability of the model is improved while the learning difficulty of the model is not increased.
In practical applications, a corresponding structure may be set based on operations to be performed by a homogeneous attention layer, and fig. 3 is a schematic structural diagram of a homogeneous attention layer provided in an embodiment of the present application, where the homogeneous attention layer includes an attention calculation unit, a feature mapping unit, a feature reconstruction unit, a graph operator, a layer normalization unit, a random deletion unit, and an addition unit. The layer normalization unit, the random deletion unit and the addition unit have the same structure as the currently used transform, and the processing flow of these units is not described again. The attention calculation unit, the feature mapping unit, the feature reconstruction unit and the graph operator may be used to construct the graph structure and determine fusion features corresponding to nodes in the graph structure.
For the construction of a graph structure, initial attention vectors of initial image features and initial text features can be determined according to a feature space conversion matrix and a mapping matrix obtained by model training; mapping the initial attention vector according to a mapping matrix obtained by model training to obtain an attention vector; based on the initial image features, the initial text features, and the attention vector, a graph structure may be constructed.
In a specific implementation, the attention calculating unit may perform the calculation of the attention weight according to the following formula,
Figure 258849DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 841140DEST_PATH_IMAGE011
Figure 235212DEST_PATH_IMAGE012
W q andW k are all characterized space transformation matrices and are,W q andW k all sizes of (A) and (B) ared*dA matrix of (a);frepresenting input features obtained by splicing initial image features and initial text features, with the size ofN*dNThe total number of the features is represented,dthe dimensions of each of the features are represented,attn(f)indicating the attention weight.
NCan be used asNA node, thereby constructing oneNGraph structure of individual nodes. For theNAnd the characteristics can be assigned by the input characteristics. For edges of graph structures, the result can be calculated by attentionattn(f)And (7) assigning values.
The feature mapping unit may map the input features into a feature space, the mapping function being:
Figure 69307DEST_PATH_IMAGE013
wherein the content of the first and second substances,W v represents a mapping matrix of sized*dOf the matrix of (a).
The feature reconstruction unit may be configured to calculate attention, where the attention corresponding to the initial image feature and the initial text feature is:sff)= attnf)×mapf)。
in the embodiment of the application, in order to fully mine hidden information of image features and text features, screening and normalization processing can be performed on input features to obtain normalization weights among nodes in a graph structure; and determining the fusion characteristics of each node in the graph structure according to the set characteristic mapping matrix, the update rate, the input characteristics, the normalized weight and the attention vector.
In a specific implementation, the graph operator may include feature filtering using a non-linear unit, the feature filtering formula being:
Figure 740591DEST_PATH_IMAGE014
wherein the content of the first and second substances,LeakyReLuan activation function that represents a deep learning,W a andW b are all the feature transfer matrices and are,zthe characteristics after screening are shown.
After feature screening is completed, the screened features may be normalized, with the normalization formula:
Figure 634729DEST_PATH_IMAGE015
wherein the content of the first and second substances,expthe operation of the exponent is represented by the exponent operator,αto representNThe normalized weights of the individual nodes with respect to each other,lto representNAny one of the nodes.
After the graph arithmetic unit finishes the normalization operation, a feature updating formula can be called, and the input features and the attention vectors are analyzed to obtain updating features; the expression of the feature update formula is:
Figure 516097DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure 950621DEST_PATH_IMAGE002
the characteristics of the update are represented by,fthe characteristics of the input are represented by,σwhich is indicative of the rate of update,σis a hyper-parameter that can be set,attnf) A vector of attention is represented, and,
Figure 804307DEST_PATH_IMAGE003
which represents the normalized weight of the weight,W d representing a feature mapping matrix;
the updated feature is superposed with the attention vector to obtain a fused featureffI.e. by
Figure 728401DEST_PATH_IMAGE017
The fusion features are output after being subjected to layer normalization unit, random deletion unit and ending addition, and the output image features and text features can enter a heterogeneous attention layer for processing.
Fig. 4 is a schematic structural diagram of a heterogeneous attention layer provided in an embodiment of the present application, where the heterogeneous attention layer includes two cross-attention calculation units, two feature mapping units, two feature reconstruction units, two graph operators, two layer normalization units, two random deletion units, and two addition units.
In specific implementation, the initial cross-attention vector of the intermediate image feature and the intermediate text feature can be determined according to a feature space conversion matrix and a mapping matrix obtained by model training; mapping the initial attention-crossing vector according to a mapping matrix obtained by model training to obtain an attention-crossing vector; constructing a heterogeneous graph structure based on the intermediate image features, the intermediate text features and the cross-attention vector.
The cross-attention calculation unit may perform the cross-attention weight calculation according to the following formula,
Figure 714375DEST_PATH_IMAGE018
wherein the content of the first and second substances,W q andW k are all characterized space transformation matrices and are,W q andW k all sizes of (A) and (B) ared*dA matrix of (a);pandgrepresenting the input features, taking the cross-attention calculation unit on the left side in fig. 4 as an example,pthe intermediate text feature is represented by a representation of,grepresenting an intermediate image feature;crossattn(pg)cross attention weights are represented.
In practical applications, a corresponding structure may be set based on operations to be performed by a heterogeneous attention layer, and fig. 4 is a schematic structural diagram of the heterogeneous attention layer provided in the embodiment of the present application, where the heterogeneous attention layer includes two cross-attention calculation units, two feature mapping units, two feature reconstruction units, two graph operators, two layer normalization units, two random deletion units, and two addition units. The layer normalization unit, the random deletion unit and the addition unit have the same structure as the currently used transform, and the processing flow of these units is not described again. The cross-attention calculation unit, the feature mapping unit, the feature reconstruction unit and the graph operator can be used for constructing a heterogeneous graph structure and determining heterogeneous fusion features corresponding to nodes in the heterogeneous graph structure.
The left-right cross-attention calculation in fig. 4 is the same, and for convenience of description, the left cross-attention calculation is taken as an example for explanation. Input features assuming a heterogeneous attention layer includeNA characteristic thatNIs characterized by asNA node, thereby constructing oneNA heterogeneous graph structure of individual nodes. For theNAnd the characteristics can be assigned by the input characteristics. For edges of a heterogeneous graph structure, cross-attention weights can be calculated from the results of the cross-attention calculationscrossattn(pg)And (7) assigning values.
After the heterogeneous graph structure is constructed, screening and normalization processing can be carried out on the intermediate image characteristics and the intermediate text characteristics to obtain heterogeneous normalization weights among all nodes in the heterogeneous graph structure; and determining heterogeneous fusion characteristics of each node in the heterogeneous graph structure according to the set characteristic mapping matrix, the update rate, the intermediate image characteristics, the intermediate text characteristics, the heterogeneous normalization weight and the cross-attention vector.
The mapping weight of the graph structure can be calculated by the formula
Figure 749327DEST_PATH_IMAGE019
Wherein the content of the first and second substances,LeakyReLuan activation function that represents a deep learning,W a W b1 andW b2 are all the feature transfer matrices and are,z p representing the mapping weights.
After the graph arithmetic unit finishes the normalization operation, a first heterogeneous feature updating formula can be called, and the intermediate image feature, the intermediate text feature and the cross-attention vector are analyzed to obtain a first heterogeneous updating feature; the expression of the first heterogeneous characteristic update formula is as follows:
Figure 519837DEST_PATH_IMAGE020
(1);
calling a second heterogeneous feature updating formula, and analyzing the intermediate image feature, the intermediate text feature and the cross-attention vector to obtain a second heterogeneous updating feature; the expression of the second heterogeneous characteristic updating formula is as follows:
Figure 349253DEST_PATH_IMAGE021
(2);
wherein the content of the first and second substances,
Figure 674055DEST_PATH_IMAGE006
the heterogeneous update characteristics are represented by the data of the mobile terminal,W d representing a feature mapping matrix;σthe update rate is a settable hyper-parameter.
And fusing the original cross-attention vector and the updated features to obtain heterogeneous fusion features. And outputting the heterogeneous fusion characteristics after the layer normalization unit, the random deletion unit and the ending addition.
The application provides a thought of fusing an image neural network and an attention mechanism and provides a homogeneous attention mechanism aiming at the common feature effectiveness problem in the existing visual sense common sense reasoning system. In order to solve the problem of heterogeneous attributes of multi-modal characteristics, a heterogeneous attention mechanism is designed, a complete reasoning system for text answer selection is set up based on the heterogeneous attention mechanism, and a target text matched with the first text can be accurately screened out by the reasoning system.
Fig. 5 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application, including an encoding unit 51, a correlation analysis unit 52, a cross-mode analysis unit 53, and a matching unit 54;
the encoding unit 51 is configured to perform encoding processing on the acquired image and text to be analyzed to obtain input features; wherein the text comprises a first text and a second text; the first text and the second text have a mapping relation; the input features comprise initial image features and initial text features;
a correlation analysis unit 52, configured to perform correlation analysis on the initial image feature and the initial text feature according to a set homogeneous attention mechanism, so as to obtain an intermediate image feature and an intermediate text feature;
a cross-modal analysis unit 53, configured to perform cross-modal analysis on the intermediate image feature and the intermediate text feature according to a set heterogeneous attention mechanism, so as to obtain a heterogeneous image feature and a heterogeneous text feature;
and the matching unit 54 is used for analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using the scorer and determining the target text matched with the first text.
Optionally, the correlation analysis unit comprises a construction subunit, a fusion subunit and an encoding subunit;
the construction subunit is used for constructing a graph structure according to the initial image characteristics, the initial text characteristics and the characteristic space conversion matrix and the mapping matrix obtained by model training;
the fusion subunit is used for fusing the characteristics of each node in the graph structure according to the set characteristic update rule to obtain the fusion characteristics of each node; the fusion features comprise image features added with correlation features and text features added with correlation features;
and the coding subunit is used for coding the fusion characteristics to obtain intermediate image characteristics and intermediate text characteristics.
Optionally, the constructing subunit is configured to determine an initial attention vector of the initial image feature and the initial text feature according to a feature space transformation matrix and a mapping matrix obtained by model training; mapping the initial attention vector according to a mapping matrix obtained by model training to obtain an attention vector; a graph structure is constructed based on the initial image features, the initial text features, and the attention vector.
Optionally, the fusion subunit is configured to perform screening and normalization processing on the input features to obtain a normalization weight between each node in the graph structure; and determining the fusion characteristics of each node in the graph structure according to the set characteristic mapping matrix, the update rate, the input characteristics, the normalized weight and the attention vector.
Optionally, the fusion subunit is configured to invoke a feature update formula, and analyze the input feature and the attention vector to obtain an update feature; the expression of the feature update formula is:
Figure 512698DEST_PATH_IMAGE022
wherein the content of the first and second substances,
Figure 137714DEST_PATH_IMAGE002
the characteristics of the update are represented by,fthe characteristics of the input are represented by,σwhich is indicative of the rate of update,attnf) A vector of attention is represented, and,
Figure 403611DEST_PATH_IMAGE003
which represents the normalized weight of the weight,W d representing a feature mapping matrix;
and overlapping the updated features and the attention vectors to obtain fused features.
Optionally, the cross-modal analysis unit includes a construction subunit, a fusion subunit, and an encoding subunit;
the construction subunit is used for constructing a heterogeneous graph structure according to the image characteristics, the text characteristics and the characteristic space conversion matrix and the mapping matrix obtained by model training;
the fusion subunit is used for fusing the characteristics of each node in the heterogeneous graph structure according to the set heterogeneous characteristic update rule to obtain the heterogeneous fusion characteristics of each node; the heterogeneous fusion features comprise image features added with heterogeneous features and text features added with heterogeneous features;
and the coding subunit is used for coding the heterogeneous fusion characteristics to obtain heterogeneous image characteristics and heterogeneous text characteristics.
Optionally, the constructing subunit is configured to determine an initial attention-crossing vector of the intermediate image feature and the intermediate text feature according to a feature space transformation matrix and a mapping matrix obtained by model training;
mapping the initial attention-crossing vector according to a mapping matrix obtained by model training to obtain an attention-crossing vector;
constructing a heterogeneous graph structure based on the intermediate image features, the intermediate text features and the cross-attention vector.
Optionally, the fusion subunit is configured to perform screening and normalization processing on the intermediate image features and the intermediate text features to obtain heterogeneous normalization weights between nodes in the heterogeneous graph structure;
and determining heterogeneous fusion characteristics of each node in the heterogeneous graph structure according to the set characteristic mapping matrix, the update rate, the intermediate image characteristics, the intermediate text characteristics, the heterogeneous normalization weight and the cross-attention vector.
Optionally, the fusion subunit is configured to invoke a first heterogeneous feature update formula, and analyze the intermediate image feature, the intermediate text feature, and the cross-attention vector to obtain a first heterogeneous update feature; the expression of the first heterogeneous characteristic update formula is as follows:
Figure 481288DEST_PATH_IMAGE023
(1);
calling a second heterogeneous feature updating formula, and analyzing the intermediate image feature, the intermediate text feature and the cross-attention vector to obtain a second heterogeneous updating feature; the expression of the second heterogeneous characteristic updating formula is as follows:
Figure 123622DEST_PATH_IMAGE024
(2);
wherein the content of the first and second substances,
Figure 603145DEST_PATH_IMAGE006
the heterogeneous update characteristics are represented by the data of the mobile terminal,pthe intermediate text feature is represented by a representation of,gthe characteristics of the intermediate image are represented,σwhich is indicative of the rate of update,crossattnp,g) A cross-attention vector is represented that is cross-attention,
Figure 305522DEST_PATH_IMAGE003
which represents the normalized weight of the weight,W d representing a feature mapping matrix;
and superposing the heterogeneous updating features and the cross-attention vector to obtain heterogeneous fusion features.
Optionally, the matching unit comprises a feature encoding unit, an input unit and a processing unit;
the feature coding unit is used for coding the heterogeneous image features and the heterogeneous text features to obtain coding features;
the input unit is used for taking the coding characteristics as the input characteristics of the scorer so as to obtain the probability scores corresponding to the second texts;
and the unit is used for taking the second text with the highest probability score as the target text matched with the first text.
Optionally, the first text is a question text, and the second text is an answer text.
Optionally, the answer text is multiple; the matching unit is used for analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using the scorer so as to screen out one answer text matched with the question text from the multiple answer texts.
The description of the features in the embodiment corresponding to fig. 5 may refer to the related description of the embodiment corresponding to fig. 1, and is not repeated here.
According to the technical scheme, the acquired image to be analyzed and the acquired text are coded to obtain the input characteristics; wherein the text may include a first text and a second text; the input features include initial image features and initial text features. In order to fully mine the relevance between the image to be analyzed and the text, the relevance analysis can be carried out on the initial image characteristic and the initial text characteristic according to a set homogeneous attention mechanism to obtain an intermediate image characteristic and an intermediate text characteristic; and performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features. And analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer to determine a target text matched with the first text. In the technical scheme, the image features and the text features are different in type and belong to multi-mode features, the attributes of the multi-mode features can be fully mined by setting a homogeneous attention mechanism and a heterogeneous attention mechanism, the learning difficulty of the model cannot be increased in the implementation process, the target text matched with the first text can be more accurately screened out based on the mined features, and the feature screening capability of the model is improved while the learning difficulty of the model is not increased.
Fig. 6 is a structural diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 6, the electronic device includes: a memory 20 for storing a computer program;
a processor 21 for implementing the steps of the processing method as described in the above embodiment text when executing the computer program.
The electronic device provided by the embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.
The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 21 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.
The memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing the following computer program 201, wherein after being loaded and executed by the processor 21, the computer program can realize the relevant steps of the processing method of the text disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may also include an operating system 202, data 203, and the like, and the storage manner may be a transient storage manner or a permanent storage manner. Operating system 202 may include, among others, Windows, Unix, Linux, and the like. Data 203 may include, but is not limited to, images to be analyzed, text, and the like.
In some embodiments, the electronic device may further include a display 22, an input/output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.
Those skilled in the art will appreciate that the configuration shown in fig. 6 does not constitute a limitation of the electronic device and may include more or fewer components than those shown.
It is to be understood that, if the processing method of the text in the above embodiments is implemented in the form of a software functional unit and sold or used as a stand-alone product, it may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the prior art, or all or part of the technical solutions may be embodied in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), an electrically erasable programmable ROM, a register, a hard disk, a removable magnetic disk, a CD-ROM, a magnetic or optical disk, and other various media capable of storing program codes.
Based on this, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the processing method of the above text.
The functions of the functional modules of the computer-readable storage medium according to the embodiment of the present invention may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the related description of the foregoing method embodiment, which is not described herein again.
The text processing method, the text processing device, the text processing apparatus, and the computer-readable storage medium provided in the embodiments of the present application are described in detail above. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
A method, an apparatus, a device and a computer readable storage medium for processing a text provided by the present application are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present application.

Claims (15)

1. A method for processing text, comprising:
coding the acquired image and text to be analyzed to obtain input characteristics; wherein the text comprises a first text and a second text; the first text and the second text have a mapping relation; the input features comprise initial image features and initial text features;
according to a set homogeneous attention mechanism, performing correlation analysis on the initial image features and the initial text features to obtain intermediate image features and intermediate text features;
performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features;
analyzing the heterogeneous image features and the heterogeneous text features by using a scorer, and determining a target text matched with the first text; wherein the target text is a text included in the second text.
2. The method of claim 1, wherein the performing a correlation analysis on the initial image feature and the initial text feature according to a set homogeneous attention mechanism to obtain an intermediate image feature and an intermediate text feature comprises:
constructing a graph structure according to the initial image features, the initial text features and a feature space conversion matrix and a mapping matrix obtained by model training;
fusing the characteristics of each node in the graph structure according to a set characteristic updating rule to obtain the fused characteristics of each node; the fusion features comprise image features added with correlation features and text features added with correlation features;
and coding the fusion characteristics to obtain intermediate image characteristics and intermediate text characteristics.
3. The method for processing the text according to claim 2, wherein the constructing a graph structure according to the initial image features, the initial text features and a feature space transformation matrix and a mapping matrix obtained by model training comprises:
determining an initial attention vector of the initial image feature and the initial text feature according to a feature space conversion matrix and a mapping matrix obtained by model training;
mapping the initial attention vector according to a mapping matrix obtained by model training to obtain an attention vector;
constructing a graph structure based on the initial image features, the initial text features, and the attention vector.
4. The method of claim 3, wherein the fusing the features of the nodes in the graph structure according to the set feature update rule to obtain the fused features of the nodes comprises:
screening and normalizing the input features to obtain normalized weights among all nodes in the graph structure;
and determining the fusion characteristics of each node in the graph structure according to the set characteristic mapping matrix, the update rate, the input characteristics, the normalization weight and the attention vector.
5. The method of claim 4, wherein the determining the fusion feature of each node in the graph structure according to the set feature mapping matrix, the update rate, the input feature, the normalized weight, and the attention vector comprises:
calling a feature updating formula, and analyzing the input features and the attention vectors to obtain updating features; the expression of the characteristic updating formula is as follows:
Figure 384137DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 521857DEST_PATH_IMAGE002
the characteristics of the update are represented by,fthe characteristics of the input are represented by,σwhich is indicative of the rate of update,attnf) A vector of attention is represented, and,
Figure 984062DEST_PATH_IMAGE003
which represents the normalized weight of the weight,W d representing a feature mapping matrix;
and superposing the updated feature and the attention vector to obtain a fusion feature.
6. The method of processing text according to claim 1, wherein the performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features comprises:
constructing a heterogeneous graph structure according to the intermediate image features, the intermediate text features and a feature space conversion matrix and a mapping matrix obtained by model training;
fusing the characteristics of each node in the heterogeneous graph structure according to a set heterogeneous characteristic updating rule to obtain heterogeneous fusion characteristics of each node; the heterogeneous fusion features comprise image features added with heterogeneous features and text features added with heterogeneous features;
and coding the heterogeneous fusion characteristics to obtain heterogeneous image characteristics and heterogeneous text characteristics.
7. The method for processing the text according to claim 6, wherein the constructing the heterogeneous graph structure according to the intermediate image features, the intermediate text features and the feature space transformation matrix and the mapping matrix obtained by model training comprises:
determining the initial attention-crossing vector of the intermediate image feature and the intermediate text feature according to a feature space conversion matrix and a mapping matrix obtained by model training;
mapping the initial attention-crossing vector according to a mapping matrix obtained by model training to obtain an attention-crossing vector;
constructing a allopgram structure based on the intermediate image features, the intermediate text features, and the cross-attention vector.
8. The method of claim 6, wherein the fusing the features of the nodes in the heterogeneous graph structure according to the set heterogeneous feature update rule to obtain the heterogeneous fused features of the nodes comprises:
screening and normalizing the intermediate image features and the intermediate text features to obtain heterogeneous normalization weights among nodes in the heterogeneous graph structure;
and determining heterogeneous fusion characteristics of each node in the heterogeneous graph structure according to the set characteristic mapping matrix, the update rate, the intermediate image characteristics, the intermediate text characteristics, the heterogeneous normalization weight and the cross-attention vector.
9. The method of claim 8, wherein the determining the heterogeneous fusion features of each node in the heterogeneous graph structure according to the set feature mapping matrix, the update rate, the intermediate image features, the intermediate text features, the heterogeneous normalization weights, and the cross-attention vector comprises:
calling a first heterogeneous feature updating formula, and analyzing the intermediate image feature, the intermediate text feature and the cross-attention vector to obtain a first heterogeneous updating feature; the expression of the first heterogeneous characteristic updating formula is as follows:
Figure 207233DEST_PATH_IMAGE004
(1);
calling a second heterogeneous feature updating formula, and analyzing the intermediate image feature, the intermediate text feature and the cross-attention vector to obtain a second heterogeneous updating feature; the expression of the second heterogeneous characteristic updating formula is as follows:
Figure 413087DEST_PATH_IMAGE005
(2);
wherein the content of the first and second substances,
Figure 670893DEST_PATH_IMAGE006
the heterogeneous update characteristics are represented by the data of the mobile terminal,pa feature of the intermediate text is represented,gthe characteristics of the intermediate image are represented,σwhich is indicative of the rate of update,crossattnp,g) A cross-attention vector is represented that spans the attention vector,
Figure 569579DEST_PATH_IMAGE007
which represents the normalized weight of the weight,W d representing a feature mapping matrix;
and superposing the heterogeneous updating features and the cross-attention vector to obtain heterogeneous fusion features.
10. The method of any one of claims 1 to 9, wherein the analyzing the heterogeneous image features and the heterogeneous text features by using a scorer, and the determining the target text matching the first text comprises:
coding the heterogeneous image features and the heterogeneous text features to obtain coding features;
taking the coding features as input features of the scorer to obtain probability scores corresponding to the second texts;
and taking the second text with the highest probability score as the target text matched with the first text.
11. The method of claim 1, wherein the first text is a question text and the second text is an answer text.
12. The method for processing text according to claim 11, wherein the answer text is plural; the analyzing the heterogeneous image features and the heterogeneous text features by using a scorer, and the determining the target text matched with the first text comprises:
and analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer so as to screen an answer text matched with the question text from a plurality of answer texts.
13. The text processing device is characterized by comprising a coding unit, a correlation analysis unit, a cross-mode analysis unit and a matching unit;
the coding unit is used for coding the acquired image to be analyzed and the acquired text to obtain input characteristics; wherein the text comprises a first text and a second text; the first text and the second text have a mapping relation; the input features comprise initial image features and initial text features;
the correlation analysis unit is used for carrying out correlation analysis on the initial image features and the initial text features according to a set homogeneous attention mechanism to obtain intermediate image features and intermediate text features;
the cross-modal analysis unit is used for performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features;
the matching unit is used for analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer and determining a target text matched with the first text.
14. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to carry out the steps of the method of processing text according to any one of claims 1 to 12.
15. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method for processing text according to any one of claims 1 to 12.
CN202210762364.8A 2022-06-30 2022-06-30 Text processing method, device, equipment and medium Active CN114821605B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210762364.8A CN114821605B (en) 2022-06-30 2022-06-30 Text processing method, device, equipment and medium
PCT/CN2022/141186 WO2024001100A1 (en) 2022-06-30 2022-12-22 Method and apparatus for processing text, and device and non-volatile readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210762364.8A CN114821605B (en) 2022-06-30 2022-06-30 Text processing method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN114821605A true CN114821605A (en) 2022-07-29
CN114821605B CN114821605B (en) 2022-11-25

Family

ID=82522683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210762364.8A Active CN114821605B (en) 2022-06-30 2022-06-30 Text processing method, device, equipment and medium

Country Status (2)

Country Link
CN (1) CN114821605B (en)
WO (1) WO2024001100A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115310611A (en) * 2022-10-12 2022-11-08 苏州浪潮智能科技有限公司 Figure intention reasoning method and related device
WO2024001100A1 (en) * 2022-06-30 2024-01-04 苏州元脑智能科技有限公司 Method and apparatus for processing text, and device and non-volatile readable storage medium
WO2024098533A1 (en) * 2022-11-08 2024-05-16 苏州元脑智能科技有限公司 Image-text bidirectional search method, apparatus and device, and non-volatile readable storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117876651B (en) * 2024-03-13 2024-05-24 浪潮电子信息产业股份有限公司 Visual positioning method, device, equipment and medium
CN117992800A (en) * 2024-03-29 2024-05-07 浪潮电子信息产业股份有限公司 Image-text data matching detection method, device, equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761153A (en) * 2021-05-19 2021-12-07 腾讯科技(深圳)有限公司 Question and answer processing method and device based on picture, readable medium and electronic equipment
CN114283430A (en) * 2021-12-03 2022-04-05 苏州大创科技有限公司 Cross-modal image-text matching training method and device, storage medium and electronic equipment
CN114462356A (en) * 2022-04-11 2022-05-10 苏州浪潮智能科技有限公司 Text error correction method, text error correction device, electronic equipment and medium
CN114511472A (en) * 2022-04-19 2022-05-17 苏州浪潮智能科技有限公司 Visual positioning method, device, equipment and medium
CN114511860A (en) * 2022-04-19 2022-05-17 苏州浪潮智能科技有限公司 Difference description statement generation method, device, equipment and medium
CN114625909A (en) * 2022-03-24 2022-06-14 北京明略昭辉科技有限公司 Image text selection method and device, electronic equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2537106A4 (en) * 2009-12-18 2013-10-02 Morningside Analytics Llc System and method for attentive clustering and related analytics and visualizations
CN113435203B (en) * 2021-08-30 2021-11-30 华南师范大学 Multi-modal named entity recognition method and device and electronic equipment
CN113901781B (en) * 2021-09-15 2024-04-26 昆明理工大学 Similar case matching method integrating segment coding and affine mechanism
CN114461821A (en) * 2022-02-24 2022-05-10 中南大学 Cross-modal image-text inter-searching method based on self-attention reasoning
CN114821605B (en) * 2022-06-30 2022-11-25 苏州浪潮智能科技有限公司 Text processing method, device, equipment and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761153A (en) * 2021-05-19 2021-12-07 腾讯科技(深圳)有限公司 Question and answer processing method and device based on picture, readable medium and electronic equipment
CN114283430A (en) * 2021-12-03 2022-04-05 苏州大创科技有限公司 Cross-modal image-text matching training method and device, storage medium and electronic equipment
CN114625909A (en) * 2022-03-24 2022-06-14 北京明略昭辉科技有限公司 Image text selection method and device, electronic equipment and storage medium
CN114462356A (en) * 2022-04-11 2022-05-10 苏州浪潮智能科技有限公司 Text error correction method, text error correction device, electronic equipment and medium
CN114511472A (en) * 2022-04-19 2022-05-17 苏州浪潮智能科技有限公司 Visual positioning method, device, equipment and medium
CN114511860A (en) * 2022-04-19 2022-05-17 苏州浪潮智能科技有限公司 Difference description statement generation method, device, equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MINGYAN WU等: ""Hierarchical Semantic Enhanced Directional Graph Network for Visual Commonsense Reasoning"", 《TRUSTWORTHY AI21》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024001100A1 (en) * 2022-06-30 2024-01-04 苏州元脑智能科技有限公司 Method and apparatus for processing text, and device and non-volatile readable storage medium
CN115310611A (en) * 2022-10-12 2022-11-08 苏州浪潮智能科技有限公司 Figure intention reasoning method and related device
WO2024077891A1 (en) * 2022-10-12 2024-04-18 苏州元脑智能科技有限公司 Character intention reasoning method and related apparatus
WO2024098533A1 (en) * 2022-11-08 2024-05-16 苏州元脑智能科技有限公司 Image-text bidirectional search method, apparatus and device, and non-volatile readable storage medium

Also Published As

Publication number Publication date
CN114821605B (en) 2022-11-25
WO2024001100A1 (en) 2024-01-04

Similar Documents

Publication Publication Date Title
CN114821605B (en) Text processing method, device, equipment and medium
JP7464752B2 (en) Image processing method, device, equipment, and computer program
CN111522962A (en) Sequence recommendation method and device and computer-readable storage medium
EP4109347A2 (en) Method for processing multimodal data using neural network, device, and medium
CN108665055B (en) Method and device for generating graphic description
CN114511860B (en) Difference description statement generation method, device, equipment and medium
CN115438215B (en) Image-text bidirectional search and matching model training method, device, equipment and medium
CN114462356B (en) Text error correction method and device, electronic equipment and medium
CN107832794A (en) A kind of convolutional neural networks generation method, the recognition methods of car system and computing device
CN115310611B (en) Figure intention reasoning method and related device
CN114676704A (en) Sentence emotion analysis method, device and equipment and storage medium
CN115168592B (en) Statement emotion analysis method, device and equipment based on aspect categories
CN111046158B (en) Question-answer matching method, model training method, device, equipment and storage medium
CN115129848B (en) Method, device, equipment and medium for processing visual question-answering task
CN114780768A (en) Visual question-answering task processing method and system, electronic equipment and storage medium
CN113222813A (en) Image super-resolution reconstruction method and device, electronic equipment and storage medium
CN113094533B (en) Image-text cross-modal retrieval method based on mixed granularity matching
CN115862031B (en) Text processing method, neural network training method, device and equipment
CN115905591B (en) Visual question-answering method, system, equipment and readable storage medium
CN115906861B (en) Sentence emotion analysis method and device based on interaction aspect information fusion
CN112070852A (en) Image generation method and system, and data processing method
CN113688232B (en) Method and device for classifying bid-inviting text, storage medium and terminal
CN115809325A (en) Document processing model training method, document processing method, device and equipment
CN115952266A (en) Question generation method and device, computer equipment and storage medium
CN115470798A (en) Training method of intention recognition model, intention recognition method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant