CN114821605A

CN114821605A - Text processing method, device, equipment and medium

Info

Publication number: CN114821605A
Application number: CN202210762364.8A
Authority: CN
Inventors: 李晓川; 赵雅倩; 李仁刚; 郭振华; 范宝余
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-07-29
Anticipated expiration: 2042-06-30
Also published as: WO2024001100A1; CN114821605B

Abstract

The application relates to the technical field of artificial intelligence, and discloses a text processing method, a device, equipment and a medium, which are used for coding acquired images to be analyzed and texts to obtain input characteristics; the text comprises a first text and a second text; the input features include initial image features and initial text features. According to a set homogeneous attention mechanism, carrying out correlation analysis on the initial image characteristics and the initial text characteristics to obtain intermediate image characteristics and intermediate text characteristics; and performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features. And analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer to determine a target text matched with the first text. By setting a homogeneous attention mechanism and a heterogeneous attention mechanism, the attributes of the multi-mode features are fully mined, and the target text matched with the first text can be screened out more accurately.

Description

Text processing method, device, equipment and medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a text processing method, apparatus, device, and computer-readable storage medium.

Background

Visual common sense Reasoning (VCR) refers to selecting an answer that best fits the description of a quiz sentence among 4 options and selecting a theoretical basis for supporting the answer among the additional 4 options according to a specified image. The multimode artificial intelligence generally relates to multimode data input such as vision, voice, text, various types of sensing signals and the like, which is similar to the situation appearing in a daily scene, so that the multimode artificial intelligence has a better landing prospect and becomes one of the current international mainstream research directions. The VCR task is a branch of the multi-modal field, and falls within the domain of multi-modal comprehension, which is intended to enable computers to gain the ability to "understand", i.e., to respond by viewing images, based on the target characters involved in the question. The VCR task provides 4 options for the answer and the computer needs to select the best qualified output among the 4 options.

the interface of the input and the output of the transformer structure is relatively flexible, and the dimension of the characteristic is not changed by the structure. At the present stage, the most extensive is a visual common sense reasoning system based on a transformer structure, and answers which best accord with the description of the question sentences are selected for the designated images. First, an input image and several pieces of text are encoded: the image is encoded using a convolutional neural network. Inputting a question sentence, a candidate answer sentence and a candidate explanation sentence, and performing feature extraction by adopting a ready-made text encoder. The candidate answers and the reasonable probabilities of the candidate interpretations are represented by fixed characters, which are represented by a fixed vector code, i.e., probability embedded vectors.

The method realizes the joint coding of the multi-modal characteristics by stacking the transform structures, thereby realizing the interaction among different modal characteristics, and finally predicting the probability that the current candidate answer and the explanation meet the requirements by decoding the specified position characteristics. The transform structure of the full connection layer is simple and roughly splices all the features together, and the relation among all the features is calculated through an attention mechanism, so that the learning difficulty of the model is increased.

Therefore, how to improve the feature screening capability of the model without increasing the difficulty of model learning is a problem to be solved by those skilled in the art.

Disclosure of Invention

The embodiment of the application aims to provide a text processing method, a text processing device, text processing equipment and a computer readable storage medium, which can improve the feature screening capability of a model without increasing the learning difficulty of the model.

In order to solve the foregoing technical problem, an embodiment of the present application provides a text processing method, including:

coding the acquired image and text to be analyzed to obtain input characteristics; wherein the text comprises a first text and a second text; the first text and the second text have a mapping relation; the input features comprise initial image features and initial text features;

according to a set homogeneous attention mechanism, performing correlation analysis on the initial image features and the initial text features to obtain intermediate image features and intermediate text features;

performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features;

analyzing the heterogeneous image features and the heterogeneous text features by using a scorer, and determining a target text matched with the first text; wherein the target text is a text included in the second text.

Optionally, the performing, according to a set homogeneous attention mechanism, correlation analysis on the initial image feature and the initial text feature to obtain an intermediate image feature and an intermediate text feature includes:

constructing a graph structure according to the initial image features, the initial text features and a feature space conversion matrix and a mapping matrix obtained by model training;

fusing the characteristics of each node in the graph structure according to a set characteristic updating rule to obtain the fused characteristics of each node; the fusion features comprise image features added with correlation features and text features added with correlation features;

and coding the fusion characteristics to obtain intermediate image characteristics and intermediate text characteristics.

Optionally, the constructing a graph structure according to the initial image feature, the initial text feature, and a feature space transformation matrix and a mapping matrix obtained by model training includes:

determining initial attention vectors of the initial image features and the initial text features according to a feature space conversion matrix and a mapping matrix obtained by model training;

mapping the initial attention vector according to a mapping matrix obtained by model training to obtain an attention vector;

constructing a graph structure based on the initial image features, the initial text features, and the attention vector.

Optionally, the fusing the features of each node in the graph structure according to the set feature update rule to obtain the fused feature of each node includes:

screening and normalizing the input features to obtain normalized weights among all nodes in the graph structure;

and determining the fusion characteristics of each node in the graph structure according to the set characteristic mapping matrix, the update rate, the input characteristics, the normalization weight and the attention vector.

Optionally, the determining, according to the set feature mapping matrix, the update rate, the input feature, the normalization weight, and the attention vector, the fusion feature of each node in the graph structure includes:

calling a feature updating formula, and analyzing the input features and the attention vectors to obtain updating features; the expression of the characteristic updating formula is as follows:

；

wherein,

the characteristics of the update are represented by,fthe characteristics of the input are represented by,σwhich is indicative of the rate of update,attn（f) A vector of attention is represented that is representative of an attention vector,

which represents the normalized weight of the weight,W _d representing a feature mapping matrix;

and superposing the updated feature and the attention vector to obtain a fusion feature.

Optionally, the performing cross-modal analysis on the intermediate image feature and the intermediate text feature according to a set heterogeneous attention mechanism to obtain a heterogeneous image feature and a heterogeneous text feature includes:

constructing a heterogeneous graph structure according to the intermediate image features, the intermediate text features and a feature space conversion matrix and a mapping matrix obtained by model training;

fusing the characteristics of each node in the heterogeneous graph structure according to a set heterogeneous characteristic updating rule to obtain heterogeneous fusion characteristics of each node; the heterogeneous fusion features comprise image features added with heterogeneous features and text features added with heterogeneous features;

and coding the heterogeneous fusion characteristics to obtain heterogeneous image characteristics and heterogeneous text characteristics.

Optionally, the constructing a heterogeneous graph structure according to the intermediate image feature, the intermediate text feature, and a feature space transformation matrix and a mapping matrix obtained by model training includes:

determining initial cross-attention vectors of the intermediate image features and the intermediate text features according to a feature space conversion matrix and a mapping matrix obtained by model training;

mapping the initial attention-crossing vector according to a mapping matrix obtained by model training to obtain an attention-crossing vector;

constructing a allopgram structure based on the intermediate image features, the intermediate text features, and the cross-attention vector.

Optionally, the fusing the features of each node in the heterogeneous graph structure according to the set heterogeneous feature update rule to obtain the heterogeneous fusion features of each node includes:

screening and normalizing the intermediate image features and the intermediate text features to obtain heterogeneous normalization weights among nodes in the heterogeneous graph structure;

and determining heterogeneous fusion characteristics of each node in the heterogeneous graph structure according to the set characteristic mapping matrix, the update rate, the intermediate image characteristics, the intermediate text characteristics, the heterogeneous normalization weight and the cross-attention vector.

Optionally, the determining, according to the set feature mapping matrix, the update rate, the intermediate image feature, the intermediate text feature, the heterogeneous normalization weight, and the cross-attention vector, a heterogeneous fusion feature of each node in the heterogeneous graph structure includes:

calling a first heterogeneous feature updating formula, and analyzing the intermediate image feature, the intermediate text feature and the cross-attention vector to obtain a first heterogeneous updating feature; the expression of the first heterogeneous characteristic updating formula is as follows:

（1）；

calling a second heterogeneous feature updating formula, and analyzing the intermediate image feature, the intermediate text feature and the cross-attention vector to obtain a second heterogeneous updating feature; the expression of the second heterogeneous characteristic updating formula is as follows:

（2）；

wherein,

the heterogeneous update characteristics are represented by the data of the mobile terminal,pa feature of the intermediate text is represented,gthe characteristics of the intermediate image are represented,σwhich is indicative of the rate of update,crossattn（p,g) A cross-attention vector is represented that spans the attention vector,

and superposing the heterogeneous updating features and the cross-attention vector to obtain heterogeneous fusion features.

Optionally, the analyzing the heterogeneous image features and the heterogeneous text features by using a scorer, and determining the target text matched with the first text includes:

coding the heterogeneous image features and the heterogeneous text features to obtain coding features;

taking the coding features as input features of the scorer to obtain probability scores corresponding to the second texts;

and taking the second text with the highest probability score as the target text matched with the first text.

Optionally, the first text is a question text, and the second text is an answer text.

Optionally, the answer text is multiple; the analyzing the heterogeneous image features and the heterogeneous text features by using a scorer and determining the target text matched with the first text comprises the following steps:

and analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer so as to screen an answer text matched with the question text from a plurality of answer texts.

The embodiment of the application also provides a text processing device, which comprises a coding unit, a correlation analysis unit, a cross-mode analysis unit and a matching unit;

the encoding unit is used for encoding the acquired image to be analyzed and the text to obtain input characteristics; wherein the text comprises a first text and a second text; the first text and the second text have a mapping relation; the input features comprise initial image features and initial text features;

the correlation analysis unit is used for carrying out correlation analysis on the initial image features and the initial text features according to a set homogeneous attention mechanism to obtain intermediate image features and intermediate text features;

the cross-modal analysis unit is used for performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features;

the matching unit is used for analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer and determining a target text matched with the first text; wherein the target text is a text included in the second text.

Optionally, the correlation analysis unit comprises a construction subunit, a fusion subunit and an encoding subunit;

the construction subunit is configured to construct a graph structure according to the initial image feature, the initial text feature, and a feature space transformation matrix and a mapping matrix obtained by model training;

the fusion subunit is configured to fuse the features of each node in the graph structure according to a set feature update rule to obtain a fusion feature of each node; the fusion features comprise image features added with correlation features and text features added with correlation features;

and the coding subunit is used for coding the fusion characteristics to obtain intermediate image characteristics and intermediate text characteristics.

Optionally, the constructing subunit is configured to determine, according to a feature space transformation matrix and a mapping matrix obtained by model training, an initial attention vector of the initial image feature and the initial text feature; mapping the initial attention vector according to a mapping matrix obtained by model training to obtain an attention vector; constructing a graph structure based on the initial image features, the initial text features, and the attention vector.

Optionally, the fusion subunit is configured to perform screening and normalization processing on the input features, so as to obtain a normalization weight between nodes in the graph structure; and determining the fusion characteristics of each node in the graph structure according to the set characteristic mapping matrix, the update rate, the input characteristics, the normalization weight and the attention vector.

Optionally, the fusion subunit is configured to invoke a feature update formula, and analyze the input feature and the attention vector to obtain an update feature; the expression of the characteristic updating formula is as follows:

；

wherein,

the characteristics of the update are represented by,fthe characteristics of the input are represented by,σwhich is indicative of the rate of update,attn（f) A vector of attention is represented, and,

Optionally, the cross-modal analysis unit includes a construction subunit, a fusion subunit, and an encoding subunit;

the constructing subunit is configured to construct a heterogeneous graph structure according to the intermediate image features, the intermediate text features, and a feature space transformation matrix and a mapping matrix obtained by model training;

the fusion subunit is configured to fuse, according to a set heterogeneous feature update rule, features of each node in the heterogeneous graph structure to obtain a heterogeneous fusion feature of each node; the heterogeneous fusion features comprise image features added with heterogeneous features and text features added with heterogeneous features;

and the coding subunit is used for coding the heterogeneous fusion characteristics to obtain heterogeneous image characteristics and heterogeneous text characteristics.

Optionally, the constructing subunit is configured to determine an initial cross-attention vector of the intermediate image feature and the intermediate text feature according to a feature space transformation matrix and a mapping matrix obtained by model training;

Optionally, the fusion subunit is configured to perform screening and normalization processing on the intermediate image features and the intermediate text features to obtain heterogeneous normalization weights between nodes in the heterogeneous graph structure;

Optionally, the fusion subunit is configured to invoke a first heterogeneous feature update formula, and analyze the intermediate image feature, the intermediate text feature, and the cross-attention vector to obtain a first heterogeneous update feature; the expression of the first heterogeneous characteristic updating formula is as follows:

（1）；

（2）；

wherein,

the heterogeneous update characteristics are represented by the data of the mobile terminal,pthe intermediate text feature is represented by a representation of,gthe characteristics of the intermediate image are represented,σwhich is indicative of the rate of update,crossattn（p,g) A cross-attention vector is represented that spans the attention vector,

Optionally, the matching unit includes a feature encoding unit, an input unit, and a processing unit;

the feature coding unit is used for coding the heterogeneous image features and the heterogeneous text features to obtain coding features;

the input unit is used for taking the coding features as the input features of the scorer to obtain the probability scores corresponding to the second texts;

the acting unit is used for taking the second text with the highest probability score as the target text matched with the first text.

Optionally, the answer text is multiple; the matching unit is used for analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by utilizing a scorer so as to screen an answer text matched with the question text from a plurality of answer texts.

An embodiment of the present application further provides an electronic device, including:

a memory for storing a computer program;

a processor for executing the computer program to implement the steps of the processing method as described above.

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the processing method of the foregoing text.

According to the technical scheme, the acquired image to be analyzed and the acquired text are coded to obtain the input characteristics; wherein the text may include a first text and a second text; the input features include initial image features and initial text features. In order to fully mine the relevance between the image to be analyzed and the text, the relevance analysis can be carried out on the initial image characteristic and the initial text characteristic according to a set homogeneous attention mechanism to obtain an intermediate image characteristic and an intermediate text characteristic; and performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features. And analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer to determine a target text matched with the first text. In the technical scheme, the image features and the text features are different in type and belong to multi-mode features, the attributes of the multi-mode features can be fully mined by setting a homogeneous attention mechanism and a heterogeneous attention mechanism, the learning difficulty of the model cannot be increased in the implementation process, the target text matched with the first text can be more accurately screened out based on the mined features, and the feature screening capability of the model is improved while the learning difficulty of the model is not increased.

Drawings

In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

Fig. 1 is a flowchart of a text processing method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an inference system for selecting text answers according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a homogeneous attention layer according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a heterogeneous attention layer according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;

fig. 6 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.

The terms "including" and "having," and any variations thereof, in the description and claims of this application and the drawings described above, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may include other steps or elements not expressly listed.

In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings.

Next, a method for processing a text provided in an embodiment of the present application will be described in detail. Fig. 1 is a flowchart of a text processing method provided in an embodiment of the present application, where the method includes:

s101: and coding the acquired image and text to be analyzed to obtain input characteristics.

Wherein the text may include a first text and a second text; the first text and the second text have a mapping relation; the input features may include initial image features and initial text features.

In an embodiment of the present application, the second text may include a candidate answer and a candidate explanation. Wherein the candidate interpretation is a resolution of the candidate answer. In practical application, the candidate answers may be first analyzed as an answer file, and after one candidate answer that is most matched with the first text is selected from all the candidate answers, the most matched candidate answer and the candidate explanation may be further analyzed as an answer file, and the candidate explanation that is most matched with the first text is selected from all the candidate explanations.

The encoding processing of the image and text to be analyzed is a conventional way, and is not described herein again.

S102: and according to a set homogeneous attention mechanism, carrying out correlation analysis on the initial image characteristic and the initial text characteristic to obtain an intermediate image characteristic and an intermediate text characteristic.

In the embodiment of the application, in order to fully mine the relevance of the initial image feature and the initial text feature, a homogeneous attention mechanism and a heterogeneous attention mechanism are set. Wherein, the homogeneous attention mechanism can be that all the characteristics are subjected to relevance analysis, and all the characteristics comprise initial image characteristics and initial text characteristics; that is, the attention-homogeneity mechanism may analyze the correlation between the initial image features, the correlation between the initial text features, and the correlation between the initial image features and the initial text features.

In practical application, a homogeneous attention layer can be built based on a homogeneous attention mechanism, and a heterogeneous attention layer can be built based on a heterogeneous attention mechanism.

Fig. 2 is a schematic structural diagram of an inference system for text answer selection according to an embodiment of the present application, and fig. 2 is an example in which an answer file includes candidate answers and candidate interpretations. The inference system of fig. 2 includes feature concatenation, composite features, homogenous attention layers, heterogeneous attention layers, coded features, and a scorer. The feature splicing and the comprehensive features can be used for processing the acquired image and text to be analyzed to obtain initial image features and initial text features. In practical applications, continuous iterative processing needs to be performed on the image features and the text features so as to fully mine the association between the image features and the text features, and therefore, as can be seen from fig. 2, a plurality of homogeneous attention layers and a plurality of heterogeneous attention layers are provided, and each homogeneous attention layer has its corresponding heterogeneous attention layer. The input of each heterogeneous attention layer is the output of the last homogeneous attention layer adjacent to it.

The operation flow required to be executed by each homogeneous attention layer is the same, so in the embodiment of the present application, the description is given by taking the processing flow of one homogeneous attention layer as an example. Similarly, the operation flow required to be executed by each heterogeneous attention layer is the same, and therefore, in the embodiment of the present application, the description is given by taking the processing flow of one heterogeneous attention layer as an example.

The image features and the text features are of different types and belong to multi-modal features. In addition, because the graph structure has a feature aggregation function, the graph structure and the transform structure can be combined, and a graph attention mechanism is designed. The graph attention mechanism aims to solve the problem of correlation among cross-modal features, thereby improving feature effectiveness.

In specific implementation, a graph structure can be constructed according to the initial image features, the initial text features, and a feature space transformation matrix and a mapping matrix obtained by model training.

The number of nodes in the graph structure is the same as the number of all the features of the initial image feature and the initial text feature, one node represents one feature, and the edges of the graph structure can be assigned by the attention weights corresponding to the initial image feature and the initial text feature.

After the graph structure is constructed, in order to fully mine hidden information among the features, the features of each node in the graph structure can be fused according to a set feature update rule to obtain the fusion features of each node; wherein the fusion features comprise image features added with correlation features and text features added with correlation features.

And coding the fusion features to obtain intermediate image features and intermediate text features. In the embodiment of the application, features are mined in a mode of combining a graph structure and a transform structure, so after the fused features are obtained, transform coding can be performed on the fused features, and intermediate image features and intermediate text features are obtained.

S103: and performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features.

The heterogeneous attention mechanism can be used for mining the relevance between the image feature and the text feature after the homogeneous attention mechanism is processed.

In specific implementation, the heterogeneous graph structure can be constructed according to the image features, the text features and the feature space transformation matrix and the mapping matrix obtained by model training.

The number of nodes in the heterogeneous graph structure is the same as the number of features input to the heterogeneous attention layer, one node represents one feature, and the edges of the heterogeneous graph structure can be assigned by the heterogeneous attention weight corresponding to the input feature.

After the heterogeneous graph structure is constructed, in order to fully mine hidden information among the multi-modal features, the features of each node in the heterogeneous graph structure can be fused according to a set heterogeneous feature updating rule to obtain heterogeneous fusion features of each node; the heterogeneous fusion features comprise image features added with heterogeneous features and text features added with heterogeneous features;

and encoding the heterogeneous fusion characteristics to obtain heterogeneous image characteristics and heterogeneous text characteristics. In the embodiment of the application, the feature is mined in a mode of combining a heterogeneous graph structure and a transform structure, so after the heterogeneous fusion feature is obtained, transform coding can be performed on the heterogeneous fusion feature to obtain a heterogeneous image feature and a heterogeneous text feature.

S104: and analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer to determine a target text matched with the first text.

With the structure of the inference system shown in fig. 2, after the initial image features and the initial text features are sequentially processed by the multiple homogeneous attention layers and the heterogeneous attention layers, the heterogeneous image features and the heterogeneous text features output by the last heterogeneous attention layer can be finally obtained.

In order to select a target text matched with the first text by using the grader, the heterogeneous image characteristics and the heterogeneous text characteristics can be coded to obtain coding characteristics; taking the coding characteristics as input characteristics of a scorer to obtain probability scores corresponding to the second texts; and taking the second text with the highest probability score as the target text matched with the first text.

When there are a plurality of second texts with the highest probability scores, any one of the second texts with the highest probability scores may be selected as the target text matched with the first text, or all of the plurality of second texts with the highest probability scores may be selected as the target text matched with the first text.

In a specific application, the first text may be a question text and the second text may be an answer text.

Answer texts are often multiple. In practical application, the heterogeneous image features and the heterogeneous text features can be analyzed by a grader to screen out one answer text matched with the question text from the multiple answer texts.

In practical applications, a corresponding structure may be set based on operations to be performed by a homogeneous attention layer, and fig. 3 is a schematic structural diagram of a homogeneous attention layer provided in an embodiment of the present application, where the homogeneous attention layer includes an attention calculation unit, a feature mapping unit, a feature reconstruction unit, a graph operator, a layer normalization unit, a random deletion unit, and an addition unit. The layer normalization unit, the random deletion unit and the addition unit have the same structure as the currently used transform, and the processing flow of these units is not described again. The attention calculation unit, the feature mapping unit, the feature reconstruction unit and the graph operator may be used to construct the graph structure and determine fusion features corresponding to nodes in the graph structure.

For the construction of a graph structure, initial attention vectors of initial image features and initial text features can be determined according to a feature space conversion matrix and a mapping matrix obtained by model training; mapping the initial attention vector according to a mapping matrix obtained by model training to obtain an attention vector; based on the initial image features, the initial text features, and the attention vector, a graph structure may be constructed.

In a specific implementation, the attention calculating unit may perform the calculation of the attention weight according to the following formula,

；

wherein,

，

；W _q andW _k are all characterized space transformation matrices and are,W _q andW _k all sizes of (A) and (B) ared*dA matrix of (a);frepresenting input features obtained by splicing initial image features and initial text features, with the size ofN*d，NThe total number of the features is represented,dthe dimensions of each of the features are represented,attn（f）indicating the attention weight.

NCan be used asNA node, thereby constructing oneNGraph structure of individual nodes. For theNAnd the characteristics can be assigned by the input characteristics. For edges of graph structures, the result can be calculated by attentionattn（f）And (7) assigning values.

The feature mapping unit may map the input features into a feature space, the mapping function being:

，

wherein,W _v represents a mapping matrix of sized*dOf the matrix of (a).

The feature reconstruction unit may be configured to calculate attention, where the attention corresponding to the initial image feature and the initial text feature is:sf（f）= attn（f）×map（f）。

in the embodiment of the application, in order to fully mine hidden information of image features and text features, screening and normalization processing can be performed on input features to obtain normalization weights among nodes in a graph structure; and determining the fusion characteristics of each node in the graph structure according to the set characteristic mapping matrix, the update rate, the input characteristics, the normalized weight and the attention vector.

In a specific implementation, the graph operator may include feature filtering using a non-linear unit, the feature filtering formula being:

，

wherein,LeakyReLuan activation function that represents a deep learning,W _a andW _b are all the feature transfer matrices and are,zthe characteristics after screening are shown.

After feature screening is completed, the screened features may be normalized, with the normalization formula:

，

wherein,expthe operation of the exponent is represented by the exponent operator,αto representNThe normalized weights of the individual nodes with respect to each other,lto representNAny one of the nodes.

After the graph arithmetic unit finishes the normalization operation, a feature updating formula can be called, and the input features and the attention vectors are analyzed to obtain updating features; the expression of the feature update formula is:

；

wherein,

the characteristics of the update are represented by,fthe characteristics of the input are represented by,σwhich is indicative of the rate of update,σis a hyper-parameter that can be set,attn（f) A vector of attention is represented, and,

the updated feature is superposed with the attention vector to obtain a fused featureffI.e. by

。

The fusion features are output after being subjected to layer normalization unit, random deletion unit and ending addition, and the output image features and text features can enter a heterogeneous attention layer for processing.

Fig. 4 is a schematic structural diagram of a heterogeneous attention layer provided in an embodiment of the present application, where the heterogeneous attention layer includes two cross-attention calculation units, two feature mapping units, two feature reconstruction units, two graph operators, two layer normalization units, two random deletion units, and two addition units.

In specific implementation, the initial cross-attention vector of the intermediate image feature and the intermediate text feature can be determined according to a feature space conversion matrix and a mapping matrix obtained by model training; mapping the initial attention-crossing vector according to a mapping matrix obtained by model training to obtain an attention-crossing vector; constructing a heterogeneous graph structure based on the intermediate image features, the intermediate text features and the cross-attention vector.

The cross-attention calculation unit may perform the cross-attention weight calculation according to the following formula,

；

wherein,W _q andW _k are all characterized space transformation matrices and are,W _q andW _k all sizes of (A) and (B) ared*dA matrix of (a);pandgrepresenting the input features, taking the cross-attention calculation unit on the left side in fig. 4 as an example,pthe intermediate text feature is represented by a representation of,grepresenting an intermediate image feature;crossattn（p，g）cross attention weights are represented.

In practical applications, a corresponding structure may be set based on operations to be performed by a heterogeneous attention layer, and fig. 4 is a schematic structural diagram of the heterogeneous attention layer provided in the embodiment of the present application, where the heterogeneous attention layer includes two cross-attention calculation units, two feature mapping units, two feature reconstruction units, two graph operators, two layer normalization units, two random deletion units, and two addition units. The layer normalization unit, the random deletion unit and the addition unit have the same structure as the currently used transform, and the processing flow of these units is not described again. The cross-attention calculation unit, the feature mapping unit, the feature reconstruction unit and the graph operator can be used for constructing a heterogeneous graph structure and determining heterogeneous fusion features corresponding to nodes in the heterogeneous graph structure.

The left-right cross-attention calculation in fig. 4 is the same, and for convenience of description, the left cross-attention calculation is taken as an example for explanation. Input features assuming a heterogeneous attention layer includeNA characteristic thatNIs characterized by asNA node, thereby constructing oneNA heterogeneous graph structure of individual nodes. For theNAnd the characteristics can be assigned by the input characteristics. For edges of a heterogeneous graph structure, cross-attention weights can be calculated from the results of the cross-attention calculationscrossattn（p，g）And (7) assigning values.

After the heterogeneous graph structure is constructed, screening and normalization processing can be carried out on the intermediate image characteristics and the intermediate text characteristics to obtain heterogeneous normalization weights among all nodes in the heterogeneous graph structure; and determining heterogeneous fusion characteristics of each node in the heterogeneous graph structure according to the set characteristic mapping matrix, the update rate, the intermediate image characteristics, the intermediate text characteristics, the heterogeneous normalization weight and the cross-attention vector.

The mapping weight of the graph structure can be calculated by the formula

，

Wherein,LeakyReLuan activation function that represents a deep learning,W _a 、W _b1 andW _b2 are all the feature transfer matrices and are,z _p representing the mapping weights.

After the graph arithmetic unit finishes the normalization operation, a first heterogeneous feature updating formula can be called, and the intermediate image feature, the intermediate text feature and the cross-attention vector are analyzed to obtain a first heterogeneous updating feature; the expression of the first heterogeneous characteristic update formula is as follows:

（1）；

（2）；

wherein,

the heterogeneous update characteristics are represented by the data of the mobile terminal,W _d representing a feature mapping matrix;σthe update rate is a settable hyper-parameter.

And fusing the original cross-attention vector and the updated features to obtain heterogeneous fusion features. And outputting the heterogeneous fusion characteristics after the layer normalization unit, the random deletion unit and the ending addition.

The application provides a thought of fusing an image neural network and an attention mechanism and provides a homogeneous attention mechanism aiming at the common feature effectiveness problem in the existing visual sense common sense reasoning system. In order to solve the problem of heterogeneous attributes of multi-modal characteristics, a heterogeneous attention mechanism is designed, a complete reasoning system for text answer selection is set up based on the heterogeneous attention mechanism, and a target text matched with the first text can be accurately screened out by the reasoning system.

Fig. 5 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application, including an encoding unit 51, a correlation analysis unit 52, a cross-mode analysis unit 53, and a matching unit 54;

the encoding unit 51 is configured to perform encoding processing on the acquired image and text to be analyzed to obtain input features; wherein the text comprises a first text and a second text; the first text and the second text have a mapping relation; the input features comprise initial image features and initial text features;

a correlation analysis unit 52, configured to perform correlation analysis on the initial image feature and the initial text feature according to a set homogeneous attention mechanism, so as to obtain an intermediate image feature and an intermediate text feature;

a cross-modal analysis unit 53, configured to perform cross-modal analysis on the intermediate image feature and the intermediate text feature according to a set heterogeneous attention mechanism, so as to obtain a heterogeneous image feature and a heterogeneous text feature;

and the matching unit 54 is used for analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using the scorer and determining the target text matched with the first text.

the construction subunit is used for constructing a graph structure according to the initial image characteristics, the initial text characteristics and the characteristic space conversion matrix and the mapping matrix obtained by model training;

the fusion subunit is used for fusing the characteristics of each node in the graph structure according to the set characteristic update rule to obtain the fusion characteristics of each node; the fusion features comprise image features added with correlation features and text features added with correlation features;

Optionally, the constructing subunit is configured to determine an initial attention vector of the initial image feature and the initial text feature according to a feature space transformation matrix and a mapping matrix obtained by model training; mapping the initial attention vector according to a mapping matrix obtained by model training to obtain an attention vector; a graph structure is constructed based on the initial image features, the initial text features, and the attention vector.

Optionally, the fusion subunit is configured to perform screening and normalization processing on the input features to obtain a normalization weight between each node in the graph structure; and determining the fusion characteristics of each node in the graph structure according to the set characteristic mapping matrix, the update rate, the input characteristics, the normalized weight and the attention vector.

Optionally, the fusion subunit is configured to invoke a feature update formula, and analyze the input feature and the attention vector to obtain an update feature; the expression of the feature update formula is:

；

wherein,

and overlapping the updated features and the attention vectors to obtain fused features.

the construction subunit is used for constructing a heterogeneous graph structure according to the image characteristics, the text characteristics and the characteristic space conversion matrix and the mapping matrix obtained by model training;

the fusion subunit is used for fusing the characteristics of each node in the heterogeneous graph structure according to the set heterogeneous characteristic update rule to obtain the heterogeneous fusion characteristics of each node; the heterogeneous fusion features comprise image features added with heterogeneous features and text features added with heterogeneous features;

Optionally, the constructing subunit is configured to determine an initial attention-crossing vector of the intermediate image feature and the intermediate text feature according to a feature space transformation matrix and a mapping matrix obtained by model training;

constructing a heterogeneous graph structure based on the intermediate image features, the intermediate text features and the cross-attention vector.

Optionally, the fusion subunit is configured to invoke a first heterogeneous feature update formula, and analyze the intermediate image feature, the intermediate text feature, and the cross-attention vector to obtain a first heterogeneous update feature; the expression of the first heterogeneous characteristic update formula is as follows:

（1）；

（2）；

wherein,

the heterogeneous update characteristics are represented by the data of the mobile terminal,pthe intermediate text feature is represented by a representation of,gthe characteristics of the intermediate image are represented,σwhich is indicative of the rate of update,crossattn（p,g) A cross-attention vector is represented that is cross-attention,

Optionally, the matching unit comprises a feature encoding unit, an input unit and a processing unit;

the input unit is used for taking the coding characteristics as the input characteristics of the scorer so as to obtain the probability scores corresponding to the second texts;

and the unit is used for taking the second text with the highest probability score as the target text matched with the first text.

Optionally, the answer text is multiple; the matching unit is used for analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using the scorer so as to screen out one answer text matched with the question text from the multiple answer texts.

The description of the features in the embodiment corresponding to fig. 5 may refer to the related description of the embodiment corresponding to fig. 1, and is not repeated here.

Fig. 6 is a structural diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 6, the electronic device includes: a memory 20 for storing a computer program;

a processor 21 for implementing the steps of the processing method as described in the above embodiment text when executing the computer program.

The electronic device provided by the embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.

The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 21 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.

The memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing the following computer program 201, wherein after being loaded and executed by the processor 21, the computer program can realize the relevant steps of the processing method of the text disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may also include an operating system 202, data 203, and the like, and the storage manner may be a transient storage manner or a permanent storage manner. Operating system 202 may include, among others, Windows, Unix, Linux, and the like. Data 203 may include, but is not limited to, images to be analyzed, text, and the like.

In some embodiments, the electronic device may further include a display 22, an input/output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.

Those skilled in the art will appreciate that the configuration shown in fig. 6 does not constitute a limitation of the electronic device and may include more or fewer components than those shown.

It is to be understood that, if the processing method of the text in the above embodiments is implemented in the form of a software functional unit and sold or used as a stand-alone product, it may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the prior art, or all or part of the technical solutions may be embodied in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), an electrically erasable programmable ROM, a register, a hard disk, a removable magnetic disk, a CD-ROM, a magnetic or optical disk, and other various media capable of storing program codes.

Based on this, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the processing method of the above text.

The functions of the functional modules of the computer-readable storage medium according to the embodiment of the present invention may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the related description of the foregoing method embodiment, which is not described herein again.

The text processing method, the text processing device, the text processing apparatus, and the computer-readable storage medium provided in the embodiments of the present application are described in detail above. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

A method, an apparatus, a device and a computer readable storage medium for processing a text provided by the present application are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present application.

Claims

1. A method for processing text, comprising:

2. The method of claim 1, wherein the performing a correlation analysis on the initial image feature and the initial text feature according to a set homogeneous attention mechanism to obtain an intermediate image feature and an intermediate text feature comprises:

3. The method for processing the text according to claim 2, wherein the constructing a graph structure according to the initial image features, the initial text features and a feature space transformation matrix and a mapping matrix obtained by model training comprises:

determining an initial attention vector of the initial image feature and the initial text feature according to a feature space conversion matrix and a mapping matrix obtained by model training;

4. The method of claim 3, wherein the fusing the features of the nodes in the graph structure according to the set feature update rule to obtain the fused features of the nodes comprises:

5. The method of claim 4, wherein the determining the fusion feature of each node in the graph structure according to the set feature mapping matrix, the update rate, the input feature, the normalized weight, and the attention vector comprises:

；

wherein,

6. The method of processing text according to claim 1, wherein the performing cross-modal analysis on the intermediate image features and the intermediate text features according to a set heterogeneous attention mechanism to obtain heterogeneous image features and heterogeneous text features comprises:

7. The method for processing the text according to claim 6, wherein the constructing the heterogeneous graph structure according to the intermediate image features, the intermediate text features and the feature space transformation matrix and the mapping matrix obtained by model training comprises:

determining the initial attention-crossing vector of the intermediate image feature and the intermediate text feature according to a feature space conversion matrix and a mapping matrix obtained by model training;

8. The method of claim 6, wherein the fusing the features of the nodes in the heterogeneous graph structure according to the set heterogeneous feature update rule to obtain the heterogeneous fused features of the nodes comprises:

9. The method of claim 8, wherein the determining the heterogeneous fusion features of each node in the heterogeneous graph structure according to the set feature mapping matrix, the update rate, the intermediate image features, the intermediate text features, the heterogeneous normalization weights, and the cross-attention vector comprises:

（1）；

（2）；

wherein,

10. The method of any one of claims 1 to 9, wherein the analyzing the heterogeneous image features and the heterogeneous text features by using a scorer, and the determining the target text matching the first text comprises:

11. The method of claim 1, wherein the first text is a question text and the second text is an answer text.

12. The method for processing text according to claim 11, wherein the answer text is plural; the analyzing the heterogeneous image features and the heterogeneous text features by using a scorer, and the determining the target text matched with the first text comprises:

13. The text processing device is characterized by comprising a coding unit, a correlation analysis unit, a cross-mode analysis unit and a matching unit;

the coding unit is used for coding the acquired image to be analyzed and the acquired text to obtain input characteristics; wherein the text comprises a first text and a second text; the first text and the second text have a mapping relation; the input features comprise initial image features and initial text features;

the matching unit is used for analyzing the heterogeneous image characteristics and the heterogeneous text characteristics by using a scorer and determining a target text matched with the first text.

14. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to carry out the steps of the method of processing text according to any one of claims 1 to 12.

15. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method for processing text according to any one of claims 1 to 12.