CN114693949A

CN114693949A - Multi-modal evaluation object extraction method based on regional perception alignment network

Info

Publication number: CN114693949A
Application number: CN202210352426.8A
Authority: CN
Inventors: 李露; 李昕玮; 王启鹏; 华梓萱; 魏素忠; 周爱华; 吴含前; 陈锦铭; 叶迪卓然; 陈烨; 焦昊; 郭雅娟
Original assignee: Southeast University; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Southeast University; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2022-04-05
Filing date: 2022-04-05
Publication date: 2022-07-01

Abstract

The invention discloses a multi-modal evaluation object extraction method based on a regional perception alignment network, which constructs a multi-modal evaluation object extraction model with a coding layer, a common attention layer and a decoding layer aiming at the characteristics of social media linguistic data, sets hyper-parameters including sentence length, word length and the like in RAN (radio access network) based on the characteristics of the social media linguistic data in the process of constructing the model, and initializes the parameters by using an Xavier method. The model respectively obtains the language material text and the picture characteristics through a coding layer, then the text and the picture characteristics are fused through a common attention layer to obtain a multi-mode characteristic sequence, and finally the obtained multi-mode characteristic sequence is processed through a decoding layer to obtain a label sequence. The results of comparison tests prove that compared with other models, the multi-mode evaluation object extraction model provided by the invention has the best results.

Description

Multi-modal evaluation object extraction method based on regional perception alignment network

Technical Field

The invention relates to a natural language processing method, in particular to a multi-mode evaluation object extraction method based on a Region-aware Alignment Network (RAN).

Background

The sentiment classification can be divided into four subtasks: (1) evaluation object Extraction (Aspect Term Extraction): a sentence is given, and all evaluation objects appearing in the sentence are extracted; (2) evaluation subject emotion classification (Aspect Term policy): giving a sentence, designating an evaluation object appearing in the sentence, and analyzing the emotional polarity of the sentence aiming at the evaluation object; (3) and (3) evaluation category detection: given a sentence, classifying the evaluation object into a predefined evaluation object class; (4) classifying evaluation objects according to emotion of the categories: given a sentence, the emotion polarity of the evaluation object based on the designated category is determined. Therefore, the evaluation object is extracted as an important subtask in emotion analysis, plays an important precursor role in the whole emotion analysis, and is a premise for obtaining good results in subsequent tasks.

The multi-modal learning refers to a category of information integrating multiple modalities (such as characters, voice, pictures, videos and the like), information fusion is realized by analyzing the relationship among the modalities, and finally the capability of processing and understanding multi-source modality information is realized. As the traditional single-mode field learning develops to a certain height, and the demands of multi-mode applications such as visual question answering and image text bidirectional retrieval are increasing, the multi-mode learning also becomes an important research field. The multi-mode learning can be mainly divided into five research directions of multi-mode representation learning, mode conversion, multi-mode alignment, multi-mode fusion and collaborative learning.

In the past, the extraction of the evaluation objects is mostly concentrated on texts, and the multi-mode evaluation object extraction method is still immature, so that a plurality of problems need to be solved. First, due to informality of social media corpora, the text contains a large number of abbreviations and diseases. If the expression based on the traditional word vector is adopted, a large number of words are treated as unknown words because the words are not in a word stock, and the traditional word vector represents all the unknown words by the same vector, so that the effectiveness of the model is seriously reduced. Furthermore, conventional word vectors will result in the loss of phrase information. Second, the most essential difference between the social media corpus and the traditional corpus is that the social media corpus has corresponding picture information. Generally, pictures and characters in the same corpus are highly related, and an evaluation target related to a text appears as a main part in an image in many cases, but since task-independent information called noise is also present in general, it is essential to reduce the influence of noise when merging picture information.

Disclosure of Invention

The purpose of the invention is as follows: based on the defects of the prior art, the invention provides a multi-modal evaluation object extraction method based on a region-aware alignment network, which has a coding layer, a common attention layer and a decoding layer, and fully utilizes the characteristics of pictures and texts to improve the extraction performance of evaluation objects through a multi-modal evaluation object extraction model based on the region-aware alignment network.

The technical scheme is as follows: a multi-mode evaluation object extraction method based on a regional awareness alignment network can be specifically divided into an encoding layer, a common attention layer and a decoding layer. The coding layer is divided into text and picture parts. The text part fully considers the characteristics of the corpus, generates context-dependent word vector codes for the text by adopting BERT, solves the problem of unknown words by adopting character-level vectors, and finally uses bidirectional LSTM to strengthen the time sequence information of the text sequence. And a picture part for capturing picture characteristics through a fast-RCNN target detection network. In the common attention layer, firstly, the picture representation at each time step is obtained through the attention of text to the picture; then, the attention calculation is carried out through the obtained picture representation and the original text sequence to obtain the text representation on each time step; and finally, carrying out inter-modal fusion on the picture representation and the text representation, and removing noise in the picture through a filter gate. In a decoding layer, a CRF algorithm is used for learning the dependency relationship between outputs, the probability of the output label at each time step is calculated, and the maximum value is used as the prediction label of the time step.

The BERT in the coding layer is a transform-based bi-directional coded representation (BERT) proposed by Google in 2018, and uses a transform Encoder as a core, a main structure of a base version of the BERT is formed by stacking 12 transform encoders, and the core content is a self-attention structure. For the basic attention structure, it can be expressed as:

wherein

Respectively representing query vector, key vector and value directionAmount of the compound (A). For the self-attention structure, the three vectors are all from the same input structure, so that the relation proportion between any two units in the input is obtained through normalization after the inner product of the query vector and the key vector is obtained, and the value vector is weighted and summed based on the proportion to obtain the self-attention representation of the input. BERT performed two tasks of pre-training on 33-hundred million text corpora, mask Language Model (Masked Language Model) and Next sentence prediction (Next sententiality prediction), respectively. The task of masking the language model is to randomly replace 15% of the words in the sentence as [ MASK ]]Predicting the covering word by the model prediction; and (3) randomly replacing the context of the speech in the training process by predicting the next sentence, and finally, judging whether the two sentences are the contexts of each other or not by expressing the sentences obtained by BERT. The first training enables the final word vector to represent relevant information containing context, and the second training enables the model to more accurately depict semantic relations between sentences and articles, so that each word vector finally output by the model can contain information of the whole sentence as much as possible.

The character-level vector in the coding layer is subjected to convolution operation on each word through Char-CNN to obtain a one-dimensional vector representation of the character level of each word, and lexical information of the words can be better captured through the vector, so that the problem of excessive unregistered words in the social media corpus is solved. The model performs vector random initialization on all characters appearing in the corpus and completes words to the same length so as to facilitate batch processing. For a single word w_iIn other words, convolution kernels [ C ] of different sizes are used₁,C₂,…,C_k]The character vectors in the words are convolved in one dimension with step size 1. For convolution kernel C_jObtaining a sequence:

where k is the number of the convolution kernel, l_jIs the size of the convolution kernel. Then, the convolution kernel sequence is maximally pooled over time steps to obtain a vector representation for the convolution kernel:

w′_ij＝MaxPoolld(F_ij)

and then, obtaining the character-level vector representation of the word by splicing the word vector representations corresponding to all the k convolution kernels:

by using the bi-directional LSTM, the model can adequately grasp timing information within the input sequence. Corresponding to the ith corpus, the text vector is represented as T_i＝{w₁,w₂,…,w_mGet the corresponding hidden state sequence H after passing through the LSTM layer_i＝{h₁,h₂,…,h_m}. Wherein w_iAnd m is the corpus length of the result after BERT output and character set vector splicing.

The fast-RCNN in the coding layer obtains picture features by inputting pictures into a target detection network, and the invention considers that in most cases the evaluation object of a text corresponds to one of the objects appearing in the pictures. Therefore, the model adopts N one-dimensional feature vectors of the target object identified by fast-RCNN as picture feature input networks, and the pictures with less than N extracted targets are filled with zero vectors.

The coding layer carries out targeted processing on the input of the two modes, and the text and the picture are respectively coded into vectors and sent to an upper network. For the text, the character-level word vector coding is carried out by using Char-CNN, so that the negative influence caused by too many unknown words in the corpus is weakened, external information is introduced by using a BERT pre-training model, the phrase loss caused by word segmentation is avoided, and more context information is blended into the stacked transform structure in the BERT and the subsequent bidirectional LSTM network as far as possible. For pictures, a foreground object appearing in the pictures is captured by using a target detection network fast-RCNN to serve as corresponding picture characteristics, and the mode fusion of the next step is tried to obtain a better effect.

The common attention layer guides the modal fusion of the whole model, aims to fully utilize picture information to guide text annotation, and is a modelThe important parts of the model include text-oriented visual attention and visually-oriented text attention. In most cases, evaluation objects required to be extracted by a task appear in pictures in an explicit manner, the character-oriented visual attention aims to reasonably utilize an attention mechanism, feature vectors of all target objects are added through different weight distributions at different time steps, and the picture feature vector with the highest image weight of the corresponding evaluation object can be obtained at the time step corresponding to an evaluation object word to be extracted through a learning model, so that the purpose of enhancing the significance of the evaluation object is achieved. Alpha is alpha_tAnd the weight vector of the target object corresponding to the t time step. According to the attention distribution, finally obtaining the picture attention feature representation at the time step t:

the visually-oriented textual attention in the common attention layer is intended to learn attention relationships between text, utilizing image features. Beta is a_tAnd the text weight vector corresponding to the t time step. Finally, according to the attention distribution, obtaining a text attention characteristic representation at the time step t:

the co-attention layer determines the fusion between modalities through a gating module after obtaining the text-oriented visual attention and the visual-oriented text attention. For words at time step t, the two attention modules in the foregoing respectively obtain their pictorial representations

And text representations

The gate control unit is obtained by the following formula:

specifically, the two modal vectors are firstly converted to the same dimension by the fully-connected layer, and then are respectively activated by the tanh activation function. Then obtaining the weight g of the picture vector through a weight matrix and a Sigmoid activation function_tAnd weights for text vectors 1-g_tFinally, weighting and summing the two modal vectors to obtain a multi-modal final representation m at the time step t_t. Wherein

Operators represent join operations, σ represents Sigmoid activation functions,

and

is the parameter to be trained. And after the fusion of the modules, a multi-modal feature sequence with fused pictures and texts is obtained.

After the common attention layer obtains the multi-modal feature sequence fused with the pictures and the texts, the degree of association between the pictures and the texts in the corpus is judged through a filter gate to determine how to use the multi-modal features. This process can be represented by the following equation:

wherein s is_tThe value of the filter gate is between 0 and 1, if the word is not associated with the picture, the filter gate stops the multi-modal characteristics from flowing, and if the word is associated with the picture, the filter gate inputs the multi-modal characteristics into the final representation according to the degree of correlation. u. of_tThe multi-modal representation is filtered by a filter gate.

The final vector representation at the decoding layer is fed for time step t. Wherein

And

as the parameters to be trained, the training parameters,

operators represent join operations, h_tAnd outputting a text hidden vector representing the coding layer.

Through the common attention layer, the model obtains a feature vector sequence after interaction and fusion between two modalities of text and picture

The feature vector sequence firstly obtains a picture representation of attention of each word in the sequence, then learns the dependency relationship between texts by taking the picture representation as a query vector, and finally fuses the two-mode vectors based on the importance degree of the pictures.

The decodingThe CRF in the layer, namely the conditional random field, can learn the dependency relationship of the label sequence, thereby avoiding the prediction which does not conform to the labeling rule and improving the probability of correct prediction. It is classified in units of paths, in other words, for the problem of n, k candidate classes with sequence length, CRF considers it as one k candidate classesⁿAnd (5) a classification problem. Namely: for the sequence x ═ x₁,…,x_n) Finding the conditional probability P (y)₁,…,y_n| x) the largest output sequence. For input sequence X, the score for output sequence Y' is given by:

wherein

To transfer the matrix, A_i,jRepresenting the probability of sequence transfer from tag i to tag j.

In order to be a transmit matrix,

representing classification of i-th word in sentence to tag y_jThe fraction of (c). m is the length of the text sequence, and k is the number of the labels.

In decoding, the CRF layer calculates scores of all output sequences according to the constraints of the score matrix and the transition matrix, so that the sequence with the highest score is selected as the class label sequence of the input sequence X. As shown in the formula:

the model needs to maximize the probability of its true label in the training process, i.e. minimize the negative value of its logarithm, which is calculated as follows:

probability that the tag sequence is Y where P (Y | X) is the sequence X, Y_XFor all possible tag sequences, N is the size of the sample set.

The model is initialized with parameters using the Xavier method. The initialization initializes the bias of each layer network to zero vector, and the initialization of the parameter matrix is in accordance with

Wherein n is the number of parameters. The initialization method can solve the gradient vanishing problem by maintaining a gaussian distribution for the output values of each layer to avoid attenuation of variance of activation values. Furthermore, the model is optimized using an Adam optimizer that can dynamically adjust the learning rate and that allows frequently changing parameters to be updated in smaller steps for different parameters, while sparse parameter updates are larger.

Has the advantages that:

1) the method and the device fully consider the characteristics of the social media corpus, so that the social media corpus has superior performance in the extraction of the evaluation objects of the social media corpus.

2) The invention respectively processes the text and the picture in the corpus in the coding layer, fully extracts the characteristics of the text and the picture in the corpus, and can reduce the cost of model construction by using BERT.

3) The method and the system make full fusion of the text characteristic and the picture characteristic by using an attention mechanism, and effectively extract the evaluation object by using multi-mode information.

4) The method and the device represent the text characteristics by combining the BERT and the Char-CNN, effectively relieve the problem of excessive unregistered words in the social media corpus, and fully utilize the text context information.

Drawings

FIG. 1 is a diagram of a model architecture of the present invention;

FIG. 2 is a detailed view of the common attention layer of the present invention;

FIG. 3 is a schematic diagram of a conditional random field output process used in the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

The invention constructs a multi-modal evaluation object extraction model based on a regional awareness alignment network. The user's speech in social media is often an opinion about something (or things). The evaluation object extraction aims to explicitly extract the evaluation objects in the corpus. Taking an example of a social media, the user text is "Mario and Luigi doped the dance floor as per usual". The evaluation objects are 'Mario' and 'Luigi' after the analysis. The specific definition of the task is as follows: given a length n sequence of text x₁,x₂,…,x_n}, predicting equilong sequences { y }₁,y₂,…,y_nB in the prediction sequence represents the head word of the evaluation object, I represents other words except the first word, and O represents that the word does not belong to the evaluation object. The model of the invention can be divided into an encoding layer, a common attention layer and a decoding layer, and the whole structure of the model is shown in figure 1.

The coding layer in the invention respectively processes the text and the picture in the social media corpus. The text part fully considers the characteristics of the corpus, adopts a BERT-base pre-training model provided by Google official, and comprises 12 transform layers, and the obtained word vector dimension is 768. The dimension of the character-level word vector is set to 30, which is initialized to follow a uniform distribution of (-0.25, 0.25). Since the BERT word segmentation can divide a word into a plurality of word fragments, when aligning, the model respectively splices the character-level word vector of the word and all the word fragments corresponding to the word. The sentence length and word length are set to 40 and 30, respectively. And (3) cutting the excessive part of the sentence, and filling the insufficient part with [ PAD ], wherein the transducers of different layers of the BERT can learn the linguistic knowledge of different layers, and the output vectors of 12 layers of transducers in the BERT are averaged to be used as the final output of the BERT according to the formula 1.

In order to better capture lexical information of words, and thus alleviate the problem of excessive unknown words in social media corpus, the model uses Char-CNN to convolve each word to obtain a one-dimensional vector representation at its character level. Firstly, all characters appearing in the corpus are subjected to vector random initialization, and the words are filled to the same length so as to facilitate batch processing. For a single word w_iIn other words, convolution kernels [ C ] of different sizes are used₁,C₂,…,C_K]The character vectors in the words are convolved in one dimension with step size 1. For convolution kernel C_jAnd obtaining a sequence shown in the formula 2:

where k is the number of the convolution kernel, l_jIs the size of the convolution kernel. Then, performing maximal pooling on the convolution kernel sequence at time step to obtain a vector representation for the convolution kernel as shown in formula 3:

w′_ij＝MaxPool1d(F_ij) (3)

then, the word vector representation corresponding to all the k convolution kernels is spliced to obtain the character level vector representation of the word as shown in formula 4:

finally, the BERT output is spliced with the character-level word vector to obtain the final word vector which is expressed as the formula 5:

in addition, in order to adequately grasp the timing information inside the input sequence, this chapter, after obtaining the text vector, feeds it into the bi-directional LSTM layer to capture the sequence information. Corresponding to the ith corpus, the text vector is represented as T_i＝{w₁,w₂,…,w_mGet the corresponding hidden state sequence H after the LSTM layer_i＝{h₁,h₂,…,h_m}。

For picture coding, adopting fast-RCNN to identify an input picture, then inputting the identified one-dimensional feature vectors of N target objects into a network as picture features, and filling the extracted picture with less than N targets by using a zero vector. The results of experiments with different values of N are shown in table 1. And in order to improve the reliability of the experiment, the average value of the results of ten experiments is adopted in the table. Experiments show that the performance of the whole model is optimal when N is 4.

TABLE 1 comparison of experimental results with different target area numbers

The common attention layer structure in the invention is shown in fig. 2, which guides the modal fusion of the whole model, aims to make full use of picture information to guide text annotation, and is a key part in the model, wherein the key part comprises character-oriented visual attention and visual-oriented character attention.

Through character-oriented visual attention, the method can reasonably utilize an attention mechanism, sum the characteristic vectors of all target objects through different weight distributions at different time steps, and can obtain the picture characteristic vector with the highest weight of the corresponding evaluation object image at the time step corresponding to the evaluation object word to be extracted through a learning model, thereby enhancing the obvious degree of the evaluation object. The specific steps are that an input corpus pair is given, a picture characteristic representation matrix O and a text hidden vector h corresponding to a time step t are given_tRespectively obtained from the coding layer to which the upper section belongs. Send it to the attention layer to obtain the imageThe attention distribution among the N target objects is as shown in equation 6:

wherein

d is the dimension of the word vector and the picture feature vector;

n is the number of captured target objects;

the operators represent a splicing operation between the picture feature matrix and the text hidden vector, which splices each vector in the matrix with the text hidden vector. W_O，

And

is a parameter matrix to be trained. Alpha is alpha_tAnd the weight vector of the target object corresponding to the t time step. According to the attention distribution, a picture attention characteristic representation at t time step is finally obtained, as shown in formula 7:

by visually guiding the attention of the characters, the invention can utilize the image characteristics as query vectors to learn the attention relationship among the texts. Specifically, the image representation at t time step can be obtained by equation 7

And a text hidden vector matrix x obtained by the coding layer is given by the visual guidance character attention module in the formula 8And (3) treatment:

wherein, x is a hidden vector matrix obtained by an LSTM layer, and n is the maximum text length set by the model.

And splicing the text hidden vector matrix with the picture characteristic vector at the t time step by the operational character. W_x，

And

is the parameter to be trained. Beta is a_tAnd the text weight vector corresponding to the t time step. According to the attention distribution, the text attention characteristic when t time step is finally obtained is shown as formula 9:

through the two modules, the text and picture characteristics are fully interacted, and the text attention vector and the picture attention vector at any time t of the sequence are obtained. Based on these two vectors, the present invention uses a gating module to decide how much of the final multi-modal representation is derived from the text and the picture, respectively. This gating, as a unit in the model, determines the fusion between modalities. For words at time step t, the two attention modules in the foregoing respectively obtain their pictorial representations

And text representations

First of all of the formulaAnd 10, converting the two modal vectors into the same dimension by the full connection layer, and respectively activating by the tanh activation function.

Then, as shown in formula 11, the weight g of the picture vector is obtained through a weight matrix and a Sigmo d activation function_tAnd weights for text vectors 1-g_t。

Finally, as shown in equation 12, the two modal vectors are weighted and summed to obtain the multi-modal final representation m at the time step t_t。

Wherein

Operators represent join operations, sigma represents a Sigmoid activation function,

and

is a parameter to be trained. After the fusion of the modules, the invention obtains the multi-modal feature sequence with fused pictures and texts.

Since in social media corpus, although the picture can be used as an effective supplement to the text in most cases, in some cases, the picture cannot assist the extraction of the evaluation object of the text, but exists purely as noise. As shown in FIG. 2, a multi-modal feature sequence m with fused picture and text is obtained_tThen, it is subjected to noise reduction treatment by a filter gate. The process is shown in equation 13.

The final vector representation of the decoding layer is fed for time step t. Wherein

And

as a function of the parameters to be trained, the training parameters,

After passing through the common attention layer, the invention obtains the feature vector sequence after interaction and fusion between two modalities of text and picture

The sequence is then fed into a decoding layer and the CRF model predicts the tags corresponding to the input sequence. The specific steps are firstly as shown in formula 14, compressing the modal fusion matrix obtained by the common attention layer through the full connection layer.

P=WV_i+b (14)

The above

Is a transmitting matrix in a CRF model, m is the length of a text sequence, k is the number of labels,

representing classification of i-th word in sentence to tag y_jThe fraction of (c). Furthermore, the transition matrix in the CRF model is

A_i,jRepresenting the probability of sequence transfer from tag i to tag j.

Wherein

In order to be a transmit matrix,

representing classification of i-th word in sentence to tag y_jThe fraction of (c). m is the length of the text sequence, and k is the number of the labels. For input sequence X, the score for output sequence Y' is shown in equation 15.

As shown in equation 16, the final output of the decoding layer of the present invention is the highest-score sequence.

Since the probability that the sentence X corresponds to the tag sequence Y is shown in equation 17:

wherein Y is_XFor all possible tag sequences. The model is aimed at making the real targetThe probability of the signature is the largest, the loss function of the model is chosen as shown in equation 18, and the process of training the neural network is actually the process of minimizing the loss function.

Where N is the size of the sample set.

In order to verify that the method has advantages compared with other evaluation object extraction models when the evaluation object extraction is carried out, and each part in the model of the invention contributes to the improvement of the model effect, a series of experiments are carried out. Since most words in the sequence extracted by the evaluation object are irrelevant words, the model can obtain higher Accuracy (Accuracy) if all words are predicted as irrelevant words, and has no practical significance. In contrast, Precision (Precision) represents the proportion of the actual evaluation object in all the extracted evaluation objects; the Recall (Recall) represents the proportion of successfully extracted evaluation objects in all real evaluation objects, and both can evaluate the model more effectively. F1-measure is a harmonic mean of the precision rate and the recall rate, and can carry out overall evaluation on the model performance. The calculation is shown in equation 19.

Wherein TP represents the number of successfully extracted evaluation objects, FP represents the number of extracted evaluation objects that are not actual evaluation objects, and FN represents the number of unsuccessfully extracted evaluation objects among the actual evaluation objects. The experimental computer CPU is 8-core 16-thread

Core^TMi9-9900K, GPU is GigabyteRTX 2080Ti of 11G video memory. The experimental steps mainly comprise three aspects, namely firstly, data preparation; then training a model; and finally, extracting evaluation objects through the trained model, and displaying the subjective and objective effects.

1) Data preparation

The data set used in the experiment is the adaptation Co-attribute Network for the Name Entity registration in Tweets corresponding tweet data set published by Zhang et al and the Visual attribute Model for the Name Tagging in Multimodal social media corresponding tweet data set published by Lu et al. When the corpus is screened, firstly deleting twitter with text except English and twitter without picture in the data set. And in the rest data, if more than one picture is corresponding to the text, randomly selecting one picture as a representative. Finally, items that do not contain any rating objects, have text lengths less than 3, or have text that is difficult to understand are deleted. The corpus labeling follows BIO-2 standard, and the whole corpus is divided into three parts, namely a training set, a verification set and a test set, and the statistical information of the corresponding corpus is shown in Table 2.

TABLE 2 comparison of experimental results with different numbers of target zones

2) Model training

And when the initialization parameters are trained, initializing the parameters by using an Xavier method. The initialization initializes the bias of each layer network to zero vector, and the initialization of the parameter matrix is in accordance with

Wherein n is the number of parameters. The initialization method can solve the gradient vanishing problem by maintaining a gaussian distribution for the output values of each layer to avoid attenuation of variance of activation values. Furthermore, the model is optimized using an Adam optimizer that can dynamically adjust the learning rate and that allows frequently changing parameters to be updated in smaller steps for different parameters, while sparse parameter updates are larger. The initial learning rate of the model was set to 0.001 and the training batch was 16 samples. To verify the superiority of the RAN model of the present invention and the necessity of each part thereof, the following model was selected and subjected to comparative experiments:

CRF the method is a simple CRF model. The input of the model is basic Word2Vec, and the model is sent to CRF model for decoding after being compressed by full-connection layer.

BilSTM + CRF the method uses a bidirectional LSTM model to extract the context semantic relationship of text sequences, and is a basic framework of many sequence labeling models. The model is an end-to-end system and does not require additional feature engineering.

CNN + BilSTM + CRF is based on the last baseline model that employed CNN to capture character-level features of the language. The model innovatively utilizes CNN to relieve the problem of unknown words, and is adopted by a plurality of sequence labeling tasks after being proposed.

BERT + CNN + BilSTM + CRF this method is a model architecture that removes the common attention layer for the RAN model. The method does not make use of picture information.

RAN-CRF this method trades the decoding layer in the RAN network for a simple Softmax function.

RAN (Word2Vec) this method roughly adopts the architecture of the RAN model. Except that the model is encoded using the Word2Vec model for the input text. The method initializes the unknown words with a zero vector.

RAN (vgg) the method substantially adopts the architecture of the RAN model. In the feature extraction part of the picture, the model uses VGG-16 to process the picture, uses the feature vector matrix obtained after the last pooling layer as the picture feature, and the feature equally divides the picture into 49 blocks in a 7 × 7 architecture.

RAN-Fusion this approach roughly exploits the architecture of the RAN model. In contrast, at the common attention level, the method does not use gated multi-modal fusion, but directly adds the previously obtained picture attention feature representation matrix and the text attention feature representation matrix.

RAN-FG this approach roughly exploits the architecture of the RAN model. In contrast, at the common attention level, the method does not use a filter gate, but rather a fusion of gated multimodalities as the output of the decoding layer.

For the Word2Vec Word vectors used therein, the experiment used a model pre-trained beforehand by 300 tens of thousands of twitter, with a Word vector dimension of 200.

3) Results of the experiment

Applying the prepared data to the above model, the results shown in table 3 were obtained. The results show the accuracy, precision, recall and F1-measure of the trained model on the test set, and the larger these evaluation index values are, the more excellent the model is. And because the evaluation object extracts the characteristic of the task, the accuracy index has no practical significance, the model can be more effectively evaluated through the accuracy rate and the recall rate, and the F1-measure is the harmonic mean of the accuracy rate and the recall rate and can be used for integrally evaluating the performance of the model.

TABLE 3 comparative experimental results

From table 3 it can be seen that the RAN model proposed by the present invention achieves the best results. In addition, the results of CRF, RAN-CRF and BERT + CNN + BilSTM + CRF are comprehensively compared to draw the conclusion that: although CRF can model the dependency of sequences, it cannot capture any context of text. It is clear that corpus-related feature engineering of the model is quite necessary. And if CRF is not used and Softmax is used for decoding, the label rule is ignored, and a large number of prediction label sequences which do not accord with the rule appear in the structure, so that the model effect is greatly weakened.

The result of comparing BilSTM + CRF with CNN + BilSTM + CRF shows that the social network corpus text really has a large number of unknown words which cannot be identified by traditional word vectors, and the problem of excessive unknown words can be effectively relieved by using character-level vector characteristics as supplement. And the F1-measure of BERT + CNN + BilSTM + CRF is 0.7% higher than that of CNN + BilSTM + CRF, so that the BERT pre-training model provides rich prior knowledge, and meanwhile, the internal multi-layer transform structure can effectively carry out context semantic coding. The model of RAN architecture used in this table 3 all gave better results than the unused RAN architecture. This shows that the RAN architecture can effectively capture the alignment relationship between the text and the picture in the social network corpus and fuse the two modal vectors. And it can also be seen in the table that the individual modules selected for use in the present invention are beneficial in improving the results.

Claims

1. A multi-modal evaluation object extraction method based on a regional awareness alignment network is characterized in that a model of the method comprises a coding layer, a common attention layer and a decoding layer, the model is initialized by parameters through an Xavier method, the model respectively obtains text and picture characteristics through the coding layer, the text and picture characteristics are fused through the common attention layer to obtain a multi-modal characteristic sequence, and finally a label sequence is obtained through the multi-modal characteristic sequence through the decoding layer.

2. The method as claimed in claim 1, wherein the coding layer includes 4 parts, i.e. BERT, Char-CNN, bi-directional LSTM network and fast-RCNN, the BERT part introduces external information, the Char-CNN part performs character-level word vector coding, the bi-directional LSTM network captures text sequence information from a sequence after splicing between BERT coding results and Char-CNN coding results, and the fast-RCNN captures foreground objects appearing in pictures as corresponding picture features.

3. The method as claimed in claim 2, wherein the BERT is a BERT-base pre-training model and comprises 12 transform layers, output vectors of 12 transform layers in the BERT are averaged to be used as final output of the BERT, and the obtained word vector has a dimension of 768 and a sentence length of 40.

4. The multi-modal assessment object extraction method based on the regional awareness alignment network as claimed in claim 2, wherein the Char-CNN dimension is set to 30, and its initialization follows a uniform distribution of (-0.25,0.25) and the word length is 30.

5. The method as claimed in claim 2, wherein the one-dimensional feature vectors of N target objects identified by the fast-RCNN are input into the network as picture features, and the extracted N-less pictures are filled with zero vectors.

6. The method as claimed in claim 1, wherein the common attention layer comprises a text-oriented visual attention, a visual-oriented text attention, a gated multi-modal fusion unit and a filter gate, the text-oriented visual attention and the visual-oriented text attention fully interact with text and picture features and obtain a text attention vector and a picture attention vector at any time t of the sequence, the gated multi-modal fusion unit determines how much a final multi-modal representation is obtained from the text and the picture respectively, and the filter gate determines how to use the multi-modal features obtained in the previous step by determining how much the picture and the text in the corpus are associated.

7. The method as claimed in claim 6, wherein the picture attention feature at t time step is represented by the following formula:

wherein alpha is_tFor the target object weight vector, α, corresponding to t time step_t,iIs the ith value, v_iIs the picture characteristic of the ith position.

8. The method as claimed in claim 6, wherein the text attention characteristic when the visually oriented text attention is obtained at t time step is represented by the following formula:

wherein beta is_tFor text weight vectors corresponding to t time steps, beta_t,jIs the jth value, h_jIs the text feature of the jth position.

9. The method as claimed in claim 6, wherein the gated multi-modal fusion unit first converts two modal vectors to the same dimension by a full connection layer, then activates the two modal vectors by tanh activation function, and then obtains the weight g of the two modal vectors to the picture vector by a weight matrix and a Sigmoid activation function_tAnd weights for text vectors 1-g_tFinally, weighting and summing the two modal vectors to obtain a multi-modal final representation m at the time step t_t。

10. The method for extracting multi-modal assessment objects based on the regional awareness alignment network as claimed in claim 6, wherein said filter gate prohibits the flow of multi-modal features when the words and pictures are not related, and imports the multi-modal features into the final representation according to the degree of the correlation when the words and pictures are related, and the filtering process is represented as follows:

wherein s is_tA filter gate with a value between 0 and 1, wherein if the word is not associated with the picture, the filter gate prevents the flow of multimodal features, and if the word is associated with the picture, the filter gate looks at the degree of correlation to assemble the multimodal features into a final representation, u_tFor a multi-modal representation after filtering through a filter gate,

a final vector representation is fed into the decoding layer for time step t, where

And

as a function of the parameters to be trained, the training parameters,

11. The method of claim 1, wherein the decoding layer is a CRF model, and for the problem of n candidate categories with sequence length, the CRF is regarded as one k candidate categoriesⁿThe classification problem, namely: for the sequence x ═ x₁,…,x_n) Finding the conditional probability P (y)₁,…,y_n| x) the largest output sequence.

12. The method for extracting multi-modal evaluation objects based on the regional awareness alignment network as claimed in claim 1, wherein a loss function during model training is expressed as follows:

wherein Y is_XFor all possible tag sequences, Y is the true tag sequence, X is the input sequence, score (X, Y) represents the score of the tag sequence Y at input sequence X, and N is the size of the sample set.