CN114693949A - Multi-modal evaluation object extraction method based on regional perception alignment network - Google Patents

Multi-modal evaluation object extraction method based on regional perception alignment network Download PDF

Info

Publication number
CN114693949A
CN114693949A CN202210352426.8A CN202210352426A CN114693949A CN 114693949 A CN114693949 A CN 114693949A CN 202210352426 A CN202210352426 A CN 202210352426A CN 114693949 A CN114693949 A CN 114693949A
Authority
CN
China
Prior art keywords
text
modal
picture
sequence
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210352426.8A
Other languages
Chinese (zh)
Inventor
李露
李昕玮
王启鹏
华梓萱
魏素忠
周爱华
吴含前
陈锦铭
叶迪卓然
陈烨
焦昊
郭雅娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
Southeast University
Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University, Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd filed Critical Southeast University
Priority to CN202210352426.8A priority Critical patent/CN114693949A/en
Publication of CN114693949A publication Critical patent/CN114693949A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a multi-modal evaluation object extraction method based on a regional perception alignment network, which constructs a multi-modal evaluation object extraction model with a coding layer, a common attention layer and a decoding layer aiming at the characteristics of social media linguistic data, sets hyper-parameters including sentence length, word length and the like in RAN (radio access network) based on the characteristics of the social media linguistic data in the process of constructing the model, and initializes the parameters by using an Xavier method. The model respectively obtains the language material text and the picture characteristics through a coding layer, then the text and the picture characteristics are fused through a common attention layer to obtain a multi-mode characteristic sequence, and finally the obtained multi-mode characteristic sequence is processed through a decoding layer to obtain a label sequence. The results of comparison tests prove that compared with other models, the multi-mode evaluation object extraction model provided by the invention has the best results.

Description

Multi-modal evaluation object extraction method based on regional perception alignment network
Technical Field
The invention relates to a natural language processing method, in particular to a multi-mode evaluation object extraction method based on a Region-aware Alignment Network (RAN).
Background
The sentiment classification can be divided into four subtasks: (1) evaluation object Extraction (Aspect Term Extraction): a sentence is given, and all evaluation objects appearing in the sentence are extracted; (2) evaluation subject emotion classification (Aspect Term policy): giving a sentence, designating an evaluation object appearing in the sentence, and analyzing the emotional polarity of the sentence aiming at the evaluation object; (3) and (3) evaluation category detection: given a sentence, classifying the evaluation object into a predefined evaluation object class; (4) classifying evaluation objects according to emotion of the categories: given a sentence, the emotion polarity of the evaluation object based on the designated category is determined. Therefore, the evaluation object is extracted as an important subtask in emotion analysis, plays an important precursor role in the whole emotion analysis, and is a premise for obtaining good results in subsequent tasks.
The multi-modal learning refers to a category of information integrating multiple modalities (such as characters, voice, pictures, videos and the like), information fusion is realized by analyzing the relationship among the modalities, and finally the capability of processing and understanding multi-source modality information is realized. As the traditional single-mode field learning develops to a certain height, and the demands of multi-mode applications such as visual question answering and image text bidirectional retrieval are increasing, the multi-mode learning also becomes an important research field. The multi-mode learning can be mainly divided into five research directions of multi-mode representation learning, mode conversion, multi-mode alignment, multi-mode fusion and collaborative learning.
In the past, the extraction of the evaluation objects is mostly concentrated on texts, and the multi-mode evaluation object extraction method is still immature, so that a plurality of problems need to be solved. First, due to informality of social media corpora, the text contains a large number of abbreviations and diseases. If the expression based on the traditional word vector is adopted, a large number of words are treated as unknown words because the words are not in a word stock, and the traditional word vector represents all the unknown words by the same vector, so that the effectiveness of the model is seriously reduced. Furthermore, conventional word vectors will result in the loss of phrase information. Second, the most essential difference between the social media corpus and the traditional corpus is that the social media corpus has corresponding picture information. Generally, pictures and characters in the same corpus are highly related, and an evaluation target related to a text appears as a main part in an image in many cases, but since task-independent information called noise is also present in general, it is essential to reduce the influence of noise when merging picture information.
Disclosure of Invention
The purpose of the invention is as follows: based on the defects of the prior art, the invention provides a multi-modal evaluation object extraction method based on a region-aware alignment network, which has a coding layer, a common attention layer and a decoding layer, and fully utilizes the characteristics of pictures and texts to improve the extraction performance of evaluation objects through a multi-modal evaluation object extraction model based on the region-aware alignment network.
The technical scheme is as follows: a multi-mode evaluation object extraction method based on a regional awareness alignment network can be specifically divided into an encoding layer, a common attention layer and a decoding layer. The coding layer is divided into text and picture parts. The text part fully considers the characteristics of the corpus, generates context-dependent word vector codes for the text by adopting BERT, solves the problem of unknown words by adopting character-level vectors, and finally uses bidirectional LSTM to strengthen the time sequence information of the text sequence. And a picture part for capturing picture characteristics through a fast-RCNN target detection network. In the common attention layer, firstly, the picture representation at each time step is obtained through the attention of text to the picture; then, the attention calculation is carried out through the obtained picture representation and the original text sequence to obtain the text representation on each time step; and finally, carrying out inter-modal fusion on the picture representation and the text representation, and removing noise in the picture through a filter gate. In a decoding layer, a CRF algorithm is used for learning the dependency relationship between outputs, the probability of the output label at each time step is calculated, and the maximum value is used as the prediction label of the time step.
The BERT in the coding layer is a transform-based bi-directional coded representation (BERT) proposed by Google in 2018, and uses a transform Encoder as a core, a main structure of a base version of the BERT is formed by stacking 12 transform encoders, and the core content is a self-attention structure. For the basic attention structure, it can be expressed as:
Figure BDA0003581381980000021
wherein
Figure BDA0003581381980000022
Respectively representing query vector, key vector and value directionAmount of the compound (A). For the self-attention structure, the three vectors are all from the same input structure, so that the relation proportion between any two units in the input is obtained through normalization after the inner product of the query vector and the key vector is obtained, and the value vector is weighted and summed based on the proportion to obtain the self-attention representation of the input. BERT performed two tasks of pre-training on 33-hundred million text corpora, mask Language Model (Masked Language Model) and Next sentence prediction (Next sententiality prediction), respectively. The task of masking the language model is to randomly replace 15% of the words in the sentence as [ MASK ]]Predicting the covering word by the model prediction; and (3) randomly replacing the context of the speech in the training process by predicting the next sentence, and finally, judging whether the two sentences are the contexts of each other or not by expressing the sentences obtained by BERT. The first training enables the final word vector to represent relevant information containing context, and the second training enables the model to more accurately depict semantic relations between sentences and articles, so that each word vector finally output by the model can contain information of the whole sentence as much as possible.
The character-level vector in the coding layer is subjected to convolution operation on each word through Char-CNN to obtain a one-dimensional vector representation of the character level of each word, and lexical information of the words can be better captured through the vector, so that the problem of excessive unregistered words in the social media corpus is solved. The model performs vector random initialization on all characters appearing in the corpus and completes words to the same length so as to facilitate batch processing. For a single word wiIn other words, convolution kernels [ C ] of different sizes are used1,C2,…,Ck]The character vectors in the words are convolved in one dimension with step size 1. For convolution kernel CjObtaining a sequence:
Figure BDA0003581381980000031
where k is the number of the convolution kernel, ljIs the size of the convolution kernel. Then, the convolution kernel sequence is maximally pooled over time steps to obtain a vector representation for the convolution kernel:
w′ij=MaxPoolld(Fij)
and then, obtaining the character-level vector representation of the word by splicing the word vector representations corresponding to all the k convolution kernels:
Figure BDA0003581381980000032
by using the bi-directional LSTM, the model can adequately grasp timing information within the input sequence. Corresponding to the ith corpus, the text vector is represented as Ti={w1,w2,…,wmGet the corresponding hidden state sequence H after passing through the LSTM layeri={h1,h2,…,hm}. Wherein wiAnd m is the corpus length of the result after BERT output and character set vector splicing.
The fast-RCNN in the coding layer obtains picture features by inputting pictures into a target detection network, and the invention considers that in most cases the evaluation object of a text corresponds to one of the objects appearing in the pictures. Therefore, the model adopts N one-dimensional feature vectors of the target object identified by fast-RCNN as picture feature input networks, and the pictures with less than N extracted targets are filled with zero vectors.
The coding layer carries out targeted processing on the input of the two modes, and the text and the picture are respectively coded into vectors and sent to an upper network. For the text, the character-level word vector coding is carried out by using Char-CNN, so that the negative influence caused by too many unknown words in the corpus is weakened, external information is introduced by using a BERT pre-training model, the phrase loss caused by word segmentation is avoided, and more context information is blended into the stacked transform structure in the BERT and the subsequent bidirectional LSTM network as far as possible. For pictures, a foreground object appearing in the pictures is captured by using a target detection network fast-RCNN to serve as corresponding picture characteristics, and the mode fusion of the next step is tried to obtain a better effect.
The common attention layer guides the modal fusion of the whole model, aims to fully utilize picture information to guide text annotation, and is a modelThe important parts of the model include text-oriented visual attention and visually-oriented text attention. In most cases, evaluation objects required to be extracted by a task appear in pictures in an explicit manner, the character-oriented visual attention aims to reasonably utilize an attention mechanism, feature vectors of all target objects are added through different weight distributions at different time steps, and the picture feature vector with the highest image weight of the corresponding evaluation object can be obtained at the time step corresponding to an evaluation object word to be extracted through a learning model, so that the purpose of enhancing the significance of the evaluation object is achieved. Alpha is alphatAnd the weight vector of the target object corresponding to the t time step. According to the attention distribution, finally obtaining the picture attention feature representation at the time step t:
Figure BDA0003581381980000041
the visually-oriented textual attention in the common attention layer is intended to learn attention relationships between text, utilizing image features. Beta is atAnd the text weight vector corresponding to the t time step. Finally, according to the attention distribution, obtaining a text attention characteristic representation at the time step t:
Figure BDA0003581381980000042
the co-attention layer determines the fusion between modalities through a gating module after obtaining the text-oriented visual attention and the visual-oriented text attention. For words at time step t, the two attention modules in the foregoing respectively obtain their pictorial representations
Figure BDA0003581381980000043
And text representations
Figure BDA0003581381980000044
The gate control unit is obtained by the following formula:
Figure BDA0003581381980000045
Figure BDA0003581381980000046
Figure BDA0003581381980000047
Figure BDA0003581381980000048
specifically, the two modal vectors are firstly converted to the same dimension by the fully-connected layer, and then are respectively activated by the tanh activation function. Then obtaining the weight g of the picture vector through a weight matrix and a Sigmoid activation functiontAnd weights for text vectors 1-gtFinally, weighting and summing the two modal vectors to obtain a multi-modal final representation m at the time step tt. Wherein
Figure BDA0003581381980000051
Operators represent join operations, σ represents Sigmoid activation functions,
Figure BDA0003581381980000052
and
Figure BDA0003581381980000053
is the parameter to be trained. And after the fusion of the modules, a multi-modal feature sequence with fused pictures and texts is obtained.
After the common attention layer obtains the multi-modal feature sequence fused with the pictures and the texts, the degree of association between the pictures and the texts in the corpus is judged through a filter gate to determine how to use the multi-modal features. This process can be represented by the following equation:
Figure BDA0003581381980000054
Figure BDA0003581381980000055
Figure BDA0003581381980000056
wherein s istThe value of the filter gate is between 0 and 1, if the word is not associated with the picture, the filter gate stops the multi-modal characteristics from flowing, and if the word is associated with the picture, the filter gate inputs the multi-modal characteristics into the final representation according to the degree of correlation. u. oftThe multi-modal representation is filtered by a filter gate.
Figure BDA0003581381980000057
The final vector representation at the decoding layer is fed for time step t. Wherein
Figure BDA0003581381980000058
And
Figure BDA0003581381980000059
as the parameters to be trained, the training parameters,
Figure BDA00035813819800000510
operators represent join operations, htAnd outputting a text hidden vector representing the coding layer.
Through the common attention layer, the model obtains a feature vector sequence after interaction and fusion between two modalities of text and picture
Figure BDA00035813819800000515
The feature vector sequence firstly obtains a picture representation of attention of each word in the sequence, then learns the dependency relationship between texts by taking the picture representation as a query vector, and finally fuses the two-mode vectors based on the importance degree of the pictures.
The decodingThe CRF in the layer, namely the conditional random field, can learn the dependency relationship of the label sequence, thereby avoiding the prediction which does not conform to the labeling rule and improving the probability of correct prediction. It is classified in units of paths, in other words, for the problem of n, k candidate classes with sequence length, CRF considers it as one k candidate classesnAnd (5) a classification problem. Namely: for the sequence x ═ x1,…,xn) Finding the conditional probability P (y)1,…,yn| x) the largest output sequence. For input sequence X, the score for output sequence Y' is given by:
Figure BDA00035813819800000511
wherein
Figure BDA00035813819800000512
To transfer the matrix, Ai,jRepresenting the probability of sequence transfer from tag i to tag j.
Figure BDA00035813819800000513
In order to be a transmit matrix,
Figure BDA00035813819800000514
representing classification of i-th word in sentence to tag yjThe fraction of (c). m is the length of the text sequence, and k is the number of the labels.
In decoding, the CRF layer calculates scores of all output sequences according to the constraints of the score matrix and the transition matrix, so that the sequence with the highest score is selected as the class label sequence of the input sequence X. As shown in the formula:
Figure BDA0003581381980000061
the model needs to maximize the probability of its true label in the training process, i.e. minimize the negative value of its logarithm, which is calculated as follows:
Figure BDA0003581381980000062
Figure BDA0003581381980000063
probability that the tag sequence is Y where P (Y | X) is the sequence X, YXFor all possible tag sequences, N is the size of the sample set.
The model is initialized with parameters using the Xavier method. The initialization initializes the bias of each layer network to zero vector, and the initialization of the parameter matrix is in accordance with
Figure BDA0003581381980000064
Wherein n is the number of parameters. The initialization method can solve the gradient vanishing problem by maintaining a gaussian distribution for the output values of each layer to avoid attenuation of variance of activation values. Furthermore, the model is optimized using an Adam optimizer that can dynamically adjust the learning rate and that allows frequently changing parameters to be updated in smaller steps for different parameters, while sparse parameter updates are larger.
Has the advantages that:
1) the method and the device fully consider the characteristics of the social media corpus, so that the social media corpus has superior performance in the extraction of the evaluation objects of the social media corpus.
2) The invention respectively processes the text and the picture in the corpus in the coding layer, fully extracts the characteristics of the text and the picture in the corpus, and can reduce the cost of model construction by using BERT.
3) The method and the system make full fusion of the text characteristic and the picture characteristic by using an attention mechanism, and effectively extract the evaluation object by using multi-mode information.
4) The method and the device represent the text characteristics by combining the BERT and the Char-CNN, effectively relieve the problem of excessive unregistered words in the social media corpus, and fully utilize the text context information.
Drawings
FIG. 1 is a diagram of a model architecture of the present invention;
FIG. 2 is a detailed view of the common attention layer of the present invention;
FIG. 3 is a schematic diagram of a conditional random field output process used in the present invention.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
The invention constructs a multi-modal evaluation object extraction model based on a regional awareness alignment network. The user's speech in social media is often an opinion about something (or things). The evaluation object extraction aims to explicitly extract the evaluation objects in the corpus. Taking an example of a social media, the user text is "Mario and Luigi doped the dance floor as per usual". The evaluation objects are 'Mario' and 'Luigi' after the analysis. The specific definition of the task is as follows: given a length n sequence of text x1,x2,…,xn}, predicting equilong sequences { y }1,y2,…,ynB in the prediction sequence represents the head word of the evaluation object, I represents other words except the first word, and O represents that the word does not belong to the evaluation object. The model of the invention can be divided into an encoding layer, a common attention layer and a decoding layer, and the whole structure of the model is shown in figure 1.
The coding layer in the invention respectively processes the text and the picture in the social media corpus. The text part fully considers the characteristics of the corpus, adopts a BERT-base pre-training model provided by Google official, and comprises 12 transform layers, and the obtained word vector dimension is 768. The dimension of the character-level word vector is set to 30, which is initialized to follow a uniform distribution of (-0.25, 0.25). Since the BERT word segmentation can divide a word into a plurality of word fragments, when aligning, the model respectively splices the character-level word vector of the word and all the word fragments corresponding to the word. The sentence length and word length are set to 40 and 30, respectively. And (3) cutting the excessive part of the sentence, and filling the insufficient part with [ PAD ], wherein the transducers of different layers of the BERT can learn the linguistic knowledge of different layers, and the output vectors of 12 layers of transducers in the BERT are averaged to be used as the final output of the BERT according to the formula 1.
Figure BDA0003581381980000071
In order to better capture lexical information of words, and thus alleviate the problem of excessive unknown words in social media corpus, the model uses Char-CNN to convolve each word to obtain a one-dimensional vector representation at its character level. Firstly, all characters appearing in the corpus are subjected to vector random initialization, and the words are filled to the same length so as to facilitate batch processing. For a single word wiIn other words, convolution kernels [ C ] of different sizes are used1,C2,…,CK]The character vectors in the words are convolved in one dimension with step size 1. For convolution kernel CjAnd obtaining a sequence shown in the formula 2:
Figure BDA0003581381980000081
where k is the number of the convolution kernel, ljIs the size of the convolution kernel. Then, performing maximal pooling on the convolution kernel sequence at time step to obtain a vector representation for the convolution kernel as shown in formula 3:
w′ij=MaxPool1d(Fij) (3)
then, the word vector representation corresponding to all the k convolution kernels is spliced to obtain the character level vector representation of the word as shown in formula 4:
Figure BDA0003581381980000082
finally, the BERT output is spliced with the character-level word vector to obtain the final word vector which is expressed as the formula 5:
Figure BDA0003581381980000083
in addition, in order to adequately grasp the timing information inside the input sequence, this chapter, after obtaining the text vector, feeds it into the bi-directional LSTM layer to capture the sequence information. Corresponding to the ith corpus, the text vector is represented as Ti={w1,w2,…,wmGet the corresponding hidden state sequence H after the LSTM layeri={h1,h2,…,hm}。
For picture coding, adopting fast-RCNN to identify an input picture, then inputting the identified one-dimensional feature vectors of N target objects into a network as picture features, and filling the extracted picture with less than N targets by using a zero vector. The results of experiments with different values of N are shown in table 1. And in order to improve the reliability of the experiment, the average value of the results of ten experiments is adopted in the table. Experiments show that the performance of the whole model is optimal when N is 4.
TABLE 1 comparison of experimental results with different target area numbers
Figure BDA0003581381980000091
The common attention layer structure in the invention is shown in fig. 2, which guides the modal fusion of the whole model, aims to make full use of picture information to guide text annotation, and is a key part in the model, wherein the key part comprises character-oriented visual attention and visual-oriented character attention.
Through character-oriented visual attention, the method can reasonably utilize an attention mechanism, sum the characteristic vectors of all target objects through different weight distributions at different time steps, and can obtain the picture characteristic vector with the highest weight of the corresponding evaluation object image at the time step corresponding to the evaluation object word to be extracted through a learning model, thereby enhancing the obvious degree of the evaluation object. The specific steps are that an input corpus pair is given, a picture characteristic representation matrix O and a text hidden vector h corresponding to a time step t are giventRespectively obtained from the coding layer to which the upper section belongs. Send it to the attention layer to obtain the imageThe attention distribution among the N target objects is as shown in equation 6:
Figure BDA0003581381980000092
wherein
Figure BDA0003581381980000093
d is the dimension of the word vector and the picture feature vector;
Figure BDA0003581381980000094
n is the number of captured target objects;
Figure BDA0003581381980000095
the operators represent a splicing operation between the picture feature matrix and the text hidden vector, which splices each vector in the matrix with the text hidden vector. WO
Figure BDA0003581381980000096
And
Figure BDA0003581381980000097
is a parameter matrix to be trained. Alpha is alphatAnd the weight vector of the target object corresponding to the t time step. According to the attention distribution, a picture attention characteristic representation at t time step is finally obtained, as shown in formula 7:
Figure BDA0003581381980000098
by visually guiding the attention of the characters, the invention can utilize the image characteristics as query vectors to learn the attention relationship among the texts. Specifically, the image representation at t time step can be obtained by equation 7
Figure BDA0003581381980000101
And a text hidden vector matrix x obtained by the coding layer is given by the visual guidance character attention module in the formula 8And (3) treatment:
Figure BDA0003581381980000102
wherein, x is a hidden vector matrix obtained by an LSTM layer, and n is the maximum text length set by the model.
Figure BDA0003581381980000103
And splicing the text hidden vector matrix with the picture characteristic vector at the t time step by the operational character. Wx
Figure BDA0003581381980000104
And
Figure BDA0003581381980000105
is the parameter to be trained. Beta is atAnd the text weight vector corresponding to the t time step. According to the attention distribution, the text attention characteristic when t time step is finally obtained is shown as formula 9:
Figure BDA0003581381980000106
through the two modules, the text and picture characteristics are fully interacted, and the text attention vector and the picture attention vector at any time t of the sequence are obtained. Based on these two vectors, the present invention uses a gating module to decide how much of the final multi-modal representation is derived from the text and the picture, respectively. This gating, as a unit in the model, determines the fusion between modalities. For words at time step t, the two attention modules in the foregoing respectively obtain their pictorial representations
Figure BDA0003581381980000107
And text representations
Figure BDA0003581381980000108
First of all of the formulaAnd 10, converting the two modal vectors into the same dimension by the full connection layer, and respectively activating by the tanh activation function.
Figure BDA0003581381980000109
Then, as shown in formula 11, the weight g of the picture vector is obtained through a weight matrix and a Sigmo d activation functiontAnd weights for text vectors 1-gt
Figure BDA00035813819800001010
Finally, as shown in equation 12, the two modal vectors are weighted and summed to obtain the multi-modal final representation m at the time step tt
Figure BDA00035813819800001011
Wherein
Figure BDA00035813819800001012
Operators represent join operations, sigma represents a Sigmoid activation function,
Figure BDA00035813819800001013
and
Figure BDA00035813819800001014
is a parameter to be trained. After the fusion of the modules, the invention obtains the multi-modal feature sequence with fused pictures and texts.
Since in social media corpus, although the picture can be used as an effective supplement to the text in most cases, in some cases, the picture cannot assist the extraction of the evaluation object of the text, but exists purely as noise. As shown in FIG. 2, a multi-modal feature sequence m with fused picture and text is obtainedtThen, it is subjected to noise reduction treatment by a filter gate. The process is shown in equation 13.
Figure BDA0003581381980000111
Wherein s istThe value of the filter gate is between 0 and 1, if the word is not associated with the picture, the filter gate stops the multi-modal characteristics from flowing, and if the word is associated with the picture, the filter gate inputs the multi-modal characteristics into the final representation according to the degree of correlation. u. oftThe multi-modal representation is filtered by a filter gate.
Figure BDA0003581381980000112
The final vector representation of the decoding layer is fed for time step t. Wherein
Figure BDA0003581381980000113
And
Figure BDA0003581381980000114
as a function of the parameters to be trained, the training parameters,
Figure BDA0003581381980000115
operators represent join operations, htAnd outputting a text hidden vector representing the coding layer.
After passing through the common attention layer, the invention obtains the feature vector sequence after interaction and fusion between two modalities of text and picture
Figure BDA0003581381980000116
The sequence is then fed into a decoding layer and the CRF model predicts the tags corresponding to the input sequence. The specific steps are firstly as shown in formula 14, compressing the modal fusion matrix obtained by the common attention layer through the full connection layer.
P=WVi+b (14)
The above
Figure BDA0003581381980000117
Is a transmitting matrix in a CRF model, m is the length of a text sequence, k is the number of labels,
Figure BDA0003581381980000118
representing classification of i-th word in sentence to tag yjThe fraction of (c). Furthermore, the transition matrix in the CRF model is
Figure BDA0003581381980000119
Ai,jRepresenting the probability of sequence transfer from tag i to tag j.
Wherein
Figure BDA00035813819800001110
To transfer the matrix, Ai,jRepresenting the probability of sequence transfer from tag i to tag j.
Figure BDA00035813819800001111
In order to be a transmit matrix,
Figure BDA00035813819800001112
representing classification of i-th word in sentence to tag yjThe fraction of (c). m is the length of the text sequence, and k is the number of the labels. For input sequence X, the score for output sequence Y' is shown in equation 15.
Figure BDA00035813819800001113
As shown in equation 16, the final output of the decoding layer of the present invention is the highest-score sequence.
Figure BDA0003581381980000121
Since the probability that the sentence X corresponds to the tag sequence Y is shown in equation 17:
Figure BDA0003581381980000122
wherein Y isXFor all possible tag sequences. The model is aimed at making the real targetThe probability of the signature is the largest, the loss function of the model is chosen as shown in equation 18, and the process of training the neural network is actually the process of minimizing the loss function.
Figure BDA0003581381980000123
Where N is the size of the sample set.
In order to verify that the method has advantages compared with other evaluation object extraction models when the evaluation object extraction is carried out, and each part in the model of the invention contributes to the improvement of the model effect, a series of experiments are carried out. Since most words in the sequence extracted by the evaluation object are irrelevant words, the model can obtain higher Accuracy (Accuracy) if all words are predicted as irrelevant words, and has no practical significance. In contrast, Precision (Precision) represents the proportion of the actual evaluation object in all the extracted evaluation objects; the Recall (Recall) represents the proportion of successfully extracted evaluation objects in all real evaluation objects, and both can evaluate the model more effectively. F1-measure is a harmonic mean of the precision rate and the recall rate, and can carry out overall evaluation on the model performance. The calculation is shown in equation 19.
Figure BDA0003581381980000124
Wherein TP represents the number of successfully extracted evaluation objects, FP represents the number of extracted evaluation objects that are not actual evaluation objects, and FN represents the number of unsuccessfully extracted evaluation objects among the actual evaluation objects. The experimental computer CPU is 8-core 16-thread
Figure BDA0003581381980000125
CoreTMi9-9900K, GPU is GigabyteRTX 2080Ti of 11G video memory. The experimental steps mainly comprise three aspects, namely firstly, data preparation; then training a model; and finally, extracting evaluation objects through the trained model, and displaying the subjective and objective effects.
1) Data preparation
The data set used in the experiment is the adaptation Co-attribute Network for the Name Entity registration in Tweets corresponding tweet data set published by Zhang et al and the Visual attribute Model for the Name Tagging in Multimodal social media corresponding tweet data set published by Lu et al. When the corpus is screened, firstly deleting twitter with text except English and twitter without picture in the data set. And in the rest data, if more than one picture is corresponding to the text, randomly selecting one picture as a representative. Finally, items that do not contain any rating objects, have text lengths less than 3, or have text that is difficult to understand are deleted. The corpus labeling follows BIO-2 standard, and the whole corpus is divided into three parts, namely a training set, a verification set and a test set, and the statistical information of the corresponding corpus is shown in Table 2.
TABLE 2 comparison of experimental results with different numbers of target zones
Figure BDA0003581381980000131
2) Model training
And when the initialization parameters are trained, initializing the parameters by using an Xavier method. The initialization initializes the bias of each layer network to zero vector, and the initialization of the parameter matrix is in accordance with
Figure BDA0003581381980000132
Wherein n is the number of parameters. The initialization method can solve the gradient vanishing problem by maintaining a gaussian distribution for the output values of each layer to avoid attenuation of variance of activation values. Furthermore, the model is optimized using an Adam optimizer that can dynamically adjust the learning rate and that allows frequently changing parameters to be updated in smaller steps for different parameters, while sparse parameter updates are larger. The initial learning rate of the model was set to 0.001 and the training batch was 16 samples. To verify the superiority of the RAN model of the present invention and the necessity of each part thereof, the following model was selected and subjected to comparative experiments:
CRF the method is a simple CRF model. The input of the model is basic Word2Vec, and the model is sent to CRF model for decoding after being compressed by full-connection layer.
BilSTM + CRF the method uses a bidirectional LSTM model to extract the context semantic relationship of text sequences, and is a basic framework of many sequence labeling models. The model is an end-to-end system and does not require additional feature engineering.
CNN + BilSTM + CRF is based on the last baseline model that employed CNN to capture character-level features of the language. The model innovatively utilizes CNN to relieve the problem of unknown words, and is adopted by a plurality of sequence labeling tasks after being proposed.
BERT + CNN + BilSTM + CRF this method is a model architecture that removes the common attention layer for the RAN model. The method does not make use of picture information.
RAN-CRF this method trades the decoding layer in the RAN network for a simple Softmax function.
RAN (Word2Vec) this method roughly adopts the architecture of the RAN model. Except that the model is encoded using the Word2Vec model for the input text. The method initializes the unknown words with a zero vector.
RAN (vgg) the method substantially adopts the architecture of the RAN model. In the feature extraction part of the picture, the model uses VGG-16 to process the picture, uses the feature vector matrix obtained after the last pooling layer as the picture feature, and the feature equally divides the picture into 49 blocks in a 7 × 7 architecture.
RAN-Fusion this approach roughly exploits the architecture of the RAN model. In contrast, at the common attention level, the method does not use gated multi-modal fusion, but directly adds the previously obtained picture attention feature representation matrix and the text attention feature representation matrix.
RAN-FG this approach roughly exploits the architecture of the RAN model. In contrast, at the common attention level, the method does not use a filter gate, but rather a fusion of gated multimodalities as the output of the decoding layer.
For the Word2Vec Word vectors used therein, the experiment used a model pre-trained beforehand by 300 tens of thousands of twitter, with a Word vector dimension of 200.
3) Results of the experiment
Applying the prepared data to the above model, the results shown in table 3 were obtained. The results show the accuracy, precision, recall and F1-measure of the trained model on the test set, and the larger these evaluation index values are, the more excellent the model is. And because the evaluation object extracts the characteristic of the task, the accuracy index has no practical significance, the model can be more effectively evaluated through the accuracy rate and the recall rate, and the F1-measure is the harmonic mean of the accuracy rate and the recall rate and can be used for integrally evaluating the performance of the model.
TABLE 3 comparative experimental results
Figure BDA0003581381980000141
Figure BDA0003581381980000151
From table 3 it can be seen that the RAN model proposed by the present invention achieves the best results. In addition, the results of CRF, RAN-CRF and BERT + CNN + BilSTM + CRF are comprehensively compared to draw the conclusion that: although CRF can model the dependency of sequences, it cannot capture any context of text. It is clear that corpus-related feature engineering of the model is quite necessary. And if CRF is not used and Softmax is used for decoding, the label rule is ignored, and a large number of prediction label sequences which do not accord with the rule appear in the structure, so that the model effect is greatly weakened.
The result of comparing BilSTM + CRF with CNN + BilSTM + CRF shows that the social network corpus text really has a large number of unknown words which cannot be identified by traditional word vectors, and the problem of excessive unknown words can be effectively relieved by using character-level vector characteristics as supplement. And the F1-measure of BERT + CNN + BilSTM + CRF is 0.7% higher than that of CNN + BilSTM + CRF, so that the BERT pre-training model provides rich prior knowledge, and meanwhile, the internal multi-layer transform structure can effectively carry out context semantic coding. The model of RAN architecture used in this table 3 all gave better results than the unused RAN architecture. This shows that the RAN architecture can effectively capture the alignment relationship between the text and the picture in the social network corpus and fuse the two modal vectors. And it can also be seen in the table that the individual modules selected for use in the present invention are beneficial in improving the results.

Claims (12)

1. A multi-modal evaluation object extraction method based on a regional awareness alignment network is characterized in that a model of the method comprises a coding layer, a common attention layer and a decoding layer, the model is initialized by parameters through an Xavier method, the model respectively obtains text and picture characteristics through the coding layer, the text and picture characteristics are fused through the common attention layer to obtain a multi-modal characteristic sequence, and finally a label sequence is obtained through the multi-modal characteristic sequence through the decoding layer.
2. The method as claimed in claim 1, wherein the coding layer includes 4 parts, i.e. BERT, Char-CNN, bi-directional LSTM network and fast-RCNN, the BERT part introduces external information, the Char-CNN part performs character-level word vector coding, the bi-directional LSTM network captures text sequence information from a sequence after splicing between BERT coding results and Char-CNN coding results, and the fast-RCNN captures foreground objects appearing in pictures as corresponding picture features.
3. The method as claimed in claim 2, wherein the BERT is a BERT-base pre-training model and comprises 12 transform layers, output vectors of 12 transform layers in the BERT are averaged to be used as final output of the BERT, and the obtained word vector has a dimension of 768 and a sentence length of 40.
4. The multi-modal assessment object extraction method based on the regional awareness alignment network as claimed in claim 2, wherein the Char-CNN dimension is set to 30, and its initialization follows a uniform distribution of (-0.25,0.25) and the word length is 30.
5. The method as claimed in claim 2, wherein the one-dimensional feature vectors of N target objects identified by the fast-RCNN are input into the network as picture features, and the extracted N-less pictures are filled with zero vectors.
6. The method as claimed in claim 1, wherein the common attention layer comprises a text-oriented visual attention, a visual-oriented text attention, a gated multi-modal fusion unit and a filter gate, the text-oriented visual attention and the visual-oriented text attention fully interact with text and picture features and obtain a text attention vector and a picture attention vector at any time t of the sequence, the gated multi-modal fusion unit determines how much a final multi-modal representation is obtained from the text and the picture respectively, and the filter gate determines how to use the multi-modal features obtained in the previous step by determining how much the picture and the text in the corpus are associated.
7. The method as claimed in claim 6, wherein the picture attention feature at t time step is represented by the following formula:
Figure FDA0003581381970000021
wherein alpha istFor the target object weight vector, α, corresponding to t time stept,iIs the ith value, viIs the picture characteristic of the ith position.
8. The method as claimed in claim 6, wherein the text attention characteristic when the visually oriented text attention is obtained at t time step is represented by the following formula:
Figure FDA0003581381970000022
wherein beta istFor text weight vectors corresponding to t time steps, betat,jIs the jth value, hjIs the text feature of the jth position.
9. The method as claimed in claim 6, wherein the gated multi-modal fusion unit first converts two modal vectors to the same dimension by a full connection layer, then activates the two modal vectors by tanh activation function, and then obtains the weight g of the two modal vectors to the picture vector by a weight matrix and a Sigmoid activation functiontAnd weights for text vectors 1-gtFinally, weighting and summing the two modal vectors to obtain a multi-modal final representation m at the time step tt
10. The method for extracting multi-modal assessment objects based on the regional awareness alignment network as claimed in claim 6, wherein said filter gate prohibits the flow of multi-modal features when the words and pictures are not related, and imports the multi-modal features into the final representation according to the degree of the correlation when the words and pictures are related, and the filtering process is represented as follows:
Figure FDA0003581381970000023
Figure FDA0003581381970000024
Figure FDA0003581381970000025
wherein s istA filter gate with a value between 0 and 1, wherein if the word is not associated with the picture, the filter gate prevents the flow of multimodal features, and if the word is associated with the picture, the filter gate looks at the degree of correlation to assemble the multimodal features into a final representation, utFor a multi-modal representation after filtering through a filter gate,
Figure FDA0003581381970000026
a final vector representation is fed into the decoding layer for time step t, where
Figure FDA0003581381970000027
And
Figure FDA0003581381970000028
as a function of the parameters to be trained, the training parameters,
Figure FDA0003581381970000029
operators represent join operations, htAnd outputting a text hidden vector representing the coding layer.
11. The method of claim 1, wherein the decoding layer is a CRF model, and for the problem of n candidate categories with sequence length, the CRF is regarded as one k candidate categoriesnThe classification problem, namely: for the sequence x ═ x1,…,xn) Finding the conditional probability P (y)1,…,yn| x) the largest output sequence.
12. The method for extracting multi-modal evaluation objects based on the regional awareness alignment network as claimed in claim 1, wherein a loss function during model training is expressed as follows:
Figure FDA0003581381970000031
wherein Y isXFor all possible tag sequences, Y is the true tag sequence, X is the input sequence, score (X, Y) represents the score of the tag sequence Y at input sequence X, and N is the size of the sample set.
CN202210352426.8A 2022-04-05 2022-04-05 Multi-modal evaluation object extraction method based on regional perception alignment network Pending CN114693949A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210352426.8A CN114693949A (en) 2022-04-05 2022-04-05 Multi-modal evaluation object extraction method based on regional perception alignment network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210352426.8A CN114693949A (en) 2022-04-05 2022-04-05 Multi-modal evaluation object extraction method based on regional perception alignment network

Publications (1)

Publication Number Publication Date
CN114693949A true CN114693949A (en) 2022-07-01

Family

ID=82143153

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210352426.8A Pending CN114693949A (en) 2022-04-05 2022-04-05 Multi-modal evaluation object extraction method based on regional perception alignment network

Country Status (1)

Country Link
CN (1) CN114693949A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117150381A (en) * 2023-08-07 2023-12-01 中国船舶集团有限公司第七〇九研究所 Target function group identification and model training method thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117150381A (en) * 2023-08-07 2023-12-01 中国船舶集团有限公司第七〇九研究所 Target function group identification and model training method thereof

Similar Documents

Publication Publication Date Title
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
Wang et al. Application of convolutional neural network in natural language processing
WO2021233112A1 (en) Multimodal machine learning-based translation method, device, equipment, and storage medium
CN110928994B (en) Similar case retrieval method, similar case retrieval device and electronic equipment
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
WO2019153737A1 (en) Comment assessing method, device, equipment and storage medium
CN111783474B (en) Comment text viewpoint information processing method and device and storage medium
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN111488739A (en) Implicit discourse relation identification method based on multi-granularity generated image enhancement representation
CN113065577A (en) Multi-modal emotion classification method for targets
CN110287323B (en) Target-oriented emotion classification method
CN110704601A (en) Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network
Wen et al. Dynamic interactive multiview memory network for emotion recognition in conversation
CN114330354B (en) Event extraction method and device based on vocabulary enhancement and storage medium
CN112861524A (en) Deep learning-based multilevel Chinese fine-grained emotion analysis method
CN113961666B (en) Keyword recognition method, apparatus, device, medium, and computer program product
Kshirsagar et al. A review on application of deep learning in natural language processing
CN113836992A (en) Method for identifying label, method, device and equipment for training label identification model
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN117574904A (en) Named entity recognition method based on contrast learning and multi-modal semantic interaction
Yan et al. Implicit emotional tendency recognition based on disconnected recurrent neural networks
Parvin et al. Transformer-based local-global guidance for image captioning
Zeng et al. Robust multimodal sentiment analysis via tag encoding of uncertain missing modalities
CN114693949A (en) Multi-modal evaluation object extraction method based on regional perception alignment network
CN112579739A (en) Reading understanding method based on ELMo embedding and gating self-attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination