CN113836298A - Text classification method and system based on visual enhancement - Google Patents

Text classification method and system based on visual enhancement Download PDF

Info

Publication number
CN113836298A
CN113836298A CN202110894298.5A CN202110894298A CN113836298A CN 113836298 A CN113836298 A CN 113836298A CN 202110894298 A CN202110894298 A CN 202110894298A CN 113836298 A CN113836298 A CN 113836298A
Authority
CN
China
Prior art keywords
text
image
representation
global
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110894298.5A
Other languages
Chinese (zh)
Inventor
张琨
吴乐
洪日昌
汪萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202110894298.5A priority Critical patent/CN113836298A/en
Publication of CN113836298A publication Critical patent/CN113836298A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention provides a text classification method and system based on visual enhancement, which relate to the technical field of computer vision and natural language understanding.

Description

Text classification method and system based on visual enhancement
Technical Field
The invention relates to the technical field of computer vision and natural language understanding, in particular to a text classification method and system based on visual enhancement.
Background
Text classification is a very important component in the field of natural language processing, and is a common method for evaluating whether semantic representation of sentences is accurate or not. Which is mainly used to classify a given single or multiple sentences. Text classification also has different classification criteria according to different specific tasks. For example, emotion classification is mainly used to determine the emotion classification or polarity of a given sentence; paraphrase recognition is mainly used to determine whether a given two sentences express the same semantics. The underlying technology that this task focuses on is how to semantically characterize the input text accurately. The semantic representation of natural language sentences is a fundamental but extremely important research content in the field of natural language processing and even artificial intelligence, and no matter basic information retrieval, semantic extraction, complex question-answering systems and dialogue systems, the semantic representation of input sentences needs to be comprehensively and accurately represented, so that the machine can be ensured to understand the complex human language system. With the continuous popularization of pervasive computing and intelligent equipment, a large amount of data is accumulated in various industries, and the large-scale data provides a solid data base for fully understanding text semantics and specific application of related representation technologies.
At present, methods for text classification can be mainly classified into two categories according to the different types of data used:
1) method based on text information
The input of the method only has text information, semantic representations of the input text are directly obtained by using methods such as a convolutional neural network, a cyclic neural network, a pre-training model and the like, and then the input text is classified by some simple classification methods.
2) Method based on image-text combination
The input of the method comprises text information and image information, the text information is processed by a method similar to that based on the text information, the image information is automatically extracted by a method such as a pre-trained image convolution model, more accurate text semantic representation is generated under the auxiliary guidance of the image information by a method such as an attention mechanism or a gate structure, and finally the representation is input into a classification layer related to a specific downstream task, and the classification of the input text is finally realized.
The above method has achieved very good results in specific text classification tasks (emotion recognition, natural language reasoning, visual-text language reasoning, etc.). In particular, the latter has attracted more and more attention by realizing more accurate modeling and representation of abstract semantics by considering multi-modal information. However, the existing methods, especially the method based on the image-text combination still have some problems: the existing method is relatively simple when processing multimedia image information and text information, only the image information is used as auxiliary information for text semantic modeling, complex content contained in the image information and the mutual influence relation between the image information and the text are ignored, and noise information is introduced into the extracted image characteristics.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a text classification method and system based on visual enhancement, which solve the technical problem that the extracted image features introduce noise information due to neglect of complex content contained in image information and the mutual influence relationship between the image information and a text in the prior art.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme:
in a first aspect, the present invention provides a method for text classification based on visual enhancement, the method comprising:
s1, matching corresponding auxiliary images for the target text;
s2, acquiring a text global semantic representation vector and a text local semantic representation matrix of the target text, and acquiring an image global characteristic representation vector and an image local characteristic representation matrix of the auxiliary image;
s3, acquiring semantic representation based on image auxiliary information and image feature representation based on text semantics by using an attention mechanism based on the text global semantic representation vector, the text local semantic representation matrix, the image global feature representation vector and the image local feature representation matrix;
s4, fusing the semantic representation based on the image auxiliary information and the image characteristic representation based on the text semantic by utilizing a heuristic matching fusion method to obtain a fusion representation, and processing the fusion representation through a multilayer perceptron with a classification layer to obtain a text classification result.
Preferably, the S1 specifically includes:
searching from the image-text data set aiming at each target text X, and extracting a corresponding image as an auxiliary image of the target text X if the description sentence in the data set is the same as the target text X; otherwise, inputting the target text X into a search engine, and selecting an image of Top-1 as an auxiliary image according to the retrieval result.
Preferably, the S2 specifically includes:
s201, performing semantic modeling on each word of a target text, and obtaining a text global semantic expression vector and a text local semantic expression matrix by using a first pre-training model;
s202, modeling is carried out on the auxiliary image, and an image global feature representation vector and an image local feature representation matrix are generated by utilizing a second pre-training model.
Preferably, the S3 specifically includes:
s301, acquiring semantic representation based on image auxiliary information by using an attention mechanism based on a text local semantic representation matrix and an image global feature representation vector, specifically comprising the following steps:
using an attention mechanism, under the guidance of an image global feature representation vector h, selecting a semantic part which accords with the current image situation from the text representation, and extracting the semantic part as an expression of text semantics, wherein the process can be expressed as follows:
Figure BDA0003197209030000041
Figure BDA0003197209030000042
wherein: { omega, W, U } are parameters to be optimized in the model training process; i represents a column vector of length m of all 1;
Figure BDA0003197209030000043
represents that the result of Uh is repeated m times; gamma represents the weight distribution corresponding to all words in the text word sequence under the condition of considering the global feature h of the image;
Figure BDA0003197209030000044
a representation vector representing the context semantics under consideration of global features of the image; tanh () represents a nonlinear activation function; sjThe j column in the matrix S represents the vector representation of the j word in the input text;
s302, acquiring image feature representation based on text semantics by using an attention mechanism based on the global semantic representation and the image local feature representation matrix, specifically comprising the following steps:
using an attention mechanism, under the guidance of a text global semantic representation vector s, selecting information related to text semantics from image local features, and fusing the information to form another representation of the image features, where the process may be represented as follows:
Figure BDA0003197209030000051
Figure BDA0003197209030000052
wherein: { omega [ [ omega ] ]o,Wo,UoExpressing parameters needing to be optimized in the model training process; i isoRepresents a column vector of length n of all 1; n represents the number of extracted local features; theta represents the degree of correlation between the local features of the image and a text global semantic representation vector s; thetaiRepresenting the degree of correlation between the ith image local feature and a text global semantic representation vector s; thetakRepresenting the degree of correlation between the local features of the kth image and a text global semantic representation vector s;
Figure BDA0003197209030000053
representing a selection result of local features of the image under the condition of considering the global semantic representation of the text; h isiAnd an ith column in the matrix H is represented, and the ith image local characteristic is represented.
Preferably, the S4 specifically includes:
fusing the obtained semantic representation and the feature representation by using a heuristic matching fusion method to obtain a fused representation; according to the fusion representation, the classification result of the target text is predicted by using two multi-layer perceptrons with classification layers, and the process can be expressed as follows:
a=ReLU(MLP1([s,h;s⊙h;s-h,])),
Figure BDA0003197209030000054
P(y|X,I)=MLP2([a,b,a+b]),
Figure BDA0003197209030000056
wherein: [ s, h; s £ h; s-h of the total number of the carbon atoms,]and
Figure BDA0003197209030000055
are all fusion representations; [;]representing a splicing operation; as and-respectively represent dot product and subtract operations for measuring similarity and difference between two variables, respectively; p (y | X, I) represents the probability that the classification result is y; y is*Representing the result of the last classification of the model, MLP1And MLP2Multi-layered perceptron representing two band classification layers
Preferably, the S3 further includes:
s303, mapping the global representation vector of the text semantics, the semantic representation based on the image auxiliary information, the image global feature representation vector and the image feature representation based on the text semantics to a contrast learning space through a multi-layer perceptron MLP.
Preferably, the method further comprises:
s5, selecting a cross entropy loss function and a contrast loss function as optimization targets, and learning and optimizing parameters in the steps S2-S3, wherein the parameters are as follows:
1) cross entropy loss function:
Figure BDA0003197209030000061
wherein: log () is a log likelihood function, K represents the number of samples in a training batch; y isiAnd (3) representing the one-hot vector representation of the real label corresponding to the ith sample, namely, only the value of the index position corresponding to the real label is 1, other positions are 0, and the vector length is the number of all labels.
2) Comparison loss function:
Figure BDA0003197209030000062
Figure BDA0003197209030000063
wherein: τ is a hyper-parameter used to control the intensity of contrast learning; sim () represents a similarity calculation function, such as cosine similarity; 1[k≠j]Is an indication value, which indicates that if and only if k ≠ j, its corresponding value is 1, otherwise it is 0;
3) on the basis of obtaining two loss functions, a final optimization target is obtained by weighting and integrating the two functions:
Loss=Loss1+λLoss2+μLoss3,
wherein: λ and μ are hyper-parameters used to control the effect of different loss functions on the final result.
In a second aspect, the present invention provides a visual enhancement based text classification system, the system comprising:
the image-text matching module is used for matching the corresponding auxiliary image for the target text;
the global representation acquisition module is used for acquiring a text global semantic representation vector and a text local semantic representation matrix of the target text and acquiring an image global characteristic representation vector and an image local characteristic representation matrix of the auxiliary image;
the attention mechanism module is used for acquiring semantic representation based on image auxiliary information and image feature representation based on text semantics by utilizing an attention mechanism based on the text global semantic representation vector, the text local semantic representation matrix, the image global feature representation vector and the image local feature representation matrix;
and the classification module is used for fusing the semantic representation based on the image auxiliary information and the image characteristic representation based on the text semantic by utilizing a heuristic matching fusion method to obtain a fusion representation, and processing the fusion representation through a multilayer perceptron with a classification layer to obtain a text classification result.
Preferably, the attention mechanism module is further configured to map the global representation vector of text semantics, the semantic representation based on image auxiliary information, the image global feature representation vector and the image feature representation based on text semantics to a contrast learning space through the multi-layer perceptron MLP.
Preferably, the system further comprises:
and the parameter optimization module is used for selecting the cross entropy loss function and the contrast loss function as optimization targets and learning and optimizing parameters in the global representation acquisition module and the attention mechanism module.
(III) advantageous effects
The invention provides a text classification method and system based on visual enhancement. Compared with the prior art, the method has the following beneficial effects:
according to the invention, the semantic representation based on the image auxiliary information and the image characteristic representation based on the text semantics are obtained through an attention mechanism, the complex content contained in the image information and the mutual influence relationship between the image information and the text are fully considered, the introduction of noise information in the text semantic expression process is reduced, and the accurate understanding of the text semantics is finally realized and the accuracy of text classification is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a block diagram of a text classification method based on visual enhancement according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the application provides a text classification method and system based on visual enhancement, solves the technical problems that the existing method ignores the complex content contained in the image information and the mutual influence relationship between the image information and the text, realizes more accurate feature extraction, reduces the introduction of noise information, and improves the accuracy of text classification.
In order to solve the technical problems, the general idea of the embodiment of the application is as follows:
the existing text classification methods mainly include text information-based methods and image-text combination-based methods. Both of these approaches have been very effective in specific text classification tasks. Especially, the method based on image-text combination realizes more accurate modeling and representation of abstract semantics by considering multi-modal information, and gets more and more attention. However, there are still some problems with the teletext based approach:
1) the existing method mainly depends on a large-scale artificially constructed data set when using multi-modal data, and the existing various multi-modal data of the Internet are not utilized, so that the multi-modal data existing in the real world can be fully used in text classification.
2) The existing method is relatively simple when processing multimedia image information and text information, only the image information is used as auxiliary information for text semantic modeling, and complex content contained in the image information and the mutual influence relationship between the image information and the text are ignored. The image information contains rich semantic information, only part of the information is associated with text semantics, and if the influence of the text information needs to be considered in the process of extracting the image characteristics, more accurate characteristic extraction can be realized, more accurate auxiliary information is further provided for text modeling, and the introduction of noise information is reduced.
Based on the above, the embodiment of the invention supplements sufficient auxiliary images for the existing text data by using the search engine, and realizes accurate modeling of the mutual influence of the text semantic representation and the auxiliary image feature information through the attention mechanism and the comparative learning, thereby reducing the introduction of noise information. Finally, accurate understanding of text semantics and accurate judgment of text classification are achieved, and the problems that the existing method is insufficient in utilization of internet large-scale data and relatively simple in processing of multimedia image information and text information are solved.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
The embodiment of the invention provides a text classification method based on visual enhancement, which comprises the following steps of S1-S4:
s1, matching corresponding auxiliary images for the target text;
s2, acquiring a text global semantic representation vector and a text local semantic representation matrix of the target text, and acquiring an image global characteristic representation vector and an image local characteristic representation matrix of the auxiliary image;
s3, acquiring semantic representation based on image auxiliary information and image feature representation based on text semantics by using an attention mechanism based on a text global semantic representation vector, a text local semantic representation matrix, an image global feature representation vector and an image local feature representation matrix;
and S4, fusing the semantic representation based on the image auxiliary information and the image characteristic representation based on the text semantic by using a heuristic matching fusion method to obtain a fusion representation, and processing the fusion representation by using a multilayer perceptron with a classification layer to obtain a text classification result.
According to the embodiment of the invention, the semantic representation based on the image auxiliary information and the image characteristic representation based on the text semantic are obtained through an attention mechanism, the complex content contained in the image information and the mutual influence relation between the image information and the text are fully considered, the introduction of noise information in the text semantic expression process is reduced, and the accurate understanding of the text semantic and the accuracy of text classification are finally realized.
The individual steps are described in detail below:
in step S1, matching a corresponding auxiliary image for the target text, which is implemented as follows:
acquiring a target text, representing the target text in a unified mathematical form, and representing the target text as a word sequence: x ═ X1,x2,…,xnWhere n denotes the length of the text sequence, xiThe representation represents the ith word. For each target text X, firstly, searching from the image-text data set, and extracting a corresponding image as an auxiliary image of the target text X if the description sentence in the data set is the same as the target text X; if the target text X is not the same as the target text X, inputting the target text X into a search engine (Google, Baidu, etc.), and selecting an image of Top-1 as an auxiliary image according to the search result, wherein the auxiliary image corresponding to the target text X is a three-channel image with a fixed size (e.g. 227X 3) and each pixel value is between 0 and 255 for convenience of subsequent description, and the auxiliary image is marked as O in the embodiment of the invention.
In step S2, a text global semantic representation vector and a text local semantic representation matrix of the target text are acquired, and an image global feature representation vector and an image local feature representation matrix of the auxiliary image are acquired. The specific implementation process is as follows:
s201, performing semantic modeling on each word of the target text, and obtaining a text global semantic expression vector and a text local semantic expression matrix by using a first pre-training model. The method specifically comprises the following steps:
for the target text X, first add a special symbol "[ CLS ] to the front and back of the word sequence]Next, replacing each corresponding word with a corresponding index representation by using a dictionary V corresponding to the first pre-training model (each first pre-training model provides a corresponding dictionary), and then inputting the obtained representation into the existing first pre-training model (in the embodiment of the present invention, BERT is taken as a representative of the first pre-training model), assuming that the embodiment of the present invention selects the output of the last L layers in the first pre-training model, the vector representation of the final target text can be obtained by performing weighted summation on the output results of the L layers, and the weighting parameter { α [, ] of the vector representation can be obtained by performing weighted summation on the output results of the L layers12,…,αLThe method is obtained by training the whole model. This process can be expressed as:
Figure BDA0003197209030000121
Figure BDA0003197209030000122
wherein the content of the first and second substances,
Figure BDA0003197209030000123
represents the first of the l-th layer [ CLS]A corresponding vector representation; slRepresenting matrix representation corresponding to the word sequence of the target text corresponding to the l-th layer; [ CLS; x; CLS]Indicates two [ CLS]Respectively splicing the front and the back of the word sequence X; ID () means to ID the inputs, replacing each input with a corresponding index value according to a given dictionary V; s is a text local semantic representation matrix, and the representation is performed through a first pre-stageTraining semantic representation output of all words in a word sequence of a model target text; s is a text global semantic representation vector of the target text which is subjected to the first pre-training model, and represents the semantic representation of the sentence obtained after weighting.
S202, modeling the auxiliary image, and generating an image global feature representation vector and an image local feature representation matrix by using a second pre-training model, wherein the method specifically comprises the following steps:
for the auxiliary image O, the second pre-training model is extracted directly by using the visual features for processing (in the embodiment of the present invention, ResNet-50 is used as a representative of the second pre-training model), and similar to the text processing manner, it is assumed that the embodiment of the present invention selects the post-training model in the second pre-training model
Figure BDA0003197209030000124
The output of the layer convolution layer, the local feature representation matrix of the final auxiliary image can be aligned to this
Figure BDA0003197209030000135
The output results of the layer convolution layers are obtained by weighted summation, and the weight parameters are obtained
Figure BDA0003197209030000134
The method and the device are obtained finally along with the training of the whole model, and meanwhile, the result of the last full connection layer is selected as the global feature representation of the auxiliary image. This process can be formally expressed as:
Figure BDA0003197209030000131
Figure BDA0003197209030000132
wherein H represents a global feature representation vector of the auxiliary image, and H represents a local feature representation matrix of the auxiliary image.
In step S3, based on the text global semantic representation vector, the text local semantic representation matrix, the image global feature representation vector, and the image local feature representation matrix, an attention mechanism is used to obtain semantic representation based on image auxiliary information and image feature representation based on text semantics, which is implemented as follows:
s301, semantic representation based on image auxiliary information is obtained by using an attention mechanism based on the text local semantic representation matrix and the image global feature representation vector. The method specifically comprises the following steps:
in the embodiment of the invention, an attention mechanism is selected, a semantic part (namely, a semantic part most suitable for the current image situation) meeting the current image situation is selected from text representation under the guidance of an image global feature representation vector h, and is extracted as the expression of text semantics, and the process can be expressed as follows:
Figure BDA0003197209030000133
Figure BDA0003197209030000141
wherein: { omega, W, U } are parameters to be optimized in the model training process, I represents a column vector of length m and all 1,
Figure BDA0003197209030000142
indicating that the result of Uh is repeated m times, gamma indicates the weight distribution corresponding to all words in the text word sequence under the condition of considering the global feature h of the image,
Figure BDA0003197209030000143
a representation vector representing the semantics of the text taking into account global features of the image, tanh () representing a non-linear activation function, sjIs the jth column in the matrix S and represents the vector representation of the jth word in the input text.
And S302, acquiring image feature representation based on text semantics by using an attention mechanism based on the global semantic representation and the image local feature representation matrix. The method specifically comprises the following steps:
the embodiment of the invention selects an attention mechanism, selects information related to text semantics from the local features of the image under the guidance of the text global semantic representation vector s, and fuses the information as the other representation of the image features. This process can be expressed in the form:
Figure BDA0003197209030000144
Figure BDA0003197209030000145
wherein: { omega [ [ omega ] ]o,Wo,UoDenotes the parameters to be optimized in the model training process, IoRepresenting column vectors with the length n of all 1, n representing the number of extracted local features, theta representing the degree of correlation of image local features and a text global semantic representation vector s, and theta representing the degree of correlation of image local features and a text global semantic representation vector siRepresenting the degree of correlation between the local features of the ith image and a text global semantic representation vector s, thetakIndicating the degree of correlation of the local features of the kth image with the text global semantic representation vector s,
Figure BDA0003197209030000146
representing a selection result of local features of the image under the condition of considering the global semantic representation of the text; h isiAn ith column in the matrix H is represented, and the ith image local feature is represented;
through the above processing, the embodiment of the invention obtains the global representation vector s of the text semantics and the semantic representation based on the image auxiliary information
Figure BDA0003197209030000151
Image global feature representation vector h and image feature representation based on text semantics
Figure BDA0003197209030000152
In the embodiment of the present invention, in order to perform self-supervised parameter learning, the embodiment of the present invention further includes:
s303, respectively obtaining a global representation vector S of text semantics and semantic representation based on image auxiliary information
Figure BDA0003197209030000153
Image global feature representation vector h and image feature representation based on text semantics
Figure BDA0003197209030000154
Then, the embodiments of the present invention map these characterizations to a contrast learning space through a multi-level perceptron (MLP), which can be expressed as:
z1=ReLU(MLP(s)),
Figure BDA0003197209030000155
z2=ReLU(MLP(h)),
Figure BDA0003197209030000156
wherein ReLU () represents a non-linear activation function, z1,
Figure BDA0003197209030000157
z2,
Figure BDA0003197209030000158
And vectors mapped to the comparison learning space are respectively expressed, so that a foundation is laid for comparison learning in the target function.
In step S4, a heuristic matching fusion method is used to fuse the semantic representation based on the image auxiliary information and the image feature representation based on the text semantic to obtain a fusion representation, and the fusion representation is processed by a multi-layer perceptron with a classification layer to obtain a classification result of the text, wherein the specific implementation process is as follows:
in this embodiment, firstly, the heuristic matching fusion method is used to obtain semantic representation and characteristicsRepresenting fusion followed by the use of two multi-layered perceptrons (MLPs) with classification layers1And MLP2) For predicting the classification result of the target text, the process can be expressed as:
a=ReLU(MLP1([s,h;s⊙h;s-h,])),
Figure BDA0003197209030000159
P(y|X,I)=MLP2([a,b,a+b]),
Figure BDA0003197209030000163
wherein: [ s, h; s £ h; s-h of the total number of the carbon atoms,]and
Figure BDA0003197209030000161
are all fusion representations; [;]representing a splicing operation; as and-respectively represent dot product and subtract operations for measuring similarity and difference between two variables, respectively; p (y | X, I) represents the probability that the classification result is y; y is*Representing the result of the final classification of the model.
In the embodiment of the invention, in order to obtain the global semantic representation of the target text, an image global feature representation vector of the auxiliary image is obtained; and optimizing parameters in the process of acquiring semantic representation based on image auxiliary information and image feature representation based on text semantics by using an attention mechanism, wherein the embodiment of the invention further comprises the following steps:
s5, learning and optimizing parameters in the steps S2-S3 by using the selected cross entropy loss function and the selected contrast loss function as optimization targets. The method comprises the following specific steps:
1) cross entropy loss function: considering the text classification task as a classification task, the cross entropy loss function of the classification task is selected as an optimization target, which can be expressed as:
Figure BDA0003197209030000162
wherein: log () is a log likelihood function; k represents the number of samples in a training batch; y isiAnd (3) representing the one-hot vector representation of the real label corresponding to the ith sample, namely, only the value of the index position corresponding to the real label is 1, other positions are 0, and the vector length is the number of all labels.
2) Comparison loss function: in order to ensure that the semantic representation obtained from the text portion and the feature representation obtained from the image information portion are as similar as possible (from two different representation angles of the same sample), the embodiment of the present invention selects the InfoNCE loss function as the optimization target of the contrast learning process, which is specifically expressed as follows:
Figure BDA0003197209030000171
Figure BDA0003197209030000172
wherein: τ is a hyper-parameter used to control the intensity of contrast learning; sim () represents a similarity calculation function, such as cosine similarity; 1[k≠j]Is an indication value that corresponds to a value of 1 if and only if k ≠ j, and 0 otherwise.
3) On the basis of obtaining two loss functions, the final optimization goal of the embodiment of the invention is obtained by weighting and integrating the two functions together:
Loss=Loss1+λLoss2+μLoss3,
wherein: λ and μ are hyper-parameters used to control the effect of different loss functions on the final result.
It should be noted that, in the embodiment of the present invention, the parameters in steps S2 to S3, such as the weight parameter { α [, ] are learned and optimized by using the cross entropy loss function and the contrast loss function of step S5 as the optimization targets12,…,αL}, weight parameters
Figure BDA0003197209030000173
{ ω, W, U } and { ωo,Wo,UoAnd when the loss value of the loss function reaches a preset value, storing the parameters in the current training process, and classifying the texts to be classified based on the visual enhancement through steps S1-S4 after the parameters are stored. Namely, S1-S5 are the training process of the complete model, and steps S1-S4 are the operation process of the model, wherein S303 is not required to be executed during the operation process of the model.
The embodiment of the invention also provides a text classification system based on visual enhancement, which comprises:
and the image-text matching module is used for matching the corresponding auxiliary image for the target text.
And the global representation acquisition module is used for acquiring a text global semantic representation vector and a text local semantic representation matrix of the target text and acquiring an image global characteristic representation vector and an image local characteristic representation matrix of the auxiliary image.
The attention mechanism module is used for acquiring semantic representation based on image auxiliary information and image feature representation based on text semantics by utilizing an attention mechanism based on the text global semantic representation vector, the text local semantic representation matrix, the image global feature representation vector and the image local feature representation matrix; and mapping a global representation vector of text semantics, a semantic representation based on image auxiliary information, an image global feature representation vector and an image feature representation based on the text semantics to a contrast learning space through a multi-layer perceptron MLP.
And the classification module is used for fusing the semantic representation based on the image auxiliary information and the image characteristic representation based on the text semantic by utilizing a heuristic matching fusion method to obtain a fusion representation, and processing the fusion representation through a multilayer perceptron with a classification layer to obtain a text classification result.
And the parameter optimization module is used for learning and optimizing parameters in the global representation acquisition module and the attention mechanism module by using the selected cross entropy loss function and the comparison loss function as optimization targets.
It can be understood that the text classification system based on visual enhancement provided by the embodiment of the present invention corresponds to the text classification method based on visual enhancement, and the explanation, examples, and beneficial effects of the relevant contents thereof can refer to the corresponding contents in the text classification method based on visual enhancement, which are not repeated herein.
In summary, compared with the prior art, the method has the following beneficial effects:
1. according to the embodiment of the invention, the semantic representation based on the image auxiliary information and the image characteristic representation based on the text semantic are obtained through an attention mechanism, the complex content contained in the image information and the mutual influence relation between the image information and the text are fully considered, the introduction of noise information in the text semantic expression process is reduced, and the accurate understanding of the text semantic and the accuracy of text classification are finally realized.
2. The embodiment of the invention fully utilizes the large-scale unmarked image-text combined multimedia data in the real world, supplements sufficient auxiliary images for the existing text data by utilizing the search engine, enriches the auxiliary images, improves a large amount of data for the training optimization of the model, and further improves the accuracy of text classification.
3. The embodiment of the invention introduces contrast learning, realizes multi-angle expression of text semantics, and more comprehensively and accurately represents the semantic information of the text semantics under a specific situation.
It should be noted that, through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for text classification based on visual enhancement, the method comprising:
s1, matching corresponding auxiliary images for the target text;
s2, acquiring a text global semantic representation vector and a text local semantic representation matrix of the target text, and acquiring an image global characteristic representation vector and an image local characteristic representation matrix of the auxiliary image;
s3, acquiring semantic representation based on image auxiliary information and image feature representation based on text semantics by using an attention mechanism based on the text global semantic representation vector, the text local semantic representation matrix, the image global feature representation vector and the image local feature representation matrix;
s4, fusing the semantic representation based on the image auxiliary information and the image characteristic representation based on the text semantic by utilizing a heuristic matching fusion method to obtain a fusion representation, and processing the fusion representation through a multilayer perceptron with a classification layer to obtain a text classification result.
2. The method for text classification based on visual enhancement as claimed in claim 1, wherein the S1 specifically includes:
searching from the image-text data set aiming at each target text X, and extracting a corresponding image as an auxiliary image of the target text X if the description sentence in the data set is the same as the target text X; otherwise, inputting the target text X into a search engine, and selecting an image of Top-1 as an auxiliary image according to the retrieval result.
3. The method for text classification based on visual enhancement as claimed in claim 1, wherein the S2 specifically includes:
s201, performing semantic modeling on each word of a target text, and obtaining a text global semantic expression vector and a text local semantic expression matrix by using a first pre-training model;
s202, modeling is carried out on the auxiliary image, and an image global feature representation vector and an image local feature representation matrix are generated by utilizing a second pre-training model.
4. The method for text classification based on visual enhancement as claimed in any one of claims 1 to 3, wherein the S3 specifically comprises:
s301, acquiring semantic representation based on image auxiliary information by using an attention mechanism based on a text local semantic representation matrix and an image global feature representation vector, specifically comprising the following steps:
using an attention mechanism, under the guidance of an image global feature representation vector h, selecting a semantic part which accords with the current image situation from the text representation, and extracting the semantic part as an expression of text semantics, wherein the process can be expressed as follows:
Figure FDA0003197209020000021
Figure FDA0003197209020000022
wherein: { omega, W, U } are parameters to be optimized in the model training process; i represents a column vector of length m of all 1;
Figure FDA0003197209020000023
represents that the result of Uh is repeated m times; gamma represents the weight distribution corresponding to all words in the text word sequence under the condition of considering the global feature h of the image;
Figure FDA0003197209020000024
a representation vector representing the context semantics under consideration of global features of the image; tanh () represents a nonlinear activation function; sjThe j column in the matrix S represents the vector representation of the j word in the input text;
s302, acquiring image feature representation based on text semantics by using an attention mechanism based on the global semantic representation and the image local feature representation matrix, specifically comprising the following steps:
using an attention mechanism, under the guidance of a text global semantic representation vector s, selecting information related to text semantics from image local features, and fusing the information to form another representation of the image features, where the process may be represented as follows:
Figure FDA0003197209020000031
Figure FDA0003197209020000032
wherein: { omega [ [ omega ] ]o,Wo,UoExpressing parameters needing to be optimized in the model training process; i isoRepresents a column vector of length n of all 1; n represents the number of extracted local features; theta represents the degree of correlation between the local features of the image and a text global semantic representation vector s; thetaiRepresenting the degree of correlation between the ith image local feature and a text global semantic representation vector s; thetakRepresenting the degree of correlation between the local features of the kth image and a text global semantic representation vector s;
Figure FDA0003197209020000034
representing a selection result of local features of the image under the condition of considering the global semantic representation of the text; h isiAnd an ith column in the matrix H is represented, and the ith image local characteristic is represented.
5. The method for text classification based on visual enhancement as claimed in any one of claims 1 to 3, wherein the S4 specifically comprises:
fusing the obtained semantic representation and the feature representation by using a heuristic matching fusion method to obtain a fused representation; according to the fusion representation, the classification result of the target text is predicted by using two multi-layer perceptrons with classification layers, and the process can be expressed as follows:
a=ReLU(MLP1([s,h;s⊙h;s-h,])),
Figure FDA0003197209020000033
P(y|X,I)=MLP2([a,b,a+b]),
Figure FDA0003197209020000041
wherein: [ s, h; s £ h; s-h of the total number of the carbon atoms,]and
Figure FDA0003197209020000042
are all fusion representations; [;]representing a splicing operation; as and-respectively represent dot product and subtract operations for measuring similarity and difference between two variables, respectively; p (y | X, I) represents the probability that the classification result is y; y is*Representing the result of the last classification of the model, MLP1And MLP2Representing two multi-layer perceptrons with classification layers.
6. The visual enhancement-based text classification method of claim 4, wherein the S3 further comprises:
s303, mapping the global representation vector of the text semantics, the semantic representation based on the image auxiliary information, the image global feature representation vector and the image feature representation based on the text semantics to a contrast learning space through a multi-layer perceptron MLP.
7. The visual enhancement-based text classification method of claim 6, wherein the method further comprises:
s5, selecting a cross entropy loss function and a contrast loss function as optimization targets, and learning and optimizing parameters in the steps S2-S3, wherein the parameters are as follows:
1) cross entropy loss function:
Figure FDA0003197209020000043
wherein: log () is a log likelihood function, K represents the number of samples in a training batch; y isiRepresenting the unique heat vector representation of the real label corresponding to the ith sample, namely, only the value of the index position corresponding to the real label is 1, other positions are 0, and the vector length is the number of all labels;
2) comparison loss function:
Figure FDA0003197209020000051
Figure FDA0003197209020000052
wherein: z is a radical of1,
Figure FDA0003197209020000053
z2,
Figure FDA0003197209020000054
Respectively representing vectors mapped into the comparison learning space; τ is a hyper-parameter used to control the intensity of contrast learning; sim () represents a similarity calculation function; 1[k≠j]Is an indication value, which indicates that if and only if k ≠ j, its corresponding value is 1, otherwise it is 0;
3) on the basis of obtaining two loss functions, a final optimization target is obtained by weighting and integrating the two functions:
Loss=Loss1+λLoss2+μLoss3,
wherein: λ and μ are hyper-parameters used to control the effect of different loss functions on the final result.
8. A system for text classification based on visual enhancement, the system comprising:
the image-text matching module is used for matching the corresponding auxiliary image for the target text;
the global representation acquisition module is used for acquiring a text global semantic representation vector and a text local semantic representation matrix of the target text and acquiring an image global characteristic representation vector and an image local characteristic representation matrix of the auxiliary image;
the attention mechanism module is used for acquiring semantic representation based on image auxiliary information and image feature representation based on text semantics by utilizing an attention mechanism based on the text global semantic representation vector, the text local semantic representation matrix, the image global feature representation vector and the image local feature representation matrix;
and the classification module is used for fusing the semantic representation based on the image auxiliary information and the image characteristic representation based on the text semantic by utilizing a heuristic matching fusion method to obtain a fusion representation, and processing the fusion representation through a multilayer perceptron with a classification layer to obtain a text classification result.
9. The visual enhancement-based text classification system according to claim 8, wherein the attention mechanism module is further configured to map a global representation vector of text semantics, a semantic representation based on image side information, an image global feature representation vector, and an image feature representation based on text semantics to a contrast learning space by a multi-layered perceptron MLP.
10. The visual enhancement-based text classification system of claim 9, wherein the system further comprises:
and the parameter optimization module is used for selecting the cross entropy loss function and the contrast loss function as optimization targets and learning and optimizing parameters in the global representation acquisition module and the attention mechanism module.
CN202110894298.5A 2021-08-05 2021-08-05 Text classification method and system based on visual enhancement Pending CN113836298A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110894298.5A CN113836298A (en) 2021-08-05 2021-08-05 Text classification method and system based on visual enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110894298.5A CN113836298A (en) 2021-08-05 2021-08-05 Text classification method and system based on visual enhancement

Publications (1)

Publication Number Publication Date
CN113836298A true CN113836298A (en) 2021-12-24

Family

ID=78962956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110894298.5A Pending CN113836298A (en) 2021-08-05 2021-08-05 Text classification method and system based on visual enhancement

Country Status (1)

Country Link
CN (1) CN113836298A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115187996A (en) * 2022-09-09 2022-10-14 中电科新型智慧城市研究院有限公司 Semantic recognition method and device, terminal equipment and storage medium
CN115293109A (en) * 2022-08-03 2022-11-04 合肥工业大学 Text image generation method and system based on fine-grained semantic fusion
CN115761273A (en) * 2023-01-10 2023-03-07 苏州浪潮智能科技有限公司 Visual common sense reasoning method and device, storage medium and electronic equipment
CN116342332A (en) * 2023-05-31 2023-06-27 合肥工业大学 Auxiliary judging method, device, equipment and storage medium based on Internet
CN116701637A (en) * 2023-06-29 2023-09-05 中南大学 Zero sample text classification method, system and medium based on CLIP
CN116777400A (en) * 2023-08-21 2023-09-19 江苏海外集团国际工程咨询有限公司 Engineering consultation information whole-flow management system and method based on deep learning
CN117150436A (en) * 2023-10-31 2023-12-01 上海大智慧财汇数据科技有限公司 Multi-mode self-adaptive fusion topic identification method and system
CN117195903A (en) * 2023-11-07 2023-12-08 北京新广视通科技集团有限责任公司 Generating type multi-mode entity relation extraction method and system based on noise perception
CN117435739A (en) * 2023-12-21 2024-01-23 深圳须弥云图空间科技有限公司 Image text classification method and device
CN117493568A (en) * 2023-11-09 2024-02-02 中安启成科技有限公司 End-to-end software function point extraction and identification method

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115293109A (en) * 2022-08-03 2022-11-04 合肥工业大学 Text image generation method and system based on fine-grained semantic fusion
CN115293109B (en) * 2022-08-03 2024-03-19 合肥工业大学 Text image generation method and system based on fine granularity semantic fusion
CN115187996B (en) * 2022-09-09 2023-01-06 中电科新型智慧城市研究院有限公司 Semantic recognition method and device, terminal equipment and storage medium
CN115187996A (en) * 2022-09-09 2022-10-14 中电科新型智慧城市研究院有限公司 Semantic recognition method and device, terminal equipment and storage medium
CN115761273A (en) * 2023-01-10 2023-03-07 苏州浪潮智能科技有限公司 Visual common sense reasoning method and device, storage medium and electronic equipment
CN116342332A (en) * 2023-05-31 2023-06-27 合肥工业大学 Auxiliary judging method, device, equipment and storage medium based on Internet
CN116701637B (en) * 2023-06-29 2024-03-08 中南大学 Zero sample text classification method, system and medium based on CLIP
CN116701637A (en) * 2023-06-29 2023-09-05 中南大学 Zero sample text classification method, system and medium based on CLIP
CN116777400A (en) * 2023-08-21 2023-09-19 江苏海外集团国际工程咨询有限公司 Engineering consultation information whole-flow management system and method based on deep learning
CN116777400B (en) * 2023-08-21 2023-10-31 江苏海外集团国际工程咨询有限公司 Engineering consultation information whole-flow management system and method based on deep learning
CN117150436A (en) * 2023-10-31 2023-12-01 上海大智慧财汇数据科技有限公司 Multi-mode self-adaptive fusion topic identification method and system
CN117150436B (en) * 2023-10-31 2024-01-30 上海大智慧财汇数据科技有限公司 Multi-mode self-adaptive fusion topic identification method and system
CN117195903B (en) * 2023-11-07 2024-01-23 北京新广视通科技集团有限责任公司 Generating type multi-mode entity relation extraction method and system based on noise perception
CN117195903A (en) * 2023-11-07 2023-12-08 北京新广视通科技集团有限责任公司 Generating type multi-mode entity relation extraction method and system based on noise perception
CN117493568A (en) * 2023-11-09 2024-02-02 中安启成科技有限公司 End-to-end software function point extraction and identification method
CN117493568B (en) * 2023-11-09 2024-04-19 中安启成科技有限公司 End-to-end software function point extraction and identification method
CN117435739A (en) * 2023-12-21 2024-01-23 深圳须弥云图空间科技有限公司 Image text classification method and device
CN117435739B (en) * 2023-12-21 2024-03-15 深圳须弥云图空间科技有限公司 Image text classification method and device

Similar Documents

Publication Publication Date Title
CN113836298A (en) Text classification method and system based on visual enhancement
CN111026842B (en) Natural language processing method, natural language processing device and intelligent question-answering system
CN109344404B (en) Context-aware dual-attention natural language reasoning method
CN110569508A (en) Method and system for classifying emotional tendencies by fusing part-of-speech and self-attention mechanism
Li et al. A method of emotional analysis of movie based on convolution neural network and bi-directional LSTM RNN
Zhang et al. Relation classification via BiLSTM-CNN
Zhao et al. ZYJ123@ DravidianLangTech-EACL2021: Offensive language identification based on XLM-RoBERTa with DPCNN
CN110750998B (en) Text output method, device, computer equipment and storage medium
CN111275046A (en) Character image recognition method and device, electronic equipment and storage medium
Yang et al. Meta captioning: A meta learning based remote sensing image captioning framework
CN112434142B (en) Method for marking training sample, server, computing equipment and storage medium
CN112765974B (en) Service assistance method, electronic equipment and readable storage medium
CN114330354A (en) Event extraction method and device based on vocabulary enhancement and storage medium
CN112988970A (en) Text matching algorithm serving intelligent question-answering system
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN115658905A (en) Cross-chapter multi-dimensional event image generation method
Rogachev et al. Building artificial neural networks for NLP analysis and classification of target content
Yao Attention-based BiLSTM neural networks for sentiment classification of short texts
Li et al. Detecting relevant differences between similar legal texts
Ermatita et al. Sentiment Analysis of COVID-19 using Multimodal Fusion Neural Networks.
CN113836934A (en) Text classification method and system based on label information enhancement
CN115906824A (en) Text fine-grained emotion analysis method, system, medium and computing equipment
CN115545030A (en) Entity extraction model training method, entity relation extraction method and device
CN113157880B (en) Element content obtaining method, device, equipment and storage medium
Bilah et al. Intent detection on indonesian text using convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination