CN117195903A - Generating type multi-mode entity relation extraction method and system based on noise perception - Google Patents

Generating type multi-mode entity relation extraction method and system based on noise perception Download PDF

Info

Publication number
CN117195903A
CN117195903A CN202311469190.7A CN202311469190A CN117195903A CN 117195903 A CN117195903 A CN 117195903A CN 202311469190 A CN202311469190 A CN 202311469190A CN 117195903 A CN117195903 A CN 117195903A
Authority
CN
China
Prior art keywords
text
image
representing
noise
language model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311469190.7A
Other languages
Chinese (zh)
Other versions
CN117195903B (en
Inventor
吴艳
杨欣洁
李志慧
李阳
徐雅静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xinguangshitong Technology Group Co ltd
Original Assignee
Beijing Xinguangshitong Technology Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xinguangshitong Technology Group Co ltd filed Critical Beijing Xinguangshitong Technology Group Co ltd
Priority to CN202311469190.7A priority Critical patent/CN117195903B/en
Publication of CN117195903A publication Critical patent/CN117195903A/en
Application granted granted Critical
Publication of CN117195903B publication Critical patent/CN117195903B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a generating type multi-mode entity relation extracting method and system based on noise perception, which belong to the technical field of multi-modes and comprise the following steps: based on the obtained options, text and images, utilizing a language model embedding layer, a CLIP visual encoder and a linear layer to respectively obtain instructions, vector representations of the instructions and image features; constructing a text image contrast learning module based on noise perception; performing image-text alignment and image-text fusion to obtain contrast learning loss and image-text fusion instructions; processing the image-text fusion instruction by using an attention mechanism, and training a language model to obtain cross entropy loss of the language model; obtaining total relation extraction loss of the noise perception entity; and extracting the entity relationship based on the minimized total loss of the noise perception entity relationship extraction, and obtaining an entity relationship extraction result. The invention solves the problem that the existing method is difficult to process challenging examples and simultaneously maintains semantic transfer capability.

Description

Generating type multi-mode entity relation extraction method and system based on noise perception
Technical Field
The invention belongs to the technical field of multiple modes, and particularly relates to a generating type and multiple mode entity relation extraction method and system based on noise perception.
Background
Multimodal Named Entity Recognition (MNER) and relation extraction (MRE) aim to extract required information by means of additional image input, and play an important role in various fields such as knowledge graph construction, reading and understanding. Currently, many existing technologies have successfully completed the MNER and MRE tasks individually. However, accomplishing these two tasks separately ignores the interactions between them, and so joint multi-modal entity relationship extraction occurs.
Due to the complexity and diversity of multimedia information, multimodal extraction often suffers from the problem that text entities do not match perfectly with visual objects. Existing multi-modal extraction methods typically utilize efficient visual objects by selecting the most significant object with a higher confidence score, which may introduce noise to irrelevant or redundant objects. Furthermore, these methods focus on designing graphics alignments of different modalities, mapping objects with entities in visual and text graphics, but aligning visual and text information in a cross-plot is based on the assumption of substantial agreement between text entities and visual objects. Thus, noise in the visual information may risk inaccurate extraction of entity relationships for the entire model.
To fully exploit the bi-directional interactions between tasks, the prior art has simultaneously extracted one or more entity-relationship triples with related entity types. Other studies use word pair relationship labels to synchronously classify relationships between entities and the types of entities involved, and although having the advantage of avoiding error propagation from the pipeline framework, the semantic information of a given entity type and relationship label is not yet effectively utilized.
Disclosure of Invention
In order to overcome the defects in the prior art, the method and the system for extracting the generated multi-mode entity relationship based on noise perception provided by the invention have the advantages that visual information is brought into an instruction mode, a plurality of entity-relationship pairs are guided to be generated, meanwhile, the method is further improved in an end-to-end mode, interference of visual modal noise is reduced through noise perception contrast learning, and the problem that the existing method is difficult to process challenging examples and simultaneously maintain semantic transfer capability is solved.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
in one aspect, the method for extracting the generated multi-mode entity relationship based on noise perception provided by the invention comprises the following steps:
s1, based on acquired options, texts and images, utilizing a language model embedding layer, a CLIP visual encoder and a linear layer to respectively obtain instructions, vector representations of the instructions and image features;
s2, constructing a text image contrast learning module based on noise perception;
s3, based on the instruction, vector representation of the instruction and image characteristics, performing image-text alignment and image-text fusion by using a text image contrast learning module based on noise perception to obtain contrast learning loss and image-text fusion instructions;
s4, processing the image-text fusion instruction by using an attention mechanism of the language model, and training the language model to obtain cross entropy loss of the language model;
s5, obtaining total loss of noise perception entity relation extraction based on the comparison learning loss and the cross entropy loss of the language model;
s6, minimizing total loss of the noise perception entity relation extraction, and performing entity relation extraction based on the minimized total loss of the noise perception entity relation extraction to obtain an entity relation extraction result.
The beneficial effects of the invention are as follows: according to the generating type and multi-mode entity relation extraction method based on noise perception, visual information is brought into an instruction, a plurality of entity-relation pairs can be extracted at the same time, so that a language model can understand additional visual information, and semantic information of a label is fully utilized; the invention designs a corresponding contrast learning module based on noise perception aiming at noise generated by inconsistent text image pairs so as to reduce interference of visual mode noise; according to the method and the device for processing the text features and the image features, the text features and the image features can be dynamically adjusted according to the consistency degree between the acquired text and the acquired image, so that challenging examples can be effectively processed by the scheme, and the semantic transfer capability is reserved.
Further, the step S1 includes the following steps:
s11, sequentially connecting the acquired options, the preset image placeholders and the acquired texts by utilizing a language model embedding layer to obtain instructions:
wherein E represents an instruction, O represents an acquired option, is represents a preset image placeholder, and T represents an acquired text;
s12, obtaining vector representation of the instruction by utilizing a pretrained language model FlanT5 through an instruction fine tuning method:
wherein,vector representation of the instruction, l represents the number of elements of the instruction after vectorization, +.>Vector representation of the representation options, lo represents the number of elements vectorized by the options, +.>The lo-th element in the vector representation representing the option,/->Vector representation representing image placeholders, ls representing the number of elements vectorized by the image placeholders,/->The ls-th element in the vector representation representing the image,>vector representation of text, lx represents the number of elements after vectorization of text, ++>An lx-th element in the vector representation representing text, hs representing the dimension of the embedded vector;
s13, encoding the acquired image by using a CLIP visual encoder to obtain the hidden state of the image:
wherein,representing the hidden state of the image, li representing the number of image patches, +.>Representing a first size of an image block in the image, < >>Representing a CLIP visual encoder, I representing the acquired image;
s14, mapping the hidden state into a feature space by using a linear layer to obtain image features:
wherein,representing image features +.>Representing a second size of the image block in the image, is->Representing a linear layer.
The beneficial effects of adopting the further scheme are as follows: the instruction construction scheme provided by the invention enables the FLAN language model to be well adapted to multi-mode extraction tasks without redundant pre-training.
Further, the text image contrast learning module based on noise perception in the S2 extracts redundancy of images through noise factor quantization so as to perform contrast learning between text images; the input of the text image contrast learning module based on noise perception comprises image characteristics and text characteristics corresponding to vector representation of the text;
the noise factor is calculated as follows:
wherein,representing noise factor->Representing the ratio of the degree of inconsistent graphics and text, +.>Representing cosine similarity function,/->The vector representing the text represents the corresponding text feature, < >>Representing image features;
the calculation expression of the contrast learning loss of the text image contrast learning module based on noise perception is as follows:
wherein,represents contrast learning loss, e represents exponential basal constant, +.>Representing the ith text feature,/->Representing the j-th image feature within a lot,/->Noise-perceived temperature parameters representing the ith text and jth image in a lot, k representing the kth in a lot, N representing the total number in a lot, +.>Represents the kth image feature in a lot,/->Noise-perceived temperature parameter representing the ith text and kth image within a batch,/->Representing the sensed temperature parameter, ">Noise factor representing ith text and jth image within a batch。
The beneficial effects of adopting the further scheme are as follows: according to the invention, the noise factors are introduced into the contrast learning model, so that the model can lighten the influence of visual modal noise in the joint multi-modal entity relation extraction so as to realize that the visual modality better assists the entity relation extraction task.
Further, the step S3 includes the following steps:
s31, obtaining text characteristics corresponding to the vector representation of the text in the vector representation of the instruction;
s32, inputting the image features and the text features into a text image contrast learning module based on noise perception to obtain contrast learning loss;
s32, aligning the image features and the text features based on the contrast learning loss;
s33, enabling the aligned text features and the aligned image features to correspond to the vector representation of the text and the vector representation of the image placeholder in the vector representation of the replacement instruction to obtain a graphic fusion instruction:
wherein,representing graphic fusion instructions->Representing the aligned image features, +.>Representing the aligned text features.
The beneficial effects of adopting the further scheme are as follows: according to the invention, the text image contrast learning module based on noise perception is used for aligning the image with the text, meanwhile, the influence of irrelevant image elements is lightened, the redundancy of the image is extracted through defined noise factor quantization, the contrast learning process is guided, and the aligned image features and text features are embedded into vector representation of the instruction to obtain a picture-text fusion instruction, so that a foundation is provided for realizing the generation type joint multi-mode entity relation extraction based on noise perception.
Further, the step S4 includes the following steps:
s41, inputting a graphic fusion instruction into a text coder-decoder formed by a text coder and a text decoder, and taking the text coder-decoder as a language model;
s42, using the aligned image features for prompting entity relation extraction, and training the entity relation pair based on the image-text fusion instruction extraction by using an auxiliary language model to obtain the entity relation pair predicted by decoding of the language model and corresponding cross entropy loss;
the computational expression of the cross entropy loss of the language model is as follows:
wherein,cross entropy loss of representation language model, +.>Representing the entity-relationship pairs predicted by the language model decoding,representing a pair of real entity relationships.
The beneficial effects of adopting the further scheme are as follows: in the invention, the image-text fusion instruction is used as the input of a text encoder, the entity relation is extracted by utilizing the aligned image characteristics, and the prediction decoding generation of the entity relation pair is guided by means of the attention mechanism in the language model corresponding to the text encoder-decoder.
Further, the calculation expression of the total loss of the relation extraction of the noise perception entity in the step S5 is as follows:
wherein,representing the total loss of relation extraction of noise perception entity, +.>Representing the contrast learning loss weighting coefficient.
The beneficial effects of adopting the further scheme are as follows: the invention obtains the total loss of the relation extraction of the noise perception entity based on the cross entropy loss of the comparison learning loss weight and the language model, and guides decoding to generate a text by means of a attention mechanism.
On the other hand, the invention also provides a system of the generating type and multi-mode entity relation extraction method based on noise perception, which comprises the following steps:
the input processing module is used for respectively obtaining instructions, vector representations of the instructions and image characteristics by utilizing a language model embedding layer, a CLIP visual encoder and a linear layer based on the obtained options, texts and images;
the alignment fusion module is used for constructing a text image contrast learning module based on noise perception, and carrying out image-text alignment and image-text fusion by using the text image contrast learning module based on the noise perception based on the instruction, the vector representation of the instruction and the image characteristics to obtain contrast learning loss and image-text fusion instructions;
the entity relation extraction module is used for processing the graphic fusion instruction by using the attention mechanism of the language model, training the language model to obtain cross entropy loss of the language model, obtaining total loss of noise perception entity relation extraction based on comparison of learning loss and cross entropy loss of the language model, minimizing the total loss of noise perception entity relation extraction, and extracting entity relation based on the minimized total loss of noise perception entity relation extraction to obtain an entity relation extraction result.
The beneficial effects of the invention are as follows: the system of the generating type and multi-mode entity relation extraction method based on noise perception provided by the invention is a corresponding system based on the generating type and multi-mode entity relation extraction method based on noise perception, and is used for realizing the generating type and multi-mode entity relation extraction method based on noise perception, the realized effect is the same as that of the method, text characteristics and image characteristics can be dynamically adjusted according to the consistency degree between acquired texts and images, the challenging example can be effectively processed, and the semantic transfer capability is reserved.
Further, the language model is a text coder-decoder composed of a text coder and a text decoder; the language model uses the aligned image features to assist in prompting entity relationship extraction.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of steps of a generating type multi-mode entity relation extracting method based on noise perception in embodiment 1 of the present invention.
Fig. 2 is a block diagram of a system of a generating type, multi-modal entity relationship extraction method based on noise perception in embodiment 2 of the present invention.
Fig. 3 is a schematic diagram of a system of a generating-type, multi-modal entity relationship extraction method based on noise perception in embodiment 2 of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.
Multimode information extraction: the cross-domain tasks of Natural Language Processing (NLP) and Computer Vision (CV) aim to extract structured information, such as entities, relationships, events, etc., from data in multiple modalities (e.g., text, images, audio, etc.), which, unlike conventional information extraction tasks, require processing multiple types of data, integrating them together for a more comprehensive understanding.
Generating a pre-training model: and using a self-coding or autoregressive mode, training a large-scale natural language model by utilizing massive unlabeled text data, receiving a text sequence as input by the generated pre-training model, and outputting a new transformed text sequence.
Extracting multi-mode entity relation: it is intended to identify entities and their relationships simultaneously from multimodal data, including extracting entities from information of different modalities such as text, images, audio, etc., and deducing the relationships between these entities.
CLIP: natural Language Understanding (NLU) and Computer Vision (CV) are intended to be integrated together to achieve cross-modal understanding between text and images, with the primary goal of enabling models to understand the associations between text descriptions and images and to perform various tasks such as multimodal extraction, image generation, image search, etc., based on understanding these associations.
Example 1:
as shown in fig. 1, in one aspect, in an embodiment of the present invention, the present invention provides a generating type, multi-modal entity relationship extraction method based on noise perception, including the following steps:
s1, based on acquired options, texts and images, utilizing a language model embedding layer, a CLIP visual encoder and a linear layer to respectively obtain instructions, vector representations of the instructions and image features;
the step S1 comprises the following steps:
s11, sequentially connecting the acquired options, the preset image placeholders and the acquired texts by utilizing a language model embedding layer to obtain instructions:
wherein E represents an instruction, O represents an acquired option, is represents a preset image placeholder, and T represents an acquired text;
s12, obtaining vector representation of the instruction by utilizing a pretrained language model FlanT5 through an instruction fine tuning method:
wherein,vector representation of the instruction, l represents the number of elements of the instruction after vectorization, +.>Vector representation of the representation options, lo represents the number of elements vectorized by the options, +.>The lo-th element in the vector representation representing the option,/->Vector representation representing image placeholders, ls representing the number of elements vectorized by the image placeholders,/->The ls-th element in the vector representation representing the image,>vector representation of text, lx represents the number of elements after vectorization of text, ++>An lx-th element in the vector representation representing text, hs representing the dimension of the embedded vector;
s13, encoding the acquired image by using a CLIP visual encoder to obtain the hidden state of the image:
wherein,representing the hidden state of the image, li representing the number of image patches, +.>Representing a first size of an image block in the image, < >>Representing a CLIP visual encoder, I representing the acquired image;
s14, mapping the hidden state into a feature space by using a linear layer to obtain image features:
wherein,representing image featuresSyndrome of deficiency of kidney qi>Representing a second size of the image block in the image, is->Representing a linear layer.
S2, constructing a text image contrast learning module based on noise perception;
the text image contrast learning module based on noise perception in the S2 extracts redundancy of images through noise factor quantization so as to conduct contrast learning among text images; the input of the text image contrast learning module based on noise perception comprises image characteristics and text characteristics corresponding to vector representation of the text;
the noise factor is calculated as follows:
wherein,representing noise factor->Representing the ratio of the degree of inconsistent graphics and text, +.>Representing cosine similarity function,/->The vector representing the text represents the corresponding text feature, < >>Representing image features;
the calculation expression of the contrast learning loss of the text image contrast learning module based on noise perception is as follows:
wherein,represents contrast learning loss, e represents exponential basal constant, +.>Representing the ith text feature,/->Representing the j-th image feature within a lot,/->Noise-perceived temperature parameters representing the ith text and jth image in a lot, k representing the kth in a lot, N representing the total number in a lot, +.>Represents the kth image feature in a lot,/->Noise-perceived temperature parameter representing the ith text and kth image within a batch,/->Representing the sensed temperature parameter, ">Representing the noise factor of the ith text and jth image within a batch.
The method is different from the prior art in that the problem of alignment of multi-mode visual text information is solved by directly adopting a multi-mode attention mechanism. However, this alignment method is based on the assumption that the text entity and the visual object are basically consistent, so that it is difficult to deal with the situation that the picture and the text are inconsistent, and noise in the visual information can bring inaccurate risk to the entity relation extraction of the whole model; according to the invention, noise factors are introduced into the contrast learning loss, and model parameters are adjusted through the contrast learning loss so as to optimize image characteristics and text characteristics and relieve the influence of noise in visual information.
S3, based on the instruction, vector representation of the instruction and image characteristics, performing image-text alignment and image-text fusion by using a text image contrast learning module based on noise perception to obtain contrast learning loss and image-text fusion instructions;
the step S3 comprises the following steps:
s31, obtaining text characteristics corresponding to the vector representation of the text in the vector representation of the instruction;
s32, inputting the image features and the text features into a text image contrast learning module based on noise perception to obtain contrast learning loss;
s32, aligning the image features and the text features based on the contrast learning loss;
s33, enabling the aligned text features and the aligned image features to correspond to the vector representation of the text and the vector representation of the image placeholder in the vector representation of the replacement instruction to obtain a graphic fusion instruction:
wherein,representing graphic fusion instructions->Representing the aligned image features, +.>Representing the aligned text features.
S4, processing the image-text fusion instruction by using an attention mechanism of the language model, and training the language model to obtain cross entropy loss of the language model;
the step S4 comprises the following steps:
s41, inputting a graphic fusion instruction into a text coder-decoder formed by a text coder and a text decoder, and taking the text coder-decoder as a language model;
s42, using the aligned image features for prompting entity relation extraction, and training the entity relation pair based on the image-text fusion instruction extraction by using an auxiliary language model to obtain the entity relation pair predicted by decoding of the language model and corresponding cross entropy loss;
the computational expression of the cross entropy loss of the language model is as follows:
wherein,cross entropy loss of representation language model, +.>Representing the entity-relationship pairs predicted by the language model decoding,representing a pair of real entity relationships.
S5, obtaining total loss of noise perception entity relation extraction based on the comparison learning loss and the cross entropy loss of the language model;
the calculation expression of the total loss of the relation extraction of the noise perception entity in the S5 is as follows:
wherein,representing the total loss of relation extraction of noise perception entity, +.>Representing the contrast learning loss weighting coefficient.
S6, minimizing total loss of the noise perception entity relation extraction, and performing entity relation extraction based on the minimized total loss of the noise perception entity relation extraction to obtain an entity relation extraction result.
Example 2:
as shown in fig. 2, in another aspect, the present invention further provides a system for generating a method for extracting a multi-modal entity relationship based on noise perception, including:
the input processing module is used for respectively obtaining instructions, vector representations of the instructions and image characteristics by utilizing a language model embedding layer, a CLIP visual encoder and a linear layer based on the obtained options, texts and images;
the alignment fusion module is used for constructing a text image contrast learning module based on noise perception, and carrying out image-text alignment and image-text fusion by using the text image contrast learning module based on the noise perception based on the instruction, the vector representation of the instruction and the image characteristics to obtain contrast learning loss and image-text fusion instructions;
the entity relation extraction module is used for processing the graphic fusion instruction by using the attention mechanism of the language model, training the language model to obtain cross entropy loss of the language model, obtaining total loss of noise perception entity relation extraction based on comparison of learning loss and cross entropy loss of the language model, minimizing the total loss of noise perception entity relation extraction, and extracting entity relation based on the minimized total loss of noise perception entity relation extraction to obtain an entity relation extraction result.
As shown in fig. 3, in the present embodiment, the text acquired by the input processing module is, for example: "I consider this/Alfrek to be the most daunting batman that I have seen. For an acquired picture corresponding to an acquired text, in this embodiment, an image encoder encodes the acquired image to obtain a hidden state of the image, and maps the hidden state into a feature space by using a linear layer to obtain image features; sequentially connecting the acquired options, the preset image placeholders and the acquired texts by utilizing a language model embedding layer to obtain instructions; in the alignment fusion module, noise factors are introduced in the contrast learning loss, and model parameters are adjusted through the alignment loss so as to optimize image features and text features, thereby relieving the influence of noise in visual information, and if case 1 corresponds to a scene with lower noise factors and higher similarity, obviously, high similarity means high consistency, which means that the interference between an image and a text is minimum, and conversely, case 2 corresponds to a scene with higher noise factors and lower similarity, and the low similarity is generally identical with the low consistency, which means that the image and the text have obvious interference; in the entity relation extraction module, the language model provided by the invention is a text encoder-decoder formed by a text encoder and a text decoder, and the text-text fusion instruction is obtained by utilizing the aligned text characteristics and the aligned vector representations of the text and the image placeholders in the vector representations of the image characteristic corresponding substitution instruction; the language model uses the image features aligned based on noise to assist in prompting entity relation extraction, so that the influence of visual modal noise in joint multi-modal entity relation extraction can be reduced, and a visual modal better assisting extraction task is realized.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention.

Claims (8)

1. A generating type and multi-mode entity relation extraction method based on noise perception is characterized by comprising the following steps:
s1, based on acquired options, texts and images, utilizing a language model embedding layer, a CLIP visual encoder and a linear layer to respectively obtain instructions, vector representations of the instructions and image features;
s2, constructing a text image contrast learning module based on noise perception;
s3, based on the instruction, vector representation of the instruction and image characteristics, performing image-text alignment and image-text fusion by using a text image contrast learning module based on noise perception to obtain contrast learning loss and image-text fusion instructions;
s4, processing the image-text fusion instruction by using an attention mechanism of the language model, and training the language model to obtain cross entropy loss of the language model;
s5, obtaining total loss of noise perception entity relation extraction based on the comparison learning loss and the cross entropy loss of the language model;
s6, minimizing total loss of the noise perception entity relation extraction, and performing entity relation extraction based on the minimized total loss of the noise perception entity relation extraction to obtain an entity relation extraction result.
2. The noise-aware-based generation-based multi-modal entity-relationship extraction method of claim 1, wherein S1 comprises the steps of:
s11, sequentially connecting the acquired options, the preset image placeholders and the acquired texts by utilizing a language model embedding layer to obtain instructions:
wherein E represents an instruction, O represents an acquired option, is represents a preset image placeholder, and T represents an acquired text;
s12, obtaining vector representation of the instruction by utilizing a pretrained language model FlanT5 through an instruction fine tuning method:
wherein,vector representation of the instruction, l represents the number of elements of the instruction after vectorization, +.>Vector representation of the representation options, lo represents the number of elements vectorized by the options, +.>The lo-th element in the vector representation representing the option,/->Vector representation representing image placeholders, ls representing the number of elements vectorized by the image placeholders,/->The ls-th element in the vector representation representing the image,>vector representation of text, lx represents the number of elements after vectorization of text, ++>An lx-th element in the vector representation representing text, hs representing the dimension of the embedded vector;
s13, encoding the acquired image by using a CLIP visual encoder to obtain the hidden state of the image:
wherein,representing the hidden state of the image, li representing the number of image patches, +.>Representing a first size of an image block in the image, < >>Representing a CLIP visual encoder, I representing the acquired image;
s14, mapping the hidden state into a feature space by using a linear layer to obtain image features:
wherein,representing image features +.>Representing a second size of the image block in the image, is->Representing a linear layer.
3. The noise perception-based generation type multi-modal entity relation extraction method according to claim 2, wherein the noise perception-based text image comparison learning module in S2 extracts redundancy of images through noise factor quantization to perform comparison learning between text images; the input of the text image contrast learning module based on noise perception comprises image characteristics and text characteristics corresponding to vector representation of the text;
the noise factor is calculated as follows:
wherein,representing noise factor->Representing the ratio of the degree of inconsistent graphics and text, +.>Representing cosine similarity function,/->The vector representing the text represents the corresponding text feature, < >>Representing image features;
the calculation expression of the contrast learning loss of the text image contrast learning module based on noise perception is as follows:
wherein,represents contrast learning loss, e represents exponential basal constant, +.>Representing the ith text feature,/->Representing the j-th image feature within a lot,/->Noise-perceived temperature parameters representing the ith text and jth image in a lot, k representing the kth in a lot, N representing the total number in a lot, +.>Represents the kth image feature in a lot,/->Noise-perceived temperature parameter representing the ith text and kth image within a batch,/->Representing the sensed temperature parameter, ">Representing the noise factor of the ith text and jth image within a batch.
4. The noise-aware-based generation-based, multi-modal entity-relationship extraction method of claim 3, wherein S3 comprises the steps of:
s31, obtaining text characteristics corresponding to the vector representation of the text in the vector representation of the instruction;
s32, inputting the image features and the text features into a text image contrast learning module based on noise perception to obtain contrast learning loss;
s32, aligning the image features and the text features based on the contrast learning loss;
s33, enabling the aligned text features and the aligned image features to correspond to the vector representation of the text and the vector representation of the image placeholder in the vector representation of the replacement instruction to obtain a graphic fusion instruction:
wherein,representing graphic fusion instructions->Representing the aligned image features, +.>Representing the aligned text features.
5. The noise-aware-based generation-based multi-modal entity-relationship extraction method of claim 4 wherein S4 includes the steps of:
s41, inputting a graphic fusion instruction into a text coder-decoder formed by a text coder and a text decoder, and taking the text coder-decoder as a language model;
s42, using the aligned image features for prompting entity relation extraction, and training the entity relation pair based on the image-text fusion instruction extraction by using an auxiliary language model to obtain the entity relation pair predicted by decoding of the language model and corresponding cross entropy loss;
the computational expression of the cross entropy loss of the language model is as follows:
wherein,cross entropy loss of representation language model, +.>Entity-relationship pairs representing language model decoding predictions, +.>Representing a pair of real entity relationships.
6. The method for extracting a multi-modal entity relationship based on noise perception generation formula according to claim 5, wherein the calculation expression of the total loss of the noise perception entity relationship extraction in S5 is as follows:
wherein,representing the total loss of relation extraction of noise perception entity, +.>Representing the contrast learning loss weighting coefficient.
7. A system for a noise-aware based generative, multi-modal entity relationship extraction method according to any one of claims 1-6, comprising:
the input processing module is used for respectively obtaining instructions, vector representations of the instructions and image characteristics by utilizing a language model embedding layer, a CLIP visual encoder and a linear layer based on the obtained options, texts and images;
the alignment fusion module is used for constructing a text image contrast learning module based on noise perception, and carrying out image-text alignment and image-text fusion by using the text image contrast learning module based on the noise perception based on the instruction, the vector representation of the instruction and the image characteristics to obtain contrast learning loss and image-text fusion instructions;
the entity relation extraction module is used for processing the graphic fusion instruction by using the attention mechanism of the language model, training the language model to obtain cross entropy loss of the language model, obtaining total loss of noise perception entity relation extraction based on comparison of learning loss and cross entropy loss of the language model, minimizing the total loss of noise perception entity relation extraction, and extracting entity relation based on the minimized total loss of noise perception entity relation extraction to obtain an entity relation extraction result.
8. The system of noise-aware based generative, multi-modal entity relationship extraction method of claim 7 wherein said language model is a text coder-decoder of a text coder and a text decoder; the language model uses the aligned image features to assist in prompting entity relationship extraction.
CN202311469190.7A 2023-11-07 2023-11-07 Generating type multi-mode entity relation extraction method and system based on noise perception Active CN117195903B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311469190.7A CN117195903B (en) 2023-11-07 2023-11-07 Generating type multi-mode entity relation extraction method and system based on noise perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311469190.7A CN117195903B (en) 2023-11-07 2023-11-07 Generating type multi-mode entity relation extraction method and system based on noise perception

Publications (2)

Publication Number Publication Date
CN117195903A true CN117195903A (en) 2023-12-08
CN117195903B CN117195903B (en) 2024-01-23

Family

ID=89005670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311469190.7A Active CN117195903B (en) 2023-11-07 2023-11-07 Generating type multi-mode entity relation extraction method and system based on noise perception

Country Status (1)

Country Link
CN (1) CN117195903B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113836298A (en) * 2021-08-05 2021-12-24 合肥工业大学 Text classification method and system based on visual enhancement
CN115239937A (en) * 2022-09-23 2022-10-25 西南交通大学 Cross-modal emotion prediction method
US20230081171A1 (en) * 2021-09-07 2023-03-16 Google Llc Cross-Modal Contrastive Learning for Text-to-Image Generation based on Machine Learning Models
CN116257609A (en) * 2023-01-09 2023-06-13 武汉理工大学三亚科教创新园 Cross-modal retrieval method and system based on multi-scale text alignment
CN116431847A (en) * 2023-06-14 2023-07-14 北京邮电大学 Cross-modal hash retrieval method and device based on multiple contrast and double-way countermeasure
CN116702035A (en) * 2023-06-02 2023-09-05 中国科学院合肥物质科学研究院 Pest identification method based on multi-mode self-supervision transducer architecture

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113836298A (en) * 2021-08-05 2021-12-24 合肥工业大学 Text classification method and system based on visual enhancement
US20230081171A1 (en) * 2021-09-07 2023-03-16 Google Llc Cross-Modal Contrastive Learning for Text-to-Image Generation based on Machine Learning Models
CN115239937A (en) * 2022-09-23 2022-10-25 西南交通大学 Cross-modal emotion prediction method
CN116257609A (en) * 2023-01-09 2023-06-13 武汉理工大学三亚科教创新园 Cross-modal retrieval method and system based on multi-scale text alignment
CN116702035A (en) * 2023-06-02 2023-09-05 中国科学院合肥物质科学研究院 Pest identification method based on multi-mode self-supervision transducer architecture
CN116431847A (en) * 2023-06-14 2023-07-14 北京邮电大学 Cross-modal hash retrieval method and device based on multiple contrast and double-way countermeasure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YOUNGSHELL: "对比学习损失(InfoNCE loss)与交叉熵损失的联系,以及温度系数的作用", pages 1 - 4, Retrieved from the Internet <URL:https://zhuanlan.zhihu.com/p/506544456> *

Also Published As

Publication number Publication date
CN117195903B (en) 2024-01-23

Similar Documents

Publication Publication Date Title
Camgoz et al. Neural sign language translation
CN112668671B (en) Method and device for acquiring pre-training model
CN111026861B (en) Text abstract generation method, training device, training equipment and medium
US20180052828A1 (en) Machine translation method and apparatus
CN113157965B (en) Audio visual model training and audio visual method, device and equipment
WO2023280064A1 (en) Audiovisual secondary haptic signal reconstruction method based on cloud-edge collaboration
Ma et al. Towards local visual modeling for image captioning
CN112016604A (en) Zero-resource machine translation method applying visual information
CN116050496A (en) Determination method and device, medium and equipment of picture description information generation model
CN116246213B (en) Data processing method, device, equipment and medium
CN115309927B (en) Multi-label guiding and multi-view measuring ocean remote sensing image retrieval method and system
CN115563335A (en) Model training method, image-text data processing device, image-text data processing equipment and image-text data processing medium
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
Li et al. Sign language recognition and translation network based on multi-view data
US20240119716A1 (en) Method for multimodal emotion classification based on modal space assimilation and contrastive learning
Jia et al. Semantic association enhancement transformer with relative position for image captioning
CN117671460A (en) Cross-modal image-text emotion analysis method based on hybrid fusion
CN117292146A (en) Industrial scene-oriented method, system and application method for constructing multi-mode large language model
CN117195903B (en) Generating type multi-mode entity relation extraction method and system based on noise perception
CN117114063A (en) Method for training a generative large language model and for processing image tasks
CN115510193B (en) Query result vectorization method, query result determination method and related devices
CN115249361A (en) Instructional text positioning model training, apparatus, device, and medium
Huang et al. Target-Oriented Sentiment Classification with Sequential Cross-Modal Semantic Graph
Yu et al. A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes
El-Gayar Automatic Generation of Image Caption Based on Semantic Relation using Deep Visual Attention Prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant