CN117195903A

CN117195903A - Generating type multi-mode entity relation extraction method and system based on noise perception

Info

Publication number: CN117195903A
Application number: CN202311469190.7A
Authority: CN
Inventors: 吴艳; 杨欣洁; 李志慧; 李阳; 徐雅静
Original assignee: Beijing Xinguangshitong Technology Group Co ltd
Current assignee: Beijing Xinguangshitong Technology Group Co ltd
Priority date: 2023-11-07
Filing date: 2023-11-07
Publication date: 2023-12-08
Anticipated expiration: 2043-11-07
Also published as: CN117195903B

Abstract

The invention discloses a generating type multi-mode entity relation extracting method and system based on noise perception, which belong to the technical field of multi-modes and comprise the following steps: based on the obtained options, text and images, utilizing a language model embedding layer, a CLIP visual encoder and a linear layer to respectively obtain instructions, vector representations of the instructions and image features; constructing a text image contrast learning module based on noise perception; performing image-text alignment and image-text fusion to obtain contrast learning loss and image-text fusion instructions; processing the image-text fusion instruction by using an attention mechanism, and training a language model to obtain cross entropy loss of the language model; obtaining total relation extraction loss of the noise perception entity; and extracting the entity relationship based on the minimized total loss of the noise perception entity relationship extraction, and obtaining an entity relationship extraction result. The invention solves the problem that the existing method is difficult to process challenging examples and simultaneously maintains semantic transfer capability.

Description

Generating type multi-mode entity relation extraction method and system based on noise perception

Technical Field

The invention belongs to the technical field of multiple modes, and particularly relates to a generating type and multiple mode entity relation extraction method and system based on noise perception.

Background

Multimodal Named Entity Recognition (MNER) and relation extraction (MRE) aim to extract required information by means of additional image input, and play an important role in various fields such as knowledge graph construction, reading and understanding. Currently, many existing technologies have successfully completed the MNER and MRE tasks individually. However, accomplishing these two tasks separately ignores the interactions between them, and so joint multi-modal entity relationship extraction occurs.

Due to the complexity and diversity of multimedia information, multimodal extraction often suffers from the problem that text entities do not match perfectly with visual objects. Existing multi-modal extraction methods typically utilize efficient visual objects by selecting the most significant object with a higher confidence score, which may introduce noise to irrelevant or redundant objects. Furthermore, these methods focus on designing graphics alignments of different modalities, mapping objects with entities in visual and text graphics, but aligning visual and text information in a cross-plot is based on the assumption of substantial agreement between text entities and visual objects. Thus, noise in the visual information may risk inaccurate extraction of entity relationships for the entire model.

To fully exploit the bi-directional interactions between tasks, the prior art has simultaneously extracted one or more entity-relationship triples with related entity types. Other studies use word pair relationship labels to synchronously classify relationships between entities and the types of entities involved, and although having the advantage of avoiding error propagation from the pipeline framework, the semantic information of a given entity type and relationship label is not yet effectively utilized.

Disclosure of Invention

In order to overcome the defects in the prior art, the method and the system for extracting the generated multi-mode entity relationship based on noise perception provided by the invention have the advantages that visual information is brought into an instruction mode, a plurality of entity-relationship pairs are guided to be generated, meanwhile, the method is further improved in an end-to-end mode, interference of visual modal noise is reduced through noise perception contrast learning, and the problem that the existing method is difficult to process challenging examples and simultaneously maintain semantic transfer capability is solved.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

in one aspect, the method for extracting the generated multi-mode entity relationship based on noise perception provided by the invention comprises the following steps:

s1, based on acquired options, texts and images, utilizing a language model embedding layer, a CLIP visual encoder and a linear layer to respectively obtain instructions, vector representations of the instructions and image features;

s2, constructing a text image contrast learning module based on noise perception;

s3, based on the instruction, vector representation of the instruction and image characteristics, performing image-text alignment and image-text fusion by using a text image contrast learning module based on noise perception to obtain contrast learning loss and image-text fusion instructions;

s4, processing the image-text fusion instruction by using an attention mechanism of the language model, and training the language model to obtain cross entropy loss of the language model;

s5, obtaining total loss of noise perception entity relation extraction based on the comparison learning loss and the cross entropy loss of the language model;

s6, minimizing total loss of the noise perception entity relation extraction, and performing entity relation extraction based on the minimized total loss of the noise perception entity relation extraction to obtain an entity relation extraction result.

The beneficial effects of the invention are as follows: according to the generating type and multi-mode entity relation extraction method based on noise perception, visual information is brought into an instruction, a plurality of entity-relation pairs can be extracted at the same time, so that a language model can understand additional visual information, and semantic information of a label is fully utilized; the invention designs a corresponding contrast learning module based on noise perception aiming at noise generated by inconsistent text image pairs so as to reduce interference of visual mode noise; according to the method and the device for processing the text features and the image features, the text features and the image features can be dynamically adjusted according to the consistency degree between the acquired text and the acquired image, so that challenging examples can be effectively processed by the scheme, and the semantic transfer capability is reserved.

Further, the step S1 includes the following steps:

s11, sequentially connecting the acquired options, the preset image placeholders and the acquired texts by utilizing a language model embedding layer to obtain instructions:

wherein E represents an instruction, O represents an acquired option, is represents a preset image placeholder, and T represents an acquired text;

s12, obtaining vector representation of the instruction by utilizing a pretrained language model FlanT5 through an instruction fine tuning method:

wherein,vector representation of the instruction, l represents the number of elements of the instruction after vectorization, +.>Vector representation of the representation options, lo represents the number of elements vectorized by the options, +.>The lo-th element in the vector representation representing the option,/->Vector representation representing image placeholders, ls representing the number of elements vectorized by the image placeholders,/->The ls-th element in the vector representation representing the image,>vector representation of text, lx represents the number of elements after vectorization of text, ++>An lx-th element in the vector representation representing text, hs representing the dimension of the embedded vector;

s13, encoding the acquired image by using a CLIP visual encoder to obtain the hidden state of the image:

wherein,representing the hidden state of the image, li representing the number of image patches, +.>Representing a first size of an image block in the image, < >>Representing a CLIP visual encoder, I representing the acquired image;

s14, mapping the hidden state into a feature space by using a linear layer to obtain image features:

wherein,representing image features +.>Representing a second size of the image block in the image, is->Representing a linear layer.

The beneficial effects of adopting the further scheme are as follows: the instruction construction scheme provided by the invention enables the FLAN language model to be well adapted to multi-mode extraction tasks without redundant pre-training.

Further, the text image contrast learning module based on noise perception in the S2 extracts redundancy of images through noise factor quantization so as to perform contrast learning between text images; the input of the text image contrast learning module based on noise perception comprises image characteristics and text characteristics corresponding to vector representation of the text;

the noise factor is calculated as follows:

wherein,representing noise factor->Representing the ratio of the degree of inconsistent graphics and text, +.>Representing cosine similarity function,/->The vector representing the text represents the corresponding text feature, < >>Representing image features;

the calculation expression of the contrast learning loss of the text image contrast learning module based on noise perception is as follows:

wherein,represents contrast learning loss, e represents exponential basal constant, +.>Representing the ith text feature,/->Representing the j-th image feature within a lot,/->Noise-perceived temperature parameters representing the ith text and jth image in a lot, k representing the kth in a lot, N representing the total number in a lot, +.>Represents the kth image feature in a lot,/->Noise-perceived temperature parameter representing the ith text and kth image within a batch,/->Representing the sensed temperature parameter, ">Noise factor representing ith text and jth image within a batch。

The beneficial effects of adopting the further scheme are as follows: according to the invention, the noise factors are introduced into the contrast learning model, so that the model can lighten the influence of visual modal noise in the joint multi-modal entity relation extraction so as to realize that the visual modality better assists the entity relation extraction task.

Further, the step S3 includes the following steps:

s31, obtaining text characteristics corresponding to the vector representation of the text in the vector representation of the instruction;

s32, inputting the image features and the text features into a text image contrast learning module based on noise perception to obtain contrast learning loss;

s32, aligning the image features and the text features based on the contrast learning loss;

s33, enabling the aligned text features and the aligned image features to correspond to the vector representation of the text and the vector representation of the image placeholder in the vector representation of the replacement instruction to obtain a graphic fusion instruction:

wherein,representing graphic fusion instructions->Representing the aligned image features, +.>Representing the aligned text features.

The beneficial effects of adopting the further scheme are as follows: according to the invention, the text image contrast learning module based on noise perception is used for aligning the image with the text, meanwhile, the influence of irrelevant image elements is lightened, the redundancy of the image is extracted through defined noise factor quantization, the contrast learning process is guided, and the aligned image features and text features are embedded into vector representation of the instruction to obtain a picture-text fusion instruction, so that a foundation is provided for realizing the generation type joint multi-mode entity relation extraction based on noise perception.

Further, the step S4 includes the following steps:

s41, inputting a graphic fusion instruction into a text coder-decoder formed by a text coder and a text decoder, and taking the text coder-decoder as a language model;

s42, using the aligned image features for prompting entity relation extraction, and training the entity relation pair based on the image-text fusion instruction extraction by using an auxiliary language model to obtain the entity relation pair predicted by decoding of the language model and corresponding cross entropy loss;

the computational expression of the cross entropy loss of the language model is as follows:

wherein,cross entropy loss of representation language model, +.>Representing the entity-relationship pairs predicted by the language model decoding,representing a pair of real entity relationships.

The beneficial effects of adopting the further scheme are as follows: in the invention, the image-text fusion instruction is used as the input of a text encoder, the entity relation is extracted by utilizing the aligned image characteristics, and the prediction decoding generation of the entity relation pair is guided by means of the attention mechanism in the language model corresponding to the text encoder-decoder.

Further, the calculation expression of the total loss of the relation extraction of the noise perception entity in the step S5 is as follows:

wherein,representing the total loss of relation extraction of noise perception entity, +.>Representing the contrast learning loss weighting coefficient.

The beneficial effects of adopting the further scheme are as follows: the invention obtains the total loss of the relation extraction of the noise perception entity based on the cross entropy loss of the comparison learning loss weight and the language model, and guides decoding to generate a text by means of a attention mechanism.

On the other hand, the invention also provides a system of the generating type and multi-mode entity relation extraction method based on noise perception, which comprises the following steps:

the input processing module is used for respectively obtaining instructions, vector representations of the instructions and image characteristics by utilizing a language model embedding layer, a CLIP visual encoder and a linear layer based on the obtained options, texts and images;

the alignment fusion module is used for constructing a text image contrast learning module based on noise perception, and carrying out image-text alignment and image-text fusion by using the text image contrast learning module based on the noise perception based on the instruction, the vector representation of the instruction and the image characteristics to obtain contrast learning loss and image-text fusion instructions;

the entity relation extraction module is used for processing the graphic fusion instruction by using the attention mechanism of the language model, training the language model to obtain cross entropy loss of the language model, obtaining total loss of noise perception entity relation extraction based on comparison of learning loss and cross entropy loss of the language model, minimizing the total loss of noise perception entity relation extraction, and extracting entity relation based on the minimized total loss of noise perception entity relation extraction to obtain an entity relation extraction result.

The beneficial effects of the invention are as follows: the system of the generating type and multi-mode entity relation extraction method based on noise perception provided by the invention is a corresponding system based on the generating type and multi-mode entity relation extraction method based on noise perception, and is used for realizing the generating type and multi-mode entity relation extraction method based on noise perception, the realized effect is the same as that of the method, text characteristics and image characteristics can be dynamically adjusted according to the consistency degree between acquired texts and images, the challenging example can be effectively processed, and the semantic transfer capability is reserved.

Further, the language model is a text coder-decoder composed of a text coder and a text decoder; the language model uses the aligned image features to assist in prompting entity relationship extraction.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of steps of a generating type multi-mode entity relation extracting method based on noise perception in embodiment 1 of the present invention.

Fig. 2 is a block diagram of a system of a generating type, multi-modal entity relationship extraction method based on noise perception in embodiment 2 of the present invention.

Fig. 3 is a schematic diagram of a system of a generating-type, multi-modal entity relationship extraction method based on noise perception in embodiment 2 of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.

Multimode information extraction: the cross-domain tasks of Natural Language Processing (NLP) and Computer Vision (CV) aim to extract structured information, such as entities, relationships, events, etc., from data in multiple modalities (e.g., text, images, audio, etc.), which, unlike conventional information extraction tasks, require processing multiple types of data, integrating them together for a more comprehensive understanding.

Generating a pre-training model: and using a self-coding or autoregressive mode, training a large-scale natural language model by utilizing massive unlabeled text data, receiving a text sequence as input by the generated pre-training model, and outputting a new transformed text sequence.

Extracting multi-mode entity relation: it is intended to identify entities and their relationships simultaneously from multimodal data, including extracting entities from information of different modalities such as text, images, audio, etc., and deducing the relationships between these entities.

CLIP: natural Language Understanding (NLU) and Computer Vision (CV) are intended to be integrated together to achieve cross-modal understanding between text and images, with the primary goal of enabling models to understand the associations between text descriptions and images and to perform various tasks such as multimodal extraction, image generation, image search, etc., based on understanding these associations.

Example 1:

as shown in fig. 1, in one aspect, in an embodiment of the present invention, the present invention provides a generating type, multi-modal entity relationship extraction method based on noise perception, including the following steps:

the step S1 comprises the following steps:

wherein,representing image featuresSyndrome of deficiency of kidney qi>Representing a second size of the image block in the image, is->Representing a linear layer.

the text image contrast learning module based on noise perception in the S2 extracts redundancy of images through noise factor quantization so as to conduct contrast learning among text images; the input of the text image contrast learning module based on noise perception comprises image characteristics and text characteristics corresponding to vector representation of the text;

the noise factor is calculated as follows:

wherein,represents contrast learning loss, e represents exponential basal constant, +.>Representing the ith text feature,/->Representing the j-th image feature within a lot,/->Noise-perceived temperature parameters representing the ith text and jth image in a lot, k representing the kth in a lot, N representing the total number in a lot, +.>Represents the kth image feature in a lot,/->Noise-perceived temperature parameter representing the ith text and kth image within a batch,/->Representing the sensed temperature parameter, ">Representing the noise factor of the ith text and jth image within a batch.

The method is different from the prior art in that the problem of alignment of multi-mode visual text information is solved by directly adopting a multi-mode attention mechanism. However, this alignment method is based on the assumption that the text entity and the visual object are basically consistent, so that it is difficult to deal with the situation that the picture and the text are inconsistent, and noise in the visual information can bring inaccurate risk to the entity relation extraction of the whole model; according to the invention, noise factors are introduced into the contrast learning loss, and model parameters are adjusted through the contrast learning loss so as to optimize image characteristics and text characteristics and relieve the influence of noise in visual information.

the step S3 comprises the following steps:

the step S4 comprises the following steps:

the calculation expression of the total loss of the relation extraction of the noise perception entity in the S5 is as follows:

Example 2:

as shown in fig. 2, in another aspect, the present invention further provides a system for generating a method for extracting a multi-modal entity relationship based on noise perception, including:

As shown in fig. 3, in the present embodiment, the text acquired by the input processing module is, for example: "I consider this/Alfrek to be the most daunting batman that I have seen. For an acquired picture corresponding to an acquired text, in this embodiment, an image encoder encodes the acquired image to obtain a hidden state of the image, and maps the hidden state into a feature space by using a linear layer to obtain image features; sequentially connecting the acquired options, the preset image placeholders and the acquired texts by utilizing a language model embedding layer to obtain instructions; in the alignment fusion module, noise factors are introduced in the contrast learning loss, and model parameters are adjusted through the alignment loss so as to optimize image features and text features, thereby relieving the influence of noise in visual information, and if case 1 corresponds to a scene with lower noise factors and higher similarity, obviously, high similarity means high consistency, which means that the interference between an image and a text is minimum, and conversely, case 2 corresponds to a scene with higher noise factors and lower similarity, and the low similarity is generally identical with the low consistency, which means that the image and the text have obvious interference; in the entity relation extraction module, the language model provided by the invention is a text encoder-decoder formed by a text encoder and a text decoder, and the text-text fusion instruction is obtained by utilizing the aligned text characteristics and the aligned vector representations of the text and the image placeholders in the vector representations of the image characteristic corresponding substitution instruction; the language model uses the image features aligned based on noise to assist in prompting entity relation extraction, so that the influence of visual modal noise in joint multi-modal entity relation extraction can be reduced, and a visual modal better assisting extraction task is realized.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention.

Claims

1. A generating type and multi-mode entity relation extraction method based on noise perception is characterized by comprising the following steps:

2. The noise-aware-based generation-based multi-modal entity-relationship extraction method of claim 1, wherein S1 comprises the steps of:

3. The noise perception-based generation type multi-modal entity relation extraction method according to claim 2, wherein the noise perception-based text image comparison learning module in S2 extracts redundancy of images through noise factor quantization to perform comparison learning between text images; the input of the text image contrast learning module based on noise perception comprises image characteristics and text characteristics corresponding to vector representation of the text;

the noise factor is calculated as follows:

4. The noise-aware-based generation-based, multi-modal entity-relationship extraction method of claim 3, wherein S3 comprises the steps of:

5. The noise-aware-based generation-based multi-modal entity-relationship extraction method of claim 4 wherein S4 includes the steps of:

wherein,cross entropy loss of representation language model, +.>Entity-relationship pairs representing language model decoding predictions, +.>Representing a pair of real entity relationships.

6. The method for extracting a multi-modal entity relationship based on noise perception generation formula according to claim 5, wherein the calculation expression of the total loss of the noise perception entity relationship extraction in S5 is as follows:

7. A system for a noise-aware based generative, multi-modal entity relationship extraction method according to any one of claims 1-6, comprising:

8. The system of noise-aware based generative, multi-modal entity relationship extraction method of claim 7 wherein said language model is a text coder-decoder of a text coder and a text decoder; the language model uses the aligned image features to assist in prompting entity relationship extraction.