CN116796287A

CN116796287A - Pre-training method, device, equipment and storage medium for graphic understanding model

Info

Publication number: CN116796287A
Application number: CN202310727691.4A
Authority: CN
Inventors: 刘羲; 董孟帆; 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-09-22

Abstract

The application belongs to the technical field of natural language processing, and provides a pre-training method, device, equipment and storage medium of an image-text understanding model, which comprises the following steps: masking the matched text and image in the image-text sample to obtain a text fragment input sequence and an image block input sequence, and calling an image-text understanding model of a double-tower structure; inputting a text segment input sequence to a text side of a model to obtain a text embedded vector, and inputting an image block input sequence to a visual side of the model to obtain an image embedded vector; acquiring multi-mode fusion characteristics based on the text embedding vector and the image embedding vector; according to the multi-mode fusion characteristics, a language task for predicting the characteristics of the masked text fragments in the text embedding vector and a visual task for predicting the characteristics of the masked image blocks in the image embedding vector are used as pre-training targets, and parameters of the model are adjusted. The application improves the pre-training efficiency of the image-text understanding model in the financial field and realizes the analysis of the financial data of the image-text mode.

Description

Pre-training method, device, equipment and storage medium for graphic understanding model

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for pre-training an image-text understanding model.

Background

In the financial field, financial data such as financial newspaper and the like is rich in information of visual modes such as forms, flowcharts and the like besides text mode information such as text content, and the information of the visual modes plays a role in the financial newspaper understanding process. Along with the continuous development of multi-mode research in recent years, a plurality of mature pre-training models exist in the aspects of text and vision in the multi-mode field, and the multi-mode pre-training model is utilized to understand and analyze financial data in graphic and text modes such as financial accounts in the financial field.

In the related art, most of multi-mode pre-training models use massive graphic-text modal financial data for pre-training to obtain cross-mode understanding capability, but the massive graphic-text modal financial data has large operand and long operation time under the environment of limited computing resources.

Disclosure of Invention

The application mainly aims to provide a pre-training method, device, equipment and storage medium for an image-text understanding model, and aims to improve the pre-training efficiency of the image-text understanding model in the financial field and realize analysis of financial data of image-text modes.

In a first aspect, the present application provides a pre-training method for an image-text understanding model, the method comprising:

masking the matched text and image in the preset image-text sample to obtain a text segment input sequence and an image block input sequence, and calling a preset image-text understanding model, wherein the image-text understanding model is of a double-tower structure;

inputting the text segment input sequence to the text side of the image-text understanding model to perform embedding operation, extracting text embedding vectors, and inputting the image block input sequence to the visual side of the image-text understanding model to perform linear embedding operation, extracting image embedding vectors;

based on the text embedded vector and the image embedded vector, acquiring a multi-mode fusion characteristic by using an attention mechanism;

according to the multi-mode fusion characteristics, a language task for predicting the characteristics of the masked text fragments in the text embedding vector and a visual task for predicting the characteristics of the masked image blocks in the image embedding vector are used as pre-training targets, and parameters of the image-text understanding model are adjusted to complete pre-training of the image-text understanding model.

In a second aspect, the present application also provides a pre-training device for an image-text understanding model, where the device includes:

the masking module is used for masking the matched text and image in the preset image-text sample respectively to obtain a text fragment input sequence and an image block input sequence, and calling a preset image-text understanding model, wherein the image-text understanding model is of a double-tower structure;

the embedding module is used for inputting the text segment input sequence to the text side of the image-text understanding model to perform embedding operation, extracting text embedding vectors, inputting the image block input sequence to the visual side of the image-text understanding model to perform linear embedding operation, and extracting image embedding vectors;

the acquisition module is used for acquiring the multi-mode fusion characteristic by using an attention mechanism based on the text embedding vector and the image embedding vector;

and the adjusting module is used for adjusting parameters of the image-text understanding model to finish the pre-training of the image-text understanding model by taking a language task for predicting the characteristics of the masked text fragments in the text embedding vector and a visual task for predicting the characteristics of the masked image blocks in the image embedding vector as pre-training targets according to the multi-mode fusion characteristics.

In a third aspect, the present application also provides a computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program when executed by the processor implements the steps of the pre-training method of the graph understanding model or the identifying method of the medical entity relationship as described above.

In a fifth aspect, the present application further provides a computer readable storage medium, where a computer program is stored on the computer readable storage medium, where the computer program, when executed by a processor, implements a pre-training method of a graph understanding model or a method for identifying a medical entity relationship as described above.

The application discloses a pre-training method, a device, computer equipment and a readable storage medium of an image-text understanding model, wherein the pre-training method of the image-text understanding model carries out mask processing on paired texts and images in a preset image-text sample respectively to obtain a text segment input sequence and an image block input sequence, and invokes the preset image-text understanding model, and the image-text understanding model is of a double-tower structure; inputting the text segment input sequence to the text side of the image-text understanding model to perform embedding operation, extracting text embedding vectors, and inputting the image block input sequence to the visual side of the image-text understanding model to perform linear embedding operation, extracting image embedding vectors; based on the text embedded vector and the image embedded vector, acquiring a multi-mode fusion characteristic by using an attention mechanism; according to the multi-mode fusion characteristics, a language task for predicting the characteristics of the masked text fragments in the text embedding vector and a visual task for predicting the characteristics of the masked image blocks in the image embedding vector are used as pre-training targets, and parameters of the image-text understanding model are adjusted to complete pre-training of the image-text understanding model. On one hand, for sample data of two different modes, such as text and image, for example, sample financial data of image-text modes, lightweight embellishment is respectively carried out on a text side and a visual side of an image-text understanding model of a double-tower structure, so that the processing speed of modal characteristics is improved, on the other hand, a attention mechanism is adopted to carry out multi-modal characteristic fusion, so that the operation time of cross-modal characteristic learning is shortened, the pre-training efficiency of the image-text understanding model in the financial field is improved, the performance and the robustness of the model are ensured, and the pre-trained image-text understanding model can realize the accurate analysis of the financial data of the image-text modes.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an embodiment of a pre-training method of the graphic understanding model of the present application;

FIG. 2 is a schematic diagram of a graphic understanding model according to an embodiment of the pre-training method of the graphic understanding model of the present application;

FIG. 3 is a schematic block diagram of a pre-training device for a graphic understanding model according to an embodiment of the present application;

fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

It is to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that, in order to clearly describe the technical solutions of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", etc. are used to distinguish identical items or similar items having substantially the same function and effect. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

The embodiment of the application provides a pre-training method and device of an image-text understanding model, computer equipment and a readable storage medium. According to the pre-training method of the graph-text understanding model, on one hand, for sample data of two different modes, such as text and image, for example, sample financial data of the graph-text mode, lightweight embedding is respectively carried out on the text side and the visual side of the graph-text understanding model of a double-tower structure, so that the processing speed of modal characteristics is improved, on the other hand, attention mechanisms are adopted to carry out multi-modal characteristic fusion, the operation time of cross-modal characteristic learning is shortened, the pre-training efficiency of the graph-text understanding model is improved, and the performance and the robustness of the model are guaranteed. The pre-trained graphic understanding model is applied to graphic analysis in the field of natural language processing, and can be particularly applied to financial data analysis and intelligent decision analysis of graphic modes such as financial and newspaper in the financial field, so that scientific and systematic data support is provided for strategic formulation and investment decision in the financial field.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a diagram illustrating a pre-training method of an image-text understanding model according to an embodiment of the present application, where the pre-training method of an image-text understanding model is mainly applied to a pre-training device of an image-text understanding model, and the pre-training device of an image-text understanding model may be a terminal device with a data processing function, such as a server.

The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CoNteNt Delivery Network, CDN), and basic cloud computing services such as big data and data analysis platforms.

As shown in fig. 1, the pre-training method of the graphic understanding model includes steps S101 to S105.

Step S101, performing mask processing on paired texts and images in a preset image-text sample respectively to obtain a text segment input sequence and an image block input sequence, and calling a preset image-text understanding model, wherein the image-text understanding model is of a double-tower structure.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an image-text understanding model according to an embodiment of the present application. The graph understanding model is a graph multi-modal understanding model of a double tower structure, which combines a BERT (Bidirectional Encoder Representation from Transformers, bi-directional attention neural network) model with a ViLT (Vision-and-Language Transformer) model. Namely, the graphic understanding model is divided into a text side and a visual side, the text side corresponds to a feature encoder part of a text, and the text side comprises a BERT model; the visual side corresponds to a feature encoder of the image, and the visual side includes a ViLT model. The image-text understanding model is an original model which is not trained by data, and parameters of the model are random or initialized parameters.

In order to pretrain the image-text understanding model, an image-text sample is required to be constructed in advance. It will be appreciated that the graphic sample may be financial data for a pre-trained graphic modality, such as financial accounting for pre-training, graphic investment analysis reports, etc., the graphic sample comprising paired images and text, the paired images and text having the same semantics.

The graph-text understanding model is pre-trained, so that the model is essentially helped to learn better cross-mode characterization, namely redundancy between a text mode and an image mode is removed by utilizing complementarity between the text mode and the image mode, better feature representation is learned, and the characterization of two mode information of the text and the image with the same semantic can be similar to each other as far as possible. The graph-text understanding model after the pre-training is completed can integrate information from two modes of text and image, realize the reasoning and supplementing of the information, and widen the function of the coverage range of the information contained in the input data.

Therefore, in the pre-training stage of the image-text understanding model, the matched texts and images in the image-text samples are required to be subjected to mask processing respectively to obtain a text segment input sequence and an image block input sequence, and the text segment input sequence and the image block input sequence are used as the input of the image-text understanding model.

In some embodiments, the masking processing is performed on the paired text and image in the preset image-text sample to obtain a text segment input sequence and an image block input sequence, which specifically are: segmenting the text to obtain a text segment sequence, and masking the text segment according to a preset first proportion to obtain a text segment input sequence; and cutting the image to obtain an image block sequence, and masking the image block sequence according to a preset second mask proportion to obtain an image block input sequence.

In terms of texts, for the texts in the image-text samples, segmenting and analyzing the texts in the image-text samples by using a preset OCR analyzer to obtain a text segment (segment) sequence which is arranged in sequence, and position information and a segment sequence number (arrangement sequence) which correspond to each text segment. It will be appreciated that the expression form of the text segment is a list structure of words (words), and the expression form of the location information corresponding to the text segment is a text box (box) coordinate corresponding to the text segment.

And then masking the text segment sequence according to a preset first mask ratio to obtain a text segment input sequence, wherein the text segment input sequence carries the position information and the segment sequence number of each text segment. The preset first mask proportion can be flexibly set according to actual needs, for example, 30% of text fragments in the text fragment sequence are subjected to random masking. To reduce the subsequent graph understanding model from obtaining information from different token of the same word, a whole word mask may be employed.

In the aspect of vision, for the images in the image-text sample, cutting and flattening the images into image blocks (patches) with preset sizes, which are arranged in sequence, so as to obtain an image block sequence, and position information and a block sequence number (arrangement sequence) corresponding to each image block. It is understood that the representation of the position information corresponding to the image block is the box coordinates corresponding to the image block.

For an image in a graphic sample, the image is first adjusted to a square with a height H and a width W, and then divided into square image blocks (patches) with a preset size (for example, p×p), where P is the width and height of the image block, and p×p is the size of the image block as a hidden dimension, and then the length m=hw/p≡2 of the image block sequence. In order to improve the universality of the image-text understanding model, the hidden dimension can be projected into a D dimension from P. The image block sequence with the structure can improve the reasoning efficiency of the image-text understanding model.

And then masking the image block sequence according to a preset second masking proportion to obtain an image block input sequence, wherein the image block input sequence carries the position information and the block serial number corresponding to each image block. The preset second mask proportion can be flexibly set according to actual needs, for example, 75% of pixels in the image block sequence are subjected to random masking, so that redundancy of the image is reduced.

Step S102, inputting the text segment input sequence to the text side of the image-text understanding model to conduct embedding operation, extracting text embedding vectors, inputting the image block input sequence to the visual side of the image-text understanding model to conduct linear embedding operation, and extracting image embedding vectors.

In text terms, the text side of the teletext understanding model includes an embedding (embedding) layer of the BERT model, such as, for example, the embedding layer of the better performing RoBERTa model. And inputting a text segment input sequence to the text side of the image-text understanding model, and projecting each text segment to a feature space through the embedding operation of the BERT model embedding layer to obtain a text embedding vector (text embedding Y). Therefore, the embedded layer of the BERT model is utilized to realize light weight ebedding of the text segment input sequence, and the processing speed of the text modal characteristics is improved.

In visual terms, the visual side of the teletext understanding model comprises a Linear Embedding (Linear Embedding) layer of the ViLT model. And inputting the image block input sequence to the visual side of the image-text understanding model, projecting each image into a two-dimensional image feature map through linear embedding operation of a linear embedding layer, and expanding the two-dimensional image feature map to obtain a one-dimensional image embedding vector (image embedding X). Therefore, the ViLT model is utilized to realize the light weight of the image block input sequence, and the processing speed of the visual mode characteristics is improved.

It will be appreciated that the dimensions of the text embedded vector are consistent with the dimensions of the image embedded vector.

In some embodiments, to introduce the order of text segments and image blocks and the radical information of text and images into the teletext understanding model, both the text side and the visual side of the teletext understanding model are set up with a one-dimensional position embedding (position embdding) layer and a two-dimensional position embedding (position embedding) layer. After the text segment input sequence is input to the text side of the graph-text understanding model for embedding operation, the method comprises the following steps: extracting a position embedded vector and a segmentation embedded vector of the text segment; after the image block input sequence is input to the visual side of the image-text understanding model for linear embedding operation, the method comprises the following steps: the position embedding vector and the block embedding vector of the image block are extracted.

In the aspect of texts, inputting a text segment input sequence to the text side of the image-text understanding model, and obtaining a segment embedded vector of the text segment through embedding operation of a one-dimensional position embedding layer by the segment serial number of each text segment carried by the text segment input sequence; and obtaining the position embedding vector of the text segment through the embedding operation of the two-dimensional position embedding layer according to the position information of each text segment carried by the text segment input sequence.

In the aspect of vision, an image block input sequence is input to the vision side of an image-text understanding model, and the block serial number of each image block carried by the image block input sequence is subjected to embedding operation of a one-dimensional position embedding layer to obtain a block embedding vector of the image block; the position information of each image block carried by the image block input sequence is subjected to the embedding operation of the two-dimensional position embedding layer, so that the position embedding vector of the image block is obtained.

Step S103, based on the text embedding vector and the image embedding vector, acquiring a multi-mode fusion characteristic by using an attention mechanism.

The graph understanding model includes a multi-headed attention (multimodal transformer) layer. Based on the obtained text embedded vector and the image embedded vector, a multi-mode fusion characteristic is obtained on a multi-head attention layer by using an attention mechanism.

In some embodiments, step S103 is specifically: combining the position embedded vector and the segmentation embedded vector of the text segment with the text embedded vector to obtain a text feature vector; combining the position embedded vector and the block embedded vector of the image block with the image embedded vector to obtain an image feature vector; and obtaining the multi-mode fusion feature by using an attention mechanism according to the text feature vector and the image feature vector.

In the text aspect, the position embedded vector and the segmentation embedded vector of the text segment are combined with the text embedded vector to obtain a text feature vector.

In the aspect of vision, the position embedded vector and the block embedded vector of the image block are combined with the image embedded vector to obtain the image feature vector.

And then, at a multi-head attention (Multimodal Transformer) layer, carrying out multi-mode feature fusion on the text feature vector and the image feature vector by using an attention mechanism to obtain multi-mode fusion features.

Therefore, the attention mechanism is utilized to perform multi-modal feature fusion on the text feature vector and the image feature vector, so that the performance and the robustness of the image-text understanding model are ensured, and the operation time of cross-modal feature learning is shortened.

In some embodiments, according to the text feature vector and the image feature vector, a multi-mode fusion feature is obtained by using an attention mechanism, specifically: and after the text feature vector and the image feature vector are spliced, inputting the spliced text feature vector and the image feature vector into a multi-head attention layer of the image-text understanding model, so that the multi-head attention layer carries out multi-mode feature fusion processing on the text feature vector and the image feature vector by using an attention mechanism, and the multi-mode fusion feature is obtained.

And carrying out multi-mode feature fusion on the text feature vector and the image feature vector by using an attention mechanism, specifically splicing (concat) the text feature vector and the image feature vector to obtain a spliced vector. And inputting the spliced vector to a multi-head attention layer of the image-text understanding model, and carrying out multi-mode feature fusion by using an attention mechanism of the multi-head attention layer to finally obtain multi-mode fusion features. Specifically, the spliced vector is subjected to linear transformation to obtain a query matrix Q, a key matrix K and a value matrix V, and the query matrix Q, the key matrix K and the value matrix V are subjected to attention mechanism operation to obtain the multi-mode fusion characteristic.

And step S104, according to the multi-mode fusion characteristics, a language task for predicting the characteristics of the masked text fragments in the text embedding vector and a visual task for predicting the characteristics of the masked image blocks in the image embedding vector are used as pre-training targets, and parameters of the image-text understanding model are adjusted to complete pre-training of the image-text understanding model.

It will be appreciated that there are masked text segments in the text segment input sequence, and thus there are masked text segment features (T1, T2 as shown in fig. 2) in the text embedding vector resulting from the text segment input sequence. Similarly, in the image embedding vector, there are masked image block features (V2, V3).

And (3) using the multi-mode fusion characteristics output by the multi-head attention layer, and taking a language task (MLM) for predicting the characteristics of the masked text fragment and a visual task (MAE) for predicting the characteristics of the masked image block as pre-training targets, and adjusting parameters of the image-text understanding model to finish pre-training of the image-text understanding model.

In some embodiments, step S104 is specifically: calculating loss values of the graph-text understanding model predicted for the masked text fragment features and the masked image block features by utilizing the multi-mode fusion features; and adjusting parameters of the image-text understanding model according to the loss value to finish pre-training of the image-text understanding model.

And pre-training the image-text understanding model by utilizing hidden features h_i from the text aspect in the multi-mode fusion features h, and obtaining a first loss value of the image-text understanding model for predicting the features of the masked text segment.

And pre-training the image-text understanding model by utilizing hidden features h_t from the aspect of the image in the multi-mode fusion features h, and obtaining a second loss value of the image-text understanding model for restoring the features of the masked image block for pixel level restoration of the masked image block. The second loss value may be calculated by a MSE (mean square error) loss function.

And taking the sum of the first loss value and the second loss value as the loss value of the graphic understanding model, and completing the pre-training when the gradient of the loss value is reduced to a preset threshold value.

The pre-training completion represents completion of the multi-mode upstream task, and can be subsequently applied to the downstream task to analyze the financial data of the graphic and text mode in the financial field. Illustratively, the financial image-text data to be analyzed, such as financial accounting to be analyzed, is obtained, the text in the financial image-text data is segmented to obtain a target text segment sequence, and the image in the financial data is segmented to obtain a target image block sequence. Inputting the target text segment sequence to the text side of the pre-trained graphic understanding model, enabling the text side to conduct embedding operation on the target text segment sequence to obtain a target text embedded vector, a target text segment position embedded vector and a segment embedded vector, and combining the target text embedded vector, the target text segment position embedded vector and the segment embedded vector to obtain a target text feature vector; inputting the target image block sequence to the visual side of the pre-trained graphic understanding model, enabling the visual side to conduct linear embedding operation on the target image block sequence to obtain a target image embedded vector, a position embedded vector and a block embedded vector of the target image block, and combining the target image embedded vector, the position embedded vector and the block embedded vector of the target image block to obtain a target image feature vector. And then, after the target text feature vector and the target image feature vector are spliced, the target text feature vector and the target image feature vector are input into a multi-head attention layer of a pre-trained image-text understanding model, so that the multi-head attention layer utilizes an attention mechanism to perform multi-mode feature fusion processing on the target text feature vector and the target image feature vector, and information (redundant information removal and information complementation) from the target text feature vector and the target image feature vector is synthesized, thereby outputting the target feature vector which can better represent common semantics of financial image-text data, realizing accurate analysis of the financial image-text data, and providing scientific and systematic data support for strategic formulation and investment decision in the financial field.

According to the pre-training method of the image-text understanding model, the matched texts and images in the preset image-text samples are subjected to mask processing respectively to obtain a text segment input sequence and an image block input sequence, and the preset image-text understanding model is called, wherein the image-text understanding model is of a double-tower structure; inputting the text segment input sequence to the text side of the image-text understanding model to perform embedding operation, extracting text embedding vectors, and inputting the image block input sequence to the visual side of the image-text understanding model to perform linear embedding operation, extracting image embedding vectors; based on the text embedded vector and the image embedded vector, acquiring a multi-mode fusion characteristic by using an attention mechanism; according to the multi-mode fusion characteristics, a language task for predicting the characteristics of the masked text fragments in the text embedding vector and a visual task for predicting the characteristics of the masked image blocks in the image embedding vector are used as pre-training targets, and parameters of the image-text understanding model are adjusted to complete pre-training of the image-text understanding model. On one hand, for sample data of two different modes of texts and images, lightweight embading is respectively carried out on a text side and a visual side of a graph-text understanding model of a double-tower structure, so that the processing speed of modal characteristics is improved, on the other hand, a attention mechanism is adopted to carry out multi-modal characteristic fusion, so that the operation time of cross-modal characteristic learning is shortened, the pre-training efficiency of the graph-text understanding model in the financial field is improved, the performance of the model is ensured, and the pre-trained graph-text understanding model can realize accurate analysis of financial data of graph-text modes.

Referring to fig. 3, fig. 3 is a schematic block diagram of a pre-training device of an image-text understanding model according to an embodiment of the present application.

As shown in fig. 3, the apparatus 300 includes: a masking module 301, an embedding module 302, an acquisition module 303 and an adjustment module 304.

The masking module 301 is configured to mask the paired text and image in the preset graphic sample, obtain a text segment input sequence and an image block input sequence, and call a preset graphic understanding model, where the graphic understanding model is a double-tower structure;

the embedding module 302 is configured to input the text segment input sequence to a text side of the graphic understanding model for performing an embedding operation, extract a text embedding vector, and input the image block input sequence to a visual side of the graphic understanding model for performing a linear embedding operation, extract an image embedding vector;

an obtaining module 303, configured to obtain a multi-mode fusion feature based on the text embedding vector and the image embedding vector by using an attention mechanism;

and the adjustment module 304 is configured to adjust parameters of the graphic understanding model according to the multimodal fusion feature, with a language task for predicting the features of the masked text segment in the text embedding vector and a visual task for predicting the features of the masked image block in the image embedding vector as pre-training targets, so as to complete pre-training of the graphic understanding model.

It should be noted that, for convenience and brevity of description, specific working processes of the above-described apparatus and each module and unit may refer to corresponding processes in the foregoing pre-training method embodiment of the graphic understanding model, which are not described herein.

The apparatus provided by the above embodiments may be implemented in the form of a computer program which may be run on a computer device as shown in fig. 4.

Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device may be a personal computer (persoNal computer, PC), a server, or the like having a data processing function.

As shown in fig. 4, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a non-volatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause the processor to perform any of a number of pre-training methods for the graphic understanding model.

The processor is used to provide computing and control capabilities to support the operation of the entire computer device.

The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium that, when executed by the processor, causes the processor to perform any of a number of pre-training methods for the graph understanding model.

The network interface is used for network communication such as transmitting assigned tasks and the like. It will be appreciated by persons skilled in the art that the architecture shown in fig. 4 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

It should be appreciated that the processor may be a central processing UNit (CeNtral ProcessiNg UNit, CPU), but may also be other general purpose processors, digital signal processors (Digital SigNal Processor, DSP), application specific integrated circuits (ApplicatioN Specific INtegrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein in one embodiment the processor is configured to run a computer program stored in the memory to implement the steps of:

In some embodiments, after implementing the embedding operation of the text segment input sequence into the text side of the teletext understanding model, the processor is configured to:

extracting a position embedded vector and a segmentation embedded vector of the text segment;

the processor is used for realizing the following steps after the image block input sequence is input to the visual side of the image-text understanding model for linear embedding operation:

the position embedding vector and the block embedding vector of the image block are extracted.

In some embodiments, the processor is configured to, when implementing the acquiring the multi-modal fusion feature based on the text embedding vector and the image embedding vector using an attention mechanism, implement:

combining the position embedded vector and the segmentation embedded vector of the text segment with the text embedded vector to obtain a text feature vector;

combining the position embedded vector and the block embedded vector of the image block with the image embedded vector to obtain an image feature vector;

and obtaining the multi-mode fusion feature by using an attention mechanism according to the text feature vector and the image feature vector.

In some embodiments, the processor is configured to, when obtaining the multi-modal fusion feature according to the text feature vector and the image feature vector by using an attention mechanism, implement:

And after the text feature vector and the image feature vector are spliced, inputting the spliced text feature vector and the image feature vector into a multi-head attention layer of the image-text understanding model, so that the multi-head attention layer carries out multi-mode feature fusion processing on the text feature vector and the image feature vector by using an attention mechanism, and the multi-mode fusion feature is obtained.

In some embodiments, the processor is configured to implement, according to the multimodal fusion feature, a language task that predicts a masked text segment feature in the text-embedding vector and a visual task that predicts a masked image block feature in the image-embedding vector as pre-training targets, and adjust parameters of the graph understanding model to complete pre-training of the graph understanding model, where the pre-training is implemented by:

calculating loss values of the graph-text understanding model predicted for the masked text fragment features and the masked image block features by utilizing the multi-mode fusion features;

and adjusting parameters of the image-text understanding model according to the loss value to finish pre-training of the image-text understanding model.

In some embodiments, the text side includes a BERT model and the visual side includes a ViLT model.

In some embodiments, the processor performs mask processing on paired text and images in a preset image-text sample, so as to implement:

segmenting the text to obtain a text segment sequence, and masking the text segment according to a preset first proportion to obtain a text segment input sequence;

and cutting the image to obtain an image block sequence, and masking the image block sequence according to a preset second mask proportion to obtain an image block input sequence.

Embodiments of the present application further provide a computer readable storage medium having a computer program stored thereon, where the computer program includes program instructions, where a method implemented when the program instructions are executed may refer to embodiments of a pre-training method of a graph understanding model of the present application, or embodiments of a method of identifying relationships between medical entities.

The computer readable storage medium may be an internal storage unit of the computer device according to the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, which are provided on the computer device.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method for pre-training a graphic understanding model, the method comprising the steps of:

2. The method for pre-training a graphic understanding model according to claim 1, wherein after the text segment input sequence is input to the text side of the graphic understanding model for embedding, the method comprises:

after the image block input sequence is input to the visual side of the image-text understanding model for linear embedding operation, the method comprises the following steps:

3. The method for pre-training a graph understanding model according to claim 2, wherein the obtaining the multi-modal fusion feature based on the text embedding vector and the image embedding vector by using an attention mechanism comprises:

4. A method of pre-training a graphic understanding model according to claim 3, wherein said deriving a multi-modal fusion feature from said text feature vector and said image feature vector using an attention mechanism comprises:

5. The method for pre-training a graphic understanding model according to claim 1, wherein the pre-training the graphic understanding model by using a language task for predicting the characteristics of the masked text segment in the text embedding vector and a visual task for predicting the characteristics of the masked image block in the image embedding vector as pre-training targets according to the multimodal fusion characteristics includes:

6. The method of pre-training a teletext understanding model according to claim 1, wherein the text side comprises a BERT model and the visual side comprises a ViLT model.

7. The method for pre-training a graphic understanding model according to claim 1, wherein the masking processing is performed on the paired text and image in the preset graphic sample to obtain a text segment input sequence and an image block input sequence, respectively, and the method comprises the following steps:

8. The pre-training device of the image-text understanding model is characterized by comprising:

9. A computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program when executed by the processor performs the steps of the pre-training method of the graphic understanding model according to any of claims 1 to 7.

10. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, wherein the computer program, when executed by a processor, implements the steps of the pre-training method of the teletext understanding model according to any one of claims 1 to 7.