WO2024011814A1 - 一种图文互检方法、系统、设备及非易失性可读存储介质 - Google Patents

一种图文互检方法、系统、设备及非易失性可读存储介质 Download PDF

Info

Publication number
WO2024011814A1
WO2024011814A1 PCT/CN2022/134091 CN2022134091W WO2024011814A1 WO 2024011814 A1 WO2024011814 A1 WO 2024011814A1 CN 2022134091 W CN2022134091 W CN 2022134091W WO 2024011814 A1 WO2024011814 A1 WO 2024011814A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
text
target
information
image
Prior art date
Application number
PCT/CN2022/134091
Other languages
English (en)
French (fr)
Inventor
李仁刚
王立
范宝余
郭振华
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024011814A1 publication Critical patent/WO2024011814A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/45Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the present application relates to the field of data processing technology, and more specifically, to a method, system, equipment and non-volatile readable storage medium for image and text mutual inspection.
  • This application provides a method for mutual inspection of images and texts, which can solve the technical problem of how to improve the accuracy of mutual inspection of images and texts to a certain extent.
  • This application also provides a picture and text mutual inspection system, equipment and a non-volatile readable storage medium. In order to achieve the above objectives, this application provides the following technical solutions:
  • a picture and text mutual inspection method including:
  • the target text includes various sub-information representing the target information
  • the target text input information is processed to obtain the target text processing result; among them, the text processing model is built based on self-supervised learning, and self-supervised learning is used to build the target text based on various sub-classes.
  • the associated information between the information performs supervised learning on the target text;
  • the target image is processed based on the image processing model in the image-text mutual detection neural network model to obtain the target image processing result;
  • determining the target text input information corresponding to the target text includes:
  • For each sub-information convert the sub-information, corresponding position information, and first type information into corresponding initial vector information, and use the sum of all initial vector information as the first vector information of the sub-information;
  • Target text input information is determined based on the first vector information.
  • determining the target text input information based on the first vector information includes:
  • the second vector information and the first vector information are used as target text input information.
  • the process of determining corresponding weight values for self-supervised learning includes:
  • the target sample is determined in one of the sub-information, and the first type sample paired with the target sample is determined in the other sub-information. and a second type of sample that is not paired with the target sample, determining a first distance value between the target sample and the first type sample, and determining a second distance value between the target sample and the second type sample;
  • the weight values for self-supervised learning are determined based on loss values.
  • the loss value of self-supervised learning is determined based on all first distance values and second distance values, including:
  • the loss function of self-supervised learning includes:
  • b represents Self-supervised learning batch;
  • N represents the number of paired samples;
  • d represents the distance value;
  • the text processing model includes a neural network model based on a transformer model and self-supervised learning.
  • the text processing model includes an input layer; a multi-head attention mechanism layer connected to the input layer; a first normalization layer connected to the input layer and the multi-head attention mechanism layer; a forward transfer layer connected to the normalization layer; a second normalization layer connected to the forward transmission layer and the first normalization layer; a first fully connected layer, a first excitation layer, a second fully connected layer, and a self-supervised classification output layer sequentially connected to the second normalization layer; and The target fully connected layer connected to the second normalized layer and corresponding to the sub-information one-to-one; the fourth fully connected layer connected to the second normalized layer; the fifth fully connected layer connected to the second normalized layer; and the first fully connected layer , the splicing layer connected to all target fully connected layers; the third fully connected layer connected to the splicing layer.
  • the image processing model is built based on the attention mechanism.
  • the image processing model includes a target number of image processing branches, and a fourth fully connected layer connected to the image processing branches;
  • the image processing branch includes an input layer, a backbone network connected to the input layer, and a backbone network connected to the backbone network.
  • the fifth fully connected layer, the attention mechanism layer connected to the fifth fully connected layer, the first normalization layer connected to the attention mechanism layer, the multiplier connected to the first normalization layer, and the multiplier and the third Five fully connected layers connect the adder, the Linear layer connects to the adder, and the BiLSTM layer connects to the Linear layer;
  • the first normalization layer in each image processing branch is the same; and the BiLSTM layers in each image processing branch are connected to each other.
  • the attention mechanism layer includes: a sixth fully connected layer connected to the fifth fully connected layer, a second excitation layer connected to the sixth fully connected layer, and a seventh fully connected layer connected to the second excitation layer. layer, a second normalization layer connected to the seventh fully connected layer, and the second normalization layer is connected to the first normalization layer.
  • the loss function in the image-text mutual detection neural network model includes:
  • the backbone network includes a ResNet network.
  • a picture and text mutual inspection system including:
  • the first acquisition module is used to acquire a set of target text and a set of target images to be retrieved.
  • the target text includes various sub-information representing the target information;
  • the first determination module is used to determine the target text input information corresponding to the target text
  • the first processing module is used to process the target text input information based on the text processing model in the pre-trained image-text mutual detection neural network model to obtain the target text processing results; among which, the text processing model is built based on self-supervised learning, and self-supervised Learning is used to supervised learning of target text based on the correlation information between various types of sub-information;
  • the second processing module is used to process the target image based on the image processing model in the image-text mutual detection neural network model to obtain the target image processing result;
  • the second determination module is used to determine the image retrieval result of the target text in the target image based on the target text processing result and the target image processing result, and/or determine the text retrieval result of the target image in the target text.
  • a picture and text mutual inspection equipment including:
  • the memory is used to store the computer program; the processor is used to implement the steps of any of the above picture-text mutual inspection methods when executing the computer program.
  • a non-volatile readable storage medium A computer program is stored in the non-volatile readable storage medium. When the computer program is executed by a processor, the steps of any of the above graphic and text mutual inspection methods are implemented.
  • This application provides a picture-text mutual inspection method to obtain a set of target text and a set of target images to be retrieved.
  • the target text includes various sub-information that characterizes the target information; determines the target text input information corresponding to the target text; based on the pre-set
  • the text processing model in the trained image-text mutual detection neural network model processes the target text input information to obtain the target text processing results; among them, the text processing model is built based on self-supervised learning, and self-supervised learning is used to Supervise learning of the target text based on the associated information; process the target image based on the image processing model in the image-text mutual detection neural network model to obtain the target image processing result; based on the target text processing result and the target image processing result, determine where the target text is Image retrieval results in the target image, and/or text retrieval results that determine the target image in the target text.
  • the target text input information is processed based on the text processing model, because the text processing model is built based on self-supervised learning, and self-supervised learning is used to interoperate between various types of sub-information.
  • the associated information of the target text is supervised to learn, so this application is equivalent to using the associated information between sub-information to obtain the target text processing result, because the associated information between sub-information can reflect the correlation between various types of information in the target text, so
  • the text processing model can ensure the accuracy of target text processing, thereby ensuring the accuracy of mutual inspection of images and texts.
  • the image and text mutual inspection system, equipment and non-volatile readable storage medium provided by this application also solve corresponding technical problems.
  • Figure 1 is a first flow chart of a picture-text mutual inspection method provided by an embodiment of the present application
  • Figure 2 is a second flow chart of a picture-text mutual inspection method provided by an embodiment of the present application.
  • Figure 3 is a schematic structural diagram of the image-text mutual detection neural network model in this application.
  • Figure 4 is a schematic structural diagram of the attention mechanism layer
  • Figure 5 is a schematic diagram of the application's traversal of image and text features
  • Figure 6 is a schematic structural diagram of a picture-text mutual inspection system provided by an embodiment of the present application.
  • Figure 7 is a schematic structural diagram of a picture-text mutual inspection device provided by an embodiment of the present application.
  • Figure 8 is another schematic structural diagram of a picture-text mutual inspection device provided by an embodiment of the present application.
  • Figure 1 is a first flow chart of a picture-text mutual inspection method provided by an embodiment of the present application.
  • An image-text mutual inspection method provided by the embodiment of the present application may include the following steps:
  • Step S101 Obtain a set of target text and a set of target images to be retrieved.
  • the target text includes various types of sub-information representing the target information.
  • a set of target text and a set of target images to be retrieved can be first obtained, so that the image corresponding to the target text can be subsequently determined in a set of target images.
  • the number of a set of target texts and a set of target images acquired and the types of target texts and target images can be determined according to actual needs.
  • the target text and target images can be medical texts and medical images, or they can be Server maintenance text and server maintenance images can also be used to create text and images for meals, etc.
  • the target text in this application includes various sub-information of the target information, and the sub-information is used to reflect the corresponding information of the target information at a certain level. Taking the type of target information as a meal making tutorial as an example, The sub-information contained in the target text can be the type of ingredients, production process, precautions, etc. This application does not make specific limitations here.
  • Step S102 Determine the target text input information corresponding to the target text.
  • the target text input information corresponding to the target text can be determined, so that subsequent image-text mutual inspection can be performed with the help of the target text input information.
  • Step S103 Process the target text input information based on the text processing model in the pre-trained image-text mutual detection neural network model to obtain the target text processing result; wherein, the text processing model is built based on self-supervised learning, and self-supervised learning is used based on The associated information between various types of sub-information performs supervised learning on the target text.
  • the target text input information can be processed based on the text processing model in the pre-trained image-text mutual detection neural network model to obtain the target text processing result;
  • the text processing model is built based on self-supervised learning.
  • Self-supervised learning is used to supervised learning of the target text based on the correlation information between various types of sub-information. That is, in this application, it is equivalent to based on the correlation between sub-information in the target text.
  • Information is used to process the target text and obtain the corresponding target text processing result.
  • the structure of the image-text mutual detection neural network model can be determined according to actual needs, and is not specifically limited in this application.
  • the training process of the neural network is divided into two stages.
  • the first stage is the stage where data is propagated from low level to high level, that is, the forward propagation stage; the other stage is when the results obtained by the current forward propagation are not as expected.
  • the error is propagated from the high level to the bottom level, which is the back propagation stage. Therefore, the training process of the image-text mutual detection neural network can be as follows: initialize all network layer weights, generally using random initialization; input image and text data go through the graph neural network, convolution layer, downsampling layer, fully connected layer, etc.
  • the forward propagation of the layer obtains the output value; the output value of the network is obtained, and the loss function value of the network output value is obtained; the error is transmitted back to the network, and each layer of the network is obtained in turn: graph neural network layer, fully connected layer, convolution layer and other layers; each layer of the network adjusts all weight coefficients in the network based on the back propagation error of each layer, that is, updates the weights; re-randomly selects a new batch (batch) ), and then enter the second step to obtain the output value through forward propagation of the network; iterate infinitely, when the error between the output value of the network and the target value (label) is less than a certain threshold, or iterate When the number exceeds a certain threshold, the training ends; the network parameters of all trained layers are saved.
  • Step S104 Process the target image based on the image processing model in the image-text mutual detection neural network model to obtain the target image processing result.
  • the target text input information is processed based on the text processing model in the pre-trained image-text mutual detection neural network model, and after the target text processing results are obtained, the image processing in the image-text mutual detection neural network model can be The model processes the target image to obtain the target image processing result, so that the corresponding image-text mutual inspection result can be determined based on the target text processing result and the target image processing result.
  • Step S105 Based on the target text processing result and the target image processing result, determine the image retrieval result of the target text in the target image, and/or determine the text retrieval result of the target image in the target text.
  • the corresponding image-text mutual detection can be determined based on the target text processing result and the target image processing result.
  • the image retrieval result of the target text in the target image can be determined, and/or the text retrieval result of the target image in the target text can be determined, etc., which is not specifically limited in this application.
  • the text and image processing process of the image-text mutual detection neural network model can be determined according to actual needs, and is not specifically limited in this application.
  • the image-text mutual detection neural network model can extract features from text or images and store the extracted features in the data set to be retrieved; receive any text data or image data given by the user as query data; extract query data Characteristics of text data or image data; perform distance matching between the characteristics of the query data and the characteristics of all samples in the data set to be retrieved, that is, finding the vector distance, such as finding the Euclidean distance. For example, if the query data is text data, fetch the data to be retrieved. Concentrate all the graph features to calculate the distance. Similarly, the query data is image data, and then calculate the Euclidean distance from all the text features in the data set to be retrieved. The sample with the smallest distance is the recommended sample, and is output.
  • This application provides a picture-text mutual inspection method to obtain a set of target text and a set of target images to be retrieved.
  • the target text includes various sub-information that characterizes the target information; determines the target text input information corresponding to the target text; based on the pre-set
  • the text processing model in the trained image-text mutual detection neural network model processes the target text input information to obtain the target text processing results; among them, the text processing model is built based on self-supervised learning, and self-supervised learning is used to Supervise learning of the target text based on the associated information; process the target image based on the image processing model in the image-text mutual detection neural network model to obtain the target image processing result; based on the target text processing result and the target image processing result, determine where the target text is Image retrieval results in the target image, and/or text retrieval results that determine the target image in the target text.
  • the target text input information is processed based on the text processing model, because the text processing model is built based on self-supervised learning, and self-supervised learning is used to interoperate between various types of sub-information.
  • the associated information of the target text is supervised to learn, so this application is equivalent to using the associated information between sub-information to obtain the target text processing result, because the associated information between sub-information can reflect the correlation between various types of information in the target text, so
  • the text processing model can ensure the accuracy of target text processing, thereby ensuring the accuracy of mutual inspection of images and texts.
  • Figure 2 is a second flow chart of a picture-text mutual inspection method provided by an embodiment of the present application.
  • Step S201 Obtain a set of target text and a set of target images to be retrieved.
  • the target text includes various types of sub-information representing the target information.
  • Step S202 Determine various types of sub-information in the target text.
  • various types of sub-information in the target text can be determined to use the sub-information Classify the information in the target text, and then determine the corresponding target text input information based on the sub-information in the target text. It should be noted that the type and quantity of sub-information can be determined according to actual needs, and this application does not make specific limitations here.
  • Step S203 Determine the location information corresponding to each sub-information.
  • Step S204 Determine the first type of information corresponding to each sub-information.
  • Step S205 For each sub-information, convert the sub-information, corresponding position information, and first type information into corresponding initial vector information, and use the sum of all initial vector information as the first vector information of the sub-information.
  • Step S206 Determine target text input information based on the first vector information.
  • the position information corresponding to each sub-information can be determined, such as using the position of the sub-information in the target text as its corresponding position information, or using the sub-information in the target text.
  • the order of appearance in the target text is used as its corresponding position information, etc.; the first type of information corresponding to each sub-information is determined to represent the type of sub-information with the help of the first type of information; for each sub-information, the sub-information and the corresponding position are Information and first type information are converted into corresponding initial vector information, and the sum of all initial vector information is used as the first vector information of sub-information.
  • sub-information and corresponding position information and first type information are converted Convert to corresponding initial vector information, and then use the sum of all initial vector information as the first vector information of the sub-information, etc.; finally, determine the target text input information based on the first vector information.
  • the second type of information of the target text in the process of determining the input information of the target text based on the first vector information, can be determined to characterize the type of the target text with the help of the second type of information; the second type of information can be converted into the corresponding the second vector information; use the second vector information and the first vector information as target text input information.
  • the process of determining the corresponding weight value of self-supervised learning may include: for any two sub-information in the text processing model, determine the target sample in one of the sub-information, and determine the target sample in the other sub-information. Determine the first type of sample that is paired with the target sample and the second type of sample that is not paired with the target sample, determine the first distance value between the target sample and the first type of sample, and determine the distance between the target sample and the second type of sample. the second distance value; determine the loss value of the self-supervised learning based on all the first distance values and the second distance value; determine the weight value of the self-supervised learning based on the loss value.
  • the loss function of self-supervised learning can be used to determine the loss value based on all first distance values and second distance values.
  • b represents Self-supervised learning batch;
  • N represents the number of paired samples;
  • d represents the distance value;
  • the loss function value of self-supervised learning includes In specific application scenarios, when applying loss function values for self-supervised learning training, the sum of all loss function values can be used as the final function loss value of self-supervised learning for training, etc. This application does not make specific limitations here. It should be noted that because self-supervised learning means that the annotation (ground truth) used for machine learning comes from the data itself, rather than from manual annotation, and in this application, the features of various sub-information are labels for each other, such as The encoding of the first text feature and the encoding of the second text feature are labels for each other and learn from each other without manual participation, so it is called self-supervised learning.
  • text processing models can include neural network models based on transformer models and self-supervised learning.
  • the text processing model can include an input layer; a multi-head attention mechanism layer (Masked Multihead Attention) connected to the input layer; a first normalization layer (Masked Multihead Attention) connected to the input layer and the multi-head attention mechanism layer.
  • Masked Multihead Attention multi-head attention mechanism layer
  • Masked Multihead Attention first normalization layer
  • the forward transmission layer (Feed Forward) connected to the normalization layer; the second normalization layer connected to the forward transmission layer and the first normalization layer; the first fully connected layer sequentially connected to the second normalization layer (FC), the first excitation layer (ReLU), the second fully connected layer, the self-supervised classification output layer; the third fully connected layer connected to the second normalized layer; the fourth fully connected layer connected to the second normalized layer;
  • the target fully connected layer that corresponds to the sub-information one-to-one is connected to the second normalized layer;
  • the splicing layer is connected to the first fully connected layer and all target fully connected layers; and the third fully connected layer is connected to the splicing layer.
  • the first text information, the second text information and the third text information in Figure 3 are the corresponding sub-information in the target text.
  • the output of the third fully connected layer is the text processing model's response to the target text. process result.
  • loss cla represents the loss function value corresponding to the transformer
  • K represents the dimensions of cla and label
  • sigmoid and ln represent the operation function
  • label k represents the element at the k-th position in label
  • cla k represents the element at the k-th position in cla .
  • Step S207 Process the target text input information based on the text processing model in the pre-trained image-text mutual detection neural network model to obtain the target text processing result; wherein, the text processing model is built based on self-supervised learning, and self-supervised learning is used based on The associated information between various types of sub-information performs supervised learning on the target text.
  • Step S208 Process the target image based on the image processing model in the image-text mutual detection neural network model to obtain the target image processing result.
  • Step S209 Based on the target text processing result and the target image processing result, determine the image retrieval result of the target text in the target image, and/or determine the text retrieval result of the target image in the target text.
  • the image-text mutual detection neural network model can include an image processing model based on the attention mechanism.
  • the image processing model is used to process the target image.
  • the image processing model may include a target number of image processing branches, and a fourth fully connected layer connected to the image processing branches; the image processing branch includes an input layer, a backbone network (backbore) connected to the input layer, and the backbore.
  • the fifth fully connected layer connected to the network, the attention mechanism layer connected to the fifth fully connected layer, the first normalization layer connected to the attention mechanism layer, the multiplier connected to the first normalization layer, and the multiplication
  • the BiLSTM layers in each image processing branch are connected to each other.
  • the type of the backbone network can be determined according to actual needs.
  • the backbone network can be a ResNet backbone network, etc.
  • the text processing model and the image processing model can be connected through the output layer, loss layer, etc., for example, in Figure 3, the text processing model and the image processing model are connected through the Generalized Pairwise Hinge-loss (Generalized pairwise hinge loss function cross entropy) layer connection, etc., this application does not specifically limit it here.
  • image features are input to the BiLSTM network to obtain the overall features of the entire image group.
  • the formula is as follows:
  • images also include reverse and sequential images, both of which contain temporal semantic information, which are encoded using the above formula.
  • BiLSTM represents each unit of the BiLSTM network, ⁇ represents the order, and ⁇ represents the reverse order; Represents the output of the i-th BiLSTM unit; represents the image input feature, i represents the i-th image, ⁇ att () represents the backbone network of this application; the average value of the feature encoding output of the BiLSTM unit is taken as the unique output of the entire medical image.
  • e csi represents the output of image group features, which is used for the next step of retrieval.
  • the attention mechanism layer in this application includes: a sixth fully connected layer connected to the fifth fully connected layer, a second excitation layer connected to the sixth fully connected layer, and a second excitation layer connected to the sixth fully connected layer.
  • the excitation layer is connected to a seventh fully connected layer, and the second normalized layer is connected to the seventh fully connected layer, and the second normalized layer is connected to the first normalized layer.
  • the image features are passed through the backbone network to obtain embedded features, and the embedded features are passed through a fully connected layer to obtain the final embedded features e of each image.
  • the final embedded feature e will calculate the weight of each feature through the attention structure.
  • the weight is a number and is normalized through the sigmoid (S-shaped) layer.
  • the weights of the features of all graphs will be uniformly entered into the softmax (normalized index) layer to determine which graph is important.
  • the feature weight of the image after the softmax layer will be multiplied by the corresponding final embedded feature e of each image.
  • the idea of residual network is introduced.
  • the output of its attention structure is as follows: at last, The image features will pass through Liner's fully connected layer FC to obtain the final graph features.
  • the loss function in the image-text mutual detection neural network model that represents the accuracy of image-text mutual detection can include:
  • each image group feature encoding and text feature encoding can be traversed to obtain the average value of the loss function, as shown in the above formula.
  • a total of M times are traversed, and M represents a total of M paired samples in this batch.
  • the image group features Traverse (M in total) and the one selected by the traversal is called a represents anchor (anchor sample).
  • the text feature encoding paired with the anchor sample is denoted as p stands for positive.
  • ⁇ 2 is a hyperparameter, fixed during training, and can be set to 0.4, etc.
  • the same traversal operation is performed for text features, Represents the sample selected in the traversal, and its corresponding positive image group feature sample is recorded as Those that do not correspond are recorded as s np .
  • Use the above loss function to perform gradient backpropagation during training and update the cascade transformer, BiLSTM, and ResNet network parameters.
  • the total loss function of the image-text mutual detection neural network model can be the sum of all loss functions, etc., which is not specifically limited in this application.
  • FIG. 6 is a schematic structural diagram of a picture-text mutual inspection system provided by an embodiment of the present application.
  • the first acquisition module is used to acquire a set of target text and a set of target images to be retrieved.
  • the target text includes various sub-information representing the target information;
  • the first determination module is used to determine the target text input information corresponding to the target text
  • the first processing module is used to process the target text input information based on the text processing model in the pre-trained image-text mutual detection neural network model to obtain the target text processing results; among which, the text processing model is built based on self-supervised learning, and self-supervised Learning is used to supervised learning of target text based on the correlation information between various types of sub-information;
  • the second processing module is used to process the target image based on the image processing model in the image-text mutual detection neural network model to obtain the target image processing result;
  • the second determination module is used to determine the image retrieval result of the target text in the target image based on the target text processing result and the target image processing result, and/or determine the text retrieval result of the target image in the target text.
  • the first determination module may include:
  • the first determination unit is used to determine various types of sub-information in the target text
  • the second determination unit is used to determine the location information corresponding to each sub-information
  • a third determination unit used to determine the first type of information corresponding to each sub-information
  • a first conversion unit configured to convert the sub-information, corresponding position information, and first type information into corresponding initial vector information for each sub-information, and use the sum of all initial vector information as the first vector of the sub-information. information;
  • the fourth determination unit is used to determine the target text input information based on the first vector information.
  • the fourth determination unit can be specifically used to: determine the second type of information of the target text; convert the second type of information into corresponding second vector information; convert the second vector information and first vector information as target text input information.
  • the determination process of the corresponding weight value of self-supervised learning includes: for any two sub-information in the text processing model, determine the target sample in one of the sub-information, Determine the first type of sample that is paired with the target sample and the second type of sample that is not paired with the target sample in another sub-information, determine the first distance value between the target sample and the first type of sample, and determine the target The second distance value between the sample and the second type sample; determine the loss value of the self-supervised learning based on all the first distance values and the second distance value; determine the weight value of the self-supervised learning based on the loss value.
  • the embodiment of the present application provides a picture-text mutual inspection system that determines the loss value of self-supervised learning based on all first distance values and second distance values, including:
  • the loss value of self-supervised learning is determined based on all first distance values and second distance values; wherein, the loss function of self-supervised learning includes:
  • b represents Self-supervised learning batch;
  • N represents the number of paired samples;
  • d represents the distance value;
  • An embodiment of the present application provides a picture-text mutual inspection system.
  • the text processing model includes a neural network model based on the transformer model and self-supervised learning.
  • An embodiment of the present application provides a picture-text mutual inspection system.
  • the text processing model includes an input layer; a multi-head attention mechanism layer connected to the input layer; a first standardization layer connected to the input layer and the multi-head attention mechanism layer; and standardization
  • the forward transmission layer connected by layers; the second normalization layer connected to the forward transmission layer and the first normalization layer; the first fully connected layer, the first excitation layer, and the second fully connected layer sequentially connected to the second normalization layer.
  • self-supervised classification output layer a target fully connected layer connected to the second normalized layer that corresponds to the sub-information; a fourth fully connected layer connected to the second normalized layer; a fifth fully connected layer connected to the second normalized layer layer; a splicing layer connected to the first fully connected layer and all target fully connected layers; a third fully connected layer connected to the splicing layer.
  • the embodiment of this application provides an image and text mutual inspection system, and the image processing model is built based on the attention mechanism.
  • the image processing model includes a target number of image processing branches and a fourth fully connected layer connected to the image processing branches.
  • the image processing branch includes an input layer, and the image processing branch is connected to the input layer.
  • Backbone network the fifth fully connected layer connected to the backbone network, the attention mechanism layer connected to the fifth fully connected layer, the first normalization layer connected to the attention mechanism layer, and the first normalization layer connected to The multiplier, the adder connected to the multiplier and the fifth fully connected layer, the Linear layer connected to the adder, and the BiLSTM layer connected to the Linear layer;
  • the first normalization layer in each image processing branch is the same; and the BiLSTM layers in each image processing branch are connected to each other.
  • the attention mechanism layer includes: a sixth fully connected layer connected to the fifth fully connected layer, a second excitation layer connected to the sixth fully connected layer, and a second excitation layer connected to the sixth fully connected layer.
  • the excitation layer is connected to a seventh fully connected layer, and the second normalized layer is connected to the seventh fully connected layer, and the second normalized layer is connected to the first normalized layer.
  • An embodiment of the present application provides a picture-text mutual detection system.
  • the loss function in the picture-text mutual detection neural network model includes:
  • An embodiment of the present application provides a picture and text mutual inspection system, and the backbone network includes a ResNet network.
  • This application also provides an image-text mutual inspection device and a non-volatile readable storage medium, both of which have the corresponding effects of the image-text mutual inspection method provided by the embodiment of the application.
  • FIG. 7 is a schematic structural diagram of a picture-text mutual inspection device provided by an embodiment of the present application.
  • An image-text mutual inspection device provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented:
  • the target text includes various sub-information representing the target information
  • the target text input information is processed to obtain the target text processing result; among them, the text processing model is built based on self-supervised learning, and self-supervised learning is used to build the target text based on various sub-classes.
  • the associated information between the information performs supervised learning on the target text;
  • the target image is processed based on the image processing model in the image-text mutual detection neural network model to obtain the target image processing result;
  • An image-text mutual inspection device includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, it implements the following steps: determine various types of sub-information in the target text. ; Determine the position information corresponding to each sub-information; Determine the first type of information corresponding to each sub-information; For each sub-information, convert the sub-information, corresponding position information, and first-type information into corresponding initial vector information, and convert all The sum value of the initial vector information is used as the first vector information of the sub-information; the target text input information is determined based on the first vector information.
  • An image-text mutual inspection device provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, it implements the following steps: determine the second type of information of the target text; Convert the second type information into corresponding second vector information; use the second vector information and the first vector information as target text input information.
  • An image-text mutual inspection device provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented: Determination of the corresponding weight value of self-supervised learning The process includes: for any two sub-information in the text processing model, determine the target sample in one of the sub-information, determine the first type sample paired with the target sample in the other sub-information, and determine the sample paired with the target sample in the other sub-information.
  • An image-text mutual inspection device provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented: through the loss function of the self-supervised learning , determine the loss value of self-supervised learning based on all first distance values and second distance values; wherein, the loss function of self-supervised learning includes:
  • b represents Self-supervised learning batch;
  • N represents the number of paired samples;
  • d represents the distance value;
  • An image-text mutual inspection device provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented:
  • the text processing model includes a transformer model and an automatic model. Neural network model built by supervised learning.
  • An image-text mutual inspection device includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented: the text processing model includes an input layer; and the input layer The multi-head attention mechanism layer connected by layers; the first normalization layer connected to the input layer and the multi-head attention mechanism layer; the forward transmission layer connected to the normalization layer; the second normalization layer connected to the forward transmission layer and the first normalization layer layer; the first fully connected layer, the first excitation layer, the second fully connected layer, and the self-supervised classification output layer connected in sequence to the second normalized layer; the target fully connected layer corresponding to the sub-information one-to-one connected to the second normalized layer
  • the connection layer; the fourth fully connected layer connected to the second normalized layer; the fifth fully connected layer connected to the second normalized layer; the splicing layer connected to the first fully connected layer and all target fully connected layers; and the splicing layer The third fully connected layer of connections.
  • An image-text mutual inspection device provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, it implements the following steps: an image processing model is built based on the attention mechanism.
  • An image-text mutual inspection device provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented:
  • the image processing model includes a target number of image processes. branch, and a fourth fully connected layer connected to the image processing branch;
  • the image processing branch includes an input layer, a backbone network connected to the input layer, a fifth fully connected layer connected to the backbone network, and an attention layer connected to the fifth fully connected layer.
  • the force mechanism layer the first normalization layer connected to the attention mechanism layer, the multiplier connected to the first normalization layer, the adder connected to the multiplier and the fifth fully connected layer, and the adder connected to Linear layer, a BiLSTM layer connected to the Linear layer; among them, the first normalization layer in each image processing branch is the same; and the BiLSTM layers in each image processing branch are connected to each other.
  • An image-text mutual inspection device provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented:
  • the attention mechanism layer includes: and a fifth The sixth fully connected layer connected to the fully connected layer, the second excitation layer connected to the sixth fully connected layer, the seventh fully connected layer connected to the second excitation layer, and the second normalization layer connected to the seventh fully connected layer layer, and the second normalization layer is connected to the first normalization layer.
  • An image-text mutual inspection device provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented:
  • Loss functions include:
  • An image-text mutual inspection device provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, it implements the following steps: the backbone network includes a ResNet network.
  • another image-text mutual inspection device may also include: an input port 203 connected to the processor 202 for transmitting commands input from the outside to the processor 202; and the processor 202
  • the connected display unit 204 is used to display the processing results of the processor 202 to the outside world;
  • the communication module 205 connected to the processor 202 is used to realize communication between the image and text mutual inspection equipment and the outside world.
  • the display unit 204 can be a display panel, a laser scanning display, etc.; the communication methods used by the communication module 205 include but are not limited to mobile high-definition link technology (HML), universal serial bus (USB), high-definition multimedia interface (HDMI), Wireless connection: wireless fidelity technology (WiFi), Bluetooth communication technology, low-power Bluetooth communication technology, communication technology based on IEEE802.11s.
  • HML mobile high-definition link technology
  • USB universal serial bus
  • HDMI high-definition multimedia interface
  • WiFi wireless fidelity technology
  • Bluetooth communication technology low-power Bluetooth communication technology
  • An embodiment of the present application provides a non-volatile readable storage medium.
  • the non-volatile readable storage medium stores a computer program.
  • the computer program is executed by a processor, the following steps are implemented:
  • the target text includes various sub-information representing the target information
  • the target text input information is processed to obtain the target text processing result; among them, the text processing model is built based on self-supervised learning, and self-supervised learning is used to build the target text based on various sub-classes.
  • the associated information between the information performs supervised learning on the target text;
  • the target image is processed based on the image processing model in the image-text mutual detection neural network model to obtain the target image processing result;
  • An embodiment of the present application provides a computer non-volatile readable storage medium.
  • a computer program is stored in the computer non-volatile readable storage medium.
  • the following steps are implemented: determine each character in the target text. Class sub-information; determine the position information corresponding to each sub-information; determine the first type information corresponding to each sub-information; for each sub-information, convert the sub-information, corresponding position information, and first type information into corresponding initial vector information, And the sum of all initial vector information is used as the first vector information of the sub-information; the target text input information is determined based on the first vector information.
  • An embodiment of the present application provides a computer non-volatile readable storage medium.
  • a computer program is stored in the computer non-volatile readable storage medium.
  • the following steps are implemented: determine the second step of the target text. Type information; convert the second type information into corresponding second vector information; use the second vector information and the first vector information as target text input information.
  • An embodiment of the present application provides a non-volatile readable storage medium.
  • a computer program is stored in the non-volatile readable storage medium.
  • the following steps are implemented: self-supervised learning of corresponding weight values
  • the determination process includes: for any two sub-information in the text processing model, determine the target sample in one of the sub-information, determine the first type sample paired with the target sample in the other sub-information, and For second-type samples that are not paired with the target sample, determine the first distance value between the target sample and the first-type sample, and determine the second distance value between the target sample and the second-type sample; based on all first distance values and the first-type sample
  • the two-distance value determines the loss value of self-supervised learning; the weight value of self-supervised learning is determined based on the loss value.
  • An embodiment of the present application provides a non-volatile readable storage medium.
  • a computer program is stored in the non-volatile readable storage medium.
  • the computer program is executed by a processor, the following steps are implemented: through the loss function of self-supervised learning, Determine a loss value for self-supervised learning based on all first distance values and second distance values;
  • the loss function of self-supervised learning includes:
  • b represents Self-supervised learning batch;
  • N represents the number of paired samples;
  • d represents the distance value;
  • An embodiment of the present application provides a non-volatile readable storage medium.
  • a computer program is stored in the non-volatile readable storage medium.
  • the computer program is executed by a processor, the following steps are implemented:
  • the text processing model includes a transformer model based on Neural network model built by self-supervised learning.
  • An embodiment of the present application provides a non-volatile readable storage medium.
  • a computer program is stored in the non-volatile readable storage medium.
  • the text processing model includes an input layer; and The multi-head attention mechanism layer connected to the input layer; the first normalization layer connected to the input layer and the multi-head attention mechanism layer; the forward transmission layer connected to the normalization layer; the second normalization layer connected to the forward transmission layer and the first normalization layer Standardization layer; the first fully connected layer, the first excitation layer, the second fully connected layer, and the self-supervised classification output layer that are sequentially connected to the second normalization layer; and the targets corresponding to the sub-information that are connected to the second normalization layer.
  • Fully connected layer the fourth fully connected layer connected to the second normalized layer; the fifth fully connected layer connected to the second normalized layer; the splicing layer connected to the first fully connected layer and all target fully connected layers; and splicing The third fully connected layer of layer connection.
  • An embodiment of the present application provides a non-volatile readable storage medium.
  • a computer program is stored in the non-volatile readable storage medium.
  • the computer program is executed by the processor, the following steps are implemented:
  • the image processing model is built based on the attention mechanism. .
  • An embodiment of the present application provides a non-volatile readable storage medium.
  • a computer program is stored in the non-volatile readable storage medium.
  • the image processing model includes a target number of images. processing branch, and a fourth fully connected layer connected to the image processing branch;
  • the image processing branch includes an input layer, a backbone network connected to the input layer, a fifth fully connected layer connected to the backbone network, and a fifth fully connected layer connected to the fifth fully connected layer.
  • Attention mechanism layer the first normalization layer connected to the attention mechanism layer, the multiplier connected to the first normalization layer, the adder connected to the multiplier and the fifth fully connected layer, the adder connected to the adder Linear layer, a BiLSTM layer connected to the Linear layer; among them, the first normalization layer in each image processing branch is the same; and the BiLSTM layers in each image processing branch are connected to each other.
  • An embodiment of the present application provides a non-volatile readable storage medium.
  • a computer program is stored in the non-volatile readable storage medium.
  • the attention mechanism layer includes: and The sixth fully connected layer connected to the five fully connected layers, the second excitation layer connected to the sixth fully connected layer, the seventh fully connected layer connected to the second excitation layer, and the second normalized layer connected to the seventh fully connected layer normalization layer, and the second normalization layer is connected to the first normalization layer.
  • An embodiment of the present application provides a non-volatile readable storage medium.
  • a computer program is stored in the non-volatile readable storage medium.
  • the computer program is executed by a processor, the following steps are implemented:
  • the loss function includes:
  • An embodiment of the present application provides a non-volatile readable storage medium.
  • a computer program is stored in the non-volatile readable storage medium.
  • the backbone network includes a ResNet network.
  • non-volatile readable storage media involved in this application include random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROM, or any other form of storage media known in the technical field.
  • RAM random access memory
  • ROM read-only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disks, removable disks, CD-ROM, or any other form of storage media known in the technical field.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Library & Information Science (AREA)
  • Image Analysis (AREA)

Abstract

提供了一种图文互检方法、系统、设备及非易失性可读存储介质,获取待检索的一组目标文本及一组目标图像,目标文本包括表征目标信息的各类子信息(S101);确定目标文本对应的目标文本输入信息(S102);基于图文互检神经网络模型中的文本处理模型对目标文本输入信息进行处理得到目标文本处理结果;其中,文本处理模型基于自监督学习搭建,自监督学习用于基于各类子信息间的关联信息对目标文本进行监督学习(S103);基于图文互检神经网络模型中的图像处理模型对目标图像进行处理得到目标图像处理结果(S104);基于目标文本处理结果及目标图像处理结果,确定目标文本在目标图像中的图像检索结果和/或确定目标图像在目标文本中的文本检索结果(S105)。

Description

一种图文互检方法、系统、设备及非易失性可读存储介质
相关申请的交叉引用
本申请要求于2022年07月12日提交中国专利局,申请号为202210812205.4,申请名称为“一种图文互检方法、系统、设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,更具体地说,涉及一种图文互检方法、系统、设备及非易失性可读存储介质。
背景技术
近年来,经济全球化进程不断推进,科学技术得到空前的发展,尤其是计算机信息技术的广泛推广和应用,使数字化处理取得长足进步。在信息时代,影像数据兼具多媒体数据优势的同时也跟相应领域的相关内容紧密关联,实现影像图文互检有利于提高数据的快速传播和交流,提高数据处理的效率和质量,不难理解,图文互检的准确性越好的话,相应数据处理的效率和质量越好。综上所述,如何提高图文互检的准确性是目前本领域技术人员亟待解决的问题。
发明内容
本申请的目的是提供一种图文互检方法,其能在一定程度上解决如何提高图文互检的准确性的技术问题。本申请还提供了一种图文互检系统、设备及非易失性可读存储介质。为了实现上述目的,本申请提供如下技术方案:
一种图文互检方法,包括:
获取待检索的一组目标文本及一组目标图像,目标文本包括表征目标信息的各类子信息;
确定目标文本对应的目标文本输入信息;
基于预先训练的图文互检神经网络模型中的文本处理模型对目标文本输入信息进行处理,得到目标文本处理结果;其中,文本处理模型基于自监督学习搭建,自监督学习用于基于各类子信息间的关联信息对目标文本进行监督学习;
基于图文互检神经网络模型中的图像处理模型对目标图像进行处理,得到目标图像处理结果;
基于目标文本处理结果及目标图像处理结果,确定目标文本在目标图像中的图像检索结果,和/或确定目标图像在目标文本中的文本检索结果。
在一些实施例中,确定目标文本对应的目标文本输入信息,包括:
确定目标文本中的各类子信息;
确定各个子信息对应的位置信息;
确定各个子信息对应的第一类型信息;
对于每个子信息,将子信息及对应的位置信息、第一类型信息转换为对应的初始向量信息,并将所有的初始向量信息的和值作为子信息的第一向量信息;
基于第一向量信息确定目标文本输入信息。
在一些实施例中,基于第一向量信息确定所述目标文本输入信息,包括:
确定目标文本的第二类型信息;
将第二类型信息转换为对应的第二向量信息;
将第二向量信息和第一向量信息作为目标文本输入信息。
在一些实施例中,自监督学习的对应权重值的确定过程包括:
对于文本处理模型中的任意两个所述子信息,均在其中的一个所述子信息中确定出目标样本,在其中的另一个子信息中确定出与目标样本成对的第一类样本、及与目标样本未成对的第二类样本,确定目标样本与所述第一类样本间的第一距离值,确定目标样本与第二类样本间的第二距离值;
基于所有的第一距离值和第二距离值确定自监督学习的损失值;
基于损失值确定自监督学习的所述权重值。
在一些实施例中,基于所有的第一距离值和第二距离值确定自监督学习的损失值,包括:
通过自监督学习的损失函数,基于所有的所述第一距离值和第二距离值确定自监督学习的损失值;
其中,自监督学习的损失函数包括:
Figure PCTCN2022134091-appb-000001
其中,
Figure PCTCN2022134091-appb-000002
表示自监督学习中第i个子信息相对于第j个子信息的损失函数值,i=1,2…n,j=1,2…n,i≠j,n表示子信息的总数量;b表示自监督学习的批次;N表示成对的样本的数量;d表示距离值;
Figure PCTCN2022134091-appb-000003
表示在第i个子信息中被选中的第a个目标样本;
Figure PCTCN2022134091-appb-000004
表示在第j个子信息中选中的与
Figure PCTCN2022134091-appb-000005
成对的第一类样本;
Figure PCTCN2022134091-appb-000006
表示在第j个子信息中选中的与
Figure PCTCN2022134091-appb-000007
未成对的样本;
Figure PCTCN2022134091-appb-000008
表示预设的超参数;min表示求最小值;其中,所有的损失函数值的和值为自监督学习的损失值。
在一些实施例中,文本处理模型包括基于transformer模型及自监督学习搭建的神经网络模型。
在一些实施例中,文本处理模型包括输入层;与输入层连接的多头注意力机制层;与输入层及多头注意力机制层连接的第一标准化层;与标准化层连接的正向传输层;与正向传输层及第一标准化层连接的第二标准化层;与第二标准化层顺次连接的第一全连接层、第一激励层、第二全连接层、自监督分类输出层;与第二标准化层连接的与子信息一一对应的目标全连接层;与第二标准化层连接的第四全连接层;与第二标准化层连接的第五全连接层;与第一全连接层、所有的目标全连接层连接的拼接层;与拼接层连接的第三全连接层。
在一些实施例中,图像处理模型基于注意力机制搭建。
在一些实施例中,图像处理模型包括目标数量个图像处理分支、及与图像处理分支连接的第四全连接层;图像处理分支包括输入层,与输入层连接的骨干网络,与骨干网络连接的第五全连接层,与第五全连接层连接的注意力机制层,与注意力机制层连接的第一归一化层,与第一归一化层连接的乘法器,与乘法器及第五全连接层连接的加法器,与加法器连接的Linear层,与Linear层连接的BiLSTM层;
其中,各个图像处理分支中的第一归一化层为同一个;且各个图像处理分支中的BiLSTM层间互相连通。
在一些实施例中,注意力机制层包括:与第五全连接层连接的第六全连接层,与第六全连接层连接的第二激励层,与第二激励层连接的第七全连接层,与第七全连接层连接的第二归一化层,且第二归一化层与第一归一化层相连接。
在一些实施例中,图文互检神经网络模型中的损失函数包括:
Figure PCTCN2022134091-appb-000009
其中,
Figure PCTCN2022134091-appb-000010
表示批次b中文本与图像间相对的损失函数值;M表示成对的样本的数量;Δ 2表示预设的超参数;
Figure PCTCN2022134091-appb-000011
表示在目标图像的目标图像处理结果中选中的第a个样本;
Figure PCTCN2022134091-appb-000012
表示在目标文本对应的目标文本处理结果中选中的与
Figure PCTCN2022134091-appb-000013
成对的样本;min表示求最小值;s np表示在目标文本特征处理结果中选中的与
Figure PCTCN2022134091-appb-000014
未成对的样本。
在一些实施例中,骨干网络包括ResNet网络。
一种图文互检系统,包括:
第一获取模块,用于获取待检索的一组目标文本及一组目标图像,目标文本包括表征目标信息的各类子信息;
第一确定模块,用于确定目标文本对应的目标文本输入信息;
第一处理模块,用于基于预先训练的图文互检神经网络模型中的文本处理模型对目标文本输入信息进行处理,得到目标文本处理结果;其中,文本处理模型基于自监督学习搭建,自监督学习用于基于各类子信息间的关联信息对目标文本进行监督学习;
第二处理模块,用于基于图文互检神经网络模型中的图像处理模型对目标图像进行处理,得到目标图像处理结果;
第二确定模块,用于基于目标文本处理结果及目标图像处理结果,确定目标文本在目标图像中的图像检索结果,和/或确定目标图像在目标文本中的文本检索结果。
一种图文互检设备,包括:
存储器,用于存储计算机程序;处理器,用于执行计算机程序时实现如上任一图文互检方法的步骤。
一种非易失性可读存储介质,非易失性可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如上任一图文互检方法的步骤。
本申请提供的一种图文互检方法,获取待检索的一组目标文本及一组目标图像,目标文本包括表征目标信息的各类子信息;确定目标文本对应的目标文本输入信息;基于预先训练的图文互检神经网络模型中的文本处理模型对目标文本输入信息进行处理,得到目标文本处理结果;其中,文本处理模型基于自监督学习搭建,自监督学习用于基于各类子信息间的关联信息对目标文本进行监督学习;基于图文互检神经网络模型中的图像处理模型对目标图像进行处理,得到目标图像处理结果;基于目标文本处理结果及目标图像处理结果,确定目标文本在目标图像中的图像检索结果,和/或确定目标图像在目标文本中的文本检索结果。本申请中,在获取待检索的目标文本及目标图像之后,基于文本处理模型对目标文本输入信息进行处理,因为文本处理模型基于自监督学习搭建,且自监督学习用于基于各类子信息间的关联信息对目标文本进行监督学习,所以本申请相当于借助子信息间的关联信息来得到目标文本处理结果,因为子信息间的关联信息能够反映目标文本中各类信息间的相关性,所以文本处理模型能够保证目标文本的处理准确性,进而保证图文互检的准确性。本申请提供的一种图文互检系统、设备及非易失性可读存储介质也解决了相应技术问题。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实 施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例提供的一种图文互检方法的第一流程图;
图2为本申请实施例提供的一种图文互检方法的第二流程图;
图3为本申请中图文互检神经网络模型的结构示意图;
图4为注意力机制层的结构示意图;
图5为本申请对图像及文本特征的遍历示意图;
图6为本申请实施例提供的一种图文互检系统的结构示意图;
图7为本申请实施例提供的一种图文互检设备的结构示意图;
图8为本申请实施例提供的一种图文互检设备的另一结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参阅图1,图1为本申请实施例提供的一种图文互检方法的第一流程图。本申请实施例提供的一种图文互检方法,可以包括以下步骤:
步骤S101:获取待检索的一组目标文本及一组目标图像,目标文本包括表征目标信息的各类子信息。
实际应用中,可以先获取待检索的一组目标文本及一组目标图像,以便后续在一组目标图像中确定与目标文本对应的图像。需要说明的是,获取的一组目标文本及一组目标图像的数量及目标文本、目标图像的类型等均可以根据实际需要确定,比如目标文本及目标图像可以为医学文本及医学图像,可以为服务器维修文本及服务器维修图像,也可以为饭菜制作文本及制作图像等,本申请在此不做具体限定。还需说明的是,本申请中的目标文本中包括目标信息的各类子信息,且子信息用于在某一层面反映目标信息的相应信息,以目标信息的类型为饭菜制作教程为例,目标文本中包含的子信息可以为食材类型、制作流程、注意事项等,本申请在此不做具体限定。
步骤S102:确定目标文本对应的目标文本输入信息。
实际应用中,在获取待检索的一组目标文本及一组目标图像之后,便可以确定目标文本对应的目标文本输入信息,以便后续借助目标文本输入信息来进行图文互检。
步骤S103:基于预先训练的图文互检神经网络模型中的文本处理模型对目标文本输入信息进行处理,得到目标文本处理结果;其中,文本处理模型基于自监督学习搭建,自监督学习用于基于各类子信息间的关联信息对目标文本进行监督学习。
实际应用中,在确定目标文本对应的目标文本输入信息之后,便可以基于预先训练的图文互检神经网络模型中的文本处理模型对目标文本输入信息进行处理,得到目标文本处理结果;其本申请中,文本处理模型基于自监督学习搭建,自监督学习用于基于各类子信息间的关联信息对目标文本进行监督学习,也即本申请中,相当于基于目标文本中子信息间的关联信息来对目标文本进行处理,得到对应的目标文本处理结果。需要说明的是,图文互检神经网络模型的结构可以根据实际需要确定,本申请在此不做具体限定。此外,神经网络的训练过程分为两个阶段,第一个阶段是数据由低层次向高层次传播的阶段,即前向传播阶段;另外一个阶段是,当前向传播得出的结果与预期不相符时,将误差从高层次向底层次进行传播训练的阶段,即反向传播阶段。所以图文互检神经网络的训练过程可以为:将所有网络层权值进行初始化,一般采用随机初始化;输入图像和文本数据经过图神经网络、卷积层、下采样层、全连接层等各层的前向传播得到输出值;求出网络的输出值,求取网络的输出值的损失函数值;将误差反向传回网络中,依次求得网络各层:图神经网络层,全连接层,卷积层 等各层的反向传播误差;网络各层根据各层的反向传播误差对网络中的所有权重系数进行调整,即进行权重的更新;重新随机选取新的batch(批次)的图像文本数据,然后进入到第二步,获得网络前向传播得到输出值;无限往复迭代,当求出网络的输出值与目标值(标签)之间的误差小于某个阈值,或者迭代次数超过某个阈值时,结束训练;保存训练好的所有层的网络参数。
步骤S104:基于图文互检神经网络模型中的图像处理模型对目标图像进行处理,得到目标图像处理结果。
实际应用中,在基于预先训练的图文互检神经网络模型中的文本处理模型对目标文本输入信息进行处理,得到目标文本处理结果之后,便可以基于图文互检神经网络模型中的图像处理模型对目标图像进行处理,得到目标图像处理结果,以便后续基于目标文本处理结果及目标图像处理结果确定相应的图文互检结果。
步骤S105:基于目标文本处理结果及目标图像处理结果,确定目标文本在目标图像中的图像检索结果,和/或确定目标图像在目标文本中的文本检索结果。
实际应用中,在基于图文互检神经网络模型中的图像处理模型对目标图像进行处理,得到目标图像处理结果之后,便可以基于目标文本处理结果及目标图像处理结果确定相应的图文互检结果,具体的,可以确定目标文本在目标图像中的图像检索结果,和/或确定目标图像在目标文本中的文本检索结果等,本申请在此不做具体限定。需要说明的是,图文互检神经网络模型对文本和图像的处理过程可以根据实际需要确定,本申请在此不做具体限定。比如图文互检神经网络模型可以对文本或图像进行特征提取,并将提取的特征存入待检索数据集中;接收用户给定的任意文本数据或图像数据作为query(查询)数据;提取query数据的文本数据或图像数据的特征;将query数据的特征与待检索数据集中所有样本特征进行距离匹配,即求向量距离,比如求取欧式距离,例如若query数据是文本数据就去取待检索数据集中所有的图特征进行求距离,同理query数据是图像数据,则与待检索数据集中所有的文本特征求欧式距离,距离最小的样本即为推荐样本,进行输出等。
本申请提供的一种图文互检方法,获取待检索的一组目标文本及一组目标图像,目标文本包括表征目标信息的各类子信息;确定目标文本对应的目标文本输入信息;基于预先训练的图文互检神经网络模型中的文本处理模型对目标文本输入信息进行处理,得到目标文本处理结果;其中,文本处理模型基于自监督学习搭建,自监督学习用于基于各类子信息间的关联信息对目标文本进行监督学习;基于图文互检神经网络模型中的图像处理模型对目标图像进行处理,得到目标图像处理结果;基于目标文本处理结果及目标图像处理结果,确定目标文本在目标图像中的图像检索结果,和/或确定目标图像在目标文本中的文本检索结果。本申请中,在获取待检索的目标文本及目标图像之后,基于文本处理模型对目标文本输入信息进行处理,因为文本处理模型基于自监督学习搭建,且自监督学习用于基于各类子信息间的关联信息对目标文本进行监督学习,所以本申请相当于借助子信息间的关联信息来得到目标文本处理结果,因为子信息间的关联信息能够反映目标文本中各类信息间的相关性,所以文本处理模型能够保证目标文本的处理准确性,进而保证图文互检的准确性。
请参阅图2,图2为本申请实施例提供的一种图文互检方法的第二流程图。
本申请实施例提供的一种图文互检方法,可以包括以下步骤:
步骤S201:获取待检索的一组目标文本及一组目标图像,目标文本包括表征目标信息的各类子信息。
步骤S202:确定目标文本中的各类子信息。
实际应用中,在确定目标文本对应的目标文本输入信息的过程中,为了使得目标文本输入信息更好的反映目标文本中的信息特征,可以确定目标文本中的各类子信息,以借助子信息对目标文本中的信息进行信息分类,后续再基于目标文本中的子信息来确定相应的目标文本输入信息。需要说明的是,子信息的类型及数量等可以根据实际需要确定,本申请在 此不做具体限定。
步骤S203:确定各个子信息对应的位置信息。
步骤S204:确定各个子信息对应的第一类型信息。
步骤S205:对于每个子信息,将子信息及对应的位置信息、第一类型信息转换为对应的初始向量信息,并将所有的初始向量信息的和值作为子信息的第一向量信息。
步骤S206:基于第一向量信息确定目标文本输入信息。
实际应用中,在确定目标文本中的各类子信息之后,便可以确定各个子信息对应的位置信息,比如将子信息在目标文本中的位置作为其对应的位置信息等,或者将子信息在目标文本中的出现先后顺序作为其对应的位置信息等;确定各个子信息对应的第一类型信息,以借助第一类型信息表征子信息的类型;对于每个子信息,将子信息及对应的位置信息、第一类型信息转换为对应的初始向量信息,并将所有的初始向量信息的和值作为子信息的第一向量信息,比如基于word2vec工具将子信息及对应的位置信息、第一类型信息转换为对应的初始向量信息,再将所有的初始向量信息的和值作为子信息的第一向量信息等;最后基于第一向量信息确定目标文本输入信息。
具体应用场景中,在基于第一向量信息确定目标文本输入信息的过程中,可以确定目标文本的第二类型信息,以借助第二类型信息表征目标文本的类型;将第二类型信息转换为对应的第二向量信息;将第二向量信息和第一向量信息作为目标文本输入信息。
具体应用场景中,自监督学习的对应权重值的确定过程可以包括:对于文本处理模型中的任意两个子信息,均在其中的一个子信息中确定出目标样本,在其中的另一个子信息中确定出与目标样本成对的第一类样本、及与目标样本未成对的第二类样本,确定目标样本与第一类样本间的第一距离值,确定目标样本与第二类样本间的第二距离值;基于所有的第一距离值和第二距离值确定自监督学习的损失值;基于所损失值确定自监督学习的权重值。
具体应用场景中,在基于所有的第一距离值和第二距离值确定自监督学习的损失值的过程中,可以通过自监督学习的损失函数,基于所有的第一距离值和第二距离值确定自监督学习的损失值;其中,自监督学习的损失函数可以包括:
Figure PCTCN2022134091-appb-000015
其中,
Figure PCTCN2022134091-appb-000016
表示自监督学习中第i个子信息相对于第j个子信息的损失函数值,i=1,2…n,j=1,2…n,i≠j,n表示子信息的总数量;b表示自监督学习的批次;N表示成对的样本的数量;d表示距离值;
Figure PCTCN2022134091-appb-000017
表示在第i个子信息中被选中的第a个目标样本;
Figure PCTCN2022134091-appb-000018
表示在第j个子信息中选中的与
Figure PCTCN2022134091-appb-000019
成对的第一类样本;
Figure PCTCN2022134091-appb-000020
表示在第j个子信息中选中的与
Figure PCTCN2022134091-appb-000021
未成对的样本;
Figure PCTCN2022134091-appb-000022
表示预设的超参数;min表示求最小值;其中,所有的损失函数值的和值为自监督学习的损失值。为便于理解,假设子信息的数量为3,则i=1,2,3,j=1,2,3,自监督学习的损失函数值便包括
Figure PCTCN2022134091-appb-000023
Figure PCTCN2022134091-appb-000024
具体应用场景中,在应用损失函数值进行自监督学习训练时,可以将所有的损失函数值的和值作为自监督学习的最终函数损失值来进行训练等,本申请在此不做具体限定。需要说明的是,因为自监督学习是指用于机器学习的标注(ground truth)源于数据本身,而非来自人工标注,而在本申请中,各类子信息的特征间互为标签,比如第一文本特征的编码与第二文本特征编码互为标签,相互进行学习,没有人工参与,所 以称之为自监督学习。
具体应用场景中,文本处理模型可以包括基于transformer模型及自监督学习搭建的神经网络模型。
具体应用场景中,请参阅图3,文本处理模型可以包括输入层;与输入层连接的多头注意力机制层(Masked Multihead Attention);与输入层及多头注意力机制层连接的第一标准化层(Add+Normalization);与标准化层连接的正向传输层(Feed Forward);与正向传输层及第一标准化层连接的第二标准化层;与第二标准化层顺次连接的第一全连接层(FC)、第一激励层(ReLU)、第二全连接层、自监督分类输出层;与第二标准化层连接的第三全连接层;与第二标准化层连接的第四全连接层;与第二标准化层连接的与子信息一一对应的目标全连接层;与第一全连接层及所有的目标全连接层连接的拼接层;与拼接层连接的第三全连接层。需要说明的是,图3中第一文本信息、第二文本信息及第三文本信息也即目标文本中的相应子信息,此外,第三全连接层输出的便是文本处理模型对目标文本的处理结果。
具体应用场景中,可以在transformer的CLS对应的输出位置,其中,CLS也即目标文本的第二类信息,提取其输出特征,用来进行主动学习分类;比如在训练开始前,以目标文本为诊断数据为例,可以读取所有第一文本信息的数据,生成诊断结果列表。对于诊断结果列表,进行同类合并操作,即相同诊断结果的数据合并成为1个数据,并统计合并的数量。再提取transformer的CLS对应的输出特征,该特征首先经过一个全连接层FC,随后通过ReLU进行非线性映射,最后再通过一个全连接层FC,该特征命名为cla,cla会进行诊断结果分类损失的计算。计算方法如下:
提取医学文本的CLS的特征;将cla特征与其对应的label(标签)求用于多目标分类的BCELoss(二分类交叉熵损失),其公式如下:
Figure PCTCN2022134091-appb-000025
其中,loss cla表示transformer对应的损失函数值;K表示cla及label的维度;sigmoid、ln表示运算函数;label k表示label中第k个位置的元素,cla k表示cla中第k个位置的元素。
步骤S207:基于预先训练的图文互检神经网络模型中的文本处理模型对目标文本输入信息进行处理,得到目标文本处理结果;其中,文本处理模型基于自监督学习搭建,自监督学习用于基于各类子信息间的关联信息对目标文本进行监督学习。
步骤S208:基于图文互检神经网络模型中的图像处理模型对目标图像进行处理,得到目标图像处理结果。
步骤S209:基于目标文本处理结果及目标图像处理结果,确定目标文本在目标图像中的图像检索结果,和/或确定目标图像在目标文本中的文本检索结果。
实际应用中,请参阅图3,图文互检神经网络模型可以包括基于注意力机制搭建的图像处理模型,图像处理模型用于对目标图像进行处理。
具体应用场景中,图像处理模型可以包括目标数量个图像处理分支、及与图像处理分支连接的第四全连接层;图像处理分支包括输入层,与输入层连接的骨干网络(backbore),与骨干网络连接的第五全连接层,与第五全连接层连接的注意力机制层,与注意力机制层连接的第一归一化层,与第一归一化层连接的乘法器,与乘法器及第五全连接层连接的加法器,与加法器连接的Linear(线性)层,与Linear层连接的BiLSTM层;其中,各个图像处理分支中的第一归一化层为同一个;且各个图像处理分支中的BiLSTM层间互相连通。需要说明的是,骨干网络的类型可以根据实际需要确定,比如骨干网络可以为ResNet骨干网络等。此外,图文互检神经网络模型中,文本处理模型与图像处理模型间可以通过输出层、损 失层等进行连接,比如在图3中,文本处理模型与图像处理模型间通过Generalized Pairwise Hinge-loss(广义成对合页损失函数交叉熵)层连接等,本申请在此不做具体限定。
在图像处理模型中,将图像特征输入到BiLSTM网络,获取整体图像组的总体特征。公式如下:
Figure PCTCN2022134091-appb-000026
Figure PCTCN2022134091-appb-000027
如上所记载的内容,图像也包含逆序和顺序两种,都隐含着时序语义信息,用如上公式对其进行编码。其中,BiLSTM代表BiLSTM网络的每一个单元,→表示顺序,←表示逆序;
Figure PCTCN2022134091-appb-000028
代表第i个BiLSTM单元的输出;
Figure PCTCN2022134091-appb-000029
代表图像输入特征,i代表第i张图像,φ att()代表本申请的骨干网络;取BiLSTM单元的特征编码输出平均值做为整个医学图的特的输出。如下所示:
Figure PCTCN2022134091-appb-000030
其中,e csi代表图像组特征的输出,用来进行下一步的检索。
具体应用场景中,请参阅图4,本申请中的注意力机制层包括:与第五全连接层连接的第六全连接层,与第六全连接层连接的第二激励层,与第二激励层连接的第七全连接层,与第七全连接层连接的第二归一化层,且第二归一化层与第一归一化层相连接。
在本申请中,图像特征经过骨干网络backbone后获得嵌入式特征,嵌入式特征经过一个全连接层以后获得每张图像的最终的嵌入特征e。最终的嵌入特征e会通过经过attention(注意)结构,计算每个特征的权重,该权重是一个数,经过sigmoid(S形)层进行归一化。所有图的特征的权重会统一进入softmax(归一化指数)层,来判别哪一个图是重要的。最终,经过softmax层后的图的特征权重会与对应的每张图像的最终的嵌入特征e相乘。同时,引入了残差网络的思想,对于每个医学图而言,其注意力结构的输出如下公式所示:
Figure PCTCN2022134091-appb-000031
最后,
Figure PCTCN2022134091-appb-000032
的图像特征会通过Liner的全连接层FC,得到最终的图特征
Figure PCTCN2022134091-appb-000033
具体应用场景中,图文互检神经网络模型中表征图文互检准确性的损失函数可以包括:
Figure PCTCN2022134091-appb-000034
其中,
Figure PCTCN2022134091-appb-000035
表示批次b中文本与图像间相对的损失函数值;M表示成对的样本的数量;Δ 2表示预设的超参数;
Figure PCTCN2022134091-appb-000036
表示在目标图像的目标图像处理结果中选中的第a个样本;
Figure PCTCN2022134091-appb-000037
表示在目标文本对应的目标文本处理结果中选中的与
Figure PCTCN2022134091-appb-000038
成对的样本;min表 示求最小值;s np表示在目标文本特征处理结果中选中的与
Figure PCTCN2022134091-appb-000039
未成对的样本。
需要说明的是,如图5所示,本申请中的数据是成对出现的。一个文本特征的编码对应一个图像组特征编码,即一个图对应一个文本。在loss函数设计中,对于这种成对的数据,可以遍历每一个图像组特征编码和文本特征编码求取损失函数的平均值,如上公式所示。共遍历M次,M代表在本batch(批次)中,共有M个成对的样本。首先对图像组特征
Figure PCTCN2022134091-appb-000040
进行遍历(共M个),遍历选中的那个就称为
Figure PCTCN2022134091-appb-000041
a代表anchor(锚点样本)。与锚点样本成对的文本特征编码记为
Figure PCTCN2022134091-appb-000042
p代表positive(积极)。同理,在本batch中与
Figure PCTCN2022134091-appb-000043
不配对的其余所有样本记为s np。Δ 2是超参数,在训练时固定,可以设置为0.4等。同理,对于文本特征也做相同的遍历操作,
Figure PCTCN2022134091-appb-000044
代表遍历中被选中的那个样本,与其对应的正图像组特征样本记为
Figure PCTCN2022134091-appb-000045
不对应的记为s np。用以上loss函数在训练中,进行梯度反传,对级联transformer,BiLSTM,ResNet网络参数进行更新。此外,图文互检神经网络模型的总损失函数可以为所有损失函数的和值等,本申请在此不做具体限定。
请参阅图6,图6为本申请实施例提供的一种图文互检系统的结构示意图。
本申请实施例提供的一种图文互检系统,可以包括:
第一获取模块,用于获取待检索的一组目标文本及一组目标图像,目标文本包括表征目标信息的各类子信息;
第一确定模块,用于确定目标文本对应的目标文本输入信息;
第一处理模块,用于基于预先训练的图文互检神经网络模型中的文本处理模型对目标文本输入信息进行处理,得到目标文本处理结果;其中,文本处理模型基于自监督学习搭建,自监督学习用于基于各类子信息间的关联信息对目标文本进行监督学习;
第二处理模块,用于基于图文互检神经网络模型中的图像处理模型对目标图像进行处理,得到目标图像处理结果;
第二确定模块,用于基于目标文本处理结果及目标图像处理结果,确定目标文本在目标图像中的图像检索结果,和/或确定目标图像在目标文本中的文本检索结果。
本申请实施例提供的一种图文互检系统,第一确定模块可以包括:
第一确定单元,用于确定目标文本中的各类子信息;
第二确定单元,用于确定各个子信息对应的位置信息;
第三确定单元,用于确定各个子信息对应的第一类型信息;
第一转换单元,用于对于每个子信息,将子信息及对应的位置信息、第一类型信息转换为对应的初始向量信息,并将所有的初始向量信息的和值作为子信息的第一向量信息;
第四确定单元,用于基于第一向量信息确定目标文本输入信息。
本申请实施例提供的一种图文互检系统,第四确定单元可以具体用于:确定目标文本的第二类型信息;将第二类型信息转换为对应的第二向量信息;将第二向量信息和第一向量信息作为目标文本输入信息。
本申请实施例提供的一种图文互检系统,自监督学习的对应权重值的确定过程包括:对于文本处理模型中的任意两个子信息,均在其中的一个子信息中确定出目标样本,在其中的另一个子信息中确定出与目标样本成对的第一类样本、及与目标样本未成对的第二类样本,确定目标样本与第一类样本间的第一距离值,确定目标样本与第二类样本间的第二距离值;基于所有的第一距离值和第二距离值确定自监督学习的损失值;基于损失值确定自监督学习的权重值。
本申请实施例提供的一种图文互检系统,基于所有的第一距离值和第二距离值确定自监 督学习的损失值,包括:
通过自监督学习的损失函数,基于所有的第一距离值和第二距离值确定自监督学习的损失值;其中,自监督学习的损失函数包括:
Figure PCTCN2022134091-appb-000046
其中,
Figure PCTCN2022134091-appb-000047
表示自监督学习中第i个子信息相对于第j个子信息的损失函数值,i=1,2…n,j=1,2…n,i≠j,n表示子信息的总数量;b表示自监督学习的批次;N表示成对的样本的数量;d表示距离值;
Figure PCTCN2022134091-appb-000048
表示在第i个子信息中被选中的第a个目标样本;
Figure PCTCN2022134091-appb-000049
表示在第j个子信息中选中的与
Figure PCTCN2022134091-appb-000050
成对的第一类样本;
Figure PCTCN2022134091-appb-000051
表示在第j个子信息中选中的与
Figure PCTCN2022134091-appb-000052
未成对的样本;
Figure PCTCN2022134091-appb-000053
表示预设的超参数;min表示求最小值;其中,所有的损失函数值的和值为自监督学习的损失值。
本申请实施例提供的一种图文互检系统,文本处理模型包括基于transformer模型及自监督学习搭建的神经网络模型。
本申请实施例提供的一种图文互检系统,文本处理模型包括输入层;与输入层连接的多头注意力机制层;与输入层及多头注意力机制层连接的第一标准化层;与标准化层连接的正向传输层;与正向传输层及第一标准化层连接的第二标准化层;与第二标准化层顺次连接的第一全连接层、第一激励层、第二全连接层、自监督分类输出层;与第二标准化层连接的与子信息一一对应的目标全连接层;与第二标准化层连接的第四全连接层;与第二标准化层连接的第五全连接层;与第一全连接层、所有的目标全连接层连接的拼接层;与拼接层连接的第三全连接层。
本申请实施例提供的一种图文互检系统,图像处理模型基于注意力机制搭建。
本申请实施例提供的一种图文互检系统,图像处理模型包括目标数量个图像处理分支、及与图像处理分支连接的第四全连接层;图像处理分支包括输入层,与输入层连接的骨干网络,与骨干网络连接的第五全连接层,与第五全连接层连接的注意力机制层,与注意力机制层连接的第一归一化层,与第一归一化层连接的乘法器,与乘法器及第五全连接层连接的加法器,与加法器连接的Linear层,与Linear层连接的BiLSTM层;
其中,各个图像处理分支中的第一归一化层为同一个;且各个图像处理分支中的BiLSTM层间互相连通。
本申请实施例提供的一种图文互检系统,注意力机制层包括:与第五全连接层连接的第六全连接层,与第六全连接层连接的第二激励层,与第二激励层连接的第七全连接层,与第七全连接层连接的第二归一化层,且第二归一化层与第一归一化层相连接。
本申请实施例提供的一种图文互检系统,图文互检神经网络模型中的损失函数包括:
Figure PCTCN2022134091-appb-000054
其中,
Figure PCTCN2022134091-appb-000055
表示批次b中文本与图像间相对的损失函数值;M表示成对的样本的数量;Δ 2表示预设的超参数;
Figure PCTCN2022134091-appb-000056
表示在目标图像的目标图像处理结果中选中的第a个样本;
Figure PCTCN2022134091-appb-000057
表示在目标文本对应的目标文本处理结果中选中的与
Figure PCTCN2022134091-appb-000058
成对的样本;min表示求 最小值;s np表示在目标文本特征处理结果中选中的与
Figure PCTCN2022134091-appb-000059
未成对的样本。
本申请实施例提供的一种图文互检系统,骨干网络包括ResNet网络。
本申请还提供了一种图文互检设备及非易失性可读存储介质,其均具有本申请实施例提供的一种图文互检方法具有的对应效果。请参阅图7,图7为本申请实施例提供的一种图文互检设备的结构示意图。
本申请实施例提供的一种图文互检设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:
获取待检索的一组目标文本及一组目标图像,目标文本包括表征目标信息的各类子信息;
确定目标文本对应的目标文本输入信息;
基于预先训练的图文互检神经网络模型中的文本处理模型对目标文本输入信息进行处理,得到目标文本处理结果;其中,文本处理模型基于自监督学习搭建,自监督学习用于基于各类子信息间的关联信息对目标文本进行监督学习;
基于图文互检神经网络模型中的图像处理模型对目标图像进行处理,得到目标图像处理结果;
基于目标文本处理结果及目标图像处理结果,确定目标文本在目标图像中的图像检索结果,和/或确定目标图像在目标文本中的文本检索结果。
本申请实施例提供的一种图文互检设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:确定目标文本中的各类子信息;确定各个子信息对应的位置信息;确定各个子信息对应的第一类型信息;对于每个子信息,将子信息及对应的位置信息、第一类型信息转换为对应的初始向量信息,并将所有的初始向量信息的和值作为子信息的第一向量信息;基于第一向量信息确定目标文本输入信息。
本申请实施例提供的一种图文互检设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:确定目标文本的第二类型信息;将第二类型信息转换为对应的第二向量信息;将第二向量信息和第一向量信息作为目标文本输入信息。
本申请实施例提供的一种图文互检设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:自监督学习的对应权重值的确定过程包括:对于文本处理模型中的任意两个子信息,均在其中的一个子信息中确定出目标样本,在其中的另一个子信息中确定出与目标样本成对的第一类样本、及与目标样本未成对的第二类样本,确定目标样本与第一类样本间的第一距离值,确定目标样本与第二类样本间的第二距离值;基于所有的第一距离值和第二距离值确定自监督学习的损失值;基于损失值确定自监督学习的权重值。
本申请实施例提供的一种图文互检设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:通过所述自监督学习的损失函数,基于所有的第一距离值和第二距离值确定自监督学习的损失值;其中,自监督学习的损失函数包括:
Figure PCTCN2022134091-appb-000060
其中,
Figure PCTCN2022134091-appb-000061
表示自监督学习中第i个子信息相对于第j个子信息的损失函数值,i=1,2…n,j=1,2…n,i≠j,n表示子信息的总数量;b表示自监督学习的批次; N表示成对的样本的数量;d表示距离值;
Figure PCTCN2022134091-appb-000062
表示在第i个子信息中被选中的第a个目标样本;
Figure PCTCN2022134091-appb-000063
表示在第j个子信息中选中的与
Figure PCTCN2022134091-appb-000064
成对的第一类样本;
Figure PCTCN2022134091-appb-000065
表示在第j个子信息中选中的与
Figure PCTCN2022134091-appb-000066
未成对的样本;
Figure PCTCN2022134091-appb-000067
表示预设的超参数;min表示求最小值;其中,所有的损失函数值的和值为自监督学习的损失值。
本申请实施例提供的一种图文互检设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:文本处理模型包括基于transformer模型及自监督学习搭建的神经网络模型。
本申请实施例提供的一种图文互检设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:文本处理模型包括输入层;与输入层连接的多头注意力机制层;与输入层及多头注意力机制层连接的第一标准化层;与标准化层连接的正向传输层;与正向传输层及第一标准化层连接的第二标准化层;与第二标准化层顺次连接的第一全连接层、第一激励层、第二全连接层、自监督分类输出层;与第二标准化层连接的与子信息一一对应的目标全连接层;与第二标准化层连接的第四全连接层;与第二标准化层连接的第五全连接层;与第一全连接层、所有的目标全连接层连接的拼接层;与拼接层连接的第三全连接层。
本申请实施例提供的一种图文互检设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:图像处理模型基于注意力机制搭建。
本申请实施例提供的一种图文互检设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:图像处理模型包括目标数量个图像处理分支、及与图像处理分支连接的第四全连接层;图像处理分支包括输入层,与输入层连接的骨干网络,与骨干网络连接的第五全连接层,与第五全连接层连接的注意力机制层,与注意力机制层连接的第一归一化层,与所述第一归一化层连接的乘法器,与乘法器及第五全连接层连接的加法器,与加法器连接的Linear层,与Linear层连接的BiLSTM层;其中,各个图像处理分支中的第一归一化层为同一个;且各个图像处理分支中的BiLSTM层间互相连通。
本申请实施例提供的一种图文互检设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:注意力机制层包括:与第五全连接层连接的第六全连接层,与第六全连接层连接的第二激励层,与第二激励层连接的第七全连接层,与第七全连接层连接的第二归一化层,且第二归一化层与第一归一化层相连接。
本申请实施例提供的一种图文互检设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:图文互检神经网络模型中的损失函数包括:
Figure PCTCN2022134091-appb-000068
其中,
Figure PCTCN2022134091-appb-000069
表示批次b中文本与图像间相对的损失函数值;M表示成对的样本的数量;Δ 2表示预设的超参数;
Figure PCTCN2022134091-appb-000070
表示在目标图像的目标图像处理结果中选中的第a个 样本;
Figure PCTCN2022134091-appb-000071
表示在目标文本对应的目标文本处理结果中选中的与
Figure PCTCN2022134091-appb-000072
成对的样本;min表示求最小值;s np表示在目标文本特征处理结果中选中的与
Figure PCTCN2022134091-appb-000073
未成对的样本。
本申请实施例提供的一种图文互检设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:骨干网络包括ResNet网络。
请参阅图8,本申请实施例提供的另一种图文互检设备中还可以包括:与处理器202连接的输入端口203,用于传输外界输入的命令至处理器202;与处理器202连接的显示单元204,用于显示处理器202的处理结果至外界;与处理器202连接的通信模块205,用于实现图文互检设备与外界的通信。显示单元204可以为显示面板、激光扫描使显示器等;通信模块205所采用的通信方式包括但不局限于移动高清链接技术(HML)、通用串行总线(USB)、高清多媒体接口(HDMI)、无线连接:无线保真技术(WiFi)、蓝牙通信技术、低功耗蓝牙通信技术、基于IEEE802.11s的通信技术。
本申请实施例提供的一种非易失性可读存储介质,非易失性可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:
获取待检索的一组目标文本及一组目标图像,目标文本包括表征目标信息的各类子信息;
确定目标文本对应的目标文本输入信息;
基于预先训练的图文互检神经网络模型中的文本处理模型对目标文本输入信息进行处理,得到目标文本处理结果;其中,文本处理模型基于自监督学习搭建,自监督学习用于基于各类子信息间的关联信息对目标文本进行监督学习;
基于图文互检神经网络模型中的图像处理模型对目标图像进行处理,得到目标图像处理结果;
基于目标文本处理结果及目标图像处理结果,确定目标文本在目标图像中的图像检索结果,和/或确定目标图像在目标文本中的文本检索结果。
本申请实施例提供的一种计算机非易失性可读存储介质,计算机非易失性可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:确定目标文本中的各类子信息;确定各个子信息对应的位置信息;确定各个子信息对应的第一类型信息;对于每个子信息,将子信息及对应的位置信息、第一类型信息转换为对应的初始向量信息,并将所有的初始向量信息的和值作为子信息的第一向量信息;基于第一向量信息确定目标文本输入信息。
本申请实施例提供的一种计算机非易失性可读存储介质,计算机非易失性可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:确定目标文本的第二类型信息;将第二类型信息转换为对应的第二向量信息;将第二向量信息和第一向量信息作为目标文本输入信息。
本申请实施例提供的一种非易失性可读存储介质,非易失性可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:自监督学习的对应权重值的确定过程包括:对于文本处理模型中的任意两个子信息,均在其中的一个子信息中确定出目标样本,在其中的另一个子信息中确定出与目标样本成对的第一类样本、及与目标样本未成对的第二类样本,确定目标样本与第一类样本间的第一距离值,确定目标样本与第二类样本间的第二距离值;基于所有的第一距离值和第二距离值确定自监督学习的损失值;基于损失值确定自监督学习的权重值。
本申请实施例提供的一种非易失性可读存储介质,非易失性可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:通过自监督学习的损失函数,基于所有 的第一距离值和第二距离值确定自监督学习的损失值;
其中,自监督学习的损失函数包括:
Figure PCTCN2022134091-appb-000074
其中,
Figure PCTCN2022134091-appb-000075
表示自监督学习中第i个子信息相对于第j个子信息的损失函数值,i=1,2…n,j=1,2…n,i≠j,n表示子信息的总数量;b表示自监督学习的批次;N表示成对的样本的数量;d表示距离值;
Figure PCTCN2022134091-appb-000076
表示在第i个子信息中被选中的第a个目标样本;
Figure PCTCN2022134091-appb-000077
表示在第j个子信息中选中的与
Figure PCTCN2022134091-appb-000078
成对的第一类样本;
Figure PCTCN2022134091-appb-000079
表示在第j个子信息中选中的与
Figure PCTCN2022134091-appb-000080
未成对的样本;
Figure PCTCN2022134091-appb-000081
表示预设的超参数;min表示求最小值;其中,所有的损失函数值的和值为自监督学习的损失值。
本申请实施例提供的一种非易失性可读存储介质,非易失性可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:文本处理模型包括基于transformer模型及自监督学习搭建的神经网络模型。
本申请实施例提供的一种非易失性可读存储介质,非易失性可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:文本处理模型包括输入层;与输入层连接的多头注意力机制层;与输入层及多头注意力机制层连接的第一标准化层;与标准化层连接的正向传输层;与正向传输层及第一标准化层连接的第二标准化层;与第二标准化层顺次连接的第一全连接层、第一激励层、第二全连接层、自监督分类输出层;与第二标准化层连接的与子信息一一对应的目标全连接层;与第二标准化层连接的第四全连接层;与第二标准化层连接的第五全连接层;与第一全连接层、所有的目标全连接层连接的拼接层;与拼接层连接的第三全连接层。
本申请实施例提供的一种非易失性可读存储介质,非易失性可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:图像处理模型基于注意力机制搭建。
本申请实施例提供的一种非易失性可读存储介质,非易失性可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:图像处理模型包括目标数量个图像处理分支、及与图像处理分支连接的第四全连接层;图像处理分支包括输入层,与输入层连接的骨干网络,与骨干网络连接的第五全连接层,与第五全连接层连接的注意力机制层,与注意力机制层连接的第一归一化层,与第一归一化层连接的乘法器,与乘法器及第五全连接层连接的加法器,与加法器连接的Linear层,与Linear层连接的BiLSTM层;其中,各个图像处理分支中的第一归一化层为同一个;且各个图像处理分支中的BiLSTM层间互相连通。
本申请实施例提供的一种非易失性可读存储介质,非易失性可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:注意力机制层包括:与第五全连接层连接的第六全连接层,与第六全连接层连接的第二激励层,与第二激励层连接的第七全连接层,与第七全连接层连接的第二归一化层,且第二归一化层与第一归一化层相连接。
本申请实施例提供的一种非易失性可读存储介质,非易失性可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:图文互检神经网络模型中的损失函数包括:
Figure PCTCN2022134091-appb-000082
其中,
Figure PCTCN2022134091-appb-000083
表示批次b中文本与图像间相对的损失函数值;M表示成对的样本的数量;Δ 2表示预设的超参数;
Figure PCTCN2022134091-appb-000084
表示在目标图像的目标图像处理结果中选中的第a个样本;
Figure PCTCN2022134091-appb-000085
表示在目标文本对应的目标文本处理结果中选中的与
Figure PCTCN2022134091-appb-000086
成对的样本;min表示求最小值;s np表示在目标文本特征处理结果中选中的与
Figure PCTCN2022134091-appb-000087
未成对的样本。
本申请实施例提供的一种非易失性可读存储介质,非易失性可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:骨干网络包括ResNet网络。
本申请所涉及的非易失性可读存储介质包括随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质。
本申请实施例提供的图文互检系统、设备及计算机可读存储介质中相关部分的说明请参见本申请实施例提供的图文互检方法中对应部分的详细说明,在此不再赘述。另外,本申请实施例提供的上述技术方案中与现有技术中对应技术方案实现原理一致的部分并未详细说明,以免过多赘述。
还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、物品或者设备中还存在另外的相同要素。
对所公开的实施例的上述说明,使本领域技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (20)

  1. 一种图文互检方法,其特征在于,包括:
    获取待检索的一组目标文本及一组目标图像,所述目标文本包括表征目标信息的各类子信息;
    确定所述目标文本对应的目标文本输入信息;
    基于预先训练的图文互检神经网络模型中的文本处理模型对所述目标文本输入信息进行处理,得到目标文本处理结果;其中,所述文本处理模型基于自监督学习搭建,所述自监督学习用于基于各类所述子信息间的关联信息对所述目标文本进行监督学习;
    基于所述图文互检神经网络模型中的图像处理模型对所述目标图像进行处理,得到目标图像处理结果;
    基于所述目标文本处理结果及所述目标图像处理结果,确定所述目标文本在所述目标图像中的图像检索结果,和/或确定所述目标图像在所述目标文本中的文本检索结果。
  2. 根据权利要求1所述的方法,其特征在于,所述图像处理模型对所述目标文本输入信息以及所述目标图像处理的步骤包括:
    提取样本数据的样本特征,将所述样本特征存入待检索数据集中;所述样本数据包括文本样本数据或图像样本数据;
    接收查询数据;所述查询数据为用户输入的所述目标文本或所述目标图像;
    提取所述查询数据的查询特征;
    将所述查询特征与所述待检索数据集中的样本特征进行匹配,求向量距离;
    将所述向量距离最小的样本特征对应的样本数据作为处理结果,进行输出。
  3. 根据权利要求1所述的方法,其特征在于,所述确定所述目标文本对应的目标文本输入信息,包括:
    确定所述目标文本中的各类所述子信息;
    确定各个所述子信息对应的位置信息;
    确定各个所述子信息对应的第一类型信息;
    对于每个所述子信息,将所述子信息及对应的所述位置信息、所述第一类型信息转换为对应的初始向量信息,并将所有的所述初始向量信息的和值作为所述子信息的第一向量信息;
    基于所述第一向量信息确定所述目标文本输入信息。
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述第一向量信息确定所述目标文本输入信息,包括:
    确定所述目标文本的第二类型信息;
    将所述第二类型信息转换为对应的第二向量信息;
    将所述第二向量信息和所述第一向量信息作为所述目标文本输入信息。
  5. 根据权利要求4所述的方法,其特征在于,所述自监督学习的对应权重值的确定过程包括:
    对于所述文本处理模型中的任意两个所述子信息,均在其中的一个所述子信息中确定出目标样本,在其中的另一个所述子信息中确定出与所述目标样本成对的第一类样本、及与所述目标样本未成对的第二类样本,确定所述目标样本与所述第一类样本间的第一距离值,确定所述目标样本与所述第二类样本间的第二距离值;
    基于所有的所述第一距离值和所述第二距离值确定所述自监督学习的损失值;
    基于所述损失值确定所述自监督学习的所述权重值。
  6. 根据权利要求5所述的方法,其特征在于,所述基于所有的所述第一距离值和所述第二距离值确定所述自监督学习的损失值,包括:
    通过所述自监督学习的损失函数,基于所有的所述第一距离值和所述第二距离值确定所述自监督学习的损失值;
    其中,所述自监督学习的损失函数包括:
    Figure PCTCN2022134091-appb-100001
    其中,
    Figure PCTCN2022134091-appb-100002
    表示所述自监督学习中第i个所述子信息相对于第j个所述子信息的损失函数值,i=1,2…n,j=1,2…n,i≠j,n表示所述子信息的总数量;b表示所述自监督学习的批次;N表示成对的样本的数量;d表示距离值;
    Figure PCTCN2022134091-appb-100003
    表示在第i个所述子信息中被选中的第a个所述目标样本;
    Figure PCTCN2022134091-appb-100004
    表示在第j个所述子信息中选中的与
    Figure PCTCN2022134091-appb-100005
    成对的所述第一类样本;
    Figure PCTCN2022134091-appb-100006
    表示在第j个所述子信息中选中的与
    Figure PCTCN2022134091-appb-100007
    未成对的样本;
    Figure PCTCN2022134091-appb-100008
    表示预设的超参数;min表示求最小值;其中,所有的所述损失函数值的和值为所述自监督学习的损失值。
  7. 根据权利要求4所述的方法,其特征在于,所述文本处理模型包括基于transformer模型及所述自监督学习搭建的神经网络模型。
  8. 根据权利要求7所述的方法,其特征在于,所述文本处理模型包括输入层;与所述输入层连接的多头注意力机制层;与所述输入层及所述多头注意力机制层连接的第一标准化层;与所述标准化层连接的正向传输层;与所述正向传输层及所述第一标准化层连接的第二标准化层;与所述第二标准化层顺次连接的第一全连接层、第一激励层、第二全连接层、自监督分类输出层;与所述第二标准化层连接的与所述子信息一一对应的目标全连接层;与所述第二标准化层连接的第四全连接层;与所述第二标准化层连接的第五全连接层;与所述第一全连接层、所有的所述目标全连接层连接的拼接层;与所述拼接层连接的第三全连接层。
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:
    通过所述第三全连接层输出所述文本处理模型对所述目标文本的处理结果。
  10. 根据权利要求1至9任一项所述的方法,其特征在于,所述图像处理模型基于注意力机制搭建。
  11. 根据权利要求10所述的方法,其特征在于,所述图像处理模型包括目标数量个图像处理分支、及与所述图像处理分支连接的第四全连接层;所述图像处理分支包括输入层,与所述输入层连接的骨干网络,与所述骨干网络连接的第五全连接层,与所述第五全连接层连接的注意力机制层,与所述注意力机制层连接的第一归一化层,与所述第一归一化层连接的乘法器,与所述乘法器及所述第五全连接层连接的加法器,与所述加法器连接的Linear层,与所述Linear层连接的BiLSTM层;
    其中,各个所述图像处理分支中的所述第一归一化层为同一个;且各个所述图像处理分支中的所述BiLSTM层间互相连通。
  12. 根据权利要求11所述的方法,其特征在于,所述注意力机制层包括:与所述第五全连接层连接的第六全连接层,与所述第六全连接层连接的第二激励层,与所述第二激励层连接的第七全连接层,与所述第七全连接层连接的第二归一化层,且所述第二归一化层与所述第一归一化层相连接。
  13. 根据权利要求12所述的方法,其特征在于,所述图文互检神经网络模型中的损失函数包括:
    Figure PCTCN2022134091-appb-100009
    其中,
    Figure PCTCN2022134091-appb-100010
    表示批次b中文本与图像间相对的损失函数值;M表示成对的样本的数量;Δ 2表示预设的超参数;
    Figure PCTCN2022134091-appb-100011
    表示在所述目标图像的所述目标图像处理结果中选中的第a个样本;
    Figure PCTCN2022134091-appb-100012
    表示在所述目标文本对应的所述目标文本处理结果中选中的与
    Figure PCTCN2022134091-appb-100013
    成对的样本;min表示求最小值;s np表示在所述目标文本特征处理结果中选中的与
    Figure PCTCN2022134091-appb-100014
    未成对的样本。
  14. 根据权利要求11所述的方法,其特征在于,所述骨干网络包括ResNet网络。
  15. 根据权利要求1所述的方法,其特征在于,所述图文互检神经网络模型训练的步骤包括:
    将样本数据进行由低层次向高层次的前向传播,得到所述图文互检神经网络模型的输出值;
    基于所述图文互检神经网络模型的输出值与所述样本数据计算输出误差;
    将所述输出误差进行由从高层次向低层次的反向传播,计算反向传播误差;
    当不满足预设条件时,选取新的样本数据,返回将所述样本数据进行由低层次向高层次的前向传播,得到所述图文互检神经网络模型的输出值的步骤;所述预设条件为所述输出误差小于第一预设阈值,或迭代次数超过第二预设阈值。
  16. 根据权利要求15所述的方法,其特征在于,所述将所述样本数据进行由低层次向高层次的前向传播,得到所述图文互检神经网络模型的输出值的步骤包括:
    对所述图文互检神经网络模型中各个网络层的权重值进行初始化;所述网络层包括:图神经层、卷积层、下采样层、全连接层;
    输入所述样本数据,所述样本数据包括图像样本数据以及文本数据样本;
    计算所述样本数据经过图神经层、卷积层、下采样层、全连接层的前向传播输出值;
    基于所述前向传播输出值计算所述图文互检神经网络模型的输出值,以及所述图文互检神经网络模型的输出值的损失函数值。
  17. 根据权利要求16所述的方法,其特征在于,所述将所述输出误差进行由从高层次向低层次的反向传播,计算反向传播误差的步骤包括:
    将所述输出误差反向传回至所述图文互检神经网络模型的各个网络层;
    计算所述各个网络层的反向传播误差;
    依据所述反向传播误差调整所述权重值。
  18. 一种图文互检系统,其特征在于,包括:
    第一获取模块,用于获取待检索的一组目标文本及一组目标图像,所述目标文本包括表征目标信息的各类子信息;
    第一确定模块,用于确定所述目标文本对应的目标文本输入信息;
    第一处理模块,用于基于预先训练的图文互检神经网络模型中的文本处理模型对所述目标文本输入信息进行处理,得到目标文本处理结果;其中,所述文本处理模型基于自监督学习搭建,所述自监督学习用于基于各类所述子信息间的关联信息对所述目标文本进行监督学习;
    第二处理模块,用于基于所述图文互检神经网络模型中的图像处理模型对所述目标图像进行处理,得到目标图像处理结果;
    第二确定模块,用于基于所述目标文本处理结果及所述目标图像处理结果,确定所述目标文本在所述目标图像中的图像检索结果,和/或确定所述目标图像在所述目标文本中的文本检索结果。
  19. 一种图文互检设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序时实现如权利要求1至17任一项所述图文互检方法的步骤。
  20. 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质中存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至17任一项所述图文互检方法的步骤。
PCT/CN2022/134091 2022-07-12 2022-11-24 一种图文互检方法、系统、设备及非易失性可读存储介质 WO2024011814A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210812205.4 2022-07-12
CN202210812205.4A CN114896429B (zh) 2022-07-12 2022-07-12 一种图文互检方法、系统、设备及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2024011814A1 true WO2024011814A1 (zh) 2024-01-18

Family

ID=82729397

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/134091 WO2024011814A1 (zh) 2022-07-12 2022-11-24 一种图文互检方法、系统、设备及非易失性可读存储介质

Country Status (2)

Country Link
CN (1) CN114896429B (zh)
WO (1) WO2024011814A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114896429B (zh) * 2022-07-12 2022-12-27 苏州浪潮智能科技有限公司 一种图文互检方法、系统、设备及计算机可读存储介质
CN115438225B (zh) * 2022-11-08 2023-03-24 苏州浪潮智能科技有限公司 视频文本互检方法及其模型训练方法、装置、设备、介质
CN115455171B (zh) * 2022-11-08 2023-05-23 苏州浪潮智能科技有限公司 文本视频的互检索以及模型训练方法、装置、设备及介质
CN115438169A (zh) * 2022-11-08 2022-12-06 苏州浪潮智能科技有限公司 一种文本与视频的互检方法、装置、设备及存储介质
CN115618043B (zh) * 2022-11-08 2023-04-07 苏州浪潮智能科技有限公司 文本操作图互检方法及模型训练方法、装置、设备、介质
CN115438215B (zh) * 2022-11-08 2023-04-18 苏州浪潮智能科技有限公司 图文双向搜索及匹配模型训练方法、装置、设备及介质
CN115730878B (zh) * 2022-12-15 2024-01-12 广东省电子口岸管理有限公司 基于数据识别的货物进出口查验管理方法

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019052403A1 (zh) * 2017-09-12 2019-03-21 腾讯科技(深圳)有限公司 图像文本匹配模型的训练方法、双向搜索方法及相关装置
CN110059157A (zh) * 2019-03-18 2019-07-26 华南师范大学 一种图文跨模态检索方法、系统、装置和存储介质
CN112148916A (zh) * 2020-09-28 2020-12-29 华中科技大学 一种基于监督的跨模态检索方法、装置、设备及介质
CN112488131A (zh) * 2020-12-18 2021-03-12 贵州大学 一种基于自监督对抗的图片文本跨模态检索方法
CN112905822A (zh) * 2021-02-02 2021-06-04 华侨大学 一种基于注意力机制的深度监督跨模态对抗学习方法
CN113064959A (zh) * 2020-01-02 2021-07-02 南京邮电大学 一种基于深度自监督排序哈希的跨模态检索方法
CN113657450A (zh) * 2021-07-16 2021-11-16 中国人民解放军陆军炮兵防空兵学院 基于注意机制的陆战场图像-文本跨模态检索方法及其系统
CN114239805A (zh) * 2021-12-15 2022-03-25 成都卫士通信息产业股份有限公司 跨模态检索神经网络及训练方法、装置、电子设备、介质
CN114896429A (zh) * 2022-07-12 2022-08-12 苏州浪潮智能科技有限公司 一种图文互检方法、系统、设备及计算机可读存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590867B (zh) * 2021-08-05 2024-02-09 西安电子科技大学 基于分层度量学习的跨模态信息检索方法

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019052403A1 (zh) * 2017-09-12 2019-03-21 腾讯科技(深圳)有限公司 图像文本匹配模型的训练方法、双向搜索方法及相关装置
CN110059157A (zh) * 2019-03-18 2019-07-26 华南师范大学 一种图文跨模态检索方法、系统、装置和存储介质
CN113064959A (zh) * 2020-01-02 2021-07-02 南京邮电大学 一种基于深度自监督排序哈希的跨模态检索方法
CN112148916A (zh) * 2020-09-28 2020-12-29 华中科技大学 一种基于监督的跨模态检索方法、装置、设备及介质
CN112488131A (zh) * 2020-12-18 2021-03-12 贵州大学 一种基于自监督对抗的图片文本跨模态检索方法
CN112905822A (zh) * 2021-02-02 2021-06-04 华侨大学 一种基于注意力机制的深度监督跨模态对抗学习方法
CN113657450A (zh) * 2021-07-16 2021-11-16 中国人民解放军陆军炮兵防空兵学院 基于注意机制的陆战场图像-文本跨模态检索方法及其系统
CN114239805A (zh) * 2021-12-15 2022-03-25 成都卫士通信息产业股份有限公司 跨模态检索神经网络及训练方法、装置、电子设备、介质
CN114896429A (zh) * 2022-07-12 2022-08-12 苏州浪潮智能科技有限公司 一种图文互检方法、系统、设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN114896429A (zh) 2022-08-12
CN114896429B (zh) 2022-12-27

Similar Documents

Publication Publication Date Title
WO2024011814A1 (zh) 一种图文互检方法、系统、设备及非易失性可读存储介质
CN109558477B (zh) 一种基于多任务学习的社区问答系统、方法及电子设备
CN111061856B (zh) 一种基于知识感知的新闻推荐方法
WO2021139247A1 (zh) 医学领域知识图谱的构建方法、装置、设备及存储介质
CN110532571A (zh) 文本处理方法及相关装置
US20210342371A1 (en) Method and Apparatus for Processing Knowledge Graph
CN107832432A (zh) 一种搜索结果排序方法、装置、服务器和存储介质
WO2021089012A1 (zh) 图网络模型的节点分类方法、装置及终端设备
CN112287170B (zh) 一种基于多模态联合学习的短视频分类方法及装置
CN109739995B (zh) 一种信息处理方法及装置
WO2020192307A1 (zh) 基于深度学习的答案抽取方法、装置、计算机设备和存储介质
CN116127020A (zh) 生成式大语言模型训练方法以及基于模型的搜索方法
US11874798B2 (en) Smart dataset collection system
CN114357193A (zh) 一种知识图谱实体对齐方法、系统、设备与存储介质
CN109145083B (zh) 一种基于深度学习的候选答案选取方法
CN113378938B (zh) 一种基于边Transformer图神经网络的小样本图像分类方法及系统
CN113628059A (zh) 一种基于多层图注意力网络的关联用户识别方法及装置
CN116049459A (zh) 跨模态互检索的方法、装置、服务器及存储介质
CN113326383B (zh) 一种短文本实体链接方法、装置、计算设备与存储介质
CN114492451A (zh) 文本匹配方法、装置、电子设备及计算机可读存储介质
CN117094395B (zh) 对知识图谱进行补全的方法、装置和计算机存储介质
US20240037133A1 (en) Method and apparatus for recommending cold start object, computer device, and storage medium
CN117114063A (zh) 用于训练生成式大语言模型和用于处理图像任务的方法
US20230351153A1 (en) Knowledge graph reasoning model, system, and reasoning method based on bayesian few-shot learning
CN114780863B (zh) 基于人工智能的项目推荐方法、装置、计算机设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22950922

Country of ref document: EP

Kind code of ref document: A1