CN115455225A - Method and device for constructing image-text semantic alignment model - Google Patents

Method and device for constructing image-text semantic alignment model Download PDF

Info

Publication number
CN115455225A
CN115455225A CN202211108881.XA CN202211108881A CN115455225A CN 115455225 A CN115455225 A CN 115455225A CN 202211108881 A CN202211108881 A CN 202211108881A CN 115455225 A CN115455225 A CN 115455225A
Authority
CN
China
Prior art keywords
image
text
semantic alignment
pair
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211108881.XA
Other languages
Chinese (zh)
Inventor
陈畅新
陈第
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Youmi Technology Co ltd
Original Assignee
Youmi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Youmi Technology Co ltd filed Critical Youmi Technology Co ltd
Priority to CN202211108881.XA priority Critical patent/CN115455225A/en
Publication of CN115455225A publication Critical patent/CN115455225A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a method and a device for constructing a graphic and text semantic alignment model, wherein the method comprises the following steps: inputting a plurality of image-text pairs into a semantic alignment model so that the semantic alignment model analyzes each image-text pair to obtain a semantic alignment result of each image-text pair, wherein the semantic alignment result is used for expressing the matching degree of a sample image and a sample text in the corresponding image-text pair; judging whether the semantic alignment model meets a convergence condition or not according to the semantic alignment result and the actual matching result of all the image-text pairs; if not, correcting the model parameters until obtaining an image-text semantic alignment model which meets the convergence condition and can be used for predicting the image corresponding to the text, the text corresponding to the image and the matching degree between the image and the text. Therefore, by the implementation of the method, the image-text semantic alignment model which can be used for various image-text matching scenes is obtained through the plurality of image-text pair training semantic alignment models, and the image-text matching efficiency and the diversity of image-text matching modes can be improved.

Description

Method and device for constructing image-text semantic alignment model
Technical Field
The invention relates to the technical field of image classification, in particular to a method and a device for constructing a graph-text semantic alignment model.
Background
With the development of the digital age, a large amount of graphic and text information exists on the internet, and people often need to process image information and text information in work, for example, matching a plurality of images with a plurality of texts. When the number of the images and the texts is small, people can match the images and the texts manually, however, when the number of the images and the texts is large, the efficiency of manually matching the images and the texts is low, and the requirement of people on processing massive image-text information cannot be met. Therefore, how to construct an image-text semantic alignment model is important, so that the efficiency of image-text matching is improved.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method and a device for constructing a graphic and text semantic alignment model, which can improve the graphic and text matching efficiency and the diversity of graphic and text matching modes.
In order to solve the technical problem, the first aspect of the present invention discloses a method for constructing an image-text semantic alignment model, where the method includes:
inputting a plurality of pre-determined image-text pairs into a semantic alignment model to be trained, so that the semantic alignment model analyzes each image-text pair to obtain a semantic alignment result of each image-text pair, each image-text pair comprises a sample image and a sample text, and the semantic alignment result is used for representing the matching degree of the sample image and the sample text in the corresponding image-text pair;
judging whether the semantic alignment model meets a convergence condition or not according to the semantic alignment results of all the image-text pairs and the pre-labeled actual matching results of all the image-text pairs;
and if not, correcting the model parameters of the semantic alignment model, and re-executing the operation of inputting the predetermined image-text pairs into the semantic alignment model to be trained so that the semantic alignment model analyzes each image-text pair to obtain the semantic alignment result of each image-text pair, executing the operation of judging whether the semantic alignment model meets the convergence condition according to the semantic alignment results of all the image-text pairs and the pre-labeled actual matching results of all the image-text pairs until the image-text semantic alignment model meeting the convergence condition is obtained, wherein the image-text semantic alignment model is used for predicting one or more of images corresponding to any texts, texts corresponding to any images, and matching degrees between any images and any texts.
As an alternative implementation, in the first aspect of the present invention, the semantic alignment model includes an image processing structure, a text processing structure, and an alignment structure;
the semantic alignment model analyzes each image-text pair to obtain a semantic alignment result of each image-text pair, and the semantic alignment result comprises the following steps:
carrying out feature extraction operation on the sample image of each image-text pair by the image processing structure to obtain the image feature of each image-text pair, and carrying out feature extraction operation on the sample text of each image-text pair by the text processing structure to obtain the text feature of each image-text pair;
and analyzing the image characteristic of each image-text pair and the image-text splicing characteristic obtained after the text characteristic is spliced by the alignment structure to obtain a semantic alignment result of each image-text pair.
As an optional implementation manner, in the first aspect of the present invention, the semantic alignment model further includes one or more feature transformation structures, each of the feature transformation structures includes at least a full connection layer;
after the image processing structure performs the feature extraction operation on the sample image of each image-text pair to obtain the image features of each image-text pair, the alignment structure analyzes the image features of each image-text pair and the image-text splicing features obtained after the text features are spliced to obtain the semantic alignment result of each image-text pair, and the method further comprises:
performing feature conversion processing on the image features of each image-text pair by the full connection layer to update the image features of the image-text pair, wherein the feature conversion processing is used for matching feature attributes corresponding to the image features of each image-text pair with feature attributes corresponding to the text features of the image-text pair, and the feature attributes comprise feature dimensions and/or feature spaces;
wherein the output result of each preceding feature transformation structure is the input content of its succeeding neighboring feature transformation structure.
As an alternative implementation, in the first aspect of the present invention, each of the feature transformation structures further includes a nonlinear processing layer;
after the feature conversion processing is performed on the image features of each image-text pair by the full connection layer to update the image features of the image-text pair, the method further includes:
carrying out nonlinear processing on the image characteristics of each image-text pair processed by the full-connection layer by the nonlinear processing layer so as to update the image characteristics of the image-text pair;
the non-linear processing layer performs non-linear processing on the image features of each image-text pair processed by the full connection layer to update the image features of the image-text pair, and the non-linear processing layer comprises:
performing activation function operation processing on the image characteristics of each image-text pair processed by the full-connection layer by the nonlinear processing layer based on a preset activation function;
and the nonlinear processing layer carries out random hiding processing on the values of one or more output neurons in the neural network layer corresponding to each image-text pair based on a preset random hiding mode and a random hiding probability so as to update the image characteristics of the image-text pair, wherein the neural network corresponding to each image-text pair comprises the neural network layer corresponding to the image characteristics obtained after the image-text pair is operated and processed by the activating function.
As an optional implementation manner, in the first aspect of the present invention, the analyzing, by the alignment structure, the image feature of each image-text pair and the image-text splicing feature obtained after splicing the text feature to obtain the semantic alignment result of each image-text pair includes:
performing vector conversion processing on image characteristics of each image-text pair and image-text splicing characteristics obtained after splicing text characteristics by using a vector processing structure of the alignment structure to obtain a target matrix corresponding to each image-text pair;
and processing the target matrix corresponding to each image-text pair by the full connection layer of the alignment structure to obtain a confidence coefficient that the semantics of the sample image of the image-text pair are matched with the semantics of the sample text, and taking the confidence coefficient as a semantic alignment result of the image-text pair.
As an optional implementation manner, in the first aspect of the present invention, the semantic alignment result of each image-text pair includes a confidence that the semantics of the sample image and the semantics of the sample text of the image-text pair match;
the step of judging whether the semantic alignment model meets the convergence condition according to the semantic alignment results of all the image-text pairs and the pre-labeled actual matching results of all the image-text pairs comprises the following steps:
calculating a prediction loss value of the semantic alignment model according to the difference degree between the semantic alignment result of each image-text pair and a target confidence degree corresponding to the pre-labeled actual matching result of the image-text pair;
judging whether the predicted loss value is smaller than a preset loss value threshold value or not;
and when the judgment result is yes, determining that the semantic alignment model meets the convergence condition, and when the judgment result is no, determining that the semantic alignment model does not meet the convergence condition.
As an optional implementation manner, in the first aspect of the present invention, all the image-text pairs include at least one positive example image-text pair and/or at least one negative example image-text pair, an actual matching result of the positive example image-text pair is a first matching result that the sample image and the sample text match, and an actual matching result of the negative example image-text pair is a second matching result that the sample image and the sample text do not match;
before the calculating a prediction loss value of the semantic alignment model according to a difference between a semantic alignment result of each image-text pair and a target confidence corresponding to a pre-labeled actual matching result of the image-text pair, the method further includes:
updating the initial confidence corresponding to each actual matching result according to a preset label smoothing coefficient to obtain a target confidence corresponding to each actual matching result;
wherein the target confidence corresponding to the first matching result and the target confidence corresponding to the second matching result are respectively:
P1=1-ε,
P2=ε/(N-1),
wherein, P1 is used to represent the target confidence corresponding to the first matching result, P2 is used to represent the target confidence corresponding to the second matching result, epsilon is used to represent the label smoothing coefficient, and N is used to represent the number of all negative case image-text pairs.
As an optional implementation manner, in the first aspect of the present invention, before the determining whether the predicted loss value is smaller than a preset loss value threshold, the method further includes:
determining the similarity between the target image characteristic and the target text characteristic of each image-text pair determined based on the semantic alignment model as the similarity corresponding to the image-text pair;
updating the prediction loss value according to the similarity corresponding to all the image-text pairs and the actual matching result of all the image-text pairs;
and, prior to said determining a similarity between the target image feature and the target text feature of each said teletext pair, the method further comprises:
for each image-text pair, segmenting a target matrix corresponding to the image-text pair output by the vector processing structure into a target image feature and a target text feature according to an input feature dimension corresponding to the input content of the vector processing structure of the semantic alignment model in the process of analyzing the image-text pair by the semantic alignment model, wherein the input content comprises image-text splicing features determined by the semantic alignment model based on the sample image and the sample text of each image-text pair.
The second aspect of the invention discloses a device for constructing a graphic and text semantic alignment model, which comprises:
the image-text alignment system comprises an input module, a semantic alignment module and a training module, wherein the input module is used for inputting a plurality of predetermined image-text pairs into a semantic alignment model to be trained so that the semantic alignment model analyzes each image-text pair to obtain a semantic alignment result of each image-text pair, each image-text pair comprises a sample image and a sample text, and the semantic alignment result is used for representing the matching degree of the sample image and the sample text in the corresponding image-text pair;
the judging module is used for judging whether the semantic alignment model meets a convergence condition or not according to the semantic alignment results of all the image-text pairs and the pre-labeled actual matching results of all the image-text pairs;
and the correction module is used for correcting the model parameters of the semantic alignment model when the judgment module judges that the semantic alignment model does not meet the convergence condition, triggering the input module to execute the operation of inputting a plurality of pre-determined image-text pairs into the semantic alignment model to be trained again so that the semantic alignment model analyzes each image-text pair to obtain the semantic alignment result of each image-text pair, triggering the judgment module to execute the operation of judging whether the semantic alignment model meets the convergence condition according to the semantic alignment results of all the image-text pairs and the pre-marked actual matching results of all the image-text pairs until the image-text semantic alignment model meeting the convergence condition is obtained, wherein the semantic alignment model is used for predicting one or more of images corresponding to any text, texts corresponding to any images, and matching degrees between any images and any texts.
As an alternative embodiment, in the second aspect of the present invention, the semantic alignment model includes an image processing structure, a text processing structure and an alignment structure;
the specific way of analyzing each image-text pair by the semantic alignment model to obtain the semantic alignment result of each image-text pair comprises the following steps:
carrying out feature extraction operation on the sample image of each image-text pair by the image processing structure to obtain the image feature of each image-text pair, and carrying out feature extraction operation on the sample text of each image-text pair by the text processing structure to obtain the text feature of each image-text pair;
and analyzing the image characteristics of each image-text pair and the image-text splicing characteristics obtained after the text characteristics are spliced by the alignment structure to obtain a semantic alignment result of each image-text pair.
As an optional implementation manner, in the second aspect of the present invention, the semantic alignment model further includes one or more feature transformation structures, each of the feature transformation structures includes at least a full connection layer;
the full connection layer is used for performing feature extraction operation on the sample image of each image-text pair by the image processing structure to obtain the image feature of each image-text pair, analyzing the image feature of each image-text pair and the image-text splicing feature obtained after the text feature is spliced by the alignment structure, and performing feature conversion processing on the image feature of each image-text pair to update the image feature of the image-text pair before obtaining the semantic alignment result of each image-text pair, wherein the feature conversion processing is used for matching the feature attribute corresponding to the image feature of each image-text pair with the feature attribute corresponding to the text feature of the image-text pair, and the feature attributes comprise feature dimensions and/or feature spaces;
wherein the output result of each preceding feature transformation structure is the input content of its succeeding neighboring feature transformation structure.
As an alternative embodiment, in the second aspect of the present invention, each of the feature transformation structures further includes a nonlinear processing layer;
the non-linear processing layer is used for performing characteristic conversion processing on the image characteristics of each image-text pair in the full connection layer so as to update the image characteristics of the image-text pair, and then performing non-linear processing on the image characteristics of each image-text pair processed by the full connection layer so as to update the image characteristics of the image-text pair;
the non-linear processing layer performs non-linear processing on the image features of each image-text pair processed by the full-connection layer, and the specific way of updating the image features of the image-text pair includes:
the non-linear processing layer performs activation function operation processing on the image characteristics of each image-text pair processed by the full connection layer based on a preset activation function;
and the nonlinear processing layer carries out random hiding processing on the value of one or more output neurons in the neural network layer corresponding to each image-text pair based on a preset random hiding mode and a random hiding probability so as to update the image characteristics of the image-text pair, wherein the neural network corresponding to each image-text pair comprises the neural network layer corresponding to the image characteristics obtained after the image-text pair is operated and processed by the activation function.
As an optional implementation manner, in the second aspect of the present invention, the specific manner of analyzing, by the alignment structure, the image feature of each image-text pair and the image-text splicing feature obtained after splicing the text features to obtain the semantic alignment result of each image-text pair includes:
performing vector conversion processing on image characteristics of each image-text pair and image-text splicing characteristics obtained after splicing text characteristics by using a vector processing structure of the alignment structure to obtain a target matrix corresponding to each image-text pair;
and processing the target matrix corresponding to each image-text pair by the full connection layer of the alignment structure to obtain a confidence coefficient of matching the semantics of the sample image of the image-text pair with the semantics of the sample text as a semantic alignment result of the image-text pair.
As an optional implementation manner, in the second aspect of the present invention, the semantic alignment result of each image-text pair includes a confidence that the semantics of the sample image and the semantics of the sample text of the image-text pair match;
the specific way that the judging module judges whether the semantic alignment model meets the convergence condition according to the semantic alignment results of all the image-text pairs and the pre-labeled actual matching results of all the image-text pairs comprises the following steps:
calculating a prediction loss value of the semantic alignment model according to the difference degree between the semantic alignment result of each image-text pair and a target confidence degree corresponding to the pre-labeled actual matching result of the image-text pair;
judging whether the predicted loss value is smaller than a preset loss value threshold value or not;
and when the judgment result is yes, determining that the semantic alignment model meets the convergence condition, and when the judgment result is no, determining that the semantic alignment model does not meet the convergence condition.
As an optional implementation manner, in the second aspect of the present invention, all the image-text pairs include at least one positive example image-text pair and/or at least one negative example image-text pair, an actual matching result of the positive example image-text pair is a first matching result that the sample image and the sample text match, and an actual matching result of the negative example image-text pair is a second matching result that the sample image and the sample text do not match;
the device further comprises:
the first updating module is used for updating the initial confidence coefficient corresponding to each actual matching result according to a preset label smoothing coefficient before the judging module calculates the prediction loss value of the semantic alignment model according to the difference degree between the semantic alignment result of each image-text pair and the target confidence coefficient corresponding to the pre-labeled actual matching result of the image-text pair, so as to obtain the target confidence coefficient corresponding to each actual matching result;
wherein the target confidence corresponding to the first matching result and the target confidence corresponding to the second matching result are respectively:
P1=1-ε,
P2=ε/(N-1),
wherein, P1 is used to represent the target confidence corresponding to the first matching result, P2 is used to represent the target confidence corresponding to the second matching result, epsilon is used to represent the label smoothing coefficient, and N is used to represent the number of all negative case image-text pairs.
As an optional embodiment, in the second aspect of the present invention, the apparatus further comprises:
the determining module is used for determining the similarity between the target image feature and the target text feature of each image-text pair determined based on the semantic alignment model before the judging module judges whether the predicted loss value is smaller than a preset loss value threshold value or not, and taking the similarity as the corresponding similarity of the image-text pair;
the second updating module is used for updating the prediction loss value according to the similarity corresponding to all the image-text pairs and the actual matching result of all the image-text pairs;
and, the apparatus further comprises:
and the feature segmentation module is used for segmenting a target matrix, which is output by the vector processing structure and corresponds to the image-text pair, into a target image feature and a target text feature according to an input feature dimension, which corresponds to the input content of the vector processing structure of the semantic alignment model in the process of analyzing the image-text pair by the semantic alignment model, wherein the input content comprises image-text splicing features which are determined by the semantic alignment model based on the sample image and the sample text of each image-text pair.
The third aspect of the invention discloses another device for constructing a graphic and text semantic alignment model, which comprises:
a memory storing executable program code;
a processor coupled with the memory;
the processor calls the executable program code stored in the memory to execute the method for constructing the image-text semantic alignment model disclosed by the first aspect of the invention.
The fourth aspect of the present invention discloses a computer storage medium, which stores computer instructions, and when the computer instructions are called, the computer instructions are used for executing the method for constructing the image-text semantic alignment model disclosed in the first aspect of the present invention.
Compared with the prior art, the embodiment of the invention has the following beneficial effects:
in the embodiment of the invention, a plurality of pre-determined image-text pairs are input into a semantic alignment model to be trained, so that the semantic alignment model analyzes each image-text pair to obtain a semantic alignment result of each image-text pair, each image-text pair comprises a sample image and a sample text, and the semantic alignment result is used for expressing the matching degree of the sample image and the sample text in the corresponding image-text pair; judging whether the semantic alignment model meets a convergence condition or not according to the semantic alignment results of all image-text pairs and the actual matching results of all pre-labeled image-text pairs; and if not, correcting the model parameters of the semantic alignment model until an image-text semantic alignment model meeting the convergence condition is obtained, wherein the image-text semantic alignment model is used for predicting one or more of an image corresponding to any text, a text corresponding to any image and the matching degree between any image and any text. Therefore, the image-text semantic alignment model which can be used for predicting the image corresponding to any text, the text corresponding to any image and the matching degree between any image and the text can be obtained by training the semantic alignment model through the image-text pairs, the image-text matching efficiency can be improved, and the diversity of image-text matching modes can be improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for constructing a text-graphics semantic alignment model according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a semantic alignment model according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of another method for constructing a text-graphics semantic alignment model according to the embodiment of the present invention;
fig. 4 is a schematic structural diagram of a device for constructing a text-graphics semantic alignment model according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of another apparatus for constructing a text-graphics semantic alignment model according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of another apparatus for constructing a teletext semantic alignment model according to the embodiment of the invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
The terms "first," "second," and the like in the description and claims of the present invention and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, article, or article that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or article.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.
The invention discloses a method and a device for constructing a graphic and text semantic alignment model, which can train the semantic alignment model through a plurality of graphic and text pairs to obtain the graphic and text semantic alignment model which can be used for predicting images corresponding to any text, texts corresponding to any images and the matching degree between any images and texts, thereby not only improving the efficiency of graphic and text matching, but also improving the diversity of graphic and text matching modes. The following are detailed below.
Example one
Referring to fig. 1, fig. 1 is a schematic flowchart of a method for constructing a text-to-text semantic alignment model according to an embodiment of the present invention. The method for constructing the image-text semantic alignment model described in fig. 1 may be applied to a process for constructing an image-text semantic alignment model based on any architecture, and the embodiment of the present invention is not limited. As shown in fig. 1, the method for constructing the teletext semantic alignment model may include the following operations:
101. and inputting a plurality of pre-determined image-text pairs into the semantic alignment model to be trained, so that the semantic alignment model analyzes each image-text pair to obtain a semantic alignment result of each image-text pair.
In this embodiment of the present invention, optionally, each image-text pair may include a sample image and a sample text, and the semantic alignment result is used to indicate a matching degree between the sample image and the sample text in the corresponding image-text pair. Further optionally, all the image-text pairs include at least one positive example image-text pair and/or at least one negative example image-text pair, and the positive example image-text pair is used to represent the image-text pair with which the image-text is matched, for example, the sample text is a civet cat image, and the sample image is a civet cat image. Negative example image-text pairs are used to represent image-text unmatched image-text pairs, such as sample text for "golden retriever" and sample image for "alaska dog". This can reduce the occurrence of overfitting of the training of the semantic alignment model.
In this embodiment of the present invention, optionally, the semantic alignment result of each image-text pair may include a confidence that the semantics of the sample image of the image-text pair match the semantics of the sample text.
In an embodiment of the present invention, optionally, the set of image-text pairs for training the semantic alignment model may include a plurality of image-text pairs classified based on any particle size, which is not limited in the embodiment of the present invention, and further optionally, the any particle size may include a particle size based on a basic category (e.g., a bird, a dog, a cat, etc.) and/or a particle size based on a plurality of subclasses of the basic category (e.g., a rhododendron, a woodpecker, a swallow, etc.).
In this embodiment of the present invention, optionally, as shown in fig. 2, the semantic alignment model may include an image processing structure, a text processing structure, and an alignment structure, further optionally, the alignment structure may include a vector processing structure and a full connection layer, optionally, the image processing structure may include an image encoder, the text processing structure may include a text encoder, the vector processing structure is used for performing semantic parsing on an image feature and a text feature, and the vector processing structure may include a vector transformation structure based on a self-attention mechanism; preferably, the image encoder may be a CNN encoder, the text encoder may be a BERT encoder, and the vector transformation structure may be a transform structure. The degree of matching between the image coding result and the text coding result and between the image and the text can be improved, and the relevance and the globality between the image characteristics and the internal information of the text characteristics can be improved by adopting a vector conversion structure based on a self-attention mechanism.
In this embodiment of the present invention, further optionally, as shown in fig. 2, the semantic alignment model may further include one or more feature transformation structures, each feature transformation structure at least includes a full connection layer, and further optionally, each feature transformation structure may further include a non-linear processing layer.
As an alternative embodiment, as shown in fig. 2, the analyzing, by the semantic alignment model, each graph-text pair to obtain a semantic alignment result of each graph-text pair may include:
carrying out feature extraction operation on the sample image of each image-text pair by the image processing structure to obtain the image feature of each image-text pair, and carrying out feature extraction operation on the sample text of each image-text pair by the text processing structure to obtain the text feature of each image-text pair;
and analyzing the image-text splicing characteristics obtained after the image characteristics and the text characteristics of each image-text pair are spliced by the alignment structure to obtain a semantic alignment result of each image-text pair.
Therefore, by implementing the optional implementation mode, the image features of the sample image in the image-text pair and the text features of the sample text can be respectively extracted, and the splicing result obtained after splicing the image-text features and the text features is analyzed to obtain the semantic alignment result, so that the dimensionality of the image-text features is increased, the neural network complexity of the semantic alignment model is improved, and the accuracy and the reliability of training the image-text semantic alignment model are improved.
In this optional implementation, optionally, as shown in fig. 2, analyzing, by the alignment structure, the image feature of each image-text pair and the image-text splicing feature obtained after splicing the text feature to obtain the semantic alignment result of each image-text pair may include:
performing vector conversion processing on image characteristics of each image-text pair and image-text splicing characteristics obtained after the text characteristics are spliced by a vector processing structure of the alignment structure to obtain a target matrix corresponding to each image-text pair;
and processing the target matrix corresponding to each image-text pair by the full connection layer of the alignment structure to obtain the confidence coefficient of matching the semantics of the sample image of the image-text pair with the semantics of the sample text, and taking the confidence coefficient as the semantic alignment result of the image-text pair.
Therefore, the optional implementation method can also utilize the vector processing structure to understand the semantics of the image features and the text features, and determine the image-text matching confidence coefficient of the image-text pairs through the full connection layer, so that the efficiency and the accuracy of determining the semantic alignment result of the image-text pairs through the semantic alignment model are improved.
In this optional implementation, further optionally, the processing, by the full connection layer of the alignment structure, the target matrix corresponding to each image-text pair to obtain a confidence that the semantics of the sample image of the image-text pair match the semantics of the sample text, and as a result of the semantic alignment of the image-text pair, may include:
and processing the target matrix corresponding to each image-text pair by a full connection layer of the alignment model to obtain one or more classification results and a confidence coefficient corresponding to each classification result, and determining the highest target confidence coefficient in the confidence coefficients corresponding to all the classification results as the confidence coefficient matching the semantics of the sample image and the semantics of the sample text of the image-text pair to serve as the semantic alignment result of the image-text pair.
Therefore, by implementing the optional implementation mode, the highest target confidence coefficient in the confidence coefficients corresponding to the multiple classification results obtained by performing linear processing on the target matrix output by the vector processing structure by the full connection layer can be used as the confidence coefficient of image-text-to-semantic matching, so that the downstream task of the model training sample is matched with the processing mode of the full connection layer, and the accuracy and reliability of the semantic alignment model for obtaining the confidence coefficient of image-text-to-semantic matching are improved.
102. And judging whether the semantic alignment model meets the convergence condition or not according to the semantic alignment results of all the image-text pairs and the actual matching results of all the pre-labeled image-text pairs.
As an optional implementation manner, judging whether the semantic alignment model meets the convergence condition according to the semantic alignment results of all the image-text pairs and the actual matching results of all the pre-labeled image-text pairs may include:
calculating a prediction loss value of a semantic alignment model according to the difference degree between the semantic alignment result of each image-text pair and a target confidence degree corresponding to the pre-labeled actual matching result of the image-text pair;
judging whether the predicted loss value is smaller than a preset loss value threshold value or not;
and when the judgment result is yes, determining that the semantic alignment model meets the convergence condition, and when the judgment result is no, determining that the semantic alignment model does not meet the convergence condition.
Therefore, by implementing the optional implementation method, the loss value of the semantic alignment model can be calculated according to the difference between the image-text-to-semantic-alignment confidence coefficient and the preset target confidence coefficient so as to judge whether the semantic alignment model meets the convergence condition, so that the accuracy and the reliability of judging whether the semantic alignment model meets the convergence condition are improved, and the matching degree of the training result of the semantic alignment model and the training purpose is further improved.
103. If the result of the determination in step 102 is negative, the model parameters of the semantic alignment model are corrected, and step 101 and step 102 are executed again.
104. And if the judgment result in the step 102 is yes, ending the current process to obtain the image-text semantic alignment model meeting the convergence condition.
In the embodiment of the present invention, optionally, the image-text semantic alignment model may be used to predict one or more of an image corresponding to any text, a text corresponding to any image, and a matching degree between any image and any text.
Therefore, by implementing the embodiment of the invention, the image-text semantic alignment model which can be used for predicting the image corresponding to any text, the text corresponding to any image and the matching degree between any image and text can be obtained by training the semantic alignment model through a plurality of image-text pairs, so that the image-text matching efficiency can be improved, and the diversity of image-text matching modes can be improved.
In an optional embodiment, as shown in fig. 2, after the image processing structure performs a feature extraction operation on the sample image of each image-text pair to obtain an image feature of each image-text pair, the alignment structure analyzes an image feature of each image-text pair and an image-text splicing feature obtained after splicing the image feature of each image-text pair, and before obtaining a semantic alignment result of each image-text pair, the method may further include:
performing feature conversion processing on the image features of each image-text pair by the full connection layer to update the image features of the image-text pairs, wherein the feature conversion processing is used for matching the feature attributes corresponding to the image features of each image-text pair with the feature attributes corresponding to the text features of the image-text pairs, and the feature attributes comprise feature dimensions and/or feature spaces;
wherein the output result of each preceding feature transformation structure is the input content of its succeeding neighboring feature transformation structure.
In this optional embodiment, preferably, the semantic alignment model may include 2 feature transformation structures, where a full connection layer in a first feature transformation structure is used to match a feature dimension corresponding to an image feature of each image-text pair with a feature dimension corresponding to a text feature of the image-text pair, and a full connection layer in a second feature transformation structure is used to match a feature space corresponding to an image feature of each image-text pair with a feature space corresponding to a text feature of the image-text pair.
Therefore, by implementing the optional embodiment, the feature attributes corresponding to the image features of the image-text pairs can be matched with the feature attributes corresponding to the text features through the full connection layer, so that the distribution difference between the image features and the text features is reduced, the possibility of successful splicing of the image features and the text features is improved, and the accuracy and the reliability of determining the semantic alignment result of the image-text pairs are further improved by comparing the image features and the text features on the premise of the same feature attributes.
In this optional embodiment, as an optional implementation, as shown in fig. 2, after performing a feature transformation process on the image feature of each image-text pair by the full connection layer to update the image feature of the image-text pair, the method may further include:
and carrying out nonlinear processing on the image characteristics of each image-text pair processed by the full connection layer by the nonlinear processing layer so as to update the image characteristics of the image-text pair.
In this optional embodiment, optionally, the performing, by the non-linear processing layer, non-linear processing on the image feature of each image-text pair processed by the full connection layer to update the image feature of the image-text pair may include:
performing activation function operation processing on the image features processed by the full connection layer by each image-text based on a preset activation function by the nonlinear processing layer;
and the nonlinear processing layer carries out random hiding processing on the value of one or more output neurons in the neural network layer corresponding to each image-text pair based on a preset random hiding mode and a random hiding probability so as to update the image characteristics of the image-text pair, wherein the neural network corresponding to each image-text pair comprises the neural network layer corresponding to the image characteristics obtained after the image-text pair is subjected to the operation processing by the activation function.
In this optional embodiment, further optionally, the performing, by the nonlinear processing layer, a random hiding process is performed on values of one or more output neurons in the neural network layer corresponding to each teletext pair based on a preset random hiding manner and a random hiding probability, so as to update the image features of the teletext pair, and may include:
and the nonlinear processing layer randomly converts the value of one or more output neurons in the neural network layer corresponding to each image-text pair into 0 based on a preset random hiding mode and a random hiding probability so as to update the image characteristics of the image-text pair.
In this optional embodiment, preferably, the activation function may be a GELU activation function, the random hiding manner may be a dropout hiding manner, and the random hiding probability may be 0.2.
Therefore, the optional implementation method performs the activation function operation processing on the image features after the feature conversion processing through the activation function, so that a non-linear factor can be introduced into the image features, and the semantic alignment model is further favorable for having the capability of solving the non-linear classification, and the image-text matching capability of the semantic alignment model is further improved.
In yet another alternative embodiment, as shown in fig. 2, the semantic alignment model may further include a mosaic structure;
and before the vector processing structure of the alignment structure performs vector conversion processing on the image feature and the text feature of each image-text pair to obtain image-text splicing features, and obtains a target matrix corresponding to each image-text pair, the method may further include:
and splicing the image features and the text features of the image-text pairs by the splicing structure according to the feature dimensions of the image features and the feature dimensions of the text features of each image-text pair to obtain image-text splicing features corresponding to each image-text pair.
For example, the feature dimension of the image feature of a certain image-text pair is [64,768], the feature dimension of the text feature is [64,768], and the feature dimension of the image-text splicing feature obtained after splicing is [64,1536].
Therefore, the optional embodiment can be implemented to splice the image features and the text features based on the feature dimensions, so that each feature dimension of the image features and each feature dimension of the text features are spliced in a one-to-one correspondence manner in the image-text feature splicing process, and the accuracy and the reliability of image-text feature splicing are improved.
Example two
Referring to fig. 3, fig. 3 is a schematic flowchart of another method for constructing a text-based semantic alignment model according to an embodiment of the present invention. The method for constructing the image-text semantic alignment model described in fig. 3 may be applied to a process for constructing an image-text semantic alignment model based on any architecture, and the embodiment of the present invention is not limited.
As shown in fig. 3, the method for constructing the teletext semantic alignment model may include the following operations:
201. and inputting a plurality of pre-determined image-text pairs into the semantic alignment model to be trained, so that the semantic alignment model analyzes each image-text pair to obtain a semantic alignment result of each image-text pair.
202. And calculating a prediction loss value of the semantic alignment model according to the difference degree between the semantic alignment result of each image-text pair and the target confidence degree corresponding to the pre-labeled actual matching result of the image-text pair.
In the embodiment of the present invention, optionally, the actual matching result of the positive example image-text pair is a first matching result that the sample image and the sample text are matched, and the actual matching result of the negative example image-text pair is a second matching result that the sample image and the sample text are not matched.
In an embodiment of the present invention, optionally, for a positive example image-text pair, when a confidence of image-text semantic matching of the image-text pair is greater than or equal to a target confidence corresponding to a first matching result, a difference between a semantic alignment result of the image-text pair and a target confidence corresponding to an actual matching result of the image-text pair is 0, and for a negative example image-text pair, when a confidence of image-text semantic matching of the image-text pair is less than or equal to a target confidence corresponding to a second matching result, a difference between a semantic alignment result of the image-text pair and a target confidence corresponding to an actual matching result of the image-text pair is 0. For example, the target confidence corresponding to the first matching result is 0.8, and if the confidence of the semantic matching of the image and text of a certain formal image and text pair is 0.9, which indicates that the semantic alignment model accurately predicts the semantic alignment result of the formal image and text pair, the difference between the semantic alignment result of the formal image and text pair and the target confidence of the actual matching result of the formal image and text pair is 0. Therefore, the situation that the image-text pair matching confidence coefficient determined by the semantic alignment model deviates from the actual confidence coefficient due to the adoption of a label smoothing method can be reduced.
As an optional embodiment, calculating a prediction loss value of the semantic alignment model according to a difference between the semantic alignment result of each image-text pair and a target confidence corresponding to an actual matching result of the pre-labeled image-text pair, may include:
and calculating a prediction loss value of the semantic alignment model according to the binary cross entropy loss function and the difference degree between the semantic alignment result of each image-text pair and a target confidence degree corresponding to the pre-labeled actual matching result of the image-text pair.
Therefore, the optional implementation method can calculate the prediction loss value of the semantic alignment model by using the binary cross entropy loss function, so that the semantic alignment model is regarded as a classification model based on two categories to calculate the model loss value, the difficulty of calculating the prediction loss value of the semantic alignment model is reduced, and the accuracy of loss calculation is improved.
203. And judging whether the predicted loss value is smaller than a preset loss value threshold value.
204. If the determination result in step 203 is negative, the model parameters of the semantic alignment model are corrected, and step 201, step 202, and step 203 are executed again.
205. And when the judgment result in the step 202 is yes, ending the current process to obtain the image-text semantic alignment model meeting the convergence condition.
In the embodiment of the present invention, for other descriptions of step 201, step 204, and step 205, please refer to the detailed description of step 101, step 103, and step 104 in the first embodiment, which is not repeated herein.
Therefore, by implementing the embodiment of the invention, the image-text semantic alignment model which can be used for predicting the image corresponding to any text, the text corresponding to any image and the matching degree between any image and the text can be obtained by training the semantic alignment model through a plurality of image-text pairs, the image-text matching efficiency can be improved, the diversity of image-text matching modes can be improved, the loss value of the semantic alignment model is calculated according to the difference degree between the image-text-to-semantic alignment confidence degree and the preset target confidence degree, so as to judge whether the semantic alignment model meets the convergence condition, the accuracy and the reliability of judging whether the semantic alignment model meets the convergence condition are improved, and the matching degree of the training result of the semantic alignment model and the training purpose is further improved.
In an optional embodiment, before calculating a prediction loss value of the semantic alignment model according to a difference between a semantic alignment result of each graph-text pair and a target confidence corresponding to an actual matching result of the pre-labeled graph-text pair, the method may further include:
updating the initial confidence corresponding to each actual matching result according to a preset label smoothing coefficient to obtain a target confidence corresponding to each actual matching result;
in this optional embodiment, optionally, the target confidence corresponding to the first matching result and the target confidence corresponding to the second matching result are respectively:
P1=1-ε,
P2=ε/(N-1),
wherein, P1 is used for representing the target confidence corresponding to the first matching result, P2 is used for representing the target confidence corresponding to the second matching result, epsilon is used for representing the label smoothing coefficient, and N is used for representing the number of all negative example image-text pairs.
For example, if ∈ =0.2, the target confidence level P1=0.8 for the first matching result and the target confidence level P2= 0.2/(N-1) for the second matching result are set to 1 for the initial confidence level corresponding to the first matching result and 2 for the second matching result.
In this alternative embodiment, the target confidence corresponding to each actual matching result is the similarity label corresponding to the actual matching result.
Therefore, the optional embodiment is implemented by performing label smoothing processing on the required target confidence coefficient, namely, the similarity label by using the label smoothing system, so that the occurrence of overfitting in the training of the semantic alignment model is reduced, image-text pairs with incompletely matched semantics and image-text pairs with similarity between different subclasses can be used as training samples in the training process of the model, and the robustness of the semantic alignment model is further improved.
In another optional embodiment, before determining whether the predicted loss value is less than the preset loss value threshold, the method may further include:
determining the similarity between the target image characteristic and the target text characteristic of each image-text pair determined based on the semantic alignment model as the similarity corresponding to the image-text pair;
and updating the prediction loss value according to the corresponding similarity of all the image-text pairs and the actual matching result of all the image-text pairs.
Therefore, by implementing the optional embodiment, the similarity between the target image feature and the target text feature of each image-text pair is used as a factor for calculating the loss value of the semantic alignment model, so that the accuracy and comprehensiveness of the loss of the calculation model are improved, and the semantic alignment accuracy of the semantic alignment model is improved.
In this optional embodiment, as an optional implementation, updating the prediction loss value according to the similarity corresponding to all the image-text pairs and the actual matching result of all the image-text pairs may include:
calculating a cosine loss value of the semantic alignment model according to the cosine loss function, the similarity corresponding to all image-text pairs and the actual matching result of all image-text pairs;
and updating the predicted loss value according to the cosine loss value.
Therefore, by implementing the optional implementation method, the cosine loss value of the semantic alignment model of the image-text pair can be calculated by using the cosine loss function, and the accuracy and the reliability of calculating the prediction loss value of the semantic alignment model can be improved.
In this optional embodiment, as an optional implementation, before determining the similarity between the target image feature and the target text feature of each image-text pair, the method may further include:
and for each image-text pair, segmenting a target matrix corresponding to the image-text pair output by the vector processing structure into a target image feature and a target text feature according to an input feature dimension corresponding to the input content of the vector processing structure of the semantic alignment model in the process of analyzing the image-text pair by the semantic alignment model, wherein the input content comprises image-text splicing features determined by the semantic alignment model based on the sample image and the sample text of each image-text pair.
For example, for a certain image-text pair, the semantic alignment model determines that the feature dimension corresponding to the image feature based on the sample image of the image-text pair is [64,768], the feature dimension corresponding to the text feature determined based on the sample text of the image-text pair is [64,768], the feature dimension corresponding to the image-text splicing feature determined after splicing the image-text splicing feature and the text feature is [64,1536], and the feature dimension is used as the input feature dimension corresponding to the input content of the input vector processing structure, so that the target image feature obtained by splitting the target matrix corresponding to the image-text input by the vector processing structure and the feature dimension corresponding to the target text feature are also [64,768].
Therefore, by implementing the optional implementation mode, the image-text semantic alignment model obtained by training can be directly aligned by using the similarity of the image features and the text features of the image-text pairs to be predicted through segmenting the target matrix input by the vector processing structure, unnecessary splicing operation in the actual application process of the image-text semantic alignment model is reduced, and the analysis efficiency of the image-text semantic alignment model is improved.
In yet another optional embodiment, a plurality of pre-determined image-text pairs are input into the semantic alignment model to be trained, so that the semantic alignment model analyzes each image-text pair to obtain a semantic alignment result of each image-text pair, and the method may further include:
combining sample images of any positive example image-text pair in a plurality of positive example image-text pairs matched with the images and texts prepared in advance with sample texts of any other positive example image-text pairs to obtain a plurality of negative example image-text pairs;
one or more positive and one or more negative example image-text pairs are determined as image-text pairs for training the semantic alignment model to be trained.
Therefore, by implementing the optional embodiment, the sample images and the sample texts of the positive example image-text pairs can be shuffled and recombined to obtain the negative example image-text pairs, so that the efficiency and the quantity of obtaining the negative example image-text pairs are improved.
EXAMPLE III
Referring to fig. 4, fig. 4 is a schematic structural diagram of another apparatus for constructing a text-based semantic alignment model according to an embodiment of the present invention. The image-text semantic alignment model construction device described in fig. 4 may be applied to a construction process of an image-text semantic alignment model based on any architecture, and the embodiment of the present invention is not limited.
As shown in fig. 4, the apparatus for constructing the teletext semantic alignment model may include:
the input module 301 is configured to input a plurality of pre-determined image-text pairs into a semantic alignment model to be trained, so that the semantic alignment model analyzes each image-text pair to obtain a semantic alignment result of each image-text pair, where each image-text pair includes a sample image and a sample text, and the semantic alignment result is used to indicate a matching degree between the sample image and the sample text in the corresponding image-text pair;
the judging module 302 is configured to judge whether the semantic alignment model satisfies a convergence condition according to the semantic alignment results of all image-text pairs and actual matching results of all pre-labeled image-text pairs;
a correcting module 303, configured to, when the determining module 302 determines that the semantic alignment model does not satisfy the convergence condition, correct a model parameter of the semantic alignment model, and trigger the input module 301 to execute the operation of inputting the predetermined number of image-text pairs into the semantic alignment model to be trained again, so that the semantic alignment model analyzes each image-text pair to obtain a semantic alignment result of each image-text pair, and trigger the determining module 302 to execute the operation of determining whether the semantic alignment model satisfies the convergence condition according to the semantic alignment results of all image-text pairs and the pre-labeled actual matching results of all image-text pairs, until obtaining an image-text semantic alignment model satisfying the convergence condition, where the image-text semantic alignment model is used to predict one or more of an image corresponding to any text, a text corresponding to any image, and a matching degree between any image and any text.
Therefore, by implementing the device described in fig. 4, the image-text semantic alignment model for predicting the image corresponding to any text, the text corresponding to any image, and the matching degree between any image and the text can be obtained by training the semantic alignment model through a plurality of image-text pairs, so that the image-text matching efficiency can be improved, and the diversity of image-text matching modes can be improved.
In an alternative embodiment, as shown in FIG. 4, the semantic alignment model includes an image processing structure, a text processing structure, and an alignment structure;
the specific manner of analyzing each image-text pair by the semantic alignment model to obtain the semantic alignment result of each image-text pair may include:
carrying out feature extraction operation on the sample image of each image-text pair by the image processing structure to obtain the image feature of each image-text pair, and carrying out feature extraction operation on the sample text of each image-text pair by the text processing structure to obtain the text feature of each image-text pair;
and analyzing the image characteristic of each image-text pair and the image-text splicing characteristic obtained after the text characteristic is spliced by the alignment structure to obtain a semantic alignment result of each image-text pair.
Therefore, the device described in fig. 4 can also be implemented to extract the image features of the sample image in the image-text pair and the text features of the sample text, and analyze the splicing result obtained after splicing the image-text features and the text features to obtain the semantic alignment result, so that the dimensionality of the image-text features is increased, the neural network complexity of the semantic alignment model is improved, and the accuracy and the reliability of training the image-text semantic alignment model are improved.
In another alternative embodiment, as shown in fig. 4, the semantic alignment model further includes one or more feature transformation structures, each feature transformation structure including at least a fully connected layer;
the full connection layer is used for performing feature extraction operation on a sample image of each image-text pair in the image processing structure to obtain an image feature of each image-text pair, analyzing the image feature of each image-text pair and an image-text splicing feature obtained after text feature splicing by the alignment structure, and performing feature conversion processing on the image feature of each image-text pair to update the image feature of the image-text pair before a semantic alignment result of each image-text pair is obtained, wherein the feature conversion processing is used for matching a feature attribute corresponding to the image feature of each image-text pair with a feature attribute corresponding to the text feature of the image-text pair, and the feature attributes comprise feature dimensions and/or feature spaces;
wherein the output result of each preceding feature transformation structure is the input content of its succeeding neighboring feature transformation structure.
It can be seen that the implementation of the apparatus described in fig. 4 can also match the feature attributes corresponding to the image features of the image-text pairs with the feature attributes corresponding to the text features through the full connection layer, thereby reducing the distribution difference between the image features and the text features, improving the possibility of successful splicing of the image features and the text features, and being beneficial to further improving the accuracy and reliability of determining the semantic alignment result of the image-text pairs by comparing the image features and the text features on the premise of the same feature attributes.
In yet another alternative embodiment, as shown in FIG. 4, each feature transformation structure further comprises a non-linear processing layer;
the nonlinear processing layer is used for carrying out characteristic conversion processing on the image characteristics of each image-text pair in the full connection layer so as to update the image characteristics of the image-text pair, and then carrying out nonlinear processing on the image characteristics of each image-text pair processed by the full connection layer so as to update the image characteristics of the image-text pair;
the non-linear processing layer performs non-linear processing on the image features of each image-text pair processed by the full-connection layer, and the specific mode for updating the image features of the image-text pairs comprises the following steps:
the nonlinear processing layer performs activation function operation processing on the image features processed by the full connection layer for each image-text based on a preset activation function;
and the nonlinear processing layer carries out random hiding processing on the value of one or more output neurons in the neural network layer corresponding to each image-text pair based on a preset random hiding mode and a random hiding probability so as to update the image characteristics of the image-text pair, wherein the neural network corresponding to each image-text pair comprises the neural network layer corresponding to the image characteristics obtained after the image-text pair is operated and processed by an activation function.
It can be seen that, the implementation of the apparatus described in fig. 4 can also perform activation function operation processing on the image features after feature conversion processing by using an activation function, so that a non-linear factor can be introduced into the image features, which is favorable for making the semantic alignment model have the capability of solving non-linear classification, and further improving the image-text matching capability of the semantic alignment model.
In yet another optional embodiment, as shown in fig. 4, the analyzing, by the alignment structure, the image feature of each image-text pair and the image-text splicing feature obtained by splicing the text feature may include:
performing vector conversion processing on image characteristics of each image-text pair and image-text splicing characteristics obtained after the text characteristics are spliced by a vector processing structure of the alignment structure to obtain a target matrix corresponding to each image-text pair;
and processing the target matrix corresponding to each image-text pair by the full connection layer of the alignment structure to obtain the confidence coefficient of matching the semantics of the sample image of the image-text pair with the semantics of the sample text, and taking the confidence coefficient as the semantic alignment result of the image-text pair.
Therefore, the device described in fig. 4 can also understand the semantics of the image features and the text features by using the vector processing structure, and determine the image-text matching confidence of the image-text pairs through the full connection layer, thereby improving the efficiency and accuracy of determining the semantic alignment result of the image-text pairs through the semantic alignment model.
In yet another alternative embodiment, as shown in fig. 4, the semantic alignment result of each teletext pair includes a confidence that the semantics of the sample image and the semantics of the sample text of the teletext pair match;
the specific manner of the determining module 302 determining whether the semantic alignment model satisfies the convergence condition according to the semantic alignment results of all the image-text pairs and the pre-labeled actual matching results of all the image-text pairs may include:
calculating a prediction loss value of a semantic alignment model according to the difference degree between the semantic alignment result of each image-text pair and a target confidence degree corresponding to the pre-labeled actual matching result of the image-text pair;
judging whether the predicted loss value is smaller than a preset loss value threshold value or not;
and when the judgment result is yes, determining that the semantic alignment model meets the convergence condition, and when the judgment result is no, determining that the semantic alignment model does not meet the convergence condition.
Therefore, by implementing the optional embodiment, the loss value of the semantic alignment model can be calculated according to the difference between the image-text-to-semantic-alignment confidence coefficient and the preset target confidence coefficient so as to judge whether the semantic alignment model meets the convergence condition, so that the accuracy and reliability of judging whether the semantic alignment model meets the convergence condition are improved, and the matching degree of the training result of the semantic alignment model and the training purpose is further improved.
In yet another optional embodiment, as shown in fig. 5, all the image-text pairs include at least one positive example image-text pair and/or at least one negative example image-text pair, the actual matching result of the positive example image-text pair is a first matching result that the sample image and the sample text match, and the actual matching result of the negative example image-text pair is a second matching result that the sample image and the sample text do not match;
the apparatus may further include:
a first updating module 304, configured to update an initial confidence corresponding to each actual matching result according to a preset tag smoothing coefficient before the determining module 302 calculates a prediction loss value of the semantic alignment model according to a difference between a semantic alignment result of each image-text pair and a target confidence corresponding to an actual matching result of the pre-labeled image-text pair, so as to obtain a target confidence corresponding to each actual matching result;
wherein, the target confidence corresponding to the first matching result and the target confidence corresponding to the second matching result are respectively:
P1=1-ε,
P2=ε/(N-1),
wherein, P1 is used for representing the target confidence corresponding to the first matching result, P2 is used for representing the target confidence corresponding to the second matching result, epsilon is used for representing the label smoothing coefficient, and N is used for representing the number of all negative example image-text pairs.
Therefore, by implementing the device described in fig. 5, the label smoothing system can be used to perform label smoothing on the required target confidence coefficient, i.e., the similarity label, so as to reduce the occurrence of overfitting in training of the semantic alignment model, so that image-text pairs with incompletely matched semantics and image-text pairs with similarity between different subclasses can be used as training samples in the training process of the model, and further the robustness of semantic alignment of the semantic alignment model is improved.
In yet another alternative embodiment, as shown in fig. 5, the apparatus may further include:
a determining module 305, configured to determine, before the determining module 302 determines whether the predicted loss value is smaller than a preset loss value threshold, a similarity between a target image feature and a target text feature of each image-text pair determined based on the semantic alignment model, as a similarity corresponding to the image-text pair;
a second updating module 306, configured to update the predicted loss value according to the similarity corresponding to all the image-text pairs and the actual matching result of all the image-text pairs;
and, the apparatus may further include:
the feature segmentation module 307 is configured to segment, for each image-text pair, a target matrix corresponding to the image-text pair output by the vector processing structure into a target image feature and a target text feature according to an input feature dimension corresponding to input content of the vector processing structure of the semantic alignment model in a process of analyzing the image-text pair by the semantic alignment model, where the input content includes image-text splicing features determined by the semantic alignment model based on the sample image and the sample text of each image-text pair.
Therefore, the implementation of the device described in fig. 5 can also use the similarity between the target image feature and the target text feature obtained by segmenting the target matrix input by the vector processing structure as a factor for calculating the loss value of the semantic alignment model, thereby improving the accuracy and comprehensiveness of the loss of the calculation model, improving the semantic alignment accuracy of the semantic alignment model, enabling the image-text semantic alignment model obtained by training to directly apply the similarity between the image feature and the text feature of the image-text pair for semantic alignment, reducing unnecessary splicing operation in the actual application process of the image-text semantic alignment model, and improving the analysis efficiency of the image-text semantic alignment model.
Example four
Referring to fig. 6, fig. 6 is a schematic structural diagram of another apparatus for constructing a text-based semantic alignment model according to an embodiment of the present invention. As shown in fig. 6, the apparatus for constructing the teletext semantic alignment model may include:
a memory 401 storing executable program code;
a processor 402 coupled to a memory 401;
the processor 402 calls the executable program code stored in the memory 401 to execute the steps in the method for constructing the teletext semantic alignment model described in the first embodiment of the invention or the second embodiment of the invention.
EXAMPLE five
The embodiment of the invention discloses a computer storage medium, which stores computer instructions, and when the computer instructions are called, the computer instructions are used for executing the steps of the method for constructing the image-text semantic alignment model described in the first embodiment or the second embodiment of the invention.
Example six
An embodiment of the present invention discloses a computer program product, which includes a non-transitory computer readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute the steps in the method for constructing a text-based semantic alignment model described in the first embodiment or the second embodiment.
The above-described embodiments of the apparatus are merely illustrative, and the modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above detailed description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above technical solutions may essentially or in part contribute to the prior art, be embodied in the form of a software product, which may be stored in a computer-readable storage medium, including a Read-Only Memory (ROM), a Random Access Memory (RAM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable Programmable Read-Only Memory (EEPROM), an optical Disc-Read (CD-ROM) or other storage medium capable of storing data, a magnetic tape, or any other computer-readable medium capable of storing data.
Finally, it should be noted that: the method and the device for constructing the image-text semantic alignment model disclosed in the embodiment of the present invention are only preferred embodiments of the present invention, and are only used for illustrating the technical solution of the present invention, not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for constructing a graph-text semantic alignment model is characterized by comprising the following steps:
inputting a plurality of pre-determined image-text pairs into a semantic alignment model to be trained, so that the semantic alignment model analyzes each image-text pair to obtain a semantic alignment result of each image-text pair, wherein each image-text pair comprises a sample image and a sample text, and the semantic alignment result is used for representing the matching degree of the sample image and the sample text in the corresponding image-text pair;
judging whether the semantic alignment model meets a convergence condition or not according to the semantic alignment results of all the image-text pairs and the pre-labeled actual matching results of all the image-text pairs;
and if not, correcting the model parameters of the semantic alignment model, and re-executing the operation of inputting the predetermined image-text pairs into the semantic alignment model to be trained so that the semantic alignment model analyzes each image-text pair to obtain the semantic alignment result of each image-text pair, executing the operation of judging whether the semantic alignment model meets the convergence condition according to the semantic alignment results of all the image-text pairs and the pre-labeled actual matching results of all the image-text pairs until the image-text semantic alignment model meeting the convergence condition is obtained, wherein the image-text semantic alignment model is used for predicting one or more of images corresponding to any texts, texts corresponding to any images, and matching degrees between any images and any texts.
2. The method for constructing the image-text semantic alignment model according to claim 1, wherein the semantic alignment model comprises an image processing structure, a text processing structure and an alignment structure;
the semantic alignment model analyzes each image-text pair to obtain a semantic alignment result of each image-text pair, and the semantic alignment result comprises the following steps:
carrying out feature extraction operation on the sample image of each image-text pair by the image processing structure to obtain the image feature of each image-text pair, and carrying out feature extraction operation on the sample text of each image-text pair by the text processing structure to obtain the text feature of each image-text pair;
and analyzing the image characteristic of each image-text pair and the image-text splicing characteristic obtained after the text characteristic is spliced by the alignment structure to obtain a semantic alignment result of each image-text pair.
3. The method for constructing a teletext semantic alignment model according to claim 2, wherein the semantic alignment model further comprises one or more feature transformation structures, each feature transformation structure comprising at least a full link layer;
after the image processing structure performs the feature extraction operation on the sample image of each image-text pair to obtain the image feature of each image-text pair, and before the alignment structure analyzes the image feature of each image-text pair and the image-text splicing feature obtained after the text feature is spliced to obtain the semantic alignment result of each image-text pair, the method further includes:
performing feature conversion processing on the image features of each image-text pair by the full connection layer to update the image features of the image-text pair, wherein the feature conversion processing is used for matching feature attributes corresponding to the image features of each image-text pair with feature attributes corresponding to the text features of the image-text pair, and the feature attributes comprise feature dimensions and/or feature spaces;
wherein the output result of each preceding feature transformation structure is the input content of its succeeding neighboring feature transformation structure.
4. The method for constructing a teletext semantic alignment model according to claim 3, wherein each feature transformation structure further comprises a non-linear processing layer;
and after the feature conversion processing is performed on the image features of each image-text pair by the full connection layer so as to update the image features of the image-text pair, the method further comprises the following steps:
carrying out nonlinear processing on the image characteristics of each image-text pair processed by the full connection layer by the nonlinear processing layer so as to update the image characteristics of the image-text pair;
the non-linear processing layer performs non-linear processing on the image features of each image-text pair processed by the full connection layer to update the image features of the image-text pair, and the non-linear processing layer comprises:
performing activation function operation processing on the image features processed by the full connection layer for each image-text by the nonlinear processing layer based on a preset activation function;
and the nonlinear processing layer carries out random hiding processing on the values of one or more output neurons in the neural network layer corresponding to each image-text pair based on a preset random hiding mode and a random hiding probability so as to update the image characteristics of the image-text pair, wherein the neural network corresponding to each image-text pair comprises the neural network layer corresponding to the image characteristics obtained after the image-text pair is operated and processed by the activating function.
5. The method for constructing a teletext semantic alignment model according to any one of claims 2-4, wherein the alignment structure analyzes a teletext alignment feature obtained by splicing an image feature and a text feature of each teletext pair to obtain a semantic alignment result of each teletext pair, and comprises:
performing vector conversion processing on image characteristics of each image-text pair and image-text splicing characteristics obtained after the text characteristics are spliced by a vector processing structure of the alignment structure to obtain a target matrix corresponding to each image-text pair;
and processing the target matrix corresponding to each image-text pair by the full connection layer of the alignment structure to obtain a confidence coefficient that the semantics of the sample image of the image-text pair are matched with the semantics of the sample text, and taking the confidence coefficient as a semantic alignment result of the image-text pair.
6. The method for constructing the image-text semantic alignment model according to any one of claims 1 to 4, wherein the semantic alignment result of each image-text pair comprises a confidence that the semantics of the sample image and the semantics of the sample text of the image-text pair match;
the step of judging whether the semantic alignment model meets the convergence condition according to the semantic alignment results of all the image-text pairs and the pre-labeled actual matching results of all the image-text pairs comprises the following steps:
calculating a prediction loss value of the semantic alignment model according to the difference degree between the semantic alignment result of each image-text pair and a target confidence degree corresponding to the pre-labeled actual matching result of the image-text pair;
judging whether the predicted loss value is smaller than a preset loss value threshold value or not;
and when the judgment result is yes, determining that the semantic alignment model meets the convergence condition, and when the judgment result is no, determining that the semantic alignment model does not meet the convergence condition.
7. The method for constructing the teletext semantic alignment model according to claim 6, wherein all the teletext pairs comprise at least one positive example teletext pair and/or at least one negative example teletext pair, the actual matching result of the positive example teletext pair is a first matching result that the sample image and the sample text are matched, and the actual matching result of the negative example teletext pair is a second matching result that the sample image and the sample text are not matched;
before the calculating a prediction loss value of the semantic alignment model according to a difference between the semantic alignment result of each image-text pair and a target confidence corresponding to a pre-labeled actual matching result of the image-text pair, the method further includes:
updating the initial confidence corresponding to each actual matching result according to a preset label smoothing coefficient to obtain a target confidence corresponding to each actual matching result;
wherein the target confidence corresponding to the first matching result and the target confidence corresponding to the second matching result are respectively:
P1=1-ε,
P2=ε/(N-1),
wherein P1 is used to represent the target confidence corresponding to the first matching result, P2 is used to represent the target confidence corresponding to the second matching result, epsilon is used to represent the label smoothing coefficient, and N is used to represent the number of all negative example image-text pairs.
8. The method for constructing a teletext semantic alignment model according to claim 6, wherein before the determining whether the predicted loss value is smaller than a preset loss value threshold, the method further comprises:
determining the similarity between the target image characteristic and the target text characteristic of each image-text pair determined based on the semantic alignment model as the similarity corresponding to the image-text pair;
updating the prediction loss value according to the similarity corresponding to all the image-text pairs and the actual matching result of all the image-text pairs;
and, prior to said determining a similarity between the target image feature and the target text feature of each said image-text pair, the method further comprises:
for each image-text pair, segmenting a target matrix corresponding to the image-text pair output by the vector processing structure into a target image feature and a target text feature according to an input feature dimension corresponding to the input content of the vector processing structure of the semantic alignment model in the process of analyzing the image-text pair by the semantic alignment model, wherein the input content comprises image-text splicing features determined by the semantic alignment model based on the sample image and the sample text of each image-text pair.
9. An apparatus for constructing a graph-text semantic alignment model, the apparatus comprising:
the input module is used for inputting a plurality of pre-determined image-text pairs into a semantic alignment model to be trained so that the semantic alignment model analyzes each image-text pair to obtain a semantic alignment result of each image-text pair, each image-text pair comprises a sample image and a sample text, and the semantic alignment result is used for representing the matching degree of the sample image and the sample text in the corresponding image-text pair;
the judging module is used for judging whether the semantic alignment model meets a convergence condition or not according to the semantic alignment results of all the image-text pairs and the pre-labeled actual matching results of all the image-text pairs;
and the correction module is used for correcting the model parameters of the semantic alignment model when the judgment module judges that the semantic alignment model does not meet the convergence condition, triggering the input module to execute the operation of inputting the predetermined image-text pairs into the semantic alignment model to be trained again so that the semantic alignment model analyzes each image-text pair to obtain the semantic alignment result of each image-text pair, triggering the judgment module to execute the operation of judging whether the semantic alignment model meets the convergence condition according to the semantic alignment results of all the image-text pairs and the pre-labeled actual matching results of all the image-text pairs until the image-text semantic alignment model meeting the convergence condition is obtained, wherein the image-text semantic alignment model is used for predicting one or more of images corresponding to any text, texts corresponding to any images, and matching degrees between any images and any texts.
10. An apparatus for constructing a text-to-text semantic alignment model, the apparatus comprising:
a memory storing executable program code;
a processor coupled with the memory;
the processor calls the executable program codes stored in the memory to execute the method for constructing the teletext semantic alignment model according to any one of claims 1-8.
CN202211108881.XA 2022-09-13 2022-09-13 Method and device for constructing image-text semantic alignment model Pending CN115455225A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211108881.XA CN115455225A (en) 2022-09-13 2022-09-13 Method and device for constructing image-text semantic alignment model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211108881.XA CN115455225A (en) 2022-09-13 2022-09-13 Method and device for constructing image-text semantic alignment model

Publications (1)

Publication Number Publication Date
CN115455225A true CN115455225A (en) 2022-12-09

Family

ID=84303130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211108881.XA Pending CN115455225A (en) 2022-09-13 2022-09-13 Method and device for constructing image-text semantic alignment model

Country Status (1)

Country Link
CN (1) CN115455225A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115860587A (en) * 2023-03-02 2023-03-28 广州市玄武无线科技股份有限公司 Visit assessment method, device, equipment and storage medium based on image-text matching

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115860587A (en) * 2023-03-02 2023-03-28 广州市玄武无线科技股份有限公司 Visit assessment method, device, equipment and storage medium based on image-text matching

Similar Documents

Publication Publication Date Title
CN112492343A (en) Video live broadcast monitoring method and related device
CN113596007A (en) Vulnerability attack detection method and device based on deep learning
CN112086087B (en) Speech recognition model training method, speech recognition method and device
CN107832300A (en) Towards minimally invasive medical field text snippet generation method and device
CN112257437A (en) Voice recognition error correction method and device, electronic equipment and storage medium
CN112214984A (en) Content plagiarism identification method, device, equipment and storage medium
EP4191544A1 (en) Method and apparatus for recognizing token, electronic device and storage medium
CN115168541A (en) Chapter event extraction method and system based on frame semantic mapping and type perception
CN115495553A (en) Query text ordering method and device, computer equipment and storage medium
CN115455225A (en) Method and device for constructing image-text semantic alignment model
CN112215236A (en) Text recognition method and device, electronic equipment and storage medium
CN115357699A (en) Text extraction method, device, equipment and storage medium
CN113268985B (en) Relationship path-based remote supervision relationship extraction method, device and medium
CN117251559B (en) Engineering standard specification acquisition method and system based on natural language big model
CN116644183B (en) Text classification method, device and storage medium
CN112766051A (en) Attention-based image character recognition method and device
CN116680385A (en) Dialogue question-answering method and device based on artificial intelligence, computer equipment and medium
CN115909381A (en) Text image recognition method, system and related device
CN115238124A (en) Video character retrieval method, device, equipment and storage medium
CN115828848A (en) Font generation model training method, device, equipment and storage medium
CN113268588A (en) Text abstract extraction method, device, equipment, storage medium and program product
CN113157880A (en) Element content obtaining method, device, equipment and storage medium
CN115292455B (en) Training method and device of image-text matching model
CN115100419B (en) Target detection method and device, electronic equipment and storage medium
CN114328883B (en) Data processing method, device, equipment and medium for machine reading understanding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination