CN115292455A - Training method and device of image-text matching model - Google Patents

Training method and device of image-text matching model Download PDF

Info

Publication number
CN115292455A
CN115292455A CN202211219395.5A CN202211219395A CN115292455A CN 115292455 A CN115292455 A CN 115292455A CN 202211219395 A CN202211219395 A CN 202211219395A CN 115292455 A CN115292455 A CN 115292455A
Authority
CN
China
Prior art keywords
text
image
initial
training
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211219395.5A
Other languages
Chinese (zh)
Other versions
CN115292455B (en
Inventor
陈畅新
李展铿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Youmi Technology Co ltd
Original Assignee
Youmi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Youmi Technology Co ltd filed Critical Youmi Technology Co ltd
Priority to CN202211219395.5A priority Critical patent/CN115292455B/en
Publication of CN115292455A publication Critical patent/CN115292455A/en
Application granted granted Critical
Publication of CN115292455B publication Critical patent/CN115292455B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Abstract

The invention discloses a method and a device for training a graph-text matching model, wherein the method comprises the following steps: the method comprises the steps of obtaining a training data set used for image-text matching model training, wherein the training data set comprises a plurality of text data and a plurality of image data, inputting each text data and each image data into a target text model and a target image model respectively to obtain a text coding vector and an image coding vector, determining an image-text data set for initial training, inputting all the image-text data sets for initial training into a preset initial image-text matching model to obtain an initial training data output result, determining initial loss information of the initial image-text matching model based on the initial training data output result, and determining the initial image-text matching model as the target image-text matching model if the initial loss information meets training completion conditions. Therefore, the image-text matching model training method can improve the efficiency of image-text matching model training, and can realize image-text mutual search and multi-mode data classification based on the image-text matching model.

Description

Training method and device of image-text matching model
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for training a graph-text matching model.
Background
In actual life, artificial intelligence has been widely applied to daily life of people. At the beginning of the development of deep learning, most models are only concentrated in specific fields such as computer vision or natural language processing, and the connection between the two fields is not deeply excavated, and at present, the training of the models is usually supervised training based on manually labeled data sets, which not only consumes labor cost, but also consumes a lot of time cost, so that the efficiency of model training is low. It can be seen that how to train the graph-text matching model to improve the efficiency of model training is a technical problem yet to be solved by those skilled in the art.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method and a device for training a graph-text matching model, which can be beneficial to improving the efficiency of training the graph-text matching model and can realize graph-text mutual search and multi-mode data classification based on the graph-text matching model obtained by training.
In order to solve the above technical problem, a first aspect of the present invention discloses a method for training an image-text matching model, where the method includes:
acquiring a training data set for image-text matching model training, wherein the training data set comprises a plurality of text data and a plurality of image data;
for each piece of text data, inputting the text data into a target text model to obtain a text coding vector, and for each piece of image data, inputting the image data into a target image model to obtain an image coding vector;
for each text coding vector, determining an image coding vector matched with the text coding vector from all the image coding vectors, and determining the text coding vector and the image coding vector matched with the text coding vector as an initial training image-text data set;
inputting all the image-text data groups for initial training into a preset initial image-text matching model to obtain an initial training data output result, and determining initial loss information of the initial image-text matching model based on the initial training data output result, wherein the initial loss information comprises one or more of text reconstruction loss information, contrast learning loss information and image-text matching loss information;
and judging whether the initial loss information meets training completion conditions or not, and determining the initial image-text matching model as a target image-text matching model when the judgment result is yes.
As an optional implementation manner, in the first aspect of the present invention, after obtaining the training data set for the training of the graph-text matching model, before inputting the text data to the target text model for each text data to obtain a text coding vector, and inputting the image data to the target image model for each image data to obtain an image coding vector, the method further includes:
performing a feature masking operation on each text data in the training data set to obtain feature masked text data;
and for each text data, inputting the text data into a target text model to obtain a text coding vector, including:
and inputting the characteristic masking text data into a target text model aiming at each characteristic masking text data to obtain a text coding vector, wherein the text coding vector comprises predicted text data of the characteristic masking text data.
As an optional implementation manner, in the first aspect of the present invention, the inputting all the initial training graphics-text data sets to a preset initial graphics-text matching model to obtain an initial training data output result includes:
executing splicing operation on a text coding vector and an image coding vector which are included in each initial training image-text data set to obtain an initial image-text input data set;
inputting the initial image-text input data set into an initial image-text matching model aiming at each initial image-text input data set to obtain an output result of the initial image-text data set;
determining an initial training data output result according to all the initial image-text data set output results;
the initial image-text data set output result comprises a plurality of initial image-text output data sets, the number of the initial image-text output data sets is equal to that of the initial training image-text data sets, and each initial image-text output data set comprises a text data output result and an image data output result.
As an optional implementation manner, in the first aspect of the present invention, after determining, for each text encoding vector, an image encoding vector matching the text encoding vector in all the image encoding vectors, and determining the text encoding vector and the image encoding vector matching the text encoding vector as an initial training teletext data set, the method further includes, before inputting all the initial training teletext data sets into a preset initial teletext matching model and obtaining an initial training data output result:
determining at least two first training image-text data sets from all the initial training image-text data sets, and recombining the text data and the image data included in all the first training image-text data sets to obtain second training image-text data sets, wherein the data included in each first training image-text data set is different from the data included in each second training image-text data set;
determining all remaining training image-text data sets in all the initial training image-text data sets except all the first training image-text data sets and all the second training image-text data sets as target training image-text data sets;
inputting all the image-text data sets for initial training into a preset initial image-text matching model to obtain an initial training data output result, wherein the method comprises the following steps:
and inputting all the image-text data sets for the target training to a preset initial image-text matching model to obtain an initial training data output result.
As an optional implementation manner, in the first aspect of the present invention, when the initial loss information includes the text reconstruction loss information, the contrast learning loss information, and the graph-text matching loss information, the determining initial loss information of the initial graph-text matching model based on the initial training data output result includes:
for each text coding vector, determining target text data matched with the text coding vector from the training data set, determining text reconstruction loss information of the text coding vector according to the text coding vector and the target text data, and determining text reconstruction loss information according to the text reconstruction loss information of all the text coding vectors;
calculating a feature matching parameter between the text data output result and each image data output result aiming at the text data output result in each initial image-text output data group to obtain a feature matching parameter between each text data output result and each image data output result, and determining comparison learning loss information of the initial image-text matching model according to all the feature matching parameters;
determining image-text matching loss information of the initial image-text matching model according to the output result of the initial image-text data set and all the image-text data sets for initial training;
and determining initial loss information of the initial image-text matching model based on the text reconstruction loss information, the contrast learning loss information and the image-text matching loss information.
As an optional implementation manner, in the first aspect of the present invention, the determining of the teletext matching loss information of the initial teletext matching model according to the initial teletext data set output result and all of the initial training teletext data sets includes:
determining a first output image-text data set identical to the initial training image-text data set from all the initial image-text data sets included in the initial image-text data set output result based on the text data output result and the image data output result included in each of the initial image-text data set output results, and determining all the output image-text data sets except all the first output image-text data sets as a second output image-text data set;
determining the output data matching degree of the initial image-text matching model according to all the first output image-text data sets, all the second output image-text data sets and all the initial training image-text data sets;
and determining the image-text matching loss information of the initial image-text matching model according to the output data matching degree and a predetermined image-text matching function.
As an optional implementation manner, in the first aspect of the present invention, the determining comparative learning loss information of the initial graph-text matching model according to all the feature matching parameters includes:
for the text data output result in each initial image-text output data set, determining a key image data output result matched with the text data output result in the initial training image-text data set based on the initial training image-text data set, determining first matching information between the text data output result and the key image data output result, and determining second matching information between the text data output result and each other image data output result except the key image data output result;
and determining contrast learning loss information of the initial image-text matching model according to all the first matching information, all the second matching information and a predetermined contrast learning loss function.
The invention discloses a training device of an image-text matching model in a second aspect, which comprises:
the acquisition module is used for acquiring a training data set for image-text matching model training, wherein the training data set comprises a plurality of text data and a plurality of image data;
the input module is used for inputting the text data into a target text model to obtain text coding vectors aiming at each text data, and inputting the image data into a target image model to obtain image coding vectors aiming at each image data;
the determining module is used for determining an image coding vector matched with the text coding vector in all the image coding vectors aiming at each text coding vector, and determining the text coding vector and the image coding vector matched with the text coding vector as an initial training image-text data set;
the input module is also used for inputting all the image-text data sets for initial training into a preset initial image-text matching model to obtain an initial training data output result;
the determining module is further configured to determine initial loss information of the initial image-text matching model based on the initial training data output result, where the initial loss information includes one or more of text reconstruction loss information, contrast learning loss information, and image-text matching loss information;
the judging module is used for judging whether the initial loss information meets training completion conditions or not;
and the determining module is also used for determining the initial image-text matching model as a target image-text matching model when the judging result is yes.
As an alternative embodiment, in the second aspect of the present invention, the apparatus further comprises:
the characteristic covering module is used for inputting the text data into a target text model to obtain a text coding vector after the acquisition module acquires a training data set for image-text matching model training, and inputting the image data into a target image model for each image data to obtain a characteristic covering text data before the image coding vector is obtained and the characteristic covering operation is executed on the text data for each text data in the training data set to obtain the characteristic covering text data;
and the input module inputs the text data into the target text model aiming at each text data, and the mode of obtaining the text coding vector specifically comprises the following steps:
and inputting the characteristic masking text data into a target text model aiming at each characteristic masking text data to obtain a text coding vector, wherein the text coding vector comprises predicted text data of the characteristic masking text data.
As an optional implementation manner, in the second aspect of the present invention, the manner in which the input module inputs all the initial training image-text data sets to a preset initial image-text matching model to obtain the initial training data output result specifically includes:
executing splicing operation on a text coding vector and an image coding vector which are included in each initial training image-text data set to obtain an initial image-text input data set;
inputting the initial image-text input data set into an initial image-text matching model aiming at each initial image-text input data set to obtain an output result of the initial image-text data set;
determining an initial training data output result according to all the initial image-text data set output results;
the initial image-text data set output result comprises a plurality of initial image-text output data sets, the number of the initial image-text output data sets is equal to that of the initial training image-text data sets, and each initial image-text output data set comprises a text data output result and an image data output result.
As an optional implementation manner, in the second aspect of the present invention, the determining module is further configured to, after determining, for each text encoding vector, an image encoding vector matching the text encoding vector from among all the image encoding vectors, and determining the text encoding vector and the image encoding vector matching the text encoding vector as one initial training teletext data set, determine at least two first training teletext data sets from all the initial training teletext data sets before the input module inputs all the initial training teletext data sets to a preset initial teletext matching model to obtain an initial training data output result;
the device further comprises:
a combination module, configured to recombine the text data and the image data included in all the first training image-text data sets to obtain second training image-text data sets, where data included in each of the first training image-text data sets is different from data included in each of the second training image-text data sets;
the determining module is further configured to determine all remaining training image-text data sets in all the initial training image-text data sets except all the first training image-text data sets and all the second training image-text data sets as target training image-text data sets;
the input module inputs all the image-text data sets for initial training into a preset initial image-text matching model, and the mode of obtaining the output result of the initial training data specifically comprises the following steps:
and inputting all the image-text data sets for the target training to a preset initial image-text matching model to obtain an initial training data output result.
As an optional implementation manner, in the second aspect of the present invention, when the initial loss information includes the text reconstruction loss information, the contrast learning loss information, and the teletext matching loss information, the manner of determining, by the determination module, the initial loss information of the initial teletext matching model based on the initial training data output result specifically includes:
for each text coding vector, determining target text data matched with the text coding vector from the training data set, determining text reconstruction loss information of the text coding vector according to the text coding vector and the target text data, and determining text reconstruction loss information according to the text reconstruction loss information of all the text coding vectors;
calculating a feature matching parameter between the text data output result and each image data output result aiming at the text data output result in each initial image-text output data group to obtain a feature matching parameter between each text data output result and each image data output result, and determining comparison learning loss information of the initial image-text matching model according to all the feature matching parameters;
determining image-text matching loss information of the initial image-text matching model according to the output result of the initial image-text data set and all the image-text data sets for initial training;
and determining initial loss information of the initial image-text matching model based on the text reconstruction loss information, the contrast learning loss information and the image-text matching loss information.
As an optional implementation manner, in the second aspect of the present invention, the manner of determining, by the determining module, the graph-text matching loss information of the initial graph-text matching model according to the output result of the initial graph-text data set and all the initial training graph-text data sets specifically includes:
determining a first output image-text data set identical to the initial training image-text data set from all the initial image-text data sets included in the initial image-text data set output result based on the text data output result and the image data output result included in each of the initial image-text data set output results, and determining all the output image-text data sets except all the first output image-text data sets as a second output image-text data set;
determining the output data matching degree of the initial image-text matching model according to all the first output image-text data sets, all the second output image-text data sets and all the initial training image-text data sets;
and determining the image-text matching loss information of the initial image-text matching model according to the output data matching degree and a predetermined image-text matching function.
As an optional implementation manner, in the second aspect of the present invention, the manner of determining, by the determining module, the teletext matching loss information of the initial teletext matching model according to the initial teletext data set output result and all of the initial training teletext data sets specifically includes:
determining a first output image-text data set identical to the initial training image-text data set from all the initial image-text data sets included in the initial image-text data set output result based on the text data output result and the image data output result included in each of the initial image-text data set output results, and determining all the output image-text data sets except all the first output image-text data sets as a second output image-text data set;
determining the output data matching degree of the initial image-text matching model according to all the first output image-text data sets, all the second output image-text data sets and all the initial training image-text data sets;
and determining the image-text matching loss information of the initial image-text matching model according to the output data matching degree and a predetermined image-text matching function.
As an optional implementation manner, in the second aspect of the present invention, the manner in which the determining module determines the comparative learning loss information of the initial graph-text matching model according to all the feature matching parameters specifically includes:
for the text data output result in each initial image-text output data set, determining a key image data output result matched with the text data output result in the initial training image-text data set based on the initial training image-text data set, determining first matching information between the text data output result and the key image data output result, and determining second matching information between the text data output result and each other image data output result except the key image data output result;
and determining contrast learning loss information of the initial image-text matching model according to all the first matching information, all the second matching information and a predetermined contrast learning loss function.
The invention discloses a third aspect of another training device for an image-text matching model, which comprises:
a memory storing executable program code;
a processor coupled with the memory;
the processor calls the executable program codes stored in the memory to execute the method for training the image-text matching model disclosed by the first aspect of the invention.
In a fourth aspect, the present invention discloses a computer-readable storage medium storing computer instructions for executing the method for training a graph-text matching model disclosed in the first aspect of the present invention when the computer instructions are called.
Compared with the prior art, the embodiment of the invention has the following beneficial effects:
in the embodiment of the invention, a training data set for training a graph-text matching model is obtained, the training data set comprises a plurality of text data and a plurality of image data, each text data and each image data are respectively input into a target text model and a target image model to obtain a text coding vector and an image coding vector, an initial training graph-text data set is determined, all initial training graph-text data sets are input into a preset initial graph-text matching model to obtain an initial training data output result, initial loss information of the initial graph-text matching model is determined based on the initial training data output result, and if the initial loss information meets training completion conditions, the initial graph-text matching model is determined as the target graph-text matching model. Therefore, the image-text matching model training efficiency can be improved, and image-text mutual search and multi-mode data classification can be realized based on the image-text matching model.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for training an image-text matching model according to an embodiment of the present invention;
FIG. 2 is a schematic flowchart of another method for training a graph-text matching model according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a training apparatus for matching graphics and text models according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of another training apparatus for matching a graph and text model according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of another training apparatus for matching graphics and text disclosed in the embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," and the like in the description and claims of the present invention and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, article, or article that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or article.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The invention discloses a training method and a training device for an image-text matching model, which can improve the efficiency of image-text matching model training and can realize image-text mutual search and multi-mode data classification based on the image-text matching model. The following are detailed descriptions.
Example one
Referring to fig. 1, fig. 1 is a schematic flowchart of a method for training a graph-text matching model according to an embodiment of the present invention. The method for training the image-text matching model described in fig. 1 may be applied to a device for training the image-text matching model, or may be applied to a cloud server or a local server based on the training of the image-text matching model, which is not limited in the embodiment of the present invention. As shown in fig. 1, the method for training the graph-text matching model may include the following operations:
101. and acquiring a training data set for training the image-text matching model.
In an embodiment of the present invention, the training data set includes a plurality of text data and a plurality of image data.
In this embodiment of the present invention, optionally, the training data set may include a text image matching data set, where the text image matching data set is a data set in which a text and an image are matched. For example, the image in the text image matching data set may be a pattern of milk, and the text in the text image matching data set may be "milk".
102. And for each image data, inputting the image data to a target image model to obtain an image coding vector.
In this embodiment of the present invention, optionally, the target text model may include a text encoder, a text pooling layer, and a text full-link layer. The text data is input into the target text model to obtain a text coding vector, which may be: inputting the text data into a text encoder to obtain a text coding result, inputting the text coding result into a text pooling layer to obtain a text pooling result, inputting the text pooling result into a text full-link layer, and outputting the text pooling result after passing through the text full-link layer to obtain a text coding vector.
In this embodiment of the present invention, optionally, the target image model may include an image encoder, an image pooling layer, and an image full-link layer. Wherein, inputting the image data into the target image model to obtain the image coding vector, which may be: inputting image data into an image encoder to obtain an image encoding result, inputting the image encoding result into an image pooling layer to obtain an image pooling result, inputting the image pooling result into an image full-link layer, and outputting the image pooling result after passing through the image full-link layer to obtain an image encoding vector.
In this embodiment of the present invention, optionally, after obtaining the text encoding vector and the image encoding vector, the text encoding vector and the image encoding vector are mapped in the same dimension.
103. For each text code vector, an image code vector matching the text code vector is determined among all the image code vectors, and the text code vector and the image code vector matching the text code vector are determined as an initial training image-text data set.
In the embodiment of the present invention, optionally, the image coding vector included in each initial training image-text data group matches the text coding vector, and the image coding vector included in different initial training image-text data groups does not match the text coding vector.
In an embodiment of the present invention, optionally, the initial training image-text data set may include a text encoding vector and an image encoding vector matching the text encoding vector. Optionally, the number of the initial training image-text data sets is multiple, and the embodiment of the present invention is not limited.
104. And inputting all the image-text data sets for initial training into a preset initial image-text matching model to obtain an initial training data output result, and determining initial loss information of the initial image-text matching model based on the initial training data output result.
In the embodiment of the invention, the initial loss information comprises one or more of text reconstruction loss information, contrast learning loss information and image-text matching loss information.
In this embodiment of the present invention, optionally, the initial loss information may include only one of text reconstruction loss information, contrast learning loss information, and image-text matching loss information, and may also include two or all of text reconstruction loss information, contrast learning loss information, and image-text matching loss information, which is not limited in this embodiment of the present invention.
105. And judging whether the initial loss information meets the training completion condition, and determining the initial image-text matching model as a target image-text matching model when the judgment result is yes.
Optionally, when the judgment result indicates that the initial loss information does not satisfy the training completion condition, re-triggering execution to acquire a training data set for training the image-text matching model, inputting the text data to the target text model for each text data to obtain a text encoding vector, inputting the image data to the target image model for each image data to obtain an image encoding vector, determining an image encoding vector matched with the text encoding vector in all the image encoding vectors for each text encoding vector, determining the text encoding vector and the image encoding vector matched with the text encoding vector as an initial training image-text data set, inputting all the initial training image-text data sets to a preset initial image-text matching model to obtain an initial training data output result, determining the initial loss information of the initial image-text matching model based on the initial training data output result, and judging whether the initial loss information satisfies the training completion condition until the initial loss information satisfies the training completion condition.
It can be seen that the implementation of the training method of the graph-text matching model described in fig. 1 can obtain a text coding vector of each text data and an image coding vector of each image data, determine an image coding vector matched with each text coding vector, determine the image coding vector matched with the text coding vector as an initial training graph-text data set, input all the initial training graph-text data sets into a preset initial graph-text matching model, obtain an initial training data output result, determine initial loss information of the initial graph-text matching model based on the initial training data output result, and determine whether the initial loss information satisfies a training completion condition, if so, determine the initial graph-text matching model as a target graph-text matching model, which is beneficial to improving the intelligence and efficiency of the training graph-text matching model, and is further beneficial to improving the accuracy and reliability of the obtained graph-text matching model, and further beneficial to realizing mutual graph-text search and multi-mode data classification based on the graph-text matching model.
Example two
Referring to fig. 2, fig. 2 is a schematic flowchart of a method for training a graph-text matching model according to an embodiment of the present invention. The method for training the image-text matching model described in fig. 2 may be applied to a device for training the image-text matching model, or may be applied to a cloud server or a local server based on the training of the image-text matching model, which is not limited in the embodiment of the present invention. As shown in fig. 2, the method for training the graph-text matching model may include the following operations:
201. and acquiring a training data set for training the image-text matching model.
202. And aiming at each text data in the text data set, performing characteristic masking operation on the text data to obtain characteristic masked text data.
In this embodiment of the present invention, optionally, performing a feature masking operation on the text data to obtain feature masked text data may include: and performing partial characteristic masking operation on the text data to obtain characteristic masked text data. For example, when the word included in the text data is a "child cold granule", the word included in the feature masking text data obtained after the feature masking operation may be a "child XX granule", in which a "cold" two word is feature masked.
In the embodiment of the present invention, it should be noted that performing the feature masking operation on the text data is to perform the feature masking operation on part of the words included in the text data, and is not to perform the feature masking operation on all the words included in the text data.
203. And for each image data, inputting the image data to a target image model to obtain an image coding vector.
In an embodiment of the present invention, the text encoding vector includes predicted text data of the feature masked text data.
In this embodiment of the present invention, optionally, the predicted text data includes a prediction result that is masked by the feature. For example, when the feature masking text data is "pediatric XX particles", where "XX" is the masked feature text data, the predicted text data of the feature masking text data may be one of "pediatric cold particles", "pediatric cold particles" and "pediatric fever particles".
In this embodiment of the present invention, optionally, for each feature masking text data, inputting the feature masking text data into the target text model to obtain a text coding vector, where the method may include:
and inputting the characteristic masking text data into a text encoder aiming at each characteristic masking text data to obtain characteristic masking predicted text data corresponding to the characteristic masking text data, inputting the characteristic masking predicted text data into a text pooling layer to obtain a text pooling result, and inputting the text pooling result into a text full-link layer to obtain a text encoding vector.
204. For each text code vector, an image code vector matching the text code vector is determined among all the image code vectors, and the text code vector and the image code vector matching the text code vector are determined as an initial training image-text data set.
205. And inputting all the image-text data sets for initial training into a preset initial image-text matching model to obtain an initial training data output result, and determining initial loss information of the initial image-text matching model based on the initial training data output result.
206. And judging whether the initial loss information meets the training completion condition, and determining the initial image-text matching model as a target image-text matching model when the judgment result is yes.
In the embodiment of the present invention, for other descriptions of step 201 and step 204 to step 206, please refer to the detailed description of step 101 to step 105 in the first embodiment, which is not repeated herein.
In the embodiment of the present invention, for example, it is assumed that the input image is img, and the input text is text = [ t1, t2, tmask.., tn ], where mask represents an obscured word, and n is a text length; after the image img passes through an image encoder, obtaining an image characteristic vencoder = VE (img), and then passing through an image pooling layer and an image full-connection layer to obtain an image coding vector v; after the text passes through a text encoder, a text characteristic extender = VT (text) is obtained, and after the text passes through a text pooling layer and a text full-link layer, a text coding vector t is obtained; when there is a tmask in the text, the text corresponding to the tmask is input to the FC _ TextPred classification layer, and tpred = FC _ TextPred (t _ encoder) is obtained, where tpred is used to indicate the prediction result of the text masked by the features obtained by prediction.
Therefore, by implementing the image-text matching model training method described in fig. 2, after the training data set is obtained, the feature masking operation is performed on each text data in the training data set to obtain the feature masked text data, and the feature masked text data is input into the target text model to obtain the text coding vector, wherein the text coding vector comprises the predicted text data of the feature masked text data, so that the masked text can be predicted after the feature masking operation is performed on the text data, the intelligence of the training image-text matching model can be improved, the efficiency of the training image-text matching model can be improved, the accuracy and the reliability of the obtained image-text matching model can be improved, and the image-text mutual searching and the multi-mode data classification based on the image-text matching model can be further realized.
In an optional embodiment, inputting all the initial training image-text data sets to a preset initial image-text matching model to obtain an initial training data output result, including:
executing splicing operation on a text coding vector and an image coding vector included in each initial training image-text data set to obtain an initial image-text input data set;
inputting the initial image-text input data set into an initial image-text matching model aiming at each initial image-text input data set to obtain an output result of the initial image-text data set;
determining an initial training data output result according to the output results of all the initial image-text data sets;
the initial image-text data set output result comprises a plurality of initial image-text output data sets, the number of the initial image-text output data sets is equal to that of the initial training image-text data sets, and each initial image-text output data set comprises a text data output result and an image data output result.
In this alternative embodiment, optionally, each initial teletext output data set comprises one text data output result and each initial teletext output data set comprises one image data output result.
In this alternative embodiment, for example, after obtaining the image coding vector v and the text coding vector t, v and t are spliced to obtain vt = cat (v, t). And vt is used for representing an initial image-text input data set obtained after splicing the image coding vector v and the text coding vector t.
In this optional embodiment, optionally, determining an initial training data output result according to output results of all the initial teletext data sets may include: and determining the output results of all the initial image-text data sets as the output results of the initial training data.
It can be seen that, by implementing the optional embodiment, for each initial training image-text data set, a splicing operation can be performed on a text coding vector and an image coding vector included in the initial training image-text data set to obtain an initial image-text input data set, for each initial image-text input data set, the initial image-text input data set is input to an initial image-text matching model to obtain an initial image-text data set output result, and according to all initial image-text data set output results, the initial training data output result is determined, so that the accuracy and reliability of determining the initial training data output result can be improved, the intelligence of training the image-text matching model can be improved, the accuracy and reliability of training the obtained image-text matching model can be further improved, and image-text mutual search and multi-mode data classification can be further realized based on the image-text matching model.
In another optional embodiment, for each text encoding vector, after determining an image encoding vector matching the text encoding vector from all image encoding vectors, and determining the text encoding vector and the image encoding vector matching the text encoding vector as an initial training teletext data set, inputting all initial training teletext data sets into a preset initial teletext matching model, and before obtaining an initial training data output result, the method further includes:
determining at least two first training image-text data sets from all initial training image-text data sets, and recombining text data and image data included in all the first training image-text data sets to obtain second training image-text data sets, wherein the data included in each first training image-text data set is different from the data included in each second training image-text data set;
determining all the remaining training image-text data sets except all the first training image-text data sets in all the initial training image-text data sets and all the second training image-text data sets as target training image-text data sets;
inputting all the image-text data sets for initial training into a preset initial image-text matching model to obtain an initial training data output result, wherein the initial training data output result comprises the following steps:
and inputting all the image-text data sets for target training into a preset initial image-text matching model to obtain an initial training data output result.
In this alternative embodiment, optionally, the data included in each of the first training teletext data sets is partially identical to, and partially different from, the data included in each of the second training teletext data sets. For example, when it is determined that text data included in one of the first training teletext data sets is text a and image data included in the first training teletext data set is image a, text data included in the other one of the first training teletext data sets is text B and image data included in the first training teletext data set is image B, the two first training teletext data sets are recombined to obtain one of the second training teletext data sets in which text data included is text a and image data included in the second training teletext data set is image B, and the other one of the second training teletext data sets in which text data included is text B and image data included in the second training teletext data set is image a.
In this alternative embodiment, all remaining training teletext data sets of all initial training teletext data sets, except all first training teletext data sets, and all second training teletext data sets are determined as target training teletext data sets, i.e.: all the target training image-text data sets included in the target training image-text data set include not only an image-text data set obtained by recombining the character data and the image data, but also an image-text data set obtained before recombining the character data and the image data, that is, an original image-text data set.
Therefore, by implementing the optional embodiment, text data and image data included in the first training image-text data set can be recombined to obtain a second training image-text data set, all the remaining training image-text data sets except all the first training image-text data sets in all the initial training image-text data sets and all the second training image-text data sets are determined as target training image-text data sets, all the target training image-text data sets are input into a preset initial image-text matching model to obtain an initial training data output result, the image-text matching model can be trained by the recombined second training image-text data set, the intelligence of the training image-text matching model can be improved, the accuracy and the reliability of the obtained image-text matching model can be improved, and the realization of image-text mutual search and data classification based on the image-text matching model can be further facilitated.
In yet another optional embodiment, when the initial loss information includes text reconstruction loss information, contrast learning loss information, and teletext matching loss information, determining initial loss information of the initial teletext matching model based on the initial training data output result includes:
for each text coding vector, determining target text data matched with the text coding vector from a training data set, determining text reconstruction loss information of the text coding vector according to the text coding vector and the target text data, and determining text reconstruction loss information according to the text reconstruction loss information of all the text coding vectors;
calculating a feature matching parameter between the text data output result and each image data output result aiming at the text data output result in each initial image-text output data set to obtain the feature matching parameter between each text data output result and each image data output result, and determining comparison learning loss information of an initial image-text matching model according to all the feature matching parameters;
determining image-text matching loss information of the initial image-text matching model according to the output result of the initial image-text data set and all image-text data sets for initial training;
and determining initial loss information of the initial image-text matching model based on the text reconstruction loss information, the contrast learning loss information and the image-text matching loss information.
In this alternative embodiment, determining initial loss information of the initial teletext matching model based on the text reconstruction loss information, the contrast learning loss information, and the teletext matching loss information may include:
determining initial loss information of an initial image-text matching model based on a total loss calculation function and text reconstruction loss information, comparison learning loss information and image-text matching loss information;
the total loss calculation function may include:
L=Ltext_rebuild+Linfonce+Litm;
wherein, L is initial loss information, ltext _ rebuild is text reconstruction loss information, linfone is contrast learning loss information, and Litm is image-text matching loss information.
Therefore, the initial loss information of the initial image-text matching model is determined through the total loss calculation function, the text reconstruction loss information, the comparison learning loss information and the image-text matching loss information, and the accuracy and the reliability of determining the initial loss information of the initial image-text matching model can be improved.
In this alternative embodiment, it should be noted that the text reconstruction loss information is a text reconstruction loss function (Ltext _ repaired), that is, a part of text is randomly masked when the text is input, and after passing through a text encoder, what the masked text is needs to be predicted, that is: the text reconstruction loss information represents loss information between the text information predicted to be subjected to the feature masking and the real text information.
In this optional embodiment, optionally, the text reconstruction loss function uses a standard cross entropy loss function: ltext _ rebuild = crossentry (tpred, tmask). Wherein Ltext _ rebuild is a text reconstruction loss function, tpred represents a predicted result of the predicted covered characters, and tmask represents the covered characters.
In this alternative embodiment, it should be noted that, for each text encoding vector, the target text data matched with the text encoding vector is determined from the training data set, that is, the determined target text data is the original text data corresponding to the text encoding vector.
In this alternative embodiment, for example, when the text data of the text encoding vector is "fruit sugar" and the target text data matched with the text encoding vector is "lemon sugar", the feature matching parameter is calculated according to the "fruit sugar" and the "lemon sugar", and the text reconstruction loss information of the text encoding vector is determined according to the feature matching parameter.
In this alternative embodiment, optionally, the comparative learning loss information is used to represent a relationship between the text data and the image data. For example, assuming that there are 4 image-text pairs, i.e. 4 images and 4 texts, and that image 1 matches text 1, image 2 matches text 2, image 3 matches text 3, and image 4 matches text 4, the penalty function here indicates that the feature distance between image 1 and text 1 should be reduced, and the feature distance between image 1 and text 2, text 3, and text 4 should be reduced. Similarly, the distances between the features of the image 2 and the features of the text 2 are drawn closer, and the distances between the features of the image 2 and the features of the text 1, the text 3 and the text 4 are drawn farther, and the process is repeated until the image 4 is reached.
In this alternative embodiment, the teletext matching loss information is used to indicate a matching relationship between text data and image data between the initial training teletext data set output after passing through the initial teletext matching model.
Therefore, by implementing the optional embodiment, the text reconstruction loss information of each text coding vector can be determined, the text reconstruction loss information is determined according to all the text reconstruction loss information, the feature matching parameters between each text data output result and each image data output result are calculated to obtain the feature matching parameters, the comparison learning loss information is determined according to all the feature matching parameters, the image-text matching loss information is determined according to the initial image-text data group output result and all the image-text data groups for initial training, the initial loss information of the initial image-text matching model is determined based on the text reconstruction loss information, the comparison learning loss information and the image-text matching loss information, the accuracy and the reliability of determining the initial loss information can be improved, the accuracy and the reliability of obtaining the image-text matching model by training can be improved, and the realization of mutual image-text searching and multi-mode data classification based on the image-text matching model can be further facilitated.
In yet another alternative embodiment, determining the teletext match loss information for the initial teletext match model based on the initial teletext data set output result and all initial training teletext data sets comprises:
determining a first output image-text data set identical to the initial training image-text data set from all initial image-text data sets included in the output results of the initial image-text data sets based on the text data output results and the image data output results included in the output results of each initial image-text data set, and determining all output image-text data sets except all the first output image-text data sets as second output image-text data sets;
determining the output data matching degree of the initial image-text matching model according to all the first output image-text data sets, all the second output image-text data sets and all the image-text data sets for initial training;
and determining the image-text matching loss information of the initial image-text matching model according to the output data matching degree and the predetermined image-text matching function.
In this optional embodiment, optionally, the combination of the text data and the image data included in all the first output image-text data sets is the same as the combination of the text data and the image data included in the initial training image-text data set, and the combination of the text data and the image data included in all the second output image-text data sets is not the same as the combination of the text data and the image data included in the initial training image-text data set.
In this alternative embodiment, optionally, for example, assuming that there are 4 image-text pairs, we keep the first 2 image-text pairs unchanged and the last two image-text pairs in a disorderly order, i.e. the following combinations are formed: the image-text combination 1 comprises an image 1, the text 1 and the image-text combination 2 comprise an image 2, the text 2 and the image-text combination 3 comprise an image 3, and the text 4 and the image-text combination 4 comprise an image 4 and a text 3. The image-text combination 1 and the image-text combination 2 are matched with correct image-text data set, and the image-text combination 3 and the image-text combination 4 are matched with wrong image-text data set. Further optionally, the image combination 1 and the image combination 2 are determined as positive labels, the image combination 3 and the image combination 4 are determined as negative labels, when the initial image-text matching model is trained, image features and text features are aggregated, and then input to a classification layer for classification to judge whether image-text pairs are matched, and training is continued on the initial image-text matching model according to a matching result.
In this optional embodiment, optionally, the predetermined image-text matching function may include:
Litm=CrossEntropy(vtpred, vttrue);
litm is image-text matching loss information of the initial image-text matching model, vtpred is used for representing two classifications, namely image features and text features are spliced and input into a classification layer, whether image-text pairs are matched or not is further judged, and vttrue represents a matching result of the image-text pairs;
further, the text data and the image data in the data group match when vttrue =1, and the text data and the image data in the data group do not match when vttrue = 0.
Therefore, by implementing the optional embodiment, the first output image-text data group and the second output image-text data group can be determined based on the text data output result and the image data output result, the output data matching degree of the initial image-text matching model can be determined according to all the first output image-text data group, the second output image-text data group and all the initial training image-text data groups, the image-text matching loss information can be determined according to the output data matching degree and the image-text matching function, the accuracy and the reliability of determining the image-text matching loss information can be improved, the accuracy and the reliability of determining the initial loss information can be improved, the intelligence and the accuracy of training the image-text matching model can be further improved, and the multi-mode mutual searching and the multi-mode data classification based on the image-text matching model can be further realized.
In yet another alternative embodiment, the determining the contrast learning loss information of the initial graph-text matching model according to all the feature matching parameters includes:
determining a key image data output result matched with the text data output result in the initial training image-text data set based on the initial training image-text data set according to the text data output result in each initial image-text output data set, determining first matching information between the text data output result and the key image data output result, and determining second matching information between the text data output result and each other image data output result except the key image data output result;
and determining contrast learning loss information of the initial image-text matching model according to all the first matching information, all the second matching information and a predetermined contrast learning loss function.
In this optional embodiment, optionally, the predetermined contrast learning loss function may include:
Figure 24727DEST_PATH_IMAGE001
wherein L is infonce For the comparison learning loss function, t is the text encoding vector, v is the image encoding vector, and τ is the temperature coefficient.
In this alternative embodiment, further optionally, τ is used to mine difficult samples. The image-text matching model comprises an image-text data set, a temperature coefficient and a loss information, wherein the image-text data set is matched with a wrong image-text sample, and the loss information of the image-text data set with the wrong matching can be increased by setting different temperature coefficients when the image-text data set with the wrong matching appears, so that the image-text matching model can learn aiming at the image-text data set with the wrong matching, and the intelligence and the accuracy of model training can be improved.
In this alternative embodiment, it should be noted that, since a plurality of image-text pairs are input, what is required for model training is that the image coding vector v and the text coding vector t in the same image-text pair should be closer and closer, and the image coding vector v and the text coding vector t in different image-text pairs should be farther and farther. For example, assuming that there are 4 image-text pairs, i.e. 4 images and 4 texts, and image 1 matches text 1, image 2 matches text 2, image 3 matches text 3, and image 4 matches text 4, the penalty function here indicates that the feature distance between image 1 and text 1 should be reduced, and the feature distance between image 1 and text 2, text 3, and text 4 should be reduced. Similarly, the distances between the features of the image 2 and the features of the text 2 are drawn closer, and the distances between the features of the image 2 and the features of the text 1, the text 3 and the text 4 are drawn farther, and the process is repeated until the image 4 is reached.
Therefore, the implementation of the optional embodiment can determine the matching information between each text data output result and each image data output result, and determine the contrast learning loss information of the initial image-text matching model according to all the matching information and the predetermined contrast learning loss function, so that the accuracy and reliability of obtaining the contrast learning loss information can be improved, the accuracy and reliability of determining the initial loss information can be improved, the intelligence and accuracy of training the image-text matching model can be improved, and the realization of image-text mutual search and multi-mode data classification based on the image-text matching model can be further facilitated.
EXAMPLE III
Referring to fig. 3, fig. 3 is a schematic structural diagram of a training apparatus for a graph-text matching model according to an embodiment of the present invention. As shown in fig. 3, the apparatus for training the graph-text matching model may include:
an obtaining module 301, configured to obtain a training data set for image-text matching model training, where the training data set includes multiple text data and multiple image data;
an input module 302, configured to input, for each piece of text data, the piece of text data into a target text model to obtain a text coding vector, and input, for each piece of image data, the piece of image data into a target image model to obtain an image coding vector;
a determining module 303, configured to determine, for each text encoding vector, an image encoding vector matching the text encoding vector from among all image encoding vectors, and determine the text encoding vector and the image encoding vector matching the text encoding vector as an initial training image-text data set;
the input module 302 is further configured to input all the initial training image-text data sets to a preset initial image-text matching model to obtain an initial training data output result;
the determining module 303 is further configured to determine initial loss information of the initial image-text matching model based on an initial training data output result, where the initial loss information includes one or more of text reconstruction loss information, comparison learning loss information, and image-text matching loss information;
a judging module 304, configured to judge whether the initial loss information meets a training completion condition;
the determining module 303 is further configured to determine the initial image-text matching model as the target image-text matching model when the determination result is yes.
It can be seen that, the implementation of the apparatus described in fig. 3 can obtain a text coding vector of each text data and an image coding vector of each image data, determine an image coding vector matched with each text coding vector, determine the text coding vector and the image coding vector matched with the text coding vector as an initial training image-text data set, input all the initial training image-text data sets to a preset initial image-text matching model, obtain an initial training data output result, determine initial loss information of the initial image-text matching model based on the initial training data output result, and determine whether the initial loss information satisfies training completion conditions, if so, determine the initial image-text matching model as a target image-text matching model, which is beneficial to improving intelligence and efficiency of the training image-text matching model, further beneficial to improving accuracy and reliability of obtaining the image-text matching model, and further beneficial to implement inter-search and multi-mode data classification based on the image-text matching model.
In an alternative embodiment, as shown in fig. 4, the apparatus further comprises:
a feature masking module 305, configured to, after the obtaining module 301 obtains the training data set for the training of the image-text matching model, input the text data into a target text model at the input module 302 for each text data to obtain a text coding vector, and perform a feature masking operation on each text data in the training data set before inputting the image data into a target image model for each image data to obtain an image coding vector to obtain feature masked text data;
for each text data, the input module 302 inputs the text data to the target text model, and the manner of obtaining the text coding vector specifically includes:
and inputting the characteristic masking text data into a target text model aiming at each characteristic masking text data to obtain a text coding vector, wherein the text coding vector comprises predicted text data of the characteristic masking text data.
Therefore, by implementing the device described in fig. 4, after the training data set is obtained, a feature masking operation can be performed on each text data in the training data set to obtain feature masked text data, and the feature masked text data is input into the target text model to obtain a text coding vector, where the text coding vector includes predicted text data of the feature masked text data, so that the masked text can be predicted after the feature masking operation is performed on the text data, the intelligence of the training image-text matching model can be improved, the efficiency of the training image-text matching model can be improved, the accuracy and reliability of the obtained image-text matching model can be further improved, and the realization of image-text mutual search and multi-mode data classification based on the image-text matching model can be further facilitated.
In another alternative embodiment, as shown in fig. 4, the manner that the input module 302 inputs all the initial training teletext data sets to the preset initial teletext matching model to obtain the initial training data output result specifically includes:
executing splicing operation on a text coding vector and an image coding vector included in each initial training image-text data set to obtain an initial image-text input data set;
inputting the initial image-text input data set to an initial image-text matching model aiming at each initial image-text input data set to obtain an output result of the initial image-text data set;
determining an initial training data output result according to all initial image-text data set output results;
the initial image-text data set output result comprises a plurality of initial image-text output data sets, the number of the initial image-text output data sets is equal to that of the initial training image-text data sets, and each initial image-text output data set comprises a text data output result and an image data output result.
It can be seen that, by implementing the apparatus described in fig. 4, for each initial training image-text data set, the text coding vector and the image coding vector included in the initial training image-text data set are spliced to obtain an initial image-text input data set, for each initial image-text input data set, the initial image-text input data set is input to the initial image-text matching model to obtain an initial image-text data set output result, and according to all initial image-text data set output results, the initial training data output result is determined, which can improve the accuracy and reliability of determining the initial training data output result, and can improve the intelligence of training the image-text matching model, thereby being beneficial to improving the accuracy and reliability of training the obtained image-text matching model, and further being beneficial to realizing image-text mutual searching and multi-mode data classification based on the image-text matching model.
In yet another alternative embodiment, as shown in fig. 4, the determining module 303 is further configured to determine, after determining, for each text encoding vector, an image encoding vector matching the text encoding vector from among all image encoding vectors, and determining the text encoding vector and the image encoding vector matching the text encoding vector as an initial training image-text data set, before the input module 302 inputs all initial training image-text data sets into a preset initial image-text matching model to obtain an initial training data output result, determine at least two first training image-text data sets from among all initial training image-text data sets;
the device also includes:
a combination module 306, configured to recombine the text data and the image data included in all the first training image-text data sets to obtain second training image-text data sets, where data included in each first training image-text data set is different from data included in each second training image-text data set;
the determining module 303 is further configured to determine all remaining training image-text data sets in all the initial training image-text data sets except all the first training image-text data sets and all the second training image-text data sets as target training image-text data sets;
the input module 302 inputs all the initial training image-text data sets to a preset initial image-text matching model, and the manner of obtaining the initial training data output result specifically includes:
and inputting all the image-text data sets for target training into a preset initial image-text matching model to obtain an initial training data output result.
It can be seen that, by implementing the apparatus described in fig. 4, the text data included in the first training image-text data set is recombined with the image data to obtain the second training image-text data set, and all the remaining training image-text data sets except all the first training image-text data sets in all the initial training image-text data sets and all the second training image-text data sets are determined as the target training image-text data set, and all the target training image-text data sets are input into the preset initial image-text matching model to obtain the initial training data output result.
In yet another alternative embodiment, as shown in fig. 4, when the initial loss information includes text reconstruction loss information, contrast learning loss information, and graph-text matching loss information, the manner of determining the initial loss information of the initial graph-text matching model by the determining module 303 based on the initial training data output result specifically includes:
for each text coding vector, determining target text data matched with the text coding vector from a training data set, determining text reconstruction loss information of the text coding vector according to the text coding vector and the target text data, and determining text reconstruction loss information according to the text reconstruction loss information of all the text coding vectors;
calculating a feature matching parameter between the text data output result and each image data output result aiming at the text data output result in each initial image-text output data set to obtain the feature matching parameter between each text data output result and each image data output result, and determining comparison learning loss information of an initial image-text matching model according to all the feature matching parameters;
determining image-text matching loss information of the initial image-text matching model according to the output result of the initial image-text data set and all image-text data sets for initial training;
and determining initial loss information of the initial image-text matching model based on the text reconstruction loss information, the contrast learning loss information and the image-text matching loss information.
It can be seen that, implementing the apparatus described in fig. 4, text reconstruction loss information for determining each text encoding vector, determining text reconstruction loss information according to all the text reconstruction loss information, calculating a feature matching parameter between each text data output result and each image data output result, obtaining a feature matching parameter, determining comparison learning loss information according to all the feature matching parameters, determining image-text matching loss information according to the initial image-text data set output result and all image-text data sets for initial training, and determining initial loss information of the initial image-text matching model based on the text reconstruction loss information, the comparison learning loss information, and the image-text matching loss information can improve accuracy and reliability of determining the initial loss information, and can improve intelligence of determining the initial loss information, thereby facilitating obtaining accuracy and reliability of training the image-text matching model, and further facilitating realization of image-text mutual search and multi-mode data classification based on the image-text matching model.
In yet another alternative embodiment, as shown in fig. 4, the manner of determining the teletext matching loss information of the initial teletext matching model by the determination module 303 according to the output result of the initial teletext data set and all initial training teletext data sets specifically includes:
determining a first output image-text data set identical to the initial training image-text data set from all initial image-text data sets included in the output results of the initial image-text data sets based on the text data output results and the image data output results included in the output results of each initial image-text data set, and determining all output image-text data sets except all the first output image-text data sets as second output image-text data sets;
determining the output data matching degree of the initial image-text matching model according to all the first output image-text data sets, all the second output image-text data sets and all the image-text data sets for initial training;
and determining the image-text matching loss information of the initial image-text matching model according to the output data matching degree and the predetermined image-text matching function.
It can be seen that the implementation of the apparatus described in fig. 4 can determine the first output image-text data set and the second output image-text data set based on the text data output result and the image data output result, determine the output data matching degree of the initial image-text matching model according to all the first output image-text data sets, the second output image-text data set and all the initial training image-text data sets, determine the image-text matching loss information according to the output data matching degree and the image-text matching function, and can improve the accuracy and reliability of determining the image-text matching loss information, thereby improving the accuracy and reliability of determining the initial loss information, further being beneficial to improving the intelligence and accuracy of training the image-text matching model, and further being beneficial to realizing image-text mutual search and multi-mode data classification based on the image-text matching model.
In yet another alternative embodiment, as shown in fig. 4, the manner of determining, by the determining module 303, the comparative learning loss information of the initial image-text matching model according to all the feature matching parameters specifically includes:
for a text data output result in each initial image-text output data set, determining a key image data output result matched with the text data output result in the initial training image-text data set based on the initial training image-text data set, determining first matching information between the text data output result and the key image data output result, and determining second matching information between the text data output result and each other image data output result except the key image data output result;
and determining contrast learning loss information of the initial image-text matching model according to all the first matching information, all the second matching information and the predetermined contrast learning loss function.
It can be seen that the implementation of the apparatus described in fig. 4 can determine matching information between each text data output result and each image data output result, and determine contrast learning loss information of the initial image-text matching model according to all matching information and a predetermined contrast learning loss function, and can improve accuracy and reliability of obtaining the contrast learning loss information, thereby improving accuracy and reliability of determining the initial loss information, further facilitating improvement of intelligence and accuracy of training the image-text matching model, and further facilitating realization of image-text inter-search multi-mode and data classification based on the image-text matching model.
Example four
Referring to fig. 5, fig. 5 is a schematic structural diagram of another training apparatus for matching graphics and text disclosed in the embodiment of the present invention. As shown in fig. 5, the apparatus for training the graph-text matching model may include:
a memory 401 storing executable program code;
a processor 402 coupled with the memory 401;
the processor 402 calls the executable program code stored in the memory 401 to execute the steps in the method for training the graph-text matching model described in the first embodiment of the present invention or the second embodiment of the present invention.
EXAMPLE five
The embodiment of the invention discloses a computer storage medium, which stores computer instructions, and when the computer instructions are called, the computer instructions are used for executing steps in a training method of an image-text matching model described in the first embodiment or the second embodiment of the invention.
EXAMPLE six
An embodiment of the present invention discloses a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute the steps in the method for training a graph-text matching model described in the first embodiment or the second embodiment.
The above-described embodiments of the apparatus are merely illustrative, and the modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above detailed description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above technical solutions may essentially or in part contribute to the prior art, be embodied in the form of a software product, which may be stored in a computer-readable storage medium, including a Read-Only Memory (ROM), a Random Access Memory (RAM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable Programmable Read-Only Memory (EEPROM), an optical Disc-Read (CD-ROM) or other storage medium capable of storing data, a magnetic tape, or any other computer-readable medium capable of storing data.
Finally, it should be noted that: the method and apparatus for training an image-text matching model disclosed in the embodiments of the present invention are only preferred embodiments of the present invention, and are only used for illustrating the technical solutions of the present invention, not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for training a graph-text matching model, the method comprising:
acquiring a training data set for image-text matching model training, wherein the training data set comprises a plurality of text data and a plurality of image data;
for each piece of text data, inputting the text data into a target text model to obtain a text coding vector, and for each piece of image data, inputting the image data into a target image model to obtain an image coding vector;
for each text coding vector, determining an image coding vector matched with the text coding vector from all the image coding vectors, and determining the text coding vector and the image coding vector matched with the text coding vector as an initial training image-text data set;
inputting all the image-text data groups for initial training into a preset initial image-text matching model to obtain an initial training data output result, and determining initial loss information of the initial image-text matching model based on the initial training data output result, wherein the initial loss information comprises one or more of text reconstruction loss information, contrast learning loss information and image-text matching loss information;
and judging whether the initial loss information meets training completion conditions or not, and determining the initial image-text matching model as a target image-text matching model when the judgment result is yes.
2. The method for training the teletext matching model according to claim 1, wherein after the training data set for teletext matching model training is obtained, before inputting the text data to a target text model for each of the text data to obtain a text coding vector, and inputting the image data to a target image model for each of the image data to obtain an image coding vector, the method further comprises:
performing a feature masking operation on each text data in the training data set to obtain feature masked text data;
and for each text data, inputting the text data into a target text model to obtain a text coding vector, including:
and inputting the characteristic masking text data into a target text model to obtain a text coding vector aiming at each characteristic masking text data, wherein the text coding vector comprises predicted text data of the characteristic masking text data.
3. The method for training the image-text matching model according to claim 2, wherein the step of inputting all the image-text data sets for initial training into a preset initial image-text matching model to obtain an initial training data output result comprises:
executing splicing operation on the text coding vector and the image coding vector included in each initial training image-text data set to obtain an initial image-text input data set;
inputting the initial image-text input data set into an initial image-text matching model aiming at each initial image-text input data set to obtain an output result of the initial image-text data set;
determining an initial training data output result according to all the initial image-text data set output results;
the initial image-text data set output result comprises a plurality of initial image-text output data sets, the number of the initial image-text output data sets is equal to that of the initial training image-text data sets, and each initial image-text output data set comprises a text data output result and an image data output result.
4. The method for training the teletext matching model according to claim 3, wherein after determining, for each text encoding vector, an image encoding vector matching the text encoding vector from among all the image encoding vectors, and determining the text encoding vector and the image encoding vector matching the text encoding vector as an initial training teletext data set, the method further comprises, before inputting all the initial training teletext data sets into a preset initial teletext matching model to obtain an initial training data output result:
determining at least two first training image-text data sets from all the initial training image-text data sets, and recombining the text data and the image data included in all the first training image-text data sets to obtain second training image-text data sets, wherein the data included in each first training image-text data set is different from the data included in each second training image-text data set;
determining all the remaining training image-text data sets except all the first training image-text data set in all the initial training image-text data sets and all the second training image-text data sets as target training image-text data sets;
inputting all the image-text data sets for initial training into a preset initial image-text matching model to obtain an initial training data output result, wherein the method comprises the following steps:
and inputting all the image-text data sets for the target training to a preset initial image-text matching model to obtain an initial training data output result.
5. The method for training the teletext matching model according to claim 4, wherein when the initial loss information includes the text reconstruction loss information, the contrast learning loss information, and the teletext matching loss information, the determining the initial loss information of the initial teletext matching model based on the initial training data output result includes:
for each text coding vector, determining target text data matched with the text coding vector from the training data set, determining text reconstruction loss information of the text coding vector according to the text coding vector and the target text data, and determining text reconstruction loss information according to the text reconstruction loss information of all the text coding vectors;
calculating a feature matching parameter between the text data output result and each image data output result aiming at the text data output result in each initial image-text output data group to obtain a feature matching parameter between each text data output result and each image data output result, and determining comparison learning loss information of the initial image-text matching model according to all the feature matching parameters;
determining image-text matching loss information of the initial image-text matching model according to the output result of the initial image-text data set and all the image-text data sets for initial training;
and determining initial loss information of the initial image-text matching model based on the text reconstruction loss information, the contrast learning loss information and the image-text matching loss information.
6. The method for training the teletext matching model according to claim 5, wherein determining teletext matching loss information for the initial teletext matching model based on the initial teletext data set output result and all of the initial training teletext data sets comprises:
determining a first output image-text data set identical to the initial training image-text data set from all the initial image-text data sets included in the initial image-text data set output result based on the text data output result and the image data output result included in each of the initial image-text data set output results, and determining all the output image-text data sets except all the first output image-text data sets as a second output image-text data set;
determining the output data matching degree of the initial image-text matching model according to all the first output image-text data sets, all the second output image-text data sets and all the initial training image-text data sets;
and determining the image-text matching loss information of the initial image-text matching model according to the output data matching degree and a predetermined image-text matching function.
7. The method for training the graph-text matching model according to claim 6, wherein the determining the contrast learning loss information of the initial graph-text matching model according to all the feature matching parameters comprises:
for the text data output result in each initial image-text output data set, determining a key image data output result matched with the text data output result in the initial training image-text data set based on the initial training image-text data set, determining first matching information between the text data output result and the key image data output result, and determining second matching information between the text data output result and each other image data output result except the key image data output result;
and determining contrast learning loss information of the initial image-text matching model according to all the first matching information, all the second matching information and a predetermined contrast learning loss function.
8. An apparatus for training a graph-text matching model, the apparatus comprising:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a training data set used for image-text matching model training, and the training data set comprises a plurality of text data and a plurality of image data;
the input module is used for inputting the text data to a target text model to obtain a text coding vector aiming at each text data, and inputting the image data to a target image model to obtain an image coding vector aiming at each image data;
the determining module is used for determining an image coding vector matched with the text coding vector in all the image coding vectors aiming at each text coding vector, and determining the text coding vector and the image coding vector matched with the text coding vector as an initial training image-text data set;
the input module is also used for inputting all the image-text data sets for initial training into a preset initial image-text matching model to obtain an initial training data output result;
the determining module is further configured to determine initial loss information of the initial image-text matching model based on the initial training data output result, where the initial loss information includes one or more of text reconstruction loss information, contrast learning loss information, and image-text matching loss information;
the judging module is used for judging whether the initial loss information meets training completion conditions or not;
and the determining module is also used for determining the initial image-text matching model as a target image-text matching model when the judging result is yes.
9. An apparatus for training a graph-text matching model, the apparatus comprising:
a memory storing executable program code;
a processor coupled with the memory;
the processor calls the executable program code stored in the memory to execute the method for training the teletext matching model according to any one of claims 1-7.
10. A computer storage medium storing computer instructions for performing a method of training a graph-text matching model according to any one of claims 1-7 when invoked.
CN202211219395.5A 2022-10-08 2022-10-08 Training method and device of image-text matching model Active CN115292455B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211219395.5A CN115292455B (en) 2022-10-08 2022-10-08 Training method and device of image-text matching model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211219395.5A CN115292455B (en) 2022-10-08 2022-10-08 Training method and device of image-text matching model

Publications (2)

Publication Number Publication Date
CN115292455A true CN115292455A (en) 2022-11-04
CN115292455B CN115292455B (en) 2023-03-24

Family

ID=83834570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211219395.5A Active CN115292455B (en) 2022-10-08 2022-10-08 Training method and device of image-text matching model

Country Status (1)

Country Link
CN (1) CN115292455B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897950A (en) * 2020-07-29 2020-11-06 北京字节跳动网络技术有限公司 Method and apparatus for generating information
US20210012150A1 (en) * 2019-07-11 2021-01-14 Xidian University Bidirectional attention-based image-text cross-modal retrieval method
CN113836333A (en) * 2021-09-18 2021-12-24 北京百度网讯科技有限公司 Training method of image-text matching model, method and device for realizing image-text retrieval
CN114419351A (en) * 2022-01-28 2022-04-29 深圳市腾讯计算机系统有限公司 Image-text pre-training model training method and device and image-text prediction model training method and device
CN114549935A (en) * 2022-02-25 2022-05-27 北京百度网讯科技有限公司 Information generation method and device
CN114564606A (en) * 2022-02-24 2022-05-31 北京字跳网络技术有限公司 Data processing method and device, electronic equipment and storage medium
WO2022155790A1 (en) * 2021-01-19 2022-07-28 华为技术有限公司 Picture/text processing method, related apparatus, and electronic device
CN114996502A (en) * 2022-06-23 2022-09-02 天津理工大学 Multi-task learning model combining image-text matching and visual reasoning, visual common sense reasoning method and computer equipment
US20220301285A1 (en) * 2021-03-19 2022-09-22 Alibaba (China) Co., Ltd. Processing picture-text data
CN115115914A (en) * 2022-06-07 2022-09-27 腾讯科技(深圳)有限公司 Information identification method, device and computer readable storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210012150A1 (en) * 2019-07-11 2021-01-14 Xidian University Bidirectional attention-based image-text cross-modal retrieval method
CN111897950A (en) * 2020-07-29 2020-11-06 北京字节跳动网络技术有限公司 Method and apparatus for generating information
WO2022155790A1 (en) * 2021-01-19 2022-07-28 华为技术有限公司 Picture/text processing method, related apparatus, and electronic device
US20220301285A1 (en) * 2021-03-19 2022-09-22 Alibaba (China) Co., Ltd. Processing picture-text data
CN113836333A (en) * 2021-09-18 2021-12-24 北京百度网讯科技有限公司 Training method of image-text matching model, method and device for realizing image-text retrieval
CN114419351A (en) * 2022-01-28 2022-04-29 深圳市腾讯计算机系统有限公司 Image-text pre-training model training method and device and image-text prediction model training method and device
CN114564606A (en) * 2022-02-24 2022-05-31 北京字跳网络技术有限公司 Data processing method and device, electronic equipment and storage medium
CN114549935A (en) * 2022-02-25 2022-05-27 北京百度网讯科技有限公司 Information generation method and device
CN115115914A (en) * 2022-06-07 2022-09-27 腾讯科技(深圳)有限公司 Information identification method, device and computer readable storage medium
CN114996502A (en) * 2022-06-23 2022-09-02 天津理工大学 Multi-task learning model combining image-text matching and visual reasoning, visual common sense reasoning method and computer equipment

Also Published As

Publication number Publication date
CN115292455B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN108228686B (en) Method and device for realizing image-text matching and electronic equipment
CN111695352A (en) Grading method and device based on semantic analysis, terminal equipment and storage medium
CN111737476A (en) Text processing method and device, computer readable storage medium and electronic equipment
CN113627447B (en) Label identification method, label identification device, computer equipment, storage medium and program product
CN115115913A (en) Data processing method and device, electronic equipment and storage medium
CN110348012B (en) Method, device, storage medium and electronic device for determining target character
CN112580346B (en) Event extraction method and device, computer equipment and storage medium
CN112528637A (en) Text processing model training method and device, computer equipment and storage medium
CN112163596B (en) Complex scene text recognition method, system, computer equipment and storage medium
CN112784066A (en) Information feedback method, device, terminal and storage medium based on knowledge graph
CN112086087B (en) Speech recognition model training method, speech recognition method and device
EP4191544A1 (en) Method and apparatus for recognizing token, electronic device and storage medium
CN114626380A (en) Entity identification method and device, electronic equipment and storage medium
CN113850251A (en) Text correction method, device and equipment based on OCR technology and storage medium
CN117558270B (en) Voice recognition method and device and keyword detection model training method and device
CN117218667B (en) Chinese character recognition method and system based on character roots
CN111783734B (en) Original edition video recognition method and device
CN112819848A (en) Matting method, matting device and electronic equipment
CN113705207A (en) Grammar error recognition method and device
CN115292455B (en) Training method and device of image-text matching model
CN116680385A (en) Dialogue question-answering method and device based on artificial intelligence, computer equipment and medium
CN111651674A (en) Bidirectional searching method and device and electronic equipment
CN115455225A (en) Method and device for constructing image-text semantic alignment model
CN114419514B (en) Data processing method, device, computer equipment and storage medium
CN114357964A (en) Subjective question scoring method, model training method, computer device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant