CN117671424A

CN117671424A - Model training method, image description method, device, medium and equipment

Info

Publication number: CN117671424A
Application number: CN202311667084.XA
Authority: CN
Inventors: 刘晨阳
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2023-12-06
Filing date: 2023-12-06
Publication date: 2024-03-08

Abstract

The invention discloses a training method, an image description method, a device, a medium and equipment of a model. The method comprises the following steps: acquiring a sample image and a tag description text of the sample image, extracting image features of the sample image, and extracting text features of the tag description text; performing position embedding on tag description texts of a plurality of sample images to obtain image feature data, and performing position embedding on text features of a plurality of tag description texts to obtain text feature data; acquiring an initial image description model, wherein the initial image description model comprises a variation encoder and a generator; and training the initial image description model based on the image characteristic data and the text characteristic data to obtain a trained target image description model. The target image description model obtained by training the initial image description model can realize one-to-many mapping from the image to the text, and any number of hidden space vectors are sampled by taking the given image as a condition to obtain any data description text.

Description

Model training method, image description method, device, medium and equipment

Technical Field

The present invention relates to the field of deep learning technologies, and in particular, to a training method for a model, an image description method, an image description device, a medium, and a device.

Background

The image description technology is a cross-mode semantic understanding fusion technology for enabling a computer to output natural language description characters corresponding to image contents through a model and calculation by taking an image as input, and can be applied to various application scenes such as visual retrieval, intelligent creation, recommendation systems, automatic association and automatic generation aiming at fusion of the two modes of the image and the natural language.

In the process of realizing the invention, the prior art is found to have at least the following technical problems: in a wide-range application scenario, not only is the model required to have more accurate and more close expression to image content for the generated image description, but also an algorithm model is required to provide diversity selection for image expression, but the existing image description algorithm model usually has one index of accuracy or diversity as the research purpose, and when one index is enhanced, the other index is easily sacrificed, so that unbalance between the accuracy and the diversity of image description is caused.

Disclosure of Invention

The invention provides a training method, an image description device, a medium and equipment for a model, so that accuracy and diversity of image description are considered.

According to an aspect of the present invention, there is provided a training method of an image description model, including:

acquiring a sample image and a tag description text of the sample image, extracting image characteristics of the sample image, and extracting text characteristics of the tag description text;

performing position embedding on tag description texts of a plurality of sample images to obtain image feature data, and performing position embedding on text features of a plurality of tag description texts to obtain text feature data;

acquiring an initial image description model, wherein the initial image description model comprises a variation encoder and a generator, and training the initial image description model based on the image characteristic data and the text characteristic data to obtain a trained target image description model: and carrying out parameter adjustment on the initial image description model through a target loss function and/or a target enhancement function in the training process, wherein the target loss function comprises one or more of a description statement loss item, a divergence data loss item and a Gaussian loss item generated by the first processing module and the second processing module, and the target enhancement function is determined based on a preset rewarding function and a rewarding base line.

Optionally, the variation encoder includes a first processing module, a second processing module, and a third processing module; the output end of the third processing module is connected with the second processing module, and the output ends of the first processing module, the second processing module and the third processing module are connected with the generator through a splicing unit;

the target image description model comprises an encoder formed by the second processing module and the third processing module after training and a generator after training.

Optionally, the first processing module and the second processing module respectively include at least one attention layer, a linear transformation and a sampling layer, wherein the linear transformation and the sampling layer are connected through a plurality of variation paths; the third processing module at least comprises a self-attention layer; the generator includes at least one attention layer, a linear transformation, and a prediction output layer.

Optionally, the training the initial image description model based on the image feature data and the text feature data to obtain a trained target image description model includes: inputting the text feature data into the first processing module and the second processing model, and inputting the image feature data into the third processing model to obtain a generated description sentence of the generator; generating a penalty function comprising generating one or more of a description statement penalty term, a divergence data penalty term, and a gaussian penalty term for the first processing module and the second processing module; and carrying out parameter adjustment on the initial image description model based on the loss function so as to obtain a trained target image description model.

Optionally, the method for generating the target strengthening function includes: acquiring a reward function, determining a reward value of the description statement based on the reward function, and determining a reward baseline based on a median of a maximum reward value and a minimum reward value; a target strengthening function is generated based on the reward function and the reward baseline.

Optionally, the extracting the image features of the sample image includes: inputting the sample image into a pre-trained feature extraction model to obtain image features of the sample image; or identifying an image area in the sample image, carrying out frame selection marking on the image area in the sample image, and carrying out feature extraction on the marked sample image based on a feature extraction model to obtain the image features of the sample image.

According to another aspect of the present invention, there is provided an image description method including:

acquiring an image to be processed, and extracting image characteristics of the image to be processed;

inputting the image characteristics of the image to be processed into a pre-trained target image description model to obtain a plurality of description texts of the image to be processed, wherein the target image description model is obtained based on the training method of the image description model provided by the embodiment of the invention.

According to another aspect of the present invention, there is provided a training apparatus of an image description model, including:

the feature extraction module is used for acquiring a sample image and a tag description text of the sample image, extracting image features of the sample image and extracting text features of the tag description text;

the feature processing module is used for carrying out position embedding on the tag description texts of the plurality of sample images to obtain image feature data, and carrying out position embedding on the text features of the plurality of tag description texts to obtain text feature data;

the model training module is used for acquiring an initial image description model, the initial image description model comprises a variation encoder and a generator, and training is carried out on the initial image description model based on the image characteristic data and the text characteristic data to obtain a trained target image description model: and carrying out parameter adjustment on the initial image description model through a target loss function and/or a target enhancement function in the training process, wherein the target loss function comprises one or more of a description statement loss item, a divergence data loss item and a Gaussian loss item generated by the first processing module and the second processing module, and the target enhancement function is determined based on a preset rewarding function and a rewarding base line.

According to another aspect of the present invention, there is provided an image description apparatus characterized by comprising:

the image processing module is used for acquiring an image to be processed and extracting image characteristics of the image to be processed;

the image description module is used for inputting the image characteristics of the image to be processed into a pre-trained target image description model to obtain a plurality of description texts of the image to be processed, wherein the target image description model is obtained based on the training method of the image description model provided by the embodiment of the invention.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the training method or the image description method of the image description model according to any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement a training method or an image description method of an image description model according to any one of the embodiments of the present invention when executed.

According to the technical scheme, the initial image description model is arranged, the initial image description model comprises the variation encoder and the generator, the target image description model obtained through training of the initial image description model can realize one-to-many mapping from the image to the text, and the given image is used as a condition to sample any number of hidden space vectors to obtain any data description text. In the training process of the initial image description model, a plurality of loss items are introduced into the target loss function, and the accuracy of the loss function is improved, so that the training effect of the model is further improved. The reinforcement learning process is increased, and the rewarding base line is set in the reinforcement learning process, so that the diversity of the output description text can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a training method of an image description model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an initial image description model according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an initial image description model according to an embodiment of the present invention;

FIG. 4 is a flowchart of an image description method provided by an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a training device for an image description model according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an image description device according to an embodiment of the present invention.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a training method for an image description model according to an embodiment of the present invention, where the method may be applied to training an image description model that can achieve both output accuracy and output diversity, and the method may be performed by a training device for an image description model, where the training device for an image description model may be implemented in hardware and/or software, and the training device for an image description model may be configured in an electronic device such as a mobile terminal, a computer, or a server. As shown in fig. 1, the method includes:

S110, acquiring a sample image and a tag description text of the sample image, extracting image features of the sample image, and extracting text features of the tag description text.

S120, performing position embedding on tag description texts of a plurality of sample images to obtain image feature data, and performing position embedding on text features of a plurality of tag description texts to obtain text feature data.

S130, acquiring an initial image description model, wherein the initial image description model comprises a variation encoder and a generator.

S140, training the initial image description model based on the image characteristic data and the text characteristic data to obtain a trained target image description model: and carrying out parameter adjustment on the initial image description model through a target loss function and/or a target enhancement function in the training process, wherein the target loss function comprises one or more of a description statement loss item, a divergence data loss item and a Gaussian loss item generated by the first processing module and the second processing module, and the target enhancement function is determined based on a preset rewarding function and a rewarding base line.

In this embodiment, the sample image and the tag description text corresponding to the sample image are obtained, alternatively, the sample image may be captured in real time by the image capturing device, or may be a pre-stored history image, or may be obtained from a search engine by a search instruction, or may be generated by an image generator, where the manner of obtaining the sample image is not limited.

The sample image may be an image including any image content, and the image content in the sample image may be a person, an animal, a building, a landscape, food, or the like. In some embodiments, the target image description model is applicable to a specific service scene, and accordingly, the sample image may be an image of the specific service scene, and information interference of other scenes may be eliminated through the image of the specific service scene, so as to accelerate the training process of the image of the specific service scene.

Optionally, an initial image set is acquired, and the sample image in the initial image set is subjected to enhancement processing to obtain an extended sample image, where the initial sample image and the extended sample image can form a target image set. The enhancing processing of the sample image comprises: one or more of rotation, flipping, mirroring, cropping, etc. The number of sample images is expanded in an image enhancement mode, and the acquisition process of the sample images is quickened.

The tag description text is a description text of a sample image, can be edited by an expert based on the sample image, and can be generated based on a pre-trained single image description model, and the single image description model can generate a description text for an input sample image. In some embodiments, the quality judgment can be performed on the tag description text of the sample image through the description text quality judgment model, the quality data of the tag description text is determined, and the tag description text of the sample image is marked and updated under the condition that the quality data is smaller than the quality threshold value, so that the low-quality description text is reduced, and the influence of the low-quality description text on the training effect of the target image description model is avoided. Specifically, the sample image and the tag description text may be input into the description text quality judgment model, so as to obtain the quality data output by the description text quality judgment model. The quality data may be a data value between 0 and 1, with higher quality data characterizing higher quality of the tag description text. Specifically, the marked sample image and the tag description text may be displayed to prompt updating of the tag description text of the sample image, and the quality of the updated tag description text is determined again until the quality data of the updated tag description text is greater than or equal to the quality threshold.

The sample image and the tag description text are respectively provided with an association identifier so that the sample image and the tag description text have a corresponding relationship, wherein the association identifier can be a serial number, a character string and the like and can be uniquely marked.

And carrying out feature images on each sample image and each tag description text to obtain image features of the sample image and text features of the tag description text, wherein the image features and the text features can be in a feature vector form, and meet the input conditions of the initial image description model.

In some embodiments, extracting image features of the sample image comprises: and inputting the sample image into a pre-trained feature extraction model to obtain the image features of the sample image. The feature extraction model may be a machine learning model such as a neural network model, for example, a convolutional neural network, and the model structure of the feature extraction model is not limited herein. The feature extraction model has a feature extraction function, can realize end-to-end feature extraction of an input sample image, and simplifies the extraction process of image features.

In some embodiments, extracting image features of the sample image comprises: and identifying an image area in the sample image, carrying out frame selection marking on the image area in the sample image, and carrying out feature extraction on the marked sample image based on a feature extraction model to obtain the image features of the sample image. Since a variety of contents are included in the sample image, and interference information may exist in the sample image, interference information exists in the background, and the like. To improve feature accuracy, image regions in the sample image may be marked, and the marked image regions may be one or more. The marked image region may be a region of interest of the sample image, such as an object of interest including, but not limited to, a person, an animal, a building, etc. The marking of the image area may be a marking frame, and the frame selection area is taken as the image area, and the marking frame may be a minimum external frame of the object of interest.

And inputting the sample image provided with the frame selection mark into a feature extraction model to obtain the image features output by the feature extraction model. The frame selection mark provides a reference for the feature extraction process, so that an interference part in an image can be eliminated, the calculated amount of the feature extraction process is reduced, the targeted feature extraction is realized, and the accuracy of the image features is improved.

In some embodiments, the tag description text may be feature coded by a text coding model to obtain text features, which may be the text's ebadd feature. Each descriptive text is converted into text features of a particular sequence length, the text encoding model is not limited herein, and may be a machine learning model such as a neural network.

The sample image and the label description text have corresponding relations, and correspondingly, the text features and the image features also have corresponding relations, and the text features and the image features are respectively embedded in positions based on the corresponding relations so as to obtain text feature data and image feature data. The text feature data and the image feature data may be respectively in a matrix form, where the text feature and the image feature having a corresponding relationship are respectively matched at positions of the text feature data and the image feature data.

And carrying out model training on the initial image description model based on the text characteristic data and the image characteristic data so as to obtain a target image description model. Wherein the initial image description model includes an encoder and a generator for the variation. The variable encoder comprises a variable structure, and optionally comprises a first processing module, a second processing module and a third processing module; the output end of the third processing module is connected with the second processing module, and the output ends of the first processing module, the second processing module and the third processing module are connected with the generator through a splicing unit. Referring to fig. 2, fig. 2 is a schematic structural diagram of an initial image description model according to an embodiment of the present invention.

The first processing module and the second processing model respectively comprise a variational structure. The first processing module can be a priori processing module, the second processing module can be a posterior processing module, wherein the first processing module can obtain a priori sequence hidden variable after training, and the second processing module can obtain a posterior sequence hidden variable after training. The output ends of the first processing module and the second processing module are respectively connected with the output end of the third processing module through a splicing unit in a data splicing mode, priori sequence hidden variables and/or posterior sequence hidden variables can be fused in image features, the priori sequence hidden variables and/or posterior sequence hidden variables provide information references for generation of description texts, one-to-many mapping of images and description texts can be achieved, the problem of single mapping of the images and the description texts is solved, and diversity of image description is achieved.

Performing iterative training on the initial image description model, generating a loss function and/or a target strengthening function in each iterative process, performing parameter adjustment on the initial image description model through the loss function and/or the target strengthening function, and obtaining a target image description model under the condition that training end conditions are met, wherein the target image description model can be applied to an reasoning stage to generate diversified description texts for a target image. Because the label description data of the target image does not exist in the reasoning stage, namely prior information does not exist, correspondingly, the target image description model does not comprise the first processing module, namely the target image description model comprises an encoder formed by the second processing module and the third processing module after training and a generator after training. The output end of the third processing module is connected with the input end of the second processing module, and the output end of the third processing module is connected with the generator through the splicing unit.

On the basis of the embodiment, the first processing module and the second processing module respectively comprise at least one attention layer, a linear transformation and a sampling layer, wherein the linear transformation and the sampling layer are connected through a plurality of variational paths; the third processing module at least comprises a self-attention layer; the generator includes at least one attention layer, a linear transformation, and a prediction output layer. Wherein the at least one attention layer comprises a self-attention layer, a cross-constrained multi-head attention layer, a multi-head attention layer, and the like. The attention layer in the first processing module or the second processing model may be one or more items. The loss of upper and lower semantics caused by information inconsistency between input sentences can be solved by arranging a plurality of variational paths in the first processing module and the second processing model.

In some embodiments, the first processing module comprises a self-attention layer, a linear transformation and a sampling layer connected in sequence, wherein the linear transformation and the sampling layer are connected through a plurality of variation paths. The second processing module comprises a multi-head attention layer, a linear transformation and a sampling layer which are sequentially connected and are cross-constrained, wherein the linear transformation and the sampling layer are connected through a plurality of variable paths. The third processing module comprises a self-attention layer, and the output end of the self-attention layer in the third processing module is connected with the input end of the multi-head attention layer in the second processing module.

The generator comprises a multi-head attention layer, a linear transformation layer and a prediction output layer which are connected in sequence and are cross constraint, wherein the prediction output layer can be a sormax processing layer. Output data of the first processing module, the second processing module and the third processing module are spliced and then input into a cross-constraint multi-head attention layer in the generator. Referring to fig. 3, fig. 3 is a schematic structural diagram of an initial image description model according to an embodiment of the present invention.

The second processing module in the target image description model comprises a multi-head attention layer, a linear transformation and a sampling layer, wherein the linear transformation is connected with the sampling layer through a plurality of variational paths. A third processing module in the target image description model includes a self-attention layer, and the generator includes a cross-constrained multi-head attention layer, a linear transformation, and a prediction output layer.

In some embodiments, the target image description model has the same structure as the initial image description model, the initial image description model includes pending model parameters, and the target image description model includes trained model parameters. In the training process, the output data of the first processing module, the output data of the second processing module and the output data of the third processing module are spliced and then input to the generator. In the reasoning stage, the output data of the second processing module and the output data of the third processing module are spliced and then input to the generator.

In the training process of the initial image description model, parameter adjustment is carried out on the initial image description model through the target loss function, reinforcement learning is carried out on the initial image description model through the target reinforcement function, and the model performance of the target image description model can be improved.

In some embodiments, the training process for the initial image description model may include: inputting the text feature data into the first processing module and the second processing model, and inputting the image feature data into the third processing model to obtain a generated description sentence of the generator; generating a penalty function comprising generating one or more of a description statement penalty term, a divergence data penalty term, and a gaussian penalty term for the first processing module and the second processing module; and carrying out parameter adjustment on the initial image description model based on the loss function so as to obtain a trained target image description model.

In this embodiment, a plurality of loss terms are introduced into the loss function, so that training of a model can be accelerated, and a model training effect can be improved. Illustratively, the loss function may be expressed by the following formula:

wherein,generating description statement log-condition likelihood logp for generator _θ (s|z, I) expectation, D _kl (p _i ||q _i ) For KL divergence data of the first processing module p and the second processing module q, θ is a model parameter in the second processing module, φ is a model parameter in the first processing model, z is a hidden variable, s is a generated description statement, s is an I expression image, γ is a preset super parameter, the first processing module p and the second processing module q are Gaussian mixture distribution of the same number k, ω _i For the prior probability of each component i in the first processing module p +.>The prior probability for each component i in the second processing block q.

Wherein z is an a priori hidden variable,is a posterior hidden variable. V is the sequence length of the generated description statement, and t is the time step.

On the basis of the above embodiment, reinforcement learning may also be performed based on the target reinforcement function in the above training process. Wherein during reinforcement learning, some predictors are generally more likely to be sampled because of higher reward points, resulting in a reduced diversity of predictors. In the embodiment, on the basis of setting the reward function, the reward base line is set, and the target strengthening function is determined through the reward function and the reward base line, so that more reasoning results are reserved, and the diversity is improved.

Optionally, the method for generating the target strengthening function includes: acquiring a reward function, determining a reward value of the description statement based on the reward function, and determining a reward baseline based on a median of a maximum reward value and a minimum reward value; a target strengthening function is generated based on the reward function and the reward baseline. Wherein the target strengthening function may be a difference of the reward function from the reward baseline. By setting the reward base line, introducing distance median reward on the basis of keeping the original global reward information, using average distance median punishment deviation samples, discarding the samples deviated from the global reward base line, so that more samples with high predictive scores are kept on the average distance median, and more reasoning results are kept as much as possible under the condition of ensuring the accuracy of semantic reasoning.

Illustratively, the reward baseline may be expressed as:

wherein (1)>Is->Is a prize value for (a). The prize value may be based on a prize for generating the descriptive statement sThe incentive function determines a incentive value, which may be, for example, a difference determination between the production description sentence s and the tag description text. The reward function may be provided with a reward condition and a sub-function, the reward condition may be a difference judging condition of the production description sentence s and the tag description text, different reward conditions correspond to different sub-functions, and the corresponding reward value may be determined by determining the reward condition corresponding to the production description sentence s and by the sub-function corresponding to the reward condition.

In some embodiments, reinforcement learning based on the objective reinforcement function and training based on the objective loss function are combined, the iteration cycle lengths of the two modes can be different, the initial image description model is adjusted based on the objective loss function in each iteration process to perform parameter adjustment, and reinforcement learning is performed on the initial image description model in the training process based on the objective reinforcement function after n-word iteration.

According to the technical scheme, an initial image description model is set, the initial image description model comprises a variation encoder and a generator, the target image description model obtained through training of the initial image description model can achieve one-to-many mapping from an image to a text, and given images are used as conditions to sample any number of hidden space vectors to achieve any data description text. In the training process of the initial image description model, a plurality of loss items are introduced into the target loss function, and the accuracy of the loss function is improved, so that the training effect of the model is further improved. The reinforcement learning process is increased, and the rewarding base line is set in the reinforcement learning process, so that the diversity of the output description text can be improved.

Example two

Fig. 4 is a flowchart of an image description method according to a second embodiment of the present invention. As shown in fig. 2, the method includes:

s210, acquiring an image to be processed, and extracting image characteristics of the image to be processed.

S220, inputting the image characteristics of the image to be processed into a pre-trained target image description model to obtain a plurality of description texts of the image to be processed, wherein the target image description model is obtained based on the training method of the image description model provided by the embodiment of the invention.

In this embodiment, the pre-trained target image description model includes a variation encoder and a generator, the variation encoder includes a second processing module for providing a posterior dependent variable, and under the condition of an image feature map output by the third processing module, any number of hidden space vectors are sampled, and after the plurality of hidden space vectors are spliced with the image feature map output by the third processing module, any data description text can be obtained through processing of the generator.

In some embodiments, the output of the image description model includes a plurality of description texts and quality data for each of the description texts; after obtaining the plurality of description texts, the plurality of description texts and quality data of each description text can be displayed for screening by a user. Or screening a preset number of descriptive texts based on the quality data of each descriptive text, namely sorting the descriptive texts through the quality data, and selecting the descriptive texts at preset sorting positions.

In some embodiments, the plurality of descriptive text of the image description model is a preset number of descriptive text filtered based on the quality data.

According to the technical scheme of the embodiment, the accuracy and the diversity of image description can be considered through the target image description model, and a plurality of high-precision description texts corresponding to the target image are obtained.

Example III

Fig. 5 is a schematic structural diagram of a training device for an image description model according to a third embodiment of the present invention. As shown in fig. 5, the apparatus includes:

a feature extraction module 310, configured to obtain a sample image and a tag description text of the sample image, extract image features of the sample image, and extract text features of the tag description text;

the feature processing module 320 is configured to perform position embedding on tag description texts of a plurality of sample images to obtain image feature data, and perform position embedding on text features of a plurality of tag description texts to obtain text feature data;

a model training module 330 for obtaining an initial image description model, the initial image description model comprising a variation encoder and a generator; training the initial image description model based on the image characteristic data and the text characteristic data to obtain a trained target image description model: parameter adjustment of the initial image description model during training by a target loss function and/or a target enhancement function, wherein the target loss function comprises one or more of a description statement loss term, a divergence data loss term and a Gaussian loss term generated by the first processing module and the second processing module, and the target enhancement function is determined based on a preset reward function and a reward base line

On the basis of the above embodiment, optionally, the variation encoder includes a first processing module, a second processing module, and a third processing module; the output end of the third processing module is connected with the second processing module, and the output ends of the first processing module, the second processing module and the third processing module are connected with the generator through a splicing unit;

On the basis of the above embodiment, optionally, the model training module 330 is configured to input the text feature data to the first processing module and the second processing module, and input the image feature data to the third processing module, so as to obtain a generated description sentence of the generator; generating a penalty function comprising generating one or more of a description statement penalty term, a divergence data penalty term, and a gaussian penalty term for the first processing module and the second processing module; and carrying out parameter adjustment on the initial image description model based on the loss function so as to obtain a trained target image description model.

Optionally, the model training module 330 is further configured to obtain a reward function, determine a reward value of the description statement based on the reward function, and determine a reward baseline based on a median of a maximum reward value and a minimum reward value; a target strengthening function is generated based on the reward function and the reward baseline.

On the basis of the above embodiment, optionally, the feature extraction module 310 is configured to input the sample image into a feature extraction model trained in advance, so as to obtain an image feature of the sample image; or identifying an image area in the sample image, carrying out frame selection marking on the image area in the sample image, and carrying out feature extraction on the marked sample image based on a feature extraction model to obtain the image features of the sample image.

The training device for the image description model provided by the embodiment of the invention can execute the training method for the image description model provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 6 is a schematic structural diagram of a training device for an image description model according to a third embodiment of the present invention. As shown in fig. 6, the apparatus includes:

an image processing module 410, configured to obtain an image to be processed, and extract image features of the image to be processed;

the image description module 420 is configured to input image features of the image to be processed into a pre-trained target image description model, and obtain a plurality of description texts of the image to be processed, where the image description model is obtained based on the training method of the image description model according to any one of claims 1-6.

On the basis of the above embodiment, optionally, the output of the image description model includes a plurality of description texts and quality data of each of the description texts;

the device also comprises a description text processing module, a display module and a display module, wherein the description text processing module is used for displaying the plurality of description texts and quality data of each description text; and/or screening a preset number of descriptive texts based on the quality data of each descriptive text.

The image description device provided by the embodiment of the invention can execute the image description method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example five

Fig. 7 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention. The electronic device 10 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 7, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, for example, a training method of an image description model or an image description method.

In some embodiments, the training method of the image description model or the image description method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the training method of the image description model or the image description method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the training method or the image description method of the image description model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

The training method or computer program of the image description model for implementing the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

Example six

The sixth embodiment of the present invention also provides a computer readable storage medium storing computer instructions for causing a processor to execute a training method or an image description method of an image description model, the method comprising:

acquiring a sample image and a tag description text of the sample image, extracting image characteristics of the sample image, and extracting text characteristics of the tag description text; performing position embedding on tag description texts of a plurality of sample images to obtain image feature data, and performing position embedding on text features of a plurality of tag description texts to obtain text feature data; acquiring an initial image description model, wherein the initial image description model comprises a variation encoder and a generator, and training the initial image description model based on the image characteristic data and the text characteristic data to obtain a trained target image description model: and carrying out parameter adjustment on the initial image description model through a target loss function and/or a target enhancement function in the training process, wherein the target loss function comprises one or more of a description statement loss item, a divergence data loss item and a Gaussian loss item generated by the first processing module and the second processing module, and the target enhancement function is determined based on a preset rewarding function and a rewarding base line.

Alternatively, the computer instructions are for causing the processor to perform an image description method comprising: acquiring an image to be processed, and extracting image characteristics of the image to be processed; inputting the image characteristics of the image to be processed into a pre-trained target image description model to obtain a plurality of description texts of the image to be processed, wherein the target image description model is obtained based on the training method of the image description model provided by any embodiment of the invention.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method of training an image description model, comprising:

acquiring an initial image description model, wherein the initial image description model comprises a variation encoder and a generator;

training the initial image description model based on the image characteristic data and the text characteristic data to obtain a trained target image description model: and carrying out parameter adjustment on the initial image description model through a target loss function and/or a target enhancement function in the training process, wherein the target loss function comprises one or more of a description statement loss item, a divergence data loss item and a Gaussian loss item generated by the first processing module and the second processing module, and the target enhancement function is determined based on a preset rewarding function and a rewarding base line.

2. The method of claim 1, wherein the variation encoder comprises a first processing module, a second processing module, and a third processing module; the output end of the third processing module is connected with the second processing module, and the output ends of the first processing module, the second processing module and the third processing module are connected with the generator through a splicing unit;

3. The method of claim 2, wherein the first processing module and the second processing model each comprise at least one attention layer, a linear transformation, and a sampling layer, wherein the linear transformation and the sampling layer are connected by a plurality of variational paths;

the third processing module at least comprises a self-attention layer;

the generator includes at least one attention layer, a linear transformation, and a prediction output layer.

4. The method of claim 2, wherein the training the initial image description model based on the image feature data and the text feature data to obtain a trained target image description model comprises:

Inputting the text feature data into the first processing module and the second processing model, and inputting the image feature data into the third processing model to obtain a generated description sentence of the generator;

generating a penalty function comprising generating one or more of a description statement penalty term, a divergence data penalty term, and a gaussian penalty term for the first processing module and the second processing module;

and carrying out parameter adjustment on the initial image description model based on the loss function so as to obtain a trained target image description model.

5. The method of claim 2, wherein the method of generating the target strengthening function comprises:

acquiring a reward function, determining a reward value of the description statement based on the reward function, and determining a reward baseline based on a median of a maximum reward value and a minimum reward value;

a target strengthening function is generated based on the reward function and the reward baseline.

6. An image description method, comprising:

inputting the image characteristics of the image to be processed into a pre-trained target image description model to obtain a plurality of description texts of the image to be processed, wherein the image description model is obtained based on the training method of the image description model according to any one of claims 1-5.

7. A training device for an image description model, comprising:

8. An image description apparatus, comprising:

the image description module is used for inputting the image characteristics of the image to be processed into a pre-trained target image description model to obtain a plurality of description texts of the image to be processed, wherein the image description model is obtained based on the training method of the image description model according to any one of claims 1-5.

9. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the training method of the image description model of any one of claims 1-5 or the image description method of claim 6.

10. A computer readable storage medium storing computer instructions for causing a processor to perform the training method of the image description model of any one of claims 1-5 or the image description method of claim 6.