CN115424044A

CN115424044A - Multi-mode-based image annotation method and device and electronic equipment

Info

Publication number: CN115424044A
Application number: CN202211034098.3A
Authority: CN
Inventors: 张恒
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-08-26
Filing date: 2022-08-26
Publication date: 2022-12-02

Abstract

The disclosure provides an image annotation method and device based on multiple modes and electronic equipment, and relates to the technical field of image processing. The method comprises the following specific steps: acquiring an image to be annotated and a picture type, and generating at least two texts to be annotated according to the picture type; inputting the images to be annotated and each text to be annotated into a pre-trained image annotation model, and extracting image characteristic vectors of the images to be annotated and text characteristic vectors corresponding to the text to be annotated through the image annotation model; acquiring the similarity between the image feature vector and each text feature vector, and determining a target annotation text from each text to be annotated according to the similarity; and labeling the image to be labeled according to the target labeling text. According to the method and the device, the image characteristic vector and the text characteristic vector are extracted, the target annotation text is determined according to the similarity, automatic annotation of the image is achieved, and the efficiency and the accuracy of image annotation are improved.

Description

Multi-mode-based image annotation method and device and electronic equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image annotation method and apparatus based on multiple modalities, and an electronic device.

Background

In recent years, with the dramatic increase in the amount of image data, means for efficiently labeling the content of image data is urgently needed in order to realize efficient retrieval and management of large-scale image data.

From the viewpoint of pattern recognition, the problem of labeling image data is regarded as a problem of assigning a label to an image according to the content of the image, that is, labeling image modality data with text modality data. The image data is labeled by selecting the content characteristics representing the image data, and if the selected content characteristics are proper, the identification precision of the model can be obviously improved by using the labeled image data to carry out model training. However, if the selected content features are not appropriate, the image annotation will not be consistent with the image, and the quality of the image annotation will be reduced. While further affecting the accuracy of the recognition of the model trained from the content features and image data. Due to the semantic gap problem, the related art has low accuracy when the image is labeled by using the text.

Disclosure of Invention

The disclosure provides an image annotation method, an image annotation device and electronic equipment based on multiple modes, and aims to at least solve the problem of low accuracy in annotation of images by texts in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a multi-modality based image annotation method, including:

acquiring an image to be annotated and a picture type, and generating at least two texts to be annotated according to the picture type;

inputting the images to be annotated and each text to be annotated into a pre-trained image annotation model, and extracting image characteristic vectors of the images to be annotated and text characteristic vectors corresponding to the text to be annotated through the image annotation model;

acquiring the similarity between the image feature vector and each text feature vector, and determining a target annotation text from each text to be annotated according to the similarity;

and marking the image to be marked according to the target marking text.

Optionally, the image annotation model includes an image encoder and a text encoder, and the extracting, by the image annotation model, the image feature vector of the image to be annotated and the text feature vector corresponding to the text to be annotated includes:

inputting the image to be marked into the image encoder, and extracting the image feature vector through the image encoder;

and inputting the text to be labeled into the text encoder, and extracting the text feature vector through the text encoder.

Optionally, the step of obtaining the similarity between the image feature vector and each text feature vector specifically includes any one of the following:

calculating cosine similarity between the image feature vector and the text feature vector as the similarity;

calculating the Manhattan distance between the image feature vector and the text feature vector as the similarity;

and calculating Euclidean distance between the image feature vector and the text feature vector as the similarity.

Optionally, the step of determining a target annotation text from each to-be-annotated text according to the similarity specifically includes:

and if the similarity is greater than a preset similarity threshold, determining that the text to be labeled corresponding to the similarity is the target labeling text.

Optionally, the step of labeling the image to be labeled according to the target labeling text specifically includes:

and determining the target annotation text as the annotation text of the image to be annotated.

According to a second aspect of the embodiments of the present disclosure, there is provided a method for training an image annotation model, which is used for training the image annotation model in the first aspect, and includes:

matching the image training data with the text training data to obtain a training data pair, and generating a training data set according to the training data pair;

selecting at least two training data pairs to form a data batch, and inputting the data batch into an image labeling model, wherein the data batch comprises a positive training data pair and at least one negative training data pair;

extracting the features of the image training data and the text training data of the positive training data pair in the data batch according to the image labeling model to generate a first image feature vector and a first text feature vector, and extracting the features of the image training data and the text training data of the negative training data pair in the data batch according to the image labeling model to generate a second image feature vector and a second text feature vector;

forming an image feature queue according to the first image feature vector and the second image feature vector, and forming a text feature queue according to the first text feature vector and the second text feature vector;

calculating a first similarity between the first image feature vector and each text feature vector in the text feature queue, and calculating a second similarity between the first text feature vector and each image feature vector in the image feature queue;

and calculating a loss function value according to the first similarity and the second similarity, and training the image annotation model by taking the convergence of the loss function as a target to obtain the trained image annotation model.

Optionally, the acquiring of the image training data and the text training data includes:

acquiring a video frame in original video data, and preprocessing the video frame to generate first image data;

acquiring a text in the original video data, preprocessing the text to generate first text data corresponding to the first image data, and forming an original data pair by the first image data and the corresponding first text data;

performing data enhancement on the first image data and the first text data to generate the image training data and the text training data.

Optionally, the step of performing data enhancement on the first image data and the first text data to generate the image training data and the text training data includes:

transforming the first image data to generate the image training data by at least one of: rotation transformation, turnover transformation, scaling transformation, translation transformation, scale transformation, noise disturbance, color transformation or shielding;

transforming the first text data to generate the text training data by at least one of: replacing similar meaning words, randomly replacing similar meaning words, replacing Chinese equivalent words, translating and converting or inverting sentence pattern transformation.

Optionally, the calculating a loss function value according to the first similarity and the second similarity, and training the image annotation model with the convergence of the loss function as a target includes:

calculating reference similarity between the image training data of the sound training data pair and the text training data of the sound training data pair;

and setting a loss function by taking all the first similarity and the second similarity with the reference similarity larger than or equal to the reference similarity as a target.

According to a third aspect of the embodiments of the present disclosure, there is provided a multi-modality based image annotation apparatus, including:

the system comprises an image to be annotated and a picture type, wherein the image to be annotated and the picture type are acquired by the text to be annotated acquisition module;

the feature extraction module is used for inputting the images to be annotated and each text to be annotated into a pre-trained image annotation model, and extracting an image feature vector of the images to be annotated and a text feature vector corresponding to the text to be annotated through the image annotation model;

the target labeling text determining module is used for acquiring the similarity between the image characteristic vector and each text characteristic vector and determining a target labeling text from each text to be labeled according to the similarity;

and the marking module is used for marking the image to be marked according to the target marking text.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a training apparatus for an image annotation model, including:

the data acquisition module is used for matching the image training data and the text training data to obtain a training data pair, and generating a training data set according to the training data pair;

the data input module is used for selecting at least two training data pairs to form a data batch and inputting the data batch into an image labeling model, wherein the data batch comprises a positive training data pair and at least one negative training data pair;

the feature extraction module is used for extracting features of image training data and text training data of positive example training data pairs in the data batch according to the image labeling model to generate a first image feature vector and a first text feature vector, and extracting features of image training data and text training data of negative example training data pairs in the data batch according to the image labeling model to generate a second image feature vector and a second text feature vector;

the feature formation module is used for forming an image feature queue according to the first image feature vector and the second image feature vector and forming a text feature queue according to the first text feature vector and the second text feature vector;

the similarity obtaining module is used for calculating first similarities between the first image feature vector and each text feature vector in the text feature queue and calculating second similarities between the first text feature vector and each image feature vector in the image feature queue;

and the training module is used for calculating a loss function value according to the first similarity and the second similarity, training the image annotation model by taking the convergence of the loss function as a target, and obtaining the trained image annotation model.

According to a fifth aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the multi-modality based image annotation method as described in the first aspect above or the training method of the image annotation model as described in the second aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein when executed by a processor of an electronic device, the instructions enable the electronic device to perform the multi-modality based image annotation method as described in the first aspect or the training method of the image annotation model as described in the second aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method for multi-modality based image annotation according to the first aspect or the method for training an image annotation model according to the second aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

by extracting the image characteristic vector and the text characteristic vector and determining the target annotation text according to the similarity, the automatic annotation of the image is realized, and the efficiency and the accuracy of the image annotation are improved.

The data enhancement mode is used for labeling the extended data training model, so that the manual workload is reduced, the scale of labeled data is improved, and the labeling quality of the multi-mode model is further improved.

Different picture types are set according to different task targets, specific category labels of the images to be labeled can be obtained, and the flexibility of labeling is improved.

By adopting a scheme of contrast learning, the text mode and the picture mode are aligned, so that the modes are fully interacted, and the effect is better.

A queue mechanism is introduced in multi-modal training, so that the dependence on computing resources is reduced while the effect of contrast learning is enhanced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow diagram illustrating a multi-modality based image annotation process in accordance with an exemplary embodiment.

FIG. 2 is a flowchart illustrating a method of training an image annotation model according to an exemplary embodiment.

FIG. 3 is a flowchart illustrating a method of training an image annotation model according to an exemplary embodiment.

FIG. 4 is a flowchart illustrating a multi-modality based image annotation process in accordance with an exemplary embodiment.

FIG. 5 is a flow diagram illustrating a method of image annotation model training in accordance with an exemplary embodiment.

FIG. 6 is a block diagram illustrating a multi-modality based image annotation appliance in accordance with an exemplary embodiment.

FIG. 7 is a block diagram illustrating an apparatus for training an image annotation model according to an exemplary embodiment.

FIG. 8 is a block diagram illustrating an apparatus in accordance with an example embodiment.

FIG. 9 is a block diagram illustrating an apparatus in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosure as detailed in the appended claims.

It should be noted that the user information (including, but not limited to, user device information, user personal information, etc.) referred to in the present disclosure is information authorized by the user or sufficiently authorized by each party.

In the background of the rapid development of the current internet, a large number of users' picture and video resources are flooded. For example, many pictures and video data exist on short video platforms or various self-media platforms. Content understanding work such as labeling and classifying the pictures or the video data is required on each platform, so that the development of ecological health of the platform is ensured.

At present, no matter a short video platform or other self-media platforms are adopted, data of picture types need to be processed, for example, pictures or video data uploaded by users are labeled and classified (such as infants, adults, fishing, food and the like), on one hand, safety and ecology management and control are carried out according to classification results, on the other hand, better distribution can be carried out according to category information, and public praise and popularity of the platforms are improved. In order to realize effective retrieval and management of large-scale image data, a means for efficiently labeling the content of the image data is urgently needed.

From the viewpoint of pattern recognition, the problem of labeling image data is regarded as a problem of assigning a label to an image according to the content of the image, that is, labeling image modality data with text modality data. Selecting appropriate features characterizing the image data content will affect the annotation performance to a large extent. Due to the well-known semantic gap problem, the prior art uses text to label images with low accuracy.

The existing method for labeling the pictures can be roughly divided into manual labeling and labeling based on a model training strategy of a limited type.

The manual labeling usually consumes a large amount of manpower and material resources, and manually labels and classifies the original pictures to obtain the related labeling results. Time and labor are wasted, and the efficiency is lower.

And marking based on a model training strategy of a limited type by using partial manual marking of partial data, and then training a relevant model according to the marked partial data. And predicting the label of the related picture data by using the trained partial model, and automatically labeling the picture data by matching with partial strategies (such as threshold screening and the like). But this approach can only support annotation of a portion of the type data. For example, training data of 5 categories are manually labeled, and a relevant model is trained according to the training data, and the model only has the capability of labeling the 5 categories, and has no expansibility or universality.

FIG. 1 is a flow diagram illustrating a multi-modality based image annotation process in accordance with an exemplary embodiment. As shown in fig. 1, the method includes:

step 101, acquiring an image to be annotated and a picture type, and generating at least two texts to be annotated according to the picture type.

In the embodiment of the application, the image to be annotated is obtained, and if the text is randomly generated to annotate the image to be annotated, the text amount is too large, and the annotation efficiency is reduced. According to the embodiment of the application, the large type corresponding to the image to be annotated, namely the picture type, is determined according to the requirement of the annotation task. The picture types may include a plurality of small types.

In a possible embodiment, the objective of the annotation task is to determine which animal the image to be annotated belongs to, then the picture type is "animal", and the animal generates the text to be annotated according to the picture type: "cat", "dog", "horse", "elephant", etc. Therefore, the text to be labeled is more targeted, and the labeling efficiency and flexibility are improved.

Step 102, inputting the image to be annotated and each text to be annotated into a pre-trained image annotation model, and extracting an image feature vector of the image to be annotated and a text feature vector corresponding to the text to be annotated through the image annotation model.

In an embodiment of the present application, the image annotation model is a neural network model, which includes a plurality of neurons. And extracting the image features of the image to be labeled according to the image labeling model to generate an image feature vector, and extracting the text features of the text to be labeled to generate the text feature vector. In order to facilitate subsequent acquisition of similarity between an image feature vector and a text feature vector, the image feature vector and the text feature vector need to have the same number of dimensions.

Step 103, obtaining the similarity between the image feature vector and each text feature vector, and determining a target annotation text from each text to be annotated according to the similarity.

In the embodiment of the application, the similarity is calculated according to the image feature vector and the text feature vector, the similarity reflects the matching degree of the corresponding image to be labeled and the text to be labeled, and the higher the similarity is, the better the matching between the image to be labeled and the text to be labeled is; the lower the similarity is, the worse the matching between the image to be annotated and the text to be annotated is. In order to obtain the picture type which is most consistent with the image to be labeled, selecting the text to be labeled with higher similarity from the text to be labeled, and determining the text to be labeled as the target labeling text.

And 104, labeling the image to be labeled according to the target labeling text.

In the embodiment of the application, after the target annotation text is obtained, the target annotation text can be labeled on the image to be annotated, and the target annotation text is labeled on the image to be annotated, so that the annotation corresponding to the image can be efficiently obtained. And the marked image is convenient to be utilized for other work.

According to the embodiment of the application, the image characteristic vector and the text characteristic vector are extracted from the image to be labeled and the text to be labeled, and the target labeling text is determined according to the similarity between the image characteristic vector and the text characteristic vector, so that the automatic labeling of the image is realized, and the efficiency and the accuracy of the image labeling are improved. Meanwhile, different picture types can be set according to different task targets, specific category labels of the images to be labeled can be obtained, and the flexibility of labeling is improved. Optionally, the step of obtaining the image to be annotated specifically includes:

and extracting the video frame in the video data to obtain the image to be marked.

The image to be annotated can be an independent image or a video frame extracted from video data. The frame extraction method can be various, such as uniform frame extraction: extracting a frame at fixed time intervals; frame extraction with fixed frame number: extracting a frame every other fixed frame number; extracting frames according to the difference between adjacent frames: and carrying out difference operation on adjacent frames, if the difference result is greater than a certain threshold value, indicating that motion is generated between the two frames, and extracting the previous frame.

Optionally, the image annotation model includes an image encoder and a text encoder, and step 102 in fig. 1 specifically includes:

and inputting the text to be labeled into the text encoder, and extracting the text characteristic vector through the text encoder.

Optionally, the step of obtaining the similarity between the image feature vector and the text feature vector specifically includes any one of the following steps:

In the embodiment of the application, the similarity calculation methods are various, and an implementer can select a proper calculation method according to specific situations and train the image annotation model according to the similarity calculation method.

Cosine similarity measures the similarity between two vectors by measuring their cosine values of their angle. The cosine value of the 0-degree angle is 1, and the cosine value of any other angle is not more than 1; and its minimum value is-1. The cosine of the angle between the two vectors thus determines whether the two vectors point in approximately the same direction. When the two vectors have the same direction, the cosine similarity value is 1; when the included angle of the two vectors is 90 degrees, the value of the cosine similarity is 0; the cosine similarity has a value of-1 when the two vectors point in completely opposite directions. The result is independent of the length of the vector, only the pointing direction of the vector. Euclidean distance is a commonly used distance definition, which refers to the true distance between two points in a multidimensional space, and the image feature vector and the text feature vector can be considered as two points in the multidimensional space, i.e., the distance between the two points.

According to the embodiment of the application, the similarity between the image feature vector and the text feature vector can be obtained, the calculation mode of the similarity is enriched, the similarity between the image feature vector and the text feature vector is more accurately obtained, and the accuracy of subsequently selecting the target annotation text is improved.

Optionally, step 103 in fig. 1 specifically includes:

and if the similarity is greater than a preset similarity threshold, determining the text to be labeled corresponding to the similarity as the target labeling text.

In the embodiment of the application, a similarity threshold is preset, and if the similarity between the image feature vector and the text feature vector is greater than the similarity threshold, it is determined that the matching degree of the to-be-annotated text and the to-be-annotated image is high, the content of the to-be-annotated image can be accurately reflected, and the to-be-annotated text is determined to be the target annotation text.

Optionally, step 104 in fig. 1 includes:

In the embodiment of the application, after the target annotation text is determined, the target annotation text is annotated on the image to be annotated, so that the automatic annotation of the image to be annotated is realized, and the efficiency of image annotation is improved.

FIG. 2 is a flowchart illustrating a method of training an image annotation model according to an exemplary embodiment. As shown in fig. 2, the method is used for training the image annotation model, and includes:

step 201, matching image training data and text training data to obtain a training data pair, and generating a training data set according to the training data pair.

In the embodiment of the application, in order to train the image annotation model, image training data and text training data are prepared in advance. In order to train the capability of an image labeling model for matching texts and images, the image training data and the text training data are paired to generate training data pairs, and a training data set is generated according to the training data pairs, wherein the training data set comprises a plurality of training data pairs.

Step 202, selecting at least two training data pairs to form a data batch, and inputting the data batch into an image labeling model, wherein the data batch comprises a positive training data pair and at least one negative training data pair.

In the embodiment of the application, after the training data set is obtained, batch input image annotation models of data with certain sizes are trained. In order to sufficiently learn the features of matched data pairs and the features of unmatched data pairs and improve the matching capability of a model, a positive example training data pair and a plurality of negative example training data pairs are arranged in each data batch, the image training data in the positive example training data pair is matched with the text training data in the positive example training data pair, the image training data in the negative example training data pair is unmatched with the text training data in the positive example training data, and the text training data in the negative example training data pair is unmatched with the image training data in the positive example training data.

Step 203, extracting features of the image training data and the text training data of the positive training data pair in the data batch according to the image labeling model to generate a first image feature vector and a first text feature vector, and extracting features of the image training data and the text training data of the negative training data pair in the data batch according to the image labeling model to generate a second image feature vector and a second text feature vector.

In the embodiment of the application, the image labeling model is a neural network model and comprises an image encoder and a text encoder, wherein the image encoder is used for extracting features in image training data to generate a first image feature vector and a second image feature vector, and the text encoder is used for extracting features in text training data to extract features of data in a data batch to generate a first text feature vector and a second text feature vector.

And 204, forming an image feature queue according to the first image feature vector and the second image feature vector, and forming a text feature queue according to the first text feature vector and the second text feature vector.

And after the feature vectors are extracted, storing the first image feature vector and the second image feature vector through an image feature queue, and storing the first text feature vector and the second text feature vector through a text feature queue.

Step 205, calculating a first similarity between the first image feature vector and each text feature vector in the text feature queue, and calculating a second similarity between the first text feature vector and each image feature vector in the image feature queue.

In the embodiment of the application, the matching degree of the image training data in the positive training data pair and the text training data in each negative training data pair in the data batch is obtained by calculating the first similarity of the first image feature vector and each text feature vector in the text feature queue; and calculating the second similarity of the first text feature vector and each image feature vector in the image feature queue to obtain the matching degree of the text training data in the positive training data pair and the image training data in each negative training data pair in the data batch.

And step 206, calculating a loss function value according to the first similarity and the second similarity, and training the image annotation model by taking the convergence of the loss function as a target to obtain a trained image annotation model.

In the embodiment of the present application, it is necessary to maximize the similarity obtained by the first image feature vector and the first text feature vector corresponding to the data in the regular data pair. A first similarity of a first image feature vector in the positive case data pair and a second text feature vector in the negative case data pair is minimized, and a second similarity of the first text feature vector in the positive case data pair and the second text feature vector in the negative case data pair is minimized. The target is realized through a loss function, and the iterative image annotation model is continuously updated according to the loss function until the loss of the loss function tends to be stable, and the training is finished at the moment.

According to the method and the device, the image annotation model is interactively trained through the two modal information of the image and the text, and the corresponding relation between the text characteristic vector and the image characteristic vector is better learned by the image annotation model through comparing the positive data pair with the negative data pair, so that the matching accuracy of the image annotation model on the text characteristic vector and the image characteristic vector is improved.

FIG. 3 is a flow chart illustrating a method of training an image annotation model according to an exemplary embodiment. As shown in fig. 3, the acquiring step of the image training data and the text training data in fig. 2 includes:

step 301, acquiring a video frame in original video data, and preprocessing the video frame to generate first image data;

step 302, obtaining a text in the original video data, preprocessing the text to generate first text data corresponding to the first image data, and forming an original data pair by the first image data and the corresponding first text data;

step 303, performing data enhancement on the first image data and the first text data to generate the image training data and the text training data.

In the embodiment of the application, the image training data may be pictures collected on the network, or may be extracted from original video data collected on the network. After the original video data is extracted, firstly, video frames in the original video data are preprocessed to eliminate irrelevant information in images, useful real information is recovered, the detectability of relevant information is enhanced, and the data is simplified to the maximum extent, so that the training efficiency of a subsequent image labeling model is improved, and the reliability of feature extraction of the image labeling model is improved. The text in the original video data, such as video titles, video subtitles, etc., is then pre-processed. The preprocessing of the text comprises: and removing texts with too long or too short lengths, special characters and the like, and filtering according to certain strategies such as video quality and the like. The first image data and the first text data can be obtained after the preprocessing process.

In order to expand data, data utilization is improved. And performing data enhancement on the first image data and the first text data.

In the embodiment of the application, the text in the original video data is subjected to data cleaning to generate the first text data corresponding to the first image data, the first image data and the corresponding first text data form an original data pair, the obtained original data pair is matched accurately, and the precision of subsequent model training is improved.

Optionally, step 303 in fig. 3 specifically includes:

In the embodiment of the application, through data enhancement, one first image data is amplified to obtain a plurality of image data, and one first text data is amplified to obtain a plurality of text data. The image data and the text data are expanded, the number of training samples is increased, the recognition capability of the image labeling model is enhanced, and the method is optional, and step 206 in fig. 2 specifically includes:

calculating the reference similarity corresponding to the image training data of the sound training data pair and the text training data of the sound training data pair;

and setting a loss function by taking all the first similarity and the second similarity with the reference similarity greater than or equal to the reference similarity as a target.

In the embodiment of the application, in the set data batch, the matching degree of the image training data and the text training data in the positive training data pair is the highest, and the matching degree of the image training data in the positive training data pair and the text training data in the negative training data pair, and the matching degree of the text training data in the positive training data pair and the image training data in the negative training data pair are lower. The reference similarity is higher than all of the first similarity and the second similarity. In the embodiment of the application, the characteristics of the image encoder and the text encoder are more fully learned by setting a plurality of negative example training data pairs.

According to the image annotation model training method and device, the image annotation model is trained through the loss function, iteration is carried out on the image annotation model by taking all the first similarity and all the second similarity with the reference similarity being greater than or equal to the reference similarity as targets, the distinguishing capability of the model for the positive case training data pair and the negative case training data pair is improved, and the matching capability of the image annotation model for the image training data and the text training data is improved.

FIG. 4 is a flow diagram illustrating a multi-modality based image annotation process in accordance with an exemplary embodiment. As shown in fig. 4, the method includes: acquiring an image to be labeled, determining the type of the image, wherein the image is a { template } image of a { template } in which a text needs to be filled, and an implementer needs to determine the type of the image filled with the text according to the target of a labeling task. In a possible embodiment, which animal the image to be labeled belongs to needs to be labeled, words such as "cat", "dog", "horse" and the like are generated according to the picture type of "animal", and { template } is filled in, so as to generate the text to be labeled.

Inputting a text to be labeled into a text encoder to extract features, and generating a text feature vector; and inputting the image to be marked into an image encoder to extract features, and generating an image feature vector.

And calculating the similarity of the image feature vector and the text feature vector, and if the similarity is greater than a preset similarity threshold, determining the text to be labeled as the target labeled text. Cosine similarity, manhattan distance, euclidean distance, or the like between the image feature vector and the text feature vector may be calculated as the similarity. And marking the image to be marked by using the target marking text after the target marking text is determined, so as to generate marked image data, wherein the image data can be used for training a neural network model.

FIG. 5 is a flow diagram illustrating a method of image annotation model training in accordance with an exemplary embodiment. As shown in FIG. 5, the batch is input into the image annotation model, the image encoder is CNN network, and the text encoder is CNN network or RNN network. Features of image training data and text training data of the positive example training data pair are extracted according to an image encoder and a text encoder to generate a first image feature vector and a first text feature vector. And extracting the features of the image training data and the text training data of the negative example training data pair in the data batch according to an image encoder and a text encoder to generate a second image feature vector and a second text feature vector.

And forming an image feature queue by using the first image feature vector and the second image feature vector, and forming a text feature queue by using the first text feature vector and the second text feature vector. The queue length here depends on the computing resources (e.g. 4096 or even larger 65536, etc.), and the purpose of introducing text queues and picture queues here is to enhance the effect of contrast learning.

And calculating first similarity of the first image feature vector and each text feature vector in the text feature queue, and calculating second similarity of the first text feature vector and each image feature vector in the image feature queue.

And calculating the reference similarity corresponding to the image training data of the sound training data pair and the text training data of the sound training data pair.

Since the matching relationship between the text and the image in a batch is known, the loss function is to maximize the reference similarity and make the first similarity and the second similarity smaller than the reference similarity. And setting a loss function, and continuously iteratively training the image annotation model by taking the convergence of the loss function as a target until the loss function tends to be stable.

FIG. 6 is a block diagram illustrating a multi-modality based image annotation appliance in accordance with an exemplary embodiment. Referring to fig. 6, the apparatus includes a pending annotation text acquisition module 610, a feature extraction module 620, a target annotation text determination module 630 and an annotation module 640.

The to-be-labeled text acquisition module 610 is used for acquiring an image to be labeled and a picture type and generating at least two texts to be labeled according to the picture type;

a feature extraction module 620, configured to input the image to be labeled and each text to be labeled into a pre-trained image labeling model, and extract an image feature vector of the image to be labeled and a text feature vector corresponding to the text to be labeled through the image labeling model;

a target annotation text determining module 630, configured to obtain a similarity between the image feature vector and each of the text feature vectors, and determine a target annotation text from each of the texts to be annotated according to the similarity;

and the labeling module 640 is configured to label the image to be labeled according to the target labeling text.

FIG. 7 is a block diagram illustrating an apparatus for training an image annotation model according to an exemplary embodiment. Referring to fig. 7, the apparatus includes a data collection module 710, a data input module 720, a feature extraction module 730, a feature team module 740, a similarity acquisition module 750, and a training module 760.

The data acquisition module 710 is configured to pair image training data and text training data to obtain a training data pair, and generate a training data set according to the training data pair;

the data input module 720 is configured to select at least two training data pairs to form a data batch, and input the data batch into an image annotation model, where the data batch includes a positive training data pair and at least one negative training data pair;

the feature extraction module 730 is configured to extract features of the image training data and the text training data of the positive training data pair in the data batch according to the image annotation model to generate a first image feature vector and a first text feature vector, and extract features of the image training data and the text training data of the negative training data pair in the data batch according to the image annotation model to generate a second image feature vector and a second text feature vector;

a feature grouping module 740, configured to form an image feature queue according to the first image feature vector and the second image feature vector, and form a text feature queue according to the first text feature vector and the second text feature vector;

a similarity obtaining module 750, configured to calculate a first similarity between the first image feature vector and each text feature vector in the text feature queue, and calculate a second similarity between the first text feature vector and each image feature vector in the image feature queue;

the training module 760 is configured to calculate a loss function value according to the first similarity and the second similarity, and train the image annotation model with the convergence of the loss function as a target, so as to obtain a trained image annotation model.

With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

FIG. 8 is a block diagram illustrating an apparatus 800 for multi-modality based image annotation, according to an exemplary embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, the apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, audio component 810 includes a Microphone (MIC) configured to receive external audio signals when apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the apparatus 800 and other devices in a wired or wireless manner. The apparatus 800 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components for performing the above-described methods.

In an exemplary embodiment, a storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the apparatus 800 to perform the method described above is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

FIG. 9 is a block diagram illustrating an apparatus 900 for multi-modality based image annotation, according to an exemplary embodiment. For example, the apparatus 900 may be provided as a server. Referring to fig. 9, the apparatus 900 includes a processing component 922, which further includes one or more processors and memory resources, represented by memory 932, for storing instructions, such as applications, that may be executed by the processing component 922. The application programs stored in memory 932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 922 is configured to execute instructions to perform the methods described above.

The device 900 may also include a power component 926 configured to perform power management of the device 900, a wired or wireless network interface 950 configured to connect the device 900 to a network, and an input output (I/O) interface 958. The apparatus 900 may operate based on an operating system stored in the memory 932, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM or the like

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A multi-modality based image annotation method is characterized by comprising the following steps:

and marking the image to be marked according to the target marking text.

2. The method of claim 1, wherein the image annotation model comprises an image encoder and a text encoder, and the extracting, by the image annotation model, the image feature vector of the image to be annotated and the text feature vector corresponding to the text to be annotated comprises:

3. The method according to claim 2, wherein the step of obtaining the similarity between the image feature vector and each of the text feature vectors specifically includes any one of:

4. The method according to claim 3, wherein the step of determining a target annotation text from each pending annotation text according to the similarity specifically comprises:

5. The method according to claim 1, wherein the step of labeling the image to be labeled according to the target labeling text specifically comprises:

6. A training method of an image annotation model is characterized by comprising the following steps:

matching the image training data and the text training data to obtain a training data pair, and generating a training data set according to the training data pair;

extracting the characteristics of the image training data and the text training data of the positive training data pair in the data batch according to the image labeling model to generate a first image characteristic vector and a first text characteristic vector, and extracting the characteristics of the image training data and the text training data of the negative training data pair in the data batch according to the image labeling model to generate a second image characteristic vector and a second text characteristic vector;

7. The method of claim 6, wherein the step of obtaining image training data and text training data comprises:

8. The method of claim 7, wherein the step of data enhancing the first image data and the first text data to generate the image training data and the text training data comprises:

9. The method of claim 6, wherein the computing a loss function value according to the first similarity and the second similarity, and the training the image annotation model with the convergence of the loss function as a target comprises:

10. A multi-modality based image annotation apparatus, comprising:

the feature extraction module is used for inputting the images to be labeled and each text to be labeled into a pre-trained image labeling model, and extracting image feature vectors of the images to be labeled and text feature vectors corresponding to the text to be labeled through the image labeling model;

the target annotation text determination module is used for acquiring the similarity between the image characteristic vector and each text characteristic vector and determining a target annotation text from each text to be annotated according to the similarity;

11. An apparatus for training an image annotation model, comprising:

the feature queue forming module is used for forming an image feature queue according to the first image feature vector and the second image feature vector and forming a text feature queue according to the first text feature vector and the second text feature vector;

12. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the multi-modality based image annotation method of any one of claims 1 to 5 or the training method of the image annotation model of any one of claims 6 to 9.

13. A computer readable storage medium, instructions in which, when executed by a processor of an electronic device, enable the electronic device to perform the multi-modality based image annotation process of any one of claims 1 to 5 or the training process of the image annotation model of any one of claims 6 to 9.