CN116109732A

CN116109732A - Image labeling method, device, processing equipment and storage medium

Info

Publication number: CN116109732A
Application number: CN202310085834.6A
Authority: CN
Inventors: 章鑫锋; 张荣升; 陈伟杰; 赵增; 刘柏; 吕唐杰; 范长杰; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2023-01-12
Filing date: 2023-01-12
Publication date: 2023-05-12

Abstract

The application provides an image labeling method, an image labeling device, processing equipment and a storage medium, and relates to the technical field of computers. Comprising the following steps: extracting features of the image to be marked by adopting a preset image coding model to obtain image features of the image to be marked; determining target text features matched with the image features from a plurality of preset text features according to the image features of the image to be annotated, wherein the plurality of preset text features are obtained by extracting features of a plurality of preset texts by adopting a preset text coding model, and the preset image coding model and the preset text coding model are obtained by training data together according to a plurality of sample images and texts; and labeling the image to be labeled according to the target text corresponding to the target text characteristics. And determining the matched target text characteristics based on the image characteristics of the image to be annotated, and then annotating the image to be annotated based on the target text corresponding to the target text characteristics, so that the direct annotation of the art resource is realized, and the user experience is improved.

Description

Image labeling method, device, processing equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image labeling method, an image labeling device, a processing device, and a storage medium.

Background

With the rapid development of internet technology, a variety of art resources on electronic devices are increasing, and various art resources can be saved in the art resources and a collaboration management system, so that in order to realize effective management of the art resources, labeling of the art resources is a research hotspot.

In the related art, the image types in the art resource are rich, which can include image resources such as model images, original pictures, videos and the like in games, when the art resource is marked, text data for describing the art resource and a label set corresponding to the text data are obtained, a trained model is adopted to determine labels corresponding to the text data according to the text data and the corresponding label set, and the text data of the art resource are marked.

However, in the related art, the trained model can be only used for labeling based on text data of the art resource, so that the art resource cannot be directly labeled, and the user experience is reduced.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides an image labeling method, an image labeling device, processing equipment and a storage medium, so that the problems that staff is required to manually label in the related art, the labeling efficiency is low, and unnecessary human resources are wasted are solved.

In order to achieve the above purpose, the technical solution adopted in the embodiment of the present application is as follows:

in a first aspect, an embodiment of the present application provides an image labeling method, including:

extracting features of an image to be marked by adopting a preset image coding model to obtain image features of the image to be marked;

determining target text features matched with the image features from a plurality of preset text features according to the image features of the image to be annotated, wherein the plurality of preset text features are obtained by extracting features of a plurality of preset texts by adopting a preset text coding model, and the preset image coding model and the preset text coding model are obtained by training data together according to a plurality of sample images and texts;

and labeling the image to be labeled according to the target text corresponding to the target text characteristics.

In a second aspect, an embodiment of the present application further provides an image labeling apparatus, including:

the feature extraction module is used for extracting features of the image to be marked by adopting a preset image coding model to obtain image features of the image to be marked;

the determining module is used for determining target text features matched with the image features from a plurality of preset text features according to the image features of the image to be annotated, the plurality of preset text features are obtained by extracting features of a plurality of preset texts by adopting a preset text coding model, and the preset image coding model and the preset text coding model are obtained by training data together according to a plurality of sample images and texts;

And the labeling module is used for labeling the image to be labeled according to the target text corresponding to the target text characteristics.

In a third aspect, embodiments of the present application further provide a processing apparatus, including: the image labeling method comprises a memory and a processor, wherein the memory stores a computer program executable by the processor, and the processor realizes the image labeling method according to any one of the first aspects when executing the computer program.

In a fourth aspect, embodiments of the present application further provide a computer readable storage medium, where a computer program is stored, where the computer program is read and executed to implement the image labeling method according to any one of the first aspect.

The beneficial effects of this application are: the embodiment of the application provides an image labeling method, which comprises the following steps: extracting features of the image to be marked by adopting a preset image coding model to obtain image features of the image to be marked; determining target text features matched with the image features from a plurality of preset text features according to the image features of the image to be annotated, wherein the plurality of preset text features are obtained by extracting features of a plurality of preset texts by adopting a preset text coding model, and the preset image coding model and the preset text coding model are obtained by training data together according to a plurality of sample images and texts; and labeling the image to be labeled according to the target text corresponding to the target text characteristics. The preset image coding model and the preset text coding model are obtained by training the data together according to a plurality of sample images and texts, so that the image characteristics extracted by the preset image coding model to the image to be marked and a plurality of preset text characteristics extracted by the preset text coding model to a plurality of preset texts can be matched; and determining the matched target text characteristics from a plurality of preset text characteristics based on the image characteristics of the image to be annotated, and then annotating the image to be annotated based on the target text corresponding to the target text characteristics, wherein the whole process does not need to rely on text description data of the image to be annotated, and the target text characteristics are determined based on the image characteristics to realize annotation, so that direct art resources are directly annotated, and user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of an image labeling method according to an embodiment of the present application;

fig. 2 is a schematic flow chart of an image labeling method according to an embodiment of the present application;

fig. 3 is a schematic flow chart of an image labeling method according to an embodiment of the present application;

fig. 4 is a schematic flow chart of an image labeling method according to an embodiment of the present application;

fig. 5 is a schematic flow chart of an image labeling method according to an embodiment of the present application;

fig. 6 is a schematic flow chart of an image labeling method according to an embodiment of the present application;

fig. 7 is a schematic flow chart of an image labeling method according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an image labeling device according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of a processing apparatus according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments.

Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

In the description of the present application, it should be noted that, if the terms "upper", "lower", and the like indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, or an azimuth or the positional relationship that is commonly put when the product of the application is used, it is merely for convenience of description and simplification of the description, and does not indicate or imply that the apparatus or element to be referred to must have a specific azimuth, be configured and operated in a specific azimuth, and therefore should not be construed as limiting the present application.

Furthermore, the terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that, without conflict, features in embodiments of the present application may be combined with each other.

In the related art, the image types in the art resource are rich, which can include image resources such as model images, original pictures, videos and the like in games, when the art resource is marked, text data for describing the art resource and a label set corresponding to the text data are obtained, a trained model is adopted to determine labels corresponding to the text data according to the text data and the corresponding label set, and the text data of the art resource are marked. However, in the related art, the trained model can be only used for labeling based on text data of the art resource, so that the art resource cannot be directly labeled, and the user experience is reduced.

Aiming at the technical problems in the related art, the embodiment of the application provides an image labeling method, wherein a preset image coding model and a preset text coding model are obtained by training data together according to a plurality of sample images and texts, so that the image characteristics extracted by the preset image coding model to an image to be labeled and a plurality of preset text characteristics extracted by the preset text coding model to a plurality of preset texts can be matched, the matched target text characteristics are determined from the plurality of preset text characteristics based on the image characteristics of the image to be labeled, then the image to be labeled can be labeled based on the target text corresponding to the target text characteristics, the text description data of the image to be labeled is not needed in the whole process, and the target text characteristics are determined based on the image characteristics to realize labeling, namely, the direct labeling of art resources is realized, and the user experience is improved.

An explanation of an image labeling method provided in the embodiment of the present application is as follows. The image labeling method provided by the embodiment of the application is applied to processing equipment, the processing equipment can be a terminal or a server, and if the processing equipment is a terminal, the terminal can be any one of the following: desktop computers, notebook computers, tablet computers, smart phones, and the like.

Fig. 1 is a schematic flow chart of an image labeling method provided in an embodiment of the present application, as shown in fig. 1, the method may include:

s101, carrying out feature extraction on an image to be marked by adopting a preset image coding model to obtain image features of the image to be marked.

If the number of the images to be marked is a plurality of images, extracting the features of the images to be marked by a preset image coding model in a sequential extraction mode or a simultaneous extraction mode to obtain the image features of each image to be marked.

It should be noted that, the image to be marked may be a model image in a game, an original picture, or other types of art resources, which is not limited in this embodiment of the present application.

In addition, the image to be annotated can be understood as an image to be annotated, and the image to be annotated is annotated, so that the image to be annotated can have corresponding text interpretation, and the image with sky can be illustrated, and the corresponding annotation can be "sky".

S102, determining target text features matched with the image features from a plurality of preset text features according to the image features of the image to be annotated.

The method comprises the steps that a plurality of preset text features are obtained by extracting features of a plurality of preset texts through a preset text coding model, and a preset image coding model and a preset text coding model are obtained by training data together according to a plurality of sample images and texts.

In some embodiments, similarity calculation is performed on the image feature of the image to be annotated and each preset text feature of the plurality of preset text features, and the preset text feature with higher similarity is used as the target text feature.

It should be noted that, the sample image-text pair includes: a plurality of sample images and sample text corresponding to each sample image. The sample text is used for representing the meaning of the corresponding sample image, and the preset image coding model and the preset text coding model are obtained by training the data together according to a plurality of sample images and texts, so that the association relationship between the image and the text can be learned by the preset image coding model and the preset text coding model.

In addition, because the association relationship between the image and the text can be learned by the preset image coding model and the preset text coding model, the image features extracted by the preset image coding model and the preset text features extracted by the preset text coding model can be matched, and the target text features with matched image features can be determined from a plurality of preset text features.

And S103, labeling the image to be labeled according to the target text corresponding to the target text characteristics.

The target text features are obtained by extracting features of the target text by a preset text coding model.

In the embodiment of the application, the target text corresponding to the target text feature is determined, and the target text corresponding to the target text feature is used as the labeling information of the image to be labeled, so that the image to be labeled is labeled.

It should be noted that, the target text may be a kanji character, an english character, or other types of characters, which is not specifically limited in the embodiment of the present application.

In summary, an embodiment of the present application provides an image labeling method, including: extracting features of the image to be marked by adopting a preset image coding model to obtain image features of the image to be marked; determining target text features matched with the image features from a plurality of preset text features according to the image features of the image to be annotated, wherein the plurality of preset text features are obtained by extracting features of a plurality of preset texts by adopting a preset text coding model, and the preset image coding model and the preset text coding model are obtained by training data together according to a plurality of sample images and texts; and labeling the image to be labeled according to the target text corresponding to the target text characteristics. The preset image coding model and the preset text coding model are obtained by training data together according to a plurality of sample images and texts, so that the image characteristics extracted by the preset image coding model to the images to be marked and the preset text characteristics extracted by the preset text coding model to the preset texts can be matched, the matched target text characteristics are determined from the preset text characteristics based on the image characteristics of the images to be marked, then the images to be marked can be marked based on the target text corresponding to the target text characteristics, the text description data of the images to be marked are not needed in the whole process, the target text characteristics are determined based on the image characteristics to realize marking, namely, the direct marking of the art resources is realized, and the user experience is improved.

Optionally, fig. 2 is a flowchart of an image labeling method provided in the embodiment of the present application, as shown in fig. 2, before a process of determining, according to image features of an image to be labeled, a target text feature matching with the image features from a plurality of preset text features in S102, the method may include:

s201, acquiring a plurality of initial texts.

Wherein the plurality of initial texts may be a plurality of sentences.

In practical application, a plurality of initial texts may be downloaded from the internet, or may be obtained from a preset database, or may be obtained in other manners, which is not specifically limited in the embodiment of the present application.

S202, analyzing and counting the initial texts to obtain a plurality of preset texts.

In the embodiment of the application, the plurality of initial texts can be analyzed and counted simultaneously to obtain a plurality of preset texts, so that the efficiency of obtaining the plurality of preset texts can be improved, wherein the plurality of preset texts can be words. Of course, analysis and statistics can be sequentially performed on a plurality of preset texts according to actual requirements, and the embodiment of the application is not particularly limited.

S203, adopting a preset text coding model to respectively perform feature extraction processing on a plurality of preset texts to obtain preset text features corresponding to each preset text.

The plurality of preset texts comprise: target text.

It should be noted that, a plurality of preset texts may be input into a preset text coding model, the preset text coding model may perform feature extraction processing on the plurality of preset texts, and the preset text coding model may output preset text features corresponding to each preset text.

Optionally, fig. 3 is a schematic flow chart of an image labeling method provided in the embodiment of the present application, as shown in fig. 3, a process of performing analysis statistics on a plurality of initial texts to obtain a plurality of preset texts in S202 may include:

s301, word segmentation processing is carried out on a plurality of initial texts, words with similar meanings are combined, and a plurality of combined words are obtained.

S302, sorting the plurality of the merged words according to the word frequency of each merged word to obtain a sorting result.

In some embodiments, the plurality of initial texts are a plurality of sentences, the plurality of sentences are subjected to word segmentation processing, and the plurality of sentences are divided into a plurality of words with part of speech; counting word frequency of each word to obtain frequency of each word; part-of-speech analysis is performed on each word to identify the part of speech of each word, and by way of example, nouns or verbs can be identified; and merging the semantic words to obtain merged words.

In merging words that are similar in meaning, for example, it is possible to merge "night" and "evening", merge "morning" and "morning", merge "cup" and "cup", and so forth.

S303, determining a plurality of preset texts from the plurality of combined words according to the sorting result.

Wherein each preset text is a noun.

In the embodiment of the application, according to the word frequency of each combined word, the nouns in the multiple combined words are sequenced to obtain a sequencing result, and according to the sequencing result, the nouns with the first preset number of high word frequencies are selected from the nouns in the multiple combined words to serve as multiple preset texts.

In addition, a plurality of preset texts may be stored in the tag library, and optionally, the number of preset texts in the tag library may be hundreds of thousands.

Optionally, the process of acquiring a plurality of initial texts in S201 may include:

and taking the sample text in the sample image-text pair data as a plurality of initial texts.

The sample image-text data comprises: a plurality of sample images and sample text corresponding to each sample image. And extracting sample texts in the sample image-text data, and taking the sample texts as a plurality of initial texts.

Optionally, fig. 4 is a schematic flow chart of an image labeling method provided in the embodiment of the present application, and as shown in fig. 4, a preset image coding model and a preset text coding model in the embodiment of the present application are obtained by the following manner:

s401, acquiring sample image-text pair data.

The sample image-text pair data may include: a plurality of sample images and sample text corresponding to each sample image.

It should be noted that, the sample text is a text description of the corresponding sample image, and, for example, the sample image a includes a rainbow, and the sample text a corresponding to the sample image a may be a "rainbow"; the sample image b includes a virtual weapon, and the sample text b corresponding to the sample image b may be "virtual weapon".

In the embodiment of the present application, sample image-text pair data may be downloaded from the internet, sample image-text pair data may be obtained from a preset database, and sample image-text pair data may also be obtained in other manners, which is not specifically limited in the embodiment of the present application.

S402, training the initial image coding model and the initial text coding model according to the sample image-text pair data to obtain a preset image coding model and a preset text coding model.

The initial image coding model and the initial text coding model may be models of a double-tower structure. And training the initial image coding model and the initial text coding model together according to the sample image-text data to obtain a preset image coding model and a preset text coding model, so that the preset image coding model and the preset text coding model can learn the association relationship between the image and the text.

Alternatively, the initial image coding model may be a VIT (Vision Transformer, visual self-attention) model, and the initial text coding model may be a unidirectional gpt2 (Generative Pre-Training) model.

Optionally, fig. 5 is a flow chart of an image labeling method provided in the embodiment of the present application, as shown in fig. 5, a process of training an initial image coding model and an initial text coding model according to sample graphics-text data in S402 to obtain a preset image coding model and a preset text coding model may include:

s501, performing feature extraction processing on a plurality of sample images in sample image-text pair data by adopting a visual coding network in an initial image coding model to obtain sample image features of the plurality of sample images.

The visual coding network in the initial image coding model can perform feature extraction processing on the plurality of sample images, and the visual coding network in the initial image coding model can output sample image features of the plurality of sample images.

S502, performing feature extraction processing on sample texts corresponding to each sample image in sample image-text pair data by adopting a text coding network in an initial text coding model to obtain sample text features of a plurality of sample texts.

The method comprises the steps that sample texts corresponding to each sample image, namely a plurality of sample texts, are input into a text coding network in an initial text coding model, the text coding network in the initial text coding model performs feature extraction processing on the plurality of sample texts, and the text coding network in the initial text coding model can output sample text features of the plurality of sample texts.

In the embodiment of the present application, the networks in the initial image encoding model and the initial text encoding model are different.

S503, calculating loss function values between sample image features of a plurality of sample images and sample text features of a plurality of sample texts, and updating parameters of an initial image coding model and parameters of an initial text coding model according to the loss function values until the loss function values meet preset conditions, so as to obtain a preset image coding model and a preset text coding model.

Wherein the loss can be counter-propagated, updating the gradient.

In some embodiments, a contrast learning loss function is adopted, loss function values between sample image features of a plurality of sample images and sample text features of a plurality of sample texts are calculated, and weights of an initial image coding model and weights of an initial text coding model are updated according to the loss function values until newly acquired loss function values are converged, so that a preset image coding model and a preset text coding model are obtained.

Optionally, fig. 6 is a flowchart of an image labeling method according to an embodiment of the present application, as shown in fig. 6, a process of calculating a loss function value between sample image features of a plurality of sample images and sample text features of a plurality of sample texts in the above step S503 may include:

s601, respectively carrying out normalization processing on sample image features of a plurality of sample images and sample text features of a plurality of sample texts to obtain a plurality of processed sample image features and a plurality of processed sample text features.

It should be noted that, the sequence of normalizing the sample image features of the plurality of sample images and the sample text features of the plurality of sample texts is not specifically limited, and of course, the normalizing process may be performed simultaneously.

S602, performing dot multiplication processing on the plurality of processed sample image features and the plurality of processed sample text features to obtain a plurality of similarity results.

In some embodiments, each of the processed sample image features and the plurality of processed sample text features are subjected to a dot product process; and respectively carrying out dot multiplication on each processed sample text feature and a plurality of processed sample image features to obtain a similarity result.

For example, the sample teletext pair data may comprise: sample image a and sample text x corresponding to sample image a; sample image b and sample text y corresponding to sample image b; sample image c, and sample text z corresponding to sample image c. The point multiplication of a and x can be respectively a and x point multiplication, a and y point multiplication, a and z point multiplication, x and a point multiplication, x and b point multiplication, and x and c point multiplication. Similarly, the dot product for b and y, and the dot product for c and z are similar to those described above, and the number of times is not repeated.

Wherein each dot product is a similarity result.

S603, adopting a contrast learning loss function, and calculating loss according to a plurality of similarity results.

In the embodiment of the application, the contrast learning loss is adopted, and the loss is calculated according to a plurality of similarity results, namely, the same sample image-text pair data, wherein the higher the similarity between the sample image characteristics and the sample text characteristics in the sample image-text pair data is, the better; the lower the similarity between the sample image features and the sample text features, the better the sample image features and the sample text features, the training of the double-tower structure model is guided by adopting the principle, the weights of the two models are updated, and when the loss function value converges, the preset image coding model and the preset text coding model are obtained.

Sample image a, and sample text x corresponding to sample image a; sample image b and sample text y corresponding to sample image b; sample image c, and sample text z corresponding to sample image c. The higher the similarity of a to x, the better, and in addition, the lower the similarity of a and y, a and z, x and b, x and c.

In the embodiment of the application, a preset text coding model is adopted to respectively conduct feature extraction processing on a plurality of preset texts to obtain a plurality of preset text features, each preset text feature can be a text characterization vector, the length of each text characterization vector can be 768, and the plurality of preset texts and the plurality of preset text features can be correspondingly stored.

In addition, the image to be marked is subjected to feature extraction processing by adopting a preset image coding model, so that the image features of the image to be marked are obtained, the image features can be image characterization vectors, and the length of each image characterization vector can be 768.

Optionally, fig. 7 is a flowchart of an image labeling method provided in the embodiment of the present application, as shown in fig. 7, a process of determining, in S102, a target text feature matched with an image feature from a plurality of preset text features according to the image feature of the image to be labeled may include:

And S701, respectively performing dot multiplication on the image features and each preset text feature to obtain the similarity between the image features and each preset text feature.

In some embodiments, each of the image feature and the plurality of preset text features is subjected to dot multiplication to obtain a similarity between the image feature and each of the preset text features, i.e., each of the preset text features has a corresponding similarity for the image feature.

Wherein, the similarity may also be referred to as a similarity score, and the score of each similarity score may be between 0 and 1.

S702, determining target text features from a plurality of preset text features according to the similarity.

In the embodiment of the present application, the similarity corresponding to each preset text feature is ranked, so as to obtain a ranking result, where ranking may be performed from big to small or from small to big, which is not specifically limited in the embodiment of the present application.

It is worth to describe that, according to the sorting result of the similarity, the second preset number of target text features with high similarity are determined. For example, the number of the preset texts may be one hundred thousand, the number of the preset text features may be one hundred thousand, each preset text feature calculates a similarity score (between 0 and 1), the similarity scores are sorted from large to small, N is selected, for example, n=10, the previous 10 preset text features may be target text features, and the target text features correspond to the preset texts and are target texts.

In summary, an embodiment of the present application provides an image labeling method, including: extracting features of the image to be marked by adopting a preset image coding model to obtain image features of the image to be marked; determining target text features matched with the image features from a plurality of preset text features according to the image features of the image to be annotated, wherein the plurality of preset text features are obtained by extracting features of a plurality of preset texts by adopting a preset text coding model, and the preset image coding model and the preset text coding model are obtained by training data together according to a plurality of sample images and texts; and labeling the image to be labeled according to the target text corresponding to the target text characteristics. The preset image coding model and the preset text coding model are obtained by training the data together according to a plurality of sample images and texts, so that the image characteristics extracted by the preset image coding model to the image to be marked and a plurality of preset text characteristics extracted by the preset text coding model to a plurality of preset texts can be matched; and determining the matched target text characteristics from a plurality of preset text characteristics based on the image characteristics of the image to be annotated, and then annotating the image to be annotated based on the target text corresponding to the target text characteristics, wherein the whole process does not need to rely on text description data of the image to be annotated, and the target text characteristics are determined based on the image characteristics to realize annotation, so that the direct annotation of art resources is realized, and the user experience is improved.

The following describes an image labeling device, a processing device, a storage medium, etc. for executing the image labeling method provided in the present application, and specific implementation processes and technical effects of the image labeling device and the processing device refer to relevant contents of the image labeling method, which are not described in detail below.

Fig. 8 is a schematic structural diagram of an image labeling device according to an embodiment of the present application, and as shown in fig. 8, the device may include:

the feature extraction module 801 is configured to perform feature extraction on an image to be annotated by using a preset image coding model, so as to obtain image features of the image to be annotated;

a determining module 802, configured to determine, according to image features of the image to be annotated, target text features matched with the image features from a plurality of preset text features, where the plurality of preset text features are obtained by extracting features from a plurality of preset texts by using a preset text coding model, and the preset image coding model and the preset text coding model are obtained by training data together according to a plurality of sample graphics and texts;

and the labeling module 803 is used for labeling the image to be labeled according to the target text corresponding to the target text characteristics.

Optionally, the apparatus further includes:

The first acquisition module is used for acquiring a plurality of initial texts;

the analysis and statistics module is used for carrying out analysis and statistics on the plurality of initial texts to obtain a plurality of preset texts;

and the first feature extraction module is used for respectively carrying out feature extraction processing on the plurality of preset texts by adopting the preset text coding model to obtain preset text features corresponding to each preset text.

Optionally, the analysis and statistics module is specifically configured to perform word segmentation on the plurality of initial texts, and combine words with similar meanings to obtain a plurality of combined words; sorting the plurality of the merged words according to the word frequency of each merged word to obtain a sorting result; and determining the preset texts from the merged words according to the sorting result, wherein each preset text is a noun.

Optionally, the preset image coding model and the preset text coding model are obtained by the following modes: obtaining the sample image-text pair data, wherein the sample image-text pair data comprises: a plurality of sample images and sample text corresponding to each sample image; training an initial image coding model and an initial text coding model according to the sample image-text data to obtain the preset image coding model and the preset text coding model.

Optionally, the training module is specifically configured to perform feature extraction processing on the plurality of sample images in the sample image-text pair data by using a visual coding network in the initial image coding model, so as to obtain sample image features of the plurality of sample images; performing feature extraction processing on sample texts corresponding to each sample image in the sample image-text pair data by adopting a text coding network in the initial text coding model to obtain sample text features of a plurality of sample texts; calculating loss function values between sample image features of the plurality of sample images and sample text features of the plurality of sample texts, and updating parameters of the initial image coding model and parameters of the initial text coding model according to the loss function values until the loss function values meet preset conditions, so as to obtain the preset image coding model and the preset text coding model.

Optionally, the training module is specifically configured to normalize the sample image features of the plurality of sample images and the sample text features of the plurality of sample texts, so as to obtain a plurality of processed sample image features and a plurality of processed sample text features; performing dot multiplication processing on the plurality of processed sample image features and the plurality of processed sample text features to obtain a plurality of similarity results; and adopting a contrast learning loss function, and calculating the loss function value according to the similarity results.

Optionally, the first obtaining module is specifically configured to use the sample text in the sample text-to-text pair data as the plurality of initial texts.

Optionally, the determining module 802 is specifically configured to perform dot multiplication on the image feature and each preset text feature to obtain a similarity between the image feature and each preset text feature; and determining the target text feature from the plurality of preset text features according to the similarity.

The foregoing apparatus is used for implementing the apparatus provided in the foregoing embodiment, and its implementation principle and technical effects are similar, and are not described herein again.

The above modules may be one or more integrated circuits configured to implement the above apparatus, for example: one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASIC), or one or more microprocessors (digital singnal processor, abbreviated as DSP), or one or more field programmable gate arrays (Field Programmable Gate Array, abbreviated as FPGA), or the like. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 9 is a schematic structural diagram of a processing device according to an embodiment of the present application, as shown in fig. 9, where the processing device may include: processor 901, memory 902. The memory 902 is used for storing a program, and the processor 901 calls the program stored in the memory 902 to execute the above method embodiment. The specific implementation manner and the technical effect are similar, and are not repeated here.

For example, the method may include:

Optionally, before determining the target text feature matched with the image feature from a plurality of preset text features according to the image feature of the image to be annotated, the method includes:

Acquiring a plurality of initial texts;

analyzing and counting the plurality of initial texts to obtain a plurality of preset texts;

and respectively carrying out feature extraction processing on the plurality of preset texts by adopting the preset text coding model to obtain preset text features corresponding to each preset text.

Optionally, the analyzing and counting the plurality of initial texts to obtain a plurality of preset texts includes:

word segmentation processing is carried out on the plurality of initial texts, words with similar meanings are combined, and a plurality of combined words are obtained;

sorting the plurality of the merged words according to the word frequency of each merged word to obtain a sorting result;

and determining the preset texts from the merged words according to the sorting result, wherein each preset text is a noun.

Optionally, the preset image coding model and the preset text coding model are obtained by the following modes:

obtaining the sample image-text pair data, wherein the sample image-text pair data comprises: a plurality of sample images and sample text corresponding to each sample image;

training an initial image coding model and an initial text coding model according to the sample image-text data to obtain the preset image coding model and the preset text coding model.

Optionally, the training the initial image coding model and the initial text coding model according to the sample image-text pair data to obtain the preset image coding model and the preset text coding model includes:

performing feature extraction processing on the plurality of sample images in the sample image-text pair data by adopting a visual coding network in the initial image coding model to obtain sample image features of the plurality of sample images;

performing feature extraction processing on sample texts corresponding to each sample image in the sample image-text pair data by adopting a text coding network in the initial text coding model to obtain sample text features of a plurality of sample texts;

calculating loss function values between sample image features of the plurality of sample images and sample text features of the plurality of sample texts, and updating parameters of the initial image coding model and parameters of the initial text coding model according to the loss function values until the loss function values meet preset conditions, so as to obtain the preset image coding model and the preset text coding model.

Optionally, the calculating a loss function value between the sample image features of the plurality of sample images and the sample text features of the plurality of sample texts includes:

Respectively carrying out normalization processing on sample image features of the plurality of sample images and sample text features of the plurality of sample texts to obtain a plurality of processed sample image features and a plurality of processed sample text features;

performing dot multiplication processing on the plurality of processed sample image features and the plurality of processed sample text features to obtain a plurality of similarity results;

and adopting a contrast learning loss function, and calculating the loss according to the similarity results.

Optionally, the acquiring a plurality of initial texts includes:

and taking the sample text in the sample image-text pair data as the plurality of initial texts.

Optionally, the determining, according to the image feature of the image to be annotated, a target text feature matched with the image feature from a plurality of preset text features includes:

respectively carrying out dot multiplication on the image features and each preset text feature to obtain the similarity between the image features and each preset text feature;

and determining the target text feature from the plurality of preset text features according to the similarity.

In summary, the preset image coding model and the preset text coding model are obtained by training the data together according to the plurality of sample images and texts, so that the image characteristics extracted by the preset image coding model to the image to be marked and the plurality of preset text characteristics extracted by the preset text coding model to the plurality of preset texts can be matched; and determining the matched target text characteristics from a plurality of preset text characteristics based on the image characteristics of the image to be annotated, and then annotating the image to be annotated based on the target text corresponding to the target text characteristics, wherein the whole process does not need to rely on text description data of the image to be annotated, and the annotation is realized based on the target text characteristics determined by the image characteristics, so that the direct annotation of the art resource is realized, and the user experience is improved.

Optionally, the present application also provides a program product, such as a computer readable storage medium, comprising a program for performing the above-described method embodiments when being executed by a processor.

For example, the method may include:

acquiring a plurality of initial texts;

Optionally, the acquiring a plurality of initial texts includes:

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (english: processor) to perform part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: u disk, mobile hard disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

The foregoing is merely a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and variations may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. An image labeling method, comprising:

2. The method according to claim 1, wherein before the determining, from the image features of the image to be annotated, a target text feature matching the image feature from a plurality of preset text features, the method comprises:

acquiring a plurality of initial texts;

3. The method according to claim 2, wherein the performing analysis statistics on the plurality of initial texts to obtain a plurality of preset texts includes:

and determining the preset texts from the merged words according to the sorting result.

4. The method according to claim 1, wherein the pre-set image coding model and the pre-set text coding model are obtained by:

5. The method according to claim 4, wherein training the initial image coding model and the initial text coding model according to the sample image-text training data to obtain the preset image coding model and the preset text coding model comprises:

6. The method of claim 5, wherein the calculating a loss function value between sample image features of the plurality of sample images and sample text features of the plurality of sample texts comprises:

7. The method of claim 2, wherein the obtaining a plurality of initial texts comprises:

8. The method according to claim 1, wherein the determining, from a plurality of preset text features, a target text feature matching the image feature according to the image feature of the image to be annotated, includes:

9. An image marking apparatus, comprising:

10. A processing apparatus, comprising: a memory and a processor, the memory storing a computer program executable by the processor, the processor implementing the image annotation method according to any of the preceding claims 1-8 when the computer program is executed.

11. A computer readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when read and executed, implements the image annotation method according to any of the preceding claims 1-8.