CN113822263A

CN113822263A - Image annotation method and device, computer equipment and storage medium

Info

Publication number: CN113822263A
Application number: CN202110679659.4A
Authority: CN
Inventors: 宁慕楠; 卢东焕; 魏东; 余双; 马锴; 郑冶枫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-12-21

Abstract

The embodiment of the application discloses an image labeling method and device, computer equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: the method comprises the steps of obtaining a first sample image of a source domain, a first annotation image of the first sample image and a plurality of second sample images of a target domain, selecting the target sample image from the plurality of second sample images, obtaining a second annotation image of the target sample image, calling an image annotation model, respectively annotating the first sample image and the target sample image to obtain a first prediction annotation image of the first sample image and a second prediction annotation image of the target sample image, training the image annotation model based on the difference between the first annotation image and the first prediction annotation image and the difference between the second annotation image and the second prediction annotation image, selecting the image which is most dissimilar to the source domain from the target domain to train the image annotation model, and improving the model performance of the image annotation model on the target domain.

Description

Image annotation method and device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to an image annotation method, an image annotation device, computer equipment and a storage medium.

Background

The image semantic segmentation is used for segmenting an image into regions with different semantic information, can be applied to various scenes such as automatic driving, medical image analysis, face recognition and the like, and under various scenes, the regions with different semantic information in the image are marked by using an image marking model, so that a marked image of the image is obtained. Because images in different fields may have differences in the same scene, in order to ensure the image segmentation effect, the image annotation model trained by using the source domain image is usually migrated to the target domain, so that the migrated image annotation model can implement an image segmentation task in the target domain. However, in the migration process of the image annotation model, how to improve the model representation of the image annotation model after the migration on the target domain is a problem which needs to be solved at present.

Disclosure of Invention

The embodiment of the application provides an image annotation method, an image annotation device, computer equipment and a storage medium, which can improve the model representation of an image annotation model on a target domain. The technical scheme is as follows:

in one aspect, an image annotation method is provided, and the method includes:

acquiring a first sample image of a source domain, a first annotation image of the first sample image and a plurality of second sample images of a target domain;

selecting a target sample image from the plurality of second sample images, and acquiring a second annotation image of the target sample image, wherein the target sample image is a second sample image with the minimum similarity with the first sample image in the plurality of second sample images;

calling an image annotation model, and respectively annotating the first sample image and the target sample image to obtain a first prediction annotation image of the first sample image and a second prediction annotation image of the target sample image;

and training the image annotation model based on the difference between the first annotation image and the first prediction annotation image and the difference between the second annotation image and the second prediction annotation image, wherein the image annotation model is used for annotating the image of the target domain.

In one possible implementation, after the training of the image annotation model based on the difference between the first annotated image and the first predictive annotated image and the difference between the second annotated image and the second predictive annotated image, the method further comprises:

and calling the trained image annotation model, and annotating the target image of the target domain to obtain an annotated image of the target image.

In another aspect, an image annotation apparatus is provided, the apparatus comprising:

the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring a first sample image of a source domain, a first annotation image of the first sample image and a plurality of second sample images of a target domain;

the obtaining module is further configured to select a target sample image from the plurality of second sample images, and obtain a second annotation image of the target sample image, where the target sample image is a second sample image having a minimum similarity with the first sample image in the plurality of second sample images;

the labeling module is used for calling an image labeling model, labeling the first sample image and the target sample image respectively, and obtaining a first prediction labeling image of the first sample image and a second prediction labeling image of the target sample image;

a training module, configured to train the image annotation model based on a difference between the first annotation image and the first prediction annotation image and a difference between the second annotation image and the second prediction annotation image, where the image annotation model is used to annotate the image of the target domain.

In one possible implementation, the apparatus further includes:

the obtaining module is further configured to obtain first image features of the plurality of first sample images and second image features of the plurality of second sample images;

the clustering module is used for clustering the first sample images based on the acquired first image characteristics to obtain at least one first clustering center;

and the determining module is used for determining the distance between the second image features of the second sample images and each first cluster center for any second sample image, and determining the similarity between the second sample images and the first sample images based on the minimum distance in the determined distances, wherein the corresponding minimum distance of the second sample images is in a negative correlation with the corresponding similarity.

In another possible implementation manner, the obtaining module includes:

the feature extraction unit is used for calling a feature extraction sub-model in the image annotation model and extracting features of each first sample image to obtain a third image feature of each first sample image;

and the fusion unit is used for fusing the third image characteristic corresponding to each first sample image with the corresponding first annotation image to obtain the first image characteristic of each first sample image.

In another possible implementation manner, the first annotation image comprises a region corresponding to at least one category; the fusion unit is configured to, for any one of the first sample images, extract a sub-labeled image corresponding to the at least one category from a first labeled image of the first sample image, where the sub-labeled image is used to indicate a pixel point belonging to the corresponding category; fusing the sub-label image corresponding to each category with the third image characteristic of the first sample image to obtain a fourth image characteristic corresponding to each category; and splicing the fourth image features corresponding to the at least one category to obtain the first image feature of the first sample image.

In another possible implementation manner, the fusion unit is configured to perform, for a sub-annotation image corresponding to any category, dot multiplication on a pixel value of each pixel point in the sub-annotation image and a corresponding feature value to obtain a product corresponding to each pixel point, where the feature value corresponding to any pixel point is a feature value in the third image feature, where the feature value is located at the same position as the pixel point; and determining the ratio between the product corresponding to each pixel point and the number of the pixel points, wherein the ratio corresponding to each pixel point forms the fourth image characteristic corresponding to the category, and the number of the pixel points is the number of the pixel points belonging to the corresponding category in the sub-annotation image.

In another possible implementation manner, the obtaining module includes:

the feature extraction unit is used for calling a feature extraction sub-model in the image annotation model and extracting features of each second sample image to obtain fifth image features of each second sample image;

a first obtaining unit, configured to obtain a soft annotation image of each second sample image, where the soft annotation image is obtained by annotating the second sample image with the image annotation model before training the image annotation model;

and the fusion unit is used for fusing the fifth image characteristic of each second sample image with the corresponding soft labeling image to obtain the second image characteristic of each second sample image.

In another possible implementation manner, the training module includes:

a second obtaining unit, configured to obtain at least one second clustering center corresponding to the multiple second sample images in the current iteration in a process of training the image annotation model, where the at least one second clustering center is obtained by clustering based on second image features of the second sample images;

the second obtaining unit is further configured to obtain a distance between a second image feature of each second sample image in the current iteration and the at least one second cluster center;

a training unit, configured to train the image annotation model of the current iteration based on a difference between the first annotation image and the first prediction annotation image, a difference between the second annotation image and the second prediction annotation image, and a distance corresponding to each second sample image.

In another possible implementation manner, the second obtaining unit is configured to obtain a second image feature of each second sample image in the current iteration if the current iteration is a first iteration of a training process; and clustering the plurality of second sample images based on the acquired plurality of second image characteristics to obtain the at least one second clustering center.

In another possible implementation manner, the second obtaining unit is further configured to, if the current iteration is not the first iteration of the training process, assign each second image feature to a second cluster center closest to the current iteration based on a distance between each second image feature and each second cluster center in the previous iteration; updating each second clustering center respectively based on the second image feature corresponding to each second clustering center; and determining the updated second clustering center as the second clustering center corresponding to the iteration.

In another possible implementation manner, the obtaining module is further configured to obtain a soft annotation image of a third sample image, where the third sample image is a second sample image of the plurality of second sample images except the target sample image, and the soft annotation image is obtained by annotating the third sample image with the image annotation model before training the image annotation model;

the training unit is used for calling the image annotation model, and annotating the third sample image to obtain a third prediction annotation image of the third sample image; and training the image annotation model of the current iteration based on the difference between the first annotation image and the first prediction annotation image, the difference between the second annotation image and the second prediction annotation image, the distance corresponding to each second sample image, and the difference between the soft annotation image and the third prediction annotation image.

In another possible implementation manner, the training module is further configured to perform countermeasure training on the image annotation model and a discrimination model based on the first sample image, the first labeled image, and the plurality of second sample images, where the discrimination model is configured to discriminate whether an annotated image output by the image annotation model is an annotated image of the first sample image.

In another possible implementation manner, the training module is configured to invoke the image annotation model, and respectively annotate the first sample image and the plurality of second sample images to obtain a fourth prediction annotation image of the first sample image and a fifth prediction annotation image of each second sample image; calling the discrimination model to discriminate the fourth prediction labeling image and the fifth prediction labeling image to obtain a discrimination result; and training the image annotation model and the discrimination model based on the difference between the fourth prediction annotation image and the first annotation image and the discrimination result.

In another possible implementation manner, the labeling module is further configured to invoke the trained image labeling model, label the target image of the target domain, and obtain a labeled image of the target image.

In another aspect, a computer device is provided, which includes a processor and a memory, wherein at least one computer program is stored in the memory, and the at least one computer program is loaded by the processor and executed to implement the operations performed in the image annotation method according to the above aspect.

In another aspect, a computer-readable storage medium is provided, in which at least one computer program is stored, the at least one computer program being loaded and executed by a processor to implement the operations performed in the image annotation method according to the above aspect.

In yet another aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising computer program code, the computer program code being stored in a computer readable storage medium. The processor of the computer device reads the computer program code from the computer-readable storage medium, and executes the computer program code, so that the computer device realizes the operations performed in the image annotation method according to the above aspect.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

according to the method, the device, the computer equipment and the storage medium provided by the embodiment of the application, because the images of the source domain and the target domain have difference, and the image of the target domain has the unique information of the target domain, the sample image which is least similar to the sample image of the source domain is selected from the target domain, and the image annotation model is trained by utilizing the selected sample image, the sample image of the source domain and the annotation image, so that the image annotation model can learn the unique information of the sample image of the target domain, the applicability of the image annotation model in the target domain is improved, the model expression of the image annotation model in the target domain is improved, and the annotation accuracy of the image annotation model in the target domain is also improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a data sharing system according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a block chain according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of a new block generation provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of an implementation environment provided by an embodiment of the present application;

FIG. 5 is a flowchart of an image annotation method provided in an embodiment of the present application;

FIG. 6 is a flowchart of an image annotation method provided in an embodiment of the present application;

FIG. 7 is a flowchart of acquiring an image of a target sample according to an embodiment of the present disclosure;

FIG. 8 is a flowchart of a method for obtaining a soft annotation image and a second clustering center according to an embodiment of the present disclosure;

FIG. 9 is a flowchart of a method for training an image annotation model according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a plurality of annotation images provided by an embodiment of the present application;

FIG. 11 is a schematic structural diagram of an image annotation apparatus according to an embodiment of the present application;

FIG. 12 is a schematic structural diagram of an image annotation device according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail with reference to the accompanying drawings.

As used herein, the terms "first," "second," "third," "fourth," "fifth," and the like may be used herein to describe various concepts, but these concepts are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, a first image feature can be referred to as a second image feature, and similarly, a second image feature can be referred to as a first image feature, without departing from the scope of the present application.

As used herein, the terms "at least one," "a plurality," "each," and "any," at least one of which includes one, two, or more than two, and a plurality of which includes two or more than two, each of which refers to each of the corresponding plurality, and any of which refers to any of the plurality. For example, the plurality of second sample images includes 3 second sample images, each of the 3 second sample images is referred to, and any one of the 3 second sample images is referred to as any one of the 3 second sample images, and can be the first second sample image, the second sample image, or the third second sample image.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and other technologies, and also includes common biometric identification technologies such as face recognition and fingerprint recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

The automatic driving technology generally comprises technologies such as high-precision maps, environment perception, behavior decision, path planning, motion control and the like, and the self-determined driving technology has wide application prospects.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and researched in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service, internet of vehicles, automatic driving, smart traffic and the like.

The scheme provided by the embodiment of the application can train the image annotation model based on the artificial intelligence computer vision technology and the machine learning technology, can realize the annotation task of the image by utilizing the trained image annotation model, and can be subsequently applied to various scenes.

The image annotation method provided by the embodiment of the application can be executed by computer equipment. Optionally, the computer device is a terminal or a server. Optionally, the server is an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. Optionally, the terminal is a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.

In some embodiments, the computer program according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a communication network, and the multiple computer devices distributed at the multiple sites and interconnected by the communication network may constitute a block chain system.

Referring to the data sharing system shown in fig. 1, the data sharing system 100 refers to a system for performing data sharing between nodes, the data sharing system may include a plurality of nodes 101, and the plurality of nodes 101 may refer to respective clients in the data sharing system. Each node 101 may receive input information while operating normally and maintain shared data within the data sharing system based on the received input information. In order to ensure information intercommunication in the data sharing system, information connection can exist between each node in the data sharing system, and information transmission can be carried out between the nodes through the information connection. For example, when an arbitrary node in the data sharing system receives input information, other nodes in the data sharing system acquire the input information according to a consensus algorithm, and store the input information as data in shared data, so that the data stored on all the nodes in the data sharing system are consistent.

Each node in the data sharing system has a node identifier corresponding thereto, and each node in the data sharing system may store a node identifier of another node in the data sharing system, so that the generated block is broadcast to the other node in the data sharing system according to the node identifier of the other node in the following. Each node may maintain a node identifier list as shown in the following table, and store the node name and the node identifier in the node identifier list correspondingly. The node identifier may be an IP (Internet Protocol) address and any other information that can be used to identify the node, and table 1 only illustrates the IP address as an example.

TABLE 1

Node name	Node identification
		Node 1	117.114.151.174
Node 2	117.116.189.145
		…	…
Node N	119.123.789.258

Each node in the data sharing system stores one identical blockchain. The block chain is composed of a plurality of blocks, referring to fig. 2, the block chain is composed of a plurality of blocks, the starting block includes a block header and a block main body, the block header stores an input information characteristic value, a version number, a timestamp and a difficulty value, and the block main body stores input information; the next block of the starting block takes the starting block as a parent block, the next block also comprises a block head and a block main body, the block head stores the input information characteristic value of the current block, the block head characteristic value of the parent block, the version number, the timestamp and the difficulty value, and the like, so that the block data stored in each block in the block chain is associated with the block data stored in the parent block, and the safety of the input information in the block is ensured.

When each block in the block chain is generated, referring to fig. 3, when the node where the block chain is located receives the input information, the input information is verified, after the verification is completed, the input information is stored in the memory pool, and the hash tree for recording the input information is updated; and then, updating the updating time stamp to the time when the input information is received, trying different random numbers, and calculating the characteristic value for multiple times, so that the calculated characteristic value can meet the following formula:

SHA256(SHA256(version+prev_hash+merkle_root+ntime+nbits+x))<TARGET

wherein, SHA256 is a characteristic value algorithm used for calculating a characteristic value; version is version information of the relevant block protocol in the block chain; prev _ hash is a block head characteristic value of a parent block of the current block; merkle _ root is a characteristic value of the input information; ntime is the update time of the update timestamp; nbits is the current difficulty, is a fixed value within a period of time, and is determined again after exceeding a fixed time period; x is a random number; TARGET is a feature threshold, which can be determined from nbits.

Therefore, when the random number meeting the formula is obtained through calculation, the information can be correspondingly stored, and the block head and the block main body are generated to obtain the current block. And then, the node where the block chain is located respectively sends the newly generated blocks to other nodes in the data sharing system where the newly generated blocks are located according to the node identifications of the other nodes in the data sharing system, the newly generated blocks are verified by the other nodes, and the newly generated blocks are added to the block chain stored in the newly generated blocks after the verification is completed.

Fig. 4 is a schematic diagram of an implementation environment provided by an embodiment of the present application. Referring to fig. 4, the implementation environment includes a terminal 401 and a server 402. The terminal 401 and the server 402 are connected via a wireless or wired network. Optionally, the terminal 401 has installed thereon a target application served by the server 402, and the terminal 401 can implement functions such as data transmission, message interaction, and the like through the target application. Optionally, the target application is a target application in an operating system of the terminal 401, or a target application provided by a third party. For example, the target application has a function of image annotation, that is, can annotate areas having different meanings in an image, and of course, can have other functions, such as a comment function, a shopping function, a navigation function, a game function, and the like.

In this embodiment of the application, the server 402 of the terminal 401 provides a sample image of a source domain or a sample image of a target domain, and the server 402 trains an image annotation model based on the obtained images. After the image annotation model is trained, the terminal 401 can complete the image annotation task by using the trained image annotation model. For example, the server 402 shares the trained image annotation model to the terminal 401, and the terminal 401 can deploy the trained image annotation model and complete an image annotation task based on the trained image annotation model; or, the terminal 401 sends the image to be labeled to the server 402 through interaction with the server 402, the server 402 labels the image sent by the terminal 401 based on the trained image labeling model, and returns the obtained labeled image to the terminal 401.

The method provided by the embodiment of the application can be used for various scenes.

For example, in an autonomous driving scenario:

due to the fact that scenes of different cities may have differences, a source domain comprises a street view image and a labeled image in the city 1, a current image labeling model is applicable to the city 1, a target domain comprises a street view image in the city 2 to be migrated, the image labeling method provided by the embodiment of the application is adopted, an image labeling model applicable to the target domain is trained on the basis of a sample image, a labeled image and a sample image of the target domain of the source domain, and then the trained image labeling model is called, so that a street view image labeling task in the city 2 can be completed, a safe street region can be identified by using the labeled image of the street view image in the city 2 in the following process, and the safe automatic driving of an automobile can be guaranteed.

As another example, in a medical scenario:

the medical images in different hospitals are possibly different, the medical image of the target hospital is used as a sample image of a target domain, other trained medical images and labeled images of the hospitals are used as source domains, the image labeling method provided by the embodiment of the application can be used for training an image labeling model suitable for the target hospital, and then the trained image labeling image is called to complete the labeling task of the medical image in the target hospital, so that the labeled image of the medical image in the target hospital can be used subsequently to identify different parts, and the parts needing medical analysis are separated from the medical image.

Fig. 5 is a flowchart of an image annotation method provided in an embodiment of the present application, which is executed by a computer device, and as shown in fig. 5, the method includes:

501. the computer device obtains a first sample image of a source domain, a first annotation image of the first sample image, and a plurality of second sample images of a target domain.

In the embodiment of the present application, the source domain and the target domain belong to different domains in the same scene, that is, there is similarity between images contained in the source domain and the target domain, but the images of the source domain and the target domain are not identical. For example, in an auto-driving scenario, the image of the source domain and the image of the target domain are both street view images, the street view image of the source domain may be a street view image in city 1, and the street view image of the target domain may be a street view image in city 2; or the street view image of the source domain is a virtual street view image, and the street view image of the target domain is a real street view image. For another example, in a medical scene, both the image of the source domain and the image of the target domain are medical images, the medical image of the source domain is a medical image belonging to hospital 1, and the medical image of the target domain is a medical image belonging to hospital 2.

The first annotation image of the first sample image is used for indicating the category to which each pixel point in the first sample image belongs, the category is used for distinguishing objects contained in the first sample image, and the pixel points belonging to different categories are used for describing different objects, for example, the first sample image is a street view image, the street view image may contain streets, bicycles, buildings and the like, that is, the streets, bicycles and buildings in the street view image all belong to different categories, and then in the first sample image, the category to which the pixel points describing streets belong is different from the category to which the pixel points describing bicycles belong. The first sample image of the source domain and the second sample image of the target domain belong to the same image type, for example, the first sample image and the second sample image both belong to a streetscape image type, or both belong to a medical image type, etc. The first sample image is an image included in the source domain, and the second sample image is an image included in the target domain.

502. And the computer equipment selects a target sample image from the plurality of second sample images and obtains a second annotation image of the target sample image, wherein the target sample image is the second sample image with the minimum similarity with the first sample image in the plurality of second sample images.

The second annotation image is used for indicating the category to which each pixel point in the target sample image belongs.

In the embodiment of the application, since the image of the target domain has the unique information of the target domain, the image of the target domain is selected from the plurality of second sample images, so as to ensure that the image annotation model trained based on the image of the target domain can learn the unique information of the target domain, thereby improving the applicability of the trained image annotation model in the target domain, and further improving the model representation of the image annotation model in the target domain.

503. And calling an image annotation model by the computer equipment, and respectively annotating the first sample image and the target sample image to obtain a first prediction annotation image of the first sample image and a second prediction annotation image of the target sample image.

The image annotation model is used for annotating the image, and the first prediction annotation image and the second prediction annotation image are obtained by the image annotation model. The first prediction annotation image is used for indicating the category to which each pixel point in the first sample image predicted by the model belongs, and the second prediction annotation image is used for indicating the category to which each pixel point in the target sample image predicted by the model belongs.

504. The computer device trains an image annotation model based on the difference between the first annotation image and the first prediction annotation image and the difference between the second annotation image and the second prediction annotation image, wherein the image annotation model is used for annotating the image of the target domain.

The accuracy of the image annotation model can be embodied by the difference between the first annotation image and the first prediction annotation image and the difference between the second annotation image and the second prediction annotation image, so that the image annotation model is trained on the basis of the difference between the first annotation image and the first prediction annotation image and the difference between the second annotation image and the second prediction annotation image, and the accuracy of the image annotation model can be improved.

According to the method provided by the embodiment of the application, because the images of the source domain and the target domain are different, and the image of the target domain has the unique information of the target domain, the sample image which is least similar to the sample image of the source domain is selected from the target domain, and the image annotation model is trained by using the selected sample image, the sample image of the source domain and the annotation image, so that the image annotation model can learn the unique information of the sample image of the target domain, the applicability of the image annotation model in the target domain is improved, the model expression of the image annotation model in the target domain is improved, and the annotation accuracy of the image annotation model in the target domain is also improved.

On the basis of the embodiment shown in fig. 5, based on the first sample image of the source domain, the first labeled image, and the second sample image of the target domain, the image labeling model suitable for the source domain is subjected to the countercheck training, and then the image labeling model after the countercheck training is trained, wherein the training process of the image labeling model after the countercheck training includes multiple iterations, and the specific process is described in the following embodiments.

Fig. 6 is a flowchart of an image annotation method provided in an embodiment of the present application, which is executed by a computer device, and as shown in fig. 6, the method includes:

601. the computer device obtains a plurality of first sample images of a source domain, a first annotation image of each first sample image, and a plurality of second sample images of a target domain.

Optionally, the first annotated image is obtained by manual annotation, for example, the first annotated image is obtained by an expert annotating the first sample image.

Optionally, the first annotation image includes at least one region corresponding to a category, and the categories to which the plurality of pixel points located in the same region belong are the same and are all the categories corresponding to the region. In this embodiment of the application, the category of each pixel point in the first sample image is the same as the category to which the pixel point located at the same position in the corresponding first labeled image belongs, that is, the first labeled image can indicate the category to which each pixel point in the corresponding first sample image belongs. For example, the first sample image is a street view image, the category corresponding to the street view image may include streets, sidewalks, buildings, walls, fences, pedestrians, vehicles, and the like, and the first labeled image of the first sample image can indicate which category each pixel point belongs to.

Optionally, different colors are used in the first annotation image to represent the categories corresponding to the pixel points, that is, the colors in the regions corresponding to the different categories are different. For example, the first sample image is a street view image, and a red area in the first annotation image of the first sample image represents a street, a green area represents a pedestrian, and the like.

Optionally, the first labeled image is represented in a matrix form, for example, the matrix includes a plurality of numerical values, and the plurality of numerical values respectively represent categories to which the corresponding pixel points belong. For another example, a plurality of values included in the matrix are arranged according to the positions of the corresponding pixels on the first labeled image, where a value 1 is used to represent a category 1, and a value 2 is used to represent a category 2, and then the category to which each pixel in the first labeled image belongs can be determined based on the plurality of values in the matrix.

In the embodiment of the present application, the plurality of first sample images of the source domain, the first annotation images of the plurality of first sample images, and the plurality of second sample images of the target domain can be stored in the form of a block chain. The computer equipment can acquire the first sample image, the first annotation image of the first sample image and the second sample image from the block chain.

In the present embodiment, the plurality of first sample images and the plurality of first labeled images of the source domain are obtained as an example, but in another embodiment, only the first sample image of the source domain, the first labeled image of the first sample image, and the plurality of second sample images of the target domain may be obtained.

602. And the computer equipment performs countermeasure training on the image annotation model and the discrimination model based on the plurality of first sample images, the plurality of first annotation images and the plurality of second sample images.

In the embodiment of the present application, the image annotation model is used to annotate an image, for example, the image annotation model is a deep (semantic segmentation model) model or a PSPNet (Pyramid pooling model). The image annotation model is a model which is trained by utilizing a sample image and an annotation image of a source domain, namely the image annotation model is suitable for the source domain, and an image annotation task can be executed on the image of the source domain based on the image annotation model. The judging model is used for judging whether the annotated image output by the image annotation model is the annotated image of the first sample image, namely the judging model is used for distinguishing the annotated image of the sample image of the source domain from the annotated image of the sample image of the target domain.

In the process of the countertraining image annotation model and the discrimination model, the image annotation model is trained to improve the accuracy of the image annotation model, so that the annotated images output by the image annotation model can not be distinguished from the discrimination model as much as possible, namely the discrimination model can not judge whether the annotated images output by the image annotation model are the annotated images of the first sample image; and training the discrimination model to improve the discrimination capability of the discrimination model, so that the discrimination model can distinguish the labeled images output by the image label model as much as possible, namely judging whether the labeled images output by the image label model are the labeled images of the first sample image. According to the method, the image annotation model and the discrimination model are countertrained, so that the accuracy of the image annotation model can be improved, and the image annotation model and the discrimination model tend to be balanced finally.

Because the image of the source domain and the image of the target domain have difference, the current image annotation model is not suitable for the target domain, therefore, the image annotation model is trained in an anti-training mode firstly, so that the image annotation model after anti-training can be suitable for the source domain, namely, the trained image annotation model is called, and the first sample image of the source domain and the second sample image of the target domain are respectively annotated, so that the obtained annotation images have similarity, the distance between the annotation image corresponding to the target domain and the annotation image corresponding to the source domain is reduced, the annotation images corresponding to the target domain are in an ordered state, and the preheating effect of the image annotation model is realized.

In one possible implementation manner, the image annotation model and the discrimination model are countertrained by using the prediction annotation image output by the image annotation model and the discrimination result output by the discrimination model, that is, the step 602 includes: calling an image annotation model, respectively annotating the plurality of first sample images and the plurality of second sample images to obtain a fourth prediction annotation image of each first sample image and a fifth prediction annotation image of each second sample image, calling a discrimination model to discriminate the fourth prediction annotation image and the fifth prediction annotation image to obtain a discrimination result, and training the image annotation model and the discrimination model based on the difference between each fourth prediction annotation image and the corresponding first annotation image and the discrimination result.

The fourth prediction annotation image is obtained by labeling the first sample image by the image labeling model and is equivalent to the prediction result of the image labeling model on the first sample image, and the fifth prediction annotation image is obtained by labeling the second sample image by the image labeling model and is equivalent to the prediction result of the image labeling model on the second sample image. Optionally, the fourth prediction annotation image or the fifth prediction annotation image is used to indicate a category to which each pixel point in the corresponding sample image predicted by the model belongs.

The judgment result is used for indicating whether the predicted annotation image judged by the judgment model is the annotation image of the first sample image, namely indicating whether the fourth predicted annotation image judged by the judgment model is the annotation image of the first sample image, and also indicating whether the fifth predicted annotation image judged by the judgment model is the annotation image of the first sample image. Based on the difference between the fourth prediction labeling image and the first labeling image, the accuracy of the image labeling model can be determined, the discrimination result can also embody the accuracy of the image labeling model and the accuracy of the discrimination model, and the image labeling model and the discrimination model are trained through the difference between the fourth prediction labeling image and the first labeling image and the discrimination result so as to improve the accuracy of the image labeling model.

Optionally, when the image annotation model and the discrimination model are trained, the image annotation model and the discrimination model are trained based on the difference between the fourth prediction annotation image and the first annotation image and the difference between the discrimination result and the annotation result.

And the annotation result is used for indicating that the fourth prediction annotation image is the annotation image of the first sample image and indicating that the fifth prediction annotation image is not the annotation image of the first sample image, namely the annotation result is a real result. The accuracy of the discrimination model can be determined by determining the difference between the labeling result and the discrimination result, and the image labeling model and the discrimination model are countertrained based on the difference between the fourth prediction labeling image and the first labeling image and the difference between the discrimination result and the labeling result, so that the accuracy of the image labeling model and the accuracy of the discrimination model are improved.

Optionally, the image annotation model and the discrimination model are trained against loss values, that is, a first loss value is determined based on a difference between the fourth prediction annotation image and the first annotation image, a second loss value is determined based on a difference between the discrimination result and the annotation result, and the image annotation model and the discrimination model are trained based on a sum of the first loss value and the second loss value. And (3) marking the model and the discriminant model against the training image by using the determined loss value so as to ensure the accuracy of the trained model.

In this embodiment of the application, the image annotation model trained by the above-mentioned countermeasure training is equivalent to a preheating network, and subsequently, on the basis of the countermeasure training, the trained image annotation model (i.e., preheating model) can be trained again based on the plurality of first sample images of the source domain, the first annotation image of each first sample image, and the plurality of second sample images of the target domain, so as to improve the model representation of the image annotation model on the target domain.

And on the basis of the countercheck training, firstly, step 603 and step 606 are executed, a target sample image with the least similarity to the first sample image of the source domain is selected from the target domain, and then, the countercheck trained image annotation model is subjected to multiple iterative training based on a plurality of first sample images of the source domain, the first annotation image of each first sample image, the target sample image of the target domain, the second annotation image of the target sample image, and a third sample image except the target sample image in a plurality of second sample images of the target domain. The present application is only illustrated by an iteration, and an iteration process is shown in steps 607-610.

603. The computer device obtains first image features of the plurality of first sample images and second image features of the plurality of second sample images.

The first image feature is used for describing information contained in the first sample image, and the second image feature is used for describing information contained in the second sample image. The computer equipment acquires the first image characteristics of each first sample image of the source domain and the second image characteristics of each second sample image of the target domain, so that the target sample image which is most dissimilar to the plurality of first sample images is selected from the plurality of second sample images by using the acquired image characteristics.

In one possible implementation, the process of obtaining the first image feature of the first sample image by using the first annotation image of the first sample image, that is, obtaining the first image features of the plurality of first sample images, includes the following steps 6031-6032:

6031. and calling a feature extraction sub-model in the image labeling model, and performing feature extraction on each first sample image to obtain a third image feature of each first sample image.

In the embodiment of the application, the image annotation model comprises a feature extraction submodel, and the feature extraction submodel is used for extracting the features of the image. The image features of each first sample image are extracted through the feature extraction submodel in the image labeling model so as to ensure the accuracy of the extracted third image features, and other feature extraction models do not need to be configured for feature extraction, so that resources are saved.

6032. And fusing the third image characteristic corresponding to each first sample image with the corresponding first annotation image to obtain the first image characteristic of each first sample image.

Because the first annotation image is used for indicating the category to which each pixel point in the corresponding first sample image belongs, the category to which each pixel point in the corresponding first sample image belongs is fused into the obtained first image feature by fusing the first annotation image and the corresponding third image feature, and the information contained in the first image feature is enriched, so that the accuracy of the first image feature is improved.

In a possible implementation manner, the first image feature of the first sample image is obtained by using at least one category of sub-annotation images included in the first annotation image and using a method of fusion and then stitching, that is, the step 6032 includes the following steps 6033 and 6035:

6033. for any first sample image, extracting at least one sub-annotation image corresponding to the category from the first annotation image of the first sample image, wherein the sub-annotation image is used for indicating pixel points belonging to the corresponding category.

Each sub-annotation image is used for indicating the pixel points belonging to the corresponding category, that is, indicating the category to which the pixel points located at the same position as the pixel points in each sub-annotation image belong in the corresponding first sample image. Optionally, in the sub-annotation image, the pixel value of the pixel point belonging to the category corresponding to the sub-annotation image is a first numerical value, and the pixel value of the pixel point not belonging to the category corresponding to the sub-annotation image is a second numerical value. For example, the first value is 1, the second value is 0, and in any sub-labeled image, the pixel value of the pixel belonging to the corresponding category of the sub-labeled image is 1, and the pixel value of the pixel not belonging to the corresponding category of the sub-labeled image is 0.

In this embodiment of the present application, the first annotation image includes an area corresponding to at least one category, and the categories of the plurality of pixel points located in the same area are the same. By extracting the sub-labeled image corresponding to each category from the first labeled image, the sub-labeled image corresponding to any category can indicate the pixel point in the region corresponding to the category in the first labeled image.

6034. And fusing the sub-label image corresponding to each category with the third image characteristic of the first sample image to obtain a fourth image characteristic corresponding to each category.

And the fourth image characteristic is used for describing information contained in pixel points belonging to the corresponding category in the first sample image.

Optionally, the sub-annotation image is fused with the third image feature by using a pixel point as a unit, that is, the process of obtaining the fourth image feature includes: for the sub-annotation image corresponding to any category, performing point multiplication on the pixel value of each pixel point in the sub-annotation image and the corresponding characteristic value to obtain a product corresponding to each pixel point, wherein the characteristic value corresponding to any pixel point is a characteristic value which is positioned at the same position as the pixel point in the third image characteristic; and determining the ratio of the product corresponding to each pixel point to the number of the pixel points, forming the fourth image characteristics corresponding to the category by the ratio corresponding to each pixel point, wherein the number of the pixel points is the number of the pixel points belonging to the corresponding category in the sub-label image.

In the embodiment of the present application, each pixel point in the sub-annotation image corresponds to a first feature value in the third sample image, and the position of the corresponding feature value in the third sample image is the same as the position of the pixel point in the sub-annotation image. By taking the pixel points as units, the pixel values and the characteristic values at the same positions in the sub-label image and the third image characteristic are fused, so that the obtained fourth image characteristic can highlight the characteristics of the pixel points of the corresponding category, and the accuracy of the fourth image characteristic is guaranteed.

Optionally, for any first sample image and any category, the fourth image feature corresponding to the category satisfies the following relationship:

wherein x is^sA first sample image of the source domain, c being used to represent the first sample image x^sThe first annotation image of (a) comprises any of at least one category,

for representing a fourth image feature corresponding to the category c,

for representing the sub-annotation image corresponding to the category,

for representing sub-annotation images

The number of pixels in (1) belonging to class c;

for representing pixel-by-pixel multiplication, f_E(x^s) For representing a first sample image x^sOf a third image feature of f_E(. to) is used to represent a feature extraction sub-model in an image annotation model.

6035. And splicing the fourth image features corresponding to at least one category to obtain the first image features of the first sample image.

For any first sample image, after the first sample image and the fourth image features corresponding to each category are obtained, the fourth image features corresponding to at least one category are spliced, so that the features of multiple categories of the first sample image are fused into the obtained first image features, information contained in the first image features is enriched, and the accuracy of the first image features is improved.

In a possible implementation manner, the fourth image feature is a three-dimensional image feature, that is, the fourth image feature includes a plurality of two-dimensional image features, and each fourth image feature is flattened first, and then the flattened image features are spliced into the first image feature.

And performing a leveling process on each fourth image feature, that is, splicing a plurality of two-dimensional image features included in the fourth image feature to obtain a spliced two-dimensional image feature, where the spliced two-dimensional image feature is the leveled image feature of the fourth image feature. Wherein the first image feature is a two-dimensional image feature.

In one possible implementation manner, the process of obtaining the second image feature of the second sample image by using the soft labeling image of the second sample image, that is, obtaining the second image features of a plurality of second sample images, includes the following steps 6036-6038:

6036. and calling a feature extraction sub-model in the image annotation model, and performing feature extraction on each second sample image to obtain a fifth image feature of each second sample image.

This step is similar to step 6031 and will not be described further herein.

6037. And acquiring a soft annotation image of each second sample image, wherein the soft annotation image is obtained by labeling the second sample image by the image labeling model before training the image labeling model.

And the soft labeling image of any second sample image is used for indicating the category of each pixel point in the second sample image predicted by the model.

In one possible implementation, the image annotation model is used to obtain a soft annotation image, that is, the step 6037 includes: and calling an image annotation model, annotating the second sample image to obtain an annotated image of the second sample image, and determining the obtained annotated image as a soft label image of the second sample image.

The image annotation model is an image annotation model suitable for the source domain, or an image annotation model after performing countermeasure training according to the above step 602.

6038. And fusing the fifth image characteristic of each second sample image with the corresponding soft labeling image to obtain the second image characteristic of each second sample image.

In one possible implementation manner, the sub-annotation images of at least one category included in the soft annotation image are fused and then stitched to obtain the second image feature of the second sample image, that is, the step 6038 includes: for any second sample image, extracting at least one sub-annotation image corresponding to the category from the soft annotation image of the second sample image, wherein the sub-annotation image is used for indicating pixel points belonging to the corresponding category; fusing the sub-annotation image corresponding to each category with the fifth image feature of the second sample image to obtain a sixth image feature corresponding to each category; and splicing the sixth image features corresponding to at least one category to obtain second image features of the second sample image.

Optionally, for any second sample image and any category, the sixth image feature corresponding to the category satisfies the following relationship:

wherein x is^tA second sample image of the target domain, c being used to represent the second sample image x^tThe soft annotation image of (a) comprises any one of at least one category,

for representing the sixth image feature corresponding to the category c,

for representing the sub-annotation image corresponding to the category c,

for representing sub-annotation images

The number of pixels in (1) belonging to class c;

for representing pixel-by-pixel multiplication, f_E(x^t) For representing the second sample image x^tOf a fifth image feature of f_E(. to) is used to represent a feature extraction sub-model in an image annotation model.

Step 6038 is similar to step 6032 described above and will not be described further herein.

604. The computer equipment clusters the plurality of first sample images based on the obtained plurality of first image characteristics to obtain at least one first clustering center.

In this embodiment, the first sample image of the source domain may include different scenes, for example, the first sample image of the source domain is a street view image, some street view images are city street view images, and other some street view images are suburban street view images. Clustering the plurality of first sample images through first image characteristics of the plurality of first sample images of the source domain to obtain at least one cluster, namely, each cluster corresponds to one center, namely, at least one cluster center is obtained, each first cluster center represents one scene of the source domain, and for any first sample image, the scene contained in the first sample image is similar to the scene represented by the corresponding first cluster center. For example, if a first clustering center is used to indicate a suburb street view, the first sample images belonging to the cluster corresponding to the first clustering center are suburb street view images; and the other first clustering center is used for indicating city street view, and the first sample images belonging to the clusters corresponding to the first clustering center are all city street view images. Each first cluster center is equivalent to an anchor point of the source domain, and the plurality of first sample images of the source domain are distributed on the corresponding anchor points.

In the embodiment of the present application, a variety of Clustering algorithms can be used in the process of Clustering the first sample image, such as a k-means algorithm, a hierarchical Clustering algorithm, or a DBSCAN (Density-Based Spatial Clustering of Applications with Noise), and the like.

In one possible implementation, the plurality of first image features are clustered to obtain at least one first cluster center of the plurality of first sample images, that is, step 604 includes: selecting k first image features from the plurality of first image features as initial first clustering centers, determining the distance between each first sample image feature and each initial first clustering center, distributing each first sample image feature to the initial first clustering center closest to the initial first clustering center to obtain a plurality of clusters, determining the average value of the first sample image feature in each cluster as an updated first clustering center, then updating the first clustering center again according to the mode based on the k updated first clustering centers, and stopping updating the first clustering centers in response to the iteration times reaching a threshold value or the obtained k updated first clustering centers converging to obtain the final k first clustering centers.

Wherein k is any positive integer. And updating the k first clustering centers in an iterative updating mode, so that the obtained first clustering centers can represent corresponding clusters, namely the distance between each first clustering center and the first sample image in the corresponding cluster is minimum, and the accuracy of the first clustering centers is ensured.

605. For any second sample image, the computer device determines the distance between the second image feature of the second sample image and each first cluster center, and determines the similarity between the second sample image and the first sample image based on the minimum distance in the determined distances, wherein the corresponding minimum distance of the second sample image has a negative correlation with the corresponding similarity.

And the distance between the second image characteristic and the first clustering center is Euclidean distance, Mahalanobis distance and the like. In this embodiment, for any second sample image of the target domain, the distance between the second sample image and each first cluster center can represent the degree of similarity between the second sample image and the first sample image belonging to each first cluster center, then a minimum distance can be determined from the determined distances, that is, the second sample image is most similar to the first sample image belonging to the first cluster center corresponding to the minimum distance, and then the similarity between the second sample image and the first sample images of the source domain is determined based on the minimum distance. And, the similarity corresponding to any second sample image and the minimum distance corresponding to the second sample image are in a negative correlation relationship, that is, the greater the minimum distance corresponding to the second sample image, the smaller the similarity corresponding to the second sample image, the smaller the minimum distance corresponding to the second sample image, and the greater the similarity corresponding to the second sample image.

The distance between the second sample image and each first clustering center can represent the similarity between the second sample image and the first sample image belonging to each first clustering center, and the similarity is determined based on the minimum distance corresponding to each second sample image, that is, the maximum similarity between the second sample image and the plurality of first sample images of the source domain is determined, so that the accuracy of the determined similarity is ensured.

In one possible implementation, the minimum distance corresponding to any second sample image satisfies the following relationship:

wherein, D (x)^t) For representing the second sample image x^tCorresponding minimum distance, F^t(x^t) For representing the second sample image x^tIs detected in the first image characteristic of (a),

for representing a kth first cluster center of the at least one first cluster center, k being an integer greater than 0; | l | · | is used to represent a norm.

606. And the computer equipment selects a target sample image from the plurality of second sample images and obtains a second annotation image of the target sample image, wherein the target sample image is the second sample image with the minimum similarity with the sample image of the source domain in the plurality of second sample images.

The second annotation image is used for indicating the category to which each pixel point in the target sample image belongs, and optionally, the second annotation image is obtained by artificial annotation. After the similarity corresponding to each second sample image is determined, according to the similarity corresponding to each second sample image, a target sample image having the smallest similarity with the plurality of first sample images of the source domain is selected from the plurality of second sample images, that is, the selected target sample image is guaranteed to be least similar to the first sample image of the source domain as far as possible.

Because the image of the target domain has the unique information of the target domain, the image of the target domain which is least similar to the first sample image of the source domain is selected from the plurality of second sample images as much as possible, so that the image annotation model trained based on the image of the target domain can learn the unique information of the target domain, the applicability of the trained image annotation model in the target domain is improved, and the model expression of the image annotation model in the target domain is improved. As shown in fig. 7, a feature extraction sub-model in the image annotation model is called, a first image feature of the first sample image and a second image feature of the second sample image are respectively obtained, the first sample image is clustered based on the first image feature to obtain a first clustering center, and a target sample image having the minimum similarity with the sample image of the source domain is selected from the plurality of second sample images based on the second image feature of the second sample image and the first clustering center.

In one possible implementation, when selecting a target sample image, a target number of target sample images are selected. Here, the target number is an arbitrary number, for example, the target number is 10, or the target number is 5% of the total number of the second sample images, or the like. In the plurality of second sample images, the similarity corresponding to the target number of target sample images is smaller than the similarity corresponding to the other unselected second sample images.

In the embodiment of the present application, the first cluster center of the source domain is equivalent to an anchor point of the source domain, and the process of selecting the target sample image through the plurality of first cluster centers of the source domain is equivalent to the process of selecting the target sample image based on the multiple anchor points of the source domain.

After the target sample image of the target domain is obtained, the image labeling model is iteratively trained for multiple times, and the following steps 607-610 are only described by taking one iteration as an example.

607. And calling the image annotation model by the computer equipment in the process of training the image annotation model, and respectively annotating the first sample image and the target sample image to obtain a first prediction annotation image of the first sample image and a second prediction annotation image of the target sample image.

In the embodiment of the present application, the image annotation model is the image annotation model trained according to the above step 302. Calling an image annotation model, annotating a first sample image to obtain a first prediction annotation image of the first sample image, calling the image annotation model, annotating a target sample image to obtain a second prediction annotation image of the target sample image.

In one possible implementation manner, the image labeling model includes a feature extraction sub-model and a feature transformation sub-model, and the image is labeled by using the feature extraction sub-model and the feature transformation sub-model, that is, the step 607 includes: calling a feature extraction sub-model, respectively extracting features of the first sample image and the target sample image to obtain image features of the first sample image and image features of the target sample image, and calling a feature conversion sub-model to respectively convert the image features of the first sample image and the image features of the target sample image to obtain a first prediction labeling image of the first sample image and a second prediction labeling image of the target sample image.

608. And the computer equipment acquires at least one second clustering center corresponding to the plurality of second sample images in the iteration, wherein the at least one second clustering center is obtained by clustering based on second image characteristics of the second sample images.

In the embodiment of the present application, in the process of training the image annotation model through multiple iterations, the image annotation model is updated every iteration, and therefore, the image annotation model may be different in different iterations. The second clustering center is obtained by clustering based on the second image feature of the second sample image in the current iteration, and the second image feature is obtained based on the image annotation model in the current iteration, so that the second image feature and the second clustering center in each iteration may have changes, and the second image feature of the second sample image in the current iteration and at least one second clustering center in the current iteration need to be determined in each iteration.

In a possible implementation manner, when the image annotation model is trained according to multiple iterations, the first iteration and the other iterations acquire the second clustering center in different manners, that is, the process of acquiring at least one second clustering center includes the following two manners:

the first mode is as follows: if the iteration is the first iteration of the training process, acquiring a second image characteristic of each second sample image in the iteration; and clustering the plurality of second sample images based on the plurality of acquired second image characteristics to obtain at least one second clustering center.

This step is similar to step 604 and will not be described herein.

The second mode is as follows: if the iteration is not the first iteration of the training process, distributing each second image feature to the second clustering center closest to the current iteration based on the distance between each second image feature and each second clustering center in the previous iteration; updating each second clustering center respectively based on the second image characteristics corresponding to each second clustering center; and determining the updated second clustering center as the second clustering center corresponding to the iteration.

According to the second image characteristics of each second sample image in the last iteration and at least one second clustering center, the second clustering centers are updated, the updated second clustering centers are used as the second clustering centers of the current iteration, the process does not need to utilize all second image characteristics for re-clustering, the calculated amount is reduced, the resources are saved, the efficiency of obtaining the second clustering centers is improved, and the accuracy of the second clustering centers is also ensured.

In one possible implementation, the process of updating the second cluster center can satisfy the following relationship:

wherein v is used to represent the vth second cluster center among the at least one second cluster center, v being a positive integer greater than 0;

for representing the updated vth second cluster center, alpha for representing the adjustment coefficient, alpha being a constant greater than 0 and less than 1,

for representing the first cluster centers in the last iteration, F^t(x^t) For representing a second sample image x assigned to a first cluster center^tSecond image ofAnd (5) carrying out characterization.

609. The computer device obtains a distance between a second image feature of each second sample image in the current iteration and at least one second cluster center.

In the embodiment of the present application, when the image annotation model is trained, the second image feature of each second sample image needs to be determined again in each iteration, and the determination method is the same as that in step 603. When the second image feature of each second sample image is obtained, for a target sample image in the plurality of second sample images, the second image feature of the target sample image is obtained by using the second annotation image of the target sample image. Then, based on each second feature map and each second cluster center in the current iteration, a distance between each second image feature and at least one second cluster center is determined, that is, for any second image feature, a distance between the second image feature and each second cluster center is determined. The manner of determining the distance between the centers of the second clusters of the second image features is the same as the manner of determining the distance in step 605, and is not described herein again.

610. And the computer equipment trains the image annotation model of the iteration based on the difference between the first annotation image and the first prediction annotation image, the difference between the second annotation image and the second prediction annotation image and the distance corresponding to each second sample image.

The image annotation model of the iteration is trained based on the difference between the first annotation image and the first prediction annotation image, the difference between the second annotation image and the second prediction annotation image, and the distance corresponding to each second sample image, which is equivalent to adjusting the parameters of the image annotation model based on the difference between the first annotation image and the first prediction annotation image, the difference between the second annotation image and the second prediction annotation image, and the distance corresponding to each second sample image.

In the embodiment of the application, the difference between the first annotation image and the first prediction annotation image and the difference between the second annotation image and the second prediction annotation image can both show the accuracy of the image annotation model, and the distance corresponding to each second sample image is the distance between each second sample image and each second clustering center, that is, the relationship between the second sample image and each second clustering center is considered, so that the trained image annotation model can learn the image distribution condition of the target domain, and the accuracy and the applicability of the image annotation model on the target domain are improved.

In the embodiment of the present application, in the process of iteratively training the image annotation model according to steps 607-610, each iteration utilizes the second clustering center of the target domain, where the second clustering center is equivalent to an anchor point of the target domain, and a plurality of second sample images of the target domain are distributed on corresponding anchor points, and the process of iteratively training the image annotation model is implemented through the plurality of second clustering centers of the target domain, which is equivalent to iteratively training the image annotation model based on multiple anchor points of the target domain, that is, the image annotation model is trained by using a domain adaptive semantic segmentation method based on multiple anchor points.

In one possible implementation, the image annotation model is trained based on the determined loss values, that is, the step 610 includes: and determining a third loss value based on the difference between the first labeled image and the first prediction labeled image and the difference between the second labeled image and the second prediction labeled image, determining a fourth loss value based on the corresponding distance of each second sample image, and training the image labeling model of the iteration based on the sum of the third loss value and the fourth loss value.

Wherein the fourth loss value is obtained based on a distance loss function, the second cluster center is equivalent to an anchor point of the target domain, and the distance loss function is equivalent to a soft alignment loss function based on multiple anchor points.

And determining a loss value based on the difference between the first annotation image and the first prediction annotation image, the difference between the second annotation image and the second prediction annotation image and the corresponding distance of each second sample image, so that the image annotation model can be trained according to the loss value, and the accuracy of the trained image annotation model is ensured.

Optionally, the third loss value satisfies the following relationship:

wherein L is_CE(x^s，y^s) For representing a loss value representing a difference between the first annotation image and the first predictive annotation image, x^sFor representing a first sample image, y^sFor representing the first annotation image,

for representing a loss value representing a difference between the second annotation image and the second predictive annotation image,

for representing an image of a target sample,

a second annotation image for representing the target sample image.

Optionally, the fourth loss value satisfies the following relationship:

wherein the content of the first and second substances,

for fourth loss value, V for number of second cluster centers, F^t(x^t) For representing the second sample image x in the current iteration^tIs detected in the first image characteristic of (a),

used for representing the upsilon second clustering centers in the iteration, | | | |, is used for representing the norm.

In one possible implementation, the image annotation model is trained by using the first sample image of the source domain, the target sample image of the target domain, and the third sample image of the target domain except the target sample image, that is, the step 610 includes the following steps 6101-:

6101. and acquiring a soft labeling image of a third sample image, wherein the third sample image is a second sample image except the target sample image in the plurality of second sample images, and the soft labeling image is obtained by labeling the third sample image by the image labeling model before training the image labeling model.

The soft labeling image of any second sample image is used for indicating the category to which each pixel point in the second sample image belongs. In the embodiment of the present application, the plurality of second sample images of the target domain are divided into the target sample image and the third sample image, the target sample image has the second annotation image, and the third sample image does not have an annotation image, so that the image annotation model is called to annotate the third sample image, the annotation image of the third sample image is obtained, and the annotation image of the third sample image is determined as the soft annotation image of the third sample image.

In one possible implementation, before step 607, the method further comprises: calling an image annotation model, respectively annotating the first sample image and the target sample image to obtain a first prediction annotation image of the first sample image and a second prediction annotation image of the target sample image, and training the image annotation model based on the difference between the first annotation image and the first prediction annotation image and the difference between the second annotation image and the second prediction annotation image.

Because the acquired first sample image and the acquired target sample image both correspond to the annotation image, the image annotation model is trained through the acquired first sample image, the acquired first annotation image, the acquired target sample image and the acquired second annotation image, so that the model representation of the image annotation model on the target domain is improved, namely the applicability of the image annotation model on the target domain is improved.

In the embodiment of the application, after the image annotation model suitable for the source domain is obtained, the image annotation model is trained according to the step 602, then the step 603-. The image annotation model is trained according to the method, so that the model expression of the image annotation model on the target domain is gradually improved, and the applicability of the image annotation model on the target domain is improved.

In this embodiment of the application, the image annotation model for obtaining the soft annotation image is an image annotation model obtained by training the image annotation model according to the step 602, and then training the image annotation model based on the obtained first sample image, the obtained first annotation image, the obtained target sample image, and the obtained second annotation image. As shown in fig. 8, an image annotation model is trained according to the above step 602, then the image annotation model is trained based on the first sample image, the first annotation image, the target sample image and the second annotation image, the trained image annotation model is used to annotate the third sample image to obtain an annotation image of the third sample image, the obtained annotation image of the third sample image is determined as a soft annotation image of the third sample image, a sub-model is extracted based on the features in the image annotation model to obtain the image features of the target sample image and the image features of the third sample image, that is, the second image features of the plurality of second sample images are obtained, and clustering is performed based on the second image features of the second sample images to obtain at least one second cluster center.

6102. And calling an image annotation model, and annotating the third sample image to obtain a third prediction annotation image of the third sample image.

6103. And training the image annotation model of the iteration based on the difference between the first annotation image and the first prediction annotation image, the difference between the second annotation image and the second prediction annotation image, the corresponding distance of each second sample image and the difference between the soft annotation image and the third prediction annotation image.

The trained image labeling model is used for labeling the image of the target domain.

When the soft annotation image is trained based on the distance corresponding to each second sample image, the image features output by the image annotation model can be close to the center of the second cluster, so that the unique information of the second sample image in the target domain can be lost, the image annotation model is trained by using the difference between the soft annotation image and the third prediction annotation image, the image features output by the image annotation model are sharpened, the unique information of the sample image in the target domain can be highlighted, the applicability of the image annotation model in the target domain is improved, and the model representation of the image annotation model in the target domain is improved.

In one possible implementation, the image annotation model is trained with the determined loss values, that is, 6103 includes: determining a third loss value based on the difference between the first labeled image and the first prediction labeled image and the difference between the second labeled image and the second prediction labeled image, determining a fourth loss value based on the corresponding distance of each second sample image, determining a fifth loss value based on the difference between the soft labeled image and the third prediction labeled image, and training the image labeling model of the iteration based on the sum of the third loss value, the fourth loss value and the fifth loss value.

And determining a loss value based on the difference between the first annotation image and the first prediction annotation image, the difference between the second annotation image and the second prediction annotation image, the distance corresponding to each second sample image and the difference between the soft annotation image and the third prediction annotation image, so that the image annotation model can be trained according to the loss value, and the accuracy of the trained image annotation model is ensured.

Optionally, the fifth loss value is derived based on a difference between the soft annotated image and the third predictive annotated image using a pseudo-annotated loss function.

Optionally, the fifth loss value satisfies the following relationship:

wherein L is_pseudoFor a fifth loss value, L_CE(. cndot.) is used to represent the loss function,

for representing a third sample image of the target domain,

for representing a third sample image

The soft annotation image of (1).

Optionally, a sum of the third loss value, the fourth loss value, and the fifth loss value satisfies the following relationship:

wherein L is_seminFor indicating the sum of the third loss value, the fourth loss value and the fifth loss value, L_segFor the purpose of representing a third loss value,

for a fourth loss value, L_pseudoFor representing a fifth loss value.

As shown in fig. 9, the image annotation model includes a feature extraction sub-model and a feature transformation sub-model, and the image annotation model is trained based on the first sample image, the first annotation image, the target sample image, the second annotation image, the third sample image, the soft annotation image of the third sample image, and the distance corresponding to each second sample.

In addition, the soft label image can include the category to which each pixel point in the corresponding third sample image belongs and the corresponding probability, when the image annotation model is trained based on the difference between the soft label image and the third prediction annotation image, only the pixel points with the probability greater than the threshold are selected from the soft label image, and the image annotation model is trained based on the difference between the categories to which the pixel points with the probability greater than the threshold belong in the soft annotation image and the third prediction annotation image. The threshold value is an arbitrary value.

In addition, in the process of training the image annotation model, the annotated image obtained by annotating the same sample image based on the image annotation model in different iterations may have differences, and when the image annotation model is trained based on the difference between the soft label image and the third predicted annotated image, the uncertain region of the third predicted annotated image is determined, and the image annotation model is trained by using the differences between the regions outside the uncertain region and the corresponding regions in the soft label image. The uncertain region is used for representing a difference region between marked images obtained by marking the same sample image for multiple times by the image marking model.

It should be noted that, in the embodiment of the present application, the image annotation model is trained based on the difference between the first annotation image and the first prediction annotation image, the difference between the second annotation image and the second prediction annotation image, and the distance corresponding to each second sample image, but in another embodiment, the step 608 and the step 610 do not need to be executed, and the image annotation model can be trained based on the difference between the first annotation image and the first prediction annotation image, and the difference between the second annotation image and the second prediction annotation image in other manners.

In a possible implementation manner, in the process of performing iterative training on the image annotation model according to step 607-610, if the iteration number is greater than the number threshold, the training of the image annotation model is stopped, or if the sum of the third loss value, the fourth loss value, and the fifth loss value is less than the loss threshold, the training of the image annotation model is stopped.

In the embodiment of the present application, the process of training the image annotation model can be implemented based on PyTorch (an open source code). A loss value of the image annotation model is determined according to a plurality of loss functions, so that the image annotation model is trained based on the determined loss value. In the process of iteratively training the image annotation model, the image annotation model is updated based on a Stochastic Gradient Descent (SGD) methodAnd the initial learning rate of the SGD is set to 2.5x10^-4And a Poly (decay) strategy with an energy of 0.9 is used to step down the learning rate.

611. And calling the trained image annotation model by the computer equipment, and annotating the target image of the target domain to obtain an annotated image of the target image.

This step is similar to step 607, and will not be described herein again.

In the embodiment of the application, after the image annotation model is trained, the trained image annotation model can be deployed in the block chain, and any computer device in the block chain can call the image annotation model to annotate the target image in the target domain.

Moreover, each second sample image of the target domain does not need to be labeled, so that the cost for obtaining the labeled image is reduced, and the efficiency for training the image labeling model is improved.

Compared with the image annotation model in the related art, as shown in fig. 10, the image annotation model obtained based on the image annotation method provided by the embodiment of the present application, and the image annotation model 1 and the image annotation model 2 provided by the related art are used to annotate different original images, the obtained annotated image is as shown in fig. 10, and the last column of images in fig. 10 are artificial annotated images and are equivalent to real annotated images. Compared with the labeled images corresponding to the image labeling models, the image labeling model provided by the application has better model performance in the target domain, and the labeled image obtained through the image labeling model provided by the application is closer to the real labeled image, namely, the accuracy of the labeled image obtained based on the image labeling model provided by the application is higher.

Taking two semantic segmentation scenes as examples, such as a game virtual scene-a real scene, and a synthesized virtual scene-a real scene, the image in the virtual scene is taken as the image of the source domain, and the image in the real scene is taken as the image of the target domain. The image annotation model provided by the application and the image annotation model 1, the image annotation model 2, the image annotation model 3 and the image annotation model 4 provided by the related technology are utilized to annotate images in different semantic segmentation scenes. Table 2 shows an example of a game virtual scene — a real scene, and it can be seen from table 2 that the annotation accuracy of the image annotation model provided by the present application is higher, and table 3 shows an example of a synthesized virtual scene — a real scene, and it can be seen from table 3 that the annotation accuracy of the image annotation model provided by the present application is higher.

TABLE 2

TABLE 3

Fig. 11 is a schematic structural diagram of an image annotation apparatus according to an embodiment of the present application, and as shown in fig. 11, the apparatus includes:

an obtaining module 1101, configured to obtain a first sample image of a source domain, a first annotation image of the first sample image, and a plurality of second sample images of a target domain;

the obtaining module 1101 is further configured to select a target sample image from the plurality of second sample images, and obtain a second annotation image of the target sample image, where the target sample image is a second sample image having the smallest similarity with the sample image of the source domain among the plurality of second sample images;

the annotation module 1102 is configured to invoke an image annotation model, and respectively annotate the first sample image and the target sample image to obtain a first prediction annotation image of the first sample image and a second prediction annotation image of the target sample image;

a training module 1103, configured to train an image annotation model based on a difference between the first annotation image and the first prediction annotation image, and a difference between the second annotation image and the second prediction annotation image.

In one possible implementation, as shown in fig. 12, the apparatus further includes:

an obtaining module 1101, configured to obtain first image features of the plurality of first sample images and second image features of the plurality of second sample images;

the clustering module 1104 is configured to cluster the plurality of first sample images based on the obtained plurality of first image features to obtain at least one first clustering center;

a determining module 1105, configured to determine, for any second sample image, distances between second image features of the second sample image and each first cluster center, and determine, based on a minimum distance in the determined distances, a similarity between the second sample image and a sample image of the source domain, where the minimum distance corresponding to the second sample image is in a negative correlation with the corresponding similarity.

In another possible implementation, as shown in fig. 12, the obtaining module 1101 includes:

the feature extraction unit 1111 is configured to invoke a feature extraction sub-model in the image annotation model, perform feature extraction on each first sample image, and obtain a third image feature of each first sample image;

a fusion unit 1112, configured to fuse the third image feature corresponding to each first sample image with the corresponding first annotation image, so as to obtain the first image feature of each first sample image.

In another possible implementation manner, the first annotation image comprises a region corresponding to at least one category; a fusion unit 1112, configured to, for any first sample image, extract a sub-labeled image corresponding to at least one category from the first labeled image of the first sample image, where the sub-labeled image is used to indicate a pixel point belonging to the corresponding category; fusing the sub-label image corresponding to each category with the third image characteristic of the first sample image to obtain a fourth image characteristic corresponding to each category; and splicing the fourth image features corresponding to at least one category to obtain the first image features of the first sample image.

In another possible implementation manner, the fusion unit 1112 is configured to perform dot multiplication on a pixel value of each pixel point in the sub-labeled image and a corresponding feature value to obtain a product corresponding to each pixel point for the sub-labeled image corresponding to any category, where the feature value corresponding to any pixel point is a feature value located at the same position as the pixel point in the third image feature; and determining the ratio of the product corresponding to each pixel point to the number of the pixel points, forming the fourth image characteristics corresponding to the category by the ratio corresponding to each pixel point, wherein the number of the pixel points is the number of the pixel points belonging to the corresponding category in the sub-label image.

the feature extraction unit 1111 is configured to invoke a feature extraction sub-model in the image annotation model, perform feature extraction on each second sample image, and obtain a fifth image feature of each second sample image;

a first obtaining unit 1113, configured to obtain a soft annotation image of each second sample image, where the soft annotation image is obtained by annotating the second sample image with an image annotation model before training the image annotation model;

a fusing unit 1112, configured to fuse the fifth image feature of each second sample image with the corresponding soft label image, so as to obtain a second image feature of each second sample image.

In another possible implementation, as shown in fig. 12, the training module 1103 includes:

a second obtaining unit 1131, configured to obtain, in a process of training an image annotation model, at least one second clustering center corresponding to a plurality of second sample images in the current iteration, where the at least one second clustering center is obtained by clustering based on second image features of the second sample images;

the second obtaining unit 1131 is further configured to obtain a distance between a second image feature of each second sample image in the current iteration and at least one second clustering center;

the training unit 1132 is configured to train the image annotation model of the current iteration based on a difference between the first annotation image and the first prediction annotation image, a difference between the second annotation image and the second prediction annotation image, and a distance corresponding to each second sample image.

In another possible implementation manner, the second obtaining unit 1131 is configured to obtain a second image feature of each second sample image in the current iteration if the current iteration is the first iteration of the training process; and clustering the plurality of second sample images based on the plurality of acquired second image characteristics to obtain at least one second clustering center.

In another possible implementation manner, the second obtaining unit 1131 is further configured to, if the current iteration is not the first iteration of the training process, assign each second image feature to the second cluster center closest to the current iteration based on the distance between each second image feature and each second cluster center in the previous iteration; updating each second clustering center respectively based on the second image characteristics corresponding to each second clustering center; and determining the updated second clustering center as the second clustering center corresponding to the iteration.

In another possible implementation manner, the obtaining module 1101 is further configured to obtain a soft annotation image of a third sample image, where the third sample image is a second sample image except for the target sample image in the plurality of second sample images, and the soft annotation image is obtained by annotating the third sample image with the image annotation model before training the image annotation model;

the training unit 1132 is configured to invoke an image annotation model, and perform annotation on the third sample image to obtain a third prediction annotation image of the third sample image; and training the image annotation model of the iteration based on the difference between the first annotation image and the first prediction annotation image, the difference between the second annotation image and the second prediction annotation image, the corresponding distance of each second sample image and the difference between the soft annotation image and the third prediction annotation image.

In another possible implementation manner, the training module 1103 is further configured to perform countermeasure training on an image annotation model and a discrimination model based on the first sample image, the first annotation image, and the plurality of second sample images, where the discrimination model is used to discriminate whether an annotation image output by the image annotation model is an annotation image of the first sample image.

In another possible implementation manner, the training module 1103 is configured to invoke an image annotation model, and respectively annotate the first sample image and the plurality of second sample images to obtain a fourth prediction annotation image of the first sample image and a fifth prediction annotation image of each second sample image; calling a discrimination model to discriminate the fourth prediction labeling image and the fifth prediction labeling image to obtain a discrimination result; and training the image annotation model and the discrimination model based on the difference between the fourth prediction annotation image and the first annotation image and the discrimination result.

In another possible implementation manner, the labeling module 1102 is further configured to invoke the trained image labeling model, label the target image of the target domain, and obtain a labeled image of the target image.

It should be noted that: the image annotation device provided in the above embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules to complete all or part of the functions described above. In addition, the image annotation device and the image annotation method provided by the above embodiments belong to the same concept, and the specific implementation process thereof is described in the method embodiments in detail, and is not described herein again.

The embodiment of the present application further provides a computer device, where the computer device includes a processor and a memory, and the memory stores at least one computer program, and the at least one computer program is loaded and executed by the processor to implement the operations performed in the image annotation method of the foregoing embodiment.

Optionally, the computer device is provided as a terminal. Fig. 13 shows a block diagram of a terminal 1300 according to an exemplary embodiment of the present application. The terminal 1300 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 1300 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

Terminal 1300 includes: a processor 1301 and a memory 1302.

Processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1301 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1301 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also referred to as a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1301 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing content that the display screen needs to display. In some embodiments, processor 1301 may further include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

Memory 1302 may include one or more computer-readable storage media, which may be non-transitory. The memory 1302 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1302 is used to store at least one computer program for execution by the processor 1301 to implement the image annotation methods provided by the method embodiments herein.

In some embodiments, terminal 1300 may further optionally include: a peripheral interface 1303 and at least one peripheral. Processor 1301, memory 1302, and peripheral interface 1303 may be connected by a bus or signal line. Each peripheral device may be connected to the peripheral device interface 1303 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1304, display screen 1305, camera assembly 1306, audio circuitry 1307, positioning assembly 1308, and power supply 1309.

Peripheral interface 1303 may be used to connect at least one peripheral associated with I/O (Input/Output) to processor 1301 and memory 1302. In some embodiments, processor 1301, memory 1302, and peripheral interface 1303 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1301, the memory 1302, and the peripheral device interface 1303 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 1304 is used to receive and transmit RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1304 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1304 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1304 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 1304 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 1304 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 1305 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1305 is a touch display screen, the display screen 1305 also has the ability to capture touch signals on or over the surface of the display screen 1305. The touch signal may be input to the processor 1301 as a control signal for processing. At this point, the display 1305 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1305 may be one, disposed on the front panel of terminal 1300; in other embodiments, display 1305 may be at least two, either on different surfaces of terminal 1300 or in a folded design; in other embodiments, display 1305 may be a flexible display disposed on a curved surface or on a folded surface of terminal 1300. Even further, the display 1305 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display 1305 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or the like.

The camera assembly 1306 is used to capture images or video. Optionally, camera assembly 1306 includes a front camera and a rear camera. The front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1306 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 1307 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1301 for processing, or inputting the electric signals to the radio frequency circuit 1304 for realizing voice communication. For stereo capture or noise reduction purposes, multiple microphones may be provided, each at a different location of terminal 1300. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1301 or the radio frequency circuitry 1304 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 1307 may also include a headphone jack.

The positioning component 1308 is used for positioning the current geographic position of the terminal 1300 for implementing navigation or LBS (Location Based Service). The Positioning component 1308 can be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.

Power supply 1309 is used to provide power to various components in terminal 1300. The power source 1309 may be alternating current, direct current, disposable or rechargeable. When the power source 1309 comprises a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1300 also includes one or more sensors 1310. The one or more sensors 1310 include, but are not limited to: acceleration sensor 1311, gyro sensor 1312, pressure sensor 1313, fingerprint sensor 1314, optical sensor 1315, and proximity sensor 1316.

The acceleration sensor 1311 can detect the magnitude of acceleration on three coordinate axes of the coordinate system established with the terminal 1300. For example, the acceleration sensor 1311 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1301 may control the display screen 1305 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1311. The acceleration sensor 1311 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1312 may detect the body direction and the rotation angle of the terminal 1300, and the gyro sensor 1312 may cooperate with the acceleration sensor 1311 to acquire a 3D motion of the user with respect to the terminal 1300. Processor 1301, based on the data collected by gyroscope sensor 1312, may perform the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensor 1313 may be disposed on a side bezel of terminal 1300 and/or underlying display 1305. When the pressure sensor 1313 is disposed on the side frame of the terminal 1300, a user's holding signal to the terminal 1300 may be detected, and the processor 1301 performs left-right hand recognition or shortcut operation according to the holding signal acquired by the pressure sensor 1313. When the pressure sensor 1313 is disposed at a lower layer of the display screen 1305, the processor 1301 controls an operability control on the UI interface according to a pressure operation of the user on the display screen 1305. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1314 is used for collecting the fingerprint of the user, and the processor 1301 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 1314, or the fingerprint sensor 1314 identifies the identity of the user according to the collected fingerprint. When the identity of the user is identified as a trusted identity, the processor 1301 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. The fingerprint sensor 1314 may be disposed on the front, back, or side of the terminal 1300. When a physical button or vendor Logo is provided on the terminal 1300, the fingerprint sensor 1314 may be integrated with the physical button or vendor Logo.

The optical sensor 1315 is used to collect the ambient light intensity. In one embodiment, the processor 1301 may control the display brightness of the display screen 1305 according to the ambient light intensity collected by the optical sensor 1315. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1305 is increased; when the ambient light intensity is low, the display brightness of the display screen 1305 is reduced. In another embodiment, the processor 1301 can also dynamically adjust the shooting parameters of the camera assembly 1306 according to the ambient light intensity collected by the optical sensor 1315.

A proximity sensor 1316, also known as a distance sensor, is disposed on a front panel of terminal 1300. Proximity sensor 1316 is used to gather the distance between the user and the front face of terminal 1300. In one embodiment, the processor 1301 controls the display 1305 to switch from the bright screen state to the dark screen state when the proximity sensor 1316 detects that the distance between the user and the front face of the terminal 1300 gradually decreases; the display 1305 is controlled by the processor 1301 to switch from the rest state to the bright state when the proximity sensor 1316 detects that the distance between the user and the front face of the terminal 1300 is gradually increasing.

Those skilled in the art will appreciate that the configuration shown in fig. 13 is not intended to be limiting with respect to terminal 1300 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be employed.

Optionally, the computer device is provided as a server. Fig. 14 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1400 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1401 and one or more memories 1402, where the memory 1402 stores at least one computer program, and the at least one computer program is loaded and executed by the processors 1401 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, in which at least one computer program is stored, and the at least one computer program is loaded and executed by a processor to implement the operations performed in the image annotation method of the foregoing embodiment.

Embodiments of the present application also provide a computer program product or a computer program comprising computer program code stored in a computer readable storage medium. The processor of the computer apparatus reads the computer program code from the computer-readable storage medium, and the processor executes the computer program code, so that the computer apparatus implements the operations performed in the image annotation method according to the above embodiment.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only an alternative embodiment of the present application and should not be construed as limiting the present application, and any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An image annotation method, characterized in that the method comprises:

2. The method of claim 1, wherein prior to said selecting a target sample image from said plurality of second sample images, said method further comprises:

acquiring first image features of a plurality of first sample images and second image features of a plurality of second sample images;

clustering the plurality of first sample images based on the acquired plurality of first image characteristics to obtain at least one first clustering center;

for any second sample image, determining the distance between the second image feature of the second sample image and each first cluster center, and determining the similarity between the second sample image and the first sample image based on the minimum distance in the determined distances, wherein the corresponding minimum distance of the second sample image has a negative correlation with the corresponding similarity.

3. The method of claim 2, wherein said obtaining a first image feature of a plurality of said first sample images comprises:

calling a feature extraction sub-model in the image annotation model, and performing feature extraction on each first sample image to obtain a third image feature of each first sample image;

and fusing the third image characteristic corresponding to each first sample image with the corresponding first annotation image to obtain the first image characteristic of each first sample image.

4. The method of claim 3, wherein the first annotated image comprises a region corresponding to at least one category; the fusing the third image feature corresponding to each first sample image with the corresponding first annotation image to obtain the first image feature of each first sample image includes:

for any first sample image, extracting a sub-annotation image corresponding to the at least one category from a first annotation image of the first sample image, wherein the sub-annotation image is used for indicating pixel points belonging to the corresponding category;

fusing the sub-label image corresponding to each category with the third image characteristic of the first sample image to obtain a fourth image characteristic corresponding to each category;

and splicing the fourth image features corresponding to the at least one category to obtain the first image feature of the first sample image.

5. The method according to claim 4, wherein the fusing the sub-labeled image corresponding to each category with the third image feature of the first sample image to obtain the fourth image feature corresponding to each category comprises:

for any sub-labeled image corresponding to any category, performing dot multiplication on the pixel value of each pixel point in the sub-labeled image and the corresponding characteristic value to obtain a product corresponding to each pixel point, wherein the characteristic value corresponding to any pixel point is the characteristic value, located at the same position as the pixel point, in the third image characteristic;

and determining the ratio between the product corresponding to each pixel point and the number of the pixel points, wherein the ratio corresponding to each pixel point forms the fourth image characteristic corresponding to the category, and the number of the pixel points is the number of the pixel points belonging to the corresponding category in the sub-annotation image.

6. The method of claim 2, wherein obtaining second image features of the plurality of second sample images comprises:

calling a feature extraction sub-model in the image annotation model, and performing feature extraction on each second sample image to obtain a fifth image feature of each second sample image;

acquiring a soft labeling image of each second sample image, wherein the soft labeling image is obtained by labeling the second sample image by the image labeling model before the image labeling model is trained;

and fusing the fifth image characteristic of each second sample image with the corresponding soft labeling image to obtain the second image characteristic of each second sample image.

7. The method of claim 1, wherein the training the image annotation model based on the difference between the first annotated image and the first predictively annotated image and the difference between the second annotated image and the second predictively annotated image comprises:

in the process of training the image annotation model, at least one second clustering center corresponding to the plurality of second sample images in the iteration is obtained, and the at least one second clustering center is obtained by clustering based on second image features of the second sample images;

obtaining a distance between a second image feature of each second sample image in the iteration and the at least one second clustering center;

and training the image annotation model of the current iteration based on the difference between the first annotation image and the first prediction annotation image, the difference between the second annotation image and the second prediction annotation image and the distance corresponding to each second sample image.

8. The method according to claim 7, wherein the obtaining at least one second clustering center corresponding to the plurality of second sample images in the current iteration comprises:

if the iteration is the first iteration of the training process, acquiring a second image characteristic of each second sample image in the iteration;

and clustering the plurality of second sample images based on the acquired plurality of second image characteristics to obtain the at least one second clustering center.

9. The method according to claim 8, wherein the obtaining at least one second clustering center corresponding to the plurality of second sample images in the current iteration further comprises:

if the iteration is not the first iteration of the training process, distributing each second image feature to a second cluster center closest to the current iteration based on the distance between each second image feature and each second cluster center in the previous iteration;

updating each second clustering center respectively based on the second image feature corresponding to each second clustering center;

and determining the updated second clustering center as the second clustering center corresponding to the iteration.

10. The method of claim 7, further comprising:

obtaining a soft labeling image of a third sample image, wherein the third sample image is a second sample image except the target sample image in the plurality of second sample images, and the soft labeling image is obtained by labeling the third sample image by the image labeling model before training the image labeling model;

the training of the image annotation model of the current iteration based on the difference between the first annotated image and the first predicted annotated image, the difference between the second annotated image and the second predicted annotated image, and the distance corresponding to each second sample image includes:

calling the image annotation model, and annotating the third sample image to obtain a third predicted annotation image of the third sample image;

and training the image annotation model of the current iteration based on the difference between the first annotation image and the first prediction annotation image, the difference between the second annotation image and the second prediction annotation image, the distance corresponding to each second sample image, and the difference between the soft annotation image and the third prediction annotation image.

11. The method of claim 1, wherein before the training of the image annotation model based on the difference between the first annotated image and the first predictively annotated image and the difference between the second annotated image and the second predictively annotated image, the method further comprises:

and performing countermeasure training on the image annotation model and a discrimination model based on the first sample image, the first annotation image and the plurality of second sample images, wherein the discrimination model is used for discriminating whether the annotation image output by the image annotation model is the annotation image of the first sample image.

12. The method of claim 11, wherein the training of the image labeling model and the discriminant model for confrontation based on the first sample image, the first labeled image, and the plurality of second sample images comprises:

calling the image annotation model, and respectively annotating the first sample image and the plurality of second sample images to obtain a fourth prediction annotation image of the first sample image and a fifth prediction annotation image of each second sample image;

calling the discrimination model to discriminate the fourth prediction labeling image and the fifth prediction labeling image to obtain a discrimination result;

and training the image annotation model and the discrimination model based on the difference between the fourth prediction annotation image and the first annotation image and the discrimination result.

13. An image annotation apparatus, characterized in that the apparatus comprises:

14. A computer device, characterized in that the computer device comprises a processor and a memory, wherein at least one computer program is stored in the memory, and the at least one computer program is loaded and executed by the processor to implement the operations performed in the image annotation method according to any one of claims 1 to 12.

15. A computer-readable storage medium, having at least one computer program stored therein, the at least one computer program being loaded and executed by a processor to perform the operations performed in the image annotation method according to any one of claims 1 to 12.