CN112734775A

CN112734775A - Image annotation, image semantic segmentation and model training method and device

Info

Publication number: CN112734775A
Application number: CN202110066493.9A
Authority: CN
Inventors: 黄超
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-19
Filing date: 2021-01-19
Publication date: 2021-04-30
Anticipated expiration: 2041-01-19
Also published as: CN112734775B

Abstract

The application provides an image labeling method, an image semantic segmentation method, a model training method and an image labeling device, relates to the technical field of artificial intelligence, and is used for improving the efficiency of labeling sample images. According to the image labeling method, the edge pixel points in the sample image are detected, the target image blocks in the sample image are screened according to the edge pixel points, and the target image blocks are labeled, so that the labeling result of the sample image is obtained.

Description

Image annotation, image semantic segmentation and model training method and device

Technical Field

The application relates to the technical field of computers, in particular to the technical field of artificial intelligence, and provides a method and a device for image annotation, image semantic segmentation and model training.

Background

At present, a variety of image segmentation models are gradually appearing for segmenting various images. The image segmentation model relates to an important image segmentation model, namely an image semantic segmentation model. The image semantic segmentation model generally performs classification on an image at a pixel level, and can perform fine segmentation on the image.

Most of the image semantic segmentation models are obtained by supervised learning, that is, a large number of sample images are required to train the image semantic segmentation models so as to obtain the trained image semantic segmentation models. The following describes an example of a training image semantic segmentation model by taking two image semantic segmentation models as examples:

the method comprises the steps of taking an image as input of a depth network, continuously reducing the scale of convolution features through a plurality of convolution layers and activation layers, extracting the depth features of the image, increasing the scale of the convolution features through an up-sampling layer, and finally outputting the probability that each pixel point corresponds to different categories. The semantic segmentation method needs an artificial semantic annotation result corresponding to each pixel point in the whole image in a model training stage, and the annotated image consumes a large amount of labor cost.

In the other method, a semantic segmentation method is realized based on a Conditional Generation Adaptive Network (CGAN), the Conditional generation adaptive network is adopted to generate a semantic image, a real semantic image and the generated semantic image are distinguished through a discrimination network, the real semantic image is a semantic image which is artificially labeled, and the loss of the image conversion semantic image is automatically learned. The semantic segmentation effect can be improved through the combination of automatic learning loss and manual definition loss. The semantic segmentation method can resist a learning loss function to express the training loss of the model, but the model still needs an artificial labeling result corresponding to the whole image in the training stage, and the process of labeling the image consumes a large amount of labeling cost.

It can be seen that no matter which semantic segmentation method is adopted, the images required in the process of training the model need to be manually labeled pixel by pixel, manual labeling of one sample image takes at least 10 minutes, and a large number of labeled images are required in the training process, which results in low efficiency of the labeling process.

Disclosure of Invention

The embodiment of the application provides a method and a device for sample image annotation, image semantic segmentation and model training, which are used for improving the efficiency of sample image annotation.

In one aspect, an embodiment of the present application provides an image annotation method, including:

dividing a sample image to be labeled into a plurality of image blocks;

respectively determining edge pixel points with edge characteristics in the image blocks;

screening out at least one target image block of which the number of edge pixel points meets a preset screening condition from the plurality of image blocks;

and performing category labeling on the at least one target image block to obtain a labeling result of the sample image.

In one aspect, an embodiment of the present application provides a training method for an image semantic segmentation model, including:

obtaining an annotation result of the sample image by any image annotation method;

performing multiple iterative training on the image semantic segmentation model according to the sample image;

until the image semantic segmentation model is converged, obtaining a trained image semantic segmentation model;

wherein, each iterative training in the iterative training of the image semantic segmentation model for a plurality of times comprises the following steps:

inputting a sample image into an image semantic segmentation model to obtain a semantic segmentation result, wherein the semantic segmentation result comprises the probability that each pixel point in the sample image belongs to each category;

and adjusting model parameters of an image semantic segmentation model according to the labeling result of each pixel point in at least one target image block in the sample image and the semantic segmentation result of the pixel point corresponding to the at least one target image block in the semantic segmentation result.

In one aspect, an embodiment of the present application provides an image semantic segmentation method, including:

acquiring a target image to be segmented;

inputting the target image into the trained image semantic segmentation model obtained by the image semantic segmentation model training method as described above, and obtaining the category to which each pixel point in the target image belongs.

the dividing module is used for dividing the sample image to be labeled into a plurality of image blocks;

the determining module is used for determining edge pixel points with edge characteristics in the image blocks respectively;

the screening module is used for screening at least one target image block of which the number of edge pixel points meets a preset screening condition from the plurality of image blocks;

and the labeling module is used for performing category labeling on the at least one target image block to obtain a labeling result of the sample image.

In a possible embodiment, the apparatus further comprises an obtaining module configured to:

before dividing a sample image to be labeled into a plurality of image blocks, sampling a sample video according to a preset sampling interval to obtain a plurality of candidate sample images;

determining a similarity between any two candidate sample images among the plurality of candidate sample images;

if any two candidate sample images with the similarity larger than the preset similarity exist, one of the two candidate sample images is removed;

and taking the residual candidate sample images as sample images to be labeled.

In a possible embodiment, the determining module is specifically configured to:

carrying out graying processing on the sample image to obtain a grayed sample image;

and carrying out edge detection processing on the grayed sample image to obtain edge pixel points with edge characteristics in the image blocks.

In a possible embodiment, the determining module is specifically configured to:

and determining the pixel points with the gray values of preset values in the grayed sample image as edge pixel points.

In a possible embodiment, the screening module is specifically configured to:

determining the ratio of the number of edge pixel points included in each image block to the total number of all edge pixel points of the sample image;

and determining at least one target image block from the plurality of image blocks according to the ratio corresponding to each image block.

In a possible embodiment, the screening module is specifically configured to:

sorting the ratios of the plurality of image blocks from large to small, and determining image blocks corresponding to the first N ratios as target image blocks, wherein N is a preset natural number; or,

determining an image block with a ratio not less than a preset ratio as a target image block; or,

and randomly selecting at least one target image block from the plurality of image blocks by taking the ratio of the plurality of image blocks as random probability.

In a possible embodiment, the labeling module is further configured to:

after at least one target image block with the number of edge pixel points meeting a preset screening condition is screened out from the plurality of image blocks, marking the at least one target image block as a first identifier, marking other image blocks except the at least one target image block in the plurality of image blocks as second identifiers, and obtaining a mask image, wherein the first identifier is different from the second identifier, and the mask image is used for training an image semantic segmentation model.

In a possible embodiment, the sample image is a game scene image with a preset behavior, and the labeling module is specifically configured to:

the class labeling of each pixel point in the at least one target image block to obtain a labeling result of the sample image comprises the following steps:

and labeling the game scene articles to which the pixel points belong in at least one target image block in the game scene graph according to a plurality of preset game scene article types to obtain a labeling result.

In one aspect, an embodiment of the present application provides an image semantic segmentation model training device, including:

the acquisition module is used for acquiring the labeling result of the sample image by any image labeling method;

the training module is used for carrying out repeated iterative training on the image semantic segmentation model according to the sample image;

the obtaining module is used for obtaining the trained image semantic segmentation model until the image semantic segmentation model converges;

the training module is used for executing the following process so as to realize each iterative training in the multiple iterative training of the image semantic segmentation model:

In one aspect, an embodiment of the present application provides an image semantic segmentation apparatus, including:

the acquisition module is used for acquiring a target image to be segmented;

and the obtaining module is used for inputting the target image into the trained image semantic segmentation model obtained by any image semantic segmentation model training method discussed above, and obtaining the category to which each pixel point in the target image belongs.

In one possible embodiment, the target image is a game scene image; the apparatus further comprises a control module, the control module further configured to:

and after the category of each pixel point in the target image is obtained, controlling the artificial intelligent game role to move to a position corresponding to a preset category according to the category of each pixel point in the target image so as to execute a corresponding task.

In one aspect, an embodiment of the present application provides a computer device, including:

at least one processor, and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implementing any of an image annotation method, an image semantic segmentation model training method, or an image semantic segmentation method as discussed above by executing the instructions stored by the memory.

Embodiments of the present application provide a computer storage medium storing computer instructions that, when executed on a computer, cause the computer to perform any one of the image annotation methods, the image semantic segmentation model training methods, or the image semantic segmentation methods as discussed above.

Due to the adoption of the technical scheme, the embodiment of the application has at least the following technical effects:

in the embodiment of the application, when the sample image is labeled, part of the image blocks are screened out from the plurality of image blocks of the sample image according to the edge information of the sample image, and the part of the image blocks are labeled, so that the labeling amount can be reduced, and the labeling efficiency is improved. Moreover, because the image blocks with the screened edge information meeting certain requirements have relatively more edge information, the image blocks possibly contain more category information, and therefore, the method is equivalent to screening and labeling more valuable image blocks in the sample image, and thus, the influence on the training of the image semantic segmentation model can be reduced. In addition, as part of image blocks are screened from the sample image, compared with a mode of labeling all pixel points in the sample image, the labeling mode in the embodiment of the application relatively increases the uncertainty of the labeled object, so that the uncertainty in the model training process is improved, and the generalization capability of the image semantic segmentation model is favorably improved.

Drawings

Fig. 1 is an exemplary diagram of an application scenario of an image annotation method according to an embodiment of the present application;

fig. 2 is a flowchart of an image annotation method according to an embodiment of the present application;

fig. 3A is an exemplary diagram of a grayed sample image provided in an embodiment of the present application;

FIG. 3B is an exemplary diagram of an image after Gaussian filtering of the image of FIG. 3A;

FIG. 3C is an edge image after performing edge detection on the image of FIG. 3A;

FIG. 4A is an exemplary diagram of a mask image provided by an embodiment of the present application;

FIG. 4B is an exemplary diagram of the sample image shown in FIG. 3A after labeling;

FIG. 5 is a flowchart of a training method for an image semantic segmentation model according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an image semantic segmentation model provided in an embodiment of the present application;

fig. 7 is a flowchart of an image semantic segmentation method according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an image annotation apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an image semantic segmentation model training device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an image semantic segmentation apparatus according to an embodiment of the present application;

fig. 11 is a first schematic structural diagram of a computer device according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application;

fig. 13 is a schematic structural diagram three of a computer device according to an embodiment of the present application.

Detailed Description

In order to better understand the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the drawings and specific embodiments.

To facilitate better understanding of the technical solutions of the present application for those skilled in the art, the following terms related to the present application are introduced.

1. Artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

2. Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

3. Convolutional Neural Network (CNN): the method is a feedforward neural network, and the artificial neurons of the feedforward neural network can respond to peripheral units in a part of coverage range and have excellent performance on large-scale image processing. The convolutional neural network consists of one or more convolutional layers and a top fully connected layer (corresponding to the classical neural network), and also includes associated weights and pooling layers (pooling layers).

4. Depth characteristics: the image features extracted through the depth network contain abstract information of the image.

5. Semantic segmentation: and according to the interested object to which each pixel in the image belongs, assigning a corresponding class label to the interested object.

6. Semantic image: the result obtained after assigning a category note to each pixel in the image.

7. Mask image: in this embodiment of the present application, for example, the mask image may be a binary image, where the binary image includes a first-type pixel point with a first value and a second-type pixel point with a second value, for example, a pixel point in the binary image has a value of 0, which means that the pixel point is not selected, and if the value of the pixel point in the binary image is 1, it means that the pixel point is selected.

8. Conditional Generation Adaptive Networks (CGAN): an improvement on the GAN is to add additional condition information to the Generator and Discriminator of the original GAN to realize a condition generation model. The additional condition information may be a category label or other auxiliary information.

9. ImageNet database: a large-scale database of 1000 categories is included.

10. MobileNetV2, a commonly used lightweight network model architecture, was trained on the ImageNet database and can be used to extract image features.

11. Image classification and classification: image classification refers to an image processing method for distinguishing objects of different categories from each other based on different features each reflected in image information. It utilizes computer to make quantitative analysis of image, and classifies every pixel point or region in the image into one of several categories to substitute for visual interpretation of human body. A category may also be referred to as a classification. In the embodiment of the present application, there may be two or more categories, such as vehicles, roads, and the like. When the image semantic segmentation model is suitable for different scenes, the corresponding categories to be labeled can be different. Each object in the image is actually composed of pixels, and the category of the pixels is the category corresponding to the object.

12. Sample image and target image: all belong to images, in the embodiment of the present application, an image used for training a model is referred to as a sample image, and an image subsequently processed by using the model is referred to as a target image.

13. Artificial intelligence game role: the game role is a game role controlled by an artificial intelligence technique in a game, and includes a Non-player character (NPC), or a player character may also be referred to as an artificial intelligence game role in a specific case, for example, when it is detected that a player does not perform a control operation on the player character within a preset time period, the player character may be controlled by the artificial intelligence technique to perform a game task, and the like.

14. Edge information and edge pixel points: the edge information is used for describing information of pixels with discontinuous gray level change of neighborhood pixels of the pixels in the image, the pixels with discontinuous gray level change of the neighborhood pixels of the pixels are edge pixels, and the edge information specifically comprises gray level values of the edge pixels, shapes formed by the edge pixels and the like. Edges are widely present between objects and backgrounds, and between objects and objects. Edge information in an image can be obtained by image edge detection.

In order to improve labeling efficiency of a sample image, an embodiment of the present application provides an image labeling scheme, where the scheme obtains edge information of the sample image, selects a part of image blocks from the sample image based on the edge information, and performs class labeling on the selected part of image blocks. Meanwhile, the selected part of the image blocks comprises image blocks with rich edge features, and the probability that the image blocks with rich category information are represented by rich edge features is high, so that the image blocks with edge pixels with edge features meeting certain conditions are screened for labeling, the labeled image blocks can have more category information, and the accuracy of model training cannot be influenced subsequently. In addition, because all pixel points of the sample image are not marked in the method, the randomness of the marked sample image is increased, the over-fitting condition of the trained model can be avoided, and the generalization capability of the model is improved.

Based on the above design concept, an application scenario of the image annotation method according to the embodiment of the present application is described below.

The labeled sample image can be used for training an image semantic segmentation model, and the image semantic segmentation model can output the category of each pixel point in the image, so that the image labeling method in the embodiment of the application can be applied to any scene needing image labeling, for example, a game scene, and specifically, an artificial intelligent game role can be controlled according to an image segmentation result generated by the image semantic segmentation model, for example, in a gunfight game, the image content is analyzed through the semantic segmentation model, so that the position information of important targets such as houses, vehicles and the like is provided, and the artificial intelligent game role can execute game tasks such as house exploration, vehicle driving and the like according to the position information. For example, the image annotation method in the embodiment of the present application may also be applied to an automatic driving scene, and specifically, the position information of an important target may be determined according to an image segmentation result generated by an image semantic segmentation model, so as to provide a reference for vehicle traveling.

Referring to fig. 1, an application scenario diagram of an application of the image annotation method according to the embodiment of the present application is shown, where the scenario diagram includes a plurality of servers and a terminal 140.

The plurality of servers includes a first server 110, a second server 120, and a third server 130, the first server 110 being configured to implement sample image annotation. The second server 120 is configured to obtain the annotated sample image from the first server 110, and train an image semantic segmentation model based on the annotated sample image. The third server 130 is configured to obtain the trained image semantic segmentation model from the second server 120, and provide an image semantic segmentation function using the trained image semantic segmentation model, and the terminal 140 and the third server 130 may communicate with each other and are terminals using the image semantic segmentation function. The image labeling method, the model training method and the image semantic segmentation method are described below.

It should be noted that, in fig. 1, the example of labeling the sample image, training the model, and implementing semantic segmentation of the image is implemented by the server, and actually, corresponding functions may also be implemented by the terminal. In addition, in fig. 1, the labeling of the sample image, the training of the model, and the implementation of the semantic segmentation of the image are implemented by three different devices, or may be implemented by one or two devices, which is not limited in the present application.

In addition, the terminal 140 and the third server 130 may be directly or indirectly connected through wired or wireless communication, which is not limited in this application. In addition, the terminal 140 may further be equipped with a client 141, and the client 141 and the third server 130 communicate with each other to implement a corresponding image semantic segmentation function.

For example, the client 141 is a game client, and the third server 130 may control the artificial intelligence game character to perform a corresponding task according to the trained image semantic segmentation model, update the game screen in real time, send the updated game screen to the terminal 140, and receive and present the game screen by the terminal 140. Alternatively, for example, a third server may test the game application by controlling the artificial intelligence game character and transmit the test result to the terminal 140.

The terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a game device, a smart television, a smart bracelet, or the like, but is not limited thereto. The first server 110, the second server 120, and the third server 130 may be independent physical servers, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be cloud servers providing basic cloud computing services such as cloud services, a cloud database, cloud computing, a cloud function, cloud storage, Network services, cloud communication, middleware services, domain name services, security services, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.

Based on the application scenarios discussed above, the following takes the first server to implement the image annotation method as an example, and introduces the image annotation method related to the embodiment of the present application.

Referring to fig. 2, a flowchart of an image annotation method provided in an embodiment of the present application is shown, where the method includes:

s201, a first server divides a sample image to be annotated into a plurality of image blocks.

The first server may obtain one or more sample images, a processing process of each sample image is the same, and in this embodiment, a process of labeling a sample image is described as an example.

After obtaining the sample image, the first server may divide the sample image into a plurality of image blocks with the same size according to a fixed size, or the first server may divide the sample image into a plurality of image blocks with the same size according to a fixed number, where the size of any two image blocks may be the same or different, and the specific manner of dividing the sample image is not limited in the present application.

S202, the first server determines edge pixel points with edge characteristics in the image blocks respectively.

The first server can detect edge information of the sample image, and edge pixel points with edge characteristics in the sample image are determined according to the edge information, so that the edge pixel points included in each image block are correspondingly determined. The edge pixel point can be understood as a pixel point with a large pixel value change.

S203, the first server screens out at least one target image block, the number of edge pixel points of which meets a preset screening condition, from the plurality of image blocks.

The first server can screen one or more target image blocks from the plurality of image blocks according to the number of edge pixel points contained in each image block. For example, the first server may screen an image block with a larger number of edge pixel points from the plurality of image blocks as a target image block.

S204, the first server carries out category labeling on at least one target image block to obtain a labeling result of the sample image.

After obtaining the at least one target image block, the first server may perform class labeling on the screened at least one target image. For example, the first server may obtain a category labeling result of each pixel point in each target image block according to input information of a user. Or for example, the first server may automatically identify the class labeling results of all the pixel points in each target image block, so as to obtain the labeling result of the sample image.

The first server can train the image semantic segmentation model according to the annotation result, or send the annotation result and the sample image to other equipment together, so that the other equipment can train the image semantic segmentation model according to the sample image and the annotation result.

In the embodiment of the application, the edge information in the sample image can be detected, the image edge is generated, the target image blocks to be labeled are selected based on the edge information, and then the target image blocks are manually labeled, so that the labeling quantity of each sample image is reduced, and the labeling efficiency is improved. Moreover, because the two sides of the edge are likely to be objects of different types, the target image blocks are selected based on the edge information, the target image blocks with more types can be selected, and the accuracy of the subsequent training model is relatively guaranteed.

The following describes specific implementation of each step based on the embodiment of fig. 2:

before executing step S201, the first server needs to obtain the sample image, and the following describes an exemplary manner in which the first server obtains the sample image:

in a first mode, the first server can screen the images needed by the user from the network to be used as sample images.

The first server may screen the sample image required by itself from the network based on a screening rule, which may be various, such as one or two of an image quality screening rule and an image scene screening rule. The image quality screening refers to screening an image with image quality meeting certain requirements, the image quality can comprise one or two of image definition, color saturation and the like, and the image scene screening refers to screening an image meeting a target scene according to a scene related to the image. The target scene may be an application scene of an image semantic segmentation model, for example, the image semantic segmentation model is applied to a game scene, and then the first server may filter images related to the game from the network as sample images.

The second method comprises the following steps: the first server may obtain the sample image from the other device.

Specifically, the first server may obtain the sample image from the device related to the application scene according to the application scene of the image semantic segmentation model, for example, the image semantic segmentation model is applied to a game scene, and then the first server may obtain the sample image from the background server device related to the game.

In a third mode, the first server acquires the sample image by combining the first mode and the second mode.

No matter which way the first server acquires the sample image, the first server may acquire the sample video directly, so in this embodiment, the first server may screen the sample image from the sample video.

For example, the first server randomly samples a sample video to obtain a sample image.

Or, the first server may sample the sample video according to a preset sampling interval to obtain a plurality of candidate images, and the first server may directly use the plurality of candidate images as the sample image, so that the sample image may be obtained simply and quickly.

In order to improve the effectiveness of generating the sample image, the first server may further screen out candidate images with higher similarity from the plurality of candidate images to obtain the sample image.

Specifically, after obtaining a plurality of sample images, the first server may determine a similarity between every two sample images, for example, extract respective image feature vectors of the two sample images, calculate a similarity between the two image feature vectors, and specifically, may represent the similarity between the two image feature vectors by a cosine similarity or a euclidean distance between the two image feature vectors. After the similarity between every two sample images is obtained, if any two candidate images with the similarity larger than the preset similarity exist, one of the candidate images is kicked out, the candidate image can be understood to be deleted by means of rejection, the candidate image is not used as the sample image, and the rest candidate images are used as the sample images to be labeled in the same way. The preset similarity is a preset similarity threshold, and the specific value can be set according to requirements, for example, the value is 0.9.

In the embodiment of the application, the candidate images with higher similarity can be screened, and the condition of model overfitting caused by training the model by using the candidate images with higher similarity can be avoided.

For example, the preset similarity is 0.9, the candidate image includes A, B and C, D four images, the similarity between a and B is 0.95, the similarity between B and C is 0.2, the similarity between C and D is 0.91, the similarity between a and D is 0.3, the first server determines that the similarity between a and the images is greater than the preset similarity, the similarity between C and D is greater than the preset similarity, the first server may eliminate a and C, and the remaining B and D are taken as sample images.

After obtaining the sample image, the first server may divide the sample image to obtain a plurality of image blocks, and the dividing manner may refer to the content discussed above, and is not described herein again. After obtaining the plurality of image blocks, the process of S202 is performed, and the following describes an implementation manner for performing S202:

the first server can perform graying processing on the sample image to obtain a grayed sample image, and further extract the grayed sample image to perform edge detection processing to obtain edge points with edge features in the image blocks.

For example, please refer to fig. 3A, which is an exemplary diagram of a sample image after being grayed.

Specifically, the first server may convert the sample image into a gray-scale image, and extract the edge information of the image based on a preset edge detection algorithm, such as a canny edge detection algorithm, which is described as an example of an edge detection process as follows:

s1.1: and performing Gaussian filtering on the grayed sample image.

The main purpose of gaussian filtering is to reduce the noise of the grayed sample image. The grayed sample image is subjected to gaussian filtering, which can be actually understood as weighted average of the grayed sample image, that is, the gray value of each pixel point in the grayed sample image is obtained by weighted average of the gray value of the pixel point and the gray values of other pixel points in the neighborhood of the pixel point. The Gaussian filtering carries out weighted average on the gray values of all the pixel points in the grayed sample image, so that some noises in the image are filtered, the overall contour in the grayed sample image is relatively fuzzy, the processed image is relatively smooth, and the width of the contour is relatively increased.

Continuing with the example shown in FIG. 3A, after Gaussian filtering the sample image shown in FIG. 3A, an example graph is obtained as shown in FIG. 3B, which is a smoother line across the image than in FIG. 3A.

S1.2: and calculating gradient values and gradient directions in the images after Gaussian filtering.

The edge can be understood as a set of pixel points with large gray value changes, for example, one is a black edge and one is a white edge, so that a part between the black edge and the white edge is generally an edge, and in specific implementation, a gray value change can be detected to find out an edge in an image, wherein a gradient value can be used to represent a change degree of the gray value, and a change direction of the gray value is represented by a gradient direction. Wherein, the gradient value and the gradient direction can be calculated by the following formulas:

wherein G is_xRepresenting gray values, G, obtained by lateral edge detection_yThe gray scale value obtained by detecting the longitudinal edge is shown, G is the change degree of the gray scale value, and theta represents the gradient direction.

S1.3: non-maxima are filtered.

In S1.1, the contour width in the image is actually enlarged, which may affect the accuracy of edge detection, and therefore S1.3 is mainly used to screen out pixel points that are not edges.

Specifically, if the first server determines that the gradient value of the pixel point in the gradient direction is the maximum, the pixel point is determined to belong to a suspected edge pixel point; if the gradient value of the pixel point in the gradient direction is not the maximum value, the pixel point is determined not to be an edge pixel point, and the like, so that some pixel points which do not belong to the edge are excluded. The suspected edge pixel points can be understood as preliminarily determined as edge pixel points, but can be further determined.

In a possible embodiment, the first server may directly use the suspected edge pixel as an edge pixel, so as to obtain an edge pixel in the grayed sample image.

S1.4: the edge is determined using an upper threshold.

In order to determine more accurate edge pixel points, in the embodiment of the present application, the first server may further screen suspected edge pixel points in S1.3 through a high threshold and a low threshold. Wherein the high threshold is greater than the low threshold.

Specifically, if the first server determines that the gradient value of the suspected edge pixel is greater than the high threshold, the suspected edge pixel is determined to be an edge pixel; if the gradient value of the suspected edge pixel point is smaller than the high threshold value but larger than the low threshold value, determining that the suspected edge pixel point belongs to the edge pixel point; and if the gradient value of the suspected edge pixel point is less than or equal to the low threshold value, determining that the suspected edge pixel point does not belong to the edge pixel point. Therefore, the first server can determine all edge pixel points in the grayed sample image, so that the edge image can be obtained. An edge image may be understood as an image that identifies edge pixels and non-edge pixels. The non-edge pixel points can be understood as pixel points which do not belong to the edge pixel points in the image.

After the first server obtains the edge image, the first server further determines edge pixel points according to the gray values of the edge pixel points, for example, the edge pixel points belonging to the preset gray values of the pixel points. After obtaining the edge image, the first server can naturally determine the number of edge pixel points included in each image block. It should be noted that the number of edge pixels included in each image block may be 0, 1, or multiple.

Continuing with the example shown in fig. 3A-3B, the first server processes the sample image to obtain an edge image as shown in fig. 3C, where the black line in fig. 3C corresponds to an edge.

After determining the number of edge pixel points included in each image block, S203 may be executed, and the first server may determine a target image block from a plurality of image blocks, where a specific implementation of S203 is described below:

specifically, the first server may determine edge pixel points included in each image block, and calculate a ratio of the edge pixel points included in each image block to all the edge pixel points in the sample image, so as to determine at least one target image block according to the ratio of each image block.

The following example introduces the formula for calculating the ratio:

wherein, P_iRepresents the ratio corresponding to the ith image blockAnd representing the number of edge pixel points included in the ith grid, and D representing the number of a plurality of image blocks included in the sample image, for example, the sample image is divided into 16 image blocks, so that the value of D is 16, and n is sequentially taken from 1 to D.

After the first server determines the ratio corresponding to each image block, at least one target image block may be determined according to the ratio, and specific determination methods are various, and the following examples are described:

the first determining method is that the ratio values corresponding to the image blocks are sorted from large to small, and the image blocks corresponding to the first N ratio values are determined as target image blocks.

After the first server determines the ratios corresponding to the plurality of image blocks, the plurality of ratios may be sorted from small to obtain a plurality of sorted ratios, the first N ratios are determined from the plurality of sorted ratios, and the image blocks corresponding to the N ratios are determined as target image blocks.

In the embodiment of the application, the image block with a relatively large ratio is determined from the plurality of image blocks to serve as the target image, the determination mode is relatively simple, and the image block with the large ratio represents that more edge pixel points exist in the image block, and objects of different types are more likely to exist on two sides of the edge, so that more edge pixel points represent that more category information exists in the image block, and the image block with richer category information can be obtained, so that a semantic segmentation model of a subsequent training image is convenient.

Wherein N is a preset natural number. As an embodiment, N may be set according to the number of image blocks included in the sample image, for example, N may be set to be half of the number of the plurality of image blocks, which may make the determined number of target image blocks relatively reasonable.

And determining an image block with the ratio not less than the preset ratio as a target image block.

The first server may set a preset ratio, where the preset ratio may be set according to an actual situation of the sample image, and after determining the ratio of each image block in the plurality of image blocks, the first server may determine the image block whose ratio is greater than or equal to the preset ratio as the target image block.

In the embodiment of the application, the target image block is determined according to the preset ratio, so that the flexibility of the number of the determined target image blocks is higher under the condition that more edge pixel points of the determined target image blocks are ensured, and the image blocks containing more edge pixel points are reserved as far as possible.

And a third step of randomly selecting at least one target image block from the plurality of image blocks by taking the ratio of the plurality of image blocks as random probability.

The first server may use the ratio of each image block as the random probability of the image block, and further randomly generate at least one target image block from the plurality of image blocks according to the respective random probabilities corresponding to the plurality of image blocks. When at least one target image block is randomized, the first server may set a preset number to randomly generate the preset number of at least one target image block.

In the embodiment of the application, the first server randomly selects the target image block from the plurality of image blocks by taking the ratio as the random probability, so that the probability of randomly selecting the image blocks with more edge pixels is higher, but the image blocks with more edge pixels are not directly selected, so that the selected target image block has certain randomness, and the overfitting condition of a later-stage image semantic segmentation model is favorably avoided.

After obtaining at least one target image block, in order to facilitate subsequent determination of which are target image blocks and which are unselected image blocks, in this embodiment of the present application, the first server may label the target image blocks in the sample image, so as to facilitate subsequent identification of which are target image blocks.

Specifically, the first server may label a first identifier on at least one target image block, which may be understood as labeling each pixel point in each target image block in the at least one target image block with the first identifier, and labeling other image blocks in the plurality of image blocks except the at least one target image block with a second identifier, thereby obtaining the mask image.

The first mark and the second mark belong to different marks, and the first mark and the second mark may belong to the same type but belong to different marks under the same type, for example, the first mark and the second mark are both represented by colors, specifically, the first mark is white, the white specifically may be represented by "1", the second mark is black, and the black specifically may be represented by "0". The first and second markers are of the same type, which may facilitate subsequent equipment to resolve the mask image. The first identifier and the second identifier may also be of different types, which is not limited in this application.

For example, continuing with the example shown in fig. 3A, the first server determines a target image block in the sample image, and the first server labels the sample image to obtain the mask image shown in fig. 4A, wherein the target image block is labeled white and the other image blocks of the plurality of image blocks except the target image block are labeled black.

After the first server determines the target image blocks, S204 may be executed, and a process of performing category labeling on at least one target image block is described as follows, taking the first server labeling categories of each pixel point in one target image block as an example:

the first server may obtain a category of each pixel point in one target image block according to input information of a user, where the input information includes a category corresponding to each pixel point, for example, the user performs a labeling operation on the category of each pixel point, and the first server obtains the input information according to the labeling operation, thereby obtaining the category of each pixel point. Or, the first server may automatically identify the category of each pixel point, and label each pixel point, for example, the first server may match each target in the sample image according to the edge information of the sample image, so as to label each pixel point.

The marked type can be two or more, the marked type can be set by the first server according to requirements, and the type can be, for example, a bridge, a person, a lawn, a tree, a house, a background, a door, a window and the like.

No matter which kind of mode is adopted to label by the first server, when the first server labels every pixel point, there are a plurality of modes of labeling the pixel point that belongs to different categories, for example, the first server can label the pixel point of different categories with different colors, also can label the pixel point of different categories with different color depth degree, also can distinguish and label the pixel point of different categories with different transparencies, perhaps also can label the pixel point of different categories with different grade numbers etc., this application embodiment does not restrict this.

For example, continuing with the example shown in fig. 3A, after the first server filters the target image block in the grayed sample image shown in fig. 3A, the first server may label the selected target image block, so as to obtain a mask image as shown in fig. 4A, where white in the mask image in fig. 4A corresponds to the target image block. Further, the first server may label each pixel in the target image block, and specifically label serial numbers corresponding to the categories for sets of pixels belonging to different categories, so as to obtain a labeling result as shown in fig. 4B, please refer to fig. 4B, where the pixel corresponding to the serial number (i) belongs to a bridge, the pixel corresponding to the serial number (ii) belongs to a vehicle, and the pixel corresponding to the serial number (iii) belongs to a person. With reference to fig. 4B, it can be seen that the method in the embodiment of the present application may not need to label the entire image, so as to reduce the labeling amount.

In practice, the first server may perform the foregoing sample image annotation process on each of the plurality of sample images to obtain an annotation result corresponding to each sample image, and these sample images may be subsequently used for training the image semantic segmentation model.

In the embodiment shown in fig. 2, when the sample image is labeled, each pixel point in the sample image is not labeled, but the sample image is divided into a plurality of image blocks, and then the image blocks with rich edge information are selected to label the selected target image blocks, so that the labeling amount in the process of labeling the sample image can be relatively reduced. Moreover, the image blocks with more edge information are screened, the edge information is richer and indicates that the image blocks contain richer category information, so that the labeling quantity can be reduced, and the information of the image blocks in the sample image has certain redundancy, so that the accuracy of the later-stage image semantic segmentation model training cannot be influenced even if all the image blocks are not labeled. Furthermore, the selected image block may not contain a complete target, so that overfitting training of the image semantic segmentation model can be avoided, and the generalization capability of the image semantic segmentation model is improved.

In addition, experiments prove that the embodiment of the application reduces the mark amount by more than 50% and does not reduce the effect of image semantic segmentation.

In order to more clearly describe the image annotation method in the embodiment of the present application, the image annotation method in the embodiment of the present application is described below by taking an example in which the image semantic segmentation model is applied to a gun battle scene in a game scene.

S2.1, the first server obtains the image of the gunfight game.

The method comprises the steps of obtaining a sample video corresponding to a gun battle game, and collecting marked gun battle game images from the sample video, wherein the sampling frequency is 5 seconds per frame, so that the purpose of avoiding the overhigh similarity between the images is achieved.

Further, after the gun battle game images are obtained, gun battle game images with high similarity can be screened out, and further sample images can be obtained, wherein the sample images can be more than 5000 gun battle game images. The manner of screening out images of gun battle games with high similarity can refer to the foregoing discussion, and will not be described herein.

S2.2, the first server extracts the image edge.

After the first server obtains the gun game image, the gun game image can be preprocessed, specifically, edge information in the gun game image is extracted, and the edge information can be understood as outline information of an object, because two sides of the edge are usually targets of different types, the edge information can assist in picking image blocks from the gun game image subsequently.

Specifically, the first server may perform gray processing on the gun game image, and extract an edge of the grayed gun game image based on a canny edge detection algorithm, so as to obtain an edge image, where a pixel value of an object edge in the edge image is 1, and a pixel value corresponding to a non-edge is 0.

S2.3, the first server determines the target image block.

After the first server obtains the edge image, the gun battle game image may be divided into a plurality of image blocks, for example, into 4 × 4 image blocks, that is, into 16 image blocks, and the number of edge pixel points included in each image block is calculated, where an edge pixel point is specifically a pixel point with a pixel value of 1 in the edge image. The first server then calculates the random probability of the image block being selected according to the number of the edge pixel points of each image block, and the manner of calculating the random probability may refer to the content discussed above, and is not described here again.

The first server may randomly select the target image block according to a random probability, for example, the first server may select 8 image blocks from 16 image blocks.

In one possible embodiment, the first server generates a binarized mask image.

Specifically, after the target image block is obtained, each pixel point in the target image block may be labeled as a first value, and other image blocks in the plurality of image blocks may be labeled as second values, so as to generate a mask image.

And S2.4, marking the target image block.

The first server can perform category labeling on each pixel point in the target image block according to a manually input category label of each pixel of the target image block, wherein the category to be labeled specifically comprises: people, grasslands, trees, houses, backgrounds, doors, windows. For example, 8 target image blocks are screened from 16 image blocks, so that only 8 target image blocks need to be labeled, which reduces the labeling amount by 50% and reduces the labor cost for labeling.

Based on the image annotation method, an embodiment of the present application provides a training method for an image semantic segmentation model, which is performed by a second server, for example, and is introduced with reference to a flowchart of the training method for an image semantic segmentation model shown in fig. 5:

s501, the second server obtains the labeling result of the sample image.

The second server can obtain the sample image and the labeling result of the sample image from the first server, and can also obtain a mask image and the like. Alternatively, the second server may also obtain the annotation result of the sample image by the image annotation method discussed above. The specific process of obtaining the labeling result of the sample image can refer to the content discussed above, and is not described herein again.

It should be noted that the second server may obtain the annotation result of each sample image in the plurality of sample images in the above manner, so that the plurality of sample images and the annotation result of each sample image perform training of the image semantic segmentation model.

And S502, the second server carries out multiple iterative training on the image semantic segmentation model according to the sample image.

The second server may perform iterative training on the image semantic segmentation model for multiple times, where the process of each iterative training is similar, and the following describes the process of one iterative training:

s3.1, the second server inputs the sample image into the image semantic segmentation model to obtain a semantic segmentation result.

The second server may input a batch of sample images in one iterative training, where the number of the batch of sample images is, for example, one or more, specifically, the number of the batch of sample images input by the second server according to a training requirement may be the same or different.

When a plurality of input sample images are available, the image semantic segmentation model can output the semantic segmentation result of each sample image respectively, and the semantic segmentation result corresponding to each sample image comprises the probability that each pixel point in the sample image belongs to each category. For example, the categories include 7 in total, the image semantic segmentation model may output the probability that each pixel point belongs to 7 categories respectively.

And S3.2, the second server adjusts model parameters of the image semantic segmentation model according to the labeling result and the semantic segmentation result.

After obtaining the semantic segmentation result of the sample image, the second server may adjust the model parameters of the image semantic segmentation model according to the labeling result of each pixel point in at least one target image block in the sample image and the semantic segmentation result of the pixel point corresponding to the at least one target image block in the semantic segmentation result. The second server may correspondingly determine a semantic segmentation result corresponding to the pixel point in the sample image according to the position of the pixel point corresponding to the class label in the sample image, or the second server may obtain the mask image from the first server, and since the mask image includes the position information of the target image block, the second server may determine the position of the target image block based on the mask image.

In which, how to adjust the model parameters of the image semantic segmentation model specifically is referred to, the following example illustrates:

the second server can determine the value of the loss function according to the labeling result of each pixel point in at least one target image block in the sample image and the semantic segmentation result of the pixel point corresponding to at least one target image block in the semantic segmentation result, and then adjust the model parameters of the image semantic segmentation model according to the value of the loss function so as to reduce the difference between the labeling result and the semantic segmentation result of the image semantic segmentation model.

There are various ways to express the loss function, for example, the loss function can be expressed as follows:

wherein, K represents the number of sample images of the semantic segmentation model of the input image, P is the total number of pixel points included in the sample images, C is the total number of categories, and m is_i,pMask corresponding to ith sample imageThe value corresponding to the p-th pixel point in the image,

is the probability that the p-th pixel point in the ith sample image belongs to the c-th category, y_i,p,cLabeling the labeling result that the p-th pixel point in the ith sample image belongs to the c-th category, if the category corresponding to the p-th pixel point is c, y_i,p,cIs 1, otherwise y_i,p,cThe value is 0.

And S503, until the image semantic segmentation model is converged, obtaining the trained image semantic segmentation model.

And when the image semantic model is subjected to repeated iterative training, if the image semantic segmentation model is converged, determining that the training is finished, and obtaining the trained image semantic segmentation model. The convergence of the image semantic segmentation model can be that the value of a loss function is smaller than a loss threshold, or the iteration times reach preset times, and the like, and the specific conditions of the convergence are not limited in the application.

In the embodiment of the application, the sample image with the target image block labeled by the category is used for training the image semantic segmentation model, so that the labeling amount of the sample image can be reduced, and in addition, only part of the image blocks in the sample image have labeling results, so that when a loss function is calculated, the loss of each pixel point in the sample image does not need to be calculated, but the loss of each pixel point in the labeled part of the image blocks is calculated, and the calculation amount can be relatively reduced. In addition, all pixel points in the sample image are not labeled in the embodiment of the application, so that the possibility of overfitting of the image semantic segmentation model is reduced, and the generalization performance of the trained semantic segmentation model is better.

Referring to fig. 6, as an embodiment, the image semantic segmentation model includes a feature extraction module and a category output module, and the following describes an example process of outputting a semantic segmentation result by the image semantic segmentation model with reference to the model structure shown in fig. 6:

the feature extraction module is used for extracting the depth features of the sample images, and the category output module is used for outputting the probability that the sample images belong to each category according to the depth features.

Referring to fig. 6, the feature extraction module can be implemented by a MobileNet network, such as MobileNet V2. MobileNet V2 may be pre-trained with images in the ImageNet database. The category output module may be implemented by a convolution layer, an active layer, and an up-sampling layer, and specifically includes, as shown in fig. 6, a first active layer, a first convolution layer, a second active layer, a second convolution layer, a third active layer, a first up-sampling layer, a fourth active layer, a second up-sampling layer, a fifth active layer, a third up-sampling layer, a sixth active layer, a fourth up-sampling layer, a seventh active layer, and a fifth up-sampling layer, which are sequentially connected.

Specifically, the MobileNet V2 network is used as a feature extraction module, and extracts convolution features of a sample image by using the MobileNet V2 network, outputs a corresponding feature map, and performs convolution processing on the feature map by using a convolution layer, outputs the feature map, and expands the scale of the feature map by using 5 upsampling layers. Each up-sampling layer inserts a zero point in the middle of the feature map input by the up-sampling layer to expand the image, and then performs a convolution operation on the expanded image to output the expanded feature map, for example, the width and height of the feature map can be 2 times that of the input feature map. And in the last sampling layer, the number of output channels is C, and the probability that each pixel belongs to various categories is respectively corresponding to the probability that each pixel belongs to different categories, so that the probability that each pixel belongs to different categories is obtained.

For example, the second server may process the sample image, the annotation result, and the mask image into preset sizes, respectively, for example, each into 640 × 360 × 3, where 3 represents the number of channels.

The processed sample image is input into the image semantic segmentation model shown in fig. 6, and is subjected to network processing by MobileNet V2 to obtain a first feature map.

And sequentially passing the first characteristic diagram through a first active layer and a first convolution layer, wherein the convolution kernel size of the first convolution layer is 4, the step length is 2, and the number of output channels is 512.

Similarly, the second characteristic diagram sequentially passes through the second active layer and the second convolution layer to obtain a third characteristic diagram with the channel number of 512. And inputting the third feature map into the third activation layer and the first up-sampling layer, thereby obtaining a fourth feature map with the channel number of 512. And inputting the fourth characteristic diagram into a fourth activation layer and a second sampling layer in sequence to obtain a fifth characteristic diagram with the channel number being 256. And inputting the fifth feature map into a fifth activation layer and a third sampling layer in sequence to obtain a sixth feature map with the channel number of 128. And inputting the seventh characteristic diagram into a sixth activation layer and a fourth sampling layer in sequence to obtain an eighth characteristic diagram with the channel number of 64. The eighth feature map is input to the sixth active layer and the fifth sampling layer, and a semantic segmentation image with the number of channels of 7 is obtained, wherein the size of the semantic segmentation image is 640 × 360 × 7.

On the basis of the image semantic segmentation model training method, an image semantic segmentation method is further provided in the embodiment of the present application, and the following description is given by taking a third server to execute the method, in combination with a flow chart of the image semantic segmentation method shown in fig. 7:

s701, the third server obtains a target image to be segmented.

S702, the third server inputs the target image into the trained image semantic segmentation model to obtain the category of each pixel point in the target image.

The third server may train the image semantic segmentation model through the image semantic segmentation model training method discussed above, to obtain a trained image semantic segmentation model. Alternatively, the third server may obtain the trained image semantic segmentation model from the second server. The specific process of training the image semantic segmentation model may refer to the content discussed above, and is not described herein again.

The third server can input the target image into the image semantic segmentation model to obtain the category to which each pixel point in the target image belongs, the image semantic segmentation model can output the probability that each pixel point belongs to each category in a plurality of categories, and the third server can determine the category with the highest probability as the category of the pixel point.

In the embodiment of the application, the semantic segmentation can be performed on the target image according to the trained image semantic segmentation model, and since all pixel points in the sample image do not need to be labeled when the image semantic segmentation model is trained, the image labeling amount is relatively reduced. In addition, because the sample image is not completely labeled, the uncertainty of the labeling result of the sample image is improved, the processing capacity of the image semantic segmentation model can be improved, and the accuracy of the segmentation result obtained by the image semantic segmentation model is improved.

Further, the third server may use the image semantic segmentation model in a specific application scenario, and after obtaining the category to which each pixel point in the target image belongs, the third server may use the category to which each pixel point in the target image belongs to execute a corresponding task.

For example, the third server may use the image semantic segmentation model in a game application scene, so that the target image corresponds to the game scene image, and after determining the game scene item category to which each pixel point in the game scene image belongs, the third server may control the artificial intelligent game character to move to a position corresponding to a preset game scene item category, so as to control the artificial intelligent game character to complete a corresponding task, for example, may control the artificial intelligent game character to perform tasks such as house exploration or vehicle driving.

Or for example, the third server may use the image semantic segmentation model in an automatic driving application scenario, so that the target image corresponds to a traffic route map, and the third server may navigate the vehicle after determining the location to which each pixel point in the traffic route map belongs.

Based on the same inventive concept, an embodiment of the present application provides an image annotation apparatus, which can implement the function of the first server discussed above, with reference to fig. 8, the apparatus includes:

a dividing module 801, configured to divide a sample image to be labeled into a plurality of image blocks;

a determining module 802, configured to determine edge pixel points with edge features in the plurality of image blocks respectively;

the screening module 803 is configured to screen out, from the plurality of image blocks, at least one target image block whose number of edge pixel points meets a preset screening condition;

and the labeling module 804 is configured to perform category labeling on at least one target image block to obtain a labeling result of the sample image.

In a possible embodiment, the apparatus further includes an obtaining module 805, where the obtaining module 805 is configured to:

and taking the residual candidate sample images as sample images to be labeled.

In a possible embodiment, the determining module 802 is specifically configured to:

and determining pixel points with preset gray values in the grayed sample image as edge pixel points.

In a possible embodiment, the screening module 803 is specifically configured to:

and determining at least one target image block from the plurality of image blocks according to the corresponding ratio of each image block.

In one possible embodiment, the tagging module 804 is further configured to:

In a possible embodiment, the sample image is a game scene image with a preset behavior, and the labeling module 804 is specifically configured to:

labeling game scene articles to which each pixel point belongs in at least one target image block in the game scene graph according to a plurality of preset game scene article types to obtain labeling results.

Based on the same inventive concept, an embodiment of the present application provides an image semantic segmentation model training apparatus, which may be used to implement the functions of the second server discussed above, please refer to fig. 9, and the apparatus includes:

an obtaining module 901, configured to obtain an annotation result of the sample image by using any one of the image annotation methods described above;

the training module 902 is used for performing multiple iterative training on the image semantic segmentation model according to the sample image;

an obtaining module 903, configured to obtain a trained image semantic segmentation model until the image semantic segmentation model converges;

the training module 902 is configured to execute the following processes to implement each iterative training in multiple iterative training of the image semantic segmentation model:

and adjusting model parameters of the image semantic segmentation model according to the labeling result of each pixel point in at least one target image block in the sample image and the semantic segmentation result of the pixel point corresponding to at least one target image block in the semantic segmentation result.

Based on the same inventive concept, an embodiment of the present application provides an image semantic segmentation apparatus, which may be used to implement the function of the foregoing third server, please refer to fig. 10, and the apparatus includes:

an obtaining module 1001 configured to obtain a target image to be segmented;

an obtaining module 1002, configured to input the target image into the trained image semantic segmentation model obtained by any image semantic segmentation model training method discussed above, and obtain a category to which each pixel point in the target image belongs.

In one possible embodiment, the target image is a game scene image; the apparatus further comprises a control module 1003, the control module 1003 further configured to:

after the category of each pixel point in the target image is obtained, the artificial intelligent game role is controlled to move to a position corresponding to a preset category according to the category of each pixel point in the target image so as to execute a corresponding task.

Based on the same inventive concept, the present application provides a computer device, please refer to fig. 11, which includes a processor 1101 and a memory 1102.

The processor 1101 may be a Central Processing Unit (CPU), or a digital processing unit, etc. The specific connection medium between the memory 1102 and the processor 1101 is not limited in the embodiment of the present application. In the embodiment of the present application, the memory 1102 and the processor 1101 are connected by a bus 1103 in fig. 11, the bus 1103 is indicated by a thick line in fig. 11, and the connection manner between other components is merely illustrative and not limited thereto. The bus 1103 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 11, but this is not intended to represent only one bus or type of bus.

Memory 1102 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1102 may also be a non-volatile memory (non-volatile memory) such as, but not limited to, a read-only memory (rom), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD), or the memory 1102 may be any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Memory 1102 may be a combination of the memories described above.

A processor 1101 for executing the image annotation method as any one of the above discussed methods when invoking the computer program stored in the memory 1102, and may also be used for implementing the functions of the apparatus in fig. 8, and may also be used for implementing the functions of the first server as discussed above.

Based on the same inventive concept, an embodiment of the present application provides a computer apparatus, please refer to fig. 12, which includes a processor 1201 and a memory 1202. The contents of the processor 1201 and the memory 1202 may refer to those discussed above and are not described herein.

Among other things, a computer program stored in the memory 1202. The processor 1201 is configured to execute any of the image semantic segmentation model training methods as discussed above when invoking the computer program stored in the memory 1202, and may also be configured to implement the functions of the apparatus in fig. 9, and may also be configured to implement the functions of the second server as discussed above.

Based on the same inventive concept, an embodiment of the present application provides a computer device, please refer to fig. 13, which includes a processor 1301 and a memory 1302. The contents of the processor 1301 and the memory 1302 can refer to the contents discussed above and are not described in detail herein.

Among other things, a computer program stored in memory 1302. A processor 1301 for executing the image semantic segmentation method as any one of the above discussed methods when invoking the computer program stored in the memory 1302, and may also be used to implement the functionality of the apparatus in fig. 10, and may also be used to implement the functionality of the third server discussed above.

Based on the same inventive concept, embodiments of the present application provide a computer storage medium storing computer instructions that, when executed on a computer, cause the computer to perform any one of the image annotation methods, the image semantic segmentation model training methods, or the image semantic segmentation methods discussed above.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Based on the same inventive concept, the embodiments of the present application provide a computer program product, which includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform any one of the image annotation methods, the image semantic segmentation model training methods, or the image semantic segmentation methods discussed above.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. An image annotation method, comprising:

dividing a sample image to be labeled into a plurality of image blocks;

2. The method of claim 1, wherein prior to dividing the sample image to be annotated into a plurality of image blocks, the method further comprises:

sampling the sample video according to a preset sampling interval to obtain a plurality of candidate images;

determining a similarity between any two candidate images among the plurality of candidate images;

if any two candidate images with the similarity larger than the preset similarity exist, one of the two candidate images is removed;

and taking the residual candidate images as sample images to be labeled.

3. The method as claimed in claim 1, wherein said determining edge pixels having edge characteristics in said plurality of image blocks respectively comprises:

4. The method as claimed in claim 1, wherein the screening out, from the plurality of image blocks, at least one target image block whose number of edge pixels meets a preset screening condition includes:

5. The method as claimed in claim 4, wherein said determining at least one target image block from the plurality of image blocks according to the ratio corresponding to each image block comprises:

and randomly selecting at least one target image block from the plurality of image blocks by taking the ratio of the plurality of image blocks as the probability.

6. The method according to any one of claims 1 to 5, wherein after at least one target image block, from the plurality of image blocks, is screened out, where the number of edge pixel points meets a preset screening condition, the method further comprises:

and marking the at least one target image block as a first identifier, marking other image blocks except the at least one target image block in the plurality of image blocks as second identifiers, and obtaining a mask image, wherein the first identifier is different from the second identifier, and the mask image is used for training an image semantic segmentation model.

7. The method according to any one of claims 1 to 5, wherein the sample image is a game scene image having a preset behavior;

the performing category labeling on the at least one target image block to obtain a labeling result of the sample image includes:

8. A training method of an image semantic segmentation model is characterized by comprising the following steps:

obtaining an annotation result of the sample image by the method according to any one of claims 1 to 7;

9. An image semantic segmentation method, comprising:

acquiring a target image to be segmented;

inputting the target image into a trained image semantic segmentation model obtained by the method according to claim 8, and obtaining the category to which each pixel point in the target image belongs.

10. The method of claim 9, wherein the target image is a game scene image;

after the obtaining of the category to which each pixel point in the target image belongs, the method further includes:

and controlling the artificial intelligent game role to move to a position corresponding to the preset game scene article category according to the game scene article category to which each pixel point in the game scene image belongs so as to execute a corresponding task.

11. An image annotation method, comprising:

and the labeling module is used for performing category labeling on each pixel point in the at least one target image block to obtain a labeling result of the sample image, and the labeling result is used for training an image semantic segmentation model.

12. An image semantic segmentation model training device is characterized by comprising:

13. An image semantic segmentation apparatus, comprising:

the acquisition module is used for acquiring a target image to be segmented;

14. A computer device, comprising:

at least one processor, and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to implement the method of any one of claims 1-7 or 8 or 9-10 by executing the instructions stored by the memory.

15. A computer storage medium storing computer instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1-7 or 8 or 9-10.