CN114926480A

CN114926480A - Method, device and equipment for training image segmentation model and storage medium

Info

Publication number: CN114926480A
Application number: CN202210596637.6A
Authority: CN
Inventors: 蔡焕洽; 龚丽君; 李志鋒; 刘威
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-08-19

Abstract

The application provides a method, a device, equipment and a storage medium for training an image segmentation model, which can be applied to the field of artificial intelligence or intelligent transportation and the like and are used for solving the problems of low image segmentation accuracy and reliability of popularization images related to music, videos or voices. The method at least comprises the following steps: performing feature extraction on the obtained sample popularization image to obtain an initial global feature; extracting edge sub-features and position sub-features from the initial global features, wherein the edge sub-features characterize edge boundaries of the at least one promotional object, and the position sub-features characterize relative positions between the at least one promotional object and the sample promotional image; and obtaining a prediction partition region of the at least one promotion target based on the fused global features of the initial global feature, the edge sub-features and the position sub-features. The region where the popularization target is located is limited from three different angles, and the image segmentation accuracy and reliability are improved.

Description

Method, device and equipment for training image segmentation model and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for training an image segmentation model.

Background

With the continuous development of science and technology, more and more devices can provide image segmentation services, and the image segmentation services can be used for acquiring a foreground region containing a target in an image.

The device can process the image by adopting a target image segmentation model obtained by multiple rounds of iterative training to obtain a foreground region of the target in the image so as to provide image segmentation service.

In the traditional method for training the image segmentation model, the image scene of the sample image is single, or the composition of the sample target is single, so the process of training the image segmentation model is simple. For example, the device may perform feature extraction on the sample image by using an image segmentation model, predict a foreground region of the sample target in the sample image based on the extracted image features, and train the image segmentation model based on an error between the predicted foreground region and a labeled region of the sample image.

However, in the cover image of the promotional video, in order to sufficiently show information such as performance or advantages of the promotional object, an image scene of the cover image may be complicated, and the promotional object may also be composed of more people or objects. Therefore, when the target image segmentation model obtained by the traditional method for training the image segmentation model is used for processing the cover image of the promotion video, the situation that the region where the object other than the promotion target is located and the region where the promotion target is located are determined as the foreground region together due to the fact that the image scene is relatively complex may occur; the situation that only the region where one component is located is determined as the foreground region due to more components of the generalized target may also occur, so that the foreground region obtained through the trained target image segmentation model is inaccurate and has low reliability.

Therefore, in the related art, the image segmentation accuracy and reliability of the cover image of the promoted video are low.

Disclosure of Invention

The embodiment of the application provides a method and a device for training an image segmentation model, computer equipment and a storage medium, which are used for solving the problems of low image segmentation accuracy and reliability of cover images of promoted videos.

In a first aspect, a method for training an image segmentation model is provided, including:

obtaining a sample popularization image set, wherein each sample popularization image comprises at least one popularization target and a corresponding segmentation label;

performing multiple rounds of iterative training on the image segmentation model to be trained based on the sample popularization image set to obtain a trained target image segmentation model, and at least performing the following operations in each round of iterative training process:

performing feature extraction on the obtained sample popularization image to obtain an initial global feature;

extracting edge sub-features and position sub-features from the initial global features, wherein the edge sub-features characterize edge boundaries of the at least one promotional object, and the position sub-features characterize relative positions between the at least one promotional object and the sample promotional image;

and obtaining a prediction segmentation region of the at least one popularization target in the sample popularization image based on the fusion global feature of the initial global feature, the edge sub-feature and the position sub-feature, and adjusting a model parameter of the image segmentation model to be trained based on an error between the prediction segmentation region and a segmentation label corresponding to the at least one popularization target.

In a second aspect, an apparatus for training an image segmentation model is provided, including:

an acquisition module: the method comprises the steps of obtaining a sample popularization image set, wherein each sample popularization image comprises at least one popularization target and a corresponding segmentation label;

a processing module: the method is used for carrying out multi-round iterative training on an image segmentation model to be trained based on the sample popularization image set to obtain a trained target image segmentation model, and at least the following operations are executed in each round of iterative training process:

the processing module is specifically configured to: performing feature extraction on the obtained sample popularization image to obtain an initial global feature;

the processing module is specifically configured to: extracting edge sub-features and position sub-features from the initial global features, wherein the edge sub-features characterize edge boundaries of the at least one promotional object, and the position sub-features characterize relative positions between the at least one promotional object and the sample promotional image;

the processing module is specifically configured to: and obtaining a prediction segmentation region of the at least one promotion target in the sample promotion image based on the fusion global feature of the initial global feature, the edge sub-feature and the position sub-feature, and adjusting a model parameter of the image segmentation model to be trained based on an error between the prediction segmentation region and a segmentation label corresponding to the at least one promotion target.

Optionally, the obtaining module is specifically configured to:

acquiring each promotion video;

taking the cover image of each promotion video as the sample promotion image;

based on a preset segmentation strategy, segmenting at least one promotion target contained in each sample promotion image to obtain corresponding segmentation labels;

and establishing a sample popularization image set based on each sample popularization image and the corresponding segmentation labels thereof.

Optionally, the processing module is specifically configured to:

performing multi-scale feature extraction on the obtained sample popularization image to obtain a plurality of intermediate global features, wherein each intermediate global feature corresponds to different resolutions, and the maximum resolution in each resolution is the same as the resolution of the sample popularization image;

and performing multi-scale fusion processing on the plurality of intermediate global features based on the maximum resolution to obtain the initial global features, wherein the resolution corresponding to the initial global features is the maximum resolution.

Optionally, the processing module is specifically configured to:

performing image segmentation on the sample popularization image based on the initial global features to obtain an initial segmentation region aiming at the at least one popularization target;

extracting local features associated with the initial segmentation region from the initial global features;

based on the local characteristics of the regions, carrying out local characteristic adjustment on the initial global characteristics to obtain integrated global characteristics;

and performing feature fusion on the integrated global feature, the edge sub-feature and the position sub-feature to obtain a fused global feature, and obtaining a prediction segmentation region of the at least one popularization target in the sample popularization image based on the fused global feature.

Optionally, the processing module is specifically configured to:

respectively extracting pixel features corresponding to all pixels contained in the sample popularization image from the initial global features;

respectively determining the association degree between each pixel characteristic and the local characteristic of the region;

and adjusting the pixel characteristics of the pixel characteristics based on the obtained association degrees to obtain the integrated global characteristics.

Optionally, the processing module is specifically configured to:

respectively taking the relevance degrees as the weight of the corresponding pixel characteristics, and respectively performing weighted fusion on the local characteristics of the region and the pixel characteristics contained in the initial segmentation region to obtain corresponding region pixel fusion characteristics;

and performing feature fusion on the obtained pixel fusion features of each region and the pixel features to obtain the integrated global feature.

Optionally, each segmentation label includes a region label, an edge label and a position label; the processing module is specifically configured to:

obtaining a predicted edge boundary of the at least one promotional object in the sample promotional image based on the edge sub-features;

obtaining a relative position between the at least one promotional object and the sample promotional image based on the position sub-feature;

and adjusting model parameters of the image segmentation model to be trained on the basis of an edge error between the predicted edge boundary and the edge label corresponding to the at least one promotion target, a position error between the relative position and the position label corresponding to the at least one promotion target, and a region error between the predicted segmentation region and the region label corresponding to the at least one promotion target.

Optionally, the processing module is further configured to:

performing multiple rounds of iterative training on an image segmentation model to be trained based on the sample popularization image set to obtain a trained target image segmentation model, and then obtaining an annotation-free popularization image set, wherein each annotation-free popularization image comprises at least one annotation-free popularization target;

respectively carrying out image segmentation on at least one unmarked popularization target contained in each unmarked popularization image by adopting the target image segmentation model to obtain corresponding target segmentation areas;

and performing multiple rounds of iterative training on the target image segmentation model based on the label-free popularization image set and the obtained target segmentation areas to obtain a trained final image segmentation model.

Optionally, the processing module is specifically configured to:

determining the confidence coefficient of each obtained target segmentation region based on a preset confidence coefficient evaluation strategy, wherein the confidence coefficient is used for representing the segmentation accuracy of the corresponding target segmentation region;

taking the target segmentation region with the confidence coefficient larger than a preset confidence coefficient threshold value in each target segmentation region as a corresponding segmentation label of the label-free popularization image;

and performing multiple rounds of iterative training on the target image segmentation model based on the obtained label-free popularization images with segmentation labels to obtain the final image segmentation model.

In a third aspect, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the method according to the first aspect.

In a fourth aspect, there is provided a computer device comprising:

a memory for storing program instructions;

a processor for calling the program instructions stored in the memory and executing the method according to the first aspect according to the obtained program instructions.

In a fifth aspect, there is provided a computer-readable storage medium having stored thereon computer-executable instructions for causing a computer to perform the method of the first aspect.

In the embodiment of the application, the image segmentation model is adopted to extract the features of the sample popularization image, after the initial global features are obtained, the edge sub-features and the position sub-features are extracted from the initial global features, and therefore the image segmentation model is trained based on the fusion global features of the initial global features, the edge sub-features and the position sub-features. The edge sub-features characterize the edge boundary of the at least one popularization target in the sample popularization image, so that the region of the at least one popularization target in the sample popularization image can be predicted from the angle of the edge boundary through the edge sub-features. The position sub-features represent the relative position between the at least one promotion target and the sample promotion image, so that the region of the at least one promotion target in the sample promotion image can be predicted from the relative position angle through the position sub-features. By means of the initial global features, the region of the at least one promotional object in the sample promotional image can be predicted from the perspective of semantic information of the at least one promotional object. Therefore, the region of the at least one popularization target in the sample popularization image can be limited from three different angles through the fusion global feature of the initial global feature, the edge sub-feature and the position sub-feature, and the relevance of the at least one popularization target is enhanced. Meanwhile, through the fusion global features of the features obtained from three different angles, the semantic information contained in the initial global feature is enhanced, and the detail correction can be performed on the region of at least one popularization target in the sample popularization image, so that the situation that the popularization target is omitted in the obtained prediction segmentation region or an object except the popularization target is mistakenly identified as the popularization target does not exist, and the accuracy and the reliability of the trained target image segmentation model are improved.

Drawings

FIG. 1A is a schematic diagram illustrating a first principle of a method for training an image segmentation model in the related art;

FIG. 1B is a schematic diagram illustrating a second principle of a method for training an image segmentation model in the related art;

fig. 1C is an application scenario of the method for training an image segmentation model according to the embodiment of the present application;

FIG. 2 is a schematic flowchart of a method for training an image segmentation model according to an embodiment of the present disclosure;

FIG. 3 is a first schematic diagram illustrating a principle of a method for training an image segmentation model according to an embodiment of the present disclosure;

fig. 4A is a schematic diagram of a principle of a method for training an image segmentation model according to an embodiment of the present application;

fig. 4B is a schematic diagram of a principle of a method for training an image segmentation model according to an embodiment of the present application;

fig. 4C is a schematic diagram of a principle of a method for training an image segmentation model according to an embodiment of the present disclosure;

fig. 4D is a schematic diagram illustrating a principle of a method for training an image segmentation model according to an embodiment of the present application;

fig. 5A is a schematic diagram of a principle diagram six of a method for training an image segmentation model according to an embodiment of the present application;

fig. 5B is a schematic diagram seven illustrating a principle of a method for training an image segmentation model according to an embodiment of the present application;

fig. 6A is a schematic diagram eight illustrating a principle of a method for training an image segmentation model according to an embodiment of the present application;

fig. 6B is a schematic diagram illustrating a principle of a method for training an image segmentation model according to an embodiment of the present application;

fig. 7A is a schematic diagram ten illustrating a principle of a method for training an image segmentation model according to an embodiment of the present application;

fig. 7B is a schematic diagram eleven illustrating a principle of a method for training an image segmentation model according to an embodiment of the present application;

FIG. 8 is a first schematic structural diagram of an apparatus for training an image segmentation model according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a device for training an image segmentation model according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

(1) Multi-scale (multi-scale) and multi-scale fusion:

different resolutions constitute different scales, and different layers of features may correspond to different scales.

The multi-scale fusion is to transform the features of different scales into a uniform scale and then perform fusion.

(2) Depth separable convolution (DWConv):

deep separable convolution is an algorithm that improves upon the standard convolution calculations in convolutional neural networks. By splitting the correlation between the spatial dimension and the channel dimension, the number of parameters required by the convolution calculation is reduced. The prototype of the depth-separable convolution can be regarded as an inclusion module in a convolutional neural network, the convolution calculation is divided into two parts, firstly, the channels are respectively subjected to spatial convolution, the output is spliced, and then, the unit convolution kernel is used for performing channel convolution to obtain a feature map.

The embodiment of the application relates to the field of Artificial Intelligence (AI), is designed based on Computer Vision (CV) technology and Machine Learning (ML) technology, and can be applied to the fields of cloud computing, intelligent traffic, auxiliary driving or maps and the like.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique in computer science that studies the design principles and implementation of various machines in an attempt to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence, enabling the machine to have the functions of perception, reasoning and decision making.

The artificial intelligence is a comprehensive subject, and relates to a wide field, namely a hardware level technology and a software level technology. The basic technologies of artificial intelligence generally include sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like. With the development and progress of artificial intelligence, artificial intelligence can be researched and applied in multiple fields, for example, common fields such as smart home, smart customer service, virtual assistant, smart speaker, smart marketing, smart wearable device, unmanned driving, autopilot, unmanned aerial vehicle, robot, smart medical, internet of vehicles, autopilot, smart transportation, etc., and with the further development of future technologies, artificial intelligence is believed to be applied in more fields, and has more and more important value. The scheme provided by the embodiment of the application relates to technologies such as artificial intelligence deep learning and augmented reality, and is further explained by the following embodiment.

Computer vision is a science for researching how to make a machine "see", and further, it means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and other technologies, and also includes common biometric identification technologies such as face recognition and fingerprint recognition.

Machine learning is a multi-field cross subject, relates to multi-subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like, and is used for specially researching a computer to obtain new knowledge or skills by simulating the learning behaviors of human beings and reorganizing an existing knowledge structure so as to continuously improve the performance of the computer.

Machine learning is the core of artificial intelligence, is a fundamental approach for enabling computers to have intelligence, and is applied to all fields of artificial intelligence; the core of machine learning is deep learning, which is a technology for realizing machine learning. Machine learning generally includes deep learning, reinforcement learning, transfer learning, inductive learning, artificial Neural network, and formula learning, and deep learning includes Convolutional Neural Network (CNN), deep belief network, recurrent Neural network, automatic encoder, and generation countermeasure network.

It should be noted that, in the embodiments of the present application, the data related to the promotional image or promotional video, etc. need to be approved or agreed by the user when the above embodiments of the present application are applied to specific products or technologies, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant countries and regions.

The following briefly introduces an application field of the method for training the image segmentation model provided by the embodiment of the present application.

However, in the cover image of the popularization video, in order to sufficiently show information such as performance or advantages of the popularization target, or in order to present a background that is helpful for popularization, an image scene of the cover image may be complicated, and the popularization target may also be composed of a large number of people or objects. Referring to fig. 1A, a cover image of a promotional video includes a director of a promotional mobile phone, promoted objects, that is, a mobile phone in the director, a background desktop, objects on the desktop, and a background wall. The cover image contains more articles, wherein the promotion target comprises a mobile phone and a main broadcast, and other articles are articles irrelevant to the promotion, or the main broadcast is a background arranged for the promotion.

Therefore, when the target image segmentation model obtained by the traditional method for training the image segmentation model is used for processing the cover image of the promotion video, the situation that the region where the object other than the promotion target is located and the region where the promotion target is located are determined as the foreground region together due to the fact that the image scene is relatively complex may occur; the situation that only the region where one component is located is determined as the foreground region due to more components of the generalized target may also occur, so that the foreground region obtained through the trained target image segmentation model is inaccurate and has low reliability.

When the target image segmentation model obtained by the conventional method for training the image segmentation model is used for processing the cover image of the promotional video shown in fig. 1A, please refer to fig. 1B, and a situation that only the area where the anchor of the promotional mobile phone is located is determined as the foreground area may occur; it may also happen that the anchor that populates the mobile phone, the promoted items, i.e., the mobile phone, the background desktop, and the items on the desktop are determined as foreground areas together, and so on.

In order to solve the problem that image segmentation accuracy and reliability of cover images of promoted videos are low, the application provides a method for training an image segmentation model. According to the method, after a sample popularization image set is obtained, multi-round iterative training is carried out on an image segmentation model to be trained on the basis of the sample popularization image set, and a trained target image segmentation model is obtained, wherein each sample popularization image comprises at least one popularization target and a segmentation label corresponding to the popularization target.

In each iteration training process, at least the following operations are executed: after the obtained sample popularization image is subjected to feature extraction, and initial global features are obtained, edge sub-features and position sub-features are extracted from the initial global features, wherein the edge sub-features represent the edge boundary of at least one popularization target, and the position sub-features represent the relative position between the at least one popularization target and the sample popularization image. And obtaining a prediction segmentation region of at least one popularization target in the sample popularization image based on the fusion global features of the initial global feature, the edge sub-feature and the position sub-feature, and adjusting model parameters of the image segmentation model to be trained based on errors between the prediction segmentation region and segmentation labels corresponding to the at least one popularization target.

In the embodiment of the application, the image segmentation model is adopted to extract the characteristics of the sample popularization image, after the initial global characteristics are obtained, the edge sub-characteristics and the position sub-characteristics are extracted from the initial global characteristics, and therefore the image segmentation model is trained on the basis of the fusion global characteristics of the initial global characteristics, the edge sub-characteristics and the position sub-characteristics. The edge sub-features characterize the edge boundary of the at least one popularization target in the sample popularization image, so that the region of the at least one popularization target in the sample popularization image can be predicted from the angle of the edge boundary through the edge sub-features. The position sub-features represent the relative position between the at least one popularization target and the sample popularization image, so that the region of the at least one popularization target in the sample popularization image can be predicted from the relative position angle through the position sub-features. Through the initial global features, the region of the at least one promotion object in the sample promotion image can be predicted from the perspective of semantic information of the at least one promotion object. Therefore, the region of the at least one popularization target in the sample popularization image can be limited from three different angles through the fusion global feature of the initial global feature, the edge sub-feature and the position sub-feature, and the relevance of the at least one popularization target is enhanced. Meanwhile, the semantic information contained in the initial global feature is enhanced through the fusion global features of the features obtained from three different angles, and the detail correction can be performed on the region of at least one popularization target in the sample popularization image, so that the obtained prediction segmentation region does not have the condition of omitting the popularization target or mistakenly identifying an object except the popularization target as the popularization target, and the accuracy and the reliability of the trained target image segmentation model are improved.

An application scenario of the method for training the image segmentation model provided by the present application is described below.

Please refer to fig. 1C, which is a schematic view of an application scenario of the method for training an image segmentation model according to the present application. The application scenario includes a client 101 and a server 102. Communication is possible between the client 101 and the server 102. The communication mode may be a wired communication technology, for example, communication is performed through a connection network line or a serial port line; the communication may also be performed by using a wireless communication technology, for example, communication is performed by using technologies such as bluetooth or wireless fidelity (WIFI), and the like, which is not limited specifically.

The client 101 generally refers to a device that can provide the server 102 with a sample set of promotional images or can use a trained target image segmentation model, e.g., a terminal device, a third party application accessible by the terminal device, or a web page accessible by the terminal device, etc. The terminal device includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal or an aircraft, etc. The server 102 generally refers to a device, such as a terminal device or a server, which can train an image segmentation model. Servers include, but are not limited to, cloud servers, local servers, or associated third party servers, etc. The client 101 and the server 102 can both adopt cloud computing to reduce the occupation of local computing resources; cloud storage can also be adopted to reduce occupation of local storage resources.

As an embodiment, the client 101 and the server 102 may be the same device, and are not limited in particular. In the embodiment of the present application, the client 101 and the server 102 are respectively different devices for example.

Based on fig. 1C, the method for training the image segmentation model provided in the embodiment of the present application is specifically described below with the server 102 as a server and the server as a main body. Please refer to fig. 2, which is a flowchart illustrating a method for training an image segmentation model according to an embodiment of the present disclosure.

S201, obtaining a sample popularization image set.

Before training the image segmentation model, the server may obtain a sample popularization image set. The server can receive each sample promotion image sent by the client and establish a sample promotion image set; or reading each sample popularization image from the storage unit and establishing a sample popularization image set; and downloading each sample popularization image from network resources, establishing a sample popularization image set, and the like, without specific limitation.

Each sample popularization image in the sample popularization image set comprises at least one popularization target and a corresponding segmentation label. All promotion targets contained in each sample promotion image may correspond to one segmentation label, or each promotion target contained in a sample promotion image may correspond to one segmentation label, and the like, without any specific limitation. The sample promotion image may be an advertisement image, and the promotion target may include an object promoted by the advertisement, a person who introduces the object in the advertisement, or a speaker, and other objects that help to introduce the object, and the like, which is not limited in particular.

As an embodiment, in order to enrich a sample popularization image set, increase the data amount of a sample popularization image used for training an image segmentation model, and improve the robustness of a trained target image segmentation model, after obtaining each sample popularization image, a server may perform preprocessing such as random rotation processing, noise increasing processing, random inversion processing, and the like on each sample popularization image, so that the number of the obtained sample popularization images is doubled, and the data amount of the sample popularization image set is increased.

As an embodiment, the sample popularization image may be obtained from a popularization video, and the server may establish a sample popularization image set based on each obtained sample popularization image by obtaining each popularization video and using a cover image of each popularization video as a sample popularization image.

After the sample popularization image set is obtained, the server may divide at least one popularization target included in each sample popularization image based on a preset division policy, and obtain a corresponding division label. The server may use a preset segmentation program to segment at least one promotion target included in each sample promotion image, and obtain a region where the corresponding at least one promotion target is located, that is, a segmentation label. The server may also obtain, through manual annotation, a region where at least one popularization target included in each sample popularization image is located, that is, segmentation annotation. The server can also manually adjust the details after the preset segmentation program is segmented, so as to obtain the segmentation labels corresponding to the sample popularization images. And the segmentation labels are used as sample labels to perform supervised model training on the image segmentation model.

As an embodiment, after obtaining each sample popularization image and its corresponding segmentation label, the server may establish a sample popularization image set based on each sample popularization image and its corresponding segmentation label. The server may also establish a sample popularization image set after obtaining each sample popularization image, and associate each sample popularization image included in the sample popularization image set with a corresponding segmentation label, and the like after obtaining the corresponding segmentation label of each sample popularization image, which is not particularly limited.

S202, performing multiple rounds of iterative training on the image segmentation model to be trained based on the sample popularization image set to obtain a trained target image segmentation model.

After obtaining the sample popularization image set, the server may perform multiple rounds of iterative training on the image segmentation model to be trained based on the sample popularization image set, to obtain a trained target image segmentation model. For example, after each iteration training, if the image segmentation model is determined not to meet the preset training target, the model parameters of the image segmentation model are adjusted, the iteration training is continued, and if the image segmentation model is determined to meet the preset training target, the trained target image segmentation model is obtained.

Next, a round of iterative training process performed on a sample popularization image is described as an example, please refer to S203 to S206, and each round of iterative training process is similar and will not be described herein again.

And S203, performing feature extraction on the obtained sample popularization image to obtain an initial global feature.

The server can adopt an image segmentation model to extract the characteristics of the obtained sample popularization image to obtain initial global characteristics. The initial global feature represents semantic information contained in the sample popularization image, including semantic information of at least one popularization target contained in the sample popularization image, semantic information except the at least one popularization target, and the like.

As an embodiment, when the server performs feature extraction, the sample popularization image may be subjected to multi-scale feature extraction, so as to obtain a plurality of intermediate global features, each intermediate global feature corresponds to a different resolution, and a maximum resolution of the resolutions is the same as a resolution of the sample popularization image. The server may perform multi-scale fusion processing on the plurality of intermediate global features based on the maximum resolution to obtain an initial global feature, where the resolution corresponding to the initial global feature is the maximum resolution. Therefore, in the process of feature extraction, resolution recovery processing is not needed by the server, loss of feature information is reduced, and multi-scale fusion processing is carried out on the intermediate global features with different resolutions, so that more feature information can be fused with the obtained initial global features, and more accurate semantic information can be represented.

Referring to fig. 3, a schematic diagram of a feature extraction principle is shown, taking the feature extraction process implemented by a 9-layer network layer as an example. After the sample popularization image is input into the first layer network layer, the first layer network layer performs first-scale feature extraction on the sample popularization image to obtain a first intermediate global feature, and the first intermediate global feature and the sample popularization image have the same resolution. And inputting the first intermediate global feature into a second layer network layer, performing first-scale feature extraction on the first intermediate global feature by the second layer network layer, and updating the first intermediate global feature based on a feature extraction result.

Inputting the updated first intermediate global features into a third layer network layer, respectively extracting the first intermediate global features by the third layer network layer according to the first scale and the second scale, updating the first intermediate global features of the first scale based on the result of the feature extraction, and obtaining the second intermediate global features of the second scale. The resolution of the second scale is smaller than the resolution of the first scale, and the feature extraction of the second scale is performed on the first intermediate global feature, which may be performing down-sampling processing on the first intermediate global feature.

And inputting the updated first intermediate global features into a fourth layer network layer, performing feature extraction of the first scale on the first intermediate global features by the fourth layer network layer, and continuously updating the first intermediate global features. And meanwhile, inputting the second intermediate global feature into a fourth layer network layer, performing second-scale feature extraction on the second intermediate global feature by the fourth layer network layer, and updating the second intermediate global feature based on the result of the feature extraction.

Inputting the updated first intermediate global features into a fifth-layer network layer, respectively extracting the first intermediate global features by a first scale and a second scale by the fifth-layer network layer, simultaneously inputting the updated second intermediate global features into the fifth-layer network layer, respectively extracting the second intermediate global features by the fifth-layer network layer by the first scale and the second scale, fusing the feature extraction result of the first scale, updating the first intermediate global features, and simultaneously fusing the feature extraction result of the second scale, and updating the second intermediate global features. The second-scale feature extraction is performed on the second intermediate global features, the second intermediate global features can be subjected to upsampling processing, and the updated first intermediate global features are fused with the original first-scale features and the second-scale features, but are not directly obtained by recovering the resolution from the second-scale features, so that the loss of feature information is reduced.

And outputting the first intermediate global feature as an initial global feature of the sample popularization image. In the process of feature extraction, the network layer for feature extraction of the first scale is connected with the network layer for feature extraction of the second scale in parallel, so that the resolution of the first intermediate global feature is kept the same as that of the sample popularization image, and the accuracy of semantic representation of the first intermediate global feature is improved.

As an embodiment, the image segmentation model may include a feature extraction network, the feature extraction network performs feature extraction on the sample popularization image, the feature extraction network may be obtained based on an HRNet network, the number of network layers included in the feature extraction network is not specifically limited, and when performing multi-scale feature extraction, the number of different scales is not specifically limited.

And S204, extracting edge sub-features and position sub-features from the initial global features.

After obtaining the initial global features, the server may extract edge sub-features from the initial global features, the edge sub-features characterizing edge boundaries of at least one promotional object.

The server may directly extract the edge sub-features from the initial global features, and may also transform the initial global features to obtain the edge sub-features, and the like, which is not limited specifically. For example, the server may use a convolutional layer and an upsampling layer to transform the initial global feature, please refer to fig. 4A, which is a network structure for extracting edge sub-features.

The first network layer may be composed of a convolutional layer (Conv), a Batch Normalization layer (BN), and an activation layer (modulated Linear Unit, ReLU), and the size of the convolutional kernel may be 3 × 3. The second layer network layer includes an upsampling layer (Upsample) that can increase the resolution to 2 times the original resolution. The third tier network layer includes convolutional layers, and the size of the convolutional kernel may be 3 x 3. The fourth layer network layer may include an upsampling layer that may increase the resolution up to 2 times the original resolution.

By the network structure for extracting the edge sub-features, the edge sub-features of a single channel can be output, the edge sub-features of the single channel can be visually represented by a black-and-white image, the resolution of the edge sub-features is the same as that of a sample popularization image, the edge boundary of at least one popularization target can be visually represented, the image shown in fig. 1A is taken as an example, please refer to fig. 4B, the edge boundary of at least one popularization target is represented by a black curve, and white represents an area surrounded by the black curve.

After obtaining the initial global features, the server may extract location sub-features from the initial global features, the location sub-features characterizing a relative location between the at least one promotional object and the sample promotional image.

The server may directly extract the position sub-feature from the initial global feature, or may transform the initial global feature to obtain the position sub-feature, and the like, which is not limited specifically. For example, the server may use a convolutional layer to transform the initial global feature, please refer to fig. 4C, which is a network structure for extracting the location sub-feature.

The first three tier network layers may all consist of convolutional layers (Conv), Batch Normalization (BN), and active layers (ReLU), and the size of the convolutional kernel may be 3 × 3. The fourth tier network layer may include an upper convolution layer and the convolution kernel may be 3 x 3 in size.

Through the network structure for extracting the position sub-features, regression estimation can be carried out on the relative position of at least one popularization target in the sample popularization image, the position sub-features of 5 channels are output, the 5 channels respectively represent a vertex coordinate of a rectangular area where the at least one popularization target is located in a coordinate system formed by the sample popularization image, the length and the width of the rectangular area, and whether a classification result of the at least one popularization target exists in the rectangular area or not, wherein the classification result is 1 when the at least one popularization target exists, is 0 when the at least one popularization target does not exist, and the like. Continuing with the example of the image shown in fig. 1A, please refer to fig. 4D, and the relative position between the at least one promotion target and the sample promotion image is represented by a black dashed rectangle.

For example, the server may extract Edge sub-features from the initial global features using an Edge decoding network (Edge Decoder). The server may extract the location sub-feature from the initial global feature using a box network (BoxNet).

S205, obtaining a prediction segmentation region of at least one promotion target in the sample promotion image based on the fusion global features of the initial global feature, the edge sub-feature and the position sub-feature.

After obtaining the edge sub-features and the location sub-features, the server may perform feature fusion on the initial global features, the edge sub-features, and the location sub-features to obtain a fused global feature. And performing image segmentation on the sample popularization image based on the fusion global features of the initial global feature, the edge sub-feature and the position sub-feature to obtain a prediction segmentation region of at least one popularization target in the sample popularization image. By fusing multiple characteristics, when the server performs image segmentation on the sample popularization image by adopting the image segmentation model, the server can predict the prediction segmentation region of at least one popularization target in the sample popularization image based on more semantic information, so that the obtained prediction segmentation region is more accurate, the training efficiency can be improved, and the image segmentation accuracy of the trained target image segmentation model is higher.

As an embodiment, the server may further adjust the initial global feature based on the roughly estimated region where the at least one promotion target is located to obtain an integrated global feature, and then perform feature fusion based on the integrated global feature, the edge sub-feature, and the position sub-feature to obtain a fused global feature.

The server may perform image segmentation on the sample popularization image based on the initial global feature to obtain an initial segmentation region for the at least one popularization target, where the initial segmentation region is a rough estimation of a region of the at least one popularization target in the sample popularization image based on the initial global feature. The server may extract, from the initial global features, regional local features associated with the initial segmented region, which may characterize only semantic information of the initial segmented region. And the server performs local feature adjustment on the initial global feature based on the local feature of the region to obtain an integrated global feature. After obtaining the integrated global feature, the server may perform feature fusion on the integrated global feature, the edge sub-feature, and the position sub-feature to obtain a fused global feature, and obtain a prediction segmentation region of at least one popularization target in the sample popularization image based on the fused global feature.

As an embodiment, when the initial global feature is adjusted based on the local feature of the region to obtain the integrated global feature, the server may first extract, from the initial global feature, the pixel feature corresponding to each pixel included in the sample popularization image. And respectively determining the relevance between each pixel feature and the local region feature, wherein the relevance can represent the possibility that the corresponding pixel belongs to at least one popularization target, the higher the relevance is, the higher the possibility that the pixel belongs to at least one popularization target is represented, and the lower the relevance is, the lower the possibility that the pixel belongs to at least one popularization target is represented. And based on the obtained relevance degrees, pixel feature adjustment is carried out on each pixel feature to obtain an integrated global feature, and the semantic representation capability of the integrated global feature is enhanced, so that the prediction segmentation area of at least one popularization target in the sample popularization image can be more accurately obtained based on the integrated global feature.

As an embodiment, when performing pixel feature adjustment on each pixel feature based on each obtained relevance degree to obtain an integrated global feature, the server may use each relevance degree as a weight of the corresponding pixel feature, and perform weighted fusion on the local feature of the region and each pixel feature included in the initial segmented region to obtain a corresponding region pixel fusion feature. Through weighting fusion, each pixel feature can appropriately contain local features of the region, and the semantic representation capability of the pixel is enhanced in the fused pixel with the target feature of the promotion target to which the pixel belongs. Compared with a method for adjusting the characteristics of a certain pixel based on the characteristics of adjacent pixels, the method for adjusting the characteristics of the pixel based on the characteristics of the adjacent pixels has the advantages that the popularization target is taken as a whole, the characteristics of the pixel related to the popularization target are adjusted, the situation that the popularization target is cut apart to lose semantic information is avoided, and the accuracy of adjusting the characteristics of the pixel is improved.

After obtaining the region pixel fusion feature corresponding to each pixel, the server may perform feature fusion on the obtained region pixel fusion feature and each pixel feature to obtain an integrated global feature. When performing feature fusion on the obtained region pixel fusion features and the pixel features, the server may correspondingly splice the region pixel fusion features and the pixel features corresponding to each pixel, and splice according to the positions of the pixels to obtain an integrated global feature.

Referring to fig. 5A, a network structure for adjusting initial global features to obtain integrated global features includes a pixel representation module, a rough region representation module, an object region representation module, a relevance calculation module, and a feature fusion module. And the pixel representation module is used for respectively extracting the pixel characteristics corresponding to the pixels contained in the sample popularization image from the initial global characteristics. The rough region characterization module is used for carrying out image segmentation on the sample promotion image based on the initial global features to obtain an initial segmentation region aiming at least one promotion target. The object region characterization module extracts region local features associated with the initial segmented regions from the initial global features. And the relevance calculating module is used for determining the relevance between each pixel feature and the local feature of the region. The feature fusion module is used for performing feature fusion on the obtained pixel fusion features of the regions and the pixel features to obtain an integrated global feature.

After obtaining the initial global features, the rough region characterization module may perform image segmentation on the sample popularization image based on the initial global features to obtain an initial segmentation region for at least one popularization target. The rough region characterization module may perform image segmentation based on the complete initial global feature, or perform image segmentation based on a shallow feature or an intermediate feature in the initial global feature, and the like, which is not limited specifically. For example, the initial divided region is represented by a matrix of b × c × h × w, b represents a batch of data processed at each operation, c represents the number of channels of the matrix, h represents the number of rows of the matrix, and w represents the number of columns of the matrix.

The pixel representation module may respectively extract, from the complete initial global features, pixel features corresponding to each pixel included in the sample popularization image, or may also respectively extract, from deep features of the initial global features, pixel features corresponding to each pixel included in the sample popularization image, and the like, which are not limited specifically. For example, the matrix formed by the individual pixel features is represented by a matrix of b x k x h x w.

The object region characterization module extracts a region local feature associated with the initial segmentation region from the initial global feature based on an output of the coarse region characterization module and an output of the pixel representation module. And the output of the rough region representation module is multiplied by the output of the pixel representation module to carry out feature fusion to obtain the local feature of the region. For example, the local features of the region are represented by vectors of b × c × k × 1, and each vector corresponds to a region class, i.e., a foreground region or a background region, etc.

The relevance calculating module takes the output of the pixel representing module as a searched object, the regional local feature as a searched object, and the two matrixes are multiplied to obtain the similarity between each pixel feature and the regional local feature, namely the relevance.

And the feature fusion module multiplies the matrix formed by the obtained relevance degrees, namely b (h) w by the matrix corresponding to the local features of the region, and splices the multiplication result, namely b k w, and the output of the pixel representation module, namely b k h w, to realize feature fusion of the pixel fusion features of the regions and the pixel features and obtain integrated global features, namely b 2k h w.

As an embodiment, when the fusion global feature is obtained, a feature fusion network can be adopted for realizing, the feature fusion network can perform multi-task feature fusion processing, and noise points can be reduced, so that the prediction of the segmentation region is more complete and accurate. Taking feature fusion of the initial global feature, the edge sub-feature and the position sub-feature to obtain a fused global feature as an example, referring to fig. 5B, the feature fusion network may include 5 network layers, each of the first three network layers may be composed of a depth separable convolution layer (DWConv), a batch normalization layer and an activation layer, and the size of the convolution kernel may be 3 × 3. The fourth layer network layer comprises a depth separable convolution layer, the fifth layer network layer comprises an upper sampling layer, and the output can be improved to be 4 times of the original resolution ratio through the upper sampling layer. The feature fusion network can output fusion global features of two channels, wherein one channel represents a foreground region, and the other channel represents a background region.

As an example, the server may employ an optical character recognition network (OCRNet) to obtain the integrated global features. The server may employ a multi-resolution feature fusion network (RefineNet) to obtain a fused global feature of the initial global feature, the edge sub-feature, and the location sub-feature.

And S206, adjusting model parameters of the image segmentation model to be trained based on the error between the prediction segmentation region and the segmentation label corresponding to the at least one popularization target.

After the prediction segmentation region is obtained, the server can determine the training loss of the image segmentation model based on the prediction segmentation region and the error between the segmentation labels corresponding to at least one popularization target, adjust the model parameters of the image segmentation model when the training loss does not reach the training target, and continue to train the image segmentation model; and when the training loss reaches the training target, outputting the image segmentation model to obtain the trained target image segmentation model.

The training target may be various, for example, when the training loss shows convergence, it is determined that the training loss reaches the training target; for another example, when all the sample popularization images included in the sample popularization image set are completely trained, it is determined that the training loss reaches the training target, and the like, which is not limited specifically.

As an embodiment, if each segmentation label includes a region label, an edge label and a location label, the server may obtain a predicted edge boundary of at least one popularization target in the sample popularization image based on the edge sub-feature. Based on the location sub-features, a relative location between the at least one promotional object and the sample promotional image is obtained.

Therefore, the training loss of the image segmentation model is determined based on the edge error between the predicted edge boundary and the edge label corresponding to the at least one popularization target, the position error between the relative position and the position label corresponding to the at least one popularization target, and the region error between the predicted segmentation region and the region label corresponding to the at least one popularization target, and when the training loss does not reach the training target, the model parameters of the image segmentation model to be trained are adjusted and training is continued; and when the training loss reaches the training target, outputting the image segmentation model to obtain the trained target image segmentation model.

As an embodiment, because the annotation process of the segmentation annotation of the image is difficult, the annotation cost is high, the model generalization performance of the target image segmentation model obtained by training the limited sample popularization image is poor, and in order to effectively improve the robustness of the trained model and the generalization capability, the server can obtain a set of label-free popularization images, wherein each label-free popularization image comprises at least one label-free popularization target. And respectively carrying out image segmentation on at least one unmarked popularization target contained in each unmarked popularization image by adopting a target image segmentation model to obtain a corresponding target segmentation area. And performing multiple rounds of iterative training on the target image segmentation model based on the label-free popularization image set and the obtained target segmentation areas to obtain a trained final image segmentation model. The final image segmentation model is trained through each label-free popularization image, the problem of overfitting of the final image segmentation model is avoided, the generalization capability of the final image segmentation model is improved, and the image segmentation accuracy and reliability of the trained final image segmentation model are higher.

As an embodiment, when performing multiple rounds of iterative training on a target image segmentation model based on an annotation-free popularization image set and obtained target segmentation regions to obtain a trained final image segmentation model, a server may determine a confidence of each obtained target segmentation region based on a preset confidence evaluation policy, where the confidence is used to characterize the segmentation accuracy of the corresponding target segmentation region. And taking the target segmentation region with the confidence coefficient larger than a preset confidence coefficient threshold value in each target segmentation region as the segmentation label of the corresponding label-free popularization image. And performing multiple rounds of iterative training on the target image segmentation model based on each obtained label-free popularization image with segmentation labels to obtain an ultimate image segmentation model, wherein the multiple rounds of iterative training are similar to the previous round of iterative process and are not repeated herein.

The preset confidence threshold can be set to 95%, so that a target segmentation region with a confidence greater than the preset confidence threshold can be regarded as a more accurate target segmentation region, and the target segmentation region with a confidence greater than 95% can be used as a segmentation label of a corresponding label-free popularization image to further train a target image segmentation model, thereby realizing a semi-supervised training process.

When the final image segmentation model is obtained by performing multiple rounds of iterative training on the target image segmentation model based on each obtained unmarked popularization image with segmentation marks, the server can perform multiple rounds of iterative training on the target image segmentation model together based on each unmarked popularization image with segmentation marks and the sample popularization image set to obtain the final image segmentation model.

For example, after the target image segmentation model is obtained, a plurality of unmarked popularization images can be obtained, and the target image segmentation model is used to perform image segmentation on the plurality of unmarked popularization images respectively to obtain corresponding target segmentation areas. And the server determines a target segmentation region with the confidence coefficient greater than a preset confidence coefficient threshold value as a segmentation label of the corresponding label-free popularization image. And performing multiple rounds of iterative training on the target image segmentation model based on the sample popularization image set and the label-free popularization image with segmentation labels until the training loss of the target image segmentation model is converged, or the training times reach preset times, or the training time reaches preset time, and the like, so as to obtain the final image segmentation model.

As an example, in the process of performing multiple rounds of iterative training, a Gradient Descent (SGD) method may be used to solve the model parameters and the bias parameters in the image segmentation model. The initial learning rate may be set to 0.01, and after calculating the error, back-propagation, calculating the gradient and updating all model parameters during each iteration. The decrease in learning rate is determined by the decrease in training loss, which can be multiplied by 0.5 if there is no decrease in training loss for 5 consecutive rounds.

As an embodiment, after obtaining the target image segmentation model or the final image segmentation model, the server may perform image segmentation processing on the image to be segmented by using the target image segmentation model or the final image segmentation model. The target image segmentation model or the final image segmentation model may perform feature extraction on an image to be segmented and perform image segmentation based on the extracted image features to obtain an initial segmentation result. The server may perform erosion dilation operation on the initial segmentation result, for example, perform an opening operation on the initial segmentation result, that is, perform erosion and dilation first, and remove segmentation noise points in the initial segmentation result; and then performing closed operation on the initial segmentation result after the open operation, namely expanding and then corroding, and filling fine holes existing in at least one popularization target. Finally, a filtering component can be used for filtering the initial segmentation result after the corrosion expansion operation, so that the edge is smoother, and more accurate image segmentation results which conform to the real edge, such as foreground areas containing all popularization targets, can be obtained.

The method for training the image segmentation model provided by the embodiment of the application is described below by taking a sample popularization image as a cover image of an advertisement video as an example.

Referring to fig. 6A, an advertisement cover image is shown, in which the advertisement main body is a car, two adults and a child are in front of the car, a house is above the car, and the scene of the car includes lawn, mountain, sky, etc. The advertisement cover image can represent publicity elements of a large capacity of the automobile, high safety factor, audiences of the automobile who like traveling or families with children, and the like. Thus, the promotion targets in the advertisement cover image include a car, a person, and a house. When the image segmentation model is adopted to segment the advertisement cover image, the target segmentation areas of the automobile, the people and the house in the advertisement cover image can be obtained. By acquiring the target segmentation area, the incidence relation and the like between the automobile and the promotion targets except the automobile can be analyzed, so that the characterization capability of the advertisement is further improved, and the optimal promotion effect and the like are achieved.

Referring to fig. 6B, the image segmentation model may be composed of a plurality of networks, including a feature extraction network, a feature integration network, an edge feature extraction network, a location feature extraction network, a feature fusion network, and the like. When the server trains the image segmentation model based on an advertisement cover image with segmentation labels, namely a sample popularization image, the advertisement cover image can be input into the feature extraction network to obtain the initial global features of the advertisement cover image. Since the advertisement cover image contains many elements and the scene is complex, the accuracy of the extracted initial global features is low, and therefore, when the cover image of the target advertisement is segmented, please refer to fig. 7A, the situation that a part of the promoted target is segmented into a background area and the like is easy to occur, for example, two characters on the left side of a car and a house above the car are segmented into the background area, and the segmentation is inaccurate is caused.

Therefore, after the initial global features are obtained, feature integration is carried out on the initial global features by adopting a feature integration network to obtain integrated global features; extracting edge sub-features from the initial global features by adopting an edge feature extraction network; extracting position sub-features from the initial global features by adopting a position feature extraction network; and adopting a feature fusion network to fuse and integrate the global feature, the edge sub-feature and the position sub-feature to obtain a fusion global feature.

Obtaining a predicted edge boundary of at least one promotion target in the advertisement cover image based on the edge sub-features respectively; obtaining a relative position between the at least one promotional object and the advertisement cover image based on the position sub-feature; and obtaining a prediction segmentation area of at least one promotion target in the advertisement cover image based on the fusion global features.

If the segmentation labels can include edge labels, position labels and region labels, the training loss of the image segmentation model can be determined based on an edge error between the predicted edge boundary and the edge label corresponding to the at least one popularization target, a position error between the relative position and the position label corresponding to the at least one popularization target, and a region error between the predicted segmentation region and the region label corresponding to the at least one popularization target, respectively. When the training loss does not reach the training target, adjusting the model parameters of the image segmentation model to be trained, and continuing to train the image segmentation model based on the next advertisement cover image; and when the training loss reaches the training target, outputting the image segmentation model to obtain the trained target image segmentation model.

Through the fusion of various characteristics, when a cover image of a target advertisement is subjected to image segmentation by using a target image segmentation model obtained through training, the region where each promotion target contained in the cover image is located can be accurately obtained, please refer to fig. 7B, the cover image is segmented into two regions by the target image segmentation model, wherein the two regions comprise a foreground region and a background region, the foreground region comprises an automobile, two adults and a child in front of the automobile, and a house above the automobile. Therefore, the incidence relation and the like between the automobile and the promotion targets except the automobile can be analyzed, so that the characterization capability of the advertisement is further improved, the optimal promotion effect is achieved, and the like.

Based on the same inventive concept, the embodiment of the present application provides an apparatus for training an image segmentation model, which can implement the corresponding function of the method for training an image segmentation model. Referring to fig. 8, the apparatus includes an obtaining module 801 and a processing module 802, wherein:

the acquisition module 801: the method comprises the steps of obtaining a sample popularization image set, wherein each sample popularization image comprises at least one popularization target and a corresponding segmentation label;

the processing module 802: the method is used for carrying out multi-round iterative training on an image segmentation model to be trained based on a sample popularization image set to obtain a trained target image segmentation model, and at least the following operations are executed in each round of iterative training process:

the processing module 802 is specifically configured to: carrying out feature extraction on the obtained sample popularization image to obtain an initial global feature;

the processing module 802 is specifically configured to: extracting edge sub-features and position sub-features from the initial global features, wherein the edge sub-features represent edge boundaries of at least one popularization target, and the position sub-features represent relative positions between the at least one popularization target and the sample popularization image;

the processing module 802 is specifically configured to: and based on the error between the segmentation labels corresponding to the prediction segmentation region and the at least one promotion target, adjusting the model parameters of the image segmentation model to be trained.

In a possible embodiment, the obtaining module 801 is specifically configured to:

acquiring each promotion video;

respectively taking cover images of the popularization videos as sample popularization images;

based on a preset segmentation strategy, segmenting at least one promotion target contained in each sample promotion image to obtain a corresponding segmentation label;

and establishing a sample popularization image set based on each sample popularization image and the corresponding segmentation label thereof.

In a possible embodiment, the processing module 802 is specifically configured to:

and performing multi-scale fusion processing on the intermediate global features based on the maximum resolution to obtain initial global features, wherein the resolution corresponding to the initial global features is the maximum resolution.

based on the initial global features, performing image segmentation on the sample popularization image to obtain an initial segmentation region aiming at least one popularization target;

extracting local features of regions associated with the initial segmentation regions from the initial global features;

based on the regional local features, performing local feature adjustment on the initial global features to obtain integrated global features;

and performing feature fusion on the integrated global feature, the edge sub-feature and the position sub-feature to obtain a fused global feature, and obtaining a prediction segmentation region of at least one popularization target in the sample popularization image based on the fused global feature.

respectively extracting pixel characteristics corresponding to each pixel contained in the sample popularization image from the initial global characteristics;

respectively determining the association degree between each pixel feature and the local feature of the area;

and adjusting the pixel characteristics of the pixel characteristics based on the obtained correlation degrees to obtain the integrated global characteristics.

respectively taking each association degree as the weight of the corresponding pixel feature, and respectively performing weighted fusion on the local feature of the region and each pixel feature contained in the initial segmentation region to obtain the corresponding region pixel fusion feature;

and performing feature fusion on the obtained pixel fusion features of each region and the pixel features to obtain an integrated global feature.

In one possible embodiment, each segmentation label includes a region label, an edge label, and a location label; the processing module 802 is specifically configured to:

obtaining a predicted edge boundary of at least one promotion target in the sample promotion image based on the edge sub-features;

obtaining a relative position between at least one promotion target and the sample promotion image based on the position sub-features;

and adjusting the model parameters of the image segmentation model to be trained based on the edge error between the predicted edge boundary and the edge label corresponding to at least one popularization target, the position error between the relative position and the position label corresponding to at least one popularization target, and the region error between the predicted segmentation region and the region label corresponding to at least one popularization target.

In a possible embodiment, the processing module 802 is further configured to:

performing multiple rounds of iterative training on an image segmentation model to be trained based on a sample popularization image set to obtain a trained target image segmentation model, and then obtaining an annotation-free popularization image set, wherein each annotation-free popularization image comprises at least one annotation-free popularization target;

respectively carrying out image segmentation on at least one unmarked popularization target contained in each unmarked popularization image by adopting a target image segmentation model to obtain a corresponding target segmentation area;

Referring to fig. 9, the apparatus for training an image segmentation model may be run on a computer device 900, and a current version and a historical version of a data storage program and application software corresponding to the data storage program may be installed on the computer device 900, where the computer device 900 includes a processor 980 and a memory 920. In some embodiments, the computer device 900 may include a display unit 940, the display unit 940 including a display panel 941 for displaying an interface for interaction by a user, and the like.

In one possible embodiment, the Display panel 941 may be configured in the form of a Liquid Crystal Display (LCD) or an Organic Light-Emitting Diode (OLED) or the like.

The processor 980 is configured to read the computer program and then execute a method defined by the computer program, for example, the processor 980 reads a data storage program or a file, etc., so as to run the data storage program on the computer device 900 and display a corresponding interface on the display unit 940. The Processor 980 may include one or more general purpose processors, and may further include one or more DSPs (Digital Signal processors) for performing relevant operations to implement the solutions provided in the embodiments of the present application.

Memory 920 typically includes both internal and external memory, which may be Random Access Memory (RAM), Read Only Memory (ROM), and CACHE memory (CACHE). The external memory can be a hard disk, an optical disk, a USB disk, a floppy disk or a tape drive. The memory 920 is used for storing a computer program including an application program corresponding to each client and other data, which may include data generated after an operating system or the application program is executed, including system data (e.g., configuration parameters of the operating system) and user data. Program instructions in the embodiments of the subject application are stored in memory 920 and executed by processor 980 within memory 920 to implement any of the methods discussed in the previous figures.

The display unit 940 is used to receive input numerical information, character information, or touch operation/non-touch gesture, and generate signal input related to user setting and function control of the computer apparatus 900, and the like. Specifically, in the embodiment of the present application, the display unit 940 may include a display panel 941. The display panel 941, for example, a touch screen, can collect touch operations by a user (for example, operations of the user on the display panel 941 or on the display panel 941 by using a finger, a stylus pen, or any other suitable object or attachment), and drive a corresponding connection device according to a preset program.

In one possible embodiment, the display panel 941 may include two portions of a touch detection device and a touch controller. The touch detection device detects the touch direction of a player, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 980, and can receive and execute commands sent by the processor 980.

The display panel 941 may be implemented by a plurality of types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the display unit 940, in some embodiments, the computer device 900 may also include an input unit 930, which input unit 930 may include an image input device 931 and other input devices 932, wherein the other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

In addition to the above, the computer device 900 may also include a power supply 990 for powering the other modules, audio circuitry 960, near field communication module 970, and RF circuitry 910. The computer device 900 may also include one or more sensors 950, such as acceleration sensors, light sensors, pressure sensors, and the like. The audio circuit 960 specifically includes a speaker 961 and a microphone 962, and the computer device 900 may collect the user's voice through the microphone 962, perform corresponding operations, and so on.

For one embodiment, the number of the processors 980 may be one or more, and the processors 980 and the memories 920 may be coupled or relatively independent.

As an example, the processor 980 in fig. 9 may be used to implement the functionality of the acquisition module 801 and the processing module 802 in fig. 8.

As an example, the processor 980 in fig. 9 may be configured to implement the corresponding functions of the server or the terminal device discussed above.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, for example, a computer program product stored in a storage medium and including several instructions for causing a computer device to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of training an image segmentation model, comprising:

and obtaining a prediction segmentation region of the at least one promotion target in the sample promotion image based on the fusion global feature of the initial global feature, the edge sub-feature and the position sub-feature, and adjusting a model parameter of the image segmentation model to be trained based on an error between the prediction segmentation region and a segmentation label corresponding to the at least one promotion target.

2. The method of claim 1, wherein obtaining the sample promotional image set comprises:

acquiring each promotion video;

taking the cover image of each promotion video as the sample promotion image;

3. The method according to claim 1, wherein the performing feature extraction on the obtained sample popularization image to obtain an initial global feature comprises:

4. The method according to any one of claims 1 to 3, wherein the obtaining the predicted segmentation region of the at least one popularization target in the sample popularization image based on the fused global feature of the initial global feature, the edge sub-feature and the position sub-feature comprises:

5. The method of claim 4, wherein the performing local feature adjustment on the initial global feature based on the local feature of the region to obtain an integrated global feature comprises:

6. The method according to claim 5, wherein the performing pixel feature adjustment on each pixel feature based on each obtained relevance to obtain the integrated global feature comprises:

7. The method according to any one of claims 1 to 3, wherein each segmentation label comprises a region label, an edge label and a position label;

the adjusting the model parameters of the image segmentation model to be trained based on the error between the prediction segmentation region and the segmentation label corresponding to the at least one popularization target includes:

8. The method according to any one of claims 1 to 3, wherein after performing multiple rounds of iterative training on the image segmentation model to be trained based on the sample popularization image set to obtain a trained target image segmentation model, the method further comprises:

acquiring a set of label-free popularization images, wherein each label-free popularization image comprises at least one label-free popularization target;

and performing multiple rounds of iterative training on the target image segmentation model based on the label-free popularization image set and each obtained target segmentation region to obtain a trained final image segmentation model.

9. The method according to claim 8, wherein the performing a plurality of rounds of iterative training on the target image segmentation model based on the label-free popularization image set and the obtained target segmentation regions to obtain a trained final image segmentation model comprises:

and performing multiple rounds of iterative training on the target image segmentation model based on each obtained label-free popularization image with segmentation labels to obtain the final image segmentation model.

10. An apparatus for training an image segmentation model, comprising:

the processing module is specifically configured to: carrying out feature extraction on the obtained sample popularization image to obtain an initial global feature;

the processing module is specifically configured to: and obtaining a prediction segmentation region of the at least one popularization target in the sample popularization image based on the fusion global feature of the initial global feature, the edge sub-feature and the position sub-feature, and adjusting a model parameter of the image segmentation model to be trained based on an error between the prediction segmentation region and a segmentation label corresponding to the at least one popularization target.

11. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any one of claims 1 to 9.

12. A computer device, comprising:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory and executing the method of any of claims 1 to 9 in accordance with the obtained program instructions.

13. A computer-readable storage medium having computer-executable instructions stored thereon for causing a computer to perform the method of any one of claims 1 to 9.