CN117011521A

CN117011521A - Training method and related device for image segmentation model

Info

Publication number: CN117011521A
Application number: CN202211572910.8A
Authority: CN
Inventors: 赖锦祥
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-11-07

Abstract

The application provides a training method and a related device of an image segmentation model, which are applied to the technical field of artificial intelligence and comprise the following steps: acquiring a first image sample without a segmentation label and at least one second image sample with a segmentation label; respectively carrying out image segmentation on the first image sample and each second image sample through an initial image segmentation model to obtain a first segmentation result of the first image sample and a second segmentation result of the second image sample; determining a value of a first loss function based on the probability map of the acquired first segmentation result in the pixel dimension; determining a value of a second loss function based on a difference between each second segmentation result and the segmentation labels; and training an initial image segmentation model by combining the value of the first loss function and the value of each second loss function to obtain a target image segmentation model. The application can improve the accuracy of the segmentation result of the image segmentation model.

Description

Training method and related device for image segmentation model

Technical Field

The application relates to an artificial intelligence technology, in particular to a training method and a related device of an image segmentation model.

Background

Object localization has found widespread use in industrial automation, object detection, and video viewing, among others. In the related art, when the positioning method for the target object in the image is in response to the actual complex situation (such as the situation that the target object is deformed), the positioning accuracy is poor and the accuracy is low, or a large amount of image data labeling operation is needed, the labeling cost is high, and therefore the segmentation accuracy of the image segmentation model is poor.

Disclosure of Invention

The embodiment of the application provides a training method, a training device, electronic equipment, a computer readable storage medium and a computer program product for an image segmentation model, which can improve the accuracy of a segmentation result of the image segmentation model.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a training method of an image segmentation model, which comprises the following steps:

acquiring a joint training sample set, wherein the joint training sample set comprises a first image sample without a segmentation label and at least one second image sample with a segmentation label, and the segmentation label is a standard segmentation image of the second image sample;

acquiring an initial image segmentation model, performing image segmentation on the first image sample through the initial image segmentation model to obtain a first segmentation result, and performing image segmentation on each second image sample to obtain a second segmentation result;

Acquiring a probability map of the first segmentation result in a pixel dimension, and determining a value of a first loss function based on the probability map, wherein the probability map is used for indicating the probability that each pixel in the first image sample belongs to each segmentation area in the first segmentation result;

acquiring differences between the second segmentation results and the segmentation labels, and determining values of second loss functions based on the differences;

and training the initial image segmentation model by combining the value of the first loss function and the value of each second loss function to obtain a target image segmentation model.

The embodiment of the application provides a training device for an image segmentation model, which comprises the following components:

the acquisition module is used for acquiring a combined training sample set, wherein the combined training sample set comprises a first image sample without a segmentation label and at least one second image sample with a segmentation label, and the segmentation label is a standard segmentation image of the second image sample;

the initial segmentation module is used for acquiring an initial image segmentation model, performing image segmentation on the first image sample through the initial image segmentation model to obtain a first segmentation result, and performing image segmentation on each second image sample to obtain a second segmentation result;

The first determining module is used for obtaining a probability map of the first segmentation result in the pixel dimension, and determining a value of a first loss function based on the probability map, wherein the probability map is used for indicating the probability that each pixel in the first image sample belongs to each segmentation area in the first segmentation result;

the second determining module is used for obtaining differences between the second segmentation results and the segmentation labels and determining values of second loss functions based on the differences;

and the training module is used for combining the value of the first loss function and the value of each second loss function, training the initial image segmentation model and obtaining a target image segmentation model.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the training method of the image segmentation model provided by the embodiment of the application when executing the executable instructions stored in the memory.

Embodiments of the present application provide a computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, cause the processor to perform the training method of the image segmentation model provided by the embodiments of the present application.

Embodiments of the present application provide a computer program product comprising a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of the electronic device reads the computer executable instructions from the computer readable storage medium, and the processor executes the computer executable instructions, so that the electronic device executes the training method of the image segmentation model provided by the embodiment of the application.

The embodiment of the application has the following beneficial effects:

by applying the embodiment of the application, the first segmentation result of the first image sample without the segmentation label and the second segmentation result of the second pattern with the segmentation label are firstly determined through the initial image segmentation model, then the value of the second loss function is determined based on the difference between the second segmentation result and the segmentation label, and the value of the first loss function is determined based on the probability map of the first segmentation result, so that the initial image segmentation model is trained by combining the determined values of the two loss functions, and the target image segmentation model is obtained. Therefore, based on the probability graph, accurate judgment of classification of each pixel is realized, so that high-quality and high-precision semantic segmentation is realized, and the accuracy of the segmentation result of the image segmentation model is improved.

Drawings

FIGS. 1A-1B are schematic diagrams of architecture of a training system 100 for an image segmentation model according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an electronic device 500 implementing a training method of an image segmentation model according to an embodiment of the present application;

FIG. 3 is a flowchart of a training method of an image segmentation model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a second image sample carrying a split label according to an embodiment of the present application;

FIGS. 5A-5B are model block diagrams of an initial image segmentation model provided by an embodiment of the present application;

FIG. 6 is a probability pictorial representation provided by an embodiment of the present application;

FIG. 7 is a flow chart illustrating a method for determining a value of a first loss function according to an embodiment of the present application;

FIG. 8 is a schematic diagram of an acquisition mode of a historical probability map according to an embodiment of the present application;

FIG. 9 is a flowchart of a first area determination method according to an embodiment of the present application;

FIG. 10 is a diagram of normalized results provided by an embodiment of the present application;

FIG. 11 is another schematic diagram of a method for determining a value of a first loss function according to an embodiment of the present application;

fig. 12 is a schematic diagram of a training process method of an initial image segmentation model according to an embodiment of the present application.

Detailed Description

The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

It should be noted that, in the embodiment of the present application, related data such as images collected in real time is involved, when the embodiment of the present application is applied to a specific product or technology, user permission or consent needs to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

1) Semantic segmentation: i.e. image segmentation, refers to the process of subdividing a digital image into a plurality of image sub-areas (sets of pixels), also called superpixels, i.e. the technique and process of dividing the image into several specific areas with unique properties and presenting objects of interest. It is a key step from image processing to image analysis. The purpose of semantic segmentation is to simplify or alter the representation of an image so that the image is easier to understand and analyze. Semantic segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in an image. More precisely, semantic segmentation is a process of labeling each pixel in an image, which causes pixels with the same label to have some common visual property.

For example, for an image, object pixels belonging to the same class are classified into one class and labeled with the same color, i.e., a classification pixel by pixel, such as a natural image in which all pixels belonging to a tree are labeled with one color, all people are labeled with one color pixel by pixel, the sky is labeled with one color, etc.

2) The image is segmented, and a result graph obtained by semantic segmentation is also called a segmentation mask.

3) Cross entropy: the cross entropy can measure the degree of difference between two different probability distributions in the same random variable, and is expressed in machine learning as the difference between the true probability distribution and the predicted probability distribution. The smaller the value of the cross entropy, the better the model prediction effect. The cross entropy is often used together with a classifier softmax, the output result is processed to make the sum of the predictive values of a plurality of classifications be 1, and then the loss is calculated through the cross entropy.

4) Semi-supervised learning refers to training of a model by utilizing a part of tagged data and untagged data, and is generally applied to a scene where the tag data are less or the tag is difficult to acquire, so that the capability of the model on limited tag data is improved.

5) Computer Vision (CV) is a science of researching how to make a machine "look at", and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, three-dimensional techniques, virtual reality, augmented reality, synchronous positioning, map construction, and other techniques, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and the like.

6) Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

7) COCO data set: is known as Microsoft Common Objects in Context (MS COCO), which is a large, rich object detection, segmentation and caption dataset. The method mainly comprises the steps of intercepting targets in images from complex daily scenes and calibrating positions through accurate segmentation. The dataset included 91 class objects, 328000 images and 2500000 tags. The method is a maximum data set with semantic segmentation so far, the provided categories are 80 categories, more than 33 ten thousand pictures are provided, 20 ten thousand pictures are marked, and the number of individuals in the whole data set is more than 150 ten thousand.

Based on the above explanation of terms and expressions involved in the embodiments of the present application, a training system for an image segmentation model provided by the embodiments of the present application is described below. Referring to fig. 1A, fig. 1A is a schematic architecture diagram of a training system 100 of an image segmentation model according to an embodiment of the present application, in order to support an exemplary application, a terminal (a terminal 400-1 and a terminal 400-2 are shown in an exemplary manner) are connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of the two, and a wireless or wired link is used to implement data transmission.

In some embodiments, the terminals (such as the terminal 400-1 and the terminal 400-2) are configured to receive a triggering operation for performing image segmentation on an image to be segmented based on a manual interaction interface of the image segmentation clients (such as the client 410-1 and the client 410-2), and send an image segmentation request carrying the image to be segmented to the server 200;

in some embodiments, the server 200 is configured to receive an image segmentation request sent by a terminal, and respond to the request by training a completed target image segmentation model, and return a segmentation result for an image to be segmented to the terminal;

the server 200 is further configured to, before acquiring the image segmentation request for the image to be segmented, implement a training procedure for the initial image segmentation model: the method comprises the steps that a server acquires a combined training sample set, wherein the combined training sample set comprises a first image sample without a segmentation label and at least one second image sample with a segmentation label, and the segmentation label is a standard segmentation image of the second image sample; acquiring an initial image segmentation model, performing image segmentation on a first image sample through the initial image segmentation model to obtain a first segmentation result, and performing image segmentation on each second image sample to obtain a second segmentation result; acquiring a probability map of the first segmentation result in the pixel dimension, and determining a value of a first loss function based on the probability map, wherein the probability map is used for indicating the probability that each pixel in the first image sample belongs to each segmentation area in the first segmentation result; determining a value of a second loss function based on a difference between each second segmentation result and the segmentation labels; and training the initial image segmentation model by combining the value of the first loss function and the value of each second loss function to obtain a target image segmentation model.

In some embodiments, the server 200 may be a server cluster or a distributed system formed by a plurality of servers, for example, a blockchain system, where the plurality of servers may be configured as a blockchain network and the server 200 is a node on the blockchain network.

In the following, an exemplary application of a blockchain network is illustrated with multiple servers accessing the blockchain network to enable training of a speech model.

In some embodiments, referring to fig. 1B, fig. 1B is a schematic architecture diagram of a training system 100 for an image segmentation model according to an embodiment of the present application. The multiple servers involved in the voice model perform the training process of the image segmentation model together, such as the terminal 600 and the terminal 700, and after obtaining the authorization of the blockchain management platform 900, the client 610 of the terminal 600 and the client 710 of the terminal 700 can access the blockchain network 800.

The terminal 600 sends an image segmentation model request to the blockchain management platform 900 (the terminal 700 sends the image segmentation model request to the blockchain management platform 900), the blockchain management platform 900 generates a corresponding update operation according to the image segmentation model acquisition request, an intelligent contract required to be invoked for realizing the update operation/query operation and parameters transferred to the intelligent contract are specified in the update operation, the transaction also carries a digital signature signed by a web page, and the update operation is sent to the blockchain network 800.

When the node 210-1, the node 210-2 and the node 210-3 in the blockchain network 800 receive the update operation, the digital signature of the update operation is verified, after the digital signature verification is successful, whether the client 610 has the acquisition permission is confirmed according to the identity of the client 610 carried in the update operation, and any one verification judgment of the digital signature and the permission verification results in acquisition failure. Signing node 210's own digital signature after verification is successful (e.g., the digest of the transaction is encrypted using node 210-1's private key) and continues to broadcast in blockchain network 800.

The nodes 210-1, 210-2, 210-3, etc. in the blockchain network 800 having the ordering function, after receiving the acquisition of successful verification, fill the acquisition request into the new block and broadcast to the nodes in the blockchain network 800 providing the consensus service.

Nodes in blockchain network 800 that provide consensus services perform consensus processes on new blocks to agree on, nodes that provide ledger functionality append new blocks to the tail of the blockchain, and perform acquisition requests in the new blocks: updating key value pairs corresponding to the voice model in the state database for the submitted voice model request; and inquiring a key value pair corresponding to the voice model from a state database for the acquisition request of the voice model, and sending the corresponding voice model to the terminal. After receiving the initial image segmentation model returned by the blockchain network 800, the terminal 600 and the terminal 700 train the image segmentation model to obtain a trained image segmentation model, and display prompt messages of successful training in the graphical interface 610-1 and the graphical interface 710-1. The terminal 600 and the terminal 700 send the trained image segmentation model to the blockchain network 800, and the blockchain network 800 calls the trained image segmentation model to perform image segmentation processing based on the image to be segmented, so as to obtain a segmentation result of the image to be segmented.

In practical applications, the server 200 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs, content Delivery Network), and basic cloud computing services such as big data and artificial intelligence platforms. Terminals (e.g., terminal 400-1 and terminal 400-2) may be, but are not limited to, smart phones, tablet computers, notebook computers, desktop computers, smart speakers, smart televisions, smart watches, etc. Terminals, such as terminal 400-1 and terminal 400-2, and server 200 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited thereto.

Next, an electronic device implementing the training method of the image segmentation model provided by the embodiment of the present application will be described. Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device 500 for implementing a training method of an image segmentation model according to an embodiment of the present application. The electronic device 500 may be the server 200 shown in fig. 1, and the electronic device 500 may also be a terminal capable of implementing the training method of the image segmentation model provided by the present application, and taking the electronic device 500 as the server shown in fig. 1 as an example, the electronic device implementing the training method of the image segmentation model in the embodiment of the present application is described, where the electronic device 500 provided in the embodiment of the present application includes: at least one processor 510, a memory 550, at least one network interface 520, and a user interface 530. The various components in electronic device 500 are coupled together by bus system 540. It is appreciated that the bus system 540 is used to enable connected communications between these components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to the data bus. The various buses are labeled as bus system 540 in fig. 2 for clarity of illustration.

The processor 510 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

The user interface 530 includes one or more output devices 531 that enable presentation of media content, including one or more speakers and/or one or more visual displays. The user interface 530 also includes one or more input devices 532, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 550 may optionally include one or more storage devices physically located remote from processor 510.

Memory 550 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile memory may be read only memory (ROM, read Only Me mory) and the volatile memory may be random access memory (RAM, random Access Memor y). The memory 550 described in embodiments of the present application is intended to comprise any suitable type of memory.

In some embodiments, memory 550 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks; network communication module 552 is used to reach other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.; a presentation module 553 for enabling presentation of information (e.g., a user interface for operating a peripheral device and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530; the input processing module 554 is configured to detect one or more user inputs or interactions from one of the one or more input devices 532 and translate the detected inputs or interactions.

In some embodiments, the training device for an image segmentation model provided in the embodiments of the present application may be implemented in a software manner, and fig. 2 shows a training device 555 for an image segmentation model stored in a memory 550, which may be software in the form of a program, a plug-in, or the like, and includes the following software modules: the acquisition module 5551, the initial segmentation module 5552, the first determination module 5553, the second determination module 5554, and the training module 5555 are logical, and thus may be arbitrarily combined or further split according to the implemented functions, and the functions of the respective modules will be described below.

In other embodiments, the training apparatus for an image segmentation model provided by the embodiments of the present application may be implemented by combining software and hardware, and by way of example, the training apparatus for an image segmentation model provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the training method for an image segmentation model provided by the embodiments of the present application, for example, the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Cir cuit), DSP, programmable logic device (PLD, programmable Logic Device), complex programmable logic device (CPLD, complex Programmable Logic Device), field programmable gate array (FPGA, field-Programmable Gate Array), or other electronic component.

In some embodiments, the terminal or the server may implement the training method of the image segmentation model provided by the embodiment of the present application by running a computer program. For example, the computer program may be a native program or a software module in an operating system; the Application program can be a local (Native) Application program (APP), namely a program which can be installed in an operating system to run, such as an instant messaging APP and a web browser APP; the method can also be an applet, namely a program which can be run only by being downloaded into a browser environment; but also an applet that can be embedded in any APP. In general, the computer programs described above may be any form of application, module or plug-in.

Based on the above description of the training system and the electronic device for the image segmentation model provided by the embodiment of the present application, the following describes a training method for the image segmentation model provided by the embodiment of the present application. In practical implementation, the training method of the image segmentation model provided by the embodiment of the present application may be implemented by a terminal or a server alone, or implemented by the terminal and the server cooperatively, and the training method of the image segmentation model provided by the embodiment of the present application is illustrated by separately executing the server 200 in fig. 1A (or fig. 1B) as an example. Referring to fig. 3, fig. 3 is a flowchart of a training method of an image segmentation model according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3.

In step 101, a server acquires a set of joint training samples, where the set of joint training samples includes a first image sample that does not carry a segmentation tag and at least one second image sample that carries a segmentation tag, and the segmentation tag is a standard segmentation image of the second image sample.

In actual implementation, the first image sample is an image without a split tag, the second image sample is an image carrying a split tag, and the split tag is a correct standard split image of the second image sample. In the standard segmentation image, each pixel point has a corresponding label, the label can be used for indicating the category to which the pixel point belongs, and the pixels with the same label have certain same visual characteristics (such as the same color). Referring to fig. 4, fig. 4 is a schematic diagram of a second image sample carrying a segmentation tag according to an embodiment of the present application, where the second image sample is shown by reference numeral 1, the standard segmentation image of the second image sample is shown by reference numeral 2, and the objects of the same category in the standard segmentation image are identified by the same visual characteristics, such as "car" shown by reference numeral 2-1 is identified by "black" and "pavement" shown by reference numeral 2-2 is identified by "white".

In step 102, an initial image segmentation model is obtained, and through the initial image segmentation model, image segmentation is performed on a first image sample to obtain a first segmentation result, and image segmentation is performed on each second image sample to obtain a second segmentation result.

In actual implementation, the server may train the initial image segmentation model F based on the acquired first image sample and the plurality of second image samples. The initial image segmentation model F is an image segmentation model that is pre-trained based on a public image set (such as a COCO data set), and is not an image segmentation model that is not trained, so that training efficiency of the initial image segmentation model can be improved. The pre-training process for the initial image segmentation model is actually semantic segmentation training for the initial image segmentation model based on cross entropy loss on the disclosed data set.

The initial image segmentation model F is described with respect to an encoding layer, a decoding layer, an embedding layer, and a classification layer in actual implementation. The encoding layer is used for encoding (downsampling) the image samples to obtain compressed images (namely, the image samples are subjected to dimension reduction through the encoding layer); a decoding layer for decoding (upsampling) the compressed image to obtain a reconstructed image having the same size as the image samples (i.e., the compressed image is upscaled by the decoding layer); the embedding layer is used for carrying out semantic mapping on the reconstructed image to obtain a semantic image; the classification layer (may also be referred to as a pixel classification layer) is configured to perform a pixel-by-pixel segmentation operation on the semantic image to obtain a segmentation result, where the segmentation result is similar to the standard segmentation image shown in fig. 4, that is, the objects in the same class in the segmentation result use the same visual characteristic identifier. It should be noted that, the reason why the embedding layer is used for semantic mapping is that some irrelevant features in the compressed image are amplified or an important feature is dispersed in the decoding process, so that the reconstructed image is mapped into a new space through the embedding layer, and a semantic image with more obvious semantics is obtained. In addition, in practical applications, the embedded layer may also be part of the classification layer. When the initial image segmentation model is trained by using a combined training sample set in the target field, a first image sample in the combined training sample is input into the initial image segmentation model, a first segmentation result is obtained through the processing of the coding layer, the decoding layer, the embedding layer and the classifying layer, and at least one second image sample in the combined training sample is input into the initial image segmentation model, and a corresponding second segmentation result is obtained through the processing of the coding layer, the decoding layer, the embedding layer and the classifying layer. The server processes the processes in a parallel mode or a serial mode according to the actual situation of the available computing resources. If the available computing resources are sufficient, the parallel mode can be adopted for processing, and if the available computing resources are lower than a preset threshold value, the serial mode can be adopted for processing. 5A-5B are model structure diagrams of an initial image segmentation model provided by an embodiment of the present application, referring to FIG. 5A, the initial image segmentation model shown in the figure is in a single-branch serial structure, including an encoding layer, a decoding layer, an embedding layer, and a classifying layer, a server sequentially inputs a first image sample and a second image sample into the initial image segmentation model, and implements image segmentation in a serial manner, where parameters in the whole processing process are shared (for example, if the first image sample is processed first, parameters of the initial image segmentation model after the first image sample is processed may be used as initial parameters of the initial image segmentation model when the second image sample is processed). Referring to fig. 5B, the initial image segmentation model shown in the figure is a two-branch parallel structure, each branch includes an encoding layer, a decoding layer, an embedding layer, and a classification layer, the branch structure shown in the number 1 is used for processing a second image sample, the branch structure shown in the number 2 is used for processing a first image sample, and the server implements segmentation operations for the first image sample and the second image sample in a parallel manner, and in the processing process, parameters of corresponding layers in the two branch structures can be shared.

To explain the pre-training process of the initial image segmentation model, in order to improve the segmentation accuracy of the image segmentation model, the server can realize the model pre-training of the initial image segmentation model F through a public data set (such as a COCO data set), and the initial image segmentation model in the pre-training process is marked as F ⁰ ，F ⁰ The model parameters are different from the F model structure. At F ⁰ In the model pre-training process of (2), the training loss can be determined by a preset loss function (such as a cross entropy loss function), and F is updated based on the training loss ⁰ Is a parameter of (a). Setting a set of image samples from the public dataset for model pre-training, the set of image samplesThe input image being x ⁱ ∈R ^H×W The corresponding split label isThe output predictive segmentation result is +.>Where H W is the height and width of the image and M represents the number of image samples in the image sample set D. Initial image segmentation model F or F ⁰ The mapping relation of (2) is shown as formula (1):

wherein θ is a model parameter of the initial image segmentation model, f _θ Is the coding layer, g _θ Is the decoding layer, h _θ Is a classification layer, f _θ (x ⁱ ) Coding layer f in model for image segmentation _θ For image sample x ⁱ Performing coding processing to obtain compressed image g _θ (f _θ (x ⁱ ) For decoding layer g in image segmentation model) _θ Decoding the compressed image to obtain a sample x of the compressed image ⁱ Reconstructing an image of the same size; then, semantic mapping is carried out on the reconstructed image through the embedding layer to obtain a semantic image, and h _θ (g _θ (f _θ (x ⁱ ) ) is a classification layer h in the image segmentation model _θ And (3) carrying out pixel-by-pixel segmentation operation on the semantic image to obtain a segmentation result, wherein each pixel point in the semantic image corresponds to one probability value aiming at each category in the image sample, and taking the category indicated by the maximum probability value as the belonged target category of the pixel point. The loss of the initial image segmentation model can be determined by one or more preset loss functions, and since the image samples in the public data set are labeling samples, the loss functions during model pre-training can comprise cross entropy loss functions, namely, the initial image segmentation model is determined based on the cross entropy loss functionsThe training loss L during model pre-training is given by the following formula:

wherein CE (.cndot.) is a cross entropy loss function,for an initial image segmentation model for image sample x ⁱ Segmentation result after image segmentation, +.>For image sample x ⁱ And (5) a corresponding standard segmentation result. And obtaining an initial image segmentation model with basic image segmentation capability through the multi-round training. It should be noted that the model structure of the initial image segmentation model may also be in other forms, which are not limited by the embodiment of the present application.

In actual implementation, after the server pre-trains based on the public data set to obtain an initial image segmentation model F, the server can continue to perform formal model training on the initial image segmentation model F through the combined training sample set, and in each round of training process, the first loss exists in the segmentation process of the first image sample in the combined training sample set through the initial image segmentation model F; there is a second penalty in performing the segmentation process for one or more second image samples in the set of joint training samples by the initial image segmentation model. Wherein the first loss is used to characterize the value of the first loss function and the second loss is used to characterize the value of the second loss function. The server may train the initial image segmentation model F based on the first loss and the second loss corresponding to each joint training sample in the joint training sample set, and finally obtain the target image segmentation model.

In some embodiments, the output results of the encoding and decoding layers of the initial image segmentation model pre-trained based on a large number of annotated public data sets are highly accurate in order to reduce the elimination of computing resources for the serverConsumption, improving model training efficiency based on actual project data set of target field, the server can perform fine tuning training on the classification layer of the initial image segmentation model only through multiple second image samples in the joint training sample set to obtain an image segmentation model F suitable for the target field ¹ ，F ¹ The parameters of the coding layer and the decoding layer are the same as those of the F model structure, and the parameters of the classification layer are different. Then, to be able to increase F ¹ For the segmentation accuracy of a small amount of first image samples (such as image samples deformed by the target object) in the target field, the segmentation accuracy can be based on the first image samples again ¹ The training method can be as follows: server fixing F ¹ Coding layer f of (1) _θ Decoding layer g _θ Based on the first image sample x, for F ¹ Classification layer h of (1) _θ T times of iterative updating are carried out to obtainAt this time, the image segmentation model F ¹ Updated to F ² I.e. +.>Wherein F is ² And F is equal to ¹ The model structures are the same, the model parameters of the coding layer and the decoding layer are the same, and the model parameters of the classification layer are different. t is a positive integer greater than zero, for the first image sample x, when the prediction result of the t-th time is +.>Prediction result of the (t-1) th timeFor classification layer h when the similarity of (2) reaches a similarity threshold, or both are the same _θ The training is ended. In the classification layer h _θ In the process of iterative updating, the training loss comprises second losses corresponding to the plurality of second image samples and first losses corresponding to the first image samples. Illustratively, class layer +.>Initial classification layer h _θ For example, the description is from->Iterative update to +.>Is a process of (2). First, by->Predicting each second image sample x ⁱ Is divided into a second partial result of (2)The second loss per round is +.>By->Model prediction first segmentation result of first image sample x +.>During model iterative training, the first loss of each round: />Wherein (1)>y _j，k Represents the k pixel passing +.>Prediction results (namely probability values of the kth pixel point belonging to the corresponding category) obtained by prediction, y' _j-1 Is +.1 of the j-1 th iteration>The model corresponds to the predicted outcome. The server trains an image segmentation model F through the first loss and the second loss ² And obtaining a target image segmentation model.

It should be noted that, the initial image segmentation model obtained based on the pre-training of the public data set may also be migrated to a specific target field (such as device detection in industrial automation and focus detection in medical images), and the initial image segmentation model is retrained based on the target image data set of the target field to obtain an image segmentation model suitable for the target field, where the target image data set includes N (N is greater than or equal to 1 and N is a positive integer) labeled images, and each image carries a segmentation label. It will be appreciated that the pre-training process for the initial image segmentation model is based on a public dataset, and the retraining for the pre-trained initial image segmentation model is based on a target image dataset of the target domain, i.e. the two training processes differ mainly in the source of the image samples. It can be also understood that the initial image segmentation model trained on the public data set is migrated to the image data set in the target field for use, so that the training efficiency of the image segmentation model in the target field can be greatly improved, and the effect of quickly acquiring the trained image segmentation model can be obtained.

The first segmentation result and the second segmentation result are described, wherein the first segmentation result is a segmented image of the first image sample, which is also called a segmentation mask, and regions of the same category in the segmentation mask adopt the same visual characteristics (such as regions of the same category adopt the same color identification). Illustratively, assuming that only the foreground and background of the image are segmented, two colors can be seen on the segmentation mask, such as black for the background and white for the foreground. It is assumed that different categories in the image are distinguished, each category comprising its own contour and the area in the segmentation mask (or area on the segmentation mask, or area ratio). That is, the division results with visual effects are usually distinguished by using different colors, and the division results read by the electronic device are often performed with pixels as granularity, for example, one h×w image may be regarded as a pixel matrix of H rows and W columns, each pixel includes a value for representing a color, and accordingly, the division results corresponding to the image with the size may also be regarded as a pixel matrix of H rows and W columns, so that for convenience in calculation, related data of the pixels may be stored in multiple data forms (such as class, structure, multi-component, etc.), and the following multi-component may be adopted: { the row where the pixel point is located, the column where the pixel point is located, the target probability value of the pixel point, the category to which the pixel point belongs, the color of the category of the pixel point }, wherein the probability value of the pixel point is used for representing the probability that the pixel point belongs to the corresponding category. For example, the segmentation result represents that the image comprises 5 categories, each pixel point has 5 probability values, each probability value corresponds to the probability that the pixel point belongs to the corresponding category, the sum of the 5 probability values is 1, and the maximum probability value in the 5 probability values is taken as the target probability value of the pixel point in the multi-element group.

In step 103, a probability map of the first segmentation result in the pixel dimension is obtained, and a value of the first loss function is determined based on the probability map, where the probability map is used to indicate a probability that each pixel in the first image sample belongs to each segmentation region in the first segmentation result.

It should be noted that, after the server pre-trains based on the public data set to obtain the initial image segmentation model, there is a first loss in executing the segmentation process for the first image sample in the joint training sample set through the initial image segmentation model; there is a second penalty in performing the segmentation process for one or more second image samples in the set of joint training samples by the initial image segmentation model. Wherein the first loss is used to characterize the value of the first loss function and the second loss is used to characterize the second loss function determination. The server may train the initial image segmentation model based on the first and second penalties included in each of the set of joint training samples. The first loss function and the first loss are described in detail in this step, and the second loss function and the second loss are described in the subsequent step 104.

In practical implementation, since the first image sample does not carry the segmentation tag, the accuracy of the area of each category shown in the first segmentation result of the first image sample is low, and in order to obtain the more accurate area corresponding to the category, the server may determine the first loss of the first image sample based on the probability map corresponding to the first segmentation result.

Describing the probability map, the server determines the probability map of the current first segmentation result from the pixel dimension aiming at the first segmentation result, wherein the size of the probability map is the same as that of the first segmentation result. As can be seen from the foregoing description, each pixel point in the first segmentation result has a corresponding target probability value of the pixel point, and the server obtains the corresponding target probability value of the pixel point by pixel point, thereby forming a probability map corresponding to the first segmentation result. Referring to fig. 6, an illustration of probability diagrams provided by an embodiment of the present application is shown in fig. 6, which is a probability diagram of 5×5 size, where the value on each pixel is the probability that the pixel belongs to the target class. It should be noted that, each probability value in the current probability map is specific to each category of the image, and each probability value in the probability map is completely independent from each other. The server obtains a first loss function based on the pixel dimension and determines a value of the first loss function, i.e., a first loss, based on each probability value in the probability map. It will be appreciated that the first loss function is used to constrain the area of the corresponding class in the first segmentation result from the pixel dimension, the smaller the value of the first loss function, the more accurate the area of the region in the segmentation result for each class.

Continuing with the description of the manner in which the value of the first loss function is determined, in some embodiments, referring to fig. 7, fig. 7 is a flowchart illustrating the manner in which the value of the first loss function is determined according to an embodiment of the present application, and based on fig. 3, the process of obtaining the value of the first loss function in step 103 based on the probability map may be implemented in steps 201 to 205.

In step 201, the server obtains a size of the first image sample, the size including a length and a width.

In practical implementation, the server reads the size (including the length and the width) of the first image sample, and it is required to say that the size of the first image sample is actually a pixel matrix from the pixel dimension, the row of the pixel matrix is the width of the first image sample, and the column number of the pixel matrix is the height of the first image sample. For example, a first image sample of size H W can be considered a matrix of pixels of row H and column W.

Step 202, a historical probability map corresponding to the first image sample is obtained.

In practical implementation, in order to obtain the first loss of the first segmentation result of the current training round of the initial image segmentation model, a probability map, i.e. a historical probability map, of the first image sample obtained by the last training of the initial image segmentation model needs to be obtained, so that the area of each category of the region in the first segmentation result can be more accurate from the pixel dimension. The historical probability map is obtained in a manner related to the training turn of the initial image segmentation model.

Referring to fig. 8, fig. 8 is a schematic diagram of a historical probability map acquisition method according to an embodiment of the present application, which is implemented in conjunction with steps 2021-2022 shown in fig. 8.

In step 2021, when the training number of the initial image segmentation model is the first, the server performs random interference on the probability map to obtain a random probability map as a historical probability map.

In practical implementation, a probability map corresponding to a first segmentation result of a first image sample is obtained through first round training of an initial image segmentation model, in order to determine a value of a first loss function corresponding to the first segmentation result, an interference factor (such as a reduction scale, probability value amplification of an odd line, probability value reduction of an even line and other different interference modes) can be determined through a random interference mode, so that a probability map different from the probability map obtained through first round training is obtained, and random interference is performed on probability values corresponding to pixels in the probability map corresponding to the first segmentation result based on the interference factors, so that a random probability map is obtained as a historical probability map.

In step 2022, when the number of training rounds of the initial image segmentation model is not the first round, the probability map corresponding to the first image sample obtained by training the initial image segmentation model in the previous round is obtained as the historical probability map.

In actual implementation, after the server performs non-first-round training on the initial image segmentation model, the server can acquire a probability map of a first image sample obtained by the previous round of training of the current training round number as a historical probability map.

Step 203, determining a first area occupied by the foreground in the first image sample in the probability map based on the size and the probability map.

In practical implementation, since the first image sample is used for restricting the area of each category in the segmentation result in the training process, one generalized distinction for the category, namely, the first segmentation result can distinguish only the foreground and the background, and does not distinguish the specific category in the foreground again. The server can determine the area occupied by the foreground in the first image sample in the probability map (i.e. the first area) through the probability map and the size of the probability map, namely the area size of the acquired foreground, namely the area of all other areas except the background in the segmentation result in the probability map.

Referring to fig. 9, fig. 9 is a flowchart of a first area determining method according to an embodiment of the present application, and the steps 2031-2032 shown in fig. 9 are described.

In step 2031, the server normalizes the probability map by combining the size of the size to obtain a standard probability map.

In practical implementation, since the probability value of each pixel point in the probability map is independently determined, there is no correlation between the probability values of the pixel points, and in order to calculate the value of the loss function in the unified area space, the server may establish the association relationship between each pixel point in a normalized manner. And the server performs normalization operation on the probability map with no correlation of the probability values of the pixel points by combining the size of the pixel points to obtain a standard probability map, wherein the sum of new probability values (obtained by transforming based on the original probability values) of all the pixel points in the standard probability map is equal to 1, and the area of the standard probability map can be understood to be 1. The specific normalization mode is as follows:

in the above formula, H×W is the size of the probability map, i.e. contains H rows, W columns, p _j,k Is the probability value for each pixel in the probability map. It will be appreciated that the normalization operation for the probability map is performed pixel by pixel. The sum of new probability values of all pixel points in the standard probability map subjected to normalization operation is 1. Referring to fig. 10, fig. 10 is a schematic diagram of normalized results provided by an embodiment of the present application.

Step 2032, determining, based on the standard probability map, an area occupied by the foreground in the standard probability map as a first area.

In practical implementation, according to the standard probability map obtained by the normalization method, normalized probability values of pixel points belonging to the foreground in the standard probability map are obtained, and summed to obtain the area occupied by the foreground in the standard probability map, namely the first area.

Step 204, determining a second area occupied by the foreground in the historical probability map based on the size of the dimension and the historical probability map.

Accordingly, in some embodiments, the server may also determine the second area by: the server normalizes the historical probability map by combining the size of the size to obtain a historical standard probability map; and determining the area occupied by the foreground in the historical standard probability map as a second area based on the historical standard probability map.

In practical implementation, according to the normalization method, a historical standard probability map corresponding to the historical probability map can be obtained, normalized probability values of pixels belonging to the foreground in the historical standard probability map are obtained, and summation is carried out, so that the area occupied by the foreground in the historical standard probability map, namely the second area, can be obtained. The first area is different from the second area.

In step 205, a value of a first loss function is determined based on the first area and the second area.

In practical implementation, the server determines the value of the first loss function through the ratio of the first area of the foreground to the second area of the foreground, so that the ratio tends to be stable through multiple iterations as much as possible, and the accurate area of each category is obtained. It should be noted that, the formula for determining the value of the first loss function (i.e., the first loss) is as follows:

in the formula, j is the training round number, j is more than or equal to 2, j is an integer, and p _j Obtaining a first area of foreground for the jth round, p _j-1 The second area of the foreground is obtained for the j-1 th round (previous round). During the training process of the model, p is as much as possible _j And p is as follows _j-1 Approximately equal. In the actual calculation, the probability values of the pixels belonging to the background may be set to zero. It will be appreciated that L _q The method is used for carrying out fine correction on the area of each category in the probability map in the segmentation prediction result of the previous round, achieving accurate segmentation in the pixel dimension as much as possible through continuous iteration, and determining the accurate area of each category in the probability map as much as possible.

Continuing with the description of the manner in which the value of the first loss function is determined, in some embodiments, the server may determine the value of the first loss function by: the server obtains the ratio of the first area to the second area; logarithm is carried out on the ratio to obtain a logarithm result; the product of the log result and the first area is determined as the value of the first loss function.

In practical implementation, the server determines the value of the first loss function according to the above formula (4), firstly obtains the ratio of the first area to the second area, then makes a log (log) of the ratio, and multiplies the result obtained after log operation by the first area to obtain the corresponding value of the first loss function.

Continuing with the description of the manner in which the value of the first loss function is determined, in some embodiments, referring to fig. 11, fig. 11 is another schematic diagram of the manner in which the value of the first loss function is determined according to an embodiment of the present application, and is described in connection with steps 301-306 shown in fig. 11.

In step 301, the server obtains a size of the first image sample, the size including a length and a width.

In practical implementation, it is known from the foregoing description that the size of the first image sample can be used to identify the number of rows and columns of the pixel matrix. The number of rows indicates the width and the number of columns indicates the length.

Step 302, a historical probability map corresponding to the first image sample is obtained.

In actual implementation, the server acquires the historical probability map of the first image sample according to the acquisition mode of the historical probability map.

Step 303, determining a third area occupied by each category in the probability map based on the size of the dimension and the probability map.

In practical implementation, the foreground of the image may further include a plurality of different categories, each category corresponding to a pixel point in a region. Thus, the server may also determine the area (i.e., the third area) that each category occupies in the probability map based on the size and probability map. The manner of obtaining the third area of each category based on the size and the probability map is similar to the foregoing step 203, and will not be described herein.

Step 304, determining a fourth area occupied by each category in the historical probability map based on the size of the dimension and the historical probability map.

In practical implementation, the foreground of the image may further include a plurality of different categories, each category corresponding to a pixel point in a region. Thus, the server may also determine the area each category occupies in the historical probability map (i.e., the fourth area) based on the size of the dimension and the historical probability map. The manner of obtaining the fourth area of each category based on the size and the historical probability map is similar to that of the step 204, and is not described herein.

Step 305, determining a sub-value of each category relative to the first loss function based on the third area and the corresponding fourth area of each category.

In practical implementation, the server obtains a value (which may be called a sub-value) of each category relative to the first loss function according to the above formula (4), according to the third area of each category in the probability map, and the fourth area of each category in the historical probability map, where one category corresponds to one sub-value of the first loss function. For example, assuming that there are 5 categories of first segmentation results, there are 5 sub-values of the first loss function.

Step 306, obtaining the weight of each category, and carrying out weighted summation based on the sub-value of each category, and taking the weighted summation result as the value of the first loss function.

In practical implementation, the server obtains the sub-value of the corresponding first loss function for each category, obtains the weight set for each category according to practical situations (in this way, a larger weight can be set for the category needing special attention), and then performs weighted summation on the sub-value of each category according to the weight, and takes the weighted summation result as the value of the first loss function, and it should be noted that the weight of each category can be set to 1.

In some embodiments, the server may determine the sub-value of each category in the first image sample relative to the first loss function by: for each category, the server performs the following processing: the server determines the ratio of the third area to the corresponding fourth area of the current category; and carrying out logarithm on the comparison value, determining the product of the logarithm result and the third area of the category, and taking the product as the sub-value of the current category relative to the first loss function.

In actual implementation, the server determines, for each class, a ratio of its corresponding third area to fourth area, then logarithmically comparing the ratios, and multiplying the logarithmically result by the third area of the class as a sub-value of the first loss function.

In step 104, a difference between each second segmentation result and the corresponding segmentation label is obtained and a value of the second loss function is determined based on the difference.

In practical implementation, the second loss function is used for constraining the contour of each category in the segmentation result of the image, and the smaller the value of the second loss function is, the more accurate the contour of each category in the segmentation result of the image is. The second loss function is a loss function used in a fully supervised (also called supervised) model training process, and is determined according to the segmentation result of the image sample carrying the segmentation labels (i.e. the second image sample) and the corresponding segmentation labels.

The manner of determining the value of the second loss function is described. In some embodiments, the server may determine the value of the second loss function by: the server obtains the cross entropy between each second segmentation result and the corresponding segmentation label; summing each cross entropy to obtain a summation result; the number of second image samples is obtained and a ratio of the sum result to the number of second image samples is determined as a value of the second loss function.

In practical implementation, since the second image sample carries the segmentation label, the training can be performed in a supervised manner in the training process of the image segmentation model, and therefore, a cross entropy function can be used as a loss function in training for the second image sample, and the server acquires the cross entropy between the second segmentation result of the second image sample and the segmentation label of the second image sample. Since the number of second image samples input each time can be one or more during one training round, the average value of the cross entropy corresponding to all the second image samples in the current training round can be determined as the value of the second loss function.

Illustratively, for each second image sample, a value L of a second loss function is determined _s The formula of (2) is as follows:

wherein, N is the number of second image samples, N is greater than or equal to 1, n=1, and ce (·) is cross entropy loss.

In step 105, the initial image segmentation model is trained to obtain a target image segmentation model in combination with the values of the first loss function and the values of each second loss function.

In practical implementation, the server can restrict the area of each category in the segmentation result of the first image sample without the segmentation tag based on the first loss function, so that the loss of the area is close to zero, and can restrict the contour of each category in the segmentation result of the second image sample with the segmentation tag based on the second loss function, so that the loss of the contour is close to zero, therefore, the server can combine the value of the first loss function and the value of the second loss function in the training process of the initial image segmentation model, and jointly train the initial segmentation model, so that accurate segmentation can be realized even under the condition that the image sample is less (small sample).

Describing the joint training process of the initial image segmentation model, in some embodiments, referring to fig. 12, fig. 12 is a schematic diagram of a training process method of the initial image segmentation model according to an embodiment of the present application, and based on fig. 3, step 105 may be implemented by steps 1051-1053.

In step 1051, the server obtains the weights of the first loss functions and the weights of each second loss function.

In practical implementation, for each training of the initial image segmentation model, the input information includes at least one second image and one first image, so that each second image sample training process has a value of a second loss function corresponding to itself, a corresponding weight can be set for the first loss function and each second loss function, and N (N is greater than or equal to 1 and N is an integer) second image samples are assumed, and during one training process, values of N second loss functions (may be referred to as second losses) and a value of a first loss function (may be referred to as first losses) are generated. The n+1 losses have corresponding weights respectively, and when n=1, the weights of the first loss and the second loss may be equal, and the weights of the N second losses may also be equal. The server may determine the total loss for the current training round based on each loss and the corresponding weight.

The manner of determining the value of the first loss function is described. In some embodiments, the server may determine the weight of the first loss function by: the server acquires the initial weight of the first loss function and the number of second image samples; according to the quantity, the initial weight is adjusted to obtain the weight of the first loss function; wherein the weight of the first loss function is inversely related to the number.

In practical implementation, the server acquires the initial weight of the first loss function, and adjusts the initial weight in combination with the number of the second image samples to obtain the weight of the first loss function, it can be understood that in the training process, the weight of the first loss function is associated with the number of the second image samples, that is, the weight of the value of the first loss function in the training process is reduced along with the increase of the number of the second image samples, the more the number of the second image samples, the larger the duty ratio of the second loss function is, so that the value of the second loss function (influencing the contour accuracy) for segmentation is dominant, and the value of the first loss function of the limiting area is taken as an aid.

Step 1052, performing weighted summation on the value of the first loss function and the value of each second loss function based on the weight of the first loss function and the weight of each second loss function, to obtain a weighted summation result.

In practical implementation, the server combines the weight of the first loss function and the weight of each second loss function to perform weighted summation on the value of the first loss function and the value of each second loss function, and a weighted summation result is obtained as the total loss of each training.

Illustratively, assuming each training input N Zhang Dier image samples and one first image sample, the value L of the first loss function is determined based on the above equation (3) _q Based on the above formula (5), determining the value of the second loss function as L _s One possible total loss L _t The determination formula of (2) is as follows:

wherein,the first loss function is weighted, and 1 is the weight of each second loss function.

Step 1053, updating the model parameters of the initial image segmentation model based on the weighted summation result to obtain the target image segmentation model.

In actual implementation, the server determines the total loss L through the formula (6) in each round of training _t And (i.e. a weighted summation result), reversely updating the model parameters of the initial image segmentation model, namely training the initial image segmentation model to obtain a target segmentation model. The training of the target segmentation model can be realized by only carrying out a small amount of data annotation (less than 10 images and at least only 1 image is annotated).

In some embodiments, the server may implement an image segmentation operation for an image to be segmented by: the server acquires an image to be segmented; and inputting the image to be segmented into a target image segmentation model, and outputting a segmentation result obtained by segmentation.

In practical implementation, the server trains the initial image segmentation model through a small amount of second image samples carrying the segmentation labels and first image samples not carrying the segmentation labels in the target field, and after the training of the target image segmentation model is completed, the server receives the image to be segmented in the target field and can obtain a corresponding segmentation result through the target image segmentation model.

In the field of industrial vision automation detection, a server receives an image to be segmented, wherein the image contains a target device and has deformation, the server acquires a target image segmentation model which is pre-stored with a second image carrying labeling information (namely a segmentation label) based on a small amount of the image in the field and is trained on a first image which is not deformed and is not labeled, the received image to be segmented is identified, and the contour and the region of the target device can be accurately labeled on a segmentation result (segmentation mask).

By applying the embodiment of the application, firstly, based on the thought of transfer learning, model pre-training is carried out on an initial image segmentation model on a large-scale data set; and then, carrying out fine tuning training on the pre-training model by using a small amount of actual project data, so that the efficiency of training the model can be improved, and simultaneously, restraining the second loss determined by at least one second image sample carrying the segmentation label through the first loss determined by the first image sample not carrying the segmentation label, and carrying out joint training on the initial image segmentation model, so that the higher region segmentation performance can be achieved only by using a small amount of data, and the positioning of the target object in the image under the deformation scene of the target object is improved.

In the following, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

In the related art, a target positioning mode based on image matching and a semantic segmentation positioning mode based on deep learning are adopted to realize target positioning in the image. Wherein, based on the target positioning mode of image matching, by extracting the outline of the target object as a matching feature, the outline extraction is obtained by calculating and analyzing the image gradient, thus, the mode is suitable for the poor positioning accuracy in the actual complex situation. The semantic segmentation positioning method based on deep learning has high positioning accuracy in the face of complex conditions, but needs to carry out a large amount of image data labeling, and has high labeling cost.

Based on the above, the embodiment of the application provides a training method of an image segmentation model, and the method is also a region positioning method based on small sample semantic segmentation, and the method can realize positioning under a deformation scene of a target object, and only needs to carry out a small amount of data annotation (less than 10 images are annotated, and at least 1 image is annotated). Firstly, based on the thought of transfer learning, model (namely image segmentation model in the prior art) pre-training is carried out on a large-scale data set; the pre-trained model is then fine-tuned using a small amount of real-project (application-specific) data (i.e., small samples). In addition, the embodiment of the application provides 2 losses for small sample fine tuning training, so that the network can achieve higher region segmentation performance only by a small amount of data.

Next, description is made from the product side, in the current industrial vision automation detection system, the accurate positioning of the target device plays a key role in the subsequent steps of detection, identification, classification and measurement of the vision detection system, so that the target device positioning module is widely used as a basic module. The technology can be applied to stable positioning of the target device under illumination and rotation change and deformation scenes, and only a small amount of data is required to be marked. The image segmentation model in the embodiment of the application is used as a region positioning model based on small sample semantic segmentation, and is firstly pre-trained on a public data set, and can achieve higher segmentation accuracy even if the data amount used for training is very small in practical application. When a user uses the small sample segmentation model, only a small amount of annotation data (less than 10 images and at least only 1 image) is collected for training, and the product can finish the segmentation and positioning of the image target object on the line.

Next, the model pre-training and model training of the image segmentation model will be described, taking the model structure of the image segmentation model shown in fig. 5B and the image segmentation model F shown in the foregoing formula (1) as examples.

Firstly, in a model pre-training stage, an image sample set is selected from a published COC0 data set, and an image segmentation model F is subjected to model pre-training, and in the pre-training process, since image samples in the image sample set carry segmentation labels, the cross entropy loss of each image sample can be determined by using the formula (2), and semantic segmentation training is performed on the image segmentation model. Through model pre-training aiming at the image segmentation model, an initial image segmentation model F=h pre-trained on the public data set is obtained _θ (g _θ (f _θ (x ⁱ ))). Then, in the model training phase, the classification layer h in the image segmentation model F is reinitialized _θ Obtaining an image segmentation model F ¹ F and F ¹ The model structure is the same, the parameters of the coding layer and the decoding layer are the same, the parameters of the classifying layer are different, and the image sample of the actual project is usedBook setTraining F ¹ H of (3) _θ Wherein the project data set Support contains N split-carrying tags +.>Image sample x of (2) ⁱ (i.e. the second image sample in the foregoing) by means of an image segmentation model F ¹ For each image sample x ⁱ Image segmentation is carried out to obtain corresponding segmentation results +.>Obtaining the corresponding cross entropy loss->(i.e., the second loss in the foregoing), the loss of the image sample set Support is calculated using the aforementioned formula (2), namely: />Then, an image segmentation model F is input for the image Query to be segmented (i.e. the first image sample in the foregoing) ¹ In the image segmentation process, the image segmentation model F is involved ¹ Classification layer h of (2) _θ Performing t times of iterative updating to obtain ∈>Thereby obtaining the trained image segmentation model +.>Wherein the loss L for each iteration _query (i.e., the first loss in the foregoing) can be obtained according to the aforementioned formula (4):

wherein,y _j，k represents the k pixel passing +.>Prediction results obtained by prediction, y' _j-1 Is +.1 of the j-1 th iteration>The model corresponds to the result.

Thus, in one training round, the total iteration loss of the image segmentation model Updating the image segmentation model through the total loss to obtain a final image segmentation model. />

By applying the embodiment of the application, firstly, based on the thought of transfer learning, model pre-training is carried out on an initial image segmentation model on a large-scale data set; and then, carrying out fine tuning training on the pre-training model by using a small amount of actual project data, so that the training model efficiency can be improved, and simultaneously, restraining the second loss determined by at least one second image sample carrying a segmentation label through the first loss determined by a first image sample not carrying the segmentation label, and carrying out joint training on the initial image segmentation model, so that the higher region segmentation performance can be achieved by only needing a small amount of data (less than 10 images are marked, and at least 1 image is marked), and the positioning of the target object in the image under the deformation scene of the target object is improved.

Continuing with the description below of an exemplary architecture of the training apparatus 555 for an image segmentation model provided by embodiments of the present application implemented as a software module, in some embodiments, as shown in fig. 3, the software module stored in the training apparatus 555 for an image segmentation model of the memory 550 may include: .

An obtaining module 5551, configured to obtain a set of joint training samples, where the set of joint training samples includes a first image sample that does not carry a segmentation tag and at least one second image sample that carries a segmentation tag, and the segmentation tag is a standard segmentation image of the second image sample;

the initial segmentation module 5552 is configured to obtain an initial image segmentation model, perform image segmentation on the first image sample to obtain a first segmentation result through the initial image segmentation model, and perform image segmentation on each of the second image samples to obtain a second segmentation result;

a first determining module 5553, configured to obtain a probability map of the first segmentation result in a pixel dimension, and determine a value of a first loss function based on the probability map, where the probability map is used to indicate a probability that each pixel in the first image sample belongs to each segmentation region in the first segmentation result;

A second determining module 5554, configured to obtain differences between each of the second segmentation results and the segmentation labels, and determine a value of a second loss function based on the differences;

and the training module 5555 is configured to train the initial image segmentation model by combining the value of the first loss function and the value of each second loss function, so as to obtain a target image segmentation model.

In some embodiments, the first determining module is further configured to obtain a size of a dimension of the first image sample, the dimension including a length and a width; acquiring a history probability map corresponding to the first image sample; determining a first area occupied by a foreground in the first image sample in the probability map based on the size and the probability map; determining a second area occupied in the historical probability map of the front Jing Zaisuo based on the size and the historical probability map; a value of a first loss function is determined based on the first area and the second area.

In some embodiments, the first determining module is further configured to, when the number of training rounds of the initial image segmentation model is the first round, randomly interfere with the probability map, and obtain a random probability map as the historical probability map; when the training round number of the initial image segmentation model is not the first round, acquiring a probability map corresponding to the first image sample obtained by training the initial image segmentation model in the last round as the historical probability map.

In some embodiments, the first determining module is further configured to normalize the probability map in combination with the size of the dimension to obtain a standard probability map; determining the area occupied in the front Jing Zaisuo standard probability map as the first area based on the standard probability map;

in some embodiments, the first determining module is further configured to normalize the historical probability map in combination with the size of the size to obtain a historical standard probability map; and determining the area occupied in the historical standard probability map of the front Jing Zaisuo as the second area based on the historical standard probability map.

In some embodiments, the first determining module is further configured to obtain a ratio of the first area to the second area; logarithm is carried out on the ratio to obtain a logarithm result; a product of the log result and the first area is determined as a value of a first loss function.

In some embodiments, the foreground of the first image sample comprises a plurality of categories, the first determining module is further configured to obtain a size of a dimension of the first image sample, the dimension comprising a length and a width; acquiring a history probability map corresponding to the first image sample; determining a third area occupied by each category in the probability map based on the size of the dimension and the probability map; determining a fourth area occupied by each category in the historical probability map based on the size and the historical probability map; determining a sub-value of each category relative to a first loss function based on the third area and the corresponding fourth area of each category; and acquiring the weight of each category, carrying out weighted summation based on the sub-value of each category, and taking the weighted summation result as the value of the first loss function.

In some embodiments, the first determining module is further configured to, for each of the categories, perform the following: determining a ratio of the third area and the corresponding fourth area of the category; and logarithming the ratio, determining the product of the logarithm result and the third area of the category, and taking the product as a sub-value of the category relative to a first loss function.

In some embodiments, the training module is further configured to obtain a weight of the first loss function and a weight of each of the second loss functions; based on the weight of the first loss function and the weight of each second loss function, carrying out weighted summation on the value of the first loss function and the value of each second loss function to obtain a weighted summation result; and updating model parameters of the initial image segmentation model based on the weighted summation result to obtain a target image segmentation model.

In some embodiments, the training module is further configured to obtain an initial weight of the first loss function, and a number of the second image samples; according to the number, the initial weight is adjusted to obtain the weight of the first loss function; wherein the weight of the first loss function is inversely related to the number.

In some embodiments, the second determining module is further configured to obtain cross entropy between each of the second segmentation results and the corresponding segmentation label; summing each cross entropy to obtain a summation result; the number of second image samples is acquired and a ratio of the sum result to the number of second image samples is determined as a value of a second loss function.

In some embodiments, the trained target image segmentation model is further used to obtain an image to be segmented; and inputting the image to be segmented into the target image segmentation model, and outputting a segmentation result obtained by segmentation.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the training method of the image segmentation model according to the embodiment of the application.

An embodiment of the present application provides a computer-readable storage medium storing executable instructions that, when executed by a processor, cause the processor to perform a method for training an image segmentation model provided by an embodiment of the present application, for example, a method for training an image segmentation model as shown in fig. 7.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or, alternatively, on multiple electronic devices distributed across multiple sites and interconnected by a communication network.

In summary, the embodiment of the application has the following beneficial effects: firstly, based on the thought of transfer learning, model pre-training is carried out on an initial image segmentation model on a public large-scale data set; and then, carrying out fine tuning training on the pre-training model by using a small amount of actual project data, so that the training model efficiency can be improved, and simultaneously, restraining the second loss determined by at least one second image sample carrying a segmentation label through the first loss determined by a first image sample not carrying the segmentation label, and carrying out joint training on the initial image segmentation model, so that the higher region segmentation performance can be achieved by only needing a small amount of data (less than 10 images are marked, and at least 1 image is marked), and the positioning of the target object in the image under the deformation scene of the target object is improved.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method of training an image segmentation model, the method comprising:

2. The method of claim 1, wherein the determining the value of the first loss function based on the probability map comprises:

acquiring a size of the first image sample, the size including a length and a width;

acquiring a history probability map corresponding to the first image sample;

determining a first area occupied by a foreground in the first image sample in the probability map based on the size and the probability map;

determining a second area occupied in the historical probability map of the front Jing Zaisuo based on the size and the historical probability map;

a value of a first loss function is determined based on the first area and the second area.

3. The method of claim 2, wherein the obtaining the historical probability map corresponding to the first image sample comprises:

when the training round number of the initial image segmentation model is the first round, randomly disturbing the probability map to obtain a random probability map as the history probability map;

when the training round number of the initial image segmentation model is not the first round, acquiring a probability map corresponding to the first image sample obtained by training the initial image segmentation model in the last round as the historical probability map.

4. The method of claim 2, wherein the determining the first area occupied in the probability map of the front Jing Zaisuo based on the size and the probability map comprises:

normalizing the probability map by combining the size of the dimension to obtain a standard probability map;

determining the area occupied in the front Jing Zaisuo standard probability map as the first area based on the standard probability map;

the determining, based on the size of the dimension and the historical probability map, a second area occupied in the historical probability map of the front Jing Zaisuo includes:

normalizing the historical probability map by combining the size of the dimension to obtain a historical standard probability map;

and determining the area occupied in the historical standard probability map of the front Jing Zaisuo as the second area based on the historical standard probability map.

5. The method of claim 2, wherein the determining the value of the first loss function based on the first area and the second area comprises:

acquiring the ratio of the first area to the second area;

logarithm is carried out on the ratio to obtain a logarithm result;

A product of the log result and the first area is determined as a value of a first loss function.

6. The method of claim 1, wherein the foreground of the first image sample comprises a plurality of categories, the determining a value of a first loss function based on the probability map comprising:

acquiring a history probability map corresponding to the first image sample;

determining a third area occupied by each category in the probability map based on the size of the dimension and the probability map;

determining a fourth area occupied by each category in the historical probability map based on the size and the historical probability map;

determining a sub-value of each category relative to a first loss function based on the third area and the corresponding fourth area of each category;

and acquiring the weight of each category, carrying out weighted summation based on the sub-value of each category, and taking the weighted summation result as the value of the first loss function.

7. The method of claim 6, wherein the determining a sub-value for each category relative to a first loss function based on the third area and the corresponding fourth area for each category comprises:

For each of the categories, the following processing is performed:

determining a ratio of the third area and the corresponding fourth area of the category;

and logarithming the ratio, determining the product of the logarithm result and the third area of the category, and taking the product as a sub-value of the category relative to a first loss function.

8. The method of claim 1, wherein said training the initial image segmentation model to obtain a target image segmentation model in combination with the values of the first loss function and the values of each of the second loss functions comprises:

acquiring the weight of the first loss function and the weight of each second loss function;

based on the weight of the first loss function and the weight of each second loss function, carrying out weighted summation on the value of the first loss function and the value of each second loss function to obtain a weighted summation result;

and updating model parameters of the initial image segmentation model based on the weighted summation result to obtain a target image segmentation model.

9. The method of claim 8, wherein the obtaining weights for the first loss function comprises:

Acquiring initial weights of the first loss function and the number of the second image samples;

according to the number, the initial weight is adjusted to obtain the weight of the first loss function;

wherein the weight of the first loss function is inversely related to the number.

10. The method of claim 1, wherein said obtaining the difference between each of the second segmentation results and the segmentation labels comprises:

acquiring cross entropy between each second segmentation result and the corresponding segmentation label;

the determining a value of a second loss function based on the difference includes:

summing each cross entropy to obtain a summation result;

the number of second image samples is acquired and a ratio of the sum result to the number of second image samples is determined as a value of a second loss function.

11. The method of claim 1, wherein the method further comprises:

acquiring an image to be segmented;

and inputting the image to be segmented into the target image segmentation model, and outputting a segmentation result obtained by segmentation.

12. An apparatus for training an image segmentation model, the apparatus comprising:

the segmentation module is used for acquiring an initial image segmentation model, carrying out image segmentation on the first image sample through the initial image segmentation model to obtain a first segmentation result, and carrying out image segmentation on each second image sample to obtain a second segmentation result;

13. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the training method of the image segmentation model according to any one of claims 1 to 11 when executing the executable instructions stored in the memory.

14. A computer readable storage medium storing computer executable instructions which, when executed by a processor, implement the method of training an image segmentation model according to any one of claims 1 to 11.

15. A computer program product comprising a computer program or computer executable instructions which, when executed by a processor, implement a method of training an image segmentation model according to any one of claims 1 to 11.