CN115393592A - Target segmentation model generation method and device, and target segmentation method and device - Google Patents

Target segmentation model generation method and device, and target segmentation method and device Download PDF

Info

Publication number
CN115393592A
CN115393592A CN202211056154.3A CN202211056154A CN115393592A CN 115393592 A CN115393592 A CN 115393592A CN 202211056154 A CN202211056154 A CN 202211056154A CN 115393592 A CN115393592 A CN 115393592A
Authority
CN
China
Prior art keywords
segmentation
information
target
mask
labeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211056154.3A
Other languages
Chinese (zh)
Inventor
程天恒
陈少宇
张骞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Horizon Information Technology Co Ltd
Original Assignee
Beijing Horizon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Horizon Information Technology Co Ltd filed Critical Beijing Horizon Information Technology Co Ltd
Priority to CN202211056154.3A priority Critical patent/CN115393592A/en
Publication of CN115393592A publication Critical patent/CN115393592A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the disclosure discloses a method and a device for generating a target segmentation model, a method and a device for segmenting a target, a computer-readable storage medium and an electronic device, wherein the method comprises the following steps: performing target segmentation on the sample image by using a teacher model to obtain a first target segmentation result; generating segmentation labeling information based on the first target segmentation result and the object labeling information; performing target segmentation on the sample image by using a student model to obtain a second target segmentation result; determining a loss value representing an error between the second target segmentation result and the segmentation annotation information; adjusting parameters of the student model based on the loss value, and adjusting parameters of the teacher model by using the adjusted parameters of the student model; and when the teacher model after the parameters are adjusted meets the training end conditions, determining the teacher model as a target segmentation model. The embodiment of the disclosure improves the efficiency of training the target segmentation model and improves the target segmentation precision of the trained target segmentation model.

Description

Target segmentation model generation method and device, and target segmentation method and device
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a target segmentation model, a method and an apparatus for target segmentation, a computer-readable storage medium, and an electronic device.
Background
The existing image-based target segmentation method mostly depends on complete object segmentation labels, the object segmentation labels are used as optimization targets in the training of an image segmentation model, and finally the trained model is used for carrying out target segmentation on the image to obtain object-level segmentation results. The method needs a great deal of manpower for object segmentation and labeling.
Currently, some target segmentation methods for simplifying segmentation labeling exist, and these methods adopt some weak labeling-based methods to train a model. The object labeling frame is used for constructing a segmented constraint relation, the color of an image is used for constructing the relation of adjacent pixels, the object contour does not need to be accurately labeled, and the weak supervision instance segmentation method depending on the object labeling frame is achieved. In addition, some methods replace pixel-level segmentation or multi-deformation labeling with a plurality of discrete points, and implement weakly labeled instance segmentation in cooperation with an object labeling box.
The existing target segmentation method based on the weak annotation generally realizes preliminary target segmentation only by using the weak annotation, and has low segmentation precision.
Disclosure of Invention
The embodiment of the disclosure provides a method and a device for generating a target segmentation model, a method and a device for segmenting a target, a computer-readable storage medium and an electronic device.
An embodiment of the present disclosure provides a method for generating a target segmentation model, including: performing target segmentation on the sample image by using a teacher model to be trained to obtain a first target segmentation result; generating segmentation marking information based on the first target segmentation result and the object marking information, wherein the object marking information is obtained by marking the area where the target object is located in the sample image in advance; performing target segmentation on the sample image by using a student model to be trained to obtain a second target segmentation result; determining a loss value representing an error between the second target segmentation result and the segmentation annotation information based on a preset loss function; adjusting parameters of the student model based on the loss value, and adjusting parameters of the teacher model by using the adjusted parameters of the student model; and determining the teacher model after the parameters are adjusted as a target segmentation model in response to the fact that the teacher model after the parameters are adjusted meets the preset training end conditions.
According to another aspect of the embodiments of the present disclosure, there is provided a target segmentation method, including: acquiring a target image; and performing target segmentation on the target image by using a pre-trained target segmentation model to obtain a segmentation result, wherein the segmentation result comprises at least one group of corresponding mask information, at least one object detection frame and at least one object class information.
According to another aspect of the embodiments of the present disclosure, there is provided an apparatus for generating a target segmentation model, the apparatus including: the first segmentation module is used for performing target segmentation on the sample image by using a teacher model to be trained to obtain a first target segmentation result; the generating module is used for generating segmentation marking information based on the first target segmentation result and the object marking information, wherein the object marking information is obtained by marking the area where the target object is located in the sample image in advance; the second segmentation module is used for performing target segmentation on the sample image by using the student model to be trained to obtain a second target segmentation result; a first determining module, configured to determine a loss value representing an error between the second target segmentation result and the segmentation labeling information based on a preset loss function; the adjusting module is used for adjusting the parameters of the student model based on the loss value and adjusting the parameters of the teacher model by using the adjusted parameters of the student model; and the second determining module is used for responding to the condition that the teacher model after the parameters are adjusted accords with the preset training ending condition, and determining the teacher model after the parameters are adjusted as the target segmentation model.
According to another aspect of the embodiments of the present disclosure, there is provided a target segmentation apparatus, including: the acquisition module is used for acquiring a target image; and the third segmentation module is used for performing target segmentation on the target image by using a pre-trained target segmentation model to obtain a segmentation result, wherein the segmentation result comprises at least one group of corresponding mask information, at least one object detection box and at least one object class information.
According to another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for execution by a processor to implement a generation method for executing the above-mentioned target segmentation model or a target segmentation method.
According to another aspect of an embodiment of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; and the processor is used for reading the executable instructions from the memory and executing the instructions to realize the generation method of the target segmentation model or the target segmentation method.
Based on the method and apparatus for generating a target segmentation model, the method and apparatus for target segmentation, the computer-readable storage medium, and the electronic device provided by the embodiments of the present disclosure, a teacher model is used to perform target segmentation on a sample image to generate segmentation labeling information, a student model is used to perform target segmentation on the sample image to obtain a second target segmentation result, a loss value representing an error between the second target segmentation result and the segmentation labeling information is determined, parameters of the student model are adjusted based on the loss value, parameters of the teacher model are adjusted based on the adjusted parameters of the student model, and finally, when the teacher model meets training end conditions, the teacher model with the adjusted parameters is determined as the target segmentation model. The weakly labeled model training method is adopted, namely only the region where the target object is located in the sample image needs to be labeled, the real contour of the target object does not need to be labeled, and the labor cost consumed by labeling operation is reduced. Meanwhile, a training mode combining a teacher model and a student model is adopted, segmentation marking information is automatically generated according to an image segmentation result of the teacher model, and the student model is trained by the segmentation marking information, so that the segmentation result is fully utilized to further train the model, the teacher model and the student model respectively execute different functions, parameters of the teacher model and parameters of the student model are more specifically adjusted, the efficiency of training a target segmentation model is improved on the basis of reducing the labor cost required by model training, and the target segmentation precision of the trained target segmentation model is improved.
The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.
Fig. 1 is a system diagram to which the present disclosure is applicable.
Fig. 2 is a flowchart illustrating a method for generating a target segmentation model according to an exemplary embodiment of the present disclosure.
Fig. 3 is a flowchart illustrating a method for generating a target segmentation model according to another embodiment of the present disclosure.
Fig. 4 is a flowchart illustrating a method for generating a target segmentation model according to another embodiment of the present disclosure.
Fig. 5 is a flowchart illustrating a method for generating a target segmentation model according to another embodiment of the present disclosure.
Fig. 6 is a flowchart illustrating a method for generating a target segmentation model according to another embodiment of the present disclosure.
Fig. 7 is a schematic diagram of a projection error provided by an exemplary embodiment of the present disclosure.
Fig. 8 is a schematic diagram of the similarity of adjacent pixels provided by an exemplary embodiment of the present disclosure.
Fig. 9 is a flowchart illustrating a method for generating a target segmentation model according to another embodiment of the present disclosure.
Fig. 10 is a flowchart illustrating a method for generating a target segmentation model according to another embodiment of the present disclosure.
Fig. 11 is a flowchart illustrating a target segmentation method according to an exemplary embodiment of the present disclosure.
Fig. 12 is a schematic structural diagram of an apparatus for generating a target segmentation model according to an embodiment of the present disclosure.
Fig. 13 is a schematic structural diagram of an apparatus for generating a target segmentation model according to another embodiment of the present disclosure.
Fig. 14 is a schematic structural diagram of a target segmentation apparatus according to an exemplary embodiment of the present disclosure.
Fig. 15 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.
Detailed Description
Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some of the embodiments of the present disclosure, and not all of the embodiments of the present disclosure, and it is to be understood that the present disclosure is not limited by the example embodiments described herein.
It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
It will be understood by those within the art that the terms "first", "second", etc. in the embodiments of the present disclosure are used only for distinguishing between different steps, devices or modules, etc., and do not denote any particular technical meaning or necessary logical order therebetween.
It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.
It is also to be understood that any reference to any component, data, or structure in the embodiments of the present disclosure may be generally understood as one or more, unless explicitly defined otherwise or indicated to the contrary hereinafter.
In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B, may indicate: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.
It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.
Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
Summary of the application
The existing target segmentation method based on weak labeling generally realizes preliminary target segmentation only by using weak labeling, does not fully utilize the weak labeling, and also has insufficient multi-model optimization, thus resulting in lower segmentation precision.
In order to solve the problem, the embodiment of the disclosure performs weak supervised training by using a teacher-student model mode on the basis of weak labeling, that is, an output result of the teacher model is used as labeling information of the student model, parameters of the student model are adjusted based on a deep learning method, and parameters of the teacher model are correspondingly adjusted, so that the weak labeling is fully utilized, the cost for manually segmenting and labeling a target object can be saved, and the precision of target segmentation can be further improved.
Exemplary System
Fig. 1 illustrates an exemplary system architecture 100 and a method or apparatus for generating a target segmentation model to which embodiments of the present disclosure may be applied.
As shown in fig. 1, system architecture 100 may include terminal device 101, network 102, and server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. Various client applications, such as an image processing application, a video playing application, a search-type application, a web browser application, etc., may be installed on the terminal device 101.
The terminal apparatus 101 may be various electronic apparatuses including, but not limited to, mobile terminals such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal, and the like, and fixed terminals such as a digital TV, a desktop computer, and the like.
The server 103 may be a server that provides various services, such as a background image processing server that processes an image uploaded by the terminal apparatus 101. The background image processing server can perform model training by using the received sample image and the object marking information to obtain a trained target segmentation model; the background image processing server may also perform target segmentation on the image uploaded by the terminal device 101 by using the trained image segmentation model to obtain a target segmentation result.
It should be noted that the target segmentation model generation method or the target segmentation method provided in the embodiments of the present disclosure may be executed by the server 103 or the terminal device 101, and accordingly, the target segmentation model generation device or the target segmentation device may be provided in the server 103 or the terminal device 101. As an example, the training process of the target segmentation model may be executed by the server 103, and the terminal device 101 may acquire the trained target segmentation model from the server 103, and perform the target segmentation process on the image by using the target segmentation model.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case where the sample image, the image to be segmented, and the like do not need to be acquired from a remote location, the system architecture described above may not include a network, and only include a server or a terminal device.
Exemplary method
Fig. 2 is a flowchart illustrating a method for generating a target segmentation model according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device (such as the terminal device 101 or the server 103 shown in fig. 1), and as shown in fig. 2, the method includes the following steps:
step 201, performing target segmentation on the sample image by using a teacher model to be trained to obtain a first target segmentation result.
The teacher model may be a model of various neural network structures for performing object segmentation on the image. For example, the teacher model may be a model of a neural network structure such as MaskR-CNN, condInst, or the like. The first target segmentation result may include, but is not limited to: mask information of a projection of a target object on an image plane in the sample image, an object detection frame including the target object, category information of the target object, and the like. The target object may be various types of objects such as a human body, an animal, a vehicle, and the like.
Step 202, generating segmentation labeling information based on the first target segmentation result and the object labeling information.
The object labeling information is obtained by labeling the area where the target object is located in the sample image in advance. Generally, the object labeling information may include an object labeling box and object type information, and the object labeling box is a generally rectangular box containing the target object.
Optionally, the first target segmentation result may include at least one object detection frame, where each object detection frame includes mask information indicating a projection of the target object on the image plane, and for each object detection frame, the object marking frame closest to the object detection frame may be used to replace the object detection frame, and mask information in the object marking frame is retained, so as to obtain segmentation marking information including at least one object marking frame and the mask information in each object marking frame.
And step 203, performing target segmentation on the sample image by using the student model to be trained to obtain a second target segmentation result.
The neural network structure of the student model can be the same as that of the teacher model, namely, the student model performs target segmentation by adopting a target segmentation method the same as that of the teacher model. The second target segmentation result may be the same type of information as the first target segmentation includes, for example, mask information, an object detection box, and category information of the target object.
And 204, determining a loss value representing the error between the second target segmentation result and the segmentation marking information based on a preset loss function.
Wherein the loss function may comprise at least one loss term, each loss term being used to calculate a corresponding one of the loss values. As an example, the second target segmentation result may include: the image processing device comprises an object detection frame, prediction type information and mask information, wherein the object detection frame is used for predicting the position of a target object in an image, the prediction type information is used for predicting the type of the target object, and the mask information is used for dividing the projection of the target object on an image plane.
The at least one loss term may include, but is not limited to: a regression loss (e.g., smooth L1 loss) for calculating an error between the object detection box and the object annotation box included in the segmentation annotation information, a classification loss (e.g., softmax loss) for calculating an error between the prediction class information and the annotation class information included in the segmentation annotation information, a segmentation loss for calculating an error between the mask information in the object detection box and the mask information in the object annotation box included in the segmentation annotation information, such as a cross entropy loss for semantic segmentation, a Dice loss (a loss calculated by a loss function for measuring the similarity between sample data and actual output data), a Focal loss (a loss calculated by a loss function including a focus parameter, which is commonly used for solving the class imbalance problem of sample data), and the like. And weighting and summing the loss values of the loss items to obtain the loss value of the preset loss function.
And step 205, adjusting parameters of the student model based on the loss value, and adjusting parameters of the teacher model by using the adjusted parameters of the student model.
Specifically, the electronic device may adjust parameters of the student model by using a gradient descent method and a back propagation method, so that the loss value is gradually reduced. The electronic device may then adjust the parameters of the teacher model using the adjusted parameters of the student model. Specifically, because the neural network structures of the student model and the teacher model are generally the same, the parameters included in the student model and the parameters included in the teacher model correspond to each other one by one, and the corresponding parameters in the teacher model can be updated based on the parameters of the student model. Optionally, the parameters of the teacher model may be updated by using an EMA (Exponential Moving Average) method, an SMA (Simple Moving Average) method, or the like. Wherein the EMA method is represented by the following formula (1):
θ t+1 =β·θ t +(1-β)·θ s (1)
wherein, theta t The initial parameters of the teacher model in the current iteration are the parameters updated in the previous iteration. Theta s Is the adjusted parameter of the iterative student model of the current round. Theta.theta. t+1 Is the updated parameter of the iterative teacher model in the current round. Beta is a preset coefficient.
And step 206, in response to that the teacher model after the parameters are adjusted meets the preset training end conditions, determining the teacher model after the parameters are adjusted as a target segmentation model.
Wherein, the training end condition may include, but is not limited to, at least one of the following: the loss value of the preset loss function is converged, the training time exceeds the preset duration, and the training times exceeds the preset times. It should be understood that the preset loss function is used to update the parameters of the student model, when the loss value converges, the student model satisfies the training end condition, and the teacher model after the parameters are finally updated also satisfies the training end condition. The training process of the model is a process of carrying out iterative update on the parameters of the model, and the parameters after each iterative update are used as initial parameters of the next iteration until the training end conditions are met.
The method for generating the target segmentation model according to the above embodiment of the present disclosure includes performing target segmentation on a sample image by using a teacher model to generate segmentation annotation information, performing target segmentation on the sample image by using a student model to obtain a second target segmentation result, determining a loss value indicating an error between the second target segmentation result and the segmentation annotation information, adjusting parameters of the student model based on the loss value, adjusting parameters of the teacher model by using the adjusted parameters of the student model, and determining the teacher model with the adjusted parameters as the target segmentation model when the teacher model meets training end conditions. The weakly labeled model training method is adopted, namely only the region where the target object is located in the sample image needs to be labeled, the real contour of the target object does not need to be labeled, and the labor cost consumed by labeling operation is reduced. Meanwhile, a training mode combining a teacher model and a student model is adopted, segmentation marking information is automatically generated according to an image segmentation result of the teacher model, and the student model is trained by the segmentation marking information, so that the segmentation result is fully utilized to further train the model, the teacher model and the student model respectively execute different functions, parameters of the teacher model and parameters of the student model are more specifically adjusted, the efficiency of training a target segmentation model is improved on the basis of reducing the labor cost required by model training, and the target segmentation precision of the trained target segmentation model is improved.
In some optional implementations, the first target segmentation result includes at least one set of first mask information and at least one first object detection box, and the first mask information in the at least one set of first mask information and the first object detection box in the at least one first object detection box correspond to each other one by one, and the object annotation information includes at least one object annotation box.
Wherein each mask value in the first mask information corresponds to one pixel. For example, each mask value in the first mask information indicates a probability that the corresponding pixel belongs to the target object, and if the mask value is 0, it indicates that the corresponding pixel is a background pixel. In general, the first mask information may be in the form of a mask map having the same size as the sample image input to the teacher model, and each pixel of the mask map corresponds to one mask value and one pixel of the sample image. The first object detection frame is a frame of a preset shape (for example, a rectangle) in which the teacher model predicts the region where the target object is located. The object labeling frame is a frame obtained by labeling the target object in the area of the sample image in advance.
As shown in fig. 3, step 202 includes:
step 2021, matching the at least one first object detection frame with the at least one object labeling frame, and determining at least one pair of the first object detection frame and the object labeling frame that are matched with each other.
Optionally, for each first object detection frame, the object labeling frame closest to the first object detection frame may be determined as the object labeling frame matched with the first object detection frame.
Step 2022, generating at least one set of segmentation label information based on the first mask information corresponding to the at least one pair of the first object detection frame and the object label frame that are matched with each other.
Each group of segmentation marking information comprises a segmentation object marking frame and segmentation marking mask information corresponding to the segmentation object marking frame. The segmented object labeling frame is an object labeling frame of the at least one pair of the first object detection frame and the object labeling frame that are matched with each other, and the segmented labeling mask information is set to a value (e.g., 0) indicating a background pixel, which is a mask value outside the segmented object labeling frame in the first mask information.
In this embodiment, at least one object detection frame is matched with at least one object labeling frame to obtain at least one set of segmentation labeling information, so that the segmentation object labeling frame included in the segmentation labeling information can accurately represent the region of the target object in the image, and therefore, the accuracy of the projection of the target object on the image plane reflected by the segmentation labeling mask information included in the segmentation labeling information is greatly improved, and the improvement of the target segmentation precision of the trained student model according to the segmentation labeling information is facilitated.
In some optional implementation manners, the first target segmentation result further includes object detection confidence levels corresponding to each set of first mask information in the at least one set of first mask information, each set of first mask information in the at least one set of first mask information includes mask probability data and binarization mask information, and the binarization mask information is data obtained by performing binarization processing on the mask probability data.
Wherein the object detection confidence represents a probability that the detected target object belongs to the category. The mask probability data is a set of probability values representing that each pixel in the sample image belongs to the target object, and the binarization mask information is a set of values that convert each probability value included in the mask probability data into 0 or 1 (for example, greater than or equal to a preset probability threshold value of 1, and less than the probability threshold value of 0). A value of 0 indicates that the corresponding pixel belongs to the background pixel, and a value of 1 indicates that the corresponding pixel belongs to the pixels included in the target object.
As shown in fig. 4, step 2021 includes:
step 20211, for each set of first mask information in at least one set of first mask information, determining a segmentation quality score corresponding to each set of first mask information based on mask probability data included in the set of first mask information, binarization mask information, and object detection confidence corresponding to the set of first mask information.
The segmentation mass fraction can be calculated by the following formula (2):
Figure BDA0003825463720000091
h and W are the height and width of the sample image, respectively, and i is the number of groups of the first mask information included in the first target segmentation result, that is, the number of target objects detected by performing target segmentation on the sample image. C i Representing the segmentation quality fraction of the ith target object. M ijk Representing a binary mask value corresponding to a pixel of coordinates (i, j),
Figure BDA0003825463720000092
representing the mask probability value for the pixel with coordinate (i, j). S i And representing the object detection confidence corresponding to the ith target object.
Step 20212, determine at least one first object detection frame as a first set of object detection frame candidates, and determine at least one object labeling frame as the first set of object labeling frame candidates.
Based on the first set of object candidate detection frames and the first set of object candidate labeling frames, the following matching steps (including steps 20213 to 20219) are performed.
Step 20213, determine the target first object detection frame corresponding to the maximum segmentation quality score from the candidate object detection frame set.
Step 20214, determining the target object labeling frame having the overlap area with the target first object detection frame and the maximum overlap rate from the candidate object labeling frame set.
And the coincidence rate is used for representing the coincidence degree between the first object detection frame and the object marking frame. In general, the coincidence ratio can be obtained by calculating an IoU (Intersection over Union). That is, if the target first object detection frame and the plurality of object labeling frames all have overlapping areas, the object labeling frame with the largest overlapping rate is taken as the target object labeling frame.
Step 20215, in response to that the coincidence rate meets the preset coincidence rate threshold condition and there is no first object detection frame matching the target object labeling frame, determining that the target first object detection frame matches the target object labeling frame.
The preset coincidence rate threshold condition may be: the coincidence ratio is greater than or equal to a preset coincidence ratio threshold value (e.g., 0.7). As an example, if the coincidence rate between the target first object detection frame and the target object labeling frame is greater than 0.7, and the current target object labeling frame is not matched with other first object detection frames yet, it is determined that the target first object detection frame and the target object labeling frame are matched with each other.
In the embodiment, the target first object detection frame with the largest segmentation quality score is determined, and the target object labeling frame matched with the target first object detection frame is determined based on the coincidence rate, so that the matched target first object detection frame is closer to the target object labeling frame, and the accuracy of segmentation labeling information is improved.
In some optional implementations, as shown in fig. 4, after the step 20215, the method further includes:
step 20216, remove the target first object detection frame from the first candidate object detection frame set, and obtain a second candidate object detection frame set.
Step 20217, remove the target object labeling frame from the first set of candidate object labeling frames to obtain a second set of candidate object labeling frames.
Step 20218, determine whether the second set of object candidate detection boxes and the second set of object candidate labeling boxes are both non-empty sets.
If yes, go to step 20219.
If not, the second candidate object detection frame set and the second candidate object labeling frame set do not have the matched first object detection frame and object labeling frame, that is, all the matched first object detection frames and object labeling frames are extracted, so that at least one pair of the first object detection frames and the object labeling frames which are matched with each other is obtained.
At step 20219, the second set of object detection frame candidates and the second set of object labeling frame candidates are determined as the new first set of object detection frame candidates and the new first set of object labeling frame candidates.
Based on the new first candidate object detection frame set and the first candidate object labeling frame set, the above-mentioned matching step is performed, i.e., step 20213 is re-performed.
In this embodiment, in combination with the foregoing embodiment, the matching step is executed in a circulating manner, so that at least one pair of the first object detection frame and the object labeling frame that are matched with each other is determined according to the size of the segmentation quality score, and the corresponding object labeling frame is accurately allocated to the first target segmentation result output by the teacher model, thereby helping to reduce an error between the generated segmentation labeling information and a distribution situation of a real target object in a sample image, and helping to improve the precision of training a student model.
In some optional implementations, the second target segmentation result includes at least one set of second mask information, at least one second object detection box, and at least one object detection category information, where each second mask information in the at least one set of second mask information, each second object detection box in the at least one second object detection box, and each object detection category information in the at least one object detection category information correspond to each other one by one; the object labeling information further comprises at least one object labeling category information corresponding to at least one object labeling frame, namely, each object labeling frame corresponds to one object labeling category information. The loss values include a first loss value, a second loss value, and a third loss value.
As shown in fig. 5, step 204 includes:
step 2041, based on a preset first loss function, determines a first loss value representing an error between the at least one second object detection box and the at least one object labeling box.
Specifically, the first loss function is typically a regression loss, such as a Smooth L1 loss function, or the like.
Step 2042, determining a second loss value representing an error between the at least one object detection category information and the at least one object labeling category information based on a preset second loss function.
In particular, the second loss function is typically a categorical loss function, such as a softmax loss function or the like.
Step 2043, based on a preset third loss function, determines a third loss value representing an error between the at least one set of second mask information and the at least one set of segmentation markup information.
In particular, the third loss function may comprise at least one loss term. As an example, the at least one loss term may include a cross-entropy loss, a Dice loss, a Focal loss, and the like for performing semantic segmentation, and the loss terms may represent an error between each set of the second mask information and the segmentation annotation mask information included in the corresponding segmentation annotation information.
It should be noted that, the first loss function may calculate an error between each pair of the corresponding second object detection frame and the corresponding object labeling frame, so that at least one error value may be obtained, and the first loss value may be obtained by performing operations such as summing and averaging on the error values. Similarly, the second loss function may calculate an error between each pair of corresponding second object detection category information and object labeling category information, and the third loss function may determine an error between each pair of second mask information and segmentation labeling information, and thus, the second loss value and the third loss value may also be obtained by, for example, summing and averaging.
In the embodiment, errors in multiple aspects can be comprehensively reflected by calculating the first loss value, the second loss value and the third loss value, and the training process of the student model is supervised from multiple aspects, so that the target segmentation precision of the trained student model is improved.
In some alternative implementations, as shown in fig. 6, step 2043 includes:
step 20431, based on the first loss sub-function included in the preset third loss function, a first sub-loss value representing a projection error between the segmented object labeling frames respectively included in the segmentation labeling information of the at least one set of the second mask information and the at least one set of the segmentation labeling information is determined.
As shown in fig. 7, 701 is a mask map representing a target object presented according to a set of second mask information, 702 is a segmented object labeling block diagram, l x1 And l x2 The projections, l, on the x-axis of the target object 7011 and the segmented object labeling box 7021, respectively y1 And l y2 The projections on the y-axis of the target object 7011 and the divided object 7021 are labeled, respectively. l x1 、l x2 、l y1 、l y2 May be represented by a vector. The first sub-loss function is shown as the following equation (3):
L 1 =L(l x1 ,l x2 )+L(l y1 ,l y2 ) (3)
alternatively, L may be a Dice loss.
Step 20432, based on the second loss subfunction included in the third loss function, determines a second sub-loss value representing an adjacent mask relationship error between the at least one set of second mask information and the segmentation annotation mask information included in the segmentation annotation information of the at least one set of segmentation annotation information, respectively.
In particular, for a set of second masking information, the second masking information may include masking probability data, i.e., a set of probability values representing that each pixel belongs to the target object. For the second maskFor each pixel contained in the second object detection box corresponding to the code information, the similarity between the mask probability value of the pixel and the mask probability values of the adjacent 8 pixels can be determined. For example, as shown in FIG. 8, the mask probability value of a certain pixel p (i, j) in the box is labeled for the second object
Figure BDA0003825463720000121
The mask probability value of a certain adjacent pixel q (k, l) is
Figure BDA0003825463720000122
Wherein i, j, k, l are pixel coordinates, the probability value of similarity between pixels p and q is shown in the following formula (4):
Figure BDA0003825463720000131
wherein, P (y) e = 1) represents the probability values that pixels p and q are similar. As shown in fig. 8, y on the line between pixels p and q e Indicating whether two pixels are labeled as similar, y e =1 indicates that the pixels p and q are similar, i.e. both pixels are both pixels comprised by the object or both pixels are background pixels. Corresponding to y e =0 indicates that the two pixels are dissimilar, i.e., different from the background pixel and different from the pixel included in the target object, P (y) e =0)=1-P(y e = 1). The second loss subfunction is shown in the following formula (5):
Figure BDA0003825463720000132
wherein, y e The value of (b) is determined by the segmentation labeling mask information, that is, according to the segmentation labeling mask information including the labeling binarization mask value corresponding to each pixel, if the labeling binarization mask values corresponding to the pixels p and q are both 1 or 0, then y is e =1, otherwise y e =0.N represents the total number of pixels contained in the mark frame of the divided object, E represents a pair of adjacent pixels, E in Representing the set of all neighboring pairs of pixels.
Step 20433, based on the third loss sub-function included in the third loss function, determining a third sub-loss value representing an error of a color similarity relationship between adjacent pixels of the segmented object labeling frames respectively included in the segmentation labeling information of the at least one set of the second mask information and the at least one set of the segmentation labeling information.
Specifically, the third loss subfunction is as shown in the following formula (6):
Figure BDA0003825463720000133
wherein, N, E, E in 、P(y e = 1) the same as the corresponding parameter in the above formula (5). S. the e Representing the color similarity of two adjacent pixels, for example, the distance between vectors representing the colors of two pixels may be determined, and the color similarity may be determined according to the distance. τ denotes a color similarity threshold value of two adjacent pixels,
Figure BDA0003825463720000134
for the similarity index function, S is expressed e When the value is more than or equal to tau,
Figure BDA0003825463720000135
otherwise 0, therefore, formula (6) does not include S e <τ.
Step 20434, based on the fourth loss subfunction comprised by the third loss function, determines a fourth sub-loss value representing a mask error between the at least one set of the second mask information and the segmentation annotation mask information comprised by the segmentation annotation information of the at least one set of the segmentation annotation information, respectively.
Specifically, for a set of second mask information corresponding to a set of segmentation markup mask information, a mask error indicates an error between the second mask information and the corresponding segmentation markup mask information. Optionally, an error between the second mask information and the segmentation marking mask information may be determined through cross entropy loss, dice loss, and the like, so as to obtain a fourth sub-loss value.
Step 20435, a third loss value is determined based on at least one of the first, second, third, and fourth sub-loss values.
Specifically, at least one of the first sub-loss value, the second sub-loss value, the third sub-loss value, and the fourth sub-loss value may be summed or weighted-summed to obtain the third loss value.
It should be noted that, since the third loss value is determined based on at least one of the first sub-loss value, the second sub-loss value, the third sub-loss value and the fourth sub-loss value, step 2043 may include at least one of the above-mentioned steps 20431 to 20434.
In the embodiment, the third loss value is obtained by calculating through a plurality of methods for calculating the loss value, the segmentation marking information is fully utilized, and the second mask information is constrained from a plurality of aspects, so that the target segmentation precision of the trained student model is further improved.
In some alternative implementations, as shown in fig. 9, step 20434 includes:
at step 204341, the row average mask value and the column average mask value corresponding to at least one set of second mask information are determined.
Specifically, each set of second mask information includes mask values each corresponding to one pixel of the sample image, and for each line of pixels, an average value of the mask values (e.g., binarized mask values) corresponding to the line of pixels is calculated as a line average mask value; and calculating the average value of the mask values respectively corresponding to the pixels of each column as a column average mask value for each column of pixels.
Step 204342, determine the row average annotation mask value and the column average annotation mask value of the segmentation annotation mask information included in each segmentation annotation information of the at least one set of segmentation annotation information.
Specifically, each mask value included in each set of segmentation labeling mask information corresponds to one pixel of the sample image, and for each row of pixels, an average value of segmentation labeling mask values corresponding to the row of pixels is calculated as a row average labeling mask value; and calculating the average value of the segmentation marking mask values respectively corresponding to the pixels of each column as a column average marking mask value.
Step 204343, based on the fourth penalty subfunction, determines a first error value representing the error between the row mean mask value and the row mean annotation mask value, and a second error value representing the error between the column mean mask value and the column mean annotation mask value.
The first error value and the second error value may be determined in various manners. For example, by calculating the Dice penalty between the row average mask value and the row average annotation mask value, and between the column average mask value and the column average annotation mask value.
Step 204344 determines a fourth sub-loss value based on the first error value and the second error value.
Specifically, the first error value and the second error value may be summed or weighted-summed to obtain a fourth sub-loss value.
Alternatively, the fourth loss subfunction is as shown in the following formula (7):
Figure BDA0003825463720000151
wherein the content of the first and second substances,
Figure BDA0003825463720000152
which is indicative of a first error value,
Figure BDA0003825463720000153
which is indicative of a second error value, is,
Figure BDA0003825463720000154
indicates predicted mask information, i.e., second mask information, M indicates segmentation markup mask information,
Figure BDA0003825463720000155
the representation is obtained by averaging the mask values (typically binarized mask values) for each column included in the second mask information, M v-avg Indicating that the mask value of each column included in the division tagging mask information is averaged;
Figure BDA0003825463720000156
the representation is obtained by averaging the mask values (typically binarized mask values) for each line included in the second mask information, M h-avg Indicating that the mask value of each line included in the division tagging mask information is averaged.
For an image of size W x H,
Figure BDA0003825463720000157
and M v-avg Is represented by the dimension W x 1,
Figure BDA0003825463720000158
and M h-avg Dimension of (d) is H x 1.
It should be noted that equation (7) is calculated for one target object, and if a plurality of target objects are included in the sample image, L corresponding to each target object can be determined 4 Each L is 4 Summing or averaging to obtain a fourth sub-loss value,
in the embodiment, the average mask value and the average labeling mask value are respectively calculated for each row of pixels and each column of pixels, so that the second mask information can be constrained by a smaller calculated amount, and the efficiency of training the student model is improved.
In some alternative implementations, as shown in fig. 10, step 203 includes:
step 2031, the size of the sample image is adjusted to the target size.
Specifically, the size of the sample image may be adjusted to a target size set manually, or the size of the sample image may be automatically adjusted to a target size at random.
And 2032, performing target segmentation on the sample image with the adjusted size by using the student model to be trained to obtain a second target segmentation result.
In the embodiment, after the size of the sample image is adjusted, the student model is trained by using the sample image after the size is adjusted, so that the capability of the student model for processing images with different sizes can be effectively improved, the trained target segmentation model can segment images with various sizes, and the generalization capability of the target segmentation model is improved.
Fig. 11 is a flowchart illustrating a target segmentation method according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device (such as the terminal device 101 or the server 103 shown in fig. 1), and as shown in fig. 11, the method includes the following steps:
step 1101, acquiring a target image.
In this embodiment, the electronic device may acquire the target image locally or remotely. The target image is an image taken of a target object, which may be various types of objects such as a human body, an animal, a vehicle, and the like.
Step 1102, performing target segmentation on the target image by using a pre-trained target segmentation model to obtain a segmentation result.
In this embodiment, the electronic device may perform target segmentation on the target image by using a pre-trained target segmentation model to obtain a segmentation result. The segmentation result comprises at least one group of corresponding mask information, at least one object detection frame and at least one object class information. That is, each set of mask information corresponds to one object detection box and one object class information. The target segmentation model is obtained by training in advance based on the method described in any one of the embodiments shown in fig. 2-10.
In the target segmentation method provided by the above embodiment of the present disclosure, the target segmentation model obtained by training using the method described in the embodiment corresponding to fig. 2 is more fully trained using the weak labels than the other relevant target segmentation models trained using the weak labels, so that the method provided by the embodiment of the present disclosure has higher precision in target segmentation.
Exemplary devices
Fig. 12 is a schematic structural diagram of a device for generating a target segmentation model according to an exemplary embodiment of the present disclosure. This embodiment can be applied to an electronic device, and as shown in fig. 12, the apparatus for generating a target segmentation model includes: the first segmentation module 1201 is used for performing target segmentation on the sample image by using the teacher model to be trained to obtain a first target segmentation result; a generating module 1202, configured to generate segmentation labeling information based on a first target segmentation result and object labeling information, where the object labeling information is obtained by labeling an area where a target object in a sample image is located in advance; a second segmentation module 1203, configured to perform target segmentation on the sample image by using a student model to be trained, to obtain a second target segmentation result; a first determining module 1204, configured to determine a loss value representing an error between the second target segmentation result and the segmentation annotation information based on a preset loss function; an adjusting module 1205 for adjusting the parameters of the student model based on the loss value, and adjusting the parameters of the teacher model by using the adjusted parameters of the student model; the second determining module 1206 is configured to determine, in response to that the parameter-adjusted teacher model meets a preset training end condition, the parameter-adjusted teacher model as a target segmentation model.
In this embodiment, the first segmentation module 1201 may perform target segmentation on the sample image by using a teacher model to be trained, so as to obtain a first target segmentation result.
The teacher model may be a model of various neural network structures for performing object segmentation on the image. For example, the teacher model may be a model of a neural network structure such as Mask R-CNN, condInst, or the like. The first target segmentation result may include, but is not limited to: mask information indicating a projection of the target object on the image plane in the sample image, an object detection frame including the target object, category information of the target object, and the like. The target object may be various types of objects such as a human body, an animal, a vehicle, and the like.
In this embodiment, the generating module 1202 may generate the segmentation annotation information based on the first target segmentation result and the object annotation information. The object labeling information is obtained by labeling the region where the target object is located in the sample image in advance. Generally, the object labeling information may include an object labeling box and object type information, and the object labeling box is a generally rectangular box containing the target object.
In this embodiment, the second segmentation module 1203 may perform target segmentation on the sample image by using a student model to be trained, so as to obtain a second target segmentation result.
The neural network structure of the student model can be the same as that of the teacher model, namely, the student model performs target segmentation by adopting a target segmentation method the same as that of the teacher model. The second target segmentation result may be the same type of information as the first target segmentation includes, for example, mask information, an object detection box, and category information of the target object.
In this embodiment, the first determining module 1204 may determine a loss value representing an error between the second target segmentation result and the segmentation mark information based on a preset loss function.
Wherein the loss function may comprise at least one loss term, each loss term being used to calculate a corresponding one of the loss values. As an example, the second target segmentation result may include: the image processing device comprises an object detection frame for predicting the position of a target object in an image, prediction type information for predicting the type of the target object, and mask information for dividing the projection of the target object on an image plane.
In this embodiment, the adjustment module 1205 can first adjust the parameters of the student model based on the loss values. Specifically, the adjusting module 1205 can adjust the parameters of the student model by using a gradient descent method and a back propagation method, so that the loss value is gradually reduced.
The adjusted parameters of the student model may then be used by the adjustment module 1205 to adjust the parameters of the teacher model. Specifically, because the neural network structures of the student model and the teacher model are generally the same, the parameters included in the student model and the parameters included in the teacher model are in one-to-one correspondence, and the corresponding parameters in the teacher model can be updated based on the parameters of the student model.
In this embodiment, the second determining module 1206 may determine the teacher model after the parameter adjustment as the target segmentation model in response to that the teacher model after the parameter adjustment meets the preset training end condition.
Wherein, the training end condition may include, but is not limited to, at least one of the following: the loss value of the preset loss function is converged, the training time exceeds the preset duration, and the training times exceeds the preset times. It should be understood that the preset loss function is used to update the parameters of the student model, when the loss value converges, the student model satisfies the training end condition, and the teacher model after the parameters are finally updated also satisfies the training end condition. The training process of the model is a process of carrying out iterative update on the parameters of the model, and the parameters after each iterative update are used as initial parameters of the next iteration until the training end condition is met.
Referring to fig. 13, fig. 13 is a schematic structural diagram of an apparatus for generating a target segmentation model according to another exemplary embodiment of the present disclosure.
In some optional implementation manners, the first target segmentation result includes at least one set of first mask information and at least one first object detection box, and first mask information in the at least one set of first mask information corresponds to first object detection boxes in the at least one first object detection box one to one, and the object labeling information includes at least one object labeling box; the generating module 1202 includes: a first determining unit 12021, configured to match the at least one first object detection frame with the at least one object labeling frame, and determine at least one pair of the first object detection frame and the object labeling frame that are matched with each other; a generating unit 12022, configured to generate at least one set of segmentation label information based on first mask information corresponding to at least one pair of the first object detection box and the object label box that are matched with each other, where each set of segmentation label information includes a segmentation label mask information corresponding to the segmentation object label box and the segmentation object label box.
In some optional implementation manners, the first target segmentation result further includes object detection confidence levels corresponding to each set of first mask information in the at least one set of first mask information, each set of first mask information in the at least one set of first mask information includes mask probability data and binarization mask information, and the binarization mask information is data obtained by performing binarization processing on the mask probability data; the first determination unit 12021 includes: a first determining subunit 120211, configured to determine, for each set of first mask information in at least one set of first mask information, a segmentation quality score corresponding to each set of first mask information based on mask probability data included in the set of first mask information, binarization mask information, and an object detection confidence corresponding to the set of first mask information; a second determining subunit 120212, configured to determine at least one first object detection frame as the first object detection frame candidate set, and determine at least one object labeling frame as the first object labeling frame candidate set, and based on the first object detection frame candidate set and the first object labeling frame candidate set, perform the following matching steps: a third determining subunit 120213, configured to determine, from the first object detection frame candidate set, a target first object detection frame corresponding to the maximum segmentation quality score; a fourth determining subunit 120214, configured to determine, from the first set of candidate object labeling frames, a target object labeling frame that has a region of coincidence with the target first object detection frame and has the largest coincidence rate; a fifth determining subunit 120215, configured to determine that the target first object detection frame matches the target object labeling frame in response to that the coincidence ratio meets the preset coincidence ratio threshold condition and that there is no first object detection frame matching the target object labeling frame.
In some optional implementations, the first determining unit 12021 further includes: a first removing subunit 120216, configured to remove the target first object detection frame from the first candidate object detection frame set, so as to obtain a second candidate object detection frame set; a second removing subunit 120217, configured to remove the target object labeling frame from the first candidate object labeling frame set, to obtain a second candidate object labeling frame set; a sixth determining subunit 120218, configured to, in response to that the second set of object detection candidate frames and the second set of object labeling candidate frames are both non-empty sets, determine the second set of object detection candidate frames and the second set of object labeling candidate frames as a new first set of object detection candidate frames and a new first set of object labeling candidate frames, and perform the matching step to obtain at least one pair of the first object detection frame and the object labeling frame that match each other.
In some optional implementations, the second target segmentation result includes at least one set of second mask information, at least one second object detection box, and at least one object detection category information, where each second mask information in the at least one set of second mask information, each second object detection box in the at least one second object detection box, and each object detection category information in the at least one object detection category information correspond to each other one by one; the object labeling information also comprises at least one object labeling category information corresponding to at least one object labeling frame, and the loss value comprises a first loss value, a second loss value and a third loss value; the first determination module 1204 includes: a second determining unit 12041, configured to determine a first loss value representing an error between the at least one second object detection box and the at least one object labeling box based on a preset first loss function; a third determining unit 12042, configured to determine, based on a preset second loss function, a second loss value representing an error between the at least one object detection category information and the at least one object labeling category information; a fourth determining unit 12043, configured to determine a third loss value representing an error between the at least one set of second mask information and the at least one set of segmentation annotation information, based on a preset third loss function.
In some alternative implementations, the fourth determining unit 12043 includes: a seventh determining subunit 120431, configured to determine, based on the first loss sub-function included in the preset third loss function, a first sub-loss value indicating a projection error between the segmented object labeling frames respectively included in the at least one set of second mask information and the segmentation labeling information in the at least one set of segmentation labeling information; and/or, the eighth determining subunit 120432 is configured to determine, based on the second loss sub-function included in the third loss function, a second sub-loss value indicating an adjacent mask relationship error between the at least one set of second mask information and the segmentation annotation mask information included in the segmentation annotation information of the at least one set of segmentation annotation information, respectively; and/or the ninth determining subunit 120433 is configured to determine, based on a third loss sub-function included in the third loss function, a third sub-loss value indicating an adjacent pixel color similarity relationship error between the segmented object labeling frames respectively included in the at least one set of second mask information and the segmented labeling information in the at least one set of segmented labeling information; and/or a tenth determining subunit 120434, configured to determine, based on a fourth loss sub-function included in the third loss function, a fourth sub-loss value indicating a mask error between the at least one set of second mask information and the segmentation class mask information included in the segmentation class information of the at least one set of segmentation class information, respectively; an eleventh determining subunit 120435, configured to determine a third loss value based on at least one of the first sub-loss value, the second sub-loss value, the third sub-loss value, and the fourth sub-loss value.
In some optional implementations, the tenth determining subunit 120435 is further to: determining a row average mask value and a column average mask value respectively corresponding to at least one group of second mask information; determining a row average annotation mask value and a column average annotation mask value of segmentation annotation mask information which are respectively included by each segmentation annotation information in at least one group of segmentation annotation information; determining a first error value representing an error between the row mean mask value and the row mean label mask value and a second error value representing an error between the column mean mask value and the column mean label mask value, based on a fourth penalty subfunction; a fourth sub-penalty value is determined based on the first error value and the second error value.
In some optional implementations, the second segmentation module 1203 includes: an adjusting unit 12031 for adjusting the size of the sample image to a target size; and the segmenting unit 12032 is configured to perform target segmentation on the sample image with the adjusted size by using the student model to be trained, so as to obtain a second target segmentation result.
The device for generating the target segmentation model provided by the embodiment of the disclosure adopts the weakly labeled model training method, that is, only the region where the target object is located in the sample image needs to be labeled, and the real contour of the target object does not need to be labeled, so that the labor cost consumed by labeling operation is reduced. Meanwhile, a training mode combining a teacher model and a student model is adopted, segmentation marking information is automatically generated according to an image segmentation result of the teacher model, and the student model is trained by the segmentation marking information, so that the segmentation result is fully utilized to further train the model, the teacher model and the student model respectively execute different functions, parameters of the teacher model and parameters of the student model are more specifically adjusted, the efficiency of training a target segmentation model is improved on the basis of reducing the labor cost required by model training, and the target segmentation precision of the trained target segmentation model is improved.
Fig. 14 is a schematic structural diagram of a target segmentation apparatus according to an exemplary embodiment of the present disclosure. This embodiment can be applied to an electronic device, as shown in fig. 14, the object segmentation apparatus includes: an acquisition module 1401 for acquiring a target image; the third segmentation module 1402 is configured to perform target segmentation on the target image by using a pre-trained target segmentation model to obtain a segmentation result, where the segmentation result includes at least one set of corresponding mask information, at least one object detection box, and at least one object category information.
In this embodiment, the acquisition module 1401 may acquire the target image locally or remotely. The target image is an image taken of a target object, which may be various types of objects such as a human body, an animal, a vehicle, and the like.
In this embodiment, the third segmentation module 1402 may perform target segmentation on the target image by using a pre-trained target segmentation model to obtain a segmentation result. The segmentation result comprises at least one group of corresponding mask information, at least one object detection frame and at least one object category information. That is, each set of mask information corresponds to one object detection box and one object class information. The target segmentation model is obtained by training in advance based on the method described in the embodiment corresponding to fig. 2.
In the target segmentation apparatus provided in the above embodiment of the present disclosure, the target segmentation model obtained by training using the method described in the embodiment corresponding to fig. 2 is more fully trained using the weak labels than the other relevant target segmentation models trained using the weak labels, so that the method provided in the embodiment of the present disclosure has higher precision in target segmentation.
Exemplary electronicsDevice
Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 15. The electronic device may be either or both of the terminal device 101 and the server 103 as shown in fig. 1, or a stand-alone device separate from them, which may communicate with the terminal device 101 and the server 103 to receive the collected input signals therefrom.
FIG. 15 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.
As shown in fig. 15, an electronic device 1500 includes one or more processors 1501 and memory 1502.
The processor 1501 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 1500 to perform desired functions.
The memory 1502 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on a computer readable storage medium and executed by the processor 1501 to implement the above generation method of the target segmentation model or the target segmentation method of the various embodiments of the present disclosure, and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.
In one example, the electronic device 1500 may further include: an input device 1503 and an output device 1504, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
For example, when the electronic device is the terminal device 101 or the server 103, the input device 1503 may be a device such as a camera, a mouse, or a keyboard, and inputs an image, various commands, or the like. When the electronic device is a stand-alone device, the input device 1503 may be a communication network connector for receiving input images, various commands, and the like from the terminal device 101 and the server 103.
The output device 1504 may output various information including the target segmentation model, the segmentation result, and the like to the outside. The output devices 1504 can include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.
Of course, for simplicity, only some of the components of the electronic device 1500 relevant to the present disclosure are shown in fig. 15, omitting components such as buses, input/output interfaces, and the like. In addition, electronic device 1500 may include any other suitable components depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the generation method of the object segmentation model or the steps in the object segmentation method according to various embodiments of the present disclosure described in the above-mentioned "exemplary methods" section of this specification.
The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer readable storage medium having stored thereon computer program instructions, which, when executed by a processor, cause the processor to perform the generation method of the object segmentation model or the steps in the object segmentation method according to various embodiments of the present disclosure described in the above "exemplary methods" section of this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure will be described in detail with reference to specific details.
In the present specification, the embodiments are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same or similar parts in each embodiment are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.
The block diagrams of devices, apparatuses, devices, systems involved in the present disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by one skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably herein. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (13)

1. A generation method of an object segmentation model comprises the following steps:
performing target segmentation on the sample image by using a teacher model to be trained to obtain a first target segmentation result;
generating segmentation labeling information based on the first target segmentation result and the object labeling information, wherein the object labeling information is obtained by labeling a region where a target object is located in the sample image in advance;
performing target segmentation on the sample image by using a student model to be trained to obtain a second target segmentation result;
determining a loss value representing an error between the second target segmentation result and the segmentation annotation information based on a preset loss function;
adjusting parameters of the student model based on the loss value, and adjusting parameters of the teacher model by using the adjusted parameters of the student model;
and determining the teacher model after the parameters are adjusted as a target segmentation model in response to the fact that the teacher model after the parameters are adjusted meets the preset training end conditions.
2. The method according to claim 1, wherein the first target segmentation result includes at least one set of first mask information and at least one first object detection box, and first mask information in the at least one set of first mask information and a first object detection box in the at least one first object detection box correspond to each other one by one, and the object labeling information includes at least one object labeling box;
generating segmentation labeling information based on the first target segmentation result and the object labeling information, wherein the generation comprises the following steps:
matching the at least one first object detection frame with the at least one object labeling frame, and determining at least one pair of the first object detection frame and the object labeling frame which are matched with each other;
and generating at least one group of segmentation marking information based on the first mask information corresponding to the at least one pair of the first object detection frame and the object marking frame which are matched with each other, wherein each group of segmentation marking information comprises a segmentation object marking frame and segmentation marking mask information corresponding to the segmentation object marking frame.
3. The method according to claim 2, wherein the first target segmentation result further includes object detection confidences corresponding to each of the at least one set of first mask information, each of the at least one set of first mask information includes mask probability data and binarization mask information, and the binarization mask information is data obtained by binarizing the mask probability data;
the matching the at least one first object detection frame and the at least one object labeling frame, and determining at least one pair of the first object detection frame and the object labeling frame which are matched with each other, includes:
for each group of first mask information in the at least one group of first mask information, determining a segmentation quality score corresponding to each group of first mask information based on mask probability data, binarization mask information and object detection confidence degree corresponding to the group of first mask information, wherein the mask probability data, the binarization mask information and the object detection confidence degree are included in the group of first mask information;
determining the at least one first object detection box as a first candidate object detection box set, determining the at least one object labeling box as a first candidate object labeling box set, and performing the following matching steps based on the first candidate object detection box set and the first candidate object labeling box set:
determining a target first object detection frame corresponding to the maximum segmentation quality score from the first candidate object detection frame set;
determining a target object labeling frame which has a coincidence area with the target first object detection frame and has the largest coincidence rate from the first candidate object labeling frame set;
and determining that the target first object detection frame is matched with the target object labeling frame in response to the fact that the coincidence rate meets a preset coincidence rate threshold condition and no first object detection frame matched with the target object labeling frame exists.
4. The method of claim 3, wherein after said determining that said target first object detection box and said target object labeling box match each other, said method further comprises:
removing the target first object detection frame from the first candidate object detection frame set to obtain a second candidate object detection frame set;
removing the target object labeling frame from the first candidate object labeling frame set to obtain a second candidate object labeling frame set;
and in response to that the second candidate object detection frame set and the second candidate object labeling frame set are both non-empty sets, determining the second candidate object detection frame set and the second candidate object labeling frame set as a new first candidate object detection frame set and a new first candidate object labeling frame set, and executing the matching step to obtain at least one pair of a first object detection frame and an object labeling frame which are matched with each other.
5. The method of claim 2, wherein the second target segmentation result includes at least one set of second mask information, at least one second object detection box, and at least one object detection category information, each second mask information of the at least one set of second mask information, each second object detection box of the at least one second object detection box, and each object detection category information of the at least one object detection category information corresponding to one another; the object labeling information further comprises at least one object labeling category information corresponding to the at least one object labeling box, and the loss value comprises a first loss value, a second loss value and a third loss value;
the determining a loss value representing an error between the second target segmentation result and the segmentation labeling information based on a preset loss function includes:
determining a first loss value representing an error between the at least one second object detection frame and the at least one object labeling frame based on a preset first loss function;
determining a second loss value representing an error between the at least one object detection category information and the at least one object labeling category information based on a preset second loss function;
determining a third loss value representing an error between the at least one set of second mask information and the at least one set of segmentation annotation information based on a preset third loss function.
6. The method according to claim 5, wherein the determining a third loss value representing an error between the at least one set of second mask information and the at least one set of segmentation annotation information based on a preset third loss function comprises:
determining a first sub-loss value representing a projection error between the at least one set of second mask information and a segmented object labeling frame respectively included by segmentation labeling information in the at least one set of segmentation labeling information based on a first loss sub-function included by a preset third loss function; and/or the presence of a gas in the gas,
determining a second sub-loss value representing an adjacent mask relation error between the at least one set of second mask information and segmentation markup mask information included in the segmentation markup information of the at least one set of segmentation markup information, respectively, based on a second loss sub-function included in the third loss function; and/or the presence of a gas in the atmosphere,
determining a third sub-loss value representing an adjacent pixel color similarity relationship error between the at least one set of second mask information and the segmented object labeling boxes respectively included in the segmentation labeling information in the at least one set of segmentation labeling information based on a third loss sub-function included in the third loss function; and/or the presence of a gas in the atmosphere,
determining a fourth sub-loss value representing a mask error between the at least one set of second mask information and segmentation label mask information included in the segmentation label information of the at least one set of segmentation label information, respectively, based on a fourth loss sub-function included in the third loss function;
determining a third loss value based on at least one of the first, second, third, and fourth sub-loss values.
7. The method according to claim 6, wherein the determining a fourth sub-loss value representing a mask error between the at least one set of second mask information and the segmentation annotation mask information included in the segmentation annotation information of the at least one set of segmentation annotation information, respectively, based on a fourth loss sub-function included in the third loss function, comprises:
determining a row average mask value and a column average mask value corresponding to each of the at least one set of second mask information;
determining a row average annotation mask value and a column average annotation mask value of segmentation annotation mask information respectively included in each segmentation annotation information in the at least one group of segmentation annotation information;
determining, based on the fourth loss sub-function, a first error value representing an error between the row mean mask value and the row mean annotation mask value, and a second error value representing an error between the column mean mask value and the column mean annotation mask value;
a fourth sub-loss value is determined based on the first error value and the second error value.
8. The method according to any one of claims 1 to 7, wherein the performing target segmentation on the sample image by using the student model to be trained to obtain a second target segmentation result comprises:
resizing the sample image to a target size;
and performing target segmentation on the sample image with the adjusted size by using the student model to be trained to obtain a second target segmentation result.
9. A method of object segmentation, comprising:
acquiring a target image;
and performing target segmentation on the target image by using a pre-trained target segmentation model to obtain a segmentation result, wherein the segmentation result comprises at least one group of corresponding mask information, at least one object detection frame and at least one object class information.
10. An apparatus for generating an object segmentation model, comprising:
the first segmentation module is used for performing target segmentation on the sample image by using a teacher model to be trained to obtain a first target segmentation result;
a generating module, configured to generate segmentation labeling information based on the first target segmentation result and object labeling information, where the object labeling information is obtained by labeling a region where a target object in the sample image is located in advance;
the second segmentation module is used for carrying out target segmentation on the sample image by utilizing a student model to be trained to obtain a second target segmentation result;
a first determining module, configured to determine a loss value representing an error between the second target segmentation result and the segmentation labeling information based on a preset loss function;
the adjusting module is used for adjusting the parameters of the student model based on the loss value and adjusting the parameters of the teacher model by using the adjusted parameters of the student model;
and the second determining module is used for responding to the condition that the teacher model after the parameters are adjusted accords with the preset training end condition, and determining the teacher model after the parameters are adjusted as a target segmentation model.
11. An object segmentation apparatus comprising:
the acquisition module is used for acquiring a target image;
and the third segmentation module is used for performing target segmentation on the target image by using a pre-trained target segmentation model to obtain a segmentation result, wherein the segmentation result comprises at least one group of corresponding mask information, at least one object detection frame and at least one object class information.
12. A computer-readable storage medium, having stored thereon a computer program for execution by a processor to perform the method of any of claims 1-9.
13. An electronic device, the electronic device comprising:
a processor;
a memory for storing executable instructions of the processor;
the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of claims 1-9.
CN202211056154.3A 2022-08-30 2022-08-30 Target segmentation model generation method and device, and target segmentation method and device Pending CN115393592A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211056154.3A CN115393592A (en) 2022-08-30 2022-08-30 Target segmentation model generation method and device, and target segmentation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211056154.3A CN115393592A (en) 2022-08-30 2022-08-30 Target segmentation model generation method and device, and target segmentation method and device

Publications (1)

Publication Number Publication Date
CN115393592A true CN115393592A (en) 2022-11-25

Family

ID=84124720

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211056154.3A Pending CN115393592A (en) 2022-08-30 2022-08-30 Target segmentation model generation method and device, and target segmentation method and device

Country Status (1)

Country Link
CN (1) CN115393592A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071608A (en) * 2023-03-16 2023-05-05 浙江啄云智能科技有限公司 Target detection method, device, equipment and storage medium
CN117132607A (en) * 2023-10-27 2023-11-28 腾讯科技(深圳)有限公司 Image segmentation model processing method, device, computer equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071608A (en) * 2023-03-16 2023-05-05 浙江啄云智能科技有限公司 Target detection method, device, equipment and storage medium
CN116071608B (en) * 2023-03-16 2023-06-06 浙江啄云智能科技有限公司 Target detection method, device, equipment and storage medium
CN117132607A (en) * 2023-10-27 2023-11-28 腾讯科技(深圳)有限公司 Image segmentation model processing method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
US11436739B2 (en) Method, apparatus, and storage medium for processing video image
KR102478000B1 (en) Image processing methods, training methods, devices, devices, media and programs
CN108304835B (en) character detection method and device
CN108229419B (en) Method and apparatus for clustering images
CN115393592A (en) Target segmentation model generation method and device, and target segmentation method and device
WO2019129032A1 (en) Remote sensing image recognition method and apparatus, storage medium and electronic device
CN111476309A (en) Image processing method, model training method, device, equipment and readable medium
CN111027563A (en) Text detection method, device and recognition system
CN109117781B (en) Multi-attribute identification model establishing method and device and multi-attribute identification method
CN110321845B (en) Method and device for extracting emotion packets from video and electronic equipment
CN110378278B (en) Neural network training method, object searching method, device and electronic equipment
CN112016559A (en) Example segmentation model training method and device and image processing method and device
US11915500B2 (en) Neural network based scene text recognition
CN109961032B (en) Method and apparatus for generating classification model
CN111723815B (en) Model training method, image processing device, computer system and medium
CN112084920B (en) Method, device, electronic equipment and medium for extracting hotwords
CN112712036A (en) Traffic sign recognition method and device, electronic equipment and computer storage medium
CN114565812A (en) Training method and device of semantic segmentation model and semantic segmentation method of image
CN115456089A (en) Training method, device, equipment and storage medium of classification model
CN109919214B (en) Training method and training device for neural network model
CN111639591B (en) Track prediction model generation method and device, readable storage medium and electronic equipment
CN111523351A (en) Neural network training method and device and electronic equipment
CN114037990A (en) Character recognition method, device, equipment, medium and product
CN113971733A (en) Model training method, classification method and device based on hypergraph structure
CN113742590A (en) Recommendation method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination