WO2020238902A1 - 图像分割方法、模型训练方法、装置、设备及存储介质 - Google Patents

图像分割方法、模型训练方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2020238902A1
WO2020238902A1 PCT/CN2020/092356 CN2020092356W WO2020238902A1 WO 2020238902 A1 WO2020238902 A1 WO 2020238902A1 CN 2020092356 W CN2020092356 W CN 2020092356W WO 2020238902 A1 WO2020238902 A1 WO 2020238902A1
Authority
WO
WIPO (PCT)
Prior art keywords
affine transformation
video frame
sample
information
transformation information
Prior art date
Application number
PCT/CN2020/092356
Other languages
English (en)
French (fr)
Inventor
陈思宏
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2020238902A1 publication Critical patent/WO2020238902A1/zh
Priority to US17/395,388 priority Critical patent/US11900613B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/174Segmentation; Edge detection involving the use of two or more images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/02Affine transformations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4046Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10132Ultrasound image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20224Image subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing
    • G06T2207/30048Heart; Cardiac

Definitions

  • This application relates to the field of computer technology, in particular to an image segmentation method and device, and a model training method and device.
  • Semantic segmentation of images or videos is one of the hotspots in the field of computer vision research.
  • Semantic segmentation technology refers to computer equipment segmenting all regions belonging to a large category in a picture and giving their category information.
  • computer equipment needs to predict the key points of each frame of the video to obtain the key points of each frame.
  • the computer equipment uses the template to calculate the difference between each frame of image and the template according to the key points of each frame to obtain transformation parameters, and then performs Affine transformation based on the transformation parameters to obtain ROI (region of interest), and then Perform target segmentation on the ROI.
  • the prediction of the key points of the subsequent video frame depends on the target segmentation result of the previous video frame.
  • the prediction deviation of the first frame will directly cause the positioning offset of the subsequent series of video frames, resulting in The semantic segmentation accuracy of the target object is low.
  • This application provides an image segmentation method, model training method, device, equipment and storage medium, which can improve the accuracy of semantic segmentation.
  • an image segmentation method applied to a computer device including:
  • an image segmentation device comprising:
  • the acquisition module is used to acquire the current frame in the video frame sequence and the historical affine transformation information transmitted by the previous video frame;
  • An affine transformation module configured to perform affine transformation on the current frame according to the historical affine transformation information to obtain a candidate region image corresponding to the current frame;
  • a feature extraction module configured to perform feature extraction on the candidate region image to obtain a feature map corresponding to the candidate region image
  • a semantic segmentation module configured to perform semantic segmentation based on the feature map to obtain a segmentation result corresponding to the target in the current frame
  • the parameter correction module is configured to correct the historical affine transformation information according to the feature map to obtain updated affine transformation information, and use the updated affine transformation information as the latter in the video frame sequence The historical affine transformation information corresponding to the video frame.
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • the historical affine transformation information is corrected according to the feature map to obtain updated affine transformation information, and the updated affine transformation information is used as the historical affine corresponding to the subsequent video frame in the video frame sequence. Transformation information.
  • a computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes The following steps:
  • the historical affine transformation information is corrected according to the feature map to obtain updated affine transformation information, and the updated affine transformation information is used as the historical affine corresponding to the subsequent video frame in the video frame sequence. Transformation information.
  • the above-mentioned image segmentation method, device, computer-readable storage medium and computer equipment perform affine transformation on the current frame according to the historical affine transformation information of the previous video frame to obtain the candidate region image corresponding to the current frame.
  • the historical affine transformation information of the previous video frame is a modified parameter, which can greatly improve the accuracy of obtaining the candidate region image. Semantic segmentation of the feature map corresponding to the candidate region image can accurately obtain the segmentation result corresponding to the target in the current frame.
  • the historical affine transformation information is corrected according to the feature map, and the corrected affine transformation information is transferred to the subsequent video frame for use in the subsequent video frame. In this way, the positioning of the current frame can be corrected, the error caused by the wrong positioning to the subsequent segmentation processing is reduced, and the accuracy of the semantic segmentation processing of the video is greatly improved.
  • a model training method which is applied to a computer device, and the method includes:
  • the model parameters of the target segmentation model are adjusted to continue training until the training stop condition is met.
  • a model training device including:
  • a sample acquisition module for acquiring video frame samples, sample label information corresponding to the video frame samples, and standard affine transformation information corresponding to the video frame samples;
  • a determining module configured to input the video frame samples into a target segmentation model for training, and determine the prediction affine transformation information corresponding to the video frame samples through the target segmentation model;
  • An output module configured to output the prediction affine transformation difference information corresponding to the video frame sample and the prediction segmentation result corresponding to the target in the video frame sample through the target segmentation model;
  • the determining module is further configured to determine standard affine transformation difference information according to the difference between the predicted affine transformation information and the standard affine transformation information;
  • the construction module is further configured to construct an affine transformation information correction loss function according to the standard affine transformation difference information and the predicted affine transformation difference information;
  • the construction module is further configured to determine a segmentation loss function according to the predicted segmentation result and the sample label information
  • the model parameter adjustment module is used to correct the loss function and the segmentation loss function according to the affine loss function, the affine transformation information, and adjust the model parameters of the target segmentation model to continue training until the training stop condition is satisfied Stop training.
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • the model parameters of the target segmentation model are adjusted and the training is continued until the training stop condition is met.
  • a computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes The following steps:
  • the model parameters of the target segmentation model are adjusted and the training is continued until the training stop condition is met.
  • the above model training methods, devices, computer-readable storage media and computer equipment introduce affine transformation supervision information, that is, standard affine transformation information, to improve the accuracy of orientation prediction; By correcting and training the predicted affine transformation information, the segmentation error caused by incorrect positioning is reduced.
  • the affine loss function, affine transformation information correction loss function, and segmentation loss function are superimposed and optimized together, so that each part influences and improves each other during the training process, so that the trained target segmentation model has accurate video semantic segmentation performance.
  • FIG. 1 is an application environment diagram of the target segmentation method and/or model training method in an embodiment
  • Figure 2 is a schematic flowchart of an image segmentation method in an embodiment
  • FIG. 3 is a schematic diagram of the structure of a video frame sequence in an embodiment
  • FIG. 4 is a schematic flowchart of a step of obtaining historical affine transformation information transmitted by a current frame in a video frame sequence and a previous video frame in an embodiment
  • Figure 5 is an overall frame diagram of a target segmentation model in an embodiment
  • FIG. 6 is a schematic diagram of the architecture of a target segmentation model for performing target segmentation on the left ventricle in a cardiac ultrasound detection video in an embodiment
  • FIG. 7 is a schematic flowchart of the training steps of the target segmentation model in an embodiment
  • FIG. 8 is a flowchart of obtaining templates in an embodiment
  • FIG. 9 is a schematic flowchart of a model training method in an embodiment
  • FIG. 10 is a schematic diagram of the architecture of the target segmentation model in the model training process in an embodiment
  • FIG. 11 is a schematic flowchart of an image segmentation method in a specific embodiment
  • Figure 12 is a structural block diagram of an image segmentation device in an embodiment
  • Figure 13 is a structural block diagram of an image segmentation device in another embodiment
  • Figure 14 is a structural block diagram of a model training device in an embodiment
  • Figure 15 is a structural block diagram of a computer device in an embodiment.
  • Fig. 1 is an application environment diagram of an image segmentation method and/or model training method in an embodiment.
  • the image segmentation method and/or model training method is applied to a semantic segmentation system.
  • the semantic segmentation system includes a collector 110 and a computer device 120.
  • the collector 110 and the computer device 120 may be connected through a network or through a transmission line.
  • the computer device 120 may be a terminal or a server.
  • the terminal may be a desktop terminal or a mobile terminal.
  • the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, etc.; the server may be implemented by an independent server or a server cluster composed of multiple servers.
  • the collector 110 can collect video in real time and transmit the video to the computer device 120.
  • the computer device 120 can obtain the current frame in the video frame sequence and the historical affine transformation information transmitted by the previous video frame; according to the historical affine transformation information Perform affine transformation on the current frame to obtain the candidate area image corresponding to the current frame; perform feature extraction on the candidate area image to obtain the feature map corresponding to the candidate area image; perform semantic segmentation based on the feature map to obtain the target corresponding to the current frame Segmentation result: Correct the historical affine transformation information according to the feature map to obtain updated affine transformation information, and use the updated affine transformation information as the historical affine transformation information corresponding to the subsequent video frame in the video frame sequence.
  • the computer device 120 may directly obtain the video, and perform target segmentation on each video frame in the video frame sequence corresponding to the video according to the above steps.
  • an image segmentation method is provided.
  • the method is applied to the computer device 120 in FIG. 1 as an example.
  • the image segmentation method includes the following steps:
  • S202 Acquire historical affine transformation information of the current frame and the previous video frame in the video frame sequence.
  • the video frame sequence is a sequence composed of more than one video frame according to the generation timing corresponding to each video frame.
  • the video frame sequence includes a plurality of video frames arranged according to the generation sequence.
  • a video frame is the basic unit that constitutes a video, and a video can include multiple video frames.
  • the video frame sequence may be a sequence composed of video frames collected in real time, for example, it may be a video frame sequence obtained in real time by a camera of a collector, or a video frame sequence corresponding to a stored video.
  • the current frame is the currently processed video frame, such as the i-th frame;
  • the previous video frame is the video frame whose generation time is before the current frame, which can be the previous frame of the current frame or the first few frames of the current frame. It can also be called the historical video frame of the current frame.
  • the historical affine transformation information is the affine transformation information delivered by the previous video frame for affine transformation of the current frame.
  • "transmitted by the previous video frame” can be understood as: the computer device transmits according to the previous video frame, or corresponds to the previous video frame.
  • Affine transformation also known as affine mapping, refers to the process of performing linear transformation on a space vector matrix and then performing translation transformation to obtain another space vector matrix. Linear transformation includes convolution operation.
  • the affine transformation information is information required to perform affine transformation, and may be affine transformation parameters or instructions for instructing how to perform affine transformation.
  • the affine transformation parameters refer to the reference parameters required for linear transformation or translation transformation of the image, such as the rotation angle (angle), the translation pixel in the horizontal axis (Shift x ), the translation pixel in the vertical axis (Shift y ), and Information such as the scaling factor (Scale).
  • the computer equipment can obtain the historical affine transformation information of the current frame and the previous video frame in the process of detecting the video.
  • the historical affine transformation information of the previous video frame refers to a parameter that has been corrected and used for affine transformation of the current frame obtained when the image segmentation method is performed on the previous video frame.
  • Computer equipment can obtain historical affine transformation information in the following ways: when the computer equipment performs target segmentation on the previous video frame, it can affine the previous video frame according to the feature map corresponding to the previous video frame.
  • the transformation information is corrected to obtain updated affine transformation information, and the updated affine transformation information can be used as the historical affine transformation information of the current frame.
  • the historical affine transformation information can also be corrected according to the feature map of the current frame to obtain an updated affine Transform information, and use the updated affine transformation information as the historical affine transformation information corresponding to the subsequent video frame in the video frame sequence.
  • the affine transformation information can be continuously modified and transmitted. In this way, the positioning of the current frame can be corrected, and the error caused by the error positioning to the subsequent segmentation processing can be reduced, so as to improve the accuracy of the semantic segmentation processing of the video.
  • the "current frame” used in this application is used to describe the current video frame processed by this method, and the "current frame” is a relatively changing video frame. For example, when processing the next video frame of the current frame, you can Take this next video frame as the new "current frame”.
  • the computer device may use the historical affine transformation information transferred in the previous frame of the current frame as the affine transformation information corresponding to the current frame to perform affine transformation.
  • the next video frame may use the historical affine transformation information transferred in the current frame as the affine transformation information corresponding to the next frame.
  • each video frame can use the historical affine transformation information transferred in the previous frame as the affine transformation information corresponding to the frame to perform affine transformation.
  • the computer device may also use the historical affine transformation information transmitted in the previous Nth (N is a positive integer, and N is greater than 1) frame of the current frame as the affine transformation information corresponding to the current frame.
  • N is a positive integer, and N is greater than 1
  • the next video frame may use the historical affine transformation information transferred from the previous N-1 frame of the current frame as the affine transformation information corresponding to the next frame.
  • each video frame can use the historical affine transformation information transmitted in the previous Nth frame as the affine transformation information corresponding to the frame to perform affine transformation.
  • the current frame currently processed by the computer device is F4, then the current frame F4 can be transmitted using the previous video frame F1
  • the historical affine transformation information of is used as the corresponding affine transformation information for affine transformation; then the current frame F4 can use the historical affine transformation information delivered by the previous video frame F1 as the corresponding affine transformation information for affine transformation Transformation; video frame F5 can use the historical affine transformation information delivered by the previous video frame F2 as the corresponding affine transformation information for affine transformation; video frame F6 can use the historical affine transformation delivered by the previous video frame F3
  • the affine transformation information is used as the corresponding affine transformation information to perform affine transformation, and so on.
  • step S202 that is, the step of obtaining historical affine transformation information of the current frame and the previous video frame in the video frame sequence includes the following steps:
  • S402 Acquire an initial video frame in the sequence of video frames.
  • the initial video frame is the first video frame in the video frame sequence.
  • the initial video frame can be the first video frame in the video frame sequence, or the Nth frame in the video frame sequence (for example, the first frame where the focus stability reaches the preset condition, the first frame where the target appears) , It can also be the first N (N is a positive integer, and N is greater than 1) video frames in the video frame sequence.
  • the affine transformation information of each subsequent video frame will refer to the affine transformation information of the previous video frame
  • the initial video frame is the first video frame sequence.
  • Video frame When the computer device is executing the image segmentation method and the computer device uses the previous Nth (N is a positive integer, and N is greater than 1) frame of the current frame to transmit the historical affine transformation information as the affine transformation information corresponding to the current frame, Then the video frames from the first frame of the video frame sequence to the previous Nth frame can all be called initial video frames.
  • S404 Extract image features of the initial video frame through the first convolutional neural network.
  • Convolutional Neural Network (Convolutional Neural Network, CNN for short) is a type of Feedforward Neural Networks (Feedforward Neural Networks) that includes convolution calculations and has a deep structure.
  • the sharing of convolution kernel parameters in the hidden layer in the convolutional neural network and the specific sparsity of inter-layer connections enable the convolutional neural network to perform grid-based features (such as pixels and audio) with a small amount of calculation.
  • Learn Convolutional neural networks usually include a convolutional layer and a pooling layer, which can perform convolution and pooling processing on the input image to map the original data to the hidden layer feature space.
  • the image feature is a spatial vector matrix that can represent the image information of the initial video frame obtained after processing by a convolutional neural network.
  • the image segmentation method is executed by the target segmentation model, and the computer device can input the video frame sequence into the target segmentation model, and process the initial video frame through the first convolutional neural network in the target segmentation model to extract the initial video From the features in the frame, the corresponding image features are obtained.
  • S406 Input the image feature to the first fully connected network, and process the image feature through the first fully connected network, and output the affine transformation information through at least one output channel of the first fully connected network.
  • a fully connected network can also be called a fully connected layer (FC).
  • FC Fully connected layer
  • the fully connected layer functions as a "classifier" in the entire convolutional neural network.
  • the fully connected layer can map the image features learned by the convolutional layer and the pooling layer to the sample label space.
  • the computer device may input the image features to the first fully connected network, process the image features through the first fully connected network, and output the affine transformation information through at least one output channel of the first fully connected network.
  • the target segmentation model includes a region affine network (Region Affine Networks, RAN for short), and the RAN network includes a convolutional neural network and a fully connected network.
  • the computer device inputs the initial video frame in the video frame sequence in the RAN network, uses the lightweight MobileNet-V2 network (lightweight network) as a generator to extract the image features of the initial video frame, and then passes a The fully connected network with an output channel of 4 returns 4 affine transformation parameters, which are the rotation angle, the translation pixel in the horizontal axis, the translation pixel in the vertical axis, and the zoom factor.
  • the computer device may use the affine transformation information output by the first fully connected network as the affine transformation information corresponding to the initial video frame, and perform affine transformation according to the affine transformation information to obtain the affine transformation information corresponding to the initial video frame.
  • the corresponding candidate area image may be used as the affine transformation information output by the first fully connected network as the affine transformation information corresponding to the initial video frame, and perform affine transformation according to the affine transformation information to obtain the affine transformation information corresponding to the initial video frame.
  • the initial video frame does not have a corresponding referenceable previous video frame, and therefore, there is no historical affine transformation information of the previous video frame for its use.
  • the target segmentation model may introduce supervision information corresponding to the affine transformation information as training samples for model training during training.
  • the supervision information corresponding to the affine transformation may be standard affine transformation information corresponding to the video frame samples.
  • the standard affine transformation information refers to the affine transformation information needed to convert video frame samples into templates.
  • the standard affine transformation information can be obtained by performing reflection similarity calculation on the position information of the sample key points included in the video frame samples and the position information of the template key points included in the template. Among them, how the template is obtained and the training process of the target segmentation model will be described in detail in the subsequent model training method.
  • the regional affine network in the target segmentation model can learn the information of the template, so as to accurately return the initial video frame relative to the template. Affine transformation information.
  • the image features of the initial video frame are extracted through the convolutional neural network, and the image features are processed through the first fully connected network, so that the affine transformation corresponding to the initial video frame and with higher accuracy can be predicted Information, which helps to improve the accuracy of target segmentation in subsequent processing.
  • the historical affine transformation information of the previous video frame is read in the buffer.
  • S204 Perform affine transformation on the current frame according to the historical affine transformation information to obtain a candidate region image corresponding to the current frame.
  • the computer device performs affine transformation on the current frame according to historical affine transformation information, which may be to correct the position, size, and orientation of the target in the current frame according to the affine transformation information to obtain the corresponding candidate Area image.
  • the candidate region image can also be called a region of interest (ROI).
  • the computer device may input the sequence of video frames to the target segmentation network, and execute the image segmentation method through the target segmentation model.
  • the target segmentation model is a model used for semantic segmentation of the target object in the video, and may be a machine learning model.
  • the target segmentation model may include multiple network structures, and different network structures include model parameters corresponding to their respective networks, and different network structures are used to perform different actions.
  • the computer device can input the video frame sequence into the target segmentation model, and perform affine transformation on the current frame according to the historical affine transformation information through the RAN network included in the target segmentation model to obtain the corresponding Candidate area image.
  • S206 Perform feature extraction on the candidate area image to obtain a feature map corresponding to the candidate area image.
  • the feature map is also called feature map, which is a spatial vector matrix obtained after convolution and/or pooling of an image through a convolutional neural network, and can be used to represent the image information of the image.
  • the computer device may perform feature extraction on the candidate region image to obtain a feature map corresponding to the candidate region image.
  • the computer device may perform feature extraction on the candidate region image through the second convolutional neural network in the target segmentation model to obtain a feature map corresponding to the candidate region image.
  • the convolutional neural network may be MobileNet-V2, VGG (Visual Geometry Group, visual collection group) network, or ResNet (deep Residual learning, deep residual learning) network, etc.
  • the second convolutional neural network can share parameters with the first convolutional neural network, so it can be regarded as the same convolutional neural network.
  • the use of “first” and “second” here is mainly used to distinguish the target segmentation model Convolutional neural network at different locations in and used to process different data.
  • the feature map obtained by performing feature extraction on the image of the candidate region incorporates the optical flow information included in the video frame sequence.
  • the optical flow information is the motion change information of the image, which can be used to indicate the movement of each pixel in the video frame sequence in the video frame in the embodiment of the present application, including the motion change information of the target to be detected in the video frame.
  • the optical flow information corresponding to the previous video frame may be determined by the position corresponding to each pixel in the previous video frame and the position corresponding to each pixel in the current frame.
  • the target segmentation area where the target object in the current frame is located can be determined by the optical flow information corresponding to the previous video frame.
  • the target segmentation area where the target object is located in the current frame can be jointly predicted based on the optical flow information and the target segmentation area where the target object is located in the previous video frame.
  • GAN Generative Adversarial Nets
  • the discriminator can be designed to introduce these two kinds of information at the same time. That is to say, during the model training process, any one of the CNN feature and the optical flow feature can be input to the discriminator, and the discriminator determines whether the currently input feature belongs to the optical flow feature or the CNN feature.
  • the discriminator cannot distinguish the difference between CNN features and optical flow features, then the second convolutional neural network at this time can generate features that incorporate optical flow information Figure.
  • the more detailed training process between the discriminator and the second convolutional neural network will be described in detail in the embodiment of the subsequent model training stage.
  • the feature map obtained by performing feature extraction on the candidate region image is fused with the optical flow information included in the video frame sequence, which can avoid errors in the segmentation result, thereby generating a reasonable segmentation result with progressive time sequence.
  • S208 Perform semantic segmentation based on the feature map to obtain a segmentation result corresponding to the target in the current frame.
  • semantic segmentation refers to the segmentation of a large category of areas in a picture by a computer device and provides its category information.
  • the segmentation result may be a target segmentation area formed by pixels belonging to the target object in the current frame.
  • the computer device may perform pixel dimension detection on the feature map, that is, detect each pixel in the candidate area image based on the feature map corresponding to the candidate area image, and output the detection result corresponding to the target in the current frame.
  • the computer device can identify the category corresponding to each pixel in the image of the candidate area, and form the target area according to each pixel of the corresponding target category. That is to distinguish the target object from the candidate area image.
  • the computer device may perform semantic segmentation on the image features of the candidate region through the fully convolutional neural network in the target segmentation model, and output the detection result corresponding to the target in the current frame.
  • step S208 that is, performing semantic segmentation based on the feature map to obtain the segmentation result corresponding to the target in the current frame includes: performing up-sampling processing on the feature map through a fully convolutional neural network to obtain an intermediate image; Through the full convolutional neural network, each pixel in the intermediate image is classified at the pixel level to obtain the category corresponding to each pixel; according to the category corresponding to each pixel, the segmentation result of semantic segmentation of the target in the current frame is output.
  • Fully Convolutional Neural Networks are usually used to classify input images pixel by pixel.
  • Fully convolutional neural networks can usually use the deconvolution layer to upsample the feature map of the last convolution layer (Upsample) to restore it to the same size of the input image, so that a prediction can be generated for each pixel.
  • the spatial information in the original input image is retained, and finally pixel-by-pixel classification is performed on the up-sampled feature map.
  • Pixel-level refers to the pixel dimension; pixel-level classification refers to classification processing in the pixel dimension, which is a fine classification method.
  • the pixel-level classification of each pixel in the intermediate image can also be referred to as the pixel-level classification of the intermediate image, which is to generate a prediction for each pixel in the intermediate image, and then obtain each pixel in the intermediate image.
  • the pixel-level classification of each pixel in the intermediate image can also be referred to as the pixel-level classification of the intermediate image, which is to generate a prediction for each pixel in the intermediate image, and then obtain each pixel in the intermediate image.
  • the corresponding category is to generate a prediction for each pixel in the intermediate image.
  • the computer device can perform up-sampling processing on the feature map corresponding to the current frame through the full convolutional neural network in the target segmentation model to obtain the intermediate image, and perform the full convolutional neural network on each pixel in the intermediate image.
  • Pixel-level classification to get the category corresponding to each pixel. For example, if the category of pixels belonging to the target object in the candidate area image is 1, and the category of pixels not belonging to the target object is 0, then the area formed by all pixels of the candidate area image with category 1 is the target segmentation area. This can segment the target area from the candidate area image. For example, highlight the target segmentation area in red or green.
  • the step of outputting the segmentation result of semantic segmentation of the target in the current frame includes: determining the pixel corresponding to the target category in the intermediate image; from the intermediate image, segmenting the corresponding target category A target segmented area composed of pixels of the target category and including the target object.
  • the fully convolutional neural network of the target segmentation model when the fully convolutional neural network of the target segmentation model is trained, the fully convolutional neural network can be trained based on the video frame samples and the sample label information for labeling the target objects in the video frame samples Network, the trained fully convolutional neural network has the ability to classify pixels.
  • the sample labeling information for labeling the target object in the video frame sample can be to mark the pixel corresponding to the target object as "1" and other pixels as "0" to distinguish the target object from non-target Object.
  • the computer device may determine the pixels corresponding to the target category in the intermediate image through the fully convolutional neural network in the target segmentation model.
  • the pixels belonging to the target category are labeled, such as red or green, etc., so as to segment the target segmentation area composed of pixels of the corresponding target category and including the target object from the intermediate image.
  • the target object can be accurately located in the current frame, and the area occupied by the target object in the current frame can be accurately determined.
  • the computer device may divide and display the target object in the video frame according to the detection result of each video frame, so as to realize the effect of automatically dividing the target in the video composed of continuous video frames.
  • the feature map is classified at the pixel level through the full convolutional neural network, and the corresponding category of each pixel can be obtained. Therefore, according to the category corresponding to each pixel, the pixel level in the current frame can be accurately determined.
  • the target segmentation area where the target is located greatly improves the segmentation ability of the target object.
  • S210 Correct the historical affine transformation information according to the feature map to obtain updated affine transformation information, and use the updated affine transformation information as historical affine transformation information corresponding to a subsequent video frame in the video frame sequence.
  • correcting historical affine transformation information refers to adjusting historical affine transformation parameters to obtain updated affine transformation parameters.
  • the computer device can correct the historical affine transformation information according to the feature map to obtain updated affine transformation information, and the updated affine transformation information can be used as the affine corresponding to the subsequent video frame in the video frame sequence. Transform information.
  • the computer device may process the feature map corresponding to the current frame through the second fully connected network included in the target segmentation model, and correct the affine transformation information to obtain updated affine transformation information.
  • the second fully connected network included in the target segmentation model can be trained to output the affine transformation difference result, and then based on the affine transformation difference result and the historical affine transformation passed by the previous video frame Information, calculate the updated affine transformation information delivered by the current frame.
  • the computer device can directly transfer the updated affine transformation information to the subsequent video frame for affine transformation of the subsequent video frame.
  • step S210 is to correct the historical affine transformation information according to the feature map to obtain updated affine transformation information, and use the updated affine transformation information as the location of the subsequent video frame in the video frame sequence.
  • the step of corresponding historical affine transformation information includes the following steps: processing the feature map through the second fully connected network, and outputting the difference result of the affine transformation through at least one output channel of the second fully connected network; and according to the difference result of the affine transformation Calculate the updated affine transformation information transmitted by the current frame with the historical affine transformation information transmitted by the previous video frame; use the updated affine transformation information transmitted by the current frame as the subsequent video frame sequence The historical affine transformation information corresponding to the video frame.
  • the second fully connected network and the first fully connected network are the same fully connected network, or are different fully connected networks.
  • the same fully connected network refers to the parameter sharing of the first fully connected network and the second fully connected network;
  • the different fully connected network refers to the first fully connected network and the second fully connected network having their own model parameters.
  • the second fully connected network included in the target segmentation model may be trained to output affine transformation difference results.
  • the feature map corresponding to the current frame can be processed through the second fully connected network in the target segmentation model, and the affine transformation difference result can be returned.
  • the difference result is the difference rate after normalization processing.
  • the computer device may calculate the updated affine transformation information delivered by the current frame based on the difference result of the affine transformation and the historical affine transformation information delivered by the previous video frame. For example, when the affine transformation information is the affine transformation parameter, the computer device can calculate the updated affine transformation information by the following formula:
  • the computer device can use the calculated updated affine transformation information as the historical affine transformation information transferred in the current frame, that is, transfer the updated affine transformation information to the subsequent video frame in the video frame sequence, For subsequent video frames to perform affine transformation according to the updated affine transformation information.
  • the supervision information of the second fully connected network during the training process may be the standard affine transformation information and the current frame.
  • the difference information of the corresponding affine transformation information may be the standard affine transformation information and the current frame.
  • the feature map is processed through the second fully connected network to correct the affine transformation information used in the current frame to obtain updated affine transformation information.
  • the updated affine transformation information is used for backward transfer, which can correct the positioning of the current frame and reduce segmentation errors caused by incorrect positioning.
  • the second fully connected network included in the target segmentation model may be trained to output corrected and updated affine transformation information.
  • the computer device can directly transfer the updated affine transformation information to the subsequent video frame for affine transformation of the subsequent video frame.
  • the supervision information of the second fully connected network during the training process may be corresponding to the current frame The standard affine transformation information of.
  • the above-mentioned image segmentation method performs affine transformation on the current frame according to the historical affine transformation information transmitted by the previous video frame to obtain the candidate region image corresponding to the current frame.
  • the historical affine transformation information transmitted by the previous video frame is a modified parameter, which can greatly improve the accuracy of image acquisition of the candidate area. Semantic segmentation of the feature map corresponding to the candidate region image can accurately obtain the segmentation result corresponding to the target in the current frame.
  • the historical affine transformation information is corrected according to the feature map, and the corrected affine transformation information is transferred to the subsequent video frame for use in the subsequent video frame. In this way, the positioning of the current frame can be corrected, the error caused by the wrong positioning to the subsequent segmentation processing is reduced, and the accuracy of the semantic segmentation processing of the video is greatly improved.
  • the image segmentation method is executed by a target segmentation model, and the image segmentation method includes the following steps: obtaining historical affine transformation information of the current frame and previous video frames in the video frame sequence; and using the target segmentation model
  • the regional affine network in, performs affine transformation on the current frame based on historical affine transformation information to obtain the candidate area image corresponding to the current frame; through the second convolutional neural network in the target segmentation model, the candidate area image is characterized Extract, obtain the feature map corresponding to the candidate region image; use the full convolutional neural network in the target segmentation model to perform semantic segmentation on the feature map to obtain the segmentation result corresponding to the target in the current frame; pass the second in the target segmentation model
  • the fully connected network corrects the historical affine transformation information to obtain updated affine transformation information, and uses the updated affine transformation information as the historical affine transformation information corresponding to the subsequent video frame in the video frame sequence.
  • the target object in the video can be automatically and accurately segmented through the trained target segmentation model, which has strong real-time performance.
  • the end-to-end network has a high degree of engineering, which is easy to migrate to mobile devices and has a high adaptive capacity.
  • the overall framework diagram includes a regional affine network (RAN) 510, a second convolutional neural network (generator) 520, a fully convolutional neural network 530, and a second fully connected network 540.
  • the regional affine network 510 includes a first convolutional neural network (generator) 512 and a first fully connected network 514.
  • each video frame in the video frame sequence by frame When performing target segmentation on the target object in the video, input each video frame in the video frame sequence by frame. If the current frame is the initial video frame, feature extraction is performed on the initial video frame through the first convolutional neural network 512 to obtain The image features are input into the first fully connected network 514 to return the current affine transformation information.
  • the regional affine network 510 performs affine transformation on the initial video frame according to the current affine transformation information to obtain the corresponding candidate region image (ROI). Then, feature extraction is performed on the candidate region image through the second convolutional neural network 520 to obtain a feature map corresponding to the candidate region image.
  • the feature map enters two task branches.
  • the segmentation prediction map is obtained after upsampling processing through the full convolutional neural network 530, and the segmentation result is output; in the positioning task branch, the second fully connected network returns Affine transformation difference result. Then, the affine transformation information corresponding to the current frame is corrected according to the affine transformation difference result to obtain updated affine transformation information, and the updated affine transformation information is transferred to the next frame.
  • the RAN network performs affine transformation on the next video frame according to the updated affine transformation information to obtain the ROI area corresponding to the next video frame, and pass the second volume
  • the product neural network 520 performs feature extraction on the candidate region image to obtain a feature map corresponding to the candidate region image.
  • the feature map enters two task branches. In the segmentation task branch, the segmentation prediction map is obtained after upsampling processing through the full convolutional neural network 530, and the segmentation result is output; in the positioning task branch, the second fully connected network returns Affine transformation difference result.
  • the affine transformation information corresponding to the next video frame is corrected according to the affine transformation difference result to obtain the updated affine transformation information, and the updated affine transformation information is transferred to the subsequent video frame.
  • the effect of segmenting the target in the video is finally realized.
  • the video frame sequence belongs to a detection video obtained by medical detection of a biological tissue, for example, it may be a cardiac ultrasound detection video.
  • the target in the video frame is the left ventricle, and the detection result is the segmentation of the left ventricle in the video frame.
  • FIG. 6 is a schematic diagram of the architecture of performing target segmentation on the left ventricle in the cardiac ultrasound detection video in an embodiment.
  • the previous frame is t-1 frame; the current frame is t frame.
  • target segmentation of the cardiac ultrasound detection video For the previous video frame, the predicted affine transformation information can be generated by the generator in the RAN network and the fully connected network Affine transformation information Affine transformation Get the candidate region image ROI of the previous video frame. Then through the generator to extract the image features, enter the segmentation task branch and the positioning task branch respectively, and obtain the segmentation result t-1 and the affine transformation difference parameter
  • the affine transformation difference parameter is transferred to the current frame, and the regional affine network is based on the affine transformation difference parameter And predicted affine transformation information To perform affine transformation on the current frame, such as Get the candidate region image ROI. Then through the generator to extract the image features, enter the segmentation task branch and the positioning task branch respectively to obtain the segmentation result t and the affine transformation difference parameter.
  • the image segmentation method is performed by a target segmentation model, and the training steps of the target segmentation model include:
  • S602 Obtain video frame samples, sample label information corresponding to the video frame samples, and standard affine transformation information corresponding to the video frame samples.
  • the video frame samples, the sample label information corresponding to the video frame samples, and the standard affine transformation information corresponding to the video frame samples are training data.
  • the sample labeling information corresponding to the video frame sample may be sample key point position information for labeling the key points in the video frame sample, and sample area position information for labeling the target object in the video frame sample.
  • the key points in the video frame sample are key points used to determine the target object, and the number of key points may be 3, 4, or other numbers.
  • the target object in the video frame sequence is the left ventricle
  • the key points in the corresponding video frame sample can be the tip of the left ventricle and both ends of the left ventricular mitral valve.
  • the position information of the key points of the sample can be It is the position information of the tip of the left ventricle and the two ends of the left ventricular mitral valve; the position information of the sample area may be the position information of the area where the left ventricle is located in the video frame sample.
  • the standard affine transformation information is the affine transformation information of the video frame sample relative to the template, that is to say, the video frame sample can be subjected to affine transformation to obtain the template according to the standard affine transformation information.
  • the template is an image that can represent a standard video frame based on statistics of multiple video frame samples.
  • step S602 namely obtaining video frame samples, sample label information corresponding to the video frame samples, and standard affine transformation information corresponding to the video frame samples, includes the following steps: obtaining video frame samples and corresponding samples Labeling information; sample labeling information includes sample key point location information and sample area location information; according to the video frame sample, sample key point location information and sample area location information, determine the template image and template key point location information corresponding to the template image; according to the sample The key point position information and the template key point position information are calculated to obtain standard affine transformation information corresponding to the video frame samples.
  • the computer device may obtain multiple video frame samples from local or other computer devices.
  • the video frame samples are manually labeled or machine labeled to mark the key points of the sample and the location area of the target object in the video frame sample.
  • the computer device can determine the template and the key point position information of the template in the template according to a plurality of video frame samples including sample labeling information.
  • the computer device may average the key point position information in multiple video frame samples to obtain the template key point position information.
  • the computer device can determine the area frame that includes the target object based on the key points in each video frame sample, and expand the area frame by a certain range to obtain the ROI of this video frame sample. Then calculate the average size of the ROI corresponding to all the video frame samples, and adjust the ROI corresponding to all the video frame samples to the average size.
  • the template can be obtained by averaging all the ROI images adjusted to the average size.
  • the key point position information of the template can be obtained by averaging the position information of the key points in each ROI image.
  • FIG. 8 is a flowchart of obtaining a template in an embodiment.
  • the computer equipment can collect a variety of standard heart slices through the collector in advance, such as A2C (apical-2-chamber, A2C, two-chamber slice), A3C (apical-3-chamber, A3C, three-chamber slice) ), A4C (apical-4-chamber, A4C, four-chamber section), A5C (apical-5-chamber, A5C, five-chamber section), etc.
  • the area frame can be expanded to the left and down by a certain proportion. , Such as 50% of length and width.
  • the area around the area frame is expanded on the basis of this frame by a certain percentage, such as 5% of the length and width, to obtain the ROI of this section view. Adjust the size of the ROIs of all slices to a scale (the size is the average size of all ROIs), and get the template by averaging.
  • the computer device can calculate the reflection similarity according to the size of each video frame sample, the position information of key points, and the size of the template and the position information of key points of the template, to obtain a transformation matrix, which includes affine transformation information,
  • the affine transformation information calculated by this method is the standard affine transformation information corresponding to the video frame sample.
  • the template image and the template key point position information corresponding to the template image can be determined according to the video frame samples, the position information of the sample key points, and the position information of the sample area. Therefore, each video frame sample can be compared with the template to determine the standard affine transformation information, which can be used as the supervision information for subsequent model training, so that the target segmentation model can learn the template information. Thereby greatly improving the prediction accuracy of affine transformation information.
  • S604 Input the video frame sample into the target segmentation model for training, and determine the prediction affine transformation information corresponding to the video frame sample through the target segmentation model.
  • the computer device may input the video frame samples into the target segmentation model, execute the aforementioned image segmentation method according to the target segmentation model, and obtain the predicted affine transformation information corresponding to the video frame samples through the RAN network.
  • S606 Construct an affine loss function according to the predicted affine transformation information and the standard affine transformation information.
  • the affine loss function is used to evaluate the degree of difference between the predicted affine transformation information and the standard affine transformation information.
  • the affine loss function assumes the responsibility of the trained RAN network, so that the RAN network in the target segmentation model can generate accurate affine transformation information relative to the template. In this way, the introduction of affine supervision information makes the affine parameter prediction more accurate.
  • the computer device may construct an affine loss function based on the predicted affine transformation information and the standard affine transformation information.
  • the computer device can calculate the loss of predicted affine transformation information and standard affine transformation information through a distance function, such as L1-Norm (L1-norm, also known as Manhattan distance) function, which is based on L1-Norm.
  • Norm function is used to construct the affine loss function of predicted affine transformation information and standard affine transformation information.
  • the computer device can input the video frame samples into the target segmentation model, execute the aforementioned image segmentation method according to the target segmentation model, and output the prediction affine transformation difference information corresponding to the video frame samples and the target correspondence in the video frame samples The predicted segmentation result.
  • the computer device may perform affine transformation on the video frame samples according to the predicted affine transformation information through the RAN network in the target segmentation model to obtain corresponding sample candidate region images.
  • the second convolutional neural network in the target segmentation model feature extraction is performed on the sample candidate region image to obtain the corresponding sample feature map.
  • the sample feature map is semantically segmented, and the predicted segmentation result corresponding to the target in the video frame sample is obtained.
  • the predicted affine transformation information is corrected based on the sample feature map to obtain the predicted affine transformation difference information corresponding to the video frame samples.
  • S610 Determine standard affine transformation difference information according to the difference between the predicted affine transformation information and the standard affine transformation information.
  • the standard affine transformation difference information is the supervision information of the affine transformation correction module in the target segmentation model, that is, the supervision information of the second fully connected network in the training process.
  • the computer device may determine the standard affine transformation difference information according to the difference between the predicted affine transformation information and the standard affine transformation information. For example, when the affine transformation information is the affine transformation parameter, the computer device can calculate the standard affine transformation difference information by the following formula:
  • Standard represents the standard affine transformation difference parameter; Indicates the affine transformation parameter corresponding to the current frame, that is, the predicted affine transformation parameter; ⁇ t represents the standard affine transformation parameter.
  • S612 Construct an affine transformation information correction loss function according to the standard affine transformation difference information and the predicted affine transformation difference information.
  • the affine transformation information correction loss function is used to evaluate the degree of difference between the predicted affine transformation difference information and the standard affine transformation difference information.
  • the affine transformation information correction loss function assumes the responsibility of the trained second fully connected network, so that the second fully connected network in the target segmentation model can generate affine transformation difference information after correcting the predicted affine transformation information.
  • the computer device may construct the affine transformation information correction loss function based on the standard affine transformation difference information and the predicted affine transformation difference information.
  • the computer device can calculate the loss of standard affine transformation difference information and predicted affine transformation difference information through a distance function, such as the L1-Norm function, that is, construct the affine transformation information correction based on the L1-Norm function Loss function.
  • a distance function such as the L1-Norm function
  • other functions can also be used to construct the affine transformation information correction loss function, as long as the function can be used to measure the difference between the standard affine transformation difference information and the predicted affine transformation difference information
  • the degree can be, such as L2-Norm function and so on.
  • the predicted affine transformation difference information is used to determine the updated affine transformation information, and is transmitted to the subsequent video frame in the video frame sequence.
  • the affine transformation information is the affine transformation parameter
  • the updated affine transformation parameter can be calculated by the following formula: among them, Represents the updated affine transformation parameters passed in the current frame; Represents the prediction affine transformation difference parameter; Represents the predicted affine transformation parameters.
  • S614 Determine a segmentation loss function according to the predicted segmentation result and sample label information.
  • the segmentation loss function is used to evaluate the degree of difference between the predicted segmentation result and the sample label information.
  • the segmentation loss function assumes the responsibility of the fully trained fully convolutional neural network, so that the fully convolutional neural network in the target segmentation model can accurately segment the target object from the input video frame.
  • the computer device may determine the segmentation loss function according to the predicted segmentation result and sample label information.
  • S616 Correct the loss function and segmentation loss function according to the affine loss function, the affine transformation information, adjust the model parameters of the target segmentation model and continue training until the training stop condition is met.
  • the training stop condition is a condition for ending model training.
  • the training stop condition may be that the preset number of iterations is reached, or the performance index of the target segmentation model after adjusting the model parameters reaches the preset index. Adjusting the model parameters of the target segmentation model is to adjust the model parameters of the target segmentation model.
  • the computer device can modify the loss function and segmentation loss function according to the affine loss function, affine transformation information, and jointly adjust the model parameters of each network structure in the target segmentation model and continue training until the training stop condition is met. .
  • the computer device can adjust the model parameters in the direction of reducing the difference between the corresponding prediction result and the reference parameter.
  • the predicted affine transformation information by continuously inputting video frame samples, the predicted affine transformation information, the predicted affine transformation difference information, and the predicted segmentation result are obtained.
  • the predicted affine transformation The difference between the difference information and the standard affine transformation difference information, and the difference between the predicted segmentation result and the sample label information adjust the model parameters to train the target segmentation model to obtain a trained target segmentation model.
  • affine transformation supervision information that is, standard affine transformation information
  • the prediction affine transformation information can be corrected by training, Thereby reducing segmentation errors caused by incorrect positioning.
  • the affine loss function, affine transformation information correction loss function, and segmentation loss function are superimposed and optimized together, so that each part influences and improves each other during the training process, so that the trained target segmentation model has accurate video semantic segmentation performance.
  • the model training method includes the following steps:
  • first video frame sample and the second video frame sample are different video frame samples.
  • the first video frame sample is the video frame before the second video frame sample, that is, the generation time of the first video frame sample is before the second video frame.
  • the first video frame sample and the second video frame sample may be adjacent video frames.
  • S804 Obtain sample labeling information corresponding to the first video frame sample and the second video frame sample, and standard affine transformation information corresponding to the first video frame sample, respectively.
  • the computer device may separately obtain sample labeling information corresponding to the first video frame sample and the second video frame sample, and standard affine transformation information corresponding to the first video frame sample.
  • the sample labeling information may include sample key point location information and sample area location information.
  • the obtaining step of the standard affine transformation information refer to the obtaining step described in the foregoing embodiment.
  • FIG. 10 is a schematic diagram of the architecture of the target segmentation model in the model training process in an embodiment.
  • the computer device can input the adjacent two video frame samples before and after as a sample pair into the target segmentation model. Process the first video frame sample through the target segmentation model to obtain the predicted affine transformation information corresponding to the first video frame sample
  • S808 Construct an affine loss function according to the predicted affine transformation information and the standard affine transformation information.
  • the computer device may construct an affine loss function based on the predicted affine transformation information and the standard affine transformation information.
  • the computer device can calculate the loss of predicted affine transformation information and standard affine transformation information through a distance function, such as the L1-Norm function, that is, construct the predicted affine transformation information and standard based on the L1-Norm function Affine loss function of affine transformation information.
  • a distance function such as the L1-Norm function
  • S810 Perform affine transformation on the first video frame sample according to the predicted affine transformation information to obtain a first sample candidate region image, and perform feature extraction on the first sample candidate region image to obtain a first sample feature map.
  • the computer device can perform affine transformation on the first video frame sample according to the predicted affine transformation information to obtain the first sample candidate region image, and use the Generator (generator, which can be passed through the volume (Implemented by product neural network) feature extraction is performed on the image of the first sample candidate area to obtain the first sample feature map corresponding to the first video frame sample.
  • Generator generator, which can be passed through the volume (Implemented by product neural network) feature extraction is performed on the image of the first sample candidate area to obtain the first sample feature map corresponding to the first video frame sample.
  • S812 Perform semantic segmentation based on the feature map of the first sample to obtain a prediction segmentation result corresponding to the target in the first video frame sample.
  • the first sample feature map performs two task branches, one of which is a segmentation task branch.
  • the target segmentation model can perform semantic segmentation processing on the feature map of the first sample through the full convolutional neural network, and after two upsampling processing through the full convolutional neural network, based on each pixel prediction, the target in the first video frame sample is obtained The corresponding predicted segmentation result.
  • the second task branch is the positioning task branch.
  • the first sample feature map returns a new affine transformation difference parameter through the fully connected layer with channel 4, which is Predict affine transformation difference information.
  • S816 Determine standard affine transformation difference information according to the difference between the predicted affine transformation information and the standard affine transformation information.
  • the computer device may determine the standard based on the difference between the predicted affine transformation information and the standard affine transformation information
  • ⁇ t represents the standard affine transformation parameter.
  • S818 Construct an affine transformation information correction loss function based on the standard affine transformation difference information and the predicted affine transformation difference information.
  • the computer device may construct the affine transformation information correction loss function based on the standard affine transformation difference information and the predicted affine transformation difference information.
  • the computer device can calculate the loss of standard affine transformation difference information and predicted affine transformation difference information through a distance function, such as the L1-Norm function, that is, construct the affine transformation information correction based on the L1-Norm function Loss function.
  • a distance function such as the L1-Norm function
  • other functions can also be used to construct the affine transformation information correction loss function, as long as the function can be used to measure the difference between the standard affine transformation difference information and the predicted affine transformation difference information
  • the degree can be, such as L2-Norm function and so on.
  • the predicted affine transformation difference information is used to determine the updated affine transformation information, and is transmitted to the subsequent video frame in the video frame sequence.
  • the affine transformation information is the affine transformation parameter
  • the updated affine transformation information can be calculated by the following formula: among them, Represents the updated affine transformation information delivered by the current frame; Represents the prediction affine transformation difference parameter; Represents the predicted affine transformation parameters.
  • S820 Determine corresponding optical flow information according to the first video frame sample and the second video frame sample, and determine the optical flow feature map according to the optical flow information and the first sample feature map.
  • the computer device may determine the corresponding optical flow information according to the first video frame sample and the second video frame sample.
  • the computer device can calculate the optical flow information corresponding to the first video frame sample through the Lucas-kanade (a two-frame difference optical flow calculation method) optical flow method.
  • the computer device can calculate the optical flow characteristic map according to the optical flow information and the first sample characteristic map.
  • the optical flow feature map may be considered to be a feature map corresponding to the second video frame sample predicted by the first video frame sample, which incorporates the optical flow information.
  • S822 Use the optical flow feature map and the second sample feature map as the sample input of the discriminator in the target segmentation model, and perform classification processing on the sample input through the discriminator to obtain the predicted category of the sample input.
  • the target segmentation network further includes a discriminator (Discriminator) in the model training stage.
  • the computer equipment can use the optical flow feature map and the second sample feature map as the sample input of the discriminator in the target segmentation model, input any one of the two, and use Discriminator to determine whether the input feature is the optical flow feature map or the second sample feature Figure.
  • the second sample feature map is the sample feature map corresponding to the second video frame sample, and may also be referred to as a CNN feature map.
  • S824 Construct an adversarial loss function according to the prediction category and the reference category corresponding to the sample input.
  • the reference category corresponding to the sample input may be categories corresponding to the optical flow feature map and the second sample feature map, such as the optical flow category and the feature category.
  • Discriminator is essentially a two-category network. Computer equipment can use two-category cross entropy as the loss function of Discriminator to determine whether the sample input is an optical flow feature map. That is, according to the prediction category and the reference category corresponding to the sample input, the anti-loss function of the target segmentation model is constructed according to the cross entropy function.
  • S826 Construct a segmentation loss function according to the optical flow feature map, the second sample feature map, and the reference feature map; the reference feature map is a feature map obtained by feature extraction of the target in the second video frame sample.
  • the computer device may perform feature extraction on the target in the second video frame sample to obtain a reference feature map. Furthermore, the computer device can construct a segmentation loss function based on the optical flow feature map, the second sample feature map, and the reference feature map.
  • the computer device can construct the segmentation loss function through the following formula:
  • F′ CNN and F′ OF respectively represent the second sample feature map and the optical flow feature map obtained by optical flow.
  • F CNN stands for reference feature map.
  • f dice , f bce , and f mse respectively represent the Dice calculation formula, the two-class cross entropy calculation formula, and the mean square error calculation formula.
  • the larger the f mse the larger the gap between the second sample feature map and the optical flow feature map, which will increase the penalty for the Generator to complete parameter updates, so that the Generator can generate a feature map that is more in line with the optical flow characteristics.
  • f dice and f bce are to encourage Generator to produce feature maps that are more suitable for artificially labeled information.
  • the computer device can modify the loss function, the counter loss function, and the segmentation loss function according to the affine loss function, the affine transformation information, and jointly adjust the model parameters of each network structure in the target segmentation model and continue training until the training stop is satisfied. Stop training when conditions are met.
  • a combination of cross training and joint training may be used for training.
  • the computer device may first train the generator for a period of time, and then fix the parameters obtained from the training, and temporarily not return it. Retrain the discriminator, and then fix the parameters of the discriminator, and then retrain the generator, and then train with each network structure after the training result is stable. Then the training stop condition at this time can also be considered as a convergence condition. It can be that the loss function of the discriminator no longer drops, the output of the discriminator is stable at about (0.5, 0.5), and the discriminator cannot distinguish between the optical flow feature map and CNN The difference in feature maps.
  • the generator and the discriminator contend, the entire network reaches a convergent state.
  • the generator will eventually generate the common part of the CNN feature and the optical flow information, and the discriminator will not be able to distinguish the difference between the optical flow feature and the CNN feature.
  • the discriminator can be removed, and the generator will generate a feature map fused with optical flow information.
  • each generator in the target segmentation model can share parameters. That is to say, the three generators in Fig. 9 can be regarded as the same generator.
  • affine transformation supervision information that is, standard affine transformation information
  • the prediction affine transformation information can be corrected by training, Thereby reducing segmentation errors caused by incorrect positioning.
  • the adversarial learning method with optical flow information is used to achieve the consistency of the network in time sequence, which makes the training more targeted and better performance.
  • the affine loss function the affine transformation information, the correction loss function, the counter loss function, and the segmentation loss function are optimized together, so that each part influences and improves each other during the training process, so that the target segmentation model obtained by training The target object can be segmented from the video accurately and smoothly.
  • a model training method is provided. This embodiment mainly takes the method applied to the computer device in FIG. 1 as an example.
  • the model training method includes the following steps: obtaining video frame samples, sample label information corresponding to the video frame samples, and standard affine corresponding to the video frame samples Transformation information; input the video frame samples into the target segmentation model for training, through the target segmentation model, determine the predicted affine transformation information corresponding to the video frame samples; construct the affine loss based on the predicted affine transformation information and standard affine transformation information Function; through the target segmentation model, output the prediction affine transformation difference information corresponding to the video frame sample and the prediction segmentation result corresponding to the target in the video frame sample; determine according to the difference between the prediction affine transformation information and the standard affine transformation information Standard affine transformation difference information; according to standard affine transformation difference information and predicted affine transformation difference information, construct the affine transformation information correction loss function; according to the predicted segmentation result and sample label information, determine the segmentation loss function; according to the
  • a cardiac ultrasound detection video is taken as an example to illustrate the training process of the target segmentation model in detail.
  • two video frame samples before and after can be input into the RAN network as sample pairs.
  • the current frame is corrected by the affine transformation of the RAN network to correct the target position, size and orientation, and an ROI image similar to the template distribution is obtained.
  • the corrected ROI image reduces a lot of interference, such as other heart chambers and left The similarity of the ventricles, the influence of image marks and artifacts, etc.
  • the generator is used again to extract features from the ROI image.
  • the output features enter two task branches.
  • the output features are upsampled twice to obtain the segmentation prediction map, and the segmentation results are output;
  • the feature returns to the new affine transformation difference result through the fully connected layer with channel 4.
  • the affine transformation information generated in the first stage is corrected twice by regression difference.
  • the supervision information of the difference result of the affine transformation in the second stage can be calculated by the following formula: among them, Standard represents standard affine transformation difference information; Indicates the affine transformation parameter corresponding to the current frame, that is, the predicted affine transformation parameter; ⁇ t represents the standard affine transformation parameter.
  • the affine transformation difference parameters predicted by the current frame in the second stage will be used to calculate the updated affine transformation information and propagate to the next video frame.
  • the next video frame is directly subjected to affine transformation according to the above parameters to obtain the ROI. Therefore, the ROI extracts features through the Generator, and again predicts the difference between the segmentation result and the affine transformation result.
  • the second stage Based on the first stage, the second stage carries out the correction of the secondary affine transformation information, as shown in the above formula.
  • the second stage predicts the change value of the affine transformation information relative to the first stage, here
  • the next video frame can be calculated from the optical flow information of the previous video frame.
  • the discriminator has two inputs: one is derived from the features extracted by the generator for the next frame of ROI, and the other is derived from the features of the ROI of the current frame based on optical flow information.
  • the discriminator determines whether the input feature is a feature of optical flow transformation (Flow Field) or a CNN feature.
  • Flow Field feature of optical flow transformation
  • the introduction of the discriminator discriminator prompts the generator to generate segmentation features with optical flow information and CNN current frame information. Therefore, the following loss function can be used to split the task branch:
  • F′ CNN and F′ OF respectively represent the second sample feature map and the optical flow feature map obtained by optical flow.
  • F CNN stands for reference feature map.
  • f dice , f bce , f mse respectively represent the Dice calculation formula, the two-class cross entropy calculation formula, and the mean square error calculation formula.
  • the larger f mse is, the larger the gap between the second sample feature map and the optical flow feature map is, and the more punishing the generator completes the parameter update, so that the generator generates a feature map that is more in line with the optical flow characteristics.
  • f dice and f bce are to prompt the generator to produce feature maps that are more suitable for artificially labeled information.
  • the two-class cross entropy is used as the loss function to determine whether the input is an optical flow feature.
  • the generator will eventually generate the common part of the CNN feature and the optical flow information, while the discriminator will not be able to distinguish the difference between the optical flow feature and the CNN feature.
  • the discriminator will be removed, and the generator will generate a feature map incorporating optical flow information
  • cardiac B-mode ultrasound is currently a more common early screening method.
  • the area of the left ventricle in the four-chamber view and the two-chamber view in the cardiac cycle ultrasound is often used clinically to estimate the ejection fraction with the Simpson method (Simpson method) as an important source of information for diagnosing cardiac function.
  • the computer-aided automatic segmentation of the left ventricle is an important basis for calculating cardiac function indicators (such as ejection fraction).
  • the boundary of the left ventricle object is blurred, and the edge is easily lost due to artifact images, which seriously affects the accuracy of segmentation.
  • the change of the left ventricle is strongly related to time, and the sudden change of the left ventricular contour caused by the prediction error can easily lead to the miscalculation of clinical indicators.
  • the implementation of ultrasound video screening places great demands on network size and real-time performance.
  • an end-to-end video target segmentation model based on Region Affine Networks which combines the target structure information of the previous video frame (that is, the historical affine transmitted by the previous video frame). Transformation information) is introduced into the current frame to improve segmentation performance;
  • Region Affine Networks is a prediction network with supervised information that can learn affine transformation information. The introduction of affine supervision information makes the prediction of affine change parameters more accurate.
  • the two-stage positioning network can correct the transformation error transmitted by the previous video frame twice, increase the network robustness, and reduce the segmentation error caused by the affine transformation information error.
  • the adversarial learning network based on optical flow information can promote the segmentation results to approach the gradual nature of the time series transformation during training, making the segmentation results more reasonable.
  • the entire network is end-to-end training, and each part complements each other and improves each other.
  • the introduction of target structure information reduces noise interference, reduces the difficulty of segmentation, and uses a lightweight coding network to obtain excellent segmentation results.
  • video sequence analysis and time smoothing are all concentrated in the training stage, which reduces the operation and processing of the model during use, greatly reduces the time-consuming of target segmentation, and improves efficiency.
  • the image segmentation method provided by the embodiments of the present application can be used in clinical cardiac ultrasound detection with the Simpson method to screen for heart disease, can free the hands of the physician, and reduce the repetitive labor and subjective differences caused by the physician's annotation. Due to the small network structure and good real-time performance of the target segmentation model, the end-to-end network engineering degree is high, and it is easy to migrate to mobile devices.
  • the segmentation results obtained by segmenting the left ventricle in the cardiac ultrasound detection video in the embodiments of this application can be used as an automated solution for clinically measuring the ejection fraction of the cardiac B-mode ultrasound combined with the Simpson method; an end-to- The end network introduces timing information and target structure location information, which can obtain segmentation results that are more in line with the rules of the video; the adversarial learning network adaptively increases the smoothness of video segmentation, making the segmentation results more reasonable; this image segmentation method realizes Lightweight network with high segmentation performance, strong real-time performance, and high degree of engineering.
  • the image segmentation method includes the following steps:
  • S1004 Extract image features of the initial video frame through the first convolutional neural network.
  • S1006 Input the image feature to the first fully connected network, process the image feature through the first fully connected network, and output affine transformation information through at least one output channel of the first fully connected network.
  • S1010 Perform affine transformation on the current frame according to the historical affine transformation information to obtain a candidate region image corresponding to the current frame.
  • S1012 Perform feature extraction on the candidate region image through the second convolutional neural network in the target segmentation model to obtain a feature map corresponding to the candidate region image; the feature map fuses the optical flow information included in the video frame sequence.
  • S1014 Up-sampling the feature map through a fully convolutional neural network to obtain an intermediate image.
  • S1016 Perform pixel-level classification on each pixel in the intermediate image through a fully convolutional neural network to obtain a category corresponding to each pixel.
  • S1020 From the intermediate image, segment a target segmentation area composed of pixels corresponding to the target category and including the target object.
  • S1022 Process the feature map through the second fully connected network, and output the affine transformation difference result through at least one output channel of the second fully connected network.
  • S1026 Use the updated affine transformation information delivered by the current frame as historical affine transformation information corresponding to a subsequent video frame in the video frame sequence.
  • the above-mentioned image segmentation method performs affine transformation on the current frame according to the historical affine transformation information transmitted by the previous video frame to obtain the candidate region image corresponding to the current frame.
  • the historical affine transformation information transmitted by the previous video frame is a modified parameter, which can greatly improve the accuracy of image acquisition of the candidate area. Semantic segmentation of the feature map corresponding to the candidate region image can accurately obtain the segmentation result corresponding to the target in the current frame.
  • the historical affine transformation information is corrected according to the feature map, and the corrected affine transformation information is transferred to the subsequent video frame for use in the subsequent video frame. In this way, the positioning of the current frame can be corrected, the error caused by the wrong positioning to the subsequent segmentation processing is reduced, and the accuracy of the semantic segmentation processing of the video is greatly improved.
  • Fig. 11 is a schematic flowchart of an image segmentation method in an embodiment. It should be understood that although the various steps in the flowchart of FIG. 11 are displayed in sequence as indicated by the arrows, these steps are not necessarily performed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least part of the steps in FIG. 11 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • an image segmentation device 1100 which includes an acquisition module 1101, an affine transformation module 1102, a feature extraction module 1103, a semantic segmentation module 1104, and a parameter correction module 1105.
  • the obtaining module 1101 is used to obtain historical affine transformation information of the current frame and the previous video frame in the video frame sequence.
  • the affine transformation module 1102 is used to perform affine transformation on the current frame according to historical affine transformation information to obtain a candidate region image corresponding to the current frame.
  • the feature extraction module 1103 is configured to perform feature extraction on the candidate region image to obtain a feature map corresponding to the candidate region image.
  • the semantic segmentation module 1104 is used to perform semantic segmentation based on the feature map to obtain the segmentation result corresponding to the target in the current frame.
  • the parameter correction module 1105 is used to correct the historical affine transformation information according to the feature map to obtain updated affine transformation information, and use the updated affine transformation information as the historical simulation corresponding to the subsequent video frame in the video frame sequence. Transformation information.
  • the acquiring module 1101 is further configured to acquire the initial video frame in the video frame sequence when the current frame is the initial video frame; extract the initial video through the first convolutional neural network The image features of the frame; the image features are input to the first fully connected network, the image features are processed through the first fully connected network, and the affine transformation information is output through at least one output channel of the first fully connected network; The affine transformation information is used as the historical affine transformation information corresponding to the initial video frame.
  • the feature map obtained by performing feature extraction on the candidate region image is fused with the optical flow information included in the video frame sequence.
  • the semantic segmentation module 1104 is also used to perform up-sampling processing on the feature map through a fully convolutional neural network to obtain an intermediate image; through the fully convolutional neural network, each pixel in the intermediate image is separately classified at the pixel level, Obtain the category corresponding to each pixel; according to the category corresponding to each pixel, output the segmentation result of semantic segmentation of the target in the current frame.
  • the semantic segmentation module 1104 is also used to determine the pixels in the intermediate image corresponding to the target category; from the intermediate image, segment the target segmentation area composed of pixels of the corresponding target category and including the target object.
  • the parameter correction module 1105 is further configured to process the feature map through the second fully connected network, and output the affine transformation difference result through at least one output channel of the second fully connected network; and according to the affine transformation difference result and The historical affine transformation information of the previous video frame is calculated, and the updated affine transformation information of the current frame is calculated; the updated affine transformation information of the current frame is used as the historical simulation corresponding to the subsequent video frame in the video frame sequence. Transformation information.
  • the feature extraction module 1103 is also used to extract features of the candidate region image through the second convolutional neural network in the target segmentation model to obtain a feature map corresponding to the candidate region image.
  • the semantic segmentation module 1104 is also used to perform semantic segmentation processing on the feature map through the fully convolutional neural network in the target segmentation model to obtain the segmentation result corresponding to the target in the current frame.
  • the parameter correction module 1105 is also used to correct the historical affine transformation information through the second fully connected network in the target segmentation model to obtain updated affine transformation information.
  • the image segmentation device further includes a model training module 1106 for obtaining video frame samples, sample label information corresponding to the video frame samples, and standard affine transformation information corresponding to the video frame samples ; Input the video frame samples into the target segmentation model for training, obtain the predicted affine transformation information corresponding to the video frame samples through the target segmentation model; construct the affine loss function according to the predicted affine transformation information and the standard affine transformation information; Through the target segmentation model, the prediction affine transformation difference information corresponding to the video frame samples and the prediction segmentation result corresponding to the target in the video frame samples are output; the standard affine transformation information is determined according to the difference between the prediction affine transformation information and the standard affine transformation information Affine transformation difference information; according to the standard affine transformation difference information and predicted affine transformation difference information, construct the affine transformation information to modify the loss function; according to the predicted segmentation result and sample label information, determine the segmentation loss function; according to the affine loss
  • the above-mentioned image segmentation device performs affine transformation on the current frame based on the historical affine transformation information of the previous video frame to obtain a candidate region image corresponding to the current frame.
  • the historical affine transformation information of the previous video frame is a modified parameter, which can greatly improve the accuracy of obtaining the candidate region image. Semantic segmentation of the feature map corresponding to the candidate region image can accurately obtain the segmentation result corresponding to the target in the current frame.
  • the historical affine transformation information is corrected according to the feature map, and the corrected affine transformation information is transferred to the subsequent video frame for use in the subsequent video frame. In this way, the positioning of the current frame can be corrected, the error caused by the wrong positioning to the subsequent segmentation processing is reduced, and the accuracy of the semantic segmentation processing of the video is greatly improved.
  • a model training device 1300 which includes a sample acquisition module 1301, a determination module 1302, a construction module 1303, an output module 1304, and a model parameter adjustment module 1305.
  • the sample acquisition module 1301 is used to acquire video frame samples, sample label information corresponding to the video frame samples, and standard affine transformation information corresponding to the video frame samples.
  • the determining module 1302 is configured to input the video frame samples into the target segmentation model for training, and determine the predicted affine transformation information corresponding to the video frame samples through the target segmentation model.
  • the construction module 1303 is used to construct an affine loss function according to the predicted affine transformation information and the standard affine transformation information.
  • the output module 1304 is configured to output the prediction affine transformation difference information corresponding to the video frame sample and the prediction segmentation result corresponding to the target in the video frame sample through the target segmentation model.
  • the determining module 1302 is further configured to determine standard affine transformation difference information according to the difference between the predicted affine transformation information and the standard affine transformation information.
  • the construction module 1303 is also used to construct the affine transformation information correction loss function according to the standard affine transformation difference information and the predicted affine transformation difference information.
  • the construction module 1303 is also used to determine the segmentation loss function according to the predicted segmentation result and sample label information.
  • the model parameter adjustment module 1305 is used to modify the loss function and segmentation loss function according to the affine loss function, affine transformation information, adjust the model parameters of the target segmentation model and continue training until the training stop condition is met.
  • the sample acquisition module 1301 is also used to acquire video frame samples and corresponding sample annotation information;
  • the sample annotation information includes sample key point position information and sample area position information; according to the video frame samples, sample key point position information and The position information of the sample area determines the template image and the template key point position information corresponding to the template image; according to the sample key point position information and the template key point position information, the standard affine transformation information corresponding to the video frame sample is calculated.
  • the sample acquisition module 1301 is also used to acquire a first video frame sample and a second video frame sample; the first video frame sample is the video frame before the second video frame sample; and the first video frame sample is obtained separately
  • the determining module 1302 is also used to input the first video frame sample and the second video frame sample as samples into the target segmentation model for training, and process the first video frame sample through the target segmentation model to obtain the same Corresponding prediction affine transformation information.
  • the output module 1304 is also used to perform affine transformation on the first video frame sample according to the predicted affine transformation information to obtain the first sample candidate area image, and perform feature extraction on the first sample candidate area image to obtain the first sample Feature map; perform semantic segmentation based on the feature map of the first sample to obtain the prediction segmentation result corresponding to the target in the first video frame sample; correct the predicted affine transformation information according to the feature map of the first sample to obtain the first video The prediction affine transformation difference information corresponding to the frame samples.
  • the model training device also includes a confrontation module 1306 for determining the corresponding optical flow information according to the first video frame sample and the second video frame sample, and determining the optical flow feature map according to the optical flow information and the first sample feature map ;
  • the optical flow feature map and the second sample feature map are used as the sample input of the discriminator in the target segmentation model, and the sample input is classified by the discriminator to obtain the predicted category of the sample input.
  • the construction module 1303 is also used to construct the anti-loss function according to the prediction category and the reference category corresponding to the sample input; construct the segmentation loss function according to the optical flow feature map, the second sample feature map, and the reference feature map; the reference feature map is the right A feature map obtained by feature extraction of the target in the second video frame sample.
  • the model parameter adjustment module 1305 is also used to modify the loss function, counter loss function, and segmentation loss function according to the affine loss function, affine transformation information, adjust the model parameters of the target segmentation model and continue training until the training stop condition is met.
  • the above model training device introduces affine transformation supervision information, that is, standard affine transformation information, in the model training process, to improve the accuracy of the position prediction; on the other hand, it can correct and train the predicted affine transformation information, Thereby reducing segmentation errors caused by incorrect positioning.
  • affine loss function, affine transformation information correction loss function, and segmentation loss function are superimposed and optimized together, so that each part influences and improves each other during the training process, so that the trained target segmentation model has accurate video semantic segmentation performance.
  • Fig. 15 shows an internal structure diagram of a computer device in an embodiment.
  • the computer device may be the computer device in FIG. 1.
  • the computer equipment includes the computer equipment including a processor, a memory, and a network interface connected through a system bus.
  • the memory includes a non-volatile storage medium and internal memory.
  • the non-volatile storage medium of the computer device stores an operating system and may also store a computer program.
  • the processor can enable the processor to implement the image segmentation method and/or the model training method.
  • a computer program may also be stored in the internal memory.
  • the processor can execute the image segmentation method and/or the model training method.
  • FIG. 15 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • the computer equipment may include More or fewer parts than shown in the figure, or some parts are combined, or have a different arrangement of parts.
  • the image segmentation device and/or model training device provided in the present application may be implemented in the form of a computer program, and the computer program may run on the computer device as shown in FIG. 15.
  • the memory of the computer device can store various program modules that make up the image segmentation device, such as the acquisition module, affine transformation module, feature extraction module, semantic segmentation module, and parameter correction module shown in FIG. 12.
  • the computer program composed of each program module causes the processor to execute the steps in the image segmentation method of each embodiment of the application described in this specification.
  • Another example is the sample acquisition module, determination module, construction module, output module, and model parameter adjustment module shown in FIG. 14.
  • the computer program composed of each program module causes the processor to execute the steps in the model training method of each embodiment of the application described in this specification.
  • the computer device shown in FIG. 15 may execute step S202 through the acquisition module in the image segmentation apparatus shown in FIG. 12.
  • the computer device may execute step S204 through the affine transformation module.
  • the computer device can execute step S206 through the feature extraction module.
  • the computer device may execute step S208 through the semantic segmentation module.
  • the computer device can execute step S210 through the parameter correction module.
  • a computer device including a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the image segmentation method and/or model training method.
  • the steps of the image segmentation method and/or model training method may be steps in the image segmentation method and/or model training method of each of the foregoing embodiments.
  • a computer-readable storage medium is provided, and a computer program is stored.
  • the processor causes the processor to execute the steps of the image segmentation method and/or model training method.
  • the steps of the image segmentation method and/or model training method may be steps in the image segmentation method and/or model training method of each of the foregoing embodiments.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Channel
  • memory bus Radbus direct RAM
  • RDRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

本申请涉及一种图像分割方法、模型训练方法、装置、设备及存储介质,其中图像分割方法包括:获取视频帧序列中的当前帧、及在前的视频帧所传递的历史仿射变换信息;依据所述历史仿射变换信息对所述当前帧进行仿射变换,得到与所述当前帧对应的候选区域图像;对所述候选区域图像进行特征提取,得到所述候选区域图像对应的特征图;基于所述特征图进行语义分割,得到所述当前帧中的目标对应的分割结果;根据所述特征图对所述历史仿射变换信息进行修正,得到更新的仿射变换信息,将所述更新的仿射变换信息作为所述视频帧序列中在后的视频帧所对应的历史仿射变换信息。本申请提供的方案可以提高图像分割准确性。

Description

图像分割方法、模型训练方法、装置、设备及存储介质
本申请要求于2019年05月29日提交的申请号为201910455150.4、发明名称为“图像分割方法和装置、模型训练方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种图像分割方法和装置、模型训练方法和装置。
背景技术
对图像或视频进行语义分割(semantic segmentation)是计算机视觉研究领域的热点之一,语义分割技术是指计算机设备将图片中属于一大类的区域都分割出来并给出其类别信息。
相关技术中对视频进行语义分割的方式,计算机设备需要对视频的每一帧进行关键点预测,得到每一帧的关键点。计算机设备通过模板,根据每一帧的关键点计算每一帧图像与模板的差异来获取变换参数,基于该变换参数进行仿射(Affine)变换得到ROI(region of interest,感兴趣区域),随后在ROI上进行目标分割。
然而在上述语义分割方式中,在后的视频帧的关键点的预测依赖于在前的视频帧的目标分割结果,首帧的预测偏差会直接导致后续一系列视频帧的定位偏移,导致对目标对象的语义分割准确性低。
发明内容
本申请提供一种图像分割方法、模型训练方法、装置、设备及存储介质,能够提高语义分割的准确性。
根据本申请的一个方面,提供了一种图像分割方法,应用于计算机设备中,所述方法包括:
获取视频帧序列中的当前帧、及在前的视频帧所传递的历史仿射变换信息;
依据所述历史仿射变换信息对所述当前帧进行仿射变换,得到与所述当前帧对应的候选区域图像;
对所述候选区域图像进行特征提取,得到所述候选区域图像对应的特征图;
基于所述特征图进行语义分割,得到所述当前帧中的目标对应的分割结果;
根据所述特征图对所述历史仿射变换信息进行修正,得到更新的仿射变换信息,并将所述更新的仿射变换信息作为所述视频帧序列中在后的视频帧所对应的历史仿射变换信息。
根据本申请的一个方面,提供了一种图像分割装置,所述装置包括:
获取模块,用于获取视频帧序列中的当前帧、及在前的视频帧所传递的历史仿射变换信息;
仿射变换模块,用于依据所述历史仿射变换信息对所述当前帧进行仿射变换,得到与所述当前帧对应的候选区域图像;
特征提取模块,用于对所述候选区域图像进行特征提取,得到所述候选区域图像对应的特征图;
语义分割模块,用于基于所述特征图进行语义分割,得到所述当前帧中的目标对应的分 割结果;
参数修正模块,用于根据所述特征图对所述历史仿射变换信息进行修正,得到更新的仿射变换信息,并将所述更新的仿射变换信息作为所述视频帧序列中在后的视频帧所对应的历史仿射变换信息。
根据本申请的一个方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取视频帧序列中的当前帧、及在前的视频帧的历史仿射变换信息;
依据所述历史仿射变换信息对所述当前帧进行仿射变换,得到与所述当前帧对应的候选区域图像;
对所述候选区域图像进行特征提取,得到所述候选区域图像对应的特征图;
基于所述特征图进行语义分割,得到所述当前帧中的目标对应的分割结果;
根据所述特征图对所述历史仿射变换信息进行修正,得到更新的仿射变换信息,将所述更新的仿射变换信息作为所述视频帧序列中在后的视频帧所对应的历史仿射变换信息。
根据本申请的一个方面,提供了一种计算机设备,所述计算机设备包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
获取视频帧序列中的当前帧、及在前的视频帧的历史仿射变换信息;
依据所述历史仿射变换信息对所述当前帧进行仿射变换,得到与所述当前帧对应的候选区域图像;
对所述候选区域图像进行特征提取,得到所述候选区域图像对应的特征图;
基于所述特征图进行语义分割,得到所述当前帧中的目标对应的分割结果;
根据所述特征图对所述历史仿射变换信息进行修正,得到更新的仿射变换信息,将所述更新的仿射变换信息作为所述视频帧序列中在后的视频帧所对应的历史仿射变换信息。
上述图像分割方法、装置、计算机可读存储介质和计算机设备,依据在前的视频帧的历史仿射变换信息,对当前帧进行仿射变换,得到与当前帧对应的候选区域图像。在前的视频帧的历史仿射变换信息是经过修正后的参数,这样可大大提高候选区域图像获取的准确性。对与候选区域图像对应的特征图进行语义分割,可以准确得到当前帧中的目标对应的分割结果。并且,根据该特征图对历史仿射变换信息进行修正,将修正后的仿射变换信息传递至在后的视频帧,以供在后的视频帧使用。这样可对当前帧的定位起到纠正作用,减少了错误定位给后续的分割处理所带来误差,大大提高了对视频进行语义分割处理的准确性。
根据本申请的一个方面,提供了一种模型训练方法,应用于计算机设备中,所述方法包括:
获取视频帧样本、所述视频帧样本对应的样本标注信息、及所述视频帧样本对应的标准仿射变换信息;
将所述视频帧样本输入至目标分割模型中进行训练,通过所述目标分割模型,确定与所述视频帧样本对应的预测仿射变换信息;
依据所述预测仿射变换信息和所述标准仿射变换信息构建仿射损失函数;
通过所述目标分割模型,输出与所述视频帧样本对应的预测仿射变换差异信息、及所述视频帧样本中目标对应的预测分割结果;
根据所述预测仿射变换信息和所述标准仿射变换信息间的差异,确定标准仿射变换差异 信息;
依据所述标准仿射变换差异信息和所述预测仿射变换差异信息,构建仿射变换信息修正损失函数;
根据所述预测分割结果和所述样本标注信息,确定分割损失函数;
依据所述仿射损失函数、所述仿射变换信息修正损失函数、及所述分割损失函数,调整所述目标分割模型的模型参数继续训练,直至满足训练停止条件时停止训练。
根据本申请的一个方面,提供了一种模型训练装置,所述装置包括:
样本获取模块,用于获取视频帧样本、所述视频帧样本对应的样本标注信息、及所述视频帧样本对应的标准仿射变换信息;
确定模块,用于将所述视频帧样本输入至目标分割模型中进行训练,通过所述目标分割模型,确定与所述视频帧样本对应的预测仿射变换信息;
构建模块,用于依据所述预测仿射变换信息和所述标准仿射变换信息构建仿射损失函数;
输出模块,用于通过所述目标分割模型,输出与所述视频帧样本对应的预测仿射变换差异信息、及所述视频帧样本中目标对应的预测分割结果;
所述确定模块还用于根据所述预测仿射变换信息和所述标准仿射变换信息间的差异,确定标准仿射变换差异信息;
所述构建模块还用于依据所述标准仿射变换差异信息和所述预测仿射变换差异信息,构建仿射变换信息修正损失函数;
所述构建模块还用于根据所述预测分割结果和所述样本标注信息,确定分割损失函数;
模型参数调整模块,用于依据所述仿射损失函数、所述仿射变换信息修正损失函数、及所述分割损失函数,调整所述目标分割模型的模型参数继续训练,直至满足训练停止条件时停止训练。
根据本申请的一个方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取视频帧样本、所述视频帧样本对应的样本标注信息、及所述视频帧样本对应的标准仿射变换信息;
将所述视频帧样本输入至目标分割模型中进行训练,通过所述目标分割模型,确定与所述视频帧样本对应的预测仿射变换信息;
依据所述预测仿射变换信息和所述标准仿射变换信息构建仿射损失函数;
通过所述目标分割模型,输出与所述视频帧样本对应的预测仿射变换差异信息、及所述视频帧样本中目标对应的预测分割结果;
根据所述预测仿射变换信息和所述标准仿射变换信息间的差异,确定标准仿射变换差异信息;
依据所述标准仿射变换差异信息和所述预测仿射变换差异信息,构建仿射变换信息修正损失函数;
根据所述预测分割结果和所述样本标注信息,确定分割损失函数;
依据所述仿射损失函数、所述仿射变换信息修正损失函数、及所述分割损失函数,调整所述目标分割模型的模型参数并继续训练,直至满足训练停止条件时停止训练。
根据本申请的一个方面,提供了一种计算机设备,所述计算机设备包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行 以下步骤:
获取视频帧样本、所述视频帧样本对应的样本标注信息、及所述视频帧样本对应的标准仿射变换信息;
将所述视频帧样本输入至目标分割模型中进行训练,通过所述目标分割模型,确定与所述视频帧样本对应的预测仿射变换信息;
依据所述预测仿射变换信息和所述标准仿射变换信息构建仿射损失函数;
通过所述目标分割模型,输出与所述视频帧样本对应的预测仿射变换差异信息、及所述视频帧样本中目标对应的预测分割结果;
根据所述预测仿射变换信息和所述标准仿射变换信息间的差异,确定标准仿射变换差异信息;
依据所述标准仿射变换差异信息和所述预测仿射变换差异信息,构建仿射变换信息修正损失函数;
根据所述预测分割结果和所述样本标注信息,确定分割损失函数;
依据所述仿射损失函数、所述仿射变换信息修正损失函数、及所述分割损失函数,调整所述目标分割模型的模型参数并继续训练,直至满足训练停止条件时停止训练。
上述模型训练方法、装置、计算机可读存储介质和计算机设备,在模型训练过程中一方面引入仿射变换监督信息,也就是标准仿射变换信息,以提高方位预测的准确性;另一方面可通过对预测仿射变换信息进行纠正训练,从而减少错误定位带来的分割误差。训练时将仿射损失函数、仿射变换信息修正损失函数、及分割损失函数叠加一起优化,使得各个部分在训练过程中相互影响,相互提升,这样训练得到的目标分割模型具有准确的视频语义分割性能。
附图说明
图1为一个实施例中目标分割方法和/或模型训练方法的应用环境图;
图2为一个实施例中图像分割方法的流程示意图;
图3为一个实施例中视频帧序列的结构示意图;
图4为一个实施例中获取视频帧序列中的当前帧、及在前的视频帧所传递的历史仿射变换信息步骤的流程示意图;
图5为一个实施例中目标分割模型的整体框架图;
图6为一个实施例中对心脏超声检测视频中的左心室进行目标分割的目标分割模型的架构示意图;
图7为一个实施例中目标分割模型的训练步骤的流程示意图;
图8为一个实施例中模板的获取流程图;
图9为一个实施例中模型训练方法的流程示意图;
图10为一个实施例中在模型训练过程中目标分割模型的架构示意图;
图11为一个具体实施例中图像分割方法的流程示意图;
图12为一个实施例中图像分割装置的结构框图;
图13为另一个实施例中图像分割装置的结构框图;
图14为一个实施例中模型训练装置的结构框图;
图15为一个实施例中计算机设备的结构框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
图1为一个实施例中图像分割方法和/或模型训练方法的应用环境图。参照图1,该图像分割方法和/或模型训练方法应用于语义分割系统。该语义分割系统包括采集器110和计算机设备120。采集器110和计算机设备120可以通过网络连接,也可以通过传输线连接。计算机设备120可以是终端或服务器。其中,终端可以是台式终端或移动终端,移动终端具体可以手机、平板电脑、笔记本电脑等中的至少一种;服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
采集器110可以实时采集视频,将视频传输至计算机设备120,计算机设备120可以获取视频帧序列中的当前帧、及在前的视频帧所传递的历史仿射变换信息;依据历史仿射变换信息对当前帧进行仿射变换,得到与当前帧对应的候选区域图像;对候选区域图像进行特征提取,得到候选区域图像对应的特征图;基于特征图进行语义分割,得到当前帧中的目标对应的分割结果;根据特征图对历史仿射变换信息进行修正,得到更新的仿射变换信息,并将更新的仿射变换信息作为视频帧序列中在后的视频帧所对应的历史仿射变换信息。
需要说明的是,上述的应用环境只是一个示例,在一些实施例中,计算机设备120可以直接获取视频,对视频对应的视频帧序列中的各个视频帧按照上述步骤进行目标分割。
如图2所示,在一个实施例中,提供了一种图像分割方法。本实施例以该方法应用于上述图1中的计算机设备120来举例说明。参照图2,该图像分割方法包括如下步骤:
S202,获取视频帧序列中的当前帧、及在前的视频帧的历史仿射变换信息。
其中,视频帧序列是多于一帧的视频帧按照各视频帧所对应的生成时序而构成的序列。视频帧序列包括:按照生成时序排列的多个视频帧。视频帧是构成视频的基本单位,一段视频可以包括多个视频帧。视频帧序列可以是实时采集的视频帧所构成的序列,例如可以是通过采集器的摄像头实时获取的视频帧序列,也可以是存储的视频对应的视频帧序列。
当前帧是当前处理的视频帧,比如第i帧;在前的视频帧是生成时间在当前帧之前的视频帧,可以是当前帧的前一帧或当前帧的第前几帧的视频帧,也可称作当前帧的历史视频帧。
历史仿射变换信息是在前的视频帧所传递的用于当前帧进行仿射变换的仿射变换信息。此处的“在前的视频帧所传递的”可以理解为:计算机设备根据在前的视频帧所传递的,或者,在前的视频帧所对应的。仿射变换,又称仿射映射,是指对一个空间向量矩阵进行线性变换后再进行平移变换得到另一个空间向量矩阵的过程,线性变换包括卷积运算。仿射变换信息是用于进行仿射变换所需的信息,可以是仿射变换参数、或用于指示如何进行仿射变换的指令。其中,仿射变换参数是指图像进行线性变换或平移变换所需的参考参数,比如旋转角度(angle)、横轴方向的平移像素(Shift x),纵轴方向的平移像素(Shift y)以及缩放系数(Scale)等信息。
计算机设备可以在对视频进行检测的过程中,获取当前帧以及在前的视频帧的历史仿射变换信息。其中,在前的视频帧的历史仿射变换信息,是指依据对在前的视频帧执行该图像分割方法时所得到的已经修正的、且用于当前帧进行仿射变换的参数。计算机设备可通过以下方式得到历史仿射变换信息:计算机设备在对在前的视频帧进行目标分割时,可根据在前的视频帧所对应的特征图,对在前的视频帧对应的仿射变换信息进行修正,得到更新的仿射 变换信息,该更新的仿射变换信息即可作为当前帧的历史仿射变换信息。
可以理解,在对整个视频帧序列进行目标分割的过程中,计算机设备对当前帧执行图像分割方法时,同样可根据当前帧的特征图对该历史仿射变换信息进行修正,得到更新的仿射变换信息,并将更新的仿射变换信息作为视频帧序列中在后的视频帧所对应的历史仿射变换信息。这样,在对视频帧序列进行目标分割时,可不断修正并传递仿射变换信息。这样可对当前帧的定位起到纠正作用,减少了错误定位给后续的分割处理所带来误差,以提高对视频进行语义分割处理的准确性。
可以理解,本申请所使用的“当前帧”用于描述当前本方法所处理的视频帧,“当前帧”是一个相对变化的视频帧,比如在处理当前帧的下一个视频帧时,则可以将该下一个视频帧作为新的“当前帧”。
在一个实施例中,计算机设备可将当前帧的前一帧所传递的历史仿射变换信息作为当前帧对应的仿射变换信息,以进行仿射变换。相应的,下一帧视频帧可将当前帧所传递的历史仿射变换信息作为下一帧对应的仿射变换信息。依次类推,每一帧视频帧均可将前一帧所传递的历史仿射变换信息作为该帧对应的仿射变换信息以进行仿射变换。
可以理解,在另一些实施例中,计算机设备还可将当前帧的前第N(N为正整数,且N大于1)帧所传递的历史仿射变换信息作为当前帧对应的仿射变换信息,以进行仿射变换。相应的,下一帧视频帧可将当前帧的前第N-1帧所传递的历史仿射变换信息作为下一帧对应的仿射变换信息。依次类推,每一帧视频帧均可将前第N帧所传递的历史仿射变换信息作为该帧对应的仿射变换信息以进行仿射变换。
举例说明,参考图3,对于视频帧序列[F1,F2,F3,F4,F5,F6],计算机设备当前所处理的当前帧为F4,那么当前帧F4可使用在前的视频帧F1所传递的历史仿射变换信息作为对应的仿射变换信息以进行仿射变换;那么当前帧F4可使用在前的视频帧F1所传递的历史仿射变换信息作为对应的仿射变换信息以进行仿射变换;视频帧F5可使用在前的视频帧F2所传递的历史仿射变换信息作为对应的仿射变换信息以进行仿射变换;视频帧F6可使用在前的视频帧F3所传递的历史仿射变换信息作为对应的仿射变换信息以进行仿射变换等等依次类推。
在一个实施例中,当当前帧为初始视频帧时,步骤S202,也就是获取视频帧序列中的当前帧、及在前的视频帧的历史仿射变换信息的步骤包括以下步骤:
S402,获取视频帧序列中的初始视频帧。
其中,初始视频帧是视频帧序列中开始的视频帧。初始视频帧可以是视频帧序列中的第一帧视频帧,也可以是视频帧序列中的第N帧(比如第一个对焦稳定性达到预设条件的帧,第一个出现目标的帧),也可以是视频帧序列中最靠前的前N(N为正整数,且N大于1)帧视频帧。
可以理解,当计算机设备在执行该图像分割方法,每后一视频帧的仿射变换信息会参考前一帧视频帧的仿射变换信息时,则该初始视频帧为视频帧序列中最开始的视频帧。当计算机设备在执行该图像分割方法,计算机设备将当前帧的前第N(N为正整数,且N大于1)帧所传递的历史仿射变换信息作为当前帧对应的仿射变换信息时,则从该视频帧序列的第一帧开始至前第N帧视频帧均可称作初始视频帧。
S404,通过第一卷积神经网络提取初始视频帧的图像特征。
其中,卷积神经网络(Convolutional Neural Network,简称CNN)是一类包含卷积计算且具有深度结构的前馈神经网络(Feedforward Neural Networks)。卷积神经网络中的隐含层 内的卷积核参数共享和层间连接的稀疏性的特定,使得卷积神经网络能够以较小的计算量对格点化特征(例如像素和音频)进行学习。卷积神经网络通常包括卷积层和池化层,可对输入的图像进行卷积和池化处理,以将原始数据映射到隐层特征空间。而图像特征是通过卷积神经网络处理后所得到的能够表示该初始视频帧的图像信息的空间向量矩阵。
可选地,该图像分割方法通过目标分割模型执行,计算机设备可将视频帧序列输入至目标分割模型中,通过目标分割模型中的第一卷积神经网络对初始视频帧进行处理,提取初始视频帧中的特征,得到相应的图像特征。S406,将图像特征输入至第一全连接网络,并通过第一全连接网络对图像特征进行处理,通过第一全连接网络的至少一个输出通道输出仿射变换信息。
全连接网络(Fully Connected Netwok)也可称作全连接层(fully connected layers,FC),全连接层在整个卷积神经网络中起到“分类器”的作用。全连接层可将卷积层和池化层所学到的图像特征映射到样本标记空间。
可选地,计算机设备可将图像特征输入至第一全连接网络,并通过该第一全连接网络对图像特征进行处理,通过第一全连接网络的至少一个输出通道输出仿射变换信息。
在一个实施例中,目标分割模型包括区域仿射网络(Region Affine Networks,简称RAN),该RAN网络包括卷积神经网络和全连接网络。可选地,计算机设备在RAN网络中输入视频帧序列中的初始视频帧,通过轻量级MobileNet-V2网络(轻量化网络)作为Generator(生成器)抽取初始视频帧的图像特征,再通过一个输出通道(channel)为4的全连接网络回归出4个仿射变换参数,这4个参数分别为旋转角度、横轴方向的平移像素,纵轴方向的平移像素以及缩放系数。
S408,将输出的仿射变换信息作为初始视频帧对应的历史仿射变换信息。
可选地,计算机设备可将第一全连接网络所输出的仿射变换信息作为该初始视频帧所对应的仿射变换信息,并依据该仿射变换信息进行仿射变换,得到与初始视频帧对应的候选区域图像。
可以理解,对于初始视频帧而言,初始视频帧并不存在与之对应的可参考的在前的视频帧,因而也没有在前的视频帧的历史仿射变换信息供其使用。
在一个实施例中,目标分割模型在训练时可引入与仿射变换信息对应的监督信息作为训练样本进行模型训练。其中,与仿射变换对应的监督信息可以是与视频帧样本对应的标准仿射变换信息。该标准仿射变换信息是指将视频帧样本转换成模板(Template)所需要的仿射变换信息。该标准仿射变换信息可通过视频帧样本所包括的样本关键点位置信息和模板所包括的模板关键点位置信息进行反射相似度计算所得到。其中,关于该模板是如何获得的、以及目标分割模型的训练过程,在后续的模型训练方法中会进行详细的介绍。
这样,通过引入与仿射变换信息对应的监督信息来训练目标分割模型,可使得该目标分割模型中的区域仿射网络学习到模板的信息,从而可准确地回归出初始视频帧相对于模板的仿射变换信息。
上述实施例中,通过卷积神经网络提取初始视频帧的图像特征,并通过第一全连接网络对图像特征进行处理,可预测出与初始视频帧对应的、且准确性更高的仿射变换信息,从而有助于提高后续处理中对目标进行分割的准确性。
在一个实施例中,当当前帧不为初始视频帧时,在缓存中读取在前的视频帧的历史仿射变换信息。
S204,依据历史仿射变换信息对当前帧进行仿射变换,得到与当前帧对应的候选区域图像。
可选地,计算机设备依据历史仿射变换信息对当前帧进行仿射变换,可以是依据仿射变换信息对当前帧中的目标所对应的位置、尺寸及方位等进行了纠正,得到对应的候选区域图像。其中,候选区域图像也可称作感兴趣区域(ROI)。
在一个实施例中,计算机设备可将视频帧序列输入至目标分割网络,通过该目标分割模型执行该图像分割方法。其中,目标分割模型是用于对视频中的目标对象进行语义分割的模型,可以是机器学习模型。该目标分割模型可包括多个网络结构,不同的网络结构包括各自网络所对应的模型参数,不同的网络结构用于执行不同的动作。
在一个实施例中,计算机设备可将视频帧序列输入至目标分割模型中,通过目标分割模型所包括的RAN网络,依据历史仿射变换信息对当前帧进行仿射变换,得到与当前帧对应的候选区域图像。
S206,对候选区域图像进行特征提取,得到候选区域图像对应的特征图。
其中,特征图又称feature map,是通过卷积神经网络对图像进行卷积和/或池化处理后所得到的空间向量矩阵,可用于表示该图像的图像信息。可选地,计算机设备可对候选区域图像进行特征提取,得到候选区域图像对应的特征图。
在一个实施例中,计算机设备可通过目标分割模型中的第二卷积神经网络,对候选区域图像进行特征提取,得到候选区域图像对应的特征图。该卷积神经网络可以是MobileNet-V2、VGG(Visual Geometry Group,视觉集合组)网络、或ResNet(deep Residual learning,深度残差学习)网络等。
其中,第二卷积神经网络可以和第一卷积神经网络共享参数,因而可以认为是相同的卷积神经网络,此处用“第一”“第二”主要是用于区分处于目标分割模型中的不同位置处、且用于处理不同数据的卷积神经网络。
在一个实施例中,对候选区域图像进行特征提取所得到的特征图,融合了视频帧序列所包括的光流信息。
其中,光流信息是图像的运动变化信息,在本申请实施例中可用于表示视频帧序列中各像素点在视频帧中移动的信息,包括视频画面中待检测目标的运动变化信息。在本申请实施例中,前一帧视频帧所对应的光流信息可通过前一帧视频帧中的各像素所对应的位置、及当前帧中各像素所对应的位置来确定。
在一个实施例中,可假定相邻两帧视频帧中目标对应的变化是较为微小的,因而当前帧中目标对象所在的目标分割区域可以通过前一帧视频帧所对应的光流信息确定。比如,当前帧中目标对象所在的目标分割区域,可根据光流信息,以及前一帧视频帧中目标对象所在的目标分割区域共同预测。
为使得目标分割模型中的卷积神经网络在对候选区域图像进行特征提取时,可以融合对应的光流信息,使得提取出的特征图融合光流信息,那么在对目标分割模型的卷积神经网络进行训练时,可引入判别器(discriminator)来共同训练。其中,生成器和判别器共同构成生成式对抗网络(Generative Adversarial Nets,GAN)。
在模型训练阶段,对于当前帧所对应的特征图有两种特征形式:一种是通过第二卷积神经网络基于当前帧所对应的候选区域图像而提取的特征图,可称作CNN特征;另一种是通过光流信息基于上一帧视频帧的特征图进行变换而得到的特征图,可称作光流特征。为此,可 设计判别器将这两种信息同时引入。也就是说,在模型训练过程中,可分别将CNN特征和光流特征中的任意一种输入至判别器中,判别器判断当前输入的特征是属于光流特征还是CNN特征。通过不断调整第二卷积神经网络的参数和判别器的参数,使得判别器无法分辨CNN特征和光流特征的区别,那么此时的第二卷积神经网络就可以生成融合了光流信息的特征图。其中,关于判别器和第二卷积神经网络之间更详细的训练过程,在后续模型训练阶段的实施例中将会有详细的描述。
上述实施例中,对候选区域图像进行特征提取所得到的特征图融合了视频帧序列所包括的光流信息,可避免分割结果出现误差,从而产生具有时序渐进性的合理分割结果。
S208,基于特征图进行语义分割,得到当前帧中的目标对应的分割结果。
其中,语义分割是指计算机设备将图片中属于一大类的区域都分割出来并给出其类别信息。分割结果可以是当前帧中属于目标对象的像素点构成的目标分割区域。
可选地,计算机设备可对特征图进行像素维度的检测,也就是基于候选区域图像所对应的特征图,对候选区域图像中每个像素进行检测,输出当前帧中的目标对应的检测结果。在一个实施例中,计算机设备可识别候选区域图像中各个像素各自所对应的类别,根据对应目标类别的各像素点构成目标区域。也就是将目标对象从候选区域图像中区分开来。
在一个实施例中,计算机设备可以通过目标分割模型中的全卷积神经网络对候选区域图像特征进行语义分割,输出当前帧中的目标对应的检测结果。
在一个实施例中,步骤S208,也就是基于特征图进行语义分割,得到当前帧中的目标对应的分割结果的步骤包括:通过全卷积神经网络对特征图进行上采样处理,得到中间图像;通过全卷积神经网络对中间图像中的各像素分别进行像素级分类,得到各像素所对应的类别;依据各像素所对应的类别,输出对当前帧中的目标进行语义分割的分割结果。
其中,全卷积神经网络(Fully Convolutional Networks,简称FCN)通常用于对输入图像进行逐像素分类。全卷积神经网络通常可采用反卷积层对最后一个卷积层的feature map进行上采样(Upsample),使它恢复到输入图像相同的尺寸,从而可以对每个像素都产生了一个预测,同时保留了原始输入图像中的空间信息,最后在上采样的特征图上进行逐像素分类。
像素级是指像素维度;像素级分类是指在像素维度上进行分类处理,是一种精细的分类方式。对中间图像中的各像素分别进行像素级分类,也可称作对中间图像进行像素级的分类,是对中间图像中的每个像素都产生一个预测,进而得到中间图像中每个像素各自所对应的类别。
可选地,计算机设备可通过目标分割模型中的全卷积神经网络对当前帧所对应的特征图进行上采样处理,得到中间图像,通过全卷积神经网络对中间图像中的各像素分别进行像素级分类,得到各像素所对应的类别。比如若候选区域图像中属于目标对象的像素点的类别为1,不属于目标对象的像素点的类别为0,则候选区域图像所有类别为1的像素点所构成的区域为目标分割区域,据此可将目标区域从候选区域图像中分割出来。比如通过红色或绿色突出显示目标分割区域。
在一个实施例中,依据各像素所对应的类别,输出对当前帧中的目标进行语义分割的分割结果的步骤包括:确定中间图像中对应目标类别的像素;从中间图像中,分割出由对应目标类别的各像素所组成的、且包括目标对象的目标分割区域。
在一个实施例中,在对该目标分割模型的全卷积神经网络进行训练时,可依据视频帧样本、及对视频帧样本中的目标对象进行标注的样本标注信息来训练该全卷积神经网络,训练 得到的该全卷积神经网络具备对像素进行分类的能力。其中,对视频帧样本中的目标对象进行标注的样本标注信息,可以是将对应目标对象的像素标记为“1”,将其他的像素标记为“0”,以此来区分目标对象个非目标对象。
在一个实施例中,计算机设备可通过目标分割模型中的全卷积神经网络确定中间图像中对应目标类别的像素。并对属于目标类别的像素进行标注,比如将标注成红色或绿色等,以此从中间图像中,分割出由对应目标类别的各像素所组成的、且包括目标对象的目标分割区域。这样可实现在当前帧中准确地定位到目标对象,并可以准确地确定目标对象在当前帧中所占的面积大小。
在一个实施例中,计算机设备可以根据每一个视频帧的检测结果在视频帧中分割显示目标对象,以实现在连续的视频帧构成的视频中对目标进行自动分割的效果。
上述实施例中,通过全卷积神经网络对特征图进行像素级分类,可得到各像素各自所对应的类别,从而依据各像素所对应的类别,可从像素级别准确地确定出当前帧中的目标所在的目标分割区域,大大提高了对目标对象的分割能力。
S210,根据特征图对历史仿射变换信息进行修正,得到更新的仿射变换信息,将更新的仿射变换信息作为视频帧序列中在后的视频帧所对应的历史仿射变换信息。
其中,对历史仿射变换信息进行修正是指调整历史仿射变换参数,得到更新的仿射变换参数。可选地,计算机设备可根据特征图对历史仿射变换信息进行修正,得到更新的仿射变换信息,该更新的仿射变换信息可作为视频帧序列中在后的视频帧所对应的仿射变换信息。
在一个实施例中,计算机设备可通过目标分割模型所包括的第二全连接网络,对当前帧所对应的特征图进行处理,对该仿射变换信息进行修正,得到更新的仿射变换信息。
在一个实施例中,该目标分割模型所包括的第二全连接网络,可被训练成输出仿射变换差异结果,再依据仿射变换差异结果和在前的视频帧所传递的历史仿射变换信息,计算得到当前帧所传递的更新的仿射变换信息。计算机设备则可直接将该更新的仿射变换信息传递至在后的视频帧,供在后的视频帧进行仿射变换使用。
在一个实施例中,步骤S210,也就是根据特征图对历史仿射变换信息进行修正,得到更新的仿射变换信息,并将更新的仿射变换信息作为视频帧序列中在后的视频帧所对应的历史仿射变换信息的步骤包括以下步骤:通过第二全连接网络,对特征图进行处理,通过第二全连接网络的至少一个输出通道输出仿射变换差异结果;依据仿射变换差异结果和在前的视频帧所传递的历史仿射变换信息,计算得到当前帧所传递的更新的仿射变换信息;将当前帧所传递的更新的仿射变换信息,作为视频帧序列中在后的视频帧所对应的历史仿射变换信息。
其中,第二全连接网络和第一全连接网络是相同的全连接网络,或者,是不同的全连接网络。其中,相同的全连接网络是指第一全连接网络和第二全连接网络的参数共享;不同的全连接网络是指第一全连接网络和第二全连接网络的具有各自的模型参数。
可选地,该目标分割模型所包括的第二全连接网络,可被训练成输出仿射变换差异结果。在这种情况下,可通过目标分割模型中的第二全连接网络对与当前帧对应的特征图进行处理,回归出仿射变换差异结果。可选地,该差异结果是进行归一化处理后的差异率。
进一步地,计算机设备可依据仿射变换差异结果和在前的视频帧所传递的历史仿射变换信息,计算得到当前帧所传递的更新的仿射变换信息。比如,当仿射变换信息为仿射变换参数时,计算机设备可通过以下公式计算得到更新的仿射变换信息:
Figure PCTCN2020092356-appb-000001
其中,
Figure PCTCN2020092356-appb-000002
表示当前帧所传递的更新的仿射变换参数;
Figure PCTCN2020092356-appb-000003
表示仿射变换差异结果;
Figure PCTCN2020092356-appb-000004
表示当前帧所对应的仿射变换参数,也就是在前的视频帧所传递的历史仿射变换参数。
进而,计算机设备可将计算得到的更新的仿射变换信息,作为当前帧所传递的历史仿射变换信息,也就是将该更新的仿射变换信息传递至视频帧序列中在后的视频帧,以供在后的视频帧依据该更新的仿射变换信息进行仿射变换。
可以理解,当目标分割模型所包括的第二全连接网络被训练成输出仿射变换差异结果时,该第二全连接网络在训练过程中的监督信息可以是标准仿射变换信息和当前帧所对应的仿射变换信息的差异信息。
上述实施例中,通过第二全连接网络对特征图进行处理,以纠正当前帧所使用的仿射变换信息,得到更新的仿射变换信息。更新的仿射变换信息用于向后传递,这样可对当前帧的定位起到纠正作用,减少了错误定位带来的分割误差。
在一个实施例中,该目标分割模型所包括的第二全连接网络可被训练成输出经纠正过的更新的仿射变换信息。计算机设备则可直接将该更新的仿射变换信息传递至在后的视频帧,供在后的视频帧进行仿射变换使用。
可以理解,当目标分割模型所包括的第二全连接网络被训练成输出经纠正过的更新的仿射变换信息时,该第二全连接网络在训练过程中的监督信息可以是当前帧所对应的标准仿射变换信息。
上述图像分割方法,依据在前的视频帧所传递的历史仿射变换信息,对当前帧进行仿射变换,得到与当前帧对应的候选区域图像。在前的视频帧所传递的历史仿射变换信息是经过修正后的参数,这样可大大提高候选区域图像获取的准确性。对与候选区域图像对应的特征图进行语义分割,可以准确得到当前帧中的目标对应的分割结果。并且,根据该特征图对历史仿射变换信息进行修正,将修正后的仿射变换信息传递至在后的视频帧,以供在后的视频帧使用。这样可对当前帧的定位起到纠正作用,减少了错误定位给后续的分割处理所带来误差,大大提高了对视频进行语义分割处理的准确性。
在一个实施例中,该图像分割方法通过目标分割模型执行,该图像分割方法包括以下步骤:获取视频帧序列中的当前帧、及在前的视频帧的历史仿射变换信息;通过目标分割模型中的区域仿射网络,依据历史仿射变换信息对当前帧进行仿射变换,得到与当前帧对应的候选区域图像;通过目标分割模型中的第二卷积神经网络,对候选区域图像进行特征提取,得到候选区域图像对应的特征图;通过目标分割模型中的全卷积神经网络,对特征图进行语义分割处理,得到当前帧中的目标对应的分割结果;通过目标分割模型中的第二全连接网络对历史仿射变换信息进行修正,得到更新的仿射变换信息,并将更新的仿射变换信息作为视频帧序列中在后的视频帧所对应的历史仿射变换信息。
这样,通过已训练好的目标分割模型可自动化、且准确的分割出视频中的目标对象,具有极强的实时性。并且端到端网络工程化程度高,极易迁移到移动设备中,自适应能力高。
如图5所示,为一个实施例中目标分割模型的整体框架图。参照图5,整体框架图包括区域仿射网络(RAN)510、第二卷积神经网络(generator)520、全卷积神经网络530以及第二全连接网络540。其中,区域仿射网络510包括第一卷积神经网络(generator)512和第一全连接网络514。
在对视频中的目标对象进行目标分割时,按帧输入视频帧序列中的各个视频帧,若当前帧为初始视频帧,则通过第一卷积神经网络512对初始视频帧进行特征提取,得到图像特征,并将图像特征输入至第一全连接网络514中回归出当前的仿射变换信息。通过区域仿射网络510依据当前的仿射变换信息对初始视频帧进行仿射变换,得到对应的候选区域图像(ROI)。再通过第二卷积神经网络520对候选区域图像进行特征提取,得到候选区域图像对应的特征图。该特征图进入两个任务分支,在分割任务分支中,通过全卷积神经网络530进行上采样处理后得到分割预测图,输出分割结果;在定位任务分支中,通过第二全连接网络回归出仿射变换差异结果。再依据仿射变换差异结果纠正当前帧所对应的仿射变换信息,得到更新的仿射变换信息,将该更新的仿射变换信息传递至下一帧。
如图5所示,在下一帧视频帧中,RAN网络依据更新的仿射变换信息对下一帧视频帧进行仿射变换,得到下一帧视频帧所对应的ROI区域,并通过第二卷积神经网络520对候选区域图像进行特征提取,得到候选区域图像对应的特征图。该特征图进入两个任务分支,在分割任务分支中,通过全卷积神经网络530进行上采样处理后得到分割预测图,输出分割结果;在定位任务分支中,通过第二全连接网络回归出仿射变换差异结果。再依据仿射变换差异结果纠正下一帧视频帧帧所对应的仿射变换信息,得到更新的仿射变换信息,将该更新的仿射变换信息传递至在后的视频帧。依次类推,最终实现对视频中的目标进行分割的效果。
在一个实施例中,视频帧序列属于对生物组织进行医学检测得到的检测视频,比如可以是心脏超声检测视频。视频帧中的目标为左心室,检测结果为分割出视频帧中的左心室。
如图6所示,图6为一个实施例中对心脏超声检测视频中的左心室进行目标分割的架构示意图。在该示意图中,前一帧为t-1帧;当前帧为t帧。参照图6,对心脏超声检测视频进行目标分割,对于前一帧视频帧,可通过RAN网络中的生成器和全连接网络生成预测的仿射变换信息
Figure PCTCN2020092356-appb-000005
再依据仿射变换信息
Figure PCTCN2020092356-appb-000006
进行仿射变换
Figure PCTCN2020092356-appb-000007
得到前一帧视频帧的候选区域图像ROI。再通过生成器提取图像特征后分别进入分割任务分支和定位任务分支,得到分割结果t-1、及仿射变换差异参数
Figure PCTCN2020092356-appb-000008
该仿射变换差异参数传递到当前帧,区域仿射网络依据仿射变换差异参数
Figure PCTCN2020092356-appb-000009
和预测的仿射变换信息
Figure PCTCN2020092356-appb-000010
来对当前帧进行仿射变换,如
Figure PCTCN2020092356-appb-000011
得到候选区域图像ROI。再通过生成器提取图像特征后分别进入分割任务分支和定位任务分支,得到分割结果t、及仿射变换差异参数
Figure PCTCN2020092356-appb-000012
以此循环类推,从而实现了对心脏超声检测视频中的左心室进行标注分割。
参考图7,在一个实施例中,该图像分割方法通过目标分割模型执行,该目标分割模型的训练步骤包括:
S602,获取视频帧样本、视频帧样本对应的样本标注信息、及视频帧样本对应的标准仿射变换信息。
其中,视频帧样本、视频帧样本对应的样本标注信息、及视频帧样本对应的标准仿射变换信息为训练数据。视频帧样本对应的样本标注信息可以是对视频帧样本中的关键点进行标注的样本关键点位置信息、及对视频帧样本中的目标对象进行标注的样本区域位置信息。其中,视频帧样本中的关键点是用于确定目标对象的关键点,关键点的数量可以是3个、4个或其他数量等。
以心脏超声检测视频为例,视频帧序列中的目标对象为左心室,那么相应的视频帧样本中的关键点可以是左心室尖端、及左心室二尖瓣膜两端,样本关键点位置信息可以是左心室 尖端、及左心室二尖瓣膜两端对位置信息;样本区域位置信息可以是视频帧样本中左心室所在的区域的位置信息。
标准仿射变换信息是视频帧样本相对于模版的仿射变换信息,也就是说视频帧样本依据该标准仿射变换信息,可进行仿射变换得到模板。其中,模板是依据多个视频帧样本统计出的可以代表标准视频帧的图像。
在一个实施例中,步骤S602,也就是获取视频帧样本、视频帧样本对应的样本标注信息、及视频帧样本对应的标准仿射变换信息对步骤包括以下步骤:获取视频帧样本和相应的样本标注信息;样本标注信息包括样本关键点位置信息和样本区域位置信息;根据视频帧样本、样本关键点位置信息和样本区域位置信息,确定模板图像及模板图像对应的模板关键点位置信息;根据样本关键点位置信息和模板关键点位置信息,计算得到与视频帧样本对应的标准仿射变换信息。
可选地,计算机设备可从本地或其他计算机设备处获取多个视频帧样本。并对该视频帧样本采用人工标注或机器标注的方式标注出样本关键点和目标对象在视频帧样本中的位置区域。
进而计算机设备可根据多个包括样本标注信息的视频帧样本,确定模板、以及模板中的模板关键点位置信息。可选地,计算机设备可对多个视频帧样本中的关键点位置信息求平均后得到模板关键点位置信息。
比如,计算机设备可依据对每个视频帧样本中的关键点确定包括有目标对象的区域框,将该区域框外扩一定的范围,得到这个视频帧样本的ROI。再计算所有视频帧样本对应的ROI的平均尺寸,并将所有视频帧样本对应的ROI调整到平均尺寸。对所有调整到平均尺寸的ROI图像求平均即可得到模板。各个ROI图像中的关键点的位置信息求平均即可得到模板的关键点位置信息。
下面以心脏超声检测视频为例详细说明书模板的获取步骤,参考图8,图8为一个实施例中模板的获取流程图。如图8所示,计算机设备可预先通过采集器采集多种标准的心脏切面,比如A2C(apical-2-chamber,A2C,二腔切面)、A3C(apical-3-chamber,A3C,三腔切面)、A4C(apical-4-chamber,A4C,四腔切面)、A5C(apical-5-chamber,A5C,五腔切面)等作为原始图片,也就是作为视频帧样本,再将每张切面图中的3个关键点紧密外扩得到区域框,考虑到各种标准切面中的左心室都在右上方位置,为了获得更多心脏结构信息,可将区域框往左边、往下边各外扩一定比例,比如长宽的50%。最后,区域框四周在这个框基础上外扩一定比例,比如长宽的5%,得到这张切面图的ROI。所有切面图的ROI调整尺寸到一个尺度(该尺寸为所有ROI的平均尺寸),求平均则得到模板。
进一步地,计算机设备可依据各个视频帧样本的尺寸、关键点位置信息,以及模板的尺寸、模板关键点位置信息,进行反射相似度计算,得到变换矩阵,该变换矩阵中包括仿射变换信息,通过该方法计算得到的仿射变换信息即为与该视频帧样本对应的标准仿射变换信息。
上述实施例中,根据视频帧样本、样本关键点位置信息和样本区域位置信息,可确定模板图像及模板图像对应的模板关键点位置信息。从而可将每张视频帧样本均与模板进行比较,以确定标准仿射变换信息,该标准仿射变换信息可作为后续模型训练的监督信息,用以使得目标分割模型可学习到模板的信息,从而大大提高仿射变换信息的预测准确性。
S604,将视频帧样本输入至目标分割模型中进行训练,通过目标分割模型,确定与视频帧样本对应的预测仿射变换信息。
可选地,计算机设备可将视频帧样本输入到目标分割模型中,根据目标分割模型执行前述的图像分割方法,通过RAN网络获取与视频帧样本对应的预测仿射变换信息。
S606,依据预测仿射变换信息和标准仿射变换信息构建仿射损失函数。
其中,仿射损失函数用于评估预测仿射变换信息和标准仿射变换信息之间的差异程度。仿射损失函数承担了训练得到好的RAN网络的责任,使目标分割模型中的RAN网络可以生成相对于模板来说准确的仿射变换信息,这样引入仿射监督信息的使得仿射参数预测更加准确。
可选地,计算机设备可依据预测仿射变换信息和标准仿射变换信息构建仿射损失函数。在一个实施例中,计算机设备可通过距离函数,比如L1-Norm(L1-范数,又称曼哈顿距离)函数来计算预测仿射变换信息和标准仿射变换信息的损失,也就是基于L1-Norm函数来构建预测仿射变换信息和标准仿射变换信息的仿射损失函数。可以理解,在本申请实施例中,也可采用其他的函数来构建损失函数,只要该函数可以用来衡量预测仿射变换信息和标准仿射变换信息之间的差异程度即可,比如L2-Norm(又称欧几里德距离)函数等。
S608,通过目标分割模型,输出与视频帧样本对应的预测仿射变换差异信息、及视频帧样本中目标对应的预测分割结果。
可选地,计算机设备可将视频帧样本输入到目标分割模型中,根据目标分割模型执行前述的图像分割方法,输出与视频帧样本对应的预测仿射变换差异信息、及视频帧样本中目标对应的预测分割结果。
在一个实施例中,计算机设备可通过目标分割模型中的RAN网络,依据预测仿射变换信息对视频帧样本进行仿射变换,得到对应的样本候选区域图像。通过目标分割模型中的第二卷积神经网络并对样本候选区域图像进行特征提取,得到对应的样本特征图。通过目标分割模型中的全卷积神经网络,对样本特征图进行语义分割,得到视频帧样本中的目标对应的预测分割结果。通过目标分割模型中的第二全连接网络,基于样本特征图对预测仿射变换信息进行修正,得到与视频帧样本对应的预测仿射变换差异信息。
S610,根据预测仿射变换信息和标准仿射变换信息间的差异,确定标准仿射变换差异信息。
其中,标准仿射变换差异信息是作为目标分割模型中仿射变换修正模块的监督信息,也就是作为第二全连接网络在训练过程中的监督信息。可选地,计算机设备可根据预测仿射变换信息和标准仿射变换信息间的差异,确定标准仿射变换差异信息。比如,当仿射变换信息为仿射变换参数时,计算机设备可通过以下公式计算标准仿射变换差异信息:
Figure PCTCN2020092356-appb-000013
其中,
Figure PCTCN2020092356-appb-000014
标准表示标准仿射变换差异参数;
Figure PCTCN2020092356-appb-000015
表示当前帧所对应的仿射变换参数,也就是预测仿射变换参数;θ t表示标准仿射变换参数。
S612,依据标准仿射变换差异信息和预测仿射变换差异信息,构建仿射变换信息修正损失函数。
其中,仿射变换信息修正损失函数用于评估预测仿射变换差异信息和标准仿射变换差异信息之间的差异程度。仿射变换信息修正损失函数承担了训练得到好的第二全连接网络的责任,使目标分割模型中的第二全连接网络可以生成对预测仿射变换信息进行修正后的仿射变换差异信息。
可选地,计算机设备可依据标准仿射变换差异信息和预测仿射变换差异信息,构建仿射变换信息修正损失函数。在一个实施例中,计算机设备可通过距离函数,比如L1-Norm函数来计算标准仿射变换差异信息和预测仿射变换差异信息的损失,也就是基于L1-Norm函数来构建仿射变换信息修正损失函数。可以理解,在本申请实施例中,也可采用其他的函数来构建仿射变换信息修正损失函数,只要该函数可以用来衡量标准仿射变换差异信息和预测仿射变换差异信息之间的差异程度即可,比如L2-Norm函数等。
可以理解,该预测仿射变换差异信息用于确定更新的仿射变换信息,并传递至视频帧序列中在后的视频帧。当仿射变换信息为仿射变换参数时,可通过以下公式计算更新的仿射变换参数:
Figure PCTCN2020092356-appb-000016
其中,
Figure PCTCN2020092356-appb-000017
表示当前帧所传递的更新的仿射变换参数;
Figure PCTCN2020092356-appb-000018
表示预测仿射变换差异参数;
Figure PCTCN2020092356-appb-000019
表示预测仿射变换参数。
S614,根据预测分割结果和样本标注信息,确定分割损失函数。
其中,分割损失函数用于评估预测分割结果和样本标注信息之间的差异程度。分割损失函数承担了训练得到好的全卷积神经网络的责任,使目标分割模型中的全卷积神经网络可以准确地从输入的视频帧中分割出目标对象。可选地,计算机设备可根据预测分割结果和样本标注信息,确定分割损失函数。
S616,依据仿射损失函数、仿射变换信息修正损失函数、及分割损失函数,调整目标分割模型的模型参数并继续训练,直至满足训练停止条件时停止训练。
其中,训练停止条件是结束模型训练的条件。训练停止条件可以是达到预设的迭代次数,或者是调整模型参数后的目标分割模型的性能指标达到预设指标。调整目标分割模型的模型参数,是对目标分割模型的模型参数进行调整。
可选地,计算机设备可依据仿射损失函数、仿射变换信息修正损失函数、及分割损失函数,共同调整目标分割模型中各个网络结构的模型参数并继续训练,直至满足训练停止条件时停止训练。
可以理解,对于每个损失函数,计算机设备可朝着减小相应的预测结果和参考参数之间的差异的方向,调整模型参数。这样,通过不断的输入视频帧样本,得到预测仿射变换信息、预测仿射变换差异信息、及预测分割结果,根据预测仿射变换信息与标准仿射变换信息之间的差异、预测仿射变换差异信息与标准仿射变换差异信息之间的差异、及预测分割结果和样本标注信息之间的差异调整模型参数,以训练目标分割模型,得到训练好的目标分割模型。
上述实施例中,在模型训练过程中一方面引入仿射变换监督信息,也就是标准仿射变换信息,以提高方位预测的准确性;另一方面可通过对预测仿射变换信息进行纠正训练,从而减少错误定位带来的分割误差。训练时将仿射损失函数、仿射变换信息修正损失函数、及分割损失函数叠加一起优化,使得各个部分在训练过程中相互影响,相互提升,这样训练得到的目标分割模型具有准确的视频语义分割性能。
参考图9,在一个实施例中,该模型训练方法包括以下步骤:
S802,获取第一视频帧样本和第二视频帧样本;第一视频帧样本为第二视频帧样本在前的视频帧。
其中,第一视频帧样本和第二视频帧样本是不同的视频帧样本。第一视频帧样本为第二视频帧样本在前的视频帧,也就是说第一视频帧样本的生成时间在第二视频帧之前。在一个实施例中,第一视频帧样本和第二视频帧样本可以是相邻的视频帧。
S804,分别获取与第一视频帧样本及第二视频帧样本各自对应的样本标注信息、及与第一视频帧样本对应的标准仿射变换信息。
可选地,计算机设备可分别获取与第一视频帧样本及第二视频帧样本各自对应的样本标注信息、及与第一视频帧样本对应的标准仿射变换信息。其中,样本标注信息可包括样本关键点位置信息和样本区域位置信息。标准仿射变换信息的获取步骤可参考前述实施例中所描述的获取步骤。
S806,将第一视频帧样本和第二视频帧样本作为样本对输入至目标分割模型中进行训练,通过目标分割模型对第一视频帧样本进行处理,得到与第一视频帧样本对应的预测仿射变换信息。
可选地,参考图10,图10为一个实施例中在模型训练过程中目标分割模型的架构示意图。如图10所示,计算机设备可将相邻的前后两帧视频帧样本作为样本对输入至目标分割模型中。通过目标分割模型对第一视频帧样本进行处理,得到与第一视频帧样本对应的预测仿射变换信息
Figure PCTCN2020092356-appb-000020
S808,依据预测仿射变换信息和标准仿射变换信息构建仿射损失函数。
可选地,计算机设备可依据预测仿射变换信息和标准仿射变换信息构建仿射损失函数。在一个实施例中,计算机设备可通过距离函数,比如L1-Norm函数来计算预测仿射变换信息和标准仿射变换信息的损失,也就是基于L1-Norm函数来构建预测仿射变换信息和标准仿射变换信息的仿射损失函数。可以理解,在本申请实施例中,也可采用其他的函数来构建损失函数,只要该函数可以用来衡量预测仿射变换信息和标准仿射变换信息之间的差异程度即可,比如L2-Norm函数等。
S810,依据预测仿射变换信息对第一视频帧样本进行仿射变换,得到第一样本候选区域图像,并对第一样本候选区域图像进行特征提取,得到第一样本特征图。
可选地,参考图10上半部分,计算机设备可依据预测仿射变换信息对第一视频帧样本进行仿射变换,得到第一样本候选区域图像,并通过Generator(生成器,可通过卷积神经网络实现)对第一样本候选区域图像进行特征提取,得到与第一视频帧样本对应的第一样本特征图。
S812,基于第一样本特征图进行语义分割,得到第一视频帧样本中的目标对应的预测分割结果。
可选地,参考图10,该第一样本特征图进行两个任务分支,其中一个任务分支是分割任务分支。目标分割模型可通过全卷积神经网络对第一样本特征图进行语义分割处理,通过全卷积神经网络进行两次上采样处理后,基于各个像素预测,得到第一视频帧样本中的目标对应的预测分割结果。
S814,根据第一样本特征图对预测仿射变换信息进行修正,得到与第一视频帧样本对应的预测仿射变换差异信息。
可选地,参考图10,第二个任务分支就是定位任务分支,在定位任务分支中,第一样本特征图通过channel为4的全连接层回归出新的仿射变换差异参数,也就是预测仿射变换差异信息。
S816,根据预测仿射变换信息和标准仿射变换信息间的差异,确定标准仿射变换差异信息。
可选地,计算机设备可根据预测仿射变换信息和标准仿射变换信息间的差异,确定标准
Figure PCTCN2020092356-appb-000021
变换参数;θ t表示标准仿射变换参数。
S818,依据标准仿射变换差异信息和预测仿射变换差异信息,构建仿射变换信息修正损失函数。
可选地,计算机设备可依据标准仿射变换差异信息和预测仿射变换差异信息,构建仿射变换信息修正损失函数。在一个实施例中,计算机设备可通过距离函数,比如L1-Norm函数来计算标准仿射变换差异信息和预测仿射变换差异信息的损失,也就是基于L1-Norm函数来构建仿射变换信息修正损失函数。可以理解,在本申请实施例中,也可采用其他的函数来构建仿射变换信息修正损失函数,只要该函数可以用来衡量标准仿射变换差异信息和预测仿射变换差异信息之间的差异程度即可,比如L2-Norm函数等。
可以理解,该预测仿射变换差异信息用于确定更新的仿射变换信息,并传递至视频帧序列中在后的视频帧。当仿射变换信息为仿射变换参数时,可通过以下公式计算更新的仿射变换信息:
Figure PCTCN2020092356-appb-000022
其中,
Figure PCTCN2020092356-appb-000023
表示当前帧所传递的更新的仿射变换信息;
Figure PCTCN2020092356-appb-000024
表示预测仿射变换差异参数;
Figure PCTCN2020092356-appb-000025
表示预测仿射变换参数。
S820,根据第一视频帧样本和第二视频帧样本,确定对应的光流信息,并依据光流信息和第一样本特征图,确定光流特征图。
可选地,计算机设备可根据第一视频帧样本和第二视频帧样本,确定对应的光流信息。比如,计算机设备可通过Lucas-kanade(是一种两帧差分的光流计算方法)光流方法计算第一视频帧样本所对应的光流信息。进而,计算机设备可依据光流信息和第一样本特征图,计算得到光流特征图。其中,该光流特征图可认为是融合了光流信息的、通过第一视频帧样本所预测的第二视频帧样本对应的特征图。
S822,将光流特征图和第二样本特征图作为目标分割模型中判别器的样本输入,并通过判别器对样本输入进行分类处理,得到样本输入的预测类别。
可选地,该目标分割网络在模型训练阶段还包括判别器(Discriminator)。计算机设备可将光流特征图和第二样本特征图作为目标分割模型中判别器的样本输入,输入两种中的任意一种,通过Discriminator判断输入的特征是光流特征图还是第二样本特征图。其中,第二样本特征图是第二视频帧样本所对应的样本特征图,也可称作CNN特征图。
S824,依据预测类别及样本输入所对应的参考类别,构建对抗损失函数。
其中,样本输入所对应的参考类别可以是光流特征图和第二样本特征图分别对应的类别,比如光流类别和特征类别。Discriminator本质是一个二分类网络,计算机设备可使用二分类交叉熵(cross entropy)作为Discriminator的损失函数,以判断样本输入是否为光流特征图。也就是,根据预测类别及样本输入所对应的参考类别,依据交叉熵函数构建目标分割模型的对抗损失函数。
S826,依据光流特征图、第二样本特征图、及参考特征图,构建分割损失函数;参考特征图为对第二视频帧样本中的目标进行特征提取所得到的特征图。
可选地,计算机设备可对第二视频帧样本中的目标进行特征提取,得到参考特征图。进而计算机设备可依据光流特征图、第二样本特征图、及参考特征图,构建分割损失函数。
在一个实施例中,计算机设备可通过以下公式构建分割损失函数:
Figure PCTCN2020092356-appb-000026
其中,F′ CNN,F′ OF分别代表第二样本特征图和通过光流获取的光流特征图。F CNN代表参考特征图。f dice,f bce,f mse分别表示Dice计算公式,二分类交叉熵计算公式,均方差(mean square error)计算公式。其中,f mse越大,表示第二样本特征图和光流特征图的差距越大,从而加重惩罚Generator完成参数更新,使得Generator产生更加符合光流特征的特征图。f dice和f bce则是促使Generator产生更加贴合人工标注信息的特征图。
S828,依据仿射损失函数、仿射变换信息修正损失函数、对抗损失函数、及分割损失函数,调整目标分割模型的模型参数并继续训练,直至满足训练停止条件时停止训练。
可选地,计算机设备可依据仿射损失函数、仿射变换信息修正损失函数、对抗损失函数、及分割损失函数,共同调整目标分割模型中各个网络结构的模型参数并继续训练,直至满足训练停止条件时停止训练。
在一个实施例中,目标分割模型在训练时,可采用交叉训练和共同训练相结合的方式进行训练。比如,参考图10,计算机设备可先训练生成器一段时间后,固定训练得到的参数,暂时不再回传。再训练判别器,之后再固定判别器的参数,进而再训练生成器,等训练结果稳定后再结合各个网络结构一起训练。那么此时的训练停止条件,也可认为是收敛条件,可以是,判别器的损失函数不再下降,判别器的输出稳定在(0.5,0.5)左右,判别器无法分辨出光流特征图和CNN特征图的区别。
可以理解,当生成器和判别器两者抗衡之后,整个网络达到收敛状态,生成器最终将产生CNN特征和光流信息共有部分的特征,而判别器将分不清光流特征和CNN特征的区别。在模型的使用阶段,可移除判别器,此时生成器将产生融合了光流信息的特征图。
在一个实施例中,目标分割模型中的各个生成器可共享参数。也就是上,上述图9中的三个生成器可认为是相同的生成器。
上述实施例中,在模型训练过程中一方面引入仿射变换监督信息,也就是标准仿射变换信息,以提高方位预测的准确性;另一方面可通过对预测仿射变换信息进行纠正训练,从而减少错误定位带来的分割误差。再者,采用了带有光流信息的对抗学习方式实现网络在时序上的一致性,使得训练时针对性更强,性能更佳。这样,训练时依据仿射损失函数、仿射变换信息修正损失函数、对抗损失函数、及分割损失函数叠加一起优化,使得各个部分在训练过程中相互影响,相互提升,这样训练得到的目标分割模型可以准确且平滑地从视频中分割出目标对象。
在一个实施例中,提供了一种模型训练方法。本实施例主要以该方法应用于图1中的计算机设备来举例说明,该模型训练方法包括以下步骤:获取视频帧样本、视频帧样本对应的样本标注信息、及视频帧样本对应的标准仿射变换信息;将视频帧样本输入至目标分割模型中进行训练,通过目标分割模型,确定与视频帧样本对应的预测仿射变换信息;依据预测仿射变换信息和标准仿射变换信息构建仿射损失函数;通过目标分割模型,输出与视频帧样本对应的预测仿射变换差异信息、及视频帧样本中目标对应的预测分割结果;根据预测仿射变换信息和标准仿射变换信息间的差异,确定标准仿射变换差异信息;依据标准仿射变换差异信息和预测仿射变换差异信息,构建仿射变换信息修正损失函数;根据预测分割结果和样本标注信息,确定分割损失函数;依据仿射损失函数、仿射变换信息修正损失函数、及分割损失函数,调整目标分割模型的模型参数并继续训练,直至满足训练停止条件时停止训练。
关于模型训练方法中各个步骤的详细说明可参考前述实施例中机器翻译模型的模型训练 步骤的说明,训练方式是一致的,在此不做重复说明。
在一个实施例中,以心脏超声检测视频为例,详细说明该目标分割模型的训练过程。参考图9,在训练时,可将前后两帧视频帧样本作为样本对输入到RAN网络中。第一阶段中,当前帧经过RAN网络的仿射变换对目标位置、尺寸以及方位进行了纠正,得到与模板分布相似的ROI图像,经过纠正的ROI图像减少了很多干扰,如其他心腔与左心室的相似性,图像标记以及伪影等带来的影响等。第二阶段中,再次使用Generator对ROI图像进行特征抽取,输出的特征进入两个任务分支,在分割任务分支中,输出的特征通过两次上采样后得到分割预测图,输出分割结果;在定位任务分支中,特征通过channel为4的全连接层回归出新的仿射变换差异结果。第二阶段通过回归差值的方式对第一阶段产生的仿射变换信息起二次修正作用。
其中,当仿射变换信息为仿射变换参数时,第二阶段的仿射变换差异结果的监督信息可通过下列公式计算:
Figure PCTCN2020092356-appb-000027
其中,
Figure PCTCN2020092356-appb-000028
标准表示标准仿射变换差异信息;
Figure PCTCN2020092356-appb-000029
表示当前帧所对应的仿射变换参数,也就是预测仿射变换参数;θ t表示标准仿射变换参数。
由于该差值较小,为了加速网络收敛,可以使用L1-Norm函数算损失值。当前帧在第二阶段预测的仿射变换差异参数将用于计算更新的仿射变换信息并传播到下一帧视频帧中,下一帧视频帧根据上述参数直接进行仿射变换得到ROI,同理,ROI经过Generator提取特征,再次预测出分割结果和仿射变换差异结果。以第一阶段为基础,第二阶段进行二次仿射变换信息修正,如上面公式所示。第二阶段预测出相对于第一阶段的仿射变换信息变化值,这里
Figure PCTCN2020092356-appb-000030
示预测仿射变换参数。同理,下一帧的视频帧所对应的ROI经过生成器提取特征,再次预测出分割结果和仿射变换差异结果。除此之外,渐进式变化是视频中目标变化的重要特征。在心脏超声检测视频帧中,左心室会随着时间逐渐扩大或者缩小,基本不存在突然变化的情况。然而,由于分割目标边界信息模糊以及伪影的干扰,尽管加入了时序、方位以及结构等先验信息,在某些视频帧上仍然会出现由于误分割引起的左心室容积突变。针对这种情况,在模型训练时可引入光流信息。假定,左心室中相邻两帧的变化是较为微小的,下一帧视频帧可以通过上一帧视频帧的光流信息计算得到。在训练时,对于当前帧应该有两种特征形式:一种是通过CNN网络基于当前帧提取的特征,另一种是通过光流信息基于上一帧的特征变换而来的特征。为此,可设计判别器将这两种信息同时引入。如图9所示,判别器(Discriminator)的输入有两种:一种来源于生成器对下一帧ROI提取的特征,一种来源于利用光流信息基于当前帧ROI特征变换而来的下一帧ROI特征,输入两种中的任意一种,判别器判断输入的特征属于光流变换的特征(Flow Field)还是CNN特征。这样,引入判别器判别器促使生成器产生具备光流信息和CNN本帧信息的分割特征。因此,分割任务分支可采用如下损失函数:
Figure PCTCN2020092356-appb-000031
其中,F′ CNN,F′ OF分别代表第二样本特征图和通过光流获取的光流特征图。F CNN代表参考特征图。f dice,f bce,f mse分别表示Dice计算公式,二分类交叉熵计算公式,均方差计算公式。其中,f mse越大,表示第二样本特征图和光流特征图的差距越大,从而加重惩罚生成器完成参数更新,使得生成器产生更加符合光流特征的特征图。f dice和f bce则是促使生成器产生更加贴合人工标注信息的特征图。
此外,对于判别器,使用二分类交叉熵作为损失函数用于判断输入是否为光流特征。两者抗衡之后,当整个网络达到收敛状态,生成器最终将产生CNN特征和光流信息共有部分的特征,而判别器将分不清光流特征和CNN特征的区别。模型使用时,判别器将被移除,生成器将产生融合了光流信息的特征图
下面结合应用场景,比如心脏早期筛查场景,对心脏超声检测视频中的左心室作为目标,通过该图像分割方法实现对左心室的分割来进行详细说明:
临床中,心脏早期筛查是预防以及诊断心脏疾病的重要措施。鉴于其筛查快速,价格低廉,信息丰富的优势,心脏B型超声是目前普遍性较高的早期筛查手段。在心脏超声检测中,临床上常以心动周期超声中左心室在四腔切面和二腔切面的面积,配合Simpson法(辛普森法)估量射血分数,作为诊断心功能的一个重要信息来源。而基于计算机辅助的左心室自动分割是计算心功能指标(如射血分数)的重要依据。然而,左心室物体边界模糊,且容易受伪影影像造成边缘缺失,严重影响了分割准确性。同时,左心室的变化和时间强烈相关,预测错误带来的左心室轮廓突变极容易导致临床指标的误计算。同时,超声视频筛查的落地对网络大小、实时性有很大的需求。
考虑到上述困难,本申请实施例中提出了基于Region Affine Networks的端到端视频目标分割模型,将在前的视频帧帧的目标结构信息(也就是在前的视频帧所传递的历史仿射变换信息)引入到当前帧,提升了分割性能;同时Region Affine Networks是有监督信息的可学习仿射变换信息的预测网络,仿射监督信息的引入使得仿射变化参数预测更加准确。并且,基于二阶段定位网络能够二次纠正在前的视频帧所传递的变换错误,增加网络鲁棒性,减少因为仿射变换信息错误带来的分割误差。同时,基于光流信息的对抗学习网络,在训练时可促使分割结果贴近时序变换渐进性,使得分割结果更加合理。整个网络端到端训练,各个部分相辅相成,相互提高。目标结构信息的引入减少噪声干扰,降低分割难度,使用轻量级的编码网络即可得到优异的分割结果。同时,视频的时序分析、时间平滑处理全部集中在训练阶段,减少了模型在使用过程中的操作处理,大大减少了目标分割的耗时,提高了效率。
本申请实施例所提供的图像分割方法可以用于临床中心脏超声检测配合Simpson法筛查心脏疾病,可以解放医师的双手,减少医师标注带来的重复劳动以及主观差异。由于实现该目标分割模型的各个网络结构小、实时性好,端到端网络工程化程度高,极易迁移到移动设备中。
本申请实施例中对心脏超声检测视频中的左心室进行分割所得到的分割结果,可作为临床上心脏B型超声结合Simpson法测量射血分数的自动化方案;专为视频单物体设计的端到端网络,引入了时序信息、目标的结构位置信息,能得到更加符合视频规律的分割结果;对抗学习网络自适应地增加了视频分割的平滑度,使得分割结果更加合理;该图像分割方法实现了高分割性能的轻量级网络,实时性极强,工程化程度高。
在一个实施例中,如图11所示,该图像分割方法包括以下步骤:
S1002,当当前帧为初始视频帧时,获取视频帧序列中的初始视频帧。
S1004,通过第一卷积神经网络提取初始视频帧的图像特征。
S1006,将图像特征输入至第一全连接网络,通过第一全连接网络对图像特征进行处理,通过第一全连接网络的至少一个输出通道输出仿射变换信息。
S1008,将输出的仿射变换信息作为初始视频帧对应的历史仿射变换信息。
当当前帧不为初始视频帧时,从缓存中读取在前的视频帧对应的历史仿射变换信息。
S1010,依据历史仿射变换信息对当前帧进行仿射变换,得到与当前帧对应的候选区域图像。
S1012,通过目标分割模型中的第二卷积神经网络,对候选区域图像进行特征提取,得到候选区域图像对应的特征图;特征图融合了视频帧序列所包括的光流信息。
S1014,通过全卷积神经网络对特征图进行上采样处理,得到中间图像。
S1016,通过全卷积神经网络对中间图像中的各像素分别进行像素级分类,得到各像素所对应的类别。
S1018,确定中间图像中对应目标类别的像素。
S1020,从中间图像中,分割出由对应目标类别的各像素所组成的、且包括目标对象的目标分割区域。
S1022,通过第二全连接网络对特征图进行处理,通过第二全连接网络的至少一个输出通道输出仿射变换差异结果。
S1024,依据仿射变换差异结果和在前的视频帧所传递的历史仿射变换信息,计算得到当前帧所传递的更新的仿射变换信息。
S1026,将当前帧所传递的更新的仿射变换信息,作为视频帧序列中在后的视频帧所对应的历史仿射变换信息。
上述图像分割方法,依据在前的视频帧所传递的历史仿射变换信息,对当前帧进行仿射变换,得到与当前帧对应的候选区域图像。在前的视频帧所传递的历史仿射变换信息是经过修正后的参数,这样可大大提高候选区域图像获取的准确性。对与候选区域图像对应的特征图进行语义分割,可以准确得到当前帧中的目标对应的分割结果。并且,根据该特征图对历史仿射变换信息进行修正,将修正后的仿射变换信息传递至在后的视频帧,以供在后的视频帧使用。这样可对当前帧的定位起到纠正作用,减少了错误定位给后续的分割处理所带来误差,大大提高了对视频进行语义分割处理的准确性。
图11为一个实施例中图像分割方法的流程示意图。应该理解的是,虽然图11的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图11中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
如图12所示,在一个实施例中,提供了图像分割装置1100,包括获取模块1101、仿射变换模块1102、特征提取模块1103、语义分割模块1104和参数修正模块1105。
获取模块1101,用于获取视频帧序列中的当前帧、及在前的视频帧的历史仿射变换信息。
仿射变换模块1102,用于依据历史仿射变换信息对当前帧进行仿射变换,得到与当前帧对应的候选区域图像。
特征提取模块1103,用于对候选区域图像进行特征提取,得到候选区域图像对应的特征图。
语义分割模块1104,用于基于特征图进行语义分割,得到当前帧中的目标对应的分割结果。
参数修正模块1105,用于根据特征图对历史仿射变换信息进行修正,得到更新的仿射变换信息,并将更新的仿射变换信息作为视频帧序列中在后的视频帧所对应的历史仿射变换信息。
在一个实施例中,当当前帧为初始视频帧时,获取模块1101还用于当当前帧为初始视频帧时,获取视频帧序列中的初始视频帧;通过第一卷积神经网络提取初始视频帧的图像特征;将图像特征输入至包括第一全连接网络,通过第一全连接网络对图像特征进行处理,通过第一全连接网络的至少一个输出通道输出仿射变换信息;将输出的仿射变换信息作为初始视频帧对应的历史仿射变换信息。
在一个实施例中,对候选区域图像进行特征提取所得到的特征图融合有视频帧序列所包括的光流信息。
在一个实施例中,语义分割模块1104还用于通过全卷积神经网络对特征图进行上采样处理,得到中间图像;通过全卷积神经网络对中间图像中的各像素分别进行像素级分类,得到各像素所对应的类别;依据各像素所对应的类别,输出对当前帧中的目标进行语义分割的分割结果。
在一个实施例中,语义分割模块1104还用于确定中间图像中对应目标类别的像素;从中间图像中,分割出由对应目标类别的各像素所组成的、且包括目标对象的目标分割区域。
在一个实施例中,参数修正模块1105还用于通过第二全连接网络对特征图进行处理,通过第二全连接网络的至少一个输出通道输出仿射变换差异结果;依据仿射变换差异结果和在前的视频帧的历史仿射变换信息,计算得到当前帧的更新的仿射变换信息;将当前帧的更新的仿射变换信息,作为视频帧序列中在后的视频帧所对应的历史仿射变换信息。
在一个实施例中,特征提取模块1103还用于通过目标分割模型中的第二卷积神经网络,对候选区域图像进行特征提取,得到候选区域图像对应的特征图。语义分割模块1104还用于通过目标分割模型中的全卷积神经网络,对特征图进行语义分割处理,得到当前帧中的目标对应的分割结果。参数修正模块1105还用于通过目标分割模型中的第二全连接网络对历史仿射变换信息进行修正,得到更新的仿射变换信息。
如图13所示,在一个实施例中,该图像分割装置还包括模型训练模块1106,用于获取视频帧样本、视频帧样本对应的样本标注信息、及视频帧样本对应的标准仿射变换信息;将视频帧样本输入至目标分割模型中进行训练,通过目标分割模型,获取与视频帧样本对应的预测仿射变换信息;依据预测仿射变换信息和标准仿射变换信息构建仿射损失函数;通过目标分割模型,输出与视频帧样本对应的预测仿射变换差异信息、及视频帧样本中目标对应的预测分割结果;根据预测仿射变换信息和标准仿射变换信息间的差异,确定标准仿射变换差异信息;依据标准仿射变换差异信息和预测仿射变换差异信息,构建仿射变换信息修正损失函数;根据预测分割结果和样本标注信息,确定分割损失函数;依据仿射损失函数、仿射变换信息修正损失函数、及分割损失函数,调整目标分割模型的模型参数并继续训练,直至满足训练停止条件时停止训练。
上述图像分割装置,依据在前的视频帧的历史仿射变换信息,对当前帧进行仿射变换,得到与当前帧对应的候选区域图像。在前的视频帧的历史仿射变换信息是经过修正后的参数,这样可大大提高候选区域图像获取的准确性。对与候选区域图像对应的特征图进行语义分割, 可以准确得到当前帧中的目标对应的分割结果。并且,根据该特征图对历史仿射变换信息进行修正,将修正后的仿射变换信息传递至在后的视频帧,以供在后的视频帧使用。这样可对当前帧的定位起到纠正作用,减少了错误定位给后续的分割处理所带来误差,大大提高了对视频进行语义分割处理的准确性。
如图14所示,在一个实施例中,提供了模型训练装置1300,包括样本获取模块1301、确定模块1302、构建模块1303、输出模块1304和模型参数调整模块1305。
样本获取模块1301,用于获取视频帧样本、视频帧样本对应的样本标注信息、及视频帧样本对应的标准仿射变换信息。
确定模块1302,用于将视频帧样本输入至目标分割模型中进行训练,通过目标分割模型,确定与视频帧样本对应的预测仿射变换信息。
构建模块1303,用于依据预测仿射变换信息和标准仿射变换信息构建仿射损失函数。
输出模块1304,用于通过目标分割模型,输出与视频帧样本对应的预测仿射变换差异信息、及视频帧样本中目标对应的预测分割结果。
确定模块1302还用于根据预测仿射变换信息和标准仿射变换信息间的差异,确定标准仿射变换差异信息。
构建模块1303还用于依据标准仿射变换差异信息和预测仿射变换差异信息,构建仿射变换信息修正损失函数。
构建模块1303还用于根据预测分割结果和样本标注信息,确定分割损失函数。
模型参数调整模块1305,用于依据仿射损失函数、仿射变换信息修正损失函数、及分割损失函数,调整目标分割模型的模型参数并继续训练,直至满足训练停止条件时停止训练。
在一个实施例中,样本获取模块1301还用于获取视频帧样本和相应的样本标注信息;样本标注信息包括样本关键点位置信息和样本区域位置信息;根据视频帧样本、样本关键点位置信息和样本区域位置信息,确定模板图像及模板图像对应的模板关键点位置信息;根据样本关键点位置信息和模板关键点位置信息,计算得到与视频帧样本对应的标准仿射变换信息。
在一个实施例中,样本获取模块1301还用于获取第一视频帧样本和第二视频帧样本;第一视频帧样本为第二视频帧样本在前的视频帧;分别获取与第一视频帧样本及第二视频帧样本各自对应的样本标注信息、及与第一视频帧样本对应的标准仿射变换信息。确定模块1302还用于将第一视频帧样本和第二视频帧样本作为样本对输入至目标分割模型中进行训练,通过目标分割模型对第一视频帧样本进行处理,得到与第一视频帧样本对应的预测仿射变换信息。输出模块1304还用于依据预测仿射变换信息对第一视频帧样本进行仿射变换,得到第一样本候选区域图像,并对第一样本候选区域图像进行特征提取,得到第一样本特征图;基于第一样本特征图进行语义分割,得到第一视频帧样本中的目标对应的预测分割结果;根据第一样本特征图对预测仿射变换信息进行修正,得到与第一视频帧样本对应的预测仿射变换差异信息。该模型训练装置还包括对抗模块1306,用于根据第一视频帧样本和第二视频帧样本,确定对应的光流信息,并依据光流信息和第一样本特征图,确定光流特征图;将光流特征图和第二样本特征图作为目标分割模型中判别器的样本输入,通过判别器对样本输入进行分类处理,得到样本输入的预测类别。构建模块1303还用于依据预测类别及样本输入所对应的参考类别,构建对抗损失函数;依据光流特征图、第二样本特征图、及参考特征图,构建分割 损失函数;参考特征图为对第二视频帧样本中的目标进行特征提取所得到的特征图。模型参数调整模块1305还用于依据仿射损失函数、仿射变换信息修正损失函数、对抗损失函数、及分割损失函数,调整目标分割模型的模型参数并继续训练,直至满足训练停止条件时停止训练。
上述模型训练装置,在模型训练过程中一方面引入仿射变换监督信息,也就是标准仿射变换信息,以提高方位预测的准确性;另一方面可通过对预测仿射变换信息进行纠正训练,从而减少错误定位带来的分割误差。训练时将仿射损失函数、仿射变换信息修正损失函数、及分割损失函数叠加一起优化,使得各个部分在训练过程中相互影响,相互提升,这样训练得到的目标分割模型具有准确的视频语义分割性能。
图15示出了一个实施例中计算机设备的内部结构图。该计算机设备可以是图1中的计算机设备。如图15所示,该计算机设备包括该计算机设备包括通过系统总线连接的处理器、存储器、和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现图像分割方法和/或模型训练方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行图像分割方法和/或模型训练方法。
本领域技术人员可以理解,图15中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的图像分割装置和或/模型训练装置可以实现为一种计算机程序的形式,计算机程序可在如图15所示的计算机设备上运行。计算机设备的存储器中可存储组成该图像分割装置的各个程序模块,比如,图12所示的获取模块、仿射变换模块、特征提取模块、语义分割模块和参数修正模块。各个程序模块构成的计算机程序使得处理器执行本说明书中描述的本申请各个实施例的图像分割方法中的步骤。还比如,图14所示的样本获取模块、确定模块、构建模块、输出模块和模型参数调整模块。各个程序模块构成的计算机程序使得处理器执行本说明书中描述的本申请各个实施例的模型训练方法中的步骤。
例如,图15所示的计算机设备可以通过如图12所示的图像分割装置中的获取模块执行步骤S202。计算机设备可通过仿射变换模块执行步骤S204。计算机设备可通过特征提取模块执行步骤S206。计算机设备可通过语义分割模块执行步骤S208。计算机设备可通过参数修正模块执行步骤S210。
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器存储有计算机程序,计算机程序被处理器执行时,使得处理器执行上述图像分割方法和/或模型训练方法的步骤。此处图像分割方法和/或模型训练方法的步骤可以是上述各个实施例的图像分割方法和/或模型训练方法中的步骤。
在一个实施例中,提供了一种计算机可读存储介质,存储有计算机程序,计算机程序被处理器执行时,使得处理器执行上述图像分割方法和/或模型训练方法的步骤。此处图像分割方法和/或模型训练方法的步骤可以是上述各个实施例的图像分割方法和/或模型训练方法中的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种图像分割方法,应用于计算机设备中,所述方法包括:
    获取视频帧序列中的当前帧、及在前的视频帧的历史仿射变换信息;
    依据所述历史仿射变换信息对所述当前帧进行仿射变换,得到与所述当前帧对应的候选区域图像;
    对所述候选区域图像进行特征提取,得到所述候选区域图像对应的特征图;
    基于所述特征图进行语义分割,得到所述当前帧中的目标对应的分割结果;
    根据所述特征图对所述历史仿射变换信息进行修正,得到更新的仿射变换信息,将所述更新的仿射变换信息作为所述视频帧序列中在后的视频帧所对应的历史仿射变换信息。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述特征图对所述历史仿射变换信息进行修正,得到更新的仿射变换信息,将所述更新的仿射变换信息作为所述视频帧序列中在后的视频帧所对应的历史仿射变换信息,包括:
    通过第二全连接网络对所述特征图进行处理,通过所述第二全连接网络的至少一个输出通道输出仿射变换差异结果;
    依据所述仿射变换差异结果和所述在前的视频帧的历史仿射变换信息,计算得到更新的仿射变换信息;
    将所述更新的仿射变换信息,作为所述视频帧序列中在后的视频帧所对应的历史仿射变换信息。
  3. 根据权利要求1所述的方法,其特征在于,所述特征图融合有所述视频帧序列所包括的光流信息。
  4. 根据权利要求1所述的方法,其特征在于,所述获取视频帧序列中的当前帧、及在前的视频帧的历史仿射变换信息,包括:
    当所述当前帧为初始视频帧时,获取所述视频帧序列中的所述初始视频帧;
    通过第一卷积神经网络提取所述初始视频帧的图像特征;
    将所述图像特征输入至第一全连接网络,通过所述第一全连接网络对所述图像特征进行处理,通过所述第一全连接网络的至少一个输出通道输出的仿射变换信息;
    将输出的所述仿射变换信息作为所述初始视频帧对应的历史仿射变换信息。
  5. 根据权利要求1所述的方法,其特征在于,所述基于所述特征图进行语义分割,得到所述当前帧中的目标对应的分割结果,包括:
    通过全卷积神经网络对所述特征图进行上采样处理,得到中间图像;
    通过所述全卷积神经网络对所述中间图像中的各个像素分别进行像素级分类,得到所述各个像素所对应的类别;
    依据所述各个像素所对应的类别,输出对所述当前帧中的目标进行语义分割的分割结果。
  6. 根据权利要求5所述的方法,其特征在于,所述依据所述各个像素所对应的类别,输出对所述当前帧中的目标进行语义分割的分割结果,包括:
    确定所述中间图像中属于目标类别的像素;
    从所述中间图像中,分割出由属于所述目标类别的各个所述像素所组成的、且包括所述目标的目标分割区域。
  7. 根据权利要求1所述的方法,其特征在于,所述方法通过目标分割模型执行;
    所述对所述候选区域图像进行特征提取,得到所述候选区域图像对应的特征图,包括:
    通过所述目标分割模型中的第二卷积神经网络,对所述候选区域图像进行特征提取,得到所述候选区域图像对应的特征图;
    所述基于所述特征图进行语义分割,得到所述当前帧中的目标对应的分割结果,包括:
    通过所述目标分割模型中的全卷积神经网络,对所述特征图进行语义分割处理,得到所述当前帧中的目标对应的分割结果;
    所述根据所述特征图对所述历史仿射变换信息进行修正,得到更新的仿射变换信息,包括:
    通过所述目标分割模型中的第二全连接网络对所述历史仿射变换信息进行修正,得到更新的仿射变换信息。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述方法通过目标分割模型执行,所述目标分割模型的训练步骤包括:
    获取视频帧样本、所述视频帧样本对应的样本标注信息、及所述视频帧样本对应的标准仿射变换信息;
    将所述视频帧样本输入至所述目标分割模型中进行训练,通过所述目标分割模型,确定与所述视频帧样本对应的预测仿射变换信息;
    依据所述预测仿射变换信息和所述标准仿射变换信息构建仿射损失函数;
    通过所述目标分割模型,输出与所述视频帧样本对应的预测仿射变换差异信息、及所述视频帧样本中目标对应的预测分割结果;
    根据所述预测仿射变换信息和所述标准仿射变换信息间的差异,确定标准仿射变换差异信息;
    依据所述标准仿射变换差异信息和所述预测仿射变换差异信息,构建仿射变换信息修正损失函数;
    根据所述预测分割结果和所述样本标注信息,确定分割损失函数;
    依据所述仿射损失函数、所述仿射变换信息修正损失函数、及所述分割损失函数,调整所述目标分割模型的模型参数并继续训练,直至满足训练停止条件时停止训练。
  9. 一种模型训练方法,应用于计算机设备中,所述方法包括:
    获取视频帧样本、所述视频帧样本对应的样本标注信息、及所述视频帧样本对应的标准仿射变换信息;
    将所述视频帧样本输入至目标分割模型中进行训练,通过所述目标分割模型,确定与所述视频帧样本对应的预测仿射变换信息;
    依据所述预测仿射变换信息和所述标准仿射变换信息构建仿射损失函数;
    通过所述目标分割模型,输出与所述视频帧样本对应的预测仿射变换差异信息、及所述视频帧样本中目标对应的预测分割结果;
    根据所述预测仿射变换信息和所述标准仿射变换信息间的差异,确定标准仿射变换差异信息;
    依据所述标准仿射变换差异信息和所述预测仿射变换差异信息,构建仿射变换信息修正损失函数;
    根据所述预测分割结果和所述样本标注信息,确定分割损失函数;
    依据所述仿射损失函数、所述仿射变换信息修正损失函数、及所述分割损失函数,调整所述目标分割模型的模型参数并继续训练,直至满足训练停止条件时停止训练。
  10. 根据权利要求9所述的方法,其特征在于,所述获取视频帧样本、所述视频帧样本对应的样本标注信息、及所述视频帧样本对应的标准仿射变换信息,包括:
    获取视频帧样本和相应的样本标注信息;所述样本标注信息包括样本关键点位置信息和样本区域位置信息;
    根据所述视频帧样本、所述样本关键点位置信息和所述样本区域位置信息,确定模板图像及所述模板图像对应的模板关键点位置信息;
    根据所述样本关键点位置信息和所述模板关键点位置信息,计算得到与所述视频帧样本对应的标准仿射变换信息。
  11. 根据权利要求9或10所述的方法,其特征在于,所述获取视频帧样本、所述视频帧样本对应的样本标注信息、及所述视频帧样本对应的标准仿射变换信息,包括:
    获取第一视频帧样本和第二视频帧样本;所述第一视频帧样本为所述第二视频帧样本在前的视频帧;
    分别获取与所述第一视频帧样本及所述第二视频帧样本各自对应的样本标注信息、及与所述第一视频帧样本对应的标准仿射变换信息;
    所述将所述视频帧样本输入至所述目标分割模型中进行训练,通过所述目标分割模型,确定与所述视频帧样本对应的预测仿射变换信息,包括:
    将所述第一视频帧样本和所述第二视频帧样本作为样本对输入至目标分割模型中进行训练,通过所述目标分割模型对所述第一视频帧样本进行处理,得到与所述第一视频帧样本对应的预测仿射变换信息;
    所述通过所述目标分割模型,输出与所述视频帧样本对应的预测仿射变换差异信息、及所述视频帧样本中目标对应的预测分割结果,包括:
    依据所述预测仿射变换信息对所述第一视频帧样本进行仿射变换,得到第一样本候选区域图像,并对所述第一样本候选区域图像进行特征提取,得到第一样本特征图;
    基于所述第一样本特征图进行语义分割,得到所述第一视频帧样本中的目标对应的预测分割结果;
    根据所述第一样本特征图对所述预测仿射变换信息进行修正,得到与所述第一视频帧样本对应的预测仿射变换差异信息。
  12. 根据权利要求11所述的方法,其特征在于,所述方法还包括:
    根据所述第一视频帧样本和所述第二视频帧样本,确定对应的光流信息,并依据所述光流信息和所述第一样本特征图,确定光流特征图;
    将所述光流特征图和所述第二样本特征图作为所述目标分割模型中判别器的样本输入,并通过所述判别器对所述样本输入进行分类处理,得到所述样本输入的预测类别;
    依据所述预测类别及所述样本输入所对应的参考类别,构建对抗损失函数;
    所述根据所述预测分割结果和所述样本标注信息,确定分割损失函数包括:
    依据所述光流特征图、所述第二样本特征图、及参考特征图,构建分割损失函数;所述参考特征图为对所述第二视频帧样本中的目标进行特征提取所得到的特征图;
    所述依据所述仿射损失函数、所述仿射变换信息修正损失函数、及所述分割损失函数,调整所述目标分割模型的模型参数并继续训练,直至满足训练停止条件时停止训练,包括:
    依据所述仿射损失函数、所述仿射变换信息修正损失函数、所述对抗损失函数、及所述分割损失函数,调整所述目标分割模型的模型参数并继续训练,直至满足训练停止条件时停止训练。
  13. 一种图像分割装置,所述装置包括:
    获取模块,用于获取视频帧序列中的当前帧、及在前的视频帧的历史仿射变换信息;
    仿射变换模块,用于依据所述历史仿射变换信息对所述当前帧进行仿射变换,得到与所述当前帧对应的候选区域图像;
    特征提取模块,用于对所述候选区域图像进行特征提取,得到所述候选区域图像对应的特征图;
    语义分割模块,用于基于所述特征图进行语义分割,得到所述当前帧中的目标对应的分割结果;
    参数修正模块,用于根据所述特征图对所述历史仿射变换信息进行修正,得到更新的仿射变换信息,将所述更新的仿射变换信息作为所述视频帧序列中在后的视频帧所对应的历史仿射变换信息。
  14. 根据权利要求13所述的装置,其特征在于,所述参数修正模块,还用于通过第二全连接网络对特征图进行处理,通过所述第二全连接网络的至少一个输出通道输出仿射变换差异结果;依据仿射变换差异结果和所述在前的视频帧的历史仿射变换信息,计算得到所述当前帧的更新的仿射变换信息;将所述更新的仿射变换信息,作为所述视频帧序列中在后的视频帧所对应的历史仿射变换信息。
  15. 根据权利要求13所述的装置,其特征在于,所述特征图融合有所述视频帧序列所包括的光流信息。
  16. 一种模型训练装置,所述装置包括:
    样本获取模块,用于获取视频帧样本、所述视频帧样本对应的样本标注信息、及所述视频帧样本对应的标准仿射变换信息;
    确定模块,用于将所述视频帧样本输入至目标分割模型中进行训练,通过所述目标分割模型,确定与所述视频帧样本对应的预测仿射变换信息;
    构建模块,用于依据所述预测仿射变换信息和所述标准仿射变换信息构建仿射损失函数;
    输出模块,用于通过所述目标分割模型,输出与所述视频帧样本对应的预测仿射变换差异信息、及所述视频帧样本中目标对应的预测分割结果;
    所述确定模块,还用于根据所述预测仿射变换信息和所述标准仿射变换信息间的差异,确定标准仿射变换差异信息;
    所述构建模块,还用于依据所述标准仿射变换差异信息和所述预测仿射变换差异信息,构建仿射变换信息修正损失函数;
    所述构建模块,还用于根据所述预测分割结果和所述样本标注信息,确定分割损失函数;
    模型参数调整模块,用于依据所述仿射损失函数、所述仿射变换信息修正损失函数、及所述分割损失函数,调整所述目标分割模型的模型参数并继续训练,直至满足训练停止条件时停止训练。
  17. 根据权利要求16所述的装置,其特征在于,
    所述样本获取模块,还用于获取第一视频帧样本和第二视频帧样本;所述第一视频帧样本为所述第二视频帧样本在前的视频帧;分别获取与所述第一视频帧样本及所述第二视频帧样本各自对应的样本标注信息、及与所述第一视频帧样本对应的标准仿射变换信息;
    所述确定模块,还用于将所述第一视频帧样本和所述第二视频帧样本作为样本对输入至所述目标分割模型中进行训练,通过所述目标分割模型对第一视频帧样本进行处理,得到与所述第一视频帧样本对应的预测仿射变换信息;
    所述输出模块,还用于依据所述预测仿射变换信息对所述第一视频帧样本进行仿射变换,得到第一样本候选区域图像,并对所述第一样本候选区域图像进行特征提取,得到第一样本特征图;基于所述第一样本特征图进行语义分割,得到所述第一视频帧样本中的目标对应的预测分割结果;根据所述第一样本特征图对预测仿射变换信息进行修正,得到与所述第一视频帧样本对应的预测仿射变换差异信息。
  18. 根据权利要求16所述的装置,其特征在于,所述装置还包括:
    对抗模块,用于根据所述第一视频帧样本和所述第二视频帧样本,确定对应的光流信息,并依据所述光流信息和所述第一样本特征图,确定光流特征图;将所述光流特征图和所述第二样本特征图作为所述目标分割模型中判别器的样本输入,并通过所述判别器对样本输入进行分类处理,得到所述样本输入的预测类别;
    所述构建模块,还用于依据所述预测类别及所述样本输入所对应的参考类别,构建对抗损失函数;依据所述光流特征图、所述第二样本特征图、及所述参考特征图,构建分割损失函数;所述参考特征图为对所述第二视频帧样本中的目标进行特征提取所得到的特征图;
    所述模型参数调整模块还用于依据所述仿射损失函数、所述仿射变换信息修正损失函数、所述对抗损失函数、及所述分割损失函数,调整所述目标分割模型的模型参数并继续训练,直至满足训练停止条件时停止训练。
  19. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如权利要求1至12中任一项所述方法的步骤。
  20. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1至12中任一项所述方法的步骤。
PCT/CN2020/092356 2019-05-29 2020-05-26 图像分割方法、模型训练方法、装置、设备及存储介质 WO2020238902A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/395,388 US11900613B2 (en) 2019-05-29 2021-08-05 Image segmentation method and apparatus, model training method and apparatus, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910455150.4 2019-05-29
CN201910455150.4A CN110188754B (zh) 2019-05-29 2019-05-29 图像分割方法和装置、模型训练方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/395,388 Continuation US11900613B2 (en) 2019-05-29 2021-08-05 Image segmentation method and apparatus, model training method and apparatus, device, and storage medium

Publications (1)

Publication Number Publication Date
WO2020238902A1 true WO2020238902A1 (zh) 2020-12-03

Family

ID=67718434

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092356 WO2020238902A1 (zh) 2019-05-29 2020-05-26 图像分割方法、模型训练方法、装置、设备及存储介质

Country Status (3)

Country Link
US (1) US11900613B2 (zh)
CN (1) CN110188754B (zh)
WO (1) WO2020238902A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112906463A (zh) * 2021-01-15 2021-06-04 上海东普信息科技有限公司 基于图像的火情检测方法、装置、设备及存储介质
CN113034580A (zh) * 2021-03-05 2021-06-25 北京字跳网络技术有限公司 图像信息检测方法、装置和电子设备
CN113177483A (zh) * 2021-04-30 2021-07-27 北京百度网讯科技有限公司 视频目标分割方法、装置、设备以及存储介质
CN113923493A (zh) * 2021-09-29 2022-01-11 北京奇艺世纪科技有限公司 一种视频处理方法、装置、电子设备以及存储介质
CN114693934A (zh) * 2022-04-13 2022-07-01 北京百度网讯科技有限公司 语义分割模型的训练方法、视频语义分割方法及装置
CN115474084A (zh) * 2022-08-10 2022-12-13 北京奇艺世纪科技有限公司 一种视频封面图像的生成方法、装置、设备和存储介质
CN117078761A (zh) * 2023-10-07 2023-11-17 深圳市爱博医疗机器人有限公司 细长型医疗器械自动定位方法、装置、设备以及介质
CN117132587A (zh) * 2023-10-20 2023-11-28 深圳微创心算子医疗科技有限公司 超声扫描导航方法、装置、计算机设备和存储介质

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188754B (zh) 2019-05-29 2021-07-13 腾讯科技(深圳)有限公司 图像分割方法和装置、模型训练方法和装置
CN110942463B (zh) * 2019-10-30 2021-03-16 杭州电子科技大学 一种基于生成对抗网络的视频目标分割方法
CN110838132B (zh) * 2019-11-15 2022-08-05 北京字节跳动网络技术有限公司 基于视频流的物体分割方法、装置、设备及存储介质
CN111027600B (zh) * 2019-11-25 2021-03-23 腾讯科技(深圳)有限公司 图像类别预测方法和装置
CN111177460B (zh) * 2019-12-20 2023-04-18 腾讯科技(深圳)有限公司 提取关键帧的方法及装置
CN111210439B (zh) * 2019-12-26 2022-06-24 中国地质大学(武汉) 通过抑制非感兴趣信息的语义分割方法、设备及存储设备
CN113111684B (zh) * 2020-01-10 2024-05-21 字节跳动有限公司 神经网络模型的训练方法、装置和图像处理系统
US11645505B2 (en) * 2020-01-17 2023-05-09 Servicenow Canada Inc. Method and system for generating a vector representation of an image
CN111507997B (zh) * 2020-04-22 2023-07-25 腾讯科技(深圳)有限公司 图像分割方法、装置、设备及计算机存储介质
CN111539439B (zh) * 2020-04-30 2021-01-05 宜宾电子科技大学研究院 一种图像语义分割方法
CN115668278A (zh) * 2020-05-29 2023-01-31 华为技术有限公司 图像处理方法及相关设备
CN111666905B (zh) * 2020-06-10 2022-12-02 重庆紫光华山智安科技有限公司 模型训练方法、行人属性识别方法和相关装置
CN111695512B (zh) * 2020-06-12 2023-04-25 嘉应学院 一种无人值守文物监测方法及装置
CN111915480B (zh) * 2020-07-16 2023-05-23 抖音视界有限公司 生成特征提取网络的方法、装置、设备和计算机可读介质
CN111968123B (zh) * 2020-08-28 2024-02-02 北京交通大学 一种半监督视频目标分割方法
CN112598645B (zh) * 2020-12-23 2022-07-01 深兰智能科技(上海)有限公司 轮廓检测方法、装置、设备及存储介质
CN115082574B (zh) * 2021-03-16 2024-05-14 上海软逸智能科技有限公司 网络模型训练方法和脏器超声切面编码生成方法、装置
CN113223104B (zh) * 2021-04-16 2023-03-24 山东师范大学 一种基于因果关系的心脏mr图像插补方法及系统
CN113361519B (zh) * 2021-05-21 2023-07-28 北京百度网讯科技有限公司 目标处理方法、目标处理模型的训练方法及其装置
CN113453032B (zh) * 2021-06-28 2022-09-30 广州虎牙科技有限公司 手势互动方法、装置、系统、服务器和存储介质
CN113570607B (zh) * 2021-06-30 2024-02-06 北京百度网讯科技有限公司 目标分割的方法、装置及电子设备
CN113435432B (zh) * 2021-08-27 2021-11-30 腾讯科技(深圳)有限公司 视频异常检测模型训练方法、视频异常检测方法和装置
CN113741459A (zh) * 2021-09-03 2021-12-03 阿波罗智能技术(北京)有限公司 确定训练样本的方法和自动驾驶模型的训练方法、装置
CN114792106A (zh) * 2021-09-30 2022-07-26 上海商汤智能科技有限公司 视频语义分割方法、装置、电子设备及存储介质
CN114241407B (zh) * 2021-12-10 2023-05-23 电子科技大学 一种基于深度学习的近距离屏幕监控方法
CN115272165B (zh) * 2022-05-10 2023-09-26 推想医疗科技股份有限公司 图像的特征提取方法、图像分割模型的训练方法和装置
CN115861393B (zh) * 2023-02-16 2023-06-16 中国科学技术大学 图像匹配方法、航天器着陆点定位方法及相关装置
CN116128715B (zh) * 2023-02-20 2023-07-18 中国人民解放军军事科学院系统工程研究院 一种图形仿射变换方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101482923A (zh) * 2009-01-19 2009-07-15 刘云 视频监控中人体目标的检测与性别识别方法
CN102216957A (zh) * 2008-10-09 2011-10-12 埃西斯创新有限公司 图像中对象的视觉跟踪以及图像分割
CN107146239A (zh) * 2017-04-21 2017-09-08 武汉大学 卫星视频运动目标检测方法及系统
CN108122234A (zh) * 2016-11-29 2018-06-05 北京市商汤科技开发有限公司 卷积神经网络训练及视频处理方法、装置和电子设备
CN108492297A (zh) * 2017-12-25 2018-09-04 重庆理工大学 基于深度级联卷积网络的mri脑肿瘤定位与瘤内分割方法
CN110188754A (zh) * 2019-05-29 2019-08-30 腾讯科技(深圳)有限公司 图像分割方法和装置、模型训练方法和装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
MX2007002087A (es) 2004-08-20 2007-10-16 Lewis Hyman Inc Metodo y aparato para poner un forro para visillo de ventana.
US8073216B2 (en) * 2007-08-29 2011-12-06 Vanderbilt University System and methods for automatic segmentation of one or more critical structures of the ear
JP5045320B2 (ja) * 2007-09-05 2012-10-10 ソニー株式会社 画像処理装置、および画像処理方法、並びにコンピュータ・プログラム
CN101719279A (zh) * 2009-12-23 2010-06-02 西北工业大学 星空图像背景运动估计方法
CN102456225B (zh) * 2010-10-22 2014-07-09 深圳中兴力维技术有限公司 一种运动目标检测与跟踪方法和系统
CN102740096A (zh) * 2012-07-13 2012-10-17 浙江工商大学 一种基于时空结合的动态场景立体视频匹配方法
CN104823444A (zh) * 2012-11-12 2015-08-05 行为识别系统公司 用于视频监控系统的图像稳定技术
US9129399B2 (en) * 2013-03-11 2015-09-08 Adobe Systems Incorporated Optical flow with nearest neighbor field fusion
CA2918295A1 (en) * 2013-07-15 2015-01-22 Tel Hashomer Medical Research, Infrastructure And Services Ltd. Mri image fusion methods and uses thereof
US10552962B2 (en) * 2017-04-27 2020-02-04 Intel Corporation Fast motion based and color assisted segmentation of video into region layers
CN108596184B (zh) * 2018-04-25 2021-01-12 清华大学深圳研究生院 图像语义分割模型的训练方法、可读存储介质及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102216957A (zh) * 2008-10-09 2011-10-12 埃西斯创新有限公司 图像中对象的视觉跟踪以及图像分割
CN101482923A (zh) * 2009-01-19 2009-07-15 刘云 视频监控中人体目标的检测与性别识别方法
CN108122234A (zh) * 2016-11-29 2018-06-05 北京市商汤科技开发有限公司 卷积神经网络训练及视频处理方法、装置和电子设备
CN107146239A (zh) * 2017-04-21 2017-09-08 武汉大学 卫星视频运动目标检测方法及系统
CN108492297A (zh) * 2017-12-25 2018-09-04 重庆理工大学 基于深度级联卷积网络的mri脑肿瘤定位与瘤内分割方法
CN110188754A (zh) * 2019-05-29 2019-08-30 腾讯科技(深圳)有限公司 图像分割方法和装置、模型训练方法和装置

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112906463A (zh) * 2021-01-15 2021-06-04 上海东普信息科技有限公司 基于图像的火情检测方法、装置、设备及存储介质
CN113034580B (zh) * 2021-03-05 2023-01-17 北京字跳网络技术有限公司 图像信息检测方法、装置和电子设备
CN113034580A (zh) * 2021-03-05 2021-06-25 北京字跳网络技术有限公司 图像信息检测方法、装置和电子设备
CN113177483A (zh) * 2021-04-30 2021-07-27 北京百度网讯科技有限公司 视频目标分割方法、装置、设备以及存储介质
CN113177483B (zh) * 2021-04-30 2023-07-11 北京百度网讯科技有限公司 视频目标分割方法、装置、设备以及存储介质
CN113923493A (zh) * 2021-09-29 2022-01-11 北京奇艺世纪科技有限公司 一种视频处理方法、装置、电子设备以及存储介质
CN113923493B (zh) * 2021-09-29 2023-06-16 北京奇艺世纪科技有限公司 一种视频处理方法、装置、电子设备以及存储介质
CN114693934A (zh) * 2022-04-13 2022-07-01 北京百度网讯科技有限公司 语义分割模型的训练方法、视频语义分割方法及装置
CN114693934B (zh) * 2022-04-13 2023-09-01 北京百度网讯科技有限公司 语义分割模型的训练方法、视频语义分割方法及装置
CN115474084A (zh) * 2022-08-10 2022-12-13 北京奇艺世纪科技有限公司 一种视频封面图像的生成方法、装置、设备和存储介质
CN115474084B (zh) * 2022-08-10 2023-10-31 北京奇艺世纪科技有限公司 一种视频封面图像的生成方法、装置、设备和存储介质
CN117078761A (zh) * 2023-10-07 2023-11-17 深圳市爱博医疗机器人有限公司 细长型医疗器械自动定位方法、装置、设备以及介质
CN117078761B (zh) * 2023-10-07 2024-02-27 深圳爱博合创医疗机器人有限公司 细长型医疗器械自动定位方法、装置、设备以及介质
CN117132587A (zh) * 2023-10-20 2023-11-28 深圳微创心算子医疗科技有限公司 超声扫描导航方法、装置、计算机设备和存储介质
CN117132587B (zh) * 2023-10-20 2024-03-01 深圳微创心算子医疗科技有限公司 超声扫描导航方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
CN110188754A (zh) 2019-08-30
CN110188754B (zh) 2021-07-13
US11900613B2 (en) 2024-02-13
US20210366126A1 (en) 2021-11-25

Similar Documents

Publication Publication Date Title
WO2020238902A1 (zh) 图像分割方法、模型训练方法、装置、设备及存储介质
US11908580B2 (en) Image classification method, computer-readable storage medium, and computer device
US11551333B2 (en) Image reconstruction method and device
Sun et al. Deep RGB-D saliency detection with depth-sensitive attention and automatic multi-modal fusion
WO2020186942A1 (zh) 目标检测方法、系统、装置、存储介质和计算机设备
CN111260055A (zh) 基于三维图像识别的模型训练方法、存储介质和设备
CN110475505A (zh) 利用全卷积网络的自动分割
CN113159073B (zh) 知识蒸馏方法及装置、存储介质、终端
CN110599528A (zh) 一种基于神经网络的无监督三维医学图像配准方法及系统
CN111325739A (zh) 肺部病灶检测的方法及装置,和图像检测模型的训练方法
CN111429421A (zh) 模型生成方法、医学图像分割方法、装置、设备及介质
WO2024021523A1 (zh) 基于图网络的大脑皮层表面全自动分割方法及系统
CN111951288A (zh) 一种基于深度学习的皮肤癌病变分割方法
CN104484886A (zh) 一种mr图像的分割方法及装置
CN110930378A (zh) 基于低数据需求的肺气肿影像处理方法及系统
Pan et al. Prostate segmentation from 3d mri using a two-stage model and variable-input based uncertainty measure
CN113902945A (zh) 一种多模态乳腺磁共振图像分类方法及系统
CN116342516A (zh) 基于模型集成的儿童手骨x光图像骨龄评估方法及系统
US20220351009A1 (en) Method and system for self-supervised learning of pillar motion for autonomous driving
Varghese et al. Unpaired image-to-image translation of structural damage
Midwinter et al. Unsupervised defect segmentation with pose priors
Pham et al. Seunet-trans: A simple yet effective unet-transformer model for medical image segmentation
Arora et al. Modified UNet++ model: a deep model for automatic segmentation of lungs from chest X-ray images
Graves et al. Siamese pyramidal deep learning network for strain estimation in 3D cardiac cine-MR
Joshi et al. Efficient diffeomorphic image registration using multi-scale dual-phased learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20812654

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20812654

Country of ref document: EP

Kind code of ref document: A1