WO2023060637A1 - Measurement method and measurement apparatus based on deep learning of tight box mark - Google Patents

Measurement method and measurement apparatus based on deep learning of tight box mark Download PDF

Info

Publication number
WO2023060637A1
WO2023060637A1 PCT/CN2021/125152 CN2021125152W WO2023060637A1 WO 2023060637 A1 WO2023060637 A1 WO 2023060637A1 CN 2021125152 W CN2021125152 W CN 2021125152W WO 2023060637 A1 WO2023060637 A1 WO 2023060637A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
category
output
network
tight frame
Prior art date
Application number
PCT/CN2021/125152
Other languages
French (fr)
Chinese (zh)
Inventor
王娟
夏斌
Original Assignee
深圳硅基智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳硅基智能科技有限公司 filed Critical 深圳硅基智能科技有限公司
Publication of WO2023060637A1 publication Critical patent/WO2023060637A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present disclosure generally relates to the field of recognition technology based on deep learning, and specifically relates to a measurement method and a measurement device based on deep learning of tight frames.
  • the image often includes information of various targets, and the information of the target in the image can be automatically analyzed based on image processing technology. For example, in the medical field, tissue objects in medical images can be identified, and then the size of the tissue objects can be measured to monitor changes in the tissue objects.
  • the present disclosure is made in view of the above-mentioned situation, and an object thereof is to provide a measurement method and a measurement device based on tight-framework deep learning that can identify a target and accurately measure the target.
  • the first aspect of the present disclosure provides a measurement method based on deep learning of a tight frame, which is a measurement method that uses a network module trained based on a target-based tight frame to identify the target so as to achieve measurement.
  • the measuring method comprises: acquiring an input image comprising at least one target, the at least one target belonging to at least one category of interest; inputting the input image into the network module To obtain a first output and a second output, the first output includes the probability that each pixel in the input image belongs to each category, and the second output includes the position of each pixel in the input image and each The offset of the tight frame of the target of the category, the offset in the second output is used as the target offset, wherein the network module includes a backbone network, a segmentation network based on image segmentation based on weakly supervised learning, and a frame regression based on A regression network, the backbone network is used to extract the feature map of the input image, the segmentation network uses the
  • a network module including a backbone network, a segmentation network based on weakly supervised learning for image segmentation, and a regression network based on bounding box regression is constructed.
  • the network module is trained based on the tight frame of the target.
  • the backbone network receives the input image and Extract a feature map consistent with the resolution of the input image, input the feature map into the segmentation network and the regression network to obtain the first output and the second output, and then obtain the tight frame of the target in the input image based on the first output and the second output so that Realize measurement.
  • the trained network module based on the tight box of the object can accurately predict the tight box of the object in the input image, and then can accurately measure it based on the tight box of the object.
  • the network module is trained by the following method: constructing a training sample, the input image data of the training sample includes multiple images to be trained, and the multiple The image to be trained includes an image containing an object belonging to at least one category, and the label data of the training sample includes the gold standard of the category to which the object belongs and the gold standard of the tight frame of the object; through the network module based on The input image data of the training sample, obtaining the predicted segmentation data output by the segmentation network corresponding to the training sample and the predicted offset output by the regression network; based on the label data corresponding to the training sample, the The predicted segmentation data and the predicted offset determine a training loss for the network module; and the network module is trained based on the training loss to optimize the network module.
  • an optimized network module can be obtained.
  • determining the training loss of the network module based on the label data corresponding to the training samples, the predicted segmentation data and the predicted offset including: obtaining the segmentation loss of the segmentation network based on the predicted segmentation data and label data corresponding to the training sample; obtaining the regression based on the predicted offset corresponding to the training sample and the real offset based on the label data
  • the regression loss of the network wherein, the real offset is the offset of the position of the pixel point of the image to be trained and the tight frame mark of the target in the label data; and based on the segmentation loss and the regression loss , to obtain the training loss of the network module.
  • the predicted segmentation data of the segmentation network can be approximated to the label data by the segmentation loss
  • the predicted offset of the regression network can be approximated by the real offset by the regression loss.
  • the target offset is an offset normalized based on the average width and average height of targets of each category, or the target offset Shift is the shift normalized based on the average size of objects of each class.
  • multi-instance learning is used to obtain multiple training packages based on the gold standard of the tight frame of the target in each image to be trained by category, and based on each category A plurality of packets to be trained to obtain the segmentation loss, wherein the plurality of packets to be trained include a plurality of positive packets and a plurality of negative packets, and the number of two sides opposite to the gold standard of the tight frame connecting the target All the pixel points on each of the straight lines are divided into a positive package, and the multiple straight lines include at least one set of first parallel lines parallel to each other and second parallel lines perpendicular to each set of first parallel lines respectively.
  • the negative bag is a single pixel point in the area outside the gold standard of the tight frame of all objects of a category
  • the segmentation loss includes a unary item and a pairwise item
  • the unary item describes each bag to be trained
  • the pair item describes the degree to which the pixel of the image to be trained and the pixels adjacent to the pixel belong to the same category.
  • the segmentation loss can be obtained based on the positive and negative packets of multi-instance learning, and the tight frame is constrained by both the positive and negative packets through the unary loss, and the predicted segmentation results are smoothed through the pairwise loss.
  • the angle of the first parallel line is any one of the extension line of the first parallel line and the gold standard of the tight frame of the target
  • the angle of the first parallel line is greater than -90° and less than 90°.
  • pixels falling within the gold standard of at least one target’s tight frame are selected from the images to be trained by category as positive samples of each category And obtain the matching tight frame corresponding to the positive sample to screen the positive samples of each category based on the matching tight frame, and then use the filtered positive samples of each category to optimize the regression network, wherein the The matching tight frame is the tight framed gold standard that has the smallest true deviation relative to the position of the positive sample among the tight framed gold standards that the positive sample falls into. In this way, the regression network can be optimized by using the positive samples of each category screened based on the matching tight frame.
  • the pixel points of the image to be trained are selected from the pixels of the image to be trained according to the category and the expected cross-over-union ratio corresponding to the pixels of the image to be trained.
  • the pixels whose expected intersection ratio is greater than the preset expected intersection ratio are optimized for the regression network, wherein multiple frames of different sizes constructed with the pixels of the image to be trained as the center point are used to obtain the plurality of The maximum value of the intersection and union ratios between the frame and the matching tight frame of the pixel point is used as the expected intersection and union ratio, and the matching tight frame is the gold standard of the tight frame where the pixels of the image to be trained fall into The gold standard of the tight frame with the smallest true offset relative to the position of the pixel. In this way, positive samples that meet the preset expected intersection ratio can be obtained.
  • the expected intersection and union ratio satisfies the formula:
  • a desired cross-merge ratio can be obtained.
  • the target is identified based on the first output and the second output to obtain the tight frame of each category of target as : Obtain the position of the pixel point with the highest local probability belonging to each category from the first output as the first position, and acquire based on the position corresponding to the first position in the second output and the target offset of the corresponding category Tight boxes for the targets of each category.
  • an object or objects of each category can be identified.
  • the sizes of multiple targets of the same category differ from each other by less than 10 times.
  • the accuracy of target recognition can be further improved.
  • the backbone network includes an encoding module and a decoding module
  • the encoding module is configured to extract image features at different scales
  • the decoding module is configured to is to map the extracted image features at different scales back to the resolution of the input image to output the feature map.
  • the input image is a fundus image
  • the target is an optic cup and/or an optic disc.
  • the target is identified based on the first output and the second output to obtain the optic cup and /or the tight frame of the optic disc so as to realize the measurement as follows: from the first output, the position of the pixel point with the highest probability belonging to each category is obtained as the first position, based on the second output corresponding to the first position and the target offset of the corresponding category to obtain the tight frame of the target of each category.
  • the optic cup and/or optic disc can be identified.
  • the measurement method involved in the first aspect of the present disclosure optionally, based on the tight frame of the optic cup in the fundus image and/or the tight frame of the optic disc in the fundus image and/or optic disc measurements are taken to obtain cup and/or optic disc dimensions, and a cup to optic disc ratio is obtained based on the cup and optic disc dimensions in said fundus image.
  • the ratio of the optic cup to the optic disc can be obtained.
  • the second aspect of the present disclosure provides a measurement device based on deep learning of a tight frame, which is a measurement device that uses a network module trained on a target-based tight frame to identify the target so as to achieve measurement.
  • the tight frame is the minimum circumscribed rectangle of the target
  • the measurement device includes an acquisition module, a network module and a recognition module; the acquisition module is configured to acquire an input image comprising at least one target, and the at least one target belongs to at least one category of interest
  • the network module is configured to receive the input image and obtain a first output and a second output based on the input image, the first output includes the probability that each pixel in the input image belongs to each category, the The second output includes the offset of the position of each pixel in the input image and the tight frame of the target of each category, and the offset in the second output is used as the target offset
  • the network module includes a backbone network, a segmentation network based on image segmentation based on weakly supervised learning, and a regression network based on border regression, the
  • a network module including a backbone network, a segmentation network based on weakly supervised learning for image segmentation, and a regression network based on bounding box regression is constructed.
  • the network module is trained based on the tight frame of the target.
  • the backbone network receives the input image and Extract a feature map consistent with the resolution of the input image, input the feature map into the segmentation network and the regression network to obtain the first output and the second output, and then obtain the tight frame of the target in the input image based on the first output and the second output so that Realize measurement.
  • the trained network module based on the tight box of the object can accurately predict the tight box of the object in the input image, and then can accurately measure it based on the tight box of the object.
  • the input image is a fundus image
  • the target is an optic cup and/or an optic disc.
  • the target is identified based on the first output and the second output to obtain the optic cup and /or the tight frame of the optic disc so as to realize the measurement as follows: from the first output, the position of the pixel point with the highest probability belonging to each category is obtained as the first position, based on the second output corresponding to the first position and the target offset of the corresponding category to obtain the tight frame of the target of each category.
  • the optic cup and/or optic disc can be identified.
  • a measurement method and a measurement device based on tight frame deep learning that can identify a target and accurately measure the target.
  • FIG. 1 is a schematic diagram showing an application scenario of a tight-frame-based deep learning measurement method involved in an example of the present disclosure.
  • FIG. 2( a ) is a schematic diagram showing a fundus image related to an example of the present disclosure.
  • FIG. 2( b ) is a schematic diagram showing a recognition result of a fundus image according to an example of the present disclosure.
  • FIG. 3 is a schematic diagram showing one example of a network module involved in an example of the present disclosure.
  • FIG. 4 is a schematic diagram showing another example of a network module related to an example of the present disclosure.
  • FIG. 5 is a flowchart illustrating a training method of a network module related to an example of the present disclosure.
  • FIG. 6 is a schematic diagram showing positive packets involved in an example of the present disclosure.
  • FIG. 7 is a schematic diagram showing a frame constructed centering on a pixel point involved in an example of the present disclosure.
  • FIG. 8 is a flow chart showing a measurement method of tight-frame-based deep learning related to an example of the present disclosure.
  • FIG. 9 is a block diagram illustrating a measurement device for tight-framework-based deep learning according to an example of the present disclosure.
  • FIG. 10 is a flowchart illustrating a measurement method for a fundus image according to an example of the present disclosure.
  • Fig. 11 is a block diagram showing a measurement device for a fundus image according to an example of the present disclosure.
  • the present disclosure relates to a measurement method and a measurement device based on tight frame deep learning, which can identify targets and improve the accuracy of target measurement.
  • the measurement method involved in the present disclosure may also be referred to as an identification method or an auxiliary measurement method.
  • the measurement method involved in the present disclosure may be applicable to any application scenario in which the width and/or height of an object in an image is accurately measured.
  • the measurement method involved in the present disclosure is a measurement method for realizing measurement by using a network module trained based on a target tight frame to identify a target.
  • the tight frame can be the smallest bounding rectangle of the target.
  • the target is in contact with the four sides of the tight frame and does not overlap with the area outside the tight frame (that is, the target is tangent to the four sides of the tight frame).
  • the tight frame can represent the width and height of the object.
  • training the network module based on the tight frame of the target can reduce the time and labor cost of collecting pixel-level annotation data, and the network module can accurately identify the tight frame of the target.
  • the input images referred to in the present disclosure may come from cameras, CT scans, PET-CT scans, SPECT scans, MRI, ultrasound, X-rays, angiograms, fluorographs, images taken by capsule endoscopy, or combinations thereof.
  • the input image may be an image of a tissue object (eg, a fundus image).
  • the input image can be a natural image.
  • the natural image may be an image observed or captured in a natural scene.
  • FIG. 1 is a schematic diagram showing an application scenario of a tight-frame-based deep learning measurement method involved in an example of the present disclosure.
  • FIG. 2( a ) is a schematic diagram showing a fundus image related to an example of the present disclosure.
  • FIG. 2( b ) is a schematic diagram showing a recognition result of a fundus image according to an example of the present disclosure.
  • the measurement method involved in the present disclosure can be applied to the application scenario as shown in FIG. 1 .
  • the image of the target object 51 including the corresponding position of the target can be collected by the acquisition device 52 (such as a camera) as an input image (see FIG.
  • the input image is input to the network module 20 to identify the target in the input image and obtain The tight frame B of the target (see Figure 1), and then the target can be measured based on the tight frame B.
  • the tight frame mark wherein, the tight frame mark B11 is the tight frame mark of the optic disc, and the tight frame mark B12 is the tight frame mark of the optic cup. In this case, the optic cup and optic disc can be measured based on the tight frame.
  • the network module 20 involved in the present disclosure may be based on multitasking.
  • the network module 20 may be a deep learning based neural network.
  • the network module 20 may include two tasks, one task may be a segmentation network 22 (described later) for image segmentation based on weakly supervised learning, and the other task may be a regression network 23 based on bounding box regression (described later). describe).
  • segmentation network 22 may segment an input image to obtain objects (eg, optic cup and/or optic disc).
  • segmentation network 22 may be based on Multiple-Instance learning (MIL) and used to supervise tight labels.
  • MIL Multiple-Instance learning
  • the problem solved by segmentation network 22 may be a multi-label classification problem.
  • the input image may contain objects of at least one category of interest (may be simply referred to as category).
  • the segmentation network 22 is able to recognize input images containing objects of at least one class of interest.
  • the input image may also be free of any objects.
  • the number of objects for each category of interest may be at least greater than one.
  • regression network 23 may be used to predict tight boxes by category. In some examples, the regression network 23 may further predict the tight frame by predicting the offset of the tight frame relative to each pixel of the input image.
  • the network module 20 may also include a backbone network 21 .
  • the backbone network 21 can be used to extract the feature map of the input image (that is, the original image input to the network module 20).
  • backbone network 21 may extract high-level features for object representation.
  • the resolution of the feature map can be consistent with the input image (ie, the feature map can be single-scale and consistent with the size of the input image). As a result, the accuracy of recognition or measurement of a target whose size does not vary greatly can be improved.
  • image features of different scales can be continuously fused to obtain a feature map consistent with the scale of the input image.
  • feature maps may serve as input to segmentation network 22 and regression network 23 .
  • backbone network 21 may include an encoding module and a decoding module.
  • the encoding module can be configured to extract image features at different scales.
  • the decoding module may be configured to map image features extracted at different scales back to the resolution of the input image to output a feature map. Thereby, it is possible to obtain a feature map matching the resolution of the input image.
  • FIG. 3 is a schematic diagram showing one example of the network module 20 related to the example of the present disclosure.
  • the network module 20 may include a backbone network 21 , a segmentation network 22 and a regression network 23 .
  • the backbone network 21 can receive an input image and output a feature map.
  • the feature maps can be used as input to segmentation network 22 and regression network 23 to obtain corresponding outputs.
  • the segmentation network 22 may use the feature map as input to obtain a first output
  • the regression network 23 may use the feature map as input to obtain a second output.
  • the input image can be input to the network module 20 to obtain the first output and the second output.
  • the first output may be the result of image segmentation prediction.
  • the second output may be the result of the bounding box regression prediction.
  • the first output may include the probability that each pixel in the input image belongs to each category. In some examples, the probability that each pixel belongs to each category can be obtained through an activation function. In some examples, the first output can be a matrix. In some examples, the size of the matrix corresponding to the first output may be M ⁇ N ⁇ C, where M ⁇ N may represent the resolution of the input image, M and N may correspond to the rows and columns of the input image respectively, and C may represent The number of categories. For example, for the two types of fundus images whose targets are the optic cup and the optic disc, the size of the matrix corresponding to the first output may be M ⁇ N ⁇ 2.
  • the value corresponding to the pixel at each position in the input image in the first output may be a vector, and the number of elements in the vector may be consistent with the number of categories.
  • the corresponding value in the first output may be a vector p k
  • the vector p k may include C elements
  • C may be the number of categories.
  • the values of the elements of the vector p k may be values from 0 to 1.
  • the second output may include an offset between the position of each pixel point in the input image and the tight bounding box of each category of objects. That is, the second output may include the offset of the tight box for the object of the definite class.
  • what the regression network 23 predicts may be the offset of a tight box for an object of a definite class.
  • the offset in the second output may be used as the target offset.
  • the target offset may be a normalized offset.
  • the object offset may be an offset normalized based on the average size of objects of each class.
  • the object offset is an offset that may be normalized based on the average width and average height of objects of each category.
  • a target offset and a predicted offset may correspond to a real offset (described later). That is, if the real offset when training the network module 20 (which can be referred to as the training phase) is normalized, then the target offset (corresponding to the measurement phase) and The prediction offset (corresponding to the training stage) can also be automatically normalized accordingly. As a result, the accuracy of recognition or measurement of a target whose size does not vary greatly can be improved.
  • the average size of the objects can be obtained by averaging the average width and average height of the objects.
  • the average size of objects, or the average width and average width of objects may be empirical values.
  • the average size of the target can be obtained by statistically collecting samples corresponding to the input image.
  • the width and height of the tight boxes of objects in the sample's label data may be averaged separately by class to obtain an average width and an average height.
  • the average width and average height may be averaged to obtain an average size of objects of that class.
  • the samples may be training samples (described later). Thereby, the average width and the average width, or the average size of the objects can be acquired through the training samples.
  • the second output can be a matrix.
  • the size of the matrix corresponding to the second output can be M ⁇ N ⁇ A, where A can represent the size of all target offsets, M ⁇ N can represent the resolution of the input image, and M and N can correspond to The rows and columns of the input image.
  • A can represent the size of all target offsets
  • M ⁇ N can represent the resolution of the input image
  • M and N can correspond to The rows and columns of the input image.
  • A can be C ⁇ 4
  • C can represent the number of categories.
  • the size of the matrix corresponding to the second output may be M ⁇ N ⁇ 8.
  • the value corresponding to the pixel at each position in the input image in the second output may be a vector.
  • C can be the number of categories, and each element in vk can be expressed as the target displacement of the target of each category.
  • the target displacement and the corresponding category can be conveniently represented.
  • the elements of v k may be 4-dimensional vectors.
  • the backbone network 21 may be based on a U-net network.
  • the coding module of the backbone network 21 may include a unit layer and a pooling layer (pooling layers).
  • the decoding module of the backbone network 21 may include a unit layer, an up-sampling layer (up-sampling layers, Up-sampling) and a skip connection unit (skip connection units, Skip-connection).
  • the unit layers may include convolutional layers, batch normalization layers, and rectified linear unit layers (ReLu).
  • the pooling layers (Pooling) may be max pooling layers (Max-pooling).
  • skip connection units may be used to combine image features from deep layers and image features from shallow layers.
  • segmentation network 22 may be a feed-forward neural network.
  • segmentation network 22 may include multiple layers of units.
  • segmentation network 22 may include multiple unit layers and convolutional layers (Conv).
  • the regression network 23 may include dilated convolution layers (dilated convolution layers, Dilated Conv) and corrected linear unit layers (batch normalization layers, BN).
  • regression network 23 may include dilated convolutional layers, rectified linear unit layers, and convolutional layers.
  • FIG. 4 is a schematic diagram showing another example of the network module 20 related to the example of the present disclosure.
  • the network layers in the network module 20 are distinguished by the numbers in the arrows, where the arrow 1 represents the convolutional layer and the batch normalization layer And the network layer composed of the corrected linear unit layer (that is, the unit layer), the arrow 2 indicates the network layer composed of the expansion convolution layer and the corrected linear unit, the arrow 3 indicates the convolution layer, the arrow 4 indicates the maximum pooling layer, and the arrow 5 indicates In the upsampling layer, arrow 6 represents the skip connection unit.
  • an input image with a resolution of 256 ⁇ 256 can be input to the network module 20, and the image features are extracted through the unit layer (see arrow 1) and the maximum pooling layer (see arrow 4) of different levels of the encoding module , and through the different levels of unit layers (see arrow 1), upsampling layer (see arrow 5) and skip connection unit (see arrow 6) of the decoding module to continuously fuse image features of different scales to obtain the same scale as the input image Then the feature map 221 is input into the segmentation network 22 and the regression network 23 respectively to obtain the first output and the second output.
  • the segmentation network 22 can be composed of a unit layer (see arrow 1) and a convolutional layer (see arrow 3) in turn, and the regression network 23 can be composed of a plurality of dilated convolutional layers and corrected linear unit layers in turn. composed of network layers (see arrow 2), and convolutional layers (see arrow 3).
  • the unit layer can be composed of convolution layer, batch normalization layer and rectified linear unit layer.
  • the size of the convolution kernel of the convolution layer in the network module 20 may be set to 3 ⁇ 3. In some examples, the size of the convolution kernel of the maximum pooling layer in the network module 20 may be set to 2 ⁇ 2, and the convolution step may be set to 2. In some examples, the scale-factor of the up-sampling layer in the network module 20 may be set to 2. In some examples, as shown in FIG. 4, the expansion coefficients (dilation-factor) of multiple dilated convolutional layers in the network module 20 can be set to 1, 1, 2, 4, 8 and 16 in sequence (see arrow 2 above numbers). In some examples, as shown in FIG. 4 , the number of max pooling layers may be five. Thus, the size of the input image can be divided by 32 (32 may be 2 to the fifth power).
  • the measurement method involved in the present disclosure is a measurement method that uses the network module 20 trained based on the tight frame of the target to recognize the target and realize the measurement.
  • the training method of the network module 20 involved in the present disclosure (which may be referred to as the training method for short) will be described in detail with reference to the accompanying drawings.
  • FIG. 5 is a flowchart showing a training method of the network module 20 according to an example of the present disclosure.
  • the segmentation network 22 and the regression network 23 in the network module 20 can be trained simultaneously on an end-to-end basis.
  • the segmentation network 22 and the regression network 23 in the network module 20 can be jointly trained to simultaneously optimize the segmentation network 22 and the regression network 23 .
  • the segmentation network 22 and the regression network 23 can adjust the network parameters of the backbone network 21 through backpropagation, so that the feature map output by the backbone network 21 can better express the characteristics of the input image and input the segmentation Network 22 and Regression Network 23.
  • both the segmentation network 22 and the regression network 23 perform processing based on the feature maps output by the backbone network 21 .
  • segmentation network 22 may be trained using multi-instance learning.
  • the pixel points used for training the regression network 23 may be selected by using the expected cross-over-union ratio corresponding to the pixel points of the image to be trained (described later).
  • the training method may include constructing a training sample (step S120), inputting the training sample into the network module 20 to obtain prediction data (step S140), and determining the network module 20 based on the training sample and the prediction data. and optimize the network module 20 based on the training loss (step S160).
  • an optimized (also called trained) network module 20 can be obtained.
  • the training samples may include input image data and label data.
  • the input image data may include multiple images to be trained.
  • the image to be trained may be a fundus image to be trained.
  • the plurality of images to be trained may include images containing objects. In some examples, the plurality of images to be trained may include images containing objects and images not containing objects. In some examples, a target may belong to at least one category. In some examples, the number of objects of each category in the image to be trained may be greater than or equal to one. For example, taking the fundus image as an example, if the optic cup and the optic disc are identified or measured, the target in the fundus image may be an optic disc and an optic cup. The examples of the present disclosure do not intend to limit the number of objects, the categories to which the objects belong, and the number of objects of each category.
  • the label data may include a gold standard of the class to which the target belongs (the gold standard of the class may also sometimes be referred to as the true class) and a gold standard of the target's tight frame (the gold standard of the tight frame may also sometimes be called the true class). box mark).
  • the tight frame of the target in the label data or the category of the target in the training method can be the gold standard by default.
  • a labeling tool can be used to mark the tight frame (that is, the smallest bounding rectangle) of the target in the training image, and set the corresponding category for the tight frame to represent the true category of the target to obtain label data .
  • the prediction data corresponding to the training samples can be obtained through the network module 20 based on the input image data of the training samples.
  • the predicted data may include predicted segmentation data output by the segmentation network 22 and predicted offsets output by the regression network 23.
  • the predicted split data may correspond to the first output
  • the predicted offset may correspond to the second output (ie, may correspond to the target offset). That is, the predicted segmentation data may include the probability that each pixel in the image to be trained belongs to each category, and the predicted offset may include the offset between the position of each pixel in the image to be trained and the tight frame mark of each category.
  • the predicted offset may be an offset normalized based on the average size of objects of each category.
  • the sizes of multiple objects of the same category may differ from each other by less than 10 times.
  • the sizes of multiple objects of the same category may differ from each other by 1 time, 2 times, 3 times, 5 times, 7 times, 8 times or 9 times, etc.
  • the accuracy of target recognition or measurement can be further improved.
  • xl, yt can represent the position of the upper left corner of the tight frame of the target
  • xr, yb can represent the position of the lower right corner of the tight frame of the target
  • c can represent the index of the category to which the target belongs
  • S c1 can represent the cth category
  • S c2 can represent the average height of the target of the c-th category. Thereby, a normalized offset can be obtained.
  • S c1 and S c2 may both be the average size of objects of the c-th category.
  • the tight frame of the target can also be represented by the position of the lower left corner and the upper right corner, or the tight frame of the target can be represented by the position, length and width of any corner .
  • normalization may also be performed in other ways, for example, the offset may be normalized by using the length and width of the tight frame of the target.
  • the pixel points in formula (1) may be the pixel points of the image to be trained or the input image. That is, formula (1) can be applied to the real offset corresponding to the image to be trained in the training phase and the target offset corresponding to the input image in the measurement phase.
  • the pixels can be the pixels in the image to be trained
  • the tight frame b of the target can be the gold standard of the tight frame of the target in the image to be trained
  • the offset t can be the real offset (also can be called the gold standard for offset).
  • the regression loss of the regression network 23 can be subsequently obtained based on the predicted offset and the actual offset.
  • the pixel point is the pixel point in the image to be trained
  • the offset t is the prediction offset
  • the tight frame of the predicted target can be deduced according to the formula (1).
  • the pixel points can be the pixel points in the input image
  • the offset t can be the target offset
  • the tight frame of the target in the input image can be deduced according to the formula (1) and the target offset (also That is, the object offset and the position of the pixel point can be substituted into formula (1) to obtain the object's tight frame).
  • step S160 the training loss of the network module 20 can be determined based on the label data corresponding to the training samples, the predicted segmentation data and the predicted offset, and then the network module 20 is trained based on the training loss to optimize the network module 20 .
  • the network module 20 may include a segmentation network 22 and a regression network 23 .
  • the training loss of the network module 20 can be obtained.
  • the network module 20 can be optimized based on the training loss.
  • the training loss may be the sum of segmentation loss and regression loss.
  • the segmentation loss may indicate the extent to which the pixels in the image to be trained in the predicted segmentation data belong to each real category, and the regression loss may indicate the closeness of the predicted offset to the actual offset.
  • FIG. 6 is a schematic diagram showing positive packets involved in an example of the present disclosure.
  • the segmentation loss of the segmentation network 22 can be obtained based on the predicted segmentation data and label data corresponding to the training samples. In this way, the predicted segmented data of the segmented network 22 can be approximated to the label data by the segmented loss.
  • segmentation loss can be obtained using multi-instance learning. In multi-instance learning, multiple bags to be trained can be obtained by category based on the real tight frame of the target in each image to be trained (that is, each category can correspond to multiple bags to be trained). Segmentation loss can be obtained based on multiple bags to be trained for each category.
  • the plurality of packets to be trained may include a plurality of positive packets and a plurality of negative packets. Thus, segmentation loss can be obtained based on the positive and negative packets of multi-instance learning. It should be noted that unless otherwise specified, the following positive and negative packages are for each category.
  • multiple positive packets may be obtained based on the area within the true tight bounding box of the object.
  • the area A2 in the image P1 to be trained is the area within the real tight frame B21 of the target T1.
  • all the pixel points on each of the multiple straight lines connecting two opposite sides of the real tight frame of the target can be divided into a positive bag (that is, a straight line can correspond to a positive bag) .
  • the two ends of each straight line may be at the upper end and the lower end, or the left end and the right end of the real tight frame.
  • the pixel points on the straight line D1 , straight line D2 , straight line D3 , straight line D4 , straight line D5 , straight line D6 , straight line D7 and straight line D8 can be divided into a positive packet respectively.
  • the examples of the present disclosure are not limited thereto, and in other examples, other ways may also be used to divide positive packets.
  • the pixels at a specific position of the real tight frame can be divided into a positive bag.
  • the plurality of straight lines may include at least one set of first parallel lines that are parallel to each other.
  • the plurality of straight lines may include one set of first parallel lines, two sets of first parallel lines, three sets of first parallel lines, or four sets of first parallel lines, and the like.
  • the number of straight lines in the first parallel line may be greater than or equal to two.
  • the plurality of straight lines may include at least one set of first parallel lines that are parallel to each other and second parallel lines that are respectively perpendicular to each set of first parallel lines. Specifically, if the multiple straight lines include a set of first parallel lines, then the multiple straight lines may also include a set of second parallel lines perpendicular to the set of first parallel lines; if the multiple straight lines include multiple sets of first parallel lines, Then the multiple straight lines may further include multiple sets of second parallel lines perpendicular to each set of first parallel lines.
  • a group of first parallel lines may include parallel straight lines D1 and straight lines D2, and a group of second parallel lines corresponding to the group of first parallel lines may include parallel straight lines D3 and straight lines D4, wherein the straight line D1 Can be perpendicular to straight line D3;
  • Another group of first parallel lines can include parallel straight line D5 and straight line D6, and a group of second parallel lines corresponding to this group of first parallel lines can include parallel straight line D7 and straight line D8, wherein, The straight line D5 may be perpendicular to the straight line D7.
  • the number of straight lines in the first parallel line and the second parallel line may be greater than or equal to two.
  • the plurality of straight lines may include multiple sets of first parallel lines (ie, the plurality of straight lines may include parallel lines at different angles).
  • the segmentation network 22 it is possible to optimize the segmentation network 22 by dividing positive packets from different angles. Thereby, the accuracy of the segmented data predicted by the segmenting network 22 can be improved.
  • the angle of the first parallel line may be the angle between the extension line of the first parallel line and the extension line of any non-intersecting side of the real tight frame, and the angle of the first parallel line may be greater than -90 ° and less than 90°.
  • the included angle may be -89°, -75°, -50°, -25°, -20°, 0°, 10°, 20°, 25°, 50°, 75° or 89°.
  • the angle can be greater than 0° and less than 90°, if the extension line of the non-intersecting side rotates clockwise less than 90° to the angle formed by the extension line of the first parallel line, the angle can be greater than 0° and less than 90°, if the extension line of the non-intersecting side
  • the angle formed by the counterclockwise rotation less than 90° (that is, clockwise rotation greater than 270°) to the extension line of the first parallel line can be greater than -90° and less than 0°, if the unintersected side is parallel to the first If the lines are parallel, the included angle can be 0°.
  • the angles of the straight lines D1, D2, D3 and D4 can be 0°, and the angles of the straight lines D5, D6, D7 and D8 (that is, the angle C1) can be 25°.
  • the angle of the first parallel can be a hyperparameter that can be optimized during training.
  • the angle of the first parallel line can also be described in a manner of rotation of the image to be trained.
  • the angle of the first parallel line may be the angle of rotation.
  • the angle of the first parallel line can be the rotation angle of rotating the image to be trained so that any side of the image to be trained that does not intersect the first parallel line is parallel to the first parallel line, wherein the first parallel line is parallel to
  • the angle can be greater than -90° and less than 90°
  • the rotation angle for clockwise rotation can be positive degrees
  • the rotation angle for counterclockwise rotation can be negative degrees.
  • the examples of the present disclosure are not limited thereto.
  • the angle of the first parallel line may also be in other ranges according to the way of describing the angle of the first parallel line. For example, if the description is based on the side of the real tight frame intersecting the first parallel line, the angle of the first parallel line may also be greater than 0° and less than 180°.
  • multiple negative bags may be obtained based on regions outside the true tight bounding box of the object.
  • the area A1 in the image P1 to be trained is an area outside the real tight frame B21 of the target T1 .
  • the negative packet may be a single pixel in an area outside the true tight frame of all objects of a category (that is, one pixel may correspond to one negative packet).
  • a segmentation loss may be obtained based on a number of bags to be trained for each category.
  • the segmentation loss may include a unary term (also known as a unary loss) and a pairwise term (also known as a pairwise loss).
  • unary terms may describe the degree to which each bag to be trained belongs to each true class. In this case, the tight box can be constrained by both the positive and negative packets through the unary loss.
  • the pair item may describe the degree to which the pixel of the image to be trained and the pixels adjacent to the pixel belong to the same category. In this case, the pairwise loss smoothes the predicted split results.
  • the segmentation loss by category can be obtained, and the segmentation loss (ie, the total segmentation loss) can be obtained based on the segmentation loss of the category.
  • the total segmentation loss L seg can satisfy the formula:
  • L c can represent the segmentation loss of category c
  • C can represent the number of categories. For example, if the optic cup and the optic disc in the fundus image are identified, C can be 2, and if only the optic cup or only the optic disc is identified, then C can be 1.
  • segmentation loss Lc for class c can satisfy the formula:
  • ⁇ c can represent a unary term
  • P can represent the degree (also can be called probability) that each pixel point of segmentation network 22 prediction belongs to each category, can represent a collection of multiple positive bags
  • can represent a weight factor.
  • the weight factor ⁇ can be a hyperparameter, which can be optimized during the training process.
  • a weighting factor ⁇ can be used to switch the two losses (ie unary and pairwise terms).
  • each positive bag of a category includes at least one pixel belonging to the category, then the pixel with the highest probability of belonging to the category in each positive bag can be used as the positive sample of the category; If there is no pixel belonging to this category in each negative bag of a category, even the pixel with the highest probability in the negative bag is a negative sample of this category.
  • the unary term ⁇ c corresponding to category c can satisfy the formula:
  • P c (b) can represent the probability that a package to be trained belongs to category c (also can be called the degree of belonging to category c or the probability of package to be trained), b can represent a package to be trained, can represent a collection of multiple positive bags, can represent a collection of multiple negative bags, max can represent the maximum value function, It can represent the cardinality of a set of multiple positive packets (that is, the number of elements in the set), ⁇ can represent a weight factor, and ⁇ can represent a focusing parameter.
  • the value of the unary term is minimum when P c (b) corresponding to the positive packet is equal to 1 and P c (b) corresponding to the negative packet is equal to 0. That is, the unary loss is the smallest.
  • weighting factor ⁇ may be between 0-1.
  • the focus parameter ⁇ may be greater than or equal to zero.
  • P c (b) may be the maximum probability of belonging to category c among the pixels of a package to be trained.
  • the maximum probability of belonging to a class among the pixels of a packet to be trained may be obtained based on a smooth maximum approximation function (ie, obtain P c (b)).
  • a smooth maximum approximation function ie, obtain P c (b)
  • the maximum smoothing approximation function may be at least one of an ⁇ -softmax function and an ⁇ -quasimax function.
  • max may represent the maximum value function
  • n may represent the number of elements (may correspond to the number of pixels in the package to be trained)
  • xi can represent the value of the element (can correspond to the probability that the pixel at the i-th position of the package to be trained belongs to a category.
  • the ⁇ -softmax function can satisfy the formula: Wherein, ⁇ can be a constant.
  • the larger ⁇ is, the closer to the maximum value of the maximum function.
  • the ⁇ -quasimax function can satisfy the formula: Wherein, ⁇ can be a constant. In some examples, the larger ⁇ is, the closer to the maximum value of the maximum function.
  • the pairwise term may describe the degree to which a pixel of the image to be trained and its neighbors belong to the same category. That is, the pairwise term can evaluate the closeness of the probability that adjacent pixels belong to the same class.
  • the pairwise term for class c The formula can be satisfied:
  • can represent the set of all pairs of adjacent pixels
  • (k, k ' ) can represent a pair of adjacent pixels
  • k and k' can represent the positions of two pixels of adjacent pixel pairs
  • p kc can represent the probability that the pixel at the kth position belongs to class c
  • p k'c can represent the probability that the pixel at the k'th position belongs to class c.
  • adjacent pixel points may be eight-neighborhood or four-neighborhood pixel points.
  • adjacent pixel points of each pixel point in the image to be trained may be acquired to obtain a set of adjacent pixel point pairs.
  • training loss can include regression loss.
  • the regression loss of the regression network 23 can be obtained based on the predicted offset corresponding to the training samples and the actual offset corresponding to the label data. In this case, the predicted offset of the regression network 23 can be approximated to the true offset by the regression loss.
  • the real offset may be the offset between the position of the pixel of the image to be trained and the real tight frame of the target in the label data.
  • the real offset may be an offset normalized based on the average size of objects of each category. For specific content, refer to the relevant description about the offset in the above formula (1).
  • corresponding pixel points in the image to be trained may be selected as positive samples to train the regression network 23 . That is, the regression network 23 can be optimized by using positive samples. Specifically, the regression loss can be obtained based on the positive samples, and then the regression network 23 can be optimized using the regression loss.
  • the regression loss can satisfy the formula:
  • C can represent the number of categories
  • M c can represent the number of positive samples of the c-th category
  • t ic can represent the true offset corresponding to the i-th positive sample of the c-th category
  • v ic can represent the c-th positive sample
  • s(x) can represent the sum of smooth L1 losses of all elements in x.
  • s(t ic -v ic ) can represent the prediction offset corresponding to the i-th positive sample of the c-th category using smooth L1 loss and the i-th positive sample The degree to which the corresponding true offsets agree.
  • the positive samples may be pixels in the image to be trained that are selected for training the regression network 23 (that is, for calculating the regression loss). Thereby, the regression loss can be obtained.
  • the true offset corresponding to the positive sample may be the offset corresponding to the true tight frame. In some examples, the true offset corresponding to the positive sample may be the offset corresponding to the matching tight box. Therefore, it can be applied to the situation where positive samples fall into multiple real tight frames.
  • the smooth L1 loss function can satisfy the formula:
  • can represent a hyperparameter, which is used to switch between the smooth L1 loss function and the smooth L2 loss function
  • x can represent the variable of the smooth L1 loss function
  • corresponding pixel points in the image to be trained may be selected as positive samples to train the regression network 23 .
  • the positive samples can be the pixels in the image to be trained that fall into at least one real tight frame of the target (that is, the pixels that fall into the real tight frame of at least one target can be selected from the image to be trained pixel as a positive sample).
  • optimizing the regression network 23 based on the pixels falling within the true tight frame of at least one object can improve the efficiency of the regression network 23 optimization.
  • pixels falling within at least one real tight bounding box of an object may be selected from the image to be trained by category as positive samples of each category.
  • the regression loss of each category can be obtained based on the positive samples of each category.
  • pixels that fall within at least one real tight frame of an object can be selected from the images to be trained by category as positive samples for each category.
  • the aforementioned positive samples of each category may be screened, and the regression network 23 may be optimized based on the screened positive samples. That is, the positive samples used to calculate the regression loss can be filtered positive samples.
  • the corresponding positive samples can be obtained. Match the tight frame, and then filter the positive samples of each category based on the matched tight frame. In this way, the regression network 23 can be optimized by using the positive samples of each category screened based on matching tight frames.
  • the real tight frame that a pixel (for example, a positive sample) falls into can be filtered to obtain a matching tight frame for the pixel.
  • the matching tight frame may be the real tight frame in which the pixel of the image to be trained falls into and which has the smallest real offset relative to the position of the pixel.
  • the matching tight frame may be the real tight frame in which the positive sample falls into the true tight frame with the smallest real offset relative to the position of the positive sample.
  • the real tight frame is used as the matching tight frame (that is, the matching tight frame can be the real tight frame that the pixel falls into), if the pixel falls into the real tight frame of multiple objects to be measured, then the real tight frame of multiple objects to be measured can be compared to the pixel
  • the true tight frame with the smallest true offset of the position is taken as the matching tight frame. In this way, the matching tight frame corresponding to the pixel can be obtained.
  • the smallest real offset (that is, the real tight frame with the smallest real offset) can be obtained by comparing L1 normal forms of the real offsets.
  • the smallest real offset can be obtained based on the L1 normal form, and then the matching tight frame can be obtained.
  • the absolute value of the elements of each real offset in the multiple real offsets can be calculated and then summed to obtain multiple offset values, and the real offset with the smallest offset value can be obtained by comparing multiple offset values as the smallest true offset.
  • the positive samples of each category may be screened by using the expected intersection and union comparison corresponding to the pixel points (for example, positive samples).
  • pixels far away from the center of the true tight frame or the matching tight frame can be filtered out. In this way, it is possible to reduce the adverse effect of pixels away from the center on the optimization of the regression network 23 and to improve the efficiency of the optimization of the regression network 23 .
  • the expected intersection ratio corresponding to the positive sample can be obtained based on the matching tight frame, and the positive samples of each category can be screened based on the expected intersection ratio. Specifically, after obtaining the positive samples of each category, the matching tight frame corresponding to the positive sample can be obtained, and then based on the matching tight frame, the expected intersection ratio corresponding to the positive sample can be obtained and the positive intersection ratio of each category can be compared based on the expected intersection. The samples are screened, and finally the regression network 23 can be optimized by using the screened positive samples of each category. But the examples of the present disclosure are not limited thereto.
  • the pixel points of the images to be trained can be screened by categories and using the expected intersection and comparison of the pixels of the images to be trained (that is, the pixels of the images to be trained can not be selected first. In the case of selecting at least the pixels that fall into the real tight frame of one target as positive samples, use the expected intersection and comparison to filter the pixels of the image to be trained). In addition, the pixels that do not fall into any real tight frame (that is, there is no pixel matching the tight frame) can be identified. In this way, subsequent screening of the pixel can be facilitated. For example, the expected intersection ratio of a pixel point may be set to 0 to identify the pixel point. Specifically, the pixels of the image to be trained can be screened by category and based on the expected intersection and comparison corresponding to the pixels of the image to be trained, and the regression network 23 can be optimized based on the screened pixels.
  • the regression network 23 may be optimized by selecting pixels whose expected intersection ratio is greater than a preset expected intersection ratio from the pixels of the image to be trained. In some examples, the regression network 23 may be optimized by selecting positive samples whose expected intersection ratio is greater than a preset expected intersection ratio from the positive samples of each category. In this way, pixels (for example, positive samples) meeting the preset expected intersection ratio can be obtained. In some examples, the preset expected intersection ratio may be greater than 0 and less than or equal to 1. For example, the preset expected cross-merging ratio may be 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1. In some examples, the preset desired intersection ratio may be a hyperparameter. The preset expected intersection and union ratio can be adjusted during the training process of the regression network 23 .
  • the expected intersection ratio corresponding to the pixel point can be obtained based on the matching tight frame of the pixel point (for example, the positive sample). In some examples, if the pixel does not correspond to a matching tight frame, the pixel may be ignored or the expected intersection ratio corresponding to the pixel may be set to 0. In this case, it is possible to make pixels that do not have a matching tight frame not used for the training of the regression network 23 or reduce the contribution to the regression loss. It should be noted that, unless otherwise specified, the following description of the expected intersection ratio corresponding to a pixel point is also applicable to the expected intersection ratio corresponding to a positive sample.
  • the expected intersection-over-union ratio may be a maximum value among intersection-over-union ratios (Intersection-over-union, Iou) between the matching tight frame of the pixel point and multiple borders constructed around the pixel point as the center.
  • Iou intersection-over-union ratios
  • the examples of the present disclosure are not limited thereto.
  • the expected intersection ratio may be the maximum value of the intersection ratios between the real tight frame of the pixel and multiple borders constructed around the pixel.
  • a plurality of frames may be constructed with a pixel point of the image to be trained as the center point, and the maximum value among the intersection ratios of the plurality of frames and the matching tight frame labels of the pixel point may be obtained as the expected intersection ratio.
  • the multiple borders may be of different sizes. Specifically, each frame in the plurality of frames may have a different width or height from other frames.
  • FIG. 7 is a schematic diagram showing a frame constructed centering on a pixel point involved in an example of the present disclosure. In order to describe the desired cross-join ratio more clearly, it will be described below in conjunction with FIG. 7 .
  • the pixel M1 has a tight matching frame B31
  • the frame B32 is an exemplary frame constructed centering on the pixel M1 .
  • W be the width of the matching tight frame
  • H be the height of the matching tight frame
  • (r 1 W, r 2 H) represent the position of the pixel
  • r 1 , r 2 are the pixels in the matching tight frame
  • the relative position of the target and satisfy the conditions: 0 ⁇ r1, r2 ⁇ 1.
  • Multiple borders can be constructed based on pixels.
  • the position of the pixel M1 can be expressed as (r 1 W, r 2 H), and the width and height of the matching tight frame B31 can be W and H respectively.
  • the tight matching frame may be divided into four regions by the two centerlines of the tight matching frame.
  • the four areas may be an upper left area, an upper right area, a lower left area, and a lower right area.
  • the center line D9 and center line D10 of the matching tight frame B31 can divide the matching tight frame B31 into an upper left area A3 , an upper right area A4 , a lower left area A5 and a lower right area A6 .
  • the pixel point M1 may be a point in the upper left area A3.
  • w 1 and h 1 can represent the width and height of the first boundary condition
  • w 2 and h 2 can represent the width and height of the second boundary condition
  • w 3 and h 3 can represent the width of the third boundary condition and height
  • w4 and h4 can represent the width and height of the fourth boundary condition.
  • intersection and union ratios corresponding to the above four boundary conditions can satisfy formula (2):
  • IoU 1 (r 1 ,r 2 ) 4r 1 r 2 ,
  • IoU 2 (r 1 ,r 2 ) 2r 1 /(2r 1 (1-2r 2 )+1),
  • IoU 3 (r 1 ,r 2 ) 2r 2 /(2r 2 (1-2r 1 )+1),
  • IoU 4 (r 1 ,r 2 ) 1/(4(1-r 1 )(1-r 2 )),
  • IoU 1 (r 1 ,r 2 ) can represent the IoU ratio corresponding to the first boundary condition
  • IoU 2 (r 1 ,r 2 ) can represent the IoU ratio corresponding to the second boundary condition
  • IoU 3 (r 1 , r 2 ) can represent the intersection ratio corresponding to the third boundary condition
  • IoU 4 (r 1 , r 2 ) can represent the intersection ratio corresponding to the fourth boundary condition.
  • the intersection and union ratio corresponding to each boundary condition can be obtained.
  • the largest intersection and union ratio among multiple boundary conditions is the expected intersection and union ratio.
  • r 1 , r 2 satisfy the condition: 0 ⁇ r 1 , r 2 ⁇ 0.5
  • the expected cross-over-union ratio can satisfy formula (3):
  • the expected intersection-over-union ratios for pixels located in other regions can be obtained based on a similar method for the upper-left region.
  • r 1 in formula (3) can be replaced by 1-r 1
  • r 2 satisfying the condition: 0.5 ⁇ r 2 ⁇ 1 the formula The r 2 in (3) is replaced by 1-r 2 .
  • the expected intersection ratio of pixels located in other regions can be obtained. That is, the pixels located in other regions can be mapped to the upper left region through coordinate conversion, and then the expected intersection ratio can be obtained based on the consistent manner of the upper left region. Therefore, for r 1 , r 2 satisfies the conditions: 0 ⁇ r 1 , r 2 ⁇ 1, and the expected intersection and union ratio can satisfy the formula (4):
  • IoU 1 (r 1 , r 2 ), IoU 2 (r 1 , r 2 ), IoU 2 (r 1 , r 2 ) and IoU 2 (r 1 , r 2 ) can be obtained by formula (2).
  • a desired cross-merge ratio can be obtained.
  • the expected intersection ratio corresponding to the pixel point can be obtained based on the matching tight frame of the pixel point (eg positive sample).
  • the examples of the present disclosure are not limited thereto.
  • the expected intersection ratio corresponding to the pixel point can be obtained based on the real tight frame corresponding to the pixel point (such as a positive sample), and the pixel points of each category can be screened based on the expected intersection ratio.
  • the expected intersection and union ratio may be the maximum value among the expected intersection and union ratios corresponding to each real tight frame. In obtaining the expected intersection ratio corresponding to the pixel based on the real tight frame, you can refer to the relevant description of obtaining the expected intersection ratio corresponding to the pixel based on the matching tight frame of the pixel.
  • FIG. 8 is a flow chart showing a measurement method of tight-frame-based deep learning related to an example of the present disclosure.
  • the measurement method may include obtaining an input image (step S220), inputting the input image into the network module 20 to obtain a first output and a second output (step S240), and based on the first output and The second output identifies the objects to obtain the tight frames of the objects of each category (step S260).
  • the input image may include at least one object.
  • at least one object may belong to at least one category of interest (category of interest may be simply referred to as category).
  • category category of interest
  • the input image may also not include objects. In this case, it is possible to judge an input image in which no object exists.
  • the first output may include the probability that each pixel in the input image belongs to each category.
  • the second output may include an offset between the position of each pixel point in the input image and the tight bounding box of each category of objects.
  • the offset in the second output may be used as the target offset.
  • the network module 20 may include a backbone network 21 , a segmentation network 22 and a regression network 23 .
  • segmentation network 22 may be image segmentation based on weakly supervised learning.
  • regression network 23 may be based on bounding box regression.
  • backbone network 21 may be used to extract feature maps of input images.
  • segmentation network 22 may take the feature map as input to obtain a first output
  • regression network 23 may take the feature map as input to obtain a second output.
  • the resolution of the feature map may be consistent with the input image. For details, refer to the relevant description of the network module 20 .
  • the first output may include the probability that each pixel in the input image belongs to each category
  • the second output may include the offset between the position of each pixel in the input image and the tight frame of each category of objects.
  • an object offset of a category corresponding to a pixel at a corresponding position may be selected from the second output, and a tight frame of each category of objects may be obtained based on the object offset. Therefore, the target can be accurately measured based on the tight frame of the target.
  • the position of the pixel with the highest local probability belonging to each category can be obtained from the first output as the first position, and each pixel can be obtained based on the position corresponding to the first position in the second output and the target offset of the corresponding category.
  • a non-maximum suppression method (Non-Maximum Suppression, NMS) may be used to obtain the first position.
  • NMS non-maximum Suppression
  • the number of first positions corresponding to each category may be greater than or equal to one. But the example of the present disclosure is not limited thereto.
  • the position of the pixel with the highest probability belonging to each category can be obtained from the first output as the first position, based on the second In the output, the position corresponding to the first position and the target offset of the corresponding class obtain the tight frame of the target of each class. That is, the first position can be obtained by using the maximum value method. In some examples, the first position may also be obtained by using a smooth maximum suppression method.
  • tight boxes for objects of various categories may be obtained based on the first position and the object offset.
  • the first position and the target offset can be substituted into equation (1) to infer the tight frame of the target.
  • the first position can be used as the position (x, y) of the pixel point in the formula (1) and the target offset can be used as the offset t to obtain the tight frame b of the target.
  • the measuring method may further include measuring the size of each target based on the tight frame of the target (not shown). Thereby, the target can be accurately measured based on the tight frame of the target.
  • the dimensions of the object may be the width and height of the tight box of the object.
  • FIG. 9 is a block diagram illustrating a measurement device 100 based on tight-framework deep learning according to an example of the present disclosure.
  • the measurement device 100 may include an acquisition module 10 , a network module 20 and an identification module 30 .
  • acquisition module 10 may be configured to acquire an input image.
  • network module 20 may be configured to receive an input image and obtain a first output and a second output based on the input image.
  • the identification module 30 may be configured to identify objects based on the first output and the second output to obtain tight frames of objects of each category.
  • the measurement device 100 may further include a measurement module (not shown). The measurement module can be configured to measure the size of each target based on the tight frame of the target. Thereby, the target can be accurately measured based on the tight frame of the target.
  • the dimensions of the object may be the width and height of the tight box of the object.
  • the measurement method and measurement device 100 involved in this disclosure construct a network module 20 including a backbone network 21, a segmentation network 22 for image segmentation based on weakly supervised learning, and a regression network 23 based on frame regression.
  • the network module 20 is a tight frame based on the target.
  • the backbone network 21 receives an input image (such as a fundus image) and extracts a feature map consistent with the resolution of the input image, and inputs the feature map into the segmentation network 22 and the regression network 23 respectively to obtain the first output and the second output, Then, based on the first output and the second output, a tight frame of the target in the input image is obtained to realize the measurement.
  • the network module 20 based on the training of the target's tight frame can accurately predict the target's tight frame in the input image, and then can accurately measure based on the target's tight frame.
  • predicting the normalized offset through the regression network 23 can improve the accuracy of identifying or measuring objects with small size changes.
  • by using the expected cross-over-union ratio to screen the pixels for optimizing the regression network 23 it is possible to reduce the negative impact of pixels far away from the center on the optimization of the regression network 23 and to improve the efficiency of the regression network 23 optimization.
  • the regression network 23 predicts the offset of a definite category, which can further improve the accuracy of target recognition or measurement.
  • the measurement method involved in the present disclosure will be further described in detail by taking the input image as an example of a fundus image.
  • the measurement method for fundus images can also be referred to as the measurement method for fundus images based on deep learning of tight frames.
  • the fundus images described in the examples of the present disclosure are used to illustrate the technical solutions of the present disclosure more clearly, and do not constitute limitations on the technical solutions provided in the present disclosure.
  • the measurement method for the input image, the measurement device 100, and the corresponding training method are all applicable to the fundus image.
  • FIG. 2( a ) shows a fundus image captured by a fundus camera.
  • the measurement method for the fundus image involved in this embodiment can use the network module 20 trained based on the tight frame of the target to identify at least one target in the fundus image so as to realize the measurement.
  • the fundus image may include at least one object, which may be the optic cup and/or optic disc. That is, the network module 20 that can be trained based on the tight frame of the target can identify the optic cup and/or optic disc in the fundus image so as to realize the measurement of the optic cup and/or optic disc. Thereby, the optic cup and/or optic disc in the fundus image can be measured based on the tight frame.
  • FIG. 10 is a flowchart illustrating a measurement method for a fundus image according to an example of the present disclosure.
  • the measurement method for the fundus image may include acquiring the fundus image (step S420), inputting the fundus image into the network module 20 to obtain the first output and the second output (step S440), and based on The first output and the second output identify the target to obtain the tight frame of the optic cup and/or optic disc in the fundus image to achieve measurement (step S460).
  • a fundus image may be acquired.
  • a fundus image may include at least one object.
  • at least one object may be identified to identify the object and the category to which the object belongs (ie, the category of interest).
  • the category of interest also referred to as category for short
  • the target for each category can be the optic cup or optic disc.
  • the fundus image may also not include the optic disc or cup. In this case, it is possible to judge a fundus image in which no optic disc or optic cup exists.
  • the fundus image may be input into the network module 20 to obtain the first output and the second output.
  • the first output can include the probability that each pixel in the fundus image belongs to each category (that is, the optic cup and/or optic disc), and the second output can include the position of each pixel in the fundus image and the tight frame of each category of targets. target offset.
  • the offset in the second output may be used as the target offset.
  • the backbone network 21 in the network module 20 can be used to extract the feature map of the fundus image.
  • the feature map may be consistent with the resolution of the fundus image.
  • the decoding module in the network module 20 is configured to map the image features extracted at different scales back to the resolution of the fundus image to output a feature map. For details, refer to the relevant description of the network module 20 .
  • the training samples of the network module 20 may include fundus image data (that is, multiple fundus images to be trained) and label data corresponding to the fundus image data.
  • the label data may include a gold standard for the class to which the optic cup and/or optic disc belongs, and a gold standard for the tight frame of the optic cup and/or optic disc.
  • the target may be identified based on the first output and the second output to obtain a tight frame of the optic cup and/or optic disc in the fundus image to achieve measurement. Thereby, the optic cup and/or optic disc can be accurately measured subsequently based on the tight frame.
  • the target offset of the category corresponding to the pixel point at the corresponding position can be selected from the second output, and the optic cup and/or optic disc can be obtained based on the target offset. or the tight frame of the optic disc.
  • the position of the pixel point with the highest probability belonging to each category can be obtained from the first output as the first position, based on the position corresponding to the first position in the second output and the target offset of the corresponding category to obtain Tight frame for optic cup and/or optic disc.
  • the first position may be obtained using a maximum value method. For details, refer to the related description of step S260.
  • a tight frame for the optic cup and/or optic disc may be obtained based on the first position and the target offset. For details, refer to the related description of step S260.
  • the measurement method for the fundus image may further include obtaining a ratio of the optic cup to the optic disc based on the tight frames of the optic cup and the optic disc in the fundus image (not shown).
  • the ratio of the optic cup to the optic disc can be accurately measured based on the tight framing of the optic cup and optic disc.
  • the optic cup and/or optic disc can be measured based on the tight frame of the optic cup and/or the tight frame of the optic disc in the fundus image to obtain the size of the optic cup and/or optic disc (the size may be, for example, the vertical diameter and the horizontal diameter).
  • the size of the optic cup and/or optic disc can be accurately measured.
  • the cup and/or disc size can be obtained by taking the height of the tight frame as the vertical diameter of the optic cup and/or optic disc and the width of the tight frame as the horizontal diameter of the optic cup and/or disc.
  • the ratio of the optic cup to the optic disc (also referred to as the cup-to-disc ratio) can be obtained.
  • the cup-to-disc ratio can be obtained based on the tight frame, so that the cup-to-disk ratio can be accurately measured.
  • the cup-to-disk ratio may include a vertical cup-to-disk ratio and a horizontal cup-to-disk ratio.
  • the vertical cup-to-disk ratio may be the ratio of the vertical diameters of the optic cup and optic disc.
  • the horizontal cup-to-disk ratio may be the ratio of the horizontal diameters of the optic cup and optic disc.
  • FIG. 11 is a block diagram showing a measurement device 200 for a fundus image according to an example of the present disclosure.
  • the measuring device 200 for fundus images may include an acquisition module 50 , a network module 20 and an identification module 60 .
  • acquisition module 50 may be configured to acquire fundus images.
  • network module 20 may be configured to receive a fundus image and obtain a first output and a second output based on the fundus image.
  • the identification module 60 may be configured to identify the target based on the first output and the second output to obtain a tight frame of the optic cup and/or optic disc in the fundus image for measurement. For details, refer to the relevant description in step S460.
  • the measuring device 200 may further include a cup-to-disk ratio module (not shown).
  • the cup-to-disk ratio module may be configured to obtain a cup-to-disc ratio based on the tight framing of the cup and disc in the fundus image. For details, refer to the related description of obtaining the ratio of the optic cup and optic disc based on the tight frame of the optic cup and optic disc in the fundus image.

Abstract

A measurement method based on deep learning of a tight box mark, comprising: obtaining an input image (S220); inputting the input image into a network module (20) trained on the basis of the tight box mark of a target to obtain a first output and a second output (S240), the first output comprising a probability that each pixel point in the input image belongs to each category, the second output comprising an offset between the position of each pixel point in the input image and the tight box mark of the target of each category, the network module (20) comprising a backbone network (21) for extracting a feature map of the input image, a segmentation network (22) based on weak supervised learning, and a regression network (23) based on bounding box regression, the segmentation network (22) taking the feature map as an input to obtain the first output, the regression network (23) taking the feature map as an input to obtain the second output, and the resolution of the feature map being consistent with that of the input image; and recognizing the target on the basis of the first output and the second output to obtain the tight box mark of the target of each category (S260). Therefore, the target can be recognized and accurately measured.

Description

基于紧框标的深度学习的测量方法及测量装置Measuring method and measuring device based on deep learning of tight frame 技术领域technical field
本公开大体涉及基于深度学习的识别技术领域,具体涉及一种基于紧框标的深度学习的测量方法及测量装置。The present disclosure generally relates to the field of recognition technology based on deep learning, and specifically relates to a measurement method and a measurement device based on deep learning of tight frames.
背景技术Background technique
图像中常常包括各种目标的信息,基于图像处理技术识别图像中目标的信息可以自动对目标进行分析。例如,在医学领域,可以对医学图像中的组织对象进行识别,进而能够测量组织对象的尺寸以监测组织对象的变化。The image often includes information of various targets, and the information of the target in the image can be automatically analyzed based on image processing technology. For example, in the medical field, tissue objects in medical images can be identified, and then the size of the tissue objects can be measured to monitor changes in the tissue objects.
近年,以深度学习为代表的人工智能技术得到了显著地发展,其在目标识别或测量等方面的应用也越来越得到关注。研究者们利用深度学习技术对图像中的目标进行识别或进行进一步地测量。具体而言,在一些基于深度学习的研究中,常常利用标注数据对基于深度学习的神经网络进行训练以对图像中的目标进行识别并分割出该目标,进而能够对该目标进行测量。然而,上述的目标识别或测量的方法常常需要精确的像素级别的标注数据用于神经网络的训练,而采集像素级别的标注数据常常需要耗费大量的人力和物力。另外,一些目标识别的方法虽然不是基于像素级别的标注数据,但仅仅是识别图像中的目标,对目标的边界识别还不够精确或在靠近目标的边界位置往往精度较低,不适用于要求精确测量的场景。在这种情况下,对图像中的目标进行测量的精确性还有待于提高。In recent years, artificial intelligence technology represented by deep learning has developed significantly, and its application in target recognition or measurement has attracted more and more attention. Researchers use deep learning techniques to identify or further measure objects in images. Specifically, in some researches based on deep learning, labeled data is often used to train a neural network based on deep learning to recognize and segment the target in the image, and then the target can be measured. However, the above target recognition or measurement methods often require accurate pixel-level labeled data for neural network training, and collecting pixel-level labeled data often requires a lot of manpower and material resources. In addition, although some object recognition methods are not based on pixel-level annotation data, they only recognize objects in the image. The boundary recognition of objects is not accurate enough or the accuracy is often low near the boundary of the object, which is not suitable for precise The measured scene. In this case, the accuracy of measuring the target in the image still needs to be improved.
发明内容Contents of the invention
本公开是有鉴于上述的状况而提出的,其目的在于提供一种能够识别目标且能够精确地对目标进行测量的基于紧框标的深度学习的测量方法及测量装置。The present disclosure is made in view of the above-mentioned situation, and an object thereof is to provide a measurement method and a measurement device based on tight-framework deep learning that can identify a target and accurately measure the target.
为此,本公开第一方面提供了一种基于紧框标的深度学习的测量 方法,是利用基于目标的紧框标进行训练的网络模块对所述目标进行识别从而实现测量的测量方法,所述紧框标为所述目标的最小外接矩形,所述测量方法包括:获取包括至少一个目标的输入图像,所述至少一个目标属于至少一个感兴趣的类别;将所述输入图像输入所述网络模块以获取第一输出和第二输出,所述第一输出包括所述输入图像中的各个像素点属于各个类别的概率,所述第二输出包括所述输入图像中各个像素点的位置与每个类别的目标的紧框标的偏移,将所述第二输出中的偏移作为目标偏移,其中,所述网络模块包括骨干网络、基于弱监督学习的图像分割的分割网络、以及基于边框回归的回归网络,所述骨干网络用于提取所述输入图像的特征图,所述分割网络将所述特征图作为输入以获得所述第一输出,所述回归网络将所述特征图作为输入以获得所述第二输出,其中,所述特征图与所述输入图像的分辨率一致;基于所述第一输出和所述第二输出对所述目标进行识别以获取各个类别的目标的紧框标。To this end, the first aspect of the present disclosure provides a measurement method based on deep learning of a tight frame, which is a measurement method that uses a network module trained based on a target-based tight frame to identify the target so as to achieve measurement. Tightly framed as the minimum bounding rectangle of the target, the measuring method comprises: acquiring an input image comprising at least one target, the at least one target belonging to at least one category of interest; inputting the input image into the network module To obtain a first output and a second output, the first output includes the probability that each pixel in the input image belongs to each category, and the second output includes the position of each pixel in the input image and each The offset of the tight frame of the target of the category, the offset in the second output is used as the target offset, wherein the network module includes a backbone network, a segmentation network based on image segmentation based on weakly supervised learning, and a frame regression based on A regression network, the backbone network is used to extract the feature map of the input image, the segmentation network uses the feature map as input to obtain the first output, and the regression network takes the feature map as input to obtain Obtaining the second output, wherein the feature map is consistent with the resolution of the input image; identifying the target based on the first output and the second output to obtain a tight frame of each category of target mark.
在本公开中,构建包括骨干网络、基于弱监督学习的图像分割的分割网络和基于边框回归的回归网络的网络模块,网络模块是基于目标的紧框标进行训练的,骨干网络接收输入图像并提取与输入图像分辨率一致的特征图,将特征图分别输入分割网络和回归网络以获取第一输出和第二输出,然后基于第一输出和第二输出获取输入图像中目标的紧框标从而实现测量。在这种情况下,基于目标的紧框标的训练的网络模块能够精确地预测输入图像中目标的紧框标,进而能够基于目标的紧框标进行精确地测量。In this disclosure, a network module including a backbone network, a segmentation network based on weakly supervised learning for image segmentation, and a regression network based on bounding box regression is constructed. The network module is trained based on the tight frame of the target. The backbone network receives the input image and Extract a feature map consistent with the resolution of the input image, input the feature map into the segmentation network and the regression network to obtain the first output and the second output, and then obtain the tight frame of the target in the input image based on the first output and the second output so that Realize measurement. In this case, the trained network module based on the tight box of the object can accurately predict the tight box of the object in the input image, and then can accurately measure it based on the tight box of the object.
另外,在本公开第一方面所涉及的测量方法中,可选地,所述网络模块通过如下方法训练:构建训练样本,所述训练样本的输入图像数据包括多张待训练图像,所述多张待训练图像包括包含至少属于一个类别的目标的图像,所述训练样本的标签数据,包括所述目标所属的类别的金标准和所述目标的紧框标的金标准;通过所述网络模块基于所述训练样本的输入图像数据,获得所述训练样本对应的由所述分割网络输出的预测分割数据和由所述回归网络输出的预测偏移;基于所述训练样本对应的标签数据、所述预测分割数据和所述预测偏移确定所述网络模块的训练损失;并且基于所述训练损失对所述网络模块 进行训练以优化所述网络模块。由此,能够获得已优化的网络模块。In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the network module is trained by the following method: constructing a training sample, the input image data of the training sample includes multiple images to be trained, and the multiple The image to be trained includes an image containing an object belonging to at least one category, and the label data of the training sample includes the gold standard of the category to which the object belongs and the gold standard of the tight frame of the object; through the network module based on The input image data of the training sample, obtaining the predicted segmentation data output by the segmentation network corresponding to the training sample and the predicted offset output by the regression network; based on the label data corresponding to the training sample, the The predicted segmentation data and the predicted offset determine a training loss for the network module; and the network module is trained based on the training loss to optimize the network module. Thus, an optimized network module can be obtained.
另外,在本公开第一方面所涉及的测量方法中,可选地,所述基于所述训练样本对应的标签数据、所述预测分割数据和所述预测偏移确定所述网络模块的训练损失,包括:基于所述训练样本对应的预测分割数据和标签数据,获取所述分割网络的分割损失;基于所述训练样本对应的预测偏移和基于标签数据对应的真实偏移,获取所述回归网络的回归损失,其中,所述真实偏移为所述待训练图像的像素点的位置与标签数据中的目标的紧框标的金标准的偏移;并且基于所述分割损失和所述回归损失,获取所述网络模块的训练损失。在这种情况下,能够通过分割损失使分割网络的预测分割数据近似标签数据,且能够通过回归损失使回归网络的预测偏移近似真实偏移。In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, determining the training loss of the network module based on the label data corresponding to the training samples, the predicted segmentation data and the predicted offset , including: obtaining the segmentation loss of the segmentation network based on the predicted segmentation data and label data corresponding to the training sample; obtaining the regression based on the predicted offset corresponding to the training sample and the real offset based on the label data The regression loss of the network, wherein, the real offset is the offset of the position of the pixel point of the image to be trained and the tight frame mark of the target in the label data; and based on the segmentation loss and the regression loss , to obtain the training loss of the network module. In this case, the predicted segmentation data of the segmentation network can be approximated to the label data by the segmentation loss, and the predicted offset of the regression network can be approximated by the real offset by the regression loss.
另外,在本公开第一方面所涉及的测量方法中,可选地,所述目标偏移为基于各个类别的目标的平均宽度和平均高度进行归一化后的偏移、或所述目标偏移为基于各个类别的目标的平均大小进行归一化后的偏移。由此,能够提高对尺寸变化不大的目标进行识别或测量的精确性。In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the target offset is an offset normalized based on the average width and average height of targets of each category, or the target offset Shift is the shift normalized based on the average size of objects of each class. As a result, the accuracy of recognition or measurement of a target whose size does not vary greatly can be improved.
另外,在本公开第一方面所涉及的测量方法中,可选地,利用多示例学习,按类别基于各个待训练图像中的目标的紧框标的金标准获取多个待训练包,基于各个类别的多个待训练包获取所述分割损失,其中,所述多个待训练包包括多个正包和多个负包,将连接所述目标的紧框标的金标准相对的两个边的多条直线中的各条直线上的全部像素点划分为一个正包,所述多条直线包括至少一组相互平行的第一平行线和分别与每组第一平行线垂直的相互平行的第二平行线,所述负包为一个类别的所有目标的紧框标的金标准之外的区域的单个像素点,所述分割损失包括一元项和成对项,所述一元项描述每个待训练包属于各个类别的金标准的程度,所述成对项描述所述待训练图像的像素点与该像素点相邻的像素点属于同类别的程度。在这种情况下,能够基于多示例学习的正包和负包获取分割损失,且通过一元损失使紧框标同时通过正包和负包进行约束,且通过成对损失使预测分割结果平滑。In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, multi-instance learning is used to obtain multiple training packages based on the gold standard of the tight frame of the target in each image to be trained by category, and based on each category A plurality of packets to be trained to obtain the segmentation loss, wherein the plurality of packets to be trained include a plurality of positive packets and a plurality of negative packets, and the number of two sides opposite to the gold standard of the tight frame connecting the target All the pixel points on each of the straight lines are divided into a positive package, and the multiple straight lines include at least one set of first parallel lines parallel to each other and second parallel lines perpendicular to each set of first parallel lines respectively. Parallel lines, the negative bag is a single pixel point in the area outside the gold standard of the tight frame of all objects of a category, the segmentation loss includes a unary item and a pairwise item, and the unary item describes each bag to be trained The degree of belonging to the gold standard of each category, the pair item describes the degree to which the pixel of the image to be trained and the pixels adjacent to the pixel belong to the same category. In this case, the segmentation loss can be obtained based on the positive and negative packets of multi-instance learning, and the tight frame is constrained by both the positive and negative packets through the unary loss, and the predicted segmentation results are smoothed through the pairwise loss.
另外,在本公开第一方面所涉及的测量方法中,可选地,所述第 一平行线的角度为所述第一平行线的延长线与所述目标的紧框标的金标准的任意一个未相交的边的延长线的夹角的角度,所述第一平行线的角度大于-90°且小于90°。在这种情况下,能够划分不同角度的正包对分割网络进行优化。由此,能够提高分割网络的预测分割数据的准确性。In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the angle of the first parallel line is any one of the extension line of the first parallel line and the gold standard of the tight frame of the target As for the included angle of the extension lines of the non-intersecting sides, the angle of the first parallel line is greater than -90° and less than 90°. In this case, being able to divide positive packets from different angles optimizes the segmentation network. Thereby, the accuracy of the segmented data prediction by the segmented network can be improved.
另外,在本公开第一方面所涉及的测量方法中,可选地,按类别从所述待训练图像中选择至少落入一个目标的紧框标的金标准内的像素点作为各个类别的正样本并获取该正样本对应的匹配紧框标以基于所述匹配紧框标对各个类别的正样本进行筛选,然后利用筛选后的各个类别的正样本对所述回归网络进行优化,其中,所述匹配紧框标为所述正样本落入的紧框标的金标准中相对所述正样本的位置的真实偏移最小的紧框标的金标准。由此,能够利用基于匹配紧框标筛选后的各个类别的正样本对回归网络进行优化。In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, pixels falling within the gold standard of at least one target’s tight frame are selected from the images to be trained by category as positive samples of each category And obtain the matching tight frame corresponding to the positive sample to screen the positive samples of each category based on the matching tight frame, and then use the filtered positive samples of each category to optimize the regression network, wherein the The matching tight frame is the tight framed gold standard that has the smallest true deviation relative to the position of the positive sample among the tight framed gold standards that the positive sample falls into. In this way, the regression network can be optimized by using the positive samples of each category screened based on the matching tight frame.
另外,在本公开第一方面所涉及的测量方法中,可选地,令像素点的位置表示为(x,y),该像素点对应的一个目标的紧框标表示为b=(xl,yt,xr,yb),所述目标的紧框标b相对该像素点的位置的偏移表示为t=(tl,tt,tr,tb),则tl,tt,tr,tb满足公式:tl=(x-xl)/S c1,tt=(y-yt)/S c2,tr=(xr-x)/S c1,tb=(yb-y)/S c2,其中,xl,yt表示目标的紧框标的左上角的位置,xr,yb表示目标的紧框标的右下角的位置,S c1表示第c个类别的目标的平均宽度,S c2表示第c个类别的目标的平均高度。由此,能够获得归一化后的偏移。 In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the position of a pixel point is expressed as (x, y), and the tight frame of a target corresponding to the pixel point is expressed as b=(xl, yt, xr, yb), the offset of the tight frame b of the target relative to the position of the pixel is expressed as t=(tl, tt, tr, tb), then tl, tt, tr, tb satisfy the formula: tl =(x-xl)/S c1 , tt=(y-yt)/S c2 , tr=(xr-x)/S c1 , tb=(yb-y)/S c2 , where xl, yt represent the target The position of the upper left corner of the tight box, xr, yb represent the position of the lower right corner of the tight frame of the target, S c1 represents the average width of the target of the c-th category, and S c2 represents the average height of the target of the c-th category. Thereby, a normalized offset can be obtained.
另外,在本公开第一方面所涉及的测量方法中,可选地,按类别并利用所述待训练图像的像素点对应的期望交并比从所述待训练图像的像素点筛选出所述期望交并比大于预设期望交并比的像素点对所述回归网络进行优化,其中,以所述待训练图像的像素点为中心点构建的不同尺寸的多个边框,获取所述多个边框分别与该像素点的匹配紧框标的交并比中的最大值并作为所述期望交并比,所述匹配紧框标为所述待训练图像的像素点落入的紧框标的金标准中相对该像素点的位置的真实偏移最小的紧框标的金标准。由此,能够获得符合预设期望交并比的正样本。In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the pixel points of the image to be trained are selected from the pixels of the image to be trained according to the category and the expected cross-over-union ratio corresponding to the pixels of the image to be trained. The pixels whose expected intersection ratio is greater than the preset expected intersection ratio are optimized for the regression network, wherein multiple frames of different sizes constructed with the pixels of the image to be trained as the center point are used to obtain the plurality of The maximum value of the intersection and union ratios between the frame and the matching tight frame of the pixel point is used as the expected intersection and union ratio, and the matching tight frame is the gold standard of the tight frame where the pixels of the image to be trained fall into The gold standard of the tight frame with the smallest true offset relative to the position of the pixel. In this way, positive samples that meet the preset expected intersection ratio can be obtained.
另外,在本公开第一方面所涉及的测量方法中,可选地,所述期望交并比满足公式:
Figure PCTCN2021125152-appb-000001
其中,r 1,r 2为所述待训练图像的像素点在所述匹配紧框标的相对位置,0<r 1,r 2<1,IoU 1(r 1,r 2)=4r 1r 2,IoU 2(r 1,r 2)=2r 1/(2r 1(1-2r 2)+1),IoU 3(r 1,r 2)=2r 2/(2r 2(1-2r 1)+1),IoU 4(r 1,r 2)=1/(4(1-r 1)(1-r 2))。由此,能够获得期望交并比。
In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the expected intersection and union ratio satisfies the formula:
Figure PCTCN2021125152-appb-000001
Among them, r 1 , r 2 are the relative positions of the pixels of the image to be trained in the matching tight frame, 0<r 1 , r 2 <1, IoU 1 (r 1 , r 2 )=4r 1 r 2 , IoU 2 (r 1 ,r 2 )=2r 1 /(2r 1 (1-2r 2 )+1), IoU 3 (r 1 ,r 2 )=2r 2 /(2r 2 (1-2r 1 )+ 1), IoU 4 (r 1 , r 2 )=1/(4(1-r 1 )(1-r 2 )). Thus, a desired cross-merge ratio can be obtained.
另外,在本公开第一方面所涉及的测量方法中,可选地,所述基于所述第一输出和所述第二输出对所述目标进行识别以获取各个类别的目标的紧框标为:从所述第一输出中获取属于各个类别的局部概率最大的像素点的位置作为第一位置,基于所述第二输出中与所述第一位置对应的位置且对应类别的目标偏移获取各个类别的目标的紧框标。在这种情况下,能够识别出各个类别的一个目标或多个目标。In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the target is identified based on the first output and the second output to obtain the tight frame of each category of target as : Obtain the position of the pixel point with the highest local probability belonging to each category from the first output as the first position, and acquire based on the position corresponding to the first position in the second output and the target offset of the corresponding category Tight boxes for the targets of each category. In this case, an object or objects of each category can be identified.
另外,在本公开第一方面所涉及的测量方法中,可选地,同类别的多个目标的尺寸彼此相差小于10倍。由此,能够进一步地提高目标的识别的精确率。In addition, in the measurement method according to the first aspect of the present disclosure, optionally, the sizes of multiple targets of the same category differ from each other by less than 10 times. As a result, the accuracy of target recognition can be further improved.
另外,在本公开第一方面所涉及的测量方法中,可选地,所述骨干网络包括编码模块和解码模块,所述编码模块配置为在不同尺度上提取的图像特征,所述解码模块配置为将在不同尺度上提取的图像特征映射回所述输入图像的分辨率以输出所述特征图。由此,能够获取与输入图像分辨率一致的特征图。In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the backbone network includes an encoding module and a decoding module, the encoding module is configured to extract image features at different scales, and the decoding module is configured to is to map the extracted image features at different scales back to the resolution of the input image to output the feature map. Thereby, it is possible to obtain a feature map matching the resolution of the input image.
另外,在本公开第一方面所涉及的测量方法中,可选地,所述输入图像为眼底图像,所述目标为视杯和/或视盘。由此,能够获得视杯和/或视盘的紧框标。In addition, in the measurement method according to the first aspect of the present disclosure, optionally, the input image is a fundus image, and the target is an optic cup and/or an optic disc. Thereby, a tight frame of the optic cup and/or optic disc can be obtained.
另外,在本公开第一方面所涉及的测量方法中,可选地,所述基于所述第一输出和所述第二输出对所述目标进行识别以获取所述眼底图像中的视杯和/或视盘的紧框标从而实现测量为:从所述第一输出中获取属于各个类别的概率最大的像素点的位置作为第一位置,基于所述第二输出中与所述第一位置对应的位置且对应类别的目标偏移获取各个类别的目标的紧框标。由此,能够识别出视杯和/或视盘。In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, the target is identified based on the first output and the second output to obtain the optic cup and /or the tight frame of the optic disc so as to realize the measurement as follows: from the first output, the position of the pixel point with the highest probability belonging to each category is obtained as the first position, based on the second output corresponding to the first position and the target offset of the corresponding category to obtain the tight frame of the target of each category. Thereby, the optic cup and/or optic disc can be identified.
另外,在本公开第一方面所涉及的测量方法中,可选地,基于所 述眼底图像中的视杯的紧框标和/或所述眼底图像中的视盘的紧框标对视杯和/或视盘进行测量以获取视杯和/或视盘的尺寸,基于所述眼底图像中的视杯和视盘的尺寸获取视杯和视盘的比值。由此,能够获得视杯和视盘的比值。In addition, in the measurement method involved in the first aspect of the present disclosure, optionally, based on the tight frame of the optic cup in the fundus image and/or the tight frame of the optic disc in the fundus image and/or optic disc measurements are taken to obtain cup and/or optic disc dimensions, and a cup to optic disc ratio is obtained based on the cup and optic disc dimensions in said fundus image. Thus, the ratio of the optic cup to the optic disc can be obtained.
本公开第二方面提供了一种基于紧框标的深度学习的测量装置,是利用基于目标的紧框标进行训练的网络模块对所述目标进行识别从而实现测量的测量装置,所述紧框标为所述目标的最小外接矩形,所述测量装置包括获取模块、网络模块和识别模块;所述获取模块配置为获取包括至少一个目标的输入图像,所述至少一个目标属于至少一个感兴趣的类别;所述网络模块配置为接收所述输入图像并基于所述输入图像获取第一输出和第二输出,所述第一输出包括所述输入图像中的各个像素点属于各个类别的概率,所述第二输出包括所述输入图像中各个像素点的位置与每个类别的目标的紧框标的偏移,将所述第二输出中的偏移作为目标偏移,其中,所述网络模块包括骨干网络、基于弱监督学习的图像分割的分割网络、以及基于边框回归的回归网络,所述骨干网络用于提取所述输入图像的特征图,所述分割网络将所述特征图作为输入以获得所述第一输出,所述回归网络将所述特征图作为输入以获得所述第二输出,其中,所述特征图与所述输入图像的分辨率一致;以及所述识别模块配置为基于所述第一输出和所述第二输出对所述目标进行识别以获取各个类别的目标的紧框标。The second aspect of the present disclosure provides a measurement device based on deep learning of a tight frame, which is a measurement device that uses a network module trained on a target-based tight frame to identify the target so as to achieve measurement. The tight frame is the minimum circumscribed rectangle of the target, the measurement device includes an acquisition module, a network module and a recognition module; the acquisition module is configured to acquire an input image comprising at least one target, and the at least one target belongs to at least one category of interest The network module is configured to receive the input image and obtain a first output and a second output based on the input image, the first output includes the probability that each pixel in the input image belongs to each category, the The second output includes the offset of the position of each pixel in the input image and the tight frame of the target of each category, and the offset in the second output is used as the target offset, wherein the network module includes a backbone network, a segmentation network based on image segmentation based on weakly supervised learning, and a regression network based on border regression, the backbone network is used to extract the feature map of the input image, and the segmentation network takes the feature map as input to obtain the the first output, the regression network takes the feature map as input to obtain the second output, wherein the feature map is consistent with the resolution of the input image; and the identification module is configured to be based on the The first output and the second output identify the objects to obtain tight boxes for each category of objects.
在本公开中,构建包括骨干网络、基于弱监督学习的图像分割的分割网络和基于边框回归的回归网络的网络模块,网络模块是基于目标的紧框标进行训练的,骨干网络接收输入图像并提取与输入图像分辨率一致的特征图,将特征图分别输入分割网络和回归网络以获取第一输出和第二输出,然后基于第一输出和第二输出获取输入图像中目标的紧框标从而实现测量。在这种情况下,基于目标的紧框标的训练的网络模块能够精确地预测输入图像中目标的紧框标,进而能够基于目标的紧框标进行精确地测量。In this disclosure, a network module including a backbone network, a segmentation network based on weakly supervised learning for image segmentation, and a regression network based on bounding box regression is constructed. The network module is trained based on the tight frame of the target. The backbone network receives the input image and Extract a feature map consistent with the resolution of the input image, input the feature map into the segmentation network and the regression network to obtain the first output and the second output, and then obtain the tight frame of the target in the input image based on the first output and the second output so that Realize measurement. In this case, the trained network module based on the tight box of the object can accurately predict the tight box of the object in the input image, and then can accurately measure it based on the tight box of the object.
另外,在本公开第一方面所涉及的测量装置中,可选地,所述输入图像为眼底图像,所述目标为视杯和/或视盘。由此,能够获得视杯和/或视盘的紧框标。In addition, in the measurement device according to the first aspect of the present disclosure, optionally, the input image is a fundus image, and the target is an optic cup and/or an optic disc. Thereby, a tight frame of the optic cup and/or optic disc can be obtained.
另外,在本公开第一方面所涉及的测量装置中,可选地,所述基于所述第一输出和所述第二输出对所述目标进行识别以获取所述眼底图像中的视杯和/或视盘的紧框标从而实现测量为:从所述第一输出中获取属于各个类别的概率最大的像素点的位置作为第一位置,基于所述第二输出中与所述第一位置对应的位置且对应类别的目标偏移获取各个类别的目标的紧框标。由此,能够识别出视杯和/或视盘。In addition, in the measurement device according to the first aspect of the present disclosure, optionally, the target is identified based on the first output and the second output to obtain the optic cup and /or the tight frame of the optic disc so as to realize the measurement as follows: from the first output, the position of the pixel point with the highest probability belonging to each category is obtained as the first position, based on the second output corresponding to the first position and the target offset of the corresponding category to obtain the tight frame of the target of each category. Thereby, the optic cup and/or optic disc can be identified.
根据本公开,提供一种能够识别目标且能够精确地对目标进行测量的基于紧框标的深度学习的测量方法及测量装置。According to the present disclosure, there are provided a measurement method and a measurement device based on tight frame deep learning that can identify a target and accurately measure the target.
附图说明Description of drawings
现在将仅通过参考附图的例子进一步详细地解释本公开,其中:The present disclosure will now be explained in further detail by way of example only with reference to the accompanying drawings, in which:
图1是示出了本公开示例所涉及的基于紧框标的深度学习的测量方法的应用场景的示意图。FIG. 1 is a schematic diagram showing an application scenario of a tight-frame-based deep learning measurement method involved in an example of the present disclosure.
图2(a)是示出了本公开示例所涉及的眼底图像的示意图。FIG. 2( a ) is a schematic diagram showing a fundus image related to an example of the present disclosure.
图2(b)是示出了本公开示例所涉及的眼底图像的识别结果的示意图。FIG. 2( b ) is a schematic diagram showing a recognition result of a fundus image according to an example of the present disclosure.
图3是示出了本公开示例所涉及的网络模块的一个示例的示意图。FIG. 3 is a schematic diagram showing one example of a network module involved in an example of the present disclosure.
图4是示出了本公开示例所涉及的网络模块的另一个示例的示意图。FIG. 4 is a schematic diagram showing another example of a network module related to an example of the present disclosure.
图5是示出了本公开示例所涉及的网络模块的训练方法的流程图。FIG. 5 is a flowchart illustrating a training method of a network module related to an example of the present disclosure.
图6是示出了本公开示例所涉及的正包的示意图。FIG. 6 is a schematic diagram showing positive packets involved in an example of the present disclosure.
图7是示出了本公开示例所涉及的以像素点为中心构建的边框的示意图。FIG. 7 is a schematic diagram showing a frame constructed centering on a pixel point involved in an example of the present disclosure.
图8是示出了本公开示例所涉及的基于紧框标的深度学习的测量方法的流程图。FIG. 8 is a flow chart showing a measurement method of tight-frame-based deep learning related to an example of the present disclosure.
图9是示出了本公开示例所涉及的基于紧框标的深度学习的测量装置的框图。FIG. 9 is a block diagram illustrating a measurement device for tight-framework-based deep learning according to an example of the present disclosure.
图10是示出了本公开示例所涉及的针对眼底图像的测量方法的流程图。FIG. 10 is a flowchart illustrating a measurement method for a fundus image according to an example of the present disclosure.
图11是示出了本公开示例所涉及的针对眼底图像的测量装置的框图。Fig. 11 is a block diagram showing a measurement device for a fundus image according to an example of the present disclosure.
具体实施方式Detailed ways
以下,参考附图,详细地说明本公开的优选实施方式。在下面的说明中,对于相同的部件赋予相同的符号,省略重复的说明。另外,附图只是示意性的图,部件相互之间的尺寸的比例或者部件的形状等可以与实际的不同。需要说明的是,本公开中的术语“包括”和“具有”以及它们的任何变形,例如所包括或所具有的一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可以包括或具有没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。本公开所描述的所有方法可以以任何合适的顺序执行,除非在此另有指示或者与上下文明显矛盾。Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the drawings. In the following description, the same reference numerals are given to the same components, and repeated descriptions are omitted. In addition, the drawings are only schematic diagrams, and the ratio of dimensions between components, the shape of components, and the like may be different from the actual ones. It should be noted that the terms "comprising" and "having" and any variations thereof in the present disclosure, such as a process, method, system, product or device that includes or has a series of steps or units, are not necessarily limited to the clearly listed instead, may include or have other steps or elements not explicitly listed or inherent to the process, method, product or apparatus. All methods described in this disclosure can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context.
本公开涉及的基于紧框标的深度学习的测量方法及测量装置,其能够识别目标且能够提高目标测量的精确性。本公开涉及的测量方法还可以称为识别方法或辅助测量方法等。本公开涉及的测量方法可以适用于任何对图像中的目标的宽度和/或高度进行精确测量的应用场景。The present disclosure relates to a measurement method and a measurement device based on tight frame deep learning, which can identify targets and improve the accuracy of target measurement. The measurement method involved in the present disclosure may also be referred to as an identification method or an auxiliary measurement method. The measurement method involved in the present disclosure may be applicable to any application scenario in which the width and/or height of an object in an image is accurately measured.
本公开涉及的测量方法是利用基于目标的紧框标进行训练的网络模块对目标进行识别从而实现测量的测量方法。紧框标可以为目标的最小外接矩形。在这种情况下,目标与紧框标的四个边相接触且不与紧框标之外的区域互相重叠(也即目标与紧框标的四个边相切)。由此,紧框标能够表示目标的宽度和高度。另外,基于目标的紧框标对网络模块进行训练,能够降低收集像素级的标注数据的时间和人力成本且网络模块能够精确地识别目标的紧框标。The measurement method involved in the present disclosure is a measurement method for realizing measurement by using a network module trained based on a target tight frame to identify a target. The tight frame can be the smallest bounding rectangle of the target. In this case, the target is in contact with the four sides of the tight frame and does not overlap with the area outside the tight frame (that is, the target is tangent to the four sides of the tight frame). Thus, the tight frame can represent the width and height of the object. In addition, training the network module based on the tight frame of the target can reduce the time and labor cost of collecting pixel-level annotation data, and the network module can accurately identify the tight frame of the target.
本公开涉及的输入图像可以来自相机、CT扫描、PET-CT扫描、SPECT扫描、MRI、超声、X射线、血管造影照片、荧光图、胶囊内窥镜拍摄的图像或其组合。在一些示例中,输入图像可以为组织对象的图像(例如眼底图像)。在一些示例中,输入图像可以为自然图像。自然图像可以为自然场景下观察或者拍摄到的图像。由此,能够对自然图像中的目标进行测量。例如,可以对自然图像中的人脸的大小或行人的身高进行测量。以下以输入图像为眼底相机采集的眼底图像为例描述了本公开的示例,并且这样的描述并不限制本公开的范围。The input images referred to in the present disclosure may come from cameras, CT scans, PET-CT scans, SPECT scans, MRI, ultrasound, X-rays, angiograms, fluorographs, images taken by capsule endoscopy, or combinations thereof. In some examples, the input image may be an image of a tissue object (eg, a fundus image). In some examples, the input image can be a natural image. The natural image may be an image observed or captured in a natural scene. Thus, it is possible to measure objects in natural images. For example, the size of a human face in a natural image or the height of a pedestrian can be measured. The following describes examples of the present disclosure by taking the input image as an example of a fundus image captured by a fundus camera, and such description does not limit the scope of the present disclosure.
图1是示出了本公开示例所涉及的基于紧框标的深度学习的测量方法的应用场景的示意图。图2(a)是示出了本公开示例所涉及的眼底图像的示意图。图2(b)是示出了本公开示例所涉及的眼底图像的识别结果的示意图。在一些示例中,本公开涉及的测量方法可以应用于如图1所示的应用场景中。在应用场景中,可以通过采集设备52(例如相机)采集目标物体51包括目标的相应位置的图像作为输入图像(参见图1),将输入图像输入网络模块20以识别输入图像中的目标并获取目标的紧框标B(参见图1),进而可以基于紧框标B对目标进行测量。以眼底图像为例,将图2(a)所示的眼底图像输入网络模块20可以获得图2(b)所示的识别结果,识别结果可以中包括视杯和视盘这两种类别的目标的紧框标,其中,紧框标B11为视盘的紧框标,紧框标B12为视杯的紧框标。在这种情况下,能够基于紧框标对视杯和视盘进行测量。FIG. 1 is a schematic diagram showing an application scenario of a tight-frame-based deep learning measurement method involved in an example of the present disclosure. FIG. 2( a ) is a schematic diagram showing a fundus image related to an example of the present disclosure. FIG. 2( b ) is a schematic diagram showing a recognition result of a fundus image according to an example of the present disclosure. In some examples, the measurement method involved in the present disclosure can be applied to the application scenario as shown in FIG. 1 . In the application scenario, the image of the target object 51 including the corresponding position of the target can be collected by the acquisition device 52 (such as a camera) as an input image (see FIG. 1), and the input image is input to the network module 20 to identify the target in the input image and obtain The tight frame B of the target (see Figure 1), and then the target can be measured based on the tight frame B. Taking the fundus image as an example, inputting the fundus image shown in FIG. 2(a) into the network module 20 can obtain the recognition result shown in FIG. 2(b). The tight frame mark, wherein, the tight frame mark B11 is the tight frame mark of the optic disc, and the tight frame mark B12 is the tight frame mark of the optic cup. In this case, the optic cup and optic disc can be measured based on the tight frame.
本公开涉及的网络模块20可以是基于多任务的。在一些示例中,网络模块20可以是基于深度学习的神经网络。在一些示例中,网络模块20可以包括两个任务,一个任务可以是基于弱监督学习的图像分割的分割网络22(稍后描述),另一个任务可以是基于边框回归的回归网络23(稍后描述)。The network module 20 involved in the present disclosure may be based on multitasking. In some examples, the network module 20 may be a deep learning based neural network. In some examples, the network module 20 may include two tasks, one task may be a segmentation network 22 (described later) for image segmentation based on weakly supervised learning, and the other task may be a regression network 23 based on bounding box regression (described later). describe).
在一些示例中,分割网络22可以对输入图像进行分割以获得目标(例如视杯和/或视盘)。在一些示例中,分割网络22可以基于多示例学习(Multiple-Instance learning,MIL)并用于监督紧框标。在一些示例中,分割网络22解决的问题可以是多标签分类(multi-label classification)问题。在一些示例中,输入图像可以包含至少一种感兴趣的类别(可以简称为类别)的目标。由此,分割网络22能够对包含至少一种感兴趣的类别的目标的输入图像进行识别。在一些示例中,输入图像也可以不存在任何目标。在一些示例中,各个感兴趣的类别的目标的数量可以至少大于1。In some examples, segmentation network 22 may segment an input image to obtain objects (eg, optic cup and/or optic disc). In some examples, segmentation network 22 may be based on Multiple-Instance learning (MIL) and used to supervise tight labels. In some examples, the problem solved by segmentation network 22 may be a multi-label classification problem. In some examples, the input image may contain objects of at least one category of interest (may be simply referred to as category). Thus, the segmentation network 22 is able to recognize input images containing objects of at least one class of interest. In some examples, the input image may also be free of any objects. In some examples, the number of objects for each category of interest may be at least greater than one.
在一些示例中,回归网络23可以用于按类别预测紧框标。在一些示例中,回归网络23可以通过预测紧框标相对输入图像的各个像素点的位置的偏移,进而预测紧框标。In some examples, regression network 23 may be used to predict tight boxes by category. In some examples, the regression network 23 may further predict the tight frame by predicting the offset of the tight frame relative to each pixel of the input image.
在一些示例中,网络模块20还可以包括骨干网络21。骨干网络 21可以用于提取输入图像(也即输入网络模块20的原始图像)的特征图。在一些示例中,骨干网络21可以提取高层次的特征用于对象表示。在一些示例中,特征图的分辨率可以与输入图像一致(也即,特征图可以是单尺度且与输入图像的大小一致)。由此,能够提高对尺寸变化不大的目标进行识别或测量的精确性。在一些示例中,可以通过不断地融合不同尺度的图像特征以获得与输入图像的尺度一致的特征图。在一些示例中,特征图可以作为分割网络22和回归网络23的输入。In some examples, the network module 20 may also include a backbone network 21 . The backbone network 21 can be used to extract the feature map of the input image (that is, the original image input to the network module 20). In some examples, backbone network 21 may extract high-level features for object representation. In some examples, the resolution of the feature map can be consistent with the input image (ie, the feature map can be single-scale and consistent with the size of the input image). As a result, the accuracy of recognition or measurement of a target whose size does not vary greatly can be improved. In some examples, image features of different scales can be continuously fused to obtain a feature map consistent with the scale of the input image. In some examples, feature maps may serve as input to segmentation network 22 and regression network 23 .
在一些示例中,骨干网络21可以包括编码模块和解码模块。在一些示例中,编码模块可以配置为在不同尺度上提取的图像特征。在一些示例中,解码模块可以配置为将在不同尺度上提取的图像特征映射回输入图像的分辨率以输出特征图。由此,能够获取与输入图像分辨率一致的特征图。In some examples, backbone network 21 may include an encoding module and a decoding module. In some examples, the encoding module can be configured to extract image features at different scales. In some examples, the decoding module may be configured to map image features extracted at different scales back to the resolution of the input image to output a feature map. Thereby, it is possible to obtain a feature map matching the resolution of the input image.
图3是示出了本公开示例所涉及的网络模块20的一个示例的示意图。在一些示例中,如图3所示,网络模块20可以包括骨干网络21、分割网络22和回归网络23。骨干网络21可以接收输入图像并输出特征图。特征图可以作为分割网络22和回归网络23的输入以获取相应的输出。具体地,分割网络22可以将特征图作为输入以获得第一输出,回归网络23可以将特征图作为输入以获得第二输出。在这种情况下,能够将输入图像输入网络模块20以获取第一输出和第二输出。FIG. 3 is a schematic diagram showing one example of the network module 20 related to the example of the present disclosure. In some examples, as shown in FIG. 3 , the network module 20 may include a backbone network 21 , a segmentation network 22 and a regression network 23 . The backbone network 21 can receive an input image and output a feature map. The feature maps can be used as input to segmentation network 22 and regression network 23 to obtain corresponding outputs. Specifically, the segmentation network 22 may use the feature map as input to obtain a first output, and the regression network 23 may use the feature map as input to obtain a second output. In this case, the input image can be input to the network module 20 to obtain the first output and the second output.
在一些示例中,第一输出可以是图像分割预测的结果。在一些示例中,第二输出可以是边框回归预测的结果。In some examples, the first output may be the result of image segmentation prediction. In some examples, the second output may be the result of the bounding box regression prediction.
在一些示例中,第一输出可以包括输入图像中的各个像素点属于各个类别的概率。在一些示例中,各个像素点属于各个类别的概率可以通过激活函数获得。在一些示例中,第一输出可以是矩阵。在一些示例中,第一输出对应的矩阵的大小可以为M×N×C,其中,M×N可以表示输入图像的分辨率,M和N可以分别对应输入图像的行和列,C可以表示类别的数量。例如,对于目标为视杯和视盘这两种类别的眼底图像,第一输出对应的矩阵的大小可以为M×N×2。In some examples, the first output may include the probability that each pixel in the input image belongs to each category. In some examples, the probability that each pixel belongs to each category can be obtained through an activation function. In some examples, the first output can be a matrix. In some examples, the size of the matrix corresponding to the first output may be M×N×C, where M×N may represent the resolution of the input image, M and N may correspond to the rows and columns of the input image respectively, and C may represent The number of categories. For example, for the two types of fundus images whose targets are the optic cup and the optic disc, the size of the matrix corresponding to the first output may be M×N×2.
在一些示例中,输入图像中的每个位置的像素点在第一输出中对应的值可以为向量,向量中的元素数量可以与类别数量一致。例如,对于输入图像中的第k个位置的像素点,在第一输出中对应的值可以为 向量p k,向量p k可以包括C个元素,C可以为类别的数量。在一些示例中,向量p k的元素值可以为0至1的数值。 In some examples, the value corresponding to the pixel at each position in the input image in the first output may be a vector, and the number of elements in the vector may be consistent with the number of categories. For example, for a pixel at position k in the input image, the corresponding value in the first output may be a vector p k , the vector p k may include C elements, and C may be the number of categories. In some examples, the values of the elements of the vector p k may be values from 0 to 1.
在一些示例中,第二输出可以包括输入图像中各个像素点的位置与每个类别的目标的紧框标的偏移。也即,第二输出可以包括明确类别的目标的紧框标的偏移。换而言之,回归网络23预测的可以是明确类别的目标的紧框标的偏移。在这种情况下,在不同类别的目标重叠度较高时能够区别相应类别的目标的紧框标,进而能够获取相应类别的目标的紧框标。由此,能够兼容不同类别的目标重叠较高的目标的识别或测量。在一些示例中,可以将第二输出中的偏移作为目标偏移。In some examples, the second output may include an offset between the position of each pixel point in the input image and the tight bounding box of each category of objects. That is, the second output may include the offset of the tight box for the object of the definite class. In other words, what the regression network 23 predicts may be the offset of a tight box for an object of a definite class. In this case, when the overlapping degree of objects of different categories is high, the tight frames of the objects of the corresponding categories can be distinguished, and then the tight frames of the objects of the corresponding categories can be obtained. As a result, recognition or measurement of objects with a high overlap between objects of different classes can be compatible. In some examples, the offset in the second output may be used as the target offset.
在一些示例中,目标偏移可以是归一化后的偏移。在一些示例中,目标偏移可以是基于各个类别的目标的平均大小进行归一化后的偏移。在一些示例中,目标偏移为可以基于各个类别的目标的平均宽度和平均高度进行归一化后的偏移。目标偏移和预测偏移(稍后描述)可以与真实偏移(稍后描述)对应。也即,若对训练网络模块20时(可以简称为训练阶段)的真实偏移进行归一化,则利用网络模块20进行预测(可以简称为测量阶段)的目标偏移(对应测量阶段)和预测偏移(对应训练阶段)也可以自动进行相应的归一化。由此,能够提高对尺寸变化不大的目标进行识别或测量的精确性。In some examples, the target offset may be a normalized offset. In some examples, the object offset may be an offset normalized based on the average size of objects of each class. In some examples, the object offset is an offset that may be normalized based on the average width and average height of objects of each category. A target offset and a predicted offset (described later) may correspond to a real offset (described later). That is, if the real offset when training the network module 20 (which can be referred to as the training phase) is normalized, then the target offset (corresponding to the measurement phase) and The prediction offset (corresponding to the training stage) can also be automatically normalized accordingly. As a result, the accuracy of recognition or measurement of a target whose size does not vary greatly can be improved.
在一些示例中,目标的平均大小可以通过对目标的平均宽度和平均高度求平均获得。在一些示例中,目标的平均大小、或目标的平均宽度和平均宽度可以是经验值。在一些示例中,目标的平均大小可以通过对采集输入图像对应的样本进行统计获得。在一些示例中,可以按类别对样本的标签数据中目标的紧框标的宽度和高度分别求平均以获取平均宽度和平均高度。在一些示例中,可以对平均宽度和平均高度求平均以获得该类别的目标的平均大小。在一些示例中,样本可以为训练样本(稍后描述)。由此,能够通过训练样本获取的平均宽度和平均宽度、或目标的平均大小。In some examples, the average size of the objects can be obtained by averaging the average width and average height of the objects. In some examples, the average size of objects, or the average width and average width of objects may be empirical values. In some examples, the average size of the target can be obtained by statistically collecting samples corresponding to the input image. In some examples, the width and height of the tight boxes of objects in the sample's label data may be averaged separately by class to obtain an average width and an average height. In some examples, the average width and average height may be averaged to obtain an average size of objects of that class. In some examples, the samples may be training samples (described later). Thereby, the average width and the average width, or the average size of the objects can be acquired through the training samples.
在一些示例中,第二输出可以是矩阵。在一些示例中,第二输出对应的矩阵的大小可以为M×N×A,其中,A可以表示全部目标偏移的大小,M×N可以表示输入图像的分辨率,M和N可以分别对应输入图像的行和列。在一些示例中,若一个目标偏移的大小是4×1的向 量(也即可以用4个数进行表示),则A可以为C×4,C可以表示类别的数量。例如,对于目标为视杯和视盘这两种类别的眼底图像,第二输出对应的矩阵的大小可以为M×N×8。In some examples, the second output can be a matrix. In some examples, the size of the matrix corresponding to the second output can be M×N×A, where A can represent the size of all target offsets, M×N can represent the resolution of the input image, and M and N can correspond to The rows and columns of the input image. In some examples, if the size of a target offset is a 4×1 vector (that is, it can be represented by 4 numbers), then A can be C×4, and C can represent the number of categories. For example, for the two types of fundus images whose targets are the optic cup and the optic disc, the size of the matrix corresponding to the second output may be M×N×8.
在一些示例中,输入图像中的每个位置的像素点在第二输出中对应的值可以为向量。例如,输入图像中的第k个位置的像素点,在第二输出中对应的值可以表示为:v k=[v k1,v k2,…,v kC]。其中,C可以为类别的数量,v k中的各个元素可以表示为每个类别的目标的目标位移。由此,能够方便地表示目标位移以及对应的类别。在一些示例中,v k的元素可以为4维的向量。 In some examples, the value corresponding to the pixel at each position in the input image in the second output may be a vector. For example, the pixel at the kth position in the input image, the corresponding value in the second output can be expressed as: v k =[v k1 , v k2 , . . . , v kC ]. Among them, C can be the number of categories, and each element in vk can be expressed as the target displacement of the target of each category. Thus, the target displacement and the corresponding category can be conveniently represented. In some examples, the elements of v k may be 4-dimensional vectors.
在一些示例中,骨干网络21可以是基于U-net网络。在本实施方式中,骨干网络21的编码模块可以包括单元层和池化层(pooling layers)。骨干网络21的解码模块可以包括单元层、上采样层(up-samplinglayers,Up-sampling)和跳跃连接单元(skipconnectionunits,Skip-connection)。In some examples, the backbone network 21 may be based on a U-net network. In this embodiment, the coding module of the backbone network 21 may include a unit layer and a pooling layer (pooling layers). The decoding module of the backbone network 21 may include a unit layer, an up-sampling layer (up-sampling layers, Up-sampling) and a skip connection unit (skip connection units, Skip-connection).
在一些示例中,单元层可以包括卷积层、批标准化层和修正线性单元层(rectified linear unit layers,ReLu)。在一些示例中,池化层(pooling layers,Pooling)可以是最大池化层(max pooling layers,Max-poooling)。在一些示例中,跳跃连接单元可以用于组合来自深层的图像特征和来自浅层的图像特征。In some examples, the unit layers may include convolutional layers, batch normalization layers, and rectified linear unit layers (ReLu). In some examples, the pooling layers (Pooling) may be max pooling layers (Max-pooling). In some examples, skip connection units may be used to combine image features from deep layers and image features from shallow layers.
另外,分割网络22可以是前向型神经网络。在一些示例中,分割网络22可以包括多个单元层。在一些示例中,分割网络22可以包括多个单元层和卷积层(convolutional layers,Conv)。另外,回归网络23可以包括膨胀卷积层(dilated convolutionlayers,Dilated Conv)和修正线性单元层(batch normalizationlayers,BN)。在一些示例中,回归网络23可以包括膨胀卷积层、修正线性单元层和卷积层。In addition, the segmentation network 22 may be a feed-forward neural network. In some examples, segmentation network 22 may include multiple layers of units. In some examples, segmentation network 22 may include multiple unit layers and convolutional layers (Conv). In addition, the regression network 23 may include dilated convolution layers (dilated convolution layers, Dilated Conv) and corrected linear unit layers (batch normalization layers, BN). In some examples, regression network 23 may include dilated convolutional layers, rectified linear unit layers, and convolutional layers.
图4是示出了本公开示例所涉及的网络模块20的另一个示例的示意图。需要说明的是,为了更清楚地描述网络模块20的网络结构,在图4中,通过箭头中的数字对网络模块20中的网络层进行区分,其中,箭头1表示卷积层、批标准化层和修正线性单元层组成的网络层(也即单元层),箭头2表示膨胀卷积层和修正线性单元组成的网络层,箭头3表示卷积层,箭头4表示最大池化层,箭头5表示上采样层,箭 头6表示跳跃连接单元。FIG. 4 is a schematic diagram showing another example of the network module 20 related to the example of the present disclosure. It should be noted that, in order to describe the network structure of the network module 20 more clearly, in FIG. 4 , the network layers in the network module 20 are distinguished by the numbers in the arrows, where the arrow 1 represents the convolutional layer and the batch normalization layer And the network layer composed of the corrected linear unit layer (that is, the unit layer), the arrow 2 indicates the network layer composed of the expansion convolution layer and the corrected linear unit, the arrow 3 indicates the convolution layer, the arrow 4 indicates the maximum pooling layer, and the arrow 5 indicates In the upsampling layer, arrow 6 represents the skip connection unit.
作为网络模块20的一个示例。如图4所示,可以将分辨率为256×256的输入图像输入到网络模块20,经过编码模块的不同层级的单元层(参见箭头1)和最大池化层(参见箭头4)提取图像特征,并通过解码模块的不同层级的单元层(参见箭头1)、上采样层(参见箭头5)和跳跃连接单元(参见箭头6)不断地融合不同尺度的图像特征以获得与输入图像的尺度一致的特征图221,然后将特征图221分别输入的分割网络22和回归网络23以获取第一输出和第二输出。As an example of the network module 20. As shown in Figure 4, an input image with a resolution of 256×256 can be input to the network module 20, and the image features are extracted through the unit layer (see arrow 1) and the maximum pooling layer (see arrow 4) of different levels of the encoding module , and through the different levels of unit layers (see arrow 1), upsampling layer (see arrow 5) and skip connection unit (see arrow 6) of the decoding module to continuously fuse image features of different scales to obtain the same scale as the input image Then the feature map 221 is input into the segmentation network 22 and the regression network 23 respectively to obtain the first output and the second output.
另外,如图4所示,分割网络22可以依次由单元层(参见箭头1)和卷积层(参见箭头3)组成,回归网络23可以依次由多个由膨胀卷积层和修正线性单元层组成的网络层(参见箭头2)、以及卷积层(参见箭头3)组成。其中,单元层可以由卷积层、批标准化层和修正线性单元层组成。In addition, as shown in Figure 4, the segmentation network 22 can be composed of a unit layer (see arrow 1) and a convolutional layer (see arrow 3) in turn, and the regression network 23 can be composed of a plurality of dilated convolutional layers and corrected linear unit layers in turn. composed of network layers (see arrow 2), and convolutional layers (see arrow 3). Among them, the unit layer can be composed of convolution layer, batch normalization layer and rectified linear unit layer.
在一些示例中,网络模块20中的卷积层的卷积核的大小可以设置为3×3。在一些示例中,网络模块20中的最大池化层的卷积核的大小可以设置为2×2,卷积步长可以设置为2。在一些示例中,网络模块20中的上采样层的放大比例系数(scale-factor)可以设置为2。在一些示例中,如图4所示,网络模块20中的多个膨胀卷积层的膨胀系数(dilation-factor)可以依次设置为1、1、2、4、8和16(参见箭头2上面的数字)。在一些示例中,如图4所示,最大池化层的数量可以为5。由此,能够使输入图像的大小被32(32可以为2的5次方)除尽。In some examples, the size of the convolution kernel of the convolution layer in the network module 20 may be set to 3×3. In some examples, the size of the convolution kernel of the maximum pooling layer in the network module 20 may be set to 2×2, and the convolution step may be set to 2. In some examples, the scale-factor of the up-sampling layer in the network module 20 may be set to 2. In some examples, as shown in FIG. 4, the expansion coefficients (dilation-factor) of multiple dilated convolutional layers in the network module 20 can be set to 1, 1, 2, 4, 8 and 16 in sequence (see arrow 2 above numbers). In some examples, as shown in FIG. 4 , the number of max pooling layers may be five. Thus, the size of the input image can be divided by 32 (32 may be 2 to the fifth power).
如上所述,本公开涉及的测量方法是利用基于目标的紧框标进行训练的网络模块20对目标进行识别从而实现测量的测量方法。以下,结合附图详细描述本公开涉及的网络模块20的训练方法(可以简称为训练方法)。图5是示出了本公开示例所涉及的网络模块20的训练方法的流程图。As described above, the measurement method involved in the present disclosure is a measurement method that uses the network module 20 trained based on the tight frame of the target to recognize the target and realize the measurement. Hereinafter, the training method of the network module 20 involved in the present disclosure (which may be referred to as the training method for short) will be described in detail with reference to the accompanying drawings. FIG. 5 is a flowchart showing a training method of the network module 20 according to an example of the present disclosure.
在一些示例中,可以基于端对端的方式同时训练网络模块20中的分割网络22和回归网络23。In some examples, the segmentation network 22 and the regression network 23 in the network module 20 can be trained simultaneously on an end-to-end basis.
在一些示例中,网络模块20中的分割网络22和回归网络23可以通过联合训练以同时优化分割网络22和回归网络23。在一些示例中,通过联合训练,分割网络22和回归网络23可以通过反向传播调整骨 干网络21的网络参数,以使骨干网络21输出的特征图能够更好的表达输入图像的特征并输入分割网络22和回归网络23。在这种情况下,分割网络22和回归网络23均基于骨干网络21输出的特征图进行处理。In some examples, the segmentation network 22 and the regression network 23 in the network module 20 can be jointly trained to simultaneously optimize the segmentation network 22 and the regression network 23 . In some examples, through joint training, the segmentation network 22 and the regression network 23 can adjust the network parameters of the backbone network 21 through backpropagation, so that the feature map output by the backbone network 21 can better express the characteristics of the input image and input the segmentation Network 22 and Regression Network 23. In this case, both the segmentation network 22 and the regression network 23 perform processing based on the feature maps output by the backbone network 21 .
在一些示例中,可以利用多示例学习对分割网络22进行训练。在一些示例中,可以利用待训练图像的像素点对应的期望交并比筛选用于训练回归网络23的像素点(稍后描述)。In some examples, segmentation network 22 may be trained using multi-instance learning. In some examples, the pixel points used for training the regression network 23 may be selected by using the expected cross-over-union ratio corresponding to the pixel points of the image to be trained (described later).
在一些示例中,如图5所示,训练方法可以包括构建训练样本(步骤S120)、将训练样本输入网络模块20以获取预测数据(步骤S140)、以及基于训练样本和预测数据确定网络模块20的训练损失并基于训练损失对网络模块20进行优化(步骤S160)。由此,能够获得已优化(也可以称为已训练)的网络模块20。In some examples, as shown in FIG. 5 , the training method may include constructing a training sample (step S120), inputting the training sample into the network module 20 to obtain prediction data (step S140), and determining the network module 20 based on the training sample and the prediction data. and optimize the network module 20 based on the training loss (step S160). Thus, an optimized (also called trained) network module 20 can be obtained.
在一些示例中,在步骤S120中,训练样本可以包括输入图像数据和标签数据。在一些示例中,输入图像数据可以包括多张待训练图像。例如待训练图像可以为待训练的眼底图像。In some examples, in step S120, the training samples may include input image data and label data. In some examples, the input image data may include multiple images to be trained. For example, the image to be trained may be a fundus image to be trained.
在一些示例中,多张待训练图像可以包括包含目标的图像。在一些示例中,多张待训练图像中可以包括包含目标的图像和不包含目标的图像。在一些示例中,目标可以至少属于一个类别。在一些示例中,待训练图像中各个类别的目标的数量可以大于等于1。例如,以眼底图像为例,若是对视杯和视盘进行识别或测量,则眼底图像中的目标可以为一个视盘和一个视杯。本公开的示例不特意对目标的数量、目标所属的类别、以及各个类别的目标的数量进行限定。In some examples, the plurality of images to be trained may include images containing objects. In some examples, the plurality of images to be trained may include images containing objects and images not containing objects. In some examples, a target may belong to at least one category. In some examples, the number of objects of each category in the image to be trained may be greater than or equal to one. For example, taking the fundus image as an example, if the optic cup and the optic disc are identified or measured, the target in the fundus image may be an optic disc and an optic cup. The examples of the present disclosure do not intend to limit the number of objects, the categories to which the objects belong, and the number of objects of each category.
在一些示例中,标签数据可以包括目标所属的类别的金标准(类别的金标准有时也可以称为真实类别)和目标的紧框标的金标准(紧框标的金标准有时也可以称为真实紧框标)。需要说明的是,除非特别说明,训练方法中标签数据中的目标的紧框标或目标所属的类别均可以默认是金标准。In some examples, the label data may include a gold standard of the class to which the target belongs (the gold standard of the class may also sometimes be referred to as the true class) and a gold standard of the target's tight frame (the gold standard of the tight frame may also sometimes be called the true class). box mark). It should be noted that, unless otherwise specified, the tight frame of the target in the label data or the category of the target in the training method can be the gold standard by default.
在一些示例中,可以利用标注工具对待训练图像中的目标的紧框标(也即,最小外接矩形)进行标注,并针对紧框标设置相应的类别以表示目标所属的真实类别以获取标签数据。In some examples, a labeling tool can be used to mark the tight frame (that is, the smallest bounding rectangle) of the target in the training image, and set the corresponding category for the tight frame to represent the true category of the target to obtain label data .
在一些示例中,在步骤S140中,通过网络模块20可以基于训练样本的输入图像数据,获得训练样本对应的预测数据。预测数据可以 包括由分割网络22输出的预测分割数据和由回归网络23输出的预测偏移。In some examples, in step S140, the prediction data corresponding to the training samples can be obtained through the network module 20 based on the input image data of the training samples. The predicted data may include predicted segmentation data output by the segmentation network 22 and predicted offsets output by the regression network 23.
另外,预测分割数据可以与第一输出对应,预测偏移可以与第二输出对应(也即,可以与目标偏移对应)。也即,预测分割数据可以包括待训练图像中的各个像素点属于各个类别的概率,预测偏移可以包括待训练图像中各个像素点的位置与每个类别的目标的紧框标的偏移。在一些示例中,与目标偏移对应,预测偏移可以为基于各个类别的目标的平均大小进行归一化后的偏移。由此,能够提高对尺寸变化不大的目标进行识别或测量的精确性。优选地,同类别的多个目标的尺寸可以彼此相差小于10倍。例如,同类别的多个目标的尺寸可以彼此相差1倍、2倍、3倍、5倍、7倍、8倍或9倍等。由此,能够进一步地提高目标的识别或测量的精确率。In addition, the predicted split data may correspond to the first output, and the predicted offset may correspond to the second output (ie, may correspond to the target offset). That is, the predicted segmentation data may include the probability that each pixel in the image to be trained belongs to each category, and the predicted offset may include the offset between the position of each pixel in the image to be trained and the tight frame mark of each category. In some examples, corresponding to the object offset, the predicted offset may be an offset normalized based on the average size of objects of each category. As a result, the accuracy of recognition or measurement of a target whose size does not vary greatly can be improved. Preferably, the sizes of multiple objects of the same category may differ from each other by less than 10 times. For example, the sizes of multiple objects of the same category may differ from each other by 1 time, 2 times, 3 times, 5 times, 7 times, 8 times or 9 times, etc. As a result, the accuracy of target recognition or measurement can be further improved.
为了更清楚地描述像素点的位置与目标的紧框标的偏移、以及归一化后的偏移,以下结合公式进行描述。需要说明的是预测偏移、目标偏移和真实偏移属于偏移的一种,同样适用于下面的公式(1)。In order to more clearly describe the offset between the position of the pixel point and the tight frame of the target, and the offset after normalization, the following formulas are used to describe it. It should be noted that the predicted offset, the target offset and the actual offset are one type of offset, and are also applicable to the following formula (1).
具体地,可以令像素点的位置表示为(x,y),该像素点对应的一个目标的紧框标表示为b=(xl,yt,xr,yb),目标的紧框标b相对该像素点的位置的偏移(也即像素点的位置与目标的紧框标的偏移)表示为t=(tl,tt,tr,tb),则tl,tt,tr,tb可以满足公式(1):Specifically, the position of a pixel can be expressed as (x, y), the tight frame of a target corresponding to the pixel is expressed as b=(xl, yt, xr, yb), and the tight frame b of the target is relative to the The offset of the position of the pixel (that is, the offset between the position of the pixel and the tight frame of the target) is expressed as t=(tl, tt, tr, tb), then tl, tt, tr, tb can satisfy the formula (1 ):
tl=(x-xl)/S c1tl=(x-xl)/S c1 ,
tt=(y-yt)/S c2tt=(y-yt)/S c2 ,
tr=(xr-x)/S c1tr=(xr-x)/S c1 ,
tb=(yb-y)/S c2tb=(yb-y)/S c2 ,
其中,xl,yt可以表示目标的紧框标的左上角的位置,xr,yb可以表示目标的紧框标的右下角的位置,c可以表示目标所属的类别的索引,S c1可以表示第c个类别的目标的平均宽度,S c2可以表示第c个类别的目标的平均高度。由此,能够获得归一化后的偏移。在一些示例中,S c1和S c2可以均为第c个类别的目标的平均大小。 Among them, xl, yt can represent the position of the upper left corner of the tight frame of the target, xr, yb can represent the position of the lower right corner of the tight frame of the target, c can represent the index of the category to which the target belongs, S c1 can represent the cth category The average width of the target, S c2 can represent the average height of the target of the c-th category. Thereby, a normalized offset can be obtained. In some examples, S c1 and S c2 may both be the average size of objects of the c-th category.
但本公开的示例不限于此,在另一些示例,也可以通过左下角的位置和右上角的位置表示目标的紧框标,或通过任意一个角的位置、 长度和宽度表示目标的紧框标。另外,在另一些示例中,也可以利用其他方式进行归一化,例如,可以利用目标的紧框标的长度和宽度对偏移进行归一化。But the example of the present disclosure is not limited thereto. In other examples, the tight frame of the target can also be represented by the position of the lower left corner and the upper right corner, or the tight frame of the target can be represented by the position, length and width of any corner . In addition, in some other examples, normalization may also be performed in other ways, for example, the offset may be normalized by using the length and width of the tight frame of the target.
另外,公式(1)中的像素点可以为待训练图像或输入图像的像素点。也即,公式(1)可以适用训练阶段的待训练图像对应的真实偏移、以及测量阶段的输入图像对应的目标偏移。In addition, the pixel points in formula (1) may be the pixel points of the image to be trained or the input image. That is, formula (1) can be applied to the real offset corresponding to the image to be trained in the training phase and the target offset corresponding to the input image in the measurement phase.
具体地,对于训练阶段,像素点可以为待训练图像中的像素点,目标的紧框标b可以为待训练图像的目标的紧框标的金标准,则偏移t可以为真实偏移(也可以称为偏移的金标准)。由此,后续能够基于预测偏移和真实偏移获取回归网络23的回归损失。另外,若像素点为待训练图像中的像素点,偏移t为预测偏移,则可以根据公式(1)反推预测的目标的紧框标。Specifically, for the training phase, the pixels can be the pixels in the image to be trained, the tight frame b of the target can be the gold standard of the tight frame of the target in the image to be trained, and the offset t can be the real offset (also can be called the gold standard for offset). Thus, the regression loss of the regression network 23 can be subsequently obtained based on the predicted offset and the actual offset. In addition, if the pixel point is the pixel point in the image to be trained, and the offset t is the prediction offset, then the tight frame of the predicted target can be deduced according to the formula (1).
另外,对于测量阶段,像素点可以为输入图像中的像素点,偏移t可以为目标偏移,则可以根据公式(1)和目标偏移反推输入图像中的目标的紧框标(也即,可以将目标偏移和像素点的位置代入公式(1)以获取目标的紧框标)。由此,能够获得输入图像中的目标的紧框标。In addition, for the measurement stage, the pixel points can be the pixel points in the input image, and the offset t can be the target offset, then the tight frame of the target in the input image can be deduced according to the formula (1) and the target offset (also That is, the object offset and the position of the pixel point can be substituted into formula (1) to obtain the object's tight frame). Thereby, a tight frame of the object in the input image can be obtained.
在一些示例中,在步骤S160中,基于训练样本对应的标签数据、预测分割数据和预测偏移可以确定网络模块20的训练损失,然后基于训练损失对网络模块20进行训练以优化网络模块20。In some examples, in step S160 , the training loss of the network module 20 can be determined based on the label data corresponding to the training samples, the predicted segmentation data and the predicted offset, and then the network module 20 is trained based on the training loss to optimize the network module 20 .
如上所述,网络模块20可以包括分割网络22和回归网络23。在一些示例中,基于分割损失和回归损失,可以获取网络模块20的训练损失。由此,能够基于训练损失对网络模块20进行优化。在一些示例中,训练损失可以为分割损失和回归损失之和。在一些示例中,分割损失可以表示预测分割数据中待训练图像中的像素点属于各个真实类别的程度,回归损失可以表示预测偏移与真实偏移的接近程度。As mentioned above, the network module 20 may include a segmentation network 22 and a regression network 23 . In some examples, based on the segmentation loss and the regression loss, the training loss of the network module 20 can be obtained. Thus, the network module 20 can be optimized based on the training loss. In some examples, the training loss may be the sum of segmentation loss and regression loss. In some examples, the segmentation loss may indicate the extent to which the pixels in the image to be trained in the predicted segmentation data belong to each real category, and the regression loss may indicate the closeness of the predicted offset to the actual offset.
图6是示出了本公开示例所涉及的正包的示意图。FIG. 6 is a schematic diagram showing positive packets involved in an example of the present disclosure.
在一些示例中,可以基于训练样本对应的预测分割数据和标签数据,获取分割网络22的分割损失。由此,能够通过分割损失使分割网络22的预测分割数据近似标签数据。在一些示例中,可以利用多示例学习获取分割损失。在多示例学习中,可以按类别基于各个待训练图像中的目标的真实紧框标获取多个待训练包(也即,各个类别可以分 别对应多个待训练包)。基于各个类别的多个待训练包可以获取分割损失。在一些示例中,多个待训练包可以包括多个正包和多个负包。由此,能够基于多示例学习的正包和负包获取分割损失。需要说明的是,除非特别说明,以下正包和负包均是针对各个类别的。In some examples, the segmentation loss of the segmentation network 22 can be obtained based on the predicted segmentation data and label data corresponding to the training samples. In this way, the predicted segmented data of the segmented network 22 can be approximated to the label data by the segmented loss. In some examples, segmentation loss can be obtained using multi-instance learning. In multi-instance learning, multiple bags to be trained can be obtained by category based on the real tight frame of the target in each image to be trained (that is, each category can correspond to multiple bags to be trained). Segmentation loss can be obtained based on multiple bags to be trained for each category. In some examples, the plurality of packets to be trained may include a plurality of positive packets and a plurality of negative packets. Thus, segmentation loss can be obtained based on the positive and negative packets of multi-instance learning. It should be noted that unless otherwise specified, the following positive and negative packages are for each category.
在一些示例中,可以基于目标的真实紧框标内的区域获取多个正包。如图6所示,待训练图像P1中区域A2为目标T1的真实紧框标B21内的区域。In some examples, multiple positive packets may be obtained based on the area within the true tight bounding box of the object. As shown in FIG. 6 , the area A2 in the image P1 to be trained is the area within the real tight frame B21 of the target T1.
在一些示例中,可以将连接目标的真实紧框标相对的两个边的多条直线中的各条直线上的全部像素点划分为一个正包(也即,一条直线可以对应一个正包)。具体地,各条直线的两端可以在真实紧框标的上端和下端、或左端和右端。作为示例,如图6所示,直线D1、直线D2、直线D3、直线D4、直线D5、直线D6、直线D7和直线D8上的像素点可以分别划分为一个正包。但本公开的示例不限于此,在另一些示例中,也可以使用其他方式划分正包。例如,可以将真实紧框标的特定位置的像素点划分为一个正包。In some examples, all the pixel points on each of the multiple straight lines connecting two opposite sides of the real tight frame of the target can be divided into a positive bag (that is, a straight line can correspond to a positive bag) . Specifically, the two ends of each straight line may be at the upper end and the lower end, or the left end and the right end of the real tight frame. As an example, as shown in FIG. 6 , the pixel points on the straight line D1 , straight line D2 , straight line D3 , straight line D4 , straight line D5 , straight line D6 , straight line D7 and straight line D8 can be divided into a positive packet respectively. However, the examples of the present disclosure are not limited thereto, and in other examples, other ways may also be used to divide positive packets. For example, the pixels at a specific position of the real tight frame can be divided into a positive bag.
在一些示例中,多条直线可以包括至少一组相互平行的第一平行线。例如,多条直线可以包括一组第一平行线、两组第一平行线、三组第一平行线或四组第一平行线等。在一些示例中,第一平行线中的直线的数量可以大于等于2。In some examples, the plurality of straight lines may include at least one set of first parallel lines that are parallel to each other. For example, the plurality of straight lines may include one set of first parallel lines, two sets of first parallel lines, three sets of first parallel lines, or four sets of first parallel lines, and the like. In some examples, the number of straight lines in the first parallel line may be greater than or equal to two.
在一些示例中,多条直线可以包括至少一组相互平行的第一平行线和分别与每组第一平行线垂直的相互平行的第二平行线。具体地,若多条直线包括一组第一平行线,则多条直线还可以包括与该组第一平行线垂直的一组第二平行线,若多条直线包括多组第一平行线,则多条直线还可以包括分别与每组第一平行线垂直的多组第二平行线。如图6所示,一组第一平行线可以包括平行的直线D1和直线D2,与该组第一平行线对应的一组第二平行可以包括平行的直线D3和直线D4,其中,直线D1可以与直线D3垂直;另外一组第一平行线可以包括平行的直线D5和直线D6、与该组第一平行线对应的一组第二平行线可以包括平行的直线D7和直线D8,其中,直线D5可以与直线D7垂直。在一些示例中,第一平行线和第二平行线的中的直线的数量可以大于等于2。In some examples, the plurality of straight lines may include at least one set of first parallel lines that are parallel to each other and second parallel lines that are respectively perpendicular to each set of first parallel lines. Specifically, if the multiple straight lines include a set of first parallel lines, then the multiple straight lines may also include a set of second parallel lines perpendicular to the set of first parallel lines; if the multiple straight lines include multiple sets of first parallel lines, Then the multiple straight lines may further include multiple sets of second parallel lines perpendicular to each set of first parallel lines. As shown in Figure 6, a group of first parallel lines may include parallel straight lines D1 and straight lines D2, and a group of second parallel lines corresponding to the group of first parallel lines may include parallel straight lines D3 and straight lines D4, wherein the straight line D1 Can be perpendicular to straight line D3; Another group of first parallel lines can include parallel straight line D5 and straight line D6, and a group of second parallel lines corresponding to this group of first parallel lines can include parallel straight line D7 and straight line D8, wherein, The straight line D5 may be perpendicular to the straight line D7. In some examples, the number of straight lines in the first parallel line and the second parallel line may be greater than or equal to two.
如上所述,在一些示例中,多条直线可以包括多组第一平行线(也即,多条直线可以包括不同角度的平行线)。在这种情况下,能够划分不同角度的正包对分割网络22进行优化。由此,能够提高分割网络22的预测分割数据的准确性。As noted above, in some examples, the plurality of straight lines may include multiple sets of first parallel lines (ie, the plurality of straight lines may include parallel lines at different angles). In this case, it is possible to optimize the segmentation network 22 by dividing positive packets from different angles. Thereby, the accuracy of the segmented data predicted by the segmenting network 22 can be improved.
在一些示例中,第一平行线的角度可以为第一平行线的延长线与真实紧框标的任意一个未相交的边的延长线的夹角的角度,第一平行线的角度可以大于-90°且小于90°。例如夹角的角度可以为-89°、-75°、-50°、-25°、-20°、0°、10°、20°、25°、50°、75°或89°等。具体地,若通过未相交的边的延长线顺时针旋转小于90°到第一平行线的延长线构成的夹角的角度可以大于0°且小于90°,若通过未相交的边的延长线逆时针旋转小于90°(也即,顺时针旋转大于270°)到第一平行线的延长线构成的夹角的角度可以大于-90°且小于0°,若未相交的边与第一平行线平行,则夹角的角度可以为0°。如图6所示,直线D1、直线D2、直线D3和直线D4的角度可以为0°,直线D5、直线D6、直线D7和直线D8的角度(也即角度C1)可以为25°。在一些示例中,第一平行线的角度可以为超参数,在训练过程可以进行优化。In some examples, the angle of the first parallel line may be the angle between the extension line of the first parallel line and the extension line of any non-intersecting side of the real tight frame, and the angle of the first parallel line may be greater than -90 ° and less than 90°. For example, the included angle may be -89°, -75°, -50°, -25°, -20°, 0°, 10°, 20°, 25°, 50°, 75° or 89°. Specifically, if the extension line of the non-intersecting side rotates clockwise less than 90° to the angle formed by the extension line of the first parallel line, the angle can be greater than 0° and less than 90°, if the extension line of the non-intersecting side The angle formed by the counterclockwise rotation less than 90° (that is, clockwise rotation greater than 270°) to the extension line of the first parallel line can be greater than -90° and less than 0°, if the unintersected side is parallel to the first If the lines are parallel, the included angle can be 0°. As shown in Figure 6, the angles of the straight lines D1, D2, D3 and D4 can be 0°, and the angles of the straight lines D5, D6, D7 and D8 (that is, the angle C1) can be 25°. In some examples, the angle of the first parallel can be a hyperparameter that can be optimized during training.
另外,也可以以待训练图像旋转的方式描述第一平行线的角度。第一平行线的角度可以为旋转的角度。具体地,第一平行线的角度可以为将待训练图像旋转以使待训练图像的与第一平行线不相交的任意边与第一平行线平行的旋转角度,其中,第一平行线平行的角度可以大于-90°且小于90°,顺时针旋转的旋转角度可以为正度数,逆时针旋转的旋转角度可以为负度数。但本公开的示例不限于此,在另一些示例中,根据第一平行线的角度的描述的方式不同,第一平行线的角度也可以为其他范围。例如,若基于与第一平行线相交的真实紧框的边进行描述,第一平行线的角度也可以大于0°且小于180°。In addition, the angle of the first parallel line can also be described in a manner of rotation of the image to be trained. The angle of the first parallel line may be the angle of rotation. Specifically, the angle of the first parallel line can be the rotation angle of rotating the image to be trained so that any side of the image to be trained that does not intersect the first parallel line is parallel to the first parallel line, wherein the first parallel line is parallel to The angle can be greater than -90° and less than 90°, the rotation angle for clockwise rotation can be positive degrees, and the rotation angle for counterclockwise rotation can be negative degrees. However, the examples of the present disclosure are not limited thereto. In some other examples, the angle of the first parallel line may also be in other ranges according to the way of describing the angle of the first parallel line. For example, if the description is based on the side of the real tight frame intersecting the first parallel line, the angle of the first parallel line may also be greater than 0° and less than 180°.
在一些示例中,可以基于目标的真实紧框标之外的区域获取多个负包。如图6所示,待训练图像P1中区域A1为目标T1的真实紧框标B21之外的区域。在一些示例中,负包可以为一个类别的所有目标的真实紧框标之外的区域的单个像素点(也即,一个像素点可以对应一个负包)。In some examples, multiple negative bags may be obtained based on regions outside the true tight bounding box of the object. As shown in FIG. 6 , the area A1 in the image P1 to be trained is an area outside the real tight frame B21 of the target T1 . In some examples, the negative packet may be a single pixel in an area outside the true tight frame of all objects of a category (that is, one pixel may correspond to one negative packet).
如上所述,在一些示例中,基于各个类别的多个待训练包可以获取分割损失。在一些示例中,分割损失可以包括一元项(也可以称为一元损失)和成对项(也可以称为成对损失)。在一些示例中,一元项可以描述每个待训练包属于各个真实类别的程度。在这种情况下,能够通过一元损失使紧框标同时通过正包和负包进行约束。在一些示例中,成对项可以描述待训练图像的像素点与该像素点相邻的像素点属于同类别的程度。在这种情况下,成对损失使预测分割结果平滑。As mentioned above, in some examples, a segmentation loss may be obtained based on a number of bags to be trained for each category. In some examples, the segmentation loss may include a unary term (also known as a unary loss) and a pairwise term (also known as a pairwise loss). In some examples, unary terms may describe the degree to which each bag to be trained belongs to each true class. In this case, the tight box can be constrained by both the positive and negative packets through the unary loss. In some examples, the pair item may describe the degree to which the pixel of the image to be trained and the pixels adjacent to the pixel belong to the same category. In this case, the pairwise loss smoothes the predicted split results.
在一些示例中,可以按类别获得类别的分割损失,基于类别的分割损失获取分割损失(也即总分割损失)。在一些示例中,总分割损失L seg可以满足公式: In some examples, the segmentation loss by category can be obtained, and the segmentation loss (ie, the total segmentation loss) can be obtained based on the segmentation loss of the category. In some examples, the total segmentation loss L seg can satisfy the formula:
Figure PCTCN2021125152-appb-000002
Figure PCTCN2021125152-appb-000002
其中,L c可以表示类别c的分割损失,C可以表示类别的数量。例如,若对眼底图像中的视杯和视盘进行识别,则C可以为2,若仅对视杯或仅对视盘进行识别,则C可以为1。 Among them, L c can represent the segmentation loss of category c, and C can represent the number of categories. For example, if the optic cup and the optic disc in the fundus image are identified, C can be 2, and if only the optic cup or only the optic disc is identified, then C can be 1.
在一些示例中,类别c的分割损失L c可以满足公式: In some examples, the segmentation loss Lc for class c can satisfy the formula:
Figure PCTCN2021125152-appb-000003
Figure PCTCN2021125152-appb-000003
其中,φ c可以表示一元项,
Figure PCTCN2021125152-appb-000004
可以表示成对项,P可以表示分割网络22预测的每个像素点属于各个类别的程度(也可以称为概率),
Figure PCTCN2021125152-appb-000005
可以表示多个正包的集合,
Figure PCTCN2021125152-appb-000006
可以表示多个负包的集合,λ可以表示权重因子。权重因子λ可以为超参数,在训练过程可以进行优化。在一些示例中,权重因子λ可以用于切换两个损失(也即一元项和成对项)。
Among them, φ c can represent a unary term,
Figure PCTCN2021125152-appb-000004
Can represent paired items, P can represent the degree (also can be called probability) that each pixel point of segmentation network 22 prediction belongs to each category,
Figure PCTCN2021125152-appb-000005
can represent a collection of multiple positive bags,
Figure PCTCN2021125152-appb-000006
Can represent a collection of multiple negative packets, and λ can represent a weight factor. The weight factor λ can be a hyperparameter, which can be optimized during the training process. In some examples, a weighting factor λ can be used to switch the two losses (ie unary and pairwise terms).
一般而言,在多示例学习中,若一个类别的各个正包中至少包括一个像素点属于该类别,则可以将各个正包中属于该类别的概率最大的像素点作为该类别的正样本;若一个类别的各个负包中不存在属于该类别的像素点,则负包中即使概率最大的像素点也是该类别的负样本。基于这种情况,在一些示例中,类别c对应的一元项φ c可以满足公式: Generally speaking, in multi-instance learning, if each positive bag of a category includes at least one pixel belonging to the category, then the pixel with the highest probability of belonging to the category in each positive bag can be used as the positive sample of the category; If there is no pixel belonging to this category in each negative bag of a category, even the pixel with the highest probability in the negative bag is a negative sample of this category. Based on this situation, in some examples, the unary term φ c corresponding to category c can satisfy the formula:
Figure PCTCN2021125152-appb-000007
Figure PCTCN2021125152-appb-000007
其中,P c(b)可以表示一个待训练包属于类别c的概率(也可以称为属于类别c的程度或待训练包的概率),b可以表示一个待训练包,
Figure PCTCN2021125152-appb-000008
可以表示多个正包的集合,
Figure PCTCN2021125152-appb-000009
可以表示多个负包的集合,
Figure PCTCN2021125152-appb-000010
max可以表示最大值函数,
Figure PCTCN2021125152-appb-000011
可以表示多个正包的集合的基数(也即集合的元素个数),β可以表示权重因子,γ可以表示聚焦参数(focusing parameter)。在一些示例中,当正包对应的P c(b)等于1且负包对应的P c(b)等于0时一元项的值最小。也即,一元损失最小。在一些示例中,权重因子β可以在0至1之间。在一些示例中,聚焦参数γ可以大于等于0。
Among them, P c (b) can represent the probability that a package to be trained belongs to category c (also can be called the degree of belonging to category c or the probability of package to be trained), b can represent a package to be trained,
Figure PCTCN2021125152-appb-000008
can represent a collection of multiple positive bags,
Figure PCTCN2021125152-appb-000009
can represent a collection of multiple negative bags,
Figure PCTCN2021125152-appb-000010
max can represent the maximum value function,
Figure PCTCN2021125152-appb-000011
It can represent the cardinality of a set of multiple positive packets (that is, the number of elements in the set), β can represent a weight factor, and γ can represent a focusing parameter. In some examples, the value of the unary term is minimum when P c (b) corresponding to the positive packet is equal to 1 and P c (b) corresponding to the negative packet is equal to 0. That is, the unary loss is the smallest. In some examples, weighting factor β may be between 0-1. In some examples, the focus parameter γ may be greater than or equal to zero.
在一些示例中,P c(b)可以为一个待训练包的像素点中属于类别c的最大概率。在一些示例中,P c(b)可以满足公式:P c(b)=max k∈b(p kc),其中,p kc可以表示待训练包b的第k个位置的像素点属于类别c的概率。 In some examples, P c (b) may be the maximum probability of belonging to category c among the pixels of a package to be trained. In some examples, P c (b) can satisfy the formula: P c (b)=max k∈b (p kc ), where p kc can indicate that the pixel at the kth position of the package b to be trained belongs to category c The probability.
在一些示例中,可以基于最大值平滑近似函数(Smooth maximum approximation)获取一个待训练包的像素点中属于一个类别的最大概率(也即获取P c(b))。由此,能够获得较稳定的最大概率。 In some examples, the maximum probability of belonging to a class among the pixels of a packet to be trained may be obtained based on a smooth maximum approximation function (ie, obtain P c (b)). Thus, a relatively stable maximum probability can be obtained.
在一些示例中,最大值平滑近似函数可以为α-softmax函数和α-quasimax函数中的至少一种。在一些示例中,对于最大值函数f(x)=max 1≤i≤nx i,max可以表示最大值函数,n可以表示元素个数(可以对应待训练包中的像素点的个数),x i可以表示元素的值(可以对应待训练包的第i个位置的像素点属于一个类别的概率。在这种情况下,α-softmax函数可以满足公式:
Figure PCTCN2021125152-appb-000012
其中,α可以为常量。在一些示例中,α越大,越接近最大值函数的最大值。另外,α-quasimax函数可以满足公式:
Figure PCTCN2021125152-appb-000013
其中,α可以为常量。在一些示例中,α越大,越接近最大值函数的最大值。
In some examples, the maximum smoothing approximation function may be at least one of an α-softmax function and an α-quasimax function. In some examples, for the maximum value function f(x)=max 1≤i≤n x i , max may represent the maximum value function, and n may represent the number of elements (may correspond to the number of pixels in the package to be trained) , xi can represent the value of the element (can correspond to the probability that the pixel at the i-th position of the package to be trained belongs to a category. In this case, the α-softmax function can satisfy the formula:
Figure PCTCN2021125152-appb-000012
Wherein, α can be a constant. In some examples, the larger α is, the closer to the maximum value of the maximum function. In addition, the α-quasimax function can satisfy the formula:
Figure PCTCN2021125152-appb-000013
Wherein, α can be a constant. In some examples, the larger α is, the closer to the maximum value of the maximum function.
如上所述,在一些示例中,成对项可以描述待训练图像的像素点与该像素点相邻的像素点属于同类别的程度。也即,成对项可以评估相邻的像素点属于同类别的概率的接近程度。在一些示例中,类别c对 应的成对项
Figure PCTCN2021125152-appb-000014
可以满足公式:
As mentioned above, in some examples, the pairwise term may describe the degree to which a pixel of the image to be trained and its neighbors belong to the same category. That is, the pairwise term can evaluate the closeness of the probability that adjacent pixels belong to the same class. In some examples, the pairwise term for class c
Figure PCTCN2021125152-appb-000014
The formula can be satisfied:
Figure PCTCN2021125152-appb-000015
Figure PCTCN2021125152-appb-000015
其中,ε可以表示所有相邻像素点对的集合,(k,k ')可以表示一对相邻像素点,k和k'可以分别表示相邻像素点对的两个像素点的位置,p kc可以表示第k个位置的像素点属于类别c的概率,p k'c可以表示第k'个位置的像素点属于类别c的概率。在一些示例中,相邻像素点可以为八邻域或四邻域的像素点。在一些示例中,可以获取待训练图像中的各个像素点的相邻像素点以获得相邻像素点对的集合。 Among them, ε can represent the set of all pairs of adjacent pixels, (k, k ' ) can represent a pair of adjacent pixels, k and k' can represent the positions of two pixels of adjacent pixel pairs, p kc can represent the probability that the pixel at the kth position belongs to class c, and p k'c can represent the probability that the pixel at the k'th position belongs to class c. In some examples, adjacent pixel points may be eight-neighborhood or four-neighborhood pixel points. In some examples, adjacent pixel points of each pixel point in the image to be trained may be acquired to obtain a set of adjacent pixel point pairs.
如上所述,训练损失可以包括回归损失。在一些示例中,可以基于训练样本对应的预测偏移和基于标签数据对应的真实偏移,获取回归网络23的回归损失。在这种情况下,能够通过回归损失使回归网络23的预测偏移近似真实偏移。As mentioned above, training loss can include regression loss. In some examples, the regression loss of the regression network 23 can be obtained based on the predicted offset corresponding to the training samples and the actual offset corresponding to the label data. In this case, the predicted offset of the regression network 23 can be approximated to the true offset by the regression loss.
在一些示例中,真实偏移可以为待训练图像的像素点的位置与标签数据中的目标的真实紧框标的偏移。在一些示例中,与预测偏移对应,真实偏移可以是基于各个类别的目标的平均大小进行归一化后的偏移。具体内容可以参见上述公式(1)的关于偏移的相关描述。In some examples, the real offset may be the offset between the position of the pixel of the image to be trained and the real tight frame of the target in the label data. In some examples, corresponding to the predicted offset, the real offset may be an offset normalized based on the average size of objects of each category. For specific content, refer to the relevant description about the offset in the above formula (1).
在一些示例中,可以从待训练图像中的像素点选择相应的像素点作为正样本对回归网络23进行训练。也即,可以利用正样本对回归网络23进行优化。具体地,可以基于正样本获取回归损失,然后利用回归损失对回归网络23进行优化。In some examples, corresponding pixel points in the image to be trained may be selected as positive samples to train the regression network 23 . That is, the regression network 23 can be optimized by using positive samples. Specifically, the regression loss can be obtained based on the positive samples, and then the regression network 23 can be optimized using the regression loss.
在一些示例中,回归损失可以满足公式:In some examples, the regression loss can satisfy the formula:
Figure PCTCN2021125152-appb-000016
Figure PCTCN2021125152-appb-000016
其中,C可以表示类别的数量,M c可以表示第c个类别的正样本的数量,t ic可以表示第c个类别的第i个正样本对应的真实偏移,v ic可以表示第c个类别的第i个正样本对应的预测偏移,s(x)可以表示x中所有元素的smooth L1损失之和。在一些示例中,对于x为t ic-v ic,s(t ic-v ic)可以表示利用smooth L1损失计算第c个类别的第i个正样本对应的预测偏移与第i个正样本对应的真实偏移一致的程度。这里,正样本可以为被选择用于对回归网络23进行训练(也即,用于计算回归 损失)的待训练图像中的像素点。由此,能够获取回归损失。 Among them, C can represent the number of categories, M c can represent the number of positive samples of the c-th category, t ic can represent the true offset corresponding to the i-th positive sample of the c-th category, and v ic can represent the c-th positive sample The prediction offset corresponding to the i-th positive sample of the category, s(x) can represent the sum of smooth L1 losses of all elements in x. In some examples, for x is t ic -v ic , s(t ic -v ic ) can represent the prediction offset corresponding to the i-th positive sample of the c-th category using smooth L1 loss and the i-th positive sample The degree to which the corresponding true offsets agree. Here, the positive samples may be pixels in the image to be trained that are selected for training the regression network 23 (that is, for calculating the regression loss). Thereby, the regression loss can be obtained.
在一些示例中,正样本对应的真实偏移可以为真实紧框标对应的偏移。在一些示例中,正样本对应的真实偏移可以为匹配紧框标对应的偏移。由此,能够适用于正样本落入多个真实紧框标的情况。In some examples, the true offset corresponding to the positive sample may be the offset corresponding to the true tight frame. In some examples, the true offset corresponding to the positive sample may be the offset corresponding to the matching tight box. Therefore, it can be applied to the situation where positive samples fall into multiple real tight frames.
在一些示例中,smooth L1损失函数可以满足公式:In some examples, the smooth L1 loss function can satisfy the formula:
Figure PCTCN2021125152-appb-000017
Figure PCTCN2021125152-appb-000017
其中,σ可以表示超参数,用于smooth L1损失函数与smooth L2损失函数之间的切换,x可以表示smooth L1损失函数的变量。Among them, σ can represent a hyperparameter, which is used to switch between the smooth L1 loss function and the smooth L2 loss function, and x can represent the variable of the smooth L1 loss function.
如上所述,在一些示例中,可以从待训练图像中的像素点选择相应的像素点作为正样本对回归网络23进行训练。As mentioned above, in some examples, corresponding pixel points in the image to be trained may be selected as positive samples to train the regression network 23 .
在一些示例中,正样本可以是待训练图像中至少落入一个目标的真实紧框标内的像素点(也即,可以从待训练图像中选择至少落入一个目标的真实紧框标内的像素点作为正样本)。在这种情况下,基于落入至少一个目标的真实紧框标内的像素点对回归网络23进行优化,能够提高回归网络23优化的效率。在一些示例中,可以按类别从待训练图像中选择至少落入一个目标的真实紧框标内的像素点作为各个类别的正样本。在一些示例中,基于各个类别的正样本可以获取各个类别的回归损失。In some examples, the positive samples can be the pixels in the image to be trained that fall into at least one real tight frame of the target (that is, the pixels that fall into the real tight frame of at least one target can be selected from the image to be trained pixel as a positive sample). In this case, optimizing the regression network 23 based on the pixels falling within the true tight frame of at least one object can improve the efficiency of the regression network 23 optimization. In some examples, pixels falling within at least one real tight bounding box of an object may be selected from the image to be trained by category as positive samples of each category. In some examples, the regression loss of each category can be obtained based on the positive samples of each category.
如上所述,可以按类别从待训练图像中选择至少落入一个目标的真实紧框标内的像素点作为各个类别的正样本。在一些示例中,可以对上述的各个类别的正样本进行筛选,并基于筛选后的正样本对回归网络23进行优化。也即,用于计算回归损失的正样本可以是筛选后的正样本。As mentioned above, pixels that fall within at least one real tight frame of an object can be selected from the images to be trained by category as positive samples for each category. In some examples, the aforementioned positive samples of each category may be screened, and the regression network 23 may be optimized based on the screened positive samples. That is, the positive samples used to calculate the regression loss can be filtered positive samples.
在一些示例中,在获取各个类别的正样本后(也即,从待训练图像中选择至少落入一个目标的真实紧框标内的像素点作为正样本后),可以获取该正样本对应的匹配紧框标,然后基于匹配紧框标对各个类别的正样本进行筛选。由此,能够利用基于匹配紧框标筛选后的各个类别的正样本对回归网络23进行优化。In some examples, after obtaining the positive samples of each category (that is, after selecting the pixels that fall into the real tight frame of at least one target as the positive samples from the image to be trained), the corresponding positive samples can be obtained. Match the tight frame, and then filter the positive samples of each category based on the matched tight frame. In this way, the regression network 23 can be optimized by using the positive samples of each category screened based on matching tight frames.
在一些示例中,可以对像素点(例如正样本)落入的真实紧框标进行筛选以获取该像素点的匹配紧框标。在一些示例中,匹配紧框标可以为待训练图像的像素点落入的真实紧框标中相对该像素点的位置的真实偏移最小的真实紧框标。对于正样本,匹配紧框标可以为正样本落入的真实紧框标中相对正样本的位置的真实偏移最小的真实紧框标。In some examples, the real tight frame that a pixel (for example, a positive sample) falls into can be filtered to obtain a matching tight frame for the pixel. In some examples, the matching tight frame may be the real tight frame in which the pixel of the image to be trained falls into and which has the smallest real offset relative to the position of the pixel. For the positive sample, the matching tight frame may be the real tight frame in which the positive sample falls into the true tight frame with the smallest real offset relative to the position of the positive sample.
具体地,在一个类别中,若一个像素点(例如正样本)仅落入一个待测量对象的真实紧框标内,则将该真实紧框标作为匹配紧框标(也即,匹配紧框标可以为像素点落入的真实紧框标),若该像素点落入多个待测量对象的真实紧框标内,则可以将多个待测量对象的真实紧框标中相对该像素点的位置的真实偏移最小的真实紧框标作为匹配紧框标。由此,能够获得像素点对应的匹配紧框标。Specifically, in a category, if a pixel point (such as a positive sample) only falls within the real tight frame of an object to be measured, then the real tight frame is used as the matching tight frame (that is, the matching tight frame can be the real tight frame that the pixel falls into), if the pixel falls into the real tight frame of multiple objects to be measured, then the real tight frame of multiple objects to be measured can be compared to the pixel The true tight frame with the smallest true offset of the position is taken as the matching tight frame. In this way, the matching tight frame corresponding to the pixel can be obtained.
在一些示例中,可以通过比较真实偏移的L1范式获取最小的真实偏移(也即真实偏移最小的真实紧框标)。在这种情况下,能够基于L1范式获取最小的真实偏移,进而能够获得匹配紧框标。具体地,可以对多个真实偏移中的各个真实偏移的元素求绝对值然后再求和以获取多个偏移值,通过比较多个偏移值以获取偏移值最小的真实偏移作为最小的真实偏移。In some examples, the smallest real offset (that is, the real tight frame with the smallest real offset) can be obtained by comparing L1 normal forms of the real offsets. In this case, the smallest real offset can be obtained based on the L1 normal form, and then the matching tight frame can be obtained. Specifically, the absolute value of the elements of each real offset in the multiple real offsets can be calculated and then summed to obtain multiple offset values, and the real offset with the smallest offset value can be obtained by comparing multiple offset values as the smallest true offset.
在一些示例中,可以利用像素点(例如正样本)对应的期望交并比对各个类别的正样本进行筛选。在这种情况下,能够筛除远离真实紧框标或匹配紧框标的中心的像素点。由此,能够降低远离中心的像素点对回归网络23优化的不利影响且能够提高回归网络23优化的效率。In some examples, the positive samples of each category may be screened by using the expected intersection and union comparison corresponding to the pixel points (for example, positive samples). In this case, pixels far away from the center of the true tight frame or the matching tight frame can be filtered out. In this way, it is possible to reduce the adverse effect of pixels away from the center on the optimization of the regression network 23 and to improve the efficiency of the optimization of the regression network 23 .
在一些示例中,可以基于匹配紧框标获取正样本对应的期望交并比并基于期望交并比对各个类别的正样本进行筛选。具体地,在获取各个类别的正样本后,可以获取该正样本对应的匹配紧框标,然后基于匹配紧框标获取正样本对应的期望交并比并基于期望交并比对各个类别的正样本进行筛选,最后可以利用筛选后的各个类别的正样本对回归网络23进行优化。但本公开的示例不限于此,在一些示例中,可以按类别并利用待训练图像的像素点对应的期望交并比对待训练图像的像素点进行筛选(也即,可以不先从待训练图像中选择至少落入一 个目标的真实紧框标内的像素点作为正样本情况下,利用期望交并比对待训练图像的像素点进行筛选)。另外,对于未落入任何真实紧框标中的像素点(也即,不存在匹配紧框标的像素点)可以进行标识。由此,能够方便后续对该像素点进行筛选。例如可以令像素点的期望交并比为0以标识出该像素点。具体地,可以按类别并基于待训练图像的像素点对应的期望交并比对待训练图像的像素点进行筛选,并基于筛选后的像素点对回归网络23进行优化。In some examples, the expected intersection ratio corresponding to the positive sample can be obtained based on the matching tight frame, and the positive samples of each category can be screened based on the expected intersection ratio. Specifically, after obtaining the positive samples of each category, the matching tight frame corresponding to the positive sample can be obtained, and then based on the matching tight frame, the expected intersection ratio corresponding to the positive sample can be obtained and the positive intersection ratio of each category can be compared based on the expected intersection. The samples are screened, and finally the regression network 23 can be optimized by using the screened positive samples of each category. But the examples of the present disclosure are not limited thereto. In some examples, the pixel points of the images to be trained can be screened by categories and using the expected intersection and comparison of the pixels of the images to be trained (that is, the pixels of the images to be trained can not be selected first. In the case of selecting at least the pixels that fall into the real tight frame of one target as positive samples, use the expected intersection and comparison to filter the pixels of the image to be trained). In addition, the pixels that do not fall into any real tight frame (that is, there is no pixel matching the tight frame) can be identified. In this way, subsequent screening of the pixel can be facilitated. For example, the expected intersection ratio of a pixel point may be set to 0 to identify the pixel point. Specifically, the pixels of the image to be trained can be screened by category and based on the expected intersection and comparison corresponding to the pixels of the image to be trained, and the regression network 23 can be optimized based on the screened pixels.
在一些示例中,可以从待训练图像的像素点中筛选出期望交并比大于预设期望交并比的像素点对回归网络23进行优化。在一些示例中,可以从各个类别的正样本中筛选出期望交并比大于预设期望交并比的正样本对回归网络23进行优化。由此,能够获得符合预设期望交并比的像素点(例如正样本)。在一些示例中,预设期望交并比可以为大于0且小于等于1。例如预设期望交并比可以为0.2、0.3、0.4、0.5、0.6、0.7、0.8、0.9或1等。在一些示例中,预设期望交并比可以为超参数。预设期望交并比可以在回归网络23训练过程进行调整。In some examples, the regression network 23 may be optimized by selecting pixels whose expected intersection ratio is greater than a preset expected intersection ratio from the pixels of the image to be trained. In some examples, the regression network 23 may be optimized by selecting positive samples whose expected intersection ratio is greater than a preset expected intersection ratio from the positive samples of each category. In this way, pixels (for example, positive samples) meeting the preset expected intersection ratio can be obtained. In some examples, the preset expected intersection ratio may be greater than 0 and less than or equal to 1. For example, the preset expected cross-merging ratio may be 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1. In some examples, the preset desired intersection ratio may be a hyperparameter. The preset expected intersection and union ratio can be adjusted during the training process of the regression network 23 .
在一些示例中,可以基于像素点(例如正样本)的匹配紧框标获取像素点对应的期望交并比。在一些示例中,若像素点未对应有匹配紧框标则该像素点可以忽略或可以令该像素点对应的期望交并比为0。在这种情况下,能够使不存在匹配紧框标的像素不用于回归网络23的训练或降低对回归损失的贡献。需要说明的是,除非特别说明,以下对像素点对应的期望交并比的描述同样适用于正样本对应的期望交并比。In some examples, the expected intersection ratio corresponding to the pixel point can be obtained based on the matching tight frame of the pixel point (for example, the positive sample). In some examples, if the pixel does not correspond to a matching tight frame, the pixel may be ignored or the expected intersection ratio corresponding to the pixel may be set to 0. In this case, it is possible to make pixels that do not have a matching tight frame not used for the training of the regression network 23 or reduce the contribution to the regression loss. It should be noted that, unless otherwise specified, the following description of the expected intersection ratio corresponding to a pixel point is also applicable to the expected intersection ratio corresponding to a positive sample.
在一些示例中,期望交并比可以为像素点的匹配紧框标分别与以该像素点为中心构建的多个边框的交并比(Intersection-over-union,Iou)中的最大值。由此,能够获得期望交并比。但本公开的示例不限于此,在另一些示例中,期望交并比可以为像素点的真实紧框标分别与以该像素点为中心构建的多个边框的交并比中的最大值。在一些示例中,可以以待训练图像的像素点为中心点构建的多个边框,获取多个边框分别与该像素点的匹配紧框标的交并比中的最大值并作为期望交并比。在一些示例中,多个边框的尺寸可以不同。具体地,多个边框中的各个边框与其他边框的宽度或高度可以不同。In some examples, the expected intersection-over-union ratio may be a maximum value among intersection-over-union ratios (Intersection-over-union, Iou) between the matching tight frame of the pixel point and multiple borders constructed around the pixel point as the center. Thus, a desired cross-merge ratio can be obtained. However, the examples of the present disclosure are not limited thereto. In other examples, the expected intersection ratio may be the maximum value of the intersection ratios between the real tight frame of the pixel and multiple borders constructed around the pixel. In some examples, a plurality of frames may be constructed with a pixel point of the image to be trained as the center point, and the maximum value among the intersection ratios of the plurality of frames and the matching tight frame labels of the pixel point may be obtained as the expected intersection ratio. In some examples, the multiple borders may be of different sizes. Specifically, each frame in the plurality of frames may have a different width or height from other frames.
图7是示出了本公开示例所涉及的以像素点为中心构建的边框的示意图。为了更清楚地描述期望交并比,以下结合图7进行描述。如图7所示,像素点M1具有匹配紧框标B31,边框B32为以像素点M1为中心构建的一个示例性的边框。FIG. 7 is a schematic diagram showing a frame constructed centering on a pixel point involved in an example of the present disclosure. In order to describe the desired cross-join ratio more clearly, it will be described below in conjunction with FIG. 7 . As shown in FIG. 7 , the pixel M1 has a tight matching frame B31 , and the frame B32 is an exemplary frame constructed centering on the pixel M1 .
在一些示例中,可以令W为匹配紧框标的宽度,H为匹配紧框标的高度,(r 1W,r 2H)表示像素点的位置,r 1,r 2为像素点在匹配紧框标的相对位置,且满足条件:0<r1,r2<1。基于像素点可以构建多个边框。作为示例,如图7所示,像素点M1的位置可以表示为(r 1W,r 2H),匹配紧框标B31的宽度和高度可以分别为W和H。 In some examples, let W be the width of the matching tight frame, H be the height of the matching tight frame, (r 1 W, r 2 H) represent the position of the pixel, r 1 , r 2 are the pixels in the matching tight frame The relative position of the target, and satisfy the conditions: 0<r1, r2<1. Multiple borders can be constructed based on pixels. As an example, as shown in FIG. 7 , the position of the pixel M1 can be expressed as (r 1 W, r 2 H), and the width and height of the matching tight frame B31 can be W and H respectively.
在一些示例中,可以利用匹配紧框标的两条中心线将匹配紧框标分成四个区域。四个区域可以为左上区域、右上区域、左下区域、右下区域。例如,如图7所示,匹配紧框标B31的中心线D9和中心线D10可以将匹配紧框标B31分成左上区域A3、右上区域A4、左下区域A5和右下区域A6。In some examples, the tight matching frame may be divided into four regions by the two centerlines of the tight matching frame. The four areas may be an upper left area, an upper right area, a lower left area, and a lower right area. For example, as shown in FIG. 7 , the center line D9 and center line D10 of the matching tight frame B31 can divide the matching tight frame B31 into an upper left area A3 , an upper right area A4 , a lower left area A5 and a lower right area A6 .
以下以像素点在左上区域(也即,r 1,r 2满足条件:0<r 1,r 2≤0.5)为例描述期望交并比。例如,如图7所示,像素点M1可以为左上区域A3中的点。 The following describes the desired cross-over-union ratio by taking the pixels in the upper left region (that is, r 1 , r 2 satisfying the condition: 0<r 1 , r 2 ≤0.5) as an example. For example, as shown in FIG. 7 , the pixel point M1 may be a point in the upper left area A3.
首先,构建以像素点为中心构建的多个边框。具体地,对于r 1,r 2满足条件:0<r 1,r 2≤0.5,像素点M1对应的四个边界条件可以分别为: First, construct multiple borders centered on pixels. Specifically, for r 1 , r 2 satisfy the condition: 0<r 1 , r 2 ≤0.5, the four boundary conditions corresponding to pixel M1 can be respectively:
w 1=2r 1W,h 1=2r 2H; w 1 =2r 1 W, h 1 =2r 2 H;
w 2=2r 1W,h 2=2(1-r 2)H; w 2 =2r 1 W, h 2 =2(1-r 2 )H;
w 3=2(1-r 1)W,h 3=2r 2H; w 3 =2(1-r 1 )W, h 3 =2r 2 H;
w 4=2(1-r 1)W,h 4=2(1-r 2)H; w 4 =2(1-r 1 )W, h 4 =2(1-r 2 )H;
其中,w 1和h 1可以表示第一个边界条件的宽度和高度,w 2和h 2可以表示第二个边界条件的宽度和高度,w 3和h 3可以表示第三个边界条件的宽度和高度,w 4和h 4可以表示第四个边界条件的宽度和高度。 Among them, w 1 and h 1 can represent the width and height of the first boundary condition, w 2 and h 2 can represent the width and height of the second boundary condition, w 3 and h 3 can represent the width of the third boundary condition and height, w4 and h4 can represent the width and height of the fourth boundary condition.
其次,计算各个边界条件下的边框与匹配紧框标的交并比。具体地,上述四个边界条件对应的交并比可以满足公式(2):Second, calculate the intersection-union ratio between the frame and the matching tight frame under each boundary condition. Specifically, the intersection and union ratios corresponding to the above four boundary conditions can satisfy formula (2):
IoU 1(r 1,r 2)=4r 1r 2IoU 1 (r 1 ,r 2 )=4r 1 r 2 ,
IoU 2(r 1,r 2)=2r 1/(2r 1(1-2r 2)+1), IoU 2 (r 1 ,r 2 )=2r 1 /(2r 1 (1-2r 2 )+1),
IoU 3(r 1,r 2)=2r 2/(2r 2(1-2r 1)+1), IoU 3 (r 1 ,r 2 )=2r 2 /(2r 2 (1-2r 1 )+1),
IoU 4(r 1,r 2)=1/(4(1-r 1)(1-r 2)), IoU 4 (r 1 ,r 2 )=1/(4(1-r 1 )(1-r 2 )),
其中,IoU 1(r 1,r 2)可以表示第一个边界条件对应的交并比,IoU 2(r 1,r 2)可以表示第二个边界条件对应的交并比,IoU 3(r 1,r 2)可以表示第三个边界条件对应的交并比,IoU 4(r 1,r 2)可以表示第四个边界条件对应的交并比。在这种情况下,能够获得各个边界条件对应的交并比。 Among them, IoU 1 (r 1 ,r 2 ) can represent the IoU ratio corresponding to the first boundary condition, IoU 2 (r 1 ,r 2 ) can represent the IoU ratio corresponding to the second boundary condition, and IoU 3 (r 1 , r 2 ) can represent the intersection ratio corresponding to the third boundary condition, and IoU 4 (r 1 , r 2 ) can represent the intersection ratio corresponding to the fourth boundary condition. In this case, the intersection and union ratio corresponding to each boundary condition can be obtained.
最后,多个边界条件的交并比中最大的交并比即为期望交并比。在一些示例中,对于r 1,r 2满足条件:0<r 1,r 2≤0.5,期望交并比可以满足公式(3):
Figure PCTCN2021125152-appb-000018
Finally, the largest intersection and union ratio among multiple boundary conditions is the expected intersection and union ratio. In some examples, for r 1 , r 2 satisfy the condition: 0<r 1 , r 2 ≤0.5, and the expected cross-over-union ratio can satisfy formula (3):
Figure PCTCN2021125152-appb-000018
另外,对于位于其他区域(也即,右上区域、左下区域和右下区域)的像素点的期望交并比可以基于左上区域类似的方法获得。在一些示例中,对于r 1满足条件:0.5≤r 1<1,可以将公式(3)的r 1替换为1-r 1,对于r 2满足条件:0.5≤r 2<1,可以将公式(3)的r 2替换为1-r 2。由此,能够获得位于其他区域的像素点的期望交并比。也即,位于其他区域的像素点通过坐标转换可以映射至左上区域,进而可以基于以左上区域一致的方式获取期望交并比。因此,对于r 1,r 2满足条件:0<r 1,r 2<1,期望交并比可以满足公式(4): In addition, the expected intersection-over-union ratios for pixels located in other regions (ie, the upper-right region, the lower-left region, and the lower-right region) can be obtained based on a similar method for the upper-left region. In some examples, for r 1 satisfying the condition: 0.5≤r 1 <1, r 1 in formula (3) can be replaced by 1-r 1 , and for r 2 satisfying the condition: 0.5≤r 2 <1, the formula The r 2 in (3) is replaced by 1-r 2 . In this way, the expected intersection ratio of pixels located in other regions can be obtained. That is, the pixels located in other regions can be mapped to the upper left region through coordinate conversion, and then the expected intersection ratio can be obtained based on the consistent manner of the upper left region. Therefore, for r 1 , r 2 satisfies the conditions: 0<r 1 , r 2 <1, and the expected intersection and union ratio can satisfy the formula (4):
Figure PCTCN2021125152-appb-000019
Figure PCTCN2021125152-appb-000019
其中,IoU 1(r 1,r 2)、IoU 2(r 1,r 2)、IoU 2(r 1,r 2)和IoU 2(r 1,r 2)可以由公式(2)获得。由此,能够获得期望交并比。 Among them, IoU 1 (r 1 , r 2 ), IoU 2 (r 1 , r 2 ), IoU 2 (r 1 , r 2 ) and IoU 2 (r 1 , r 2 ) can be obtained by formula (2). Thus, a desired cross-merge ratio can be obtained.
如上所述,在一些示例中,可以基于像素点(例如正样本)的匹配紧框标获取像素点对应的期望交并比。但本公开的示例不限于此,在另一些示例中,在对各个类别的正样本或对待训练图像的像素点进行筛选过程中,也可以不获取匹配紧框标。具体地,可以基于像素点(例如正样本)对应的真实紧框标获取像素点对应的期望交并比并基于期望交并比对各个类别的像素点进行筛选。在这种情况下,期望交并比可以为各个真实紧框标对应的期望交并比中的最大值。基于真实紧框标获取像素点对应的期望交并比中可以参考基于像素点的匹配紧 框标获取像素点对应的期望交并比的相关描述。As mentioned above, in some examples, the expected intersection ratio corresponding to the pixel point can be obtained based on the matching tight frame of the pixel point (eg positive sample). However, the examples of the present disclosure are not limited thereto. In some other examples, during the process of screening the positive samples of each category or the pixel points of the image to be trained, no matching tight frame may be obtained. Specifically, the expected intersection ratio corresponding to the pixel point can be obtained based on the real tight frame corresponding to the pixel point (such as a positive sample), and the pixel points of each category can be screened based on the expected intersection ratio. In this case, the expected intersection and union ratio may be the maximum value among the expected intersection and union ratios corresponding to each real tight frame. In obtaining the expected intersection ratio corresponding to the pixel based on the real tight frame, you can refer to the relevant description of obtaining the expected intersection ratio corresponding to the pixel based on the matching tight frame of the pixel.
以下,结合附图详细描述本公开涉及的测量方法。测量方法涉及的网络模块20可以由上述的训练方法进行训练。图8是示出了本公开示例所涉及的基于紧框标的深度学习的测量方法的流程图。Hereinafter, the measurement method involved in the present disclosure will be described in detail with reference to the accompanying drawings. The network module 20 involved in the measurement method can be trained by the above-mentioned training method. FIG. 8 is a flow chart showing a measurement method of tight-frame-based deep learning related to an example of the present disclosure.
在一些示例中,如图8所示,测量方法可以包括获取输入图像(步骤S220)、将输入图像输入网络模块20以获取第一输出和第二输出(步骤S240)、以及基于第一输出和第二输出对目标进行识别以获取各个类别的目标的紧框标(步骤S260)。In some examples, as shown in FIG. 8 , the measurement method may include obtaining an input image (step S220), inputting the input image into the network module 20 to obtain a first output and a second output (step S240), and based on the first output and The second output identifies the objects to obtain the tight frames of the objects of each category (step S260).
在一些示例中,在步骤S220中,输入图像可以包括至少一个目标。在一些示例中,至少一个目标可以属于至少一个感兴趣的类别(感兴趣的类别可以简称为类别)。具体地,若输入图像包括一个目标,则该目标可以属于一个感兴趣的类别,若输入图像包括多个目标,则该多个目标可以属于至少一个感兴趣的类别。在一些示例中,输入图像也可以不包括目标。在这种情况下,能够对不存在目标的输入图像进行判断。In some examples, in step S220, the input image may include at least one object. In some examples, at least one object may belong to at least one category of interest (category of interest may be simply referred to as category). Specifically, if the input image includes one object, the object may belong to one category of interest, and if the input image includes multiple objects, the multiple objects may belong to at least one category of interest. In some examples, the input image may also not include objects. In this case, it is possible to judge an input image in which no object exists.
在一些示例中,在步骤S240中,第一输出可以包括输入图像中的各个像素点属于各个类别的概率。在一些示例中,第二输出可以包括输入图像中各个像素点的位置与每个类别的目标的紧框标的偏移。在一些示例中,可以将第二输出中的偏移作为目标偏移。在一些示例中,网络模块20可以包括骨干网络21、分割网络22和回归网络23。在一些示例中,分割网络22可以是基于弱监督学习的图像分割的。在一些示例中,回归网络23可以是基于边框回归的。在一些示例中,骨干网络21可以用于提取输入图像的特征图。在一些示例中,分割网络22可以将特征图作为输入以获得第一输出,回归网络23可以将特征图作为输入以获得第二输出。在一些示例中,特征图的分辨率可以与输入图像一致。具体内容参见网络模块20的相关描述。In some examples, in step S240, the first output may include the probability that each pixel in the input image belongs to each category. In some examples, the second output may include an offset between the position of each pixel point in the input image and the tight bounding box of each category of objects. In some examples, the offset in the second output may be used as the target offset. In some examples, the network module 20 may include a backbone network 21 , a segmentation network 22 and a regression network 23 . In some examples, segmentation network 22 may be image segmentation based on weakly supervised learning. In some examples, regression network 23 may be based on bounding box regression. In some examples, backbone network 21 may be used to extract feature maps of input images. In some examples, segmentation network 22 may take the feature map as input to obtain a first output, and regression network 23 may take the feature map as input to obtain a second output. In some examples, the resolution of the feature map may be consistent with the input image. For details, refer to the relevant description of the network module 20 .
如上所述,第一输出可以包括输入图像中的各个像素点属于各个类别的概率,第二输出可以包括输入图像中各个像素点的位置与每个类别的目标的紧框标的偏移。在一些示例中,在步骤S260中,可以基于第一输出从第二输出中选择相应位置的像素点对应类别的目标偏移,并基于该目标偏移获取各个类别的目标的紧框标。由此,后续能够基 于目标的紧框标对目标进行精确测量。As mentioned above, the first output may include the probability that each pixel in the input image belongs to each category, and the second output may include the offset between the position of each pixel in the input image and the tight frame of each category of objects. In some examples, in step S260 , based on the first output, an object offset of a category corresponding to a pixel at a corresponding position may be selected from the second output, and a tight frame of each category of objects may be obtained based on the object offset. Therefore, the target can be accurately measured based on the tight frame of the target.
在一些示例中,可以从第一输出中获取属于各个类别的局部概率最大的像素点的位置作为第一位置,基于第二输出中与第一位置对应的位置且对应类别的目标偏移获取各个类别的目标的紧框标。在这种情况下,能够识别出各个类别的一个目标或多个目标。在一些示例中,可以采用非极大值抑制法(Non-Maximum Suppression,NMS)获取第一位置。在一些示例中,各个类别对应的第一位置的数量可以大于等于1。但本公开的示例不限于此,对于各个类别只有一个目标的输入图像,在一些示例中,可以从第一输出中获取属于各个类别的概率最大的像素点的位置作为第一位置,基于第二输出中与第一位置对应的位置且对应类别的目标偏移获取各个类别的目标的紧框标。也即,可以利用极大值法获取第一位置。在一些示例中,还可以利用平滑极大值抑制法获取第一位置。In some examples, the position of the pixel with the highest local probability belonging to each category can be obtained from the first output as the first position, and each pixel can be obtained based on the position corresponding to the first position in the second output and the target offset of the corresponding category. A tight box for the category's target. In this case, an object or objects of each category can be identified. In some examples, a non-maximum suppression method (Non-Maximum Suppression, NMS) may be used to obtain the first position. In some examples, the number of first positions corresponding to each category may be greater than or equal to one. But the example of the present disclosure is not limited thereto. For an input image with only one object in each category, in some examples, the position of the pixel with the highest probability belonging to each category can be obtained from the first output as the first position, based on the second In the output, the position corresponding to the first position and the target offset of the corresponding class obtain the tight frame of the target of each class. That is, the first position can be obtained by using the maximum value method. In some examples, the first position may also be obtained by using a smooth maximum suppression method.
在一些示例中,可以基于第一位置和目标偏移获取各个类别的目标的紧框标。在一些示例中,可以将第一位置和目标偏移代入公式(1)以反推目标的紧框标。具体地,可以将第一位置作为公式(1)的像素点的位置(x,y)并将目标偏移作为偏移t以获取目标的紧框标b。In some examples, tight boxes for objects of various categories may be obtained based on the first position and the object offset. In some examples, the first position and the target offset can be substituted into equation (1) to infer the tight frame of the target. Specifically, the first position can be used as the position (x, y) of the pixel point in the formula (1) and the target offset can be used as the offset t to obtain the tight frame b of the target.
在一些示例中,测量方法还可以包括基于目标的紧框标对各个目标的尺寸进行测量(未图示)。由此,能够基于目标的紧框标对目标进行精确地测量。在一些示例中,目标的尺寸可以为目标的紧框标的宽度和高度。In some examples, the measuring method may further include measuring the size of each target based on the tight frame of the target (not shown). Thereby, the target can be accurately measured based on the tight frame of the target. In some examples, the dimensions of the object may be the width and height of the tight box of the object.
以下,结合附图详细描述本公开涉及的基于紧框标的深度学习的测量装置100。测量装置100还可以称为识别装置或辅助测量装置等。本公开涉及的测量装置100用于实施上述的测量方法。图9是示出了本公开示例所涉及的基于紧框标的深度学习的测量装置100的框图。Hereinafter, the measurement device 100 based on deep learning of tight frames involved in the present disclosure will be described in detail with reference to the accompanying drawings. The measurement device 100 may also be called an identification device or an auxiliary measurement device. The measuring device 100 according to the present disclosure is used to implement the above-mentioned measuring method. FIG. 9 is a block diagram illustrating a measurement device 100 based on tight-framework deep learning according to an example of the present disclosure.
如图9所示,在一些示例中,测量装置100可以包括获取模块10、网络模块20和识别模块30。As shown in FIG. 9 , in some examples, the measurement device 100 may include an acquisition module 10 , a network module 20 and an identification module 30 .
在一些示例中,获取模块10可以配置为获取输入图像。具体内容参见步骤S220中的相关描述。在一些示例中,网络模块20可以配置为接收输入图像并基于输入图像获取第一输出和第二输出。具体内容参见网络模块20的相关描述。在一些示例中,识别模块30可以配置 为基于第一输出和第二输出对目标进行识别以获取各个类别的目标的紧框标。具体内容参见步骤S260中的相关描述。在一些示例中,测量装置100还可以包括测量模块(未图示)。测量模块可以配置为基于目标的紧框标对各个目标的尺寸进行测量。由此,能够基于目标的紧框标对目标进行精确地测量。在一些示例中,目标的尺寸可以为目标的紧框标的宽度和高度。In some examples, acquisition module 10 may be configured to acquire an input image. For details, refer to the relevant description in step S220. In some examples, network module 20 may be configured to receive an input image and obtain a first output and a second output based on the input image. For details, refer to the relevant description of the network module 20 . In some examples, the identification module 30 may be configured to identify objects based on the first output and the second output to obtain tight frames of objects of each category. For details, refer to the relevant description in step S260. In some examples, the measurement device 100 may further include a measurement module (not shown). The measurement module can be configured to measure the size of each target based on the tight frame of the target. Thereby, the target can be accurately measured based on the tight frame of the target. In some examples, the dimensions of the object may be the width and height of the tight box of the object.
本公开涉及的测量方法及测量装置100,构建包括骨干网络21、基于弱监督学习的图像分割的分割网络22和基于边框回归的回归网络23的网络模块20,网络模块20是基于目标的紧框标进行训练的,骨干网络21接收输入图像(例如眼底图像)并提取与输入图像分辨率一致的特征图,将特征图分别输入分割网络22和回归网络23以获取第一输出和第二输出,然后基于第一输出和第二输出获取输入图像中目标的紧框标从而实现测量。在这种情况下,基于目标的紧框标的训练的网络模块20能够精确地预测输入图像中目标的紧框标,进而能够基于目标的紧框标进行精确地测量。另外,通过回归网络23预测归一化后的偏移,能够提高对尺寸变化不大的目标进行识别或测量的精确性。另外,利用期望交并比筛选用于优化回归网络23的像素点,能够降低远离中心的像素点对回归网络23优化的不利影响且能够提高回归网络23优化的效率。另外,回归网络23预测的是明确类别的偏移,能够进一步地提高目标识别或测量的精确性。The measurement method and measurement device 100 involved in this disclosure construct a network module 20 including a backbone network 21, a segmentation network 22 for image segmentation based on weakly supervised learning, and a regression network 23 based on frame regression. The network module 20 is a tight frame based on the target. The backbone network 21 receives an input image (such as a fundus image) and extracts a feature map consistent with the resolution of the input image, and inputs the feature map into the segmentation network 22 and the regression network 23 respectively to obtain the first output and the second output, Then, based on the first output and the second output, a tight frame of the target in the input image is obtained to realize the measurement. In this case, the network module 20 based on the training of the target's tight frame can accurately predict the target's tight frame in the input image, and then can accurately measure based on the target's tight frame. In addition, predicting the normalized offset through the regression network 23 can improve the accuracy of identifying or measuring objects with small size changes. In addition, by using the expected cross-over-union ratio to screen the pixels for optimizing the regression network 23 , it is possible to reduce the negative impact of pixels far away from the center on the optimization of the regression network 23 and to improve the efficiency of the regression network 23 optimization. In addition, the regression network 23 predicts the offset of a definite category, which can further improve the accuracy of target recognition or measurement.
以下,以输入图像为眼底图像为例对本公开涉及的测量方法进一步进行详细描述。针对眼底图像的测量方法也可以称为基于紧框标的深度学习的眼底图像的测量方法。此外,本公开的示例描述的眼底图像是为了更加清楚地说明本公开的技术方案,并不构成对于本公开提供的技术方案的限定。除非特别说明,针对输入图像的测量方法、测量装置100、以及相应的训练方法均适用于眼底图像。作为眼底图像的示例,例如图2(a)示出了由眼底相机拍摄的眼底图像。Hereinafter, the measurement method involved in the present disclosure will be further described in detail by taking the input image as an example of a fundus image. The measurement method for fundus images can also be referred to as the measurement method for fundus images based on deep learning of tight frames. In addition, the fundus images described in the examples of the present disclosure are used to illustrate the technical solutions of the present disclosure more clearly, and do not constitute limitations on the technical solutions provided in the present disclosure. Unless otherwise specified, the measurement method for the input image, the measurement device 100, and the corresponding training method are all applicable to the fundus image. As an example of a fundus image, for example, FIG. 2( a ) shows a fundus image captured by a fundus camera.
本实施方式涉及的针对眼底图像的测量方法,可以利用基于目标的紧框标进行训练的网络模块20对眼底图像中的至少一个目标进行识别从而实现测量。眼底图像可以包括至少一个目标,至少一个目标可以为视杯和/或视盘。也即,可以基于目标的紧框标进行训练的网络模 块20对眼底图像中的视杯和/或视盘进行识别从而实现视杯和/或视盘的测量。由此,能够基于紧框标对眼底图像中的视杯和/或视盘进行测量。在另一些示例中,也可以针对眼底图像中的微血管瘤进行识别从而实现微血管瘤的测量。The measurement method for the fundus image involved in this embodiment can use the network module 20 trained based on the tight frame of the target to identify at least one target in the fundus image so as to realize the measurement. The fundus image may include at least one object, which may be the optic cup and/or optic disc. That is, the network module 20 that can be trained based on the tight frame of the target can identify the optic cup and/or optic disc in the fundus image so as to realize the measurement of the optic cup and/or optic disc. Thereby, the optic cup and/or optic disc in the fundus image can be measured based on the tight frame. In some other examples, it is also possible to identify microvascular tumors in the fundus image so as to realize the measurement of microvascular tumors.
图10是示出了本公开示例所涉及的针对眼底图像的测量方法的流程图。FIG. 10 is a flowchart illustrating a measurement method for a fundus image according to an example of the present disclosure.
在一些示例中,如图10所示,针对眼底图像的测量方法可以包括获取眼底图像(步骤S420)、将眼底图像输入网络模块20以获取第一输出和第二输出(步骤S440)、以及基于第一输出和第二输出对目标进行识别以获取眼底图像中的视杯和/或视盘的紧框标从而实现测量(步骤S460)。In some examples, as shown in FIG. 10 , the measurement method for the fundus image may include acquiring the fundus image (step S420), inputting the fundus image into the network module 20 to obtain the first output and the second output (step S440), and based on The first output and the second output identify the target to obtain the tight frame of the optic cup and/or optic disc in the fundus image to achieve measurement (step S460).
在一些示例中,在步骤S420中,可以获取眼底图像。在一些示例中,眼底图像可以包括至少一个目标。在一些示例中,可以对至少一个目标进行识别以识别出目标以及目标所属的类别(也即,感兴趣的类别)。对于眼底图像,感兴趣的类别(也可以简称为类别)可以为视杯和/或视盘。各个类别的目标可以为视杯或视盘。具体地,若对眼底图像中的视盘或视杯进行识别,则感兴趣的类别可以为视杯或视盘,若对眼底图像中的视盘和视杯进行识别,则感兴趣的类别可以为视杯和视盘。在一些示例中,眼底图像也可以不包括视盘或视杯。在这种情况下,能够对不存在视盘或视杯的眼底图像进行判断。In some examples, in step S420, a fundus image may be acquired. In some examples, a fundus image may include at least one object. In some examples, at least one object may be identified to identify the object and the category to which the object belongs (ie, the category of interest). For fundus images, the category of interest (also referred to as category for short) may be the optic cup and/or the optic disc. The target for each category can be the optic cup or optic disc. Specifically, if the optic disc or optic cup in the fundus image is identified, the category of interest can be the optic cup or optic disc, and if the optic disc and optic cup in the fundus image are identified, the category of interest can be the optic cup and video discs. In some examples, the fundus image may also not include the optic disc or cup. In this case, it is possible to judge a fundus image in which no optic disc or optic cup exists.
在一些示例中,在步骤S440中,可以将眼底图像输入网络模块20以获取第一输出和第二输出。第一输出可以包括眼底图像中的各个像素点属于各个类别(也即视杯和/或视盘)的概率,第二输出可以包括眼底图像中各个像素点的位置与每个类别的目标的紧框标的偏移。在一些示例中,可以将第二输出中的偏移作为目标偏移。对于眼底图像,网络模块20中的骨干网络21可以用于提取眼底图像的特征图。在一些示例中,特征图可以与眼底图像的分辨率一致。网络模块20中的解码模块配置为将在不同尺度上提取的图像特征映射回眼底图像的分辨率以输出特征图。具体内容参见网络模块20的相关描述。In some examples, in step S440, the fundus image may be input into the network module 20 to obtain the first output and the second output. The first output can include the probability that each pixel in the fundus image belongs to each category (that is, the optic cup and/or optic disc), and the second output can include the position of each pixel in the fundus image and the tight frame of each category of targets. target offset. In some examples, the offset in the second output may be used as the target offset. For the fundus image, the backbone network 21 in the network module 20 can be used to extract the feature map of the fundus image. In some examples, the feature map may be consistent with the resolution of the fundus image. The decoding module in the network module 20 is configured to map the image features extracted at different scales back to the resolution of the fundus image to output a feature map. For details, refer to the relevant description of the network module 20 .
另外,对于眼底图像,网络模块20的训练样本可以包括眼底图像数据(也即多张待训练的眼底图像)和眼底图像数据对应的标签数据。 标签数据可以包括视杯和/或视盘所属的类别的金标准、以及视杯和/或视盘的紧框标的金标准。In addition, for fundus images, the training samples of the network module 20 may include fundus image data (that is, multiple fundus images to be trained) and label data corresponding to the fundus image data. The label data may include a gold standard for the class to which the optic cup and/or optic disc belongs, and a gold standard for the tight frame of the optic cup and/or optic disc.
在一些示例中,在步骤S460中,可以基于第一输出和第二输出对目标进行识别以获取眼底图像中的视杯和/或视盘的紧框标从而实现测量。由此,后续能够基于紧框标对视杯和/或视盘进行精确地测量。在一些示例中,可以基于第一输出从第二输出中选择相应位置的像素点对应类别(也即,视杯和/或视盘)的目标偏移,并基于该目标偏移获取视杯和/或视盘的紧框标。对于眼底图像,优选地,可以从第一输出中获取属于各个类别的概率最大的像素点的位置作为第一位置,基于第二输出中与第一位置对应的位置且对应类别的目标偏移获取视杯和/或视盘的紧框标。在一些示例中,可以利用极大值法获取第一位置。具体内容参见步骤S260的相关描述。In some examples, in step S460, the target may be identified based on the first output and the second output to obtain a tight frame of the optic cup and/or optic disc in the fundus image to achieve measurement. Thereby, the optic cup and/or optic disc can be accurately measured subsequently based on the tight frame. In some examples, based on the first output, the target offset of the category corresponding to the pixel point at the corresponding position (that is, the optic cup and/or optic disc) can be selected from the second output, and the optic cup and/or optic disc can be obtained based on the target offset. or the tight frame of the optic disc. For the fundus image, preferably, the position of the pixel point with the highest probability belonging to each category can be obtained from the first output as the first position, based on the position corresponding to the first position in the second output and the target offset of the corresponding category to obtain Tight frame for optic cup and/or optic disc. In some examples, the first position may be obtained using a maximum value method. For details, refer to the related description of step S260.
在一些示例中,可以基于第一位置和目标偏移获取视杯和/或视盘的紧框标。具体内容参见步骤S260的相关描述。In some examples, a tight frame for the optic cup and/or optic disc may be obtained based on the first position and the target offset. For details, refer to the related description of step S260.
在一些示例中,针对眼底图像的测量方法还可以包括基于眼底图像中的视杯和视盘的紧框标获取视杯和视盘的比值(未图示)。由此,能够基于视杯和视盘的紧框标对视杯和视盘的比值进行精确地测量。In some examples, the measurement method for the fundus image may further include obtaining a ratio of the optic cup to the optic disc based on the tight frames of the optic cup and the optic disc in the fundus image (not shown). Thus, the ratio of the optic cup to the optic disc can be accurately measured based on the tight framing of the optic cup and optic disc.
在一些示例中,在步骤S460获取视杯和/或视盘的紧框标后,可以基于眼底图像中的视杯的紧框标和/或视盘的紧框标对视杯和/或视盘进行测量以获取视杯和/或视盘的尺寸(尺寸例如可以为垂直直径和水平直径)。由此,能够对视杯和/或视盘的尺寸进行精确地测量。在一些示例中,可以将紧框标的高度作为视杯和/或视盘的垂直直径,将紧框标的宽度作为视杯和/或视盘的水平直径以获取视杯和/或视盘的尺寸。In some examples, after obtaining the tight frame of the optic cup and/or optic disc in step S460, the optic cup and/or optic disc can be measured based on the tight frame of the optic cup and/or the tight frame of the optic disc in the fundus image to obtain the size of the optic cup and/or optic disc (the size may be, for example, the vertical diameter and the horizontal diameter). Thus, the size of the optic cup and/or optic disc can be accurately measured. In some examples, the cup and/or disc size can be obtained by taking the height of the tight frame as the vertical diameter of the optic cup and/or optic disc and the width of the tight frame as the horizontal diameter of the optic cup and/or disc.
在一些示例中,基于紧框标获取视杯和视盘的尺寸后可以获取视杯和视盘的比值(也可以简称为杯盘比)。在这种情况下,基于紧框标获取视杯和视盘的比值,进而能够对杯盘比进行精确地测量。In some examples, after obtaining the sizes of the optic cup and optic disc based on the tight frame, the ratio of the optic cup to the optic disc (also referred to as the cup-to-disc ratio) can be obtained. In this case, the cup-to-disc ratio can be obtained based on the tight frame, so that the cup-to-disk ratio can be accurately measured.
在一些示例中,杯盘比可以包括垂直杯盘比和水平杯盘比。垂直杯盘比可以为视杯和视盘的垂直直径的比值。水平杯盘比可以为视杯和视盘的水平直径的比值。在一些示例中,令眼底图像中的视杯的紧框标为b oc=(xl oc,yt oc,xr oc,yb oc),视盘的紧框标为b od=(xl od,yt od,xr od,yb od), 其中,b oc和b od的前两个数值可以表示紧框标的左上角的位置,后两个数值可以表示紧框标的右下角的位置,则 In some examples, the cup-to-disk ratio may include a vertical cup-to-disk ratio and a horizontal cup-to-disk ratio. The vertical cup-to-disk ratio may be the ratio of the vertical diameters of the optic cup and optic disc. The horizontal cup-to-disk ratio may be the ratio of the horizontal diameters of the optic cup and optic disc. In some examples, let b oc =(xl oc ,yt oc ,xr oc ,yb oc ) for the tight frame of the optic cup in the fundus image and b od =(xl od ,yt od , xr od , yb od ), where the first two values of b oc and b od can represent the position of the upper left corner of the tight frame, and the latter two values can represent the position of the lower right corner of the tight frame, then
垂直杯盘比可以满足公式:VCDR=(yb oc-yt oc)/(yb od-yt od), The vertical cup-to-disk ratio can satisfy the formula: VCDR=(yb oc -yt oc )/(yb od -yt od ),
水平杯盘比可以满足公式:HCDR=(xr oc-xl oc)/(xr od-xl od)。 The horizontal cup-to-disk ratio may satisfy the formula: HCDR=(xr oc −xl oc )/(xr od −xl od ).
以下,结合附图详细描述本公开涉及的针对眼底图像的测量装置200。针对眼底图像的测量装置200也可以称为基于紧框标的深度学习的眼底图像的测量装置200。本公开涉及的针对眼底图像的测量装置200用于实施上述的针对眼底图像的测量方法。图11是示出了本公开示例所涉及的针对眼底图像的测量装置200的框图。Hereinafter, the measurement device 200 for fundus images involved in the present disclosure will be described in detail with reference to the accompanying drawings. The measurement device 200 for fundus images can also be referred to as the measurement device 200 for fundus images based on deep learning of tight frames. The measuring device 200 for fundus images involved in the present disclosure is used to implement the above-mentioned measuring method for fundus images. FIG. 11 is a block diagram showing a measurement device 200 for a fundus image according to an example of the present disclosure.
如图11所示,在一些示例中,针对眼底图像的测量装置200可以包括获取模块50、网络模块20和识别模块60。在一些示例中,获取模块50可以配置为获取眼底图像。具体内容参见步骤S420中的相关描述。在一些示例中,网络模块20可以配置为接收眼底图像并基于眼底图像获取第一输出和第二输出。具体内容参见网络模块20和步骤S440的相关描述。在一些示例中,识别模块60可以配置为基于第一输出和第二输出对目标进行识别以获取眼底图像中的视杯和/或视盘的紧框标从而实现测量。具体内容参见步骤S460中的相关描述。在一些示例中,测量装置200还可以包括杯盘比模块(未图示)。杯盘比模块可以配置为基于眼底图像中的视杯和视盘的紧框标获取视杯和视盘的比值。具体内容参见基于眼底图像中的视杯和视盘的紧框标获取视杯和视盘的比值的相关描述。As shown in FIG. 11 , in some examples, the measuring device 200 for fundus images may include an acquisition module 50 , a network module 20 and an identification module 60 . In some examples, acquisition module 50 may be configured to acquire fundus images. For details, refer to the relevant description in step S420. In some examples, network module 20 may be configured to receive a fundus image and obtain a first output and a second output based on the fundus image. For details, refer to the relevant description of the network module 20 and step S440. In some examples, the identification module 60 may be configured to identify the target based on the first output and the second output to obtain a tight frame of the optic cup and/or optic disc in the fundus image for measurement. For details, refer to the relevant description in step S460. In some examples, the measuring device 200 may further include a cup-to-disk ratio module (not shown). The cup-to-disk ratio module may be configured to obtain a cup-to-disc ratio based on the tight framing of the cup and disc in the fundus image. For details, refer to the related description of obtaining the ratio of the optic cup and optic disc based on the tight frame of the optic cup and optic disc in the fundus image.
虽然以上结合附图和示例对本公开进行了具体说明,但是可以理解,上述说明不以任何形式限制本公开。本领域技术人员在不偏离本公开的实质精神和范围的情况下可以根据需要对本公开进行变形和变化,这些变形和变化均落入本公开的范围内。Although the present disclosure has been described in detail with reference to the drawings and examples, it should be understood that the above description does not limit the present disclosure in any form. Those skilled in the art can make modifications and changes to the present disclosure as needed without departing from the true spirit and scope of the present disclosure, and these modifications and changes all fall within the scope of the present disclosure.

Claims (19)

  1. 一种基于紧框标的深度学习的测量方法,其特征在于,是利用基于目标的紧框标进行训练的网络模块对所述目标进行识别从而实现测量的测量方法,所述紧框标为所述目标的最小外接矩形,所述测量方法包括:获取包括至少一个目标的输入图像,所述至少一个目标属于至少一个感兴趣的类别;将所述输入图像输入所述网络模块以获取第一输出和第二输出,所述第一输出包括所述输入图像中的各个像素点属于各个类别的概率,所述第二输出包括所述输入图像中各个像素点的位置与每个类别的目标的紧框标的偏移,将所述第二输出中的偏移作为目标偏移,其中,所述网络模块包括骨干网络、基于弱监督学习的图像分割的分割网络、以及基于边框回归的回归网络,所述骨干网络用于提取所述输入图像的特征图,所述分割网络将所述特征图作为输入以获得所述第一输出,所述回归网络将所述特征图作为输入以获得所述第二输出,其中,所述特征图与所述输入图像的分辨率一致;基于所述第一输出和所述第二输出对所述目标进行识别以获取各个类别的目标的紧框标。A measurement method based on deep learning of a tight frame, characterized in that it is a measurement method that uses a network module trained based on a target-based tight frame to identify the target so as to achieve measurement, and the tight frame is the A minimum bounding rectangle of a target, the measuring method comprising: acquiring an input image comprising at least one target, the at least one target belonging to at least one category of interest; inputting the input image into the network module to obtain a first output and The second output, the first output includes the probability that each pixel in the input image belongs to each category, and the second output includes the position of each pixel in the input image and the tight frame of the target of each category Target offset, using the offset in the second output as the target offset, wherein the network module includes a backbone network, a segmentation network based on image segmentation based on weakly supervised learning, and a regression network based on frame regression, the The backbone network is used to extract the feature map of the input image, the segmentation network takes the feature map as input to obtain the first output, and the regression network takes the feature map as input to obtain the second output , wherein the feature map is consistent with the resolution of the input image; the target is identified based on the first output and the second output to obtain a tight frame of each category of target.
  2. 根据权利要求1所述的测量方法,其特征在于:The measuring method according to claim 1, characterized in that:
    所述网络模块通过如下方法训练:The network module is trained by the following method:
    构建训练样本,所述训练样本的输入图像数据包括多张待训练图像,所述多张待训练图像包括包含至少属于一个类别的目标的图像,所述训练样本的标签数据,包括所述目标所属的类别的金标准和所述目标的紧框标的金标准;通过所述网络模块基于所述训练样本的输入图像数据,获得所述训练样本对应的由所述分割网络输出的预测分割数据和由所述回归网络输出的预测偏移;基于所述训练样本对应的标签数据、所述预测分割数据和所述预测偏移确定所述网络模块的训练损失;并且基于所述训练损失对所述网络模块进行训练以优化所述网络模块。Constructing a training sample, the input image data of the training sample includes a plurality of images to be trained, the plurality of images to be trained include images containing objects belonging to at least one category, and the label data of the training samples includes the objects to which the objects belong The gold standard of the category and the gold standard of the tight frame of the target; through the network module based on the input image data of the training sample, the prediction segmentation data output by the segmentation network corresponding to the training sample and the predicted segmentation data output by the segmentation network are obtained The prediction offset of the regression network output; determine the training loss of the network module based on the label data corresponding to the training sample, the prediction segmentation data and the prediction offset; and based on the training loss to the network Modules are trained to optimize the network modules.
  3. 根据权利要求2所述的测量方法,其特征在于:The measuring method according to claim 2, characterized in that:
    所述基于所述训练样本对应的标签数据、所述预测分割数据和所述预测偏移确定所述网络模块的训练损失,包括:基于所述训练样本对应的预测分割数据和标签数据,获取所述分割网络的分割损失;基于所述训练样本对应的预测 偏移和基于标签数据对应的真实偏移,获取所述回归网络的回归损失,其中,所述真实偏移为所述待训练图像的像素点的位置与标签数据中的目标的紧框标的金标准的偏移;并且基于所述分割损失和所述回归损失,获取所述网络模块的训练损失。The determining the training loss of the network module based on the label data corresponding to the training sample, the predicted segmentation data, and the predicted offset includes: obtaining the predicted segmentation data and label data corresponding to the training sample. The segmentation loss of the segmentation network; based on the prediction offset corresponding to the training sample and the actual offset based on the label data, the regression loss of the regression network is obtained, wherein the actual offset is the image to be trained The position of the pixel point is offset from the gold standard of the tight frame of the target in the label data; and based on the segmentation loss and the regression loss, the training loss of the network module is obtained.
  4. 根据权利要求1至3任一项所述的测量方法,其特征在于:The measuring method according to any one of claims 1 to 3, characterized in that:
    所述目标偏移为基于各个类别的目标的平均宽度和平均高度进行归一化后的偏移、或所述目标偏移为基于各个类别的目标的平均大小进行归一化后的偏移。The object offset is an offset normalized based on an average width and an average height of objects of each category, or the object offset is an offset normalized based on an average size of objects of each category.
  5. 根据权利要求3所述的测量方法,其特征在于:The measuring method according to claim 3, characterized in that:
    利用多示例学习,按类别基于各个待训练图像中的目标的紧框标的金标准获取多个待训练包,基于各个类别的多个待训练包获取所述分割损失,其中,所述多个待训练包包括多个正包和多个负包,将连接所述目标的紧框标的金标准相对的两个边的多条直线中的各条直线上的全部像素点划分为一个正包,所述多条直线包括至少一组相互平行的第一平行线和分别与每组第一平行线垂直的相互平行的第二平行线,所述负包为一个类别的所有目标的紧框标的金标准之外的区域的单个像素点,所述分割损失包括一元项和成对项,所述一元项描述每个待训练包属于各个类别的金标准的程度,所述成对项描述所述待训练图像的像素点与该像素点相邻的像素点属于同类别的程度。Using multi-instance learning, a plurality of bags to be trained are obtained according to categories based on the gold standard of tight frames of objects in each image to be trained, and the segmentation loss is obtained based on a plurality of bags to be trained in each category, wherein the plurality of bags to be trained The training bag includes a plurality of positive bags and a plurality of negative bags, and all pixels on each of the straight lines in the multiple straight lines connecting the two opposite sides of the tight frame mark of the target are divided into a positive bag, so The multiple straight lines include at least one group of first parallel lines parallel to each other and second parallel lines perpendicular to each group of first parallel lines respectively, and the negative package is the gold standard of a tight frame for all targets of a category A single pixel point in the area outside the region, the segmentation loss includes a unary item and a paired item, the unary item describes the degree to which each bag to be trained belongs to the gold standard of each category, and the paired item describes the to-be-trained The degree to which a pixel of an image belongs to the same category as its adjacent pixels.
  6. 根据权利要求5所述的测量方法,其特征在于:The measuring method according to claim 5, characterized in that:
    所述第一平行线的角度为所述第一平行线的延长线与所述目标的紧框标的金标准的任意一个未相交的边的延长线的夹角的角度,所述第一平行线的角度大于-90°且小于90°。The angle of the first parallel line is the angle between the extension line of the first parallel line and the extension line of any non-intersecting side of the gold standard of the tight frame of the target, and the first parallel line The angle of is greater than -90° and less than 90°.
  7. 根据权利要求2所述的测量方法,其特征在于:The measuring method according to claim 2, characterized in that:
    按类别从所述待训练图像中选择至少落入一个目标的紧框标的金标准内的像素点作为各个类别的正样本并获取该正样本对应的匹配紧框标以基于所述匹配紧框标对各个类别的正样本进行筛选,然后利用筛选后的各个类别的正样本对所述回归网络进行优化,其中,所述匹配紧框标为所述正样本落入的紧 框标的金标准中相对所述正样本的位置的真实偏移最小的紧框标的金标准。Select by category from the image to be trained the pixels that fall into the gold standard of the tight frame of at least one target as positive samples of each category and obtain the matching tight frame corresponding to the positive sample to be based on the matching tight frame Screen the positive samples of each category, and then use the screened positive samples of each category to optimize the regression network, wherein the matching tight frame is relative to the gold standard of the tight frame that the positive sample falls into. The true offset of the position of the positive samples is the gold standard for the smallest tight frame.
  8. 根据权利要求1、3或7所述的测量方法,其特征在于:The measuring method according to claim 1, 3 or 7, characterized in that:
    令像素点的位置表示为(x,y),该像素点对应的一个目标的紧框标表示为b=(xl,yt,xr,yb),所述目标的紧框标b相对该像素点的位置的偏移表示为t=(tl,tt,tr,tb),则tl,tt,tr,tb满足公式:Let the position of the pixel point be expressed as (x, y), the tight frame mark of a target corresponding to the pixel point is represented as b=(xl, yt, xr, yb), and the tight frame mark b of the target is relative to the pixel point The offset of the position of is expressed as t=(tl, tt, tr, tb), then tl, tt, tr, tb satisfy the formula:
    tl=(x-xl)/S c1tl=(x-xl)/S c1 ,
    tt=(y-yt)/S c2tt=(y-yt)/S c2 ,
    tr=(xr-x)/S c1tr=(xr-x)/S c1 ,
    tb=(yb-y)/S c2tb=(yb-y)/S c2 ,
    其中,xl,yt表示目标的紧框标的左上角的位置,xr,yb表示目标的紧框标的右下角的位置,S c1表示第c个类别的目标的平均宽度,S c2表示第c个类别的目标的平均高度。 Among them, xl, yt represent the position of the upper left corner of the tight frame of the target, xr, yb represent the position of the lower right corner of the tight frame of the target, S c1 represents the average width of the target of the c-th category, and S c2 represents the c-th category The average height of the target.
  9. 根据权利要求2所述的测量方法,其特征在于:The measuring method according to claim 2, characterized in that:
    按类别并利用所述待训练图像的像素点对应的期望交并比从所述待训练图像的像素点筛选出所述期望交并比大于预设期望交并比的像素点对所述回归网络进行优化,其中,以所述待训练图像的像素点为中心点构建的不同尺寸的多个边框,获取所述多个边框分别与该像素点的匹配紧框标的交并比中的最大值并作为所述期望交并比,所述匹配紧框标为所述待训练图像的像素点落入的紧框标的金标准中相对该像素点的位置的真实偏移最小的紧框标的金标准。According to the category and using the expected intersection and union ratio corresponding to the pixel of the image to be trained to filter out the pixels with the expected intersection and union ratio greater than the preset expected intersection and union ratio from the pixels of the image to be trained, the regression network is selected. Optimizing, wherein, a plurality of borders of different sizes constructed with the pixel point of the image to be trained as the center point, obtaining the maximum value in the intersection ratio of the plurality of borders and the matching tight frame mark of the pixel point respectively, and As the expected intersection-over-union ratio, the matching tight frame is the gold standard of the tight frame in which the pixel of the image to be trained falls into and has the smallest deviation relative to the true position of the pixel.
  10. 根据权利要求9所述的测量方法,其特征在于:The measuring method according to claim 9, characterized in that:
    所述期望交并比满足公式:The expected intersection and union ratio satisfies the formula:
    Figure PCTCN2021125152-appb-100001
    Figure PCTCN2021125152-appb-100001
    其中,r 1,r 2为所述待训练图像的像素点在所述匹配紧框标的相对位置,0<r 1,r 2<1,IoU 1(r 1,r 2)=4r 1r 2,IoU 2(r 1,r 2)=2r 1/(2r 1(1-2r 2)+1),IoU 3(r 1,r 2)=2r 2/(2r 2(1-2r 1)+1),IoU 4(r 1,r 2)=1/(4(1-r 1)(1-r 2))。 Among them, r 1 , r 2 are the relative positions of the pixels of the image to be trained in the matching tight frame, 0<r 1 , r 2 <1, IoU 1 (r 1 , r 2 )=4r 1 r 2 , IoU 2 (r 1 ,r 2 )=2r 1 /(2r 1 (1-2r 2 )+1), IoU 3 (r 1 ,r 2 )=2r 2 /(2r 2 (1-2r 1 )+ 1), IoU 4 (r 1 , r 2 )=1/(4(1-r 1 )(1-r 2 )).
  11. 根据权利要求1所述的测量方法,其特征在于:The measuring method according to claim 1, characterized in that:
    所述基于所述第一输出和所述第二输出对所述目标进行识别以获取各个类别的目标的紧框标为:The tight box for identifying the target based on the first output and the second output to obtain targets of each category is:
    从所述第一输出中获取属于各个类别的局部概率最大的像素点的位置作为第一位置,基于所述第二输出中与所述第一位置对应的位置且对应类别的目标偏移获取各个类别的目标的紧框标。Obtain the position of the pixel point with the highest local probability belonging to each category from the first output as the first position, and obtain each pixel based on the position corresponding to the first position in the second output and the target offset of the corresponding category. A tight box for the category's target.
  12. 根据权利要求1所述的测量方法,其特征在于:The measuring method according to claim 1, characterized in that:
    同类别的多个目标的尺寸彼此相差小于10倍。The sizes of multiple objects of the same class differ from each other by a factor of less than 10.
  13. 根据权利要求1所述的测量方法,其特征在于:The measuring method according to claim 1, characterized in that:
    所述骨干网络包括编码模块和解码模块,所述编码模块配置为在不同尺度上提取的图像特征,所述解码模块配置为将在不同尺度上提取的图像特征映射回所述输入图像的分辨率以输出所述特征图。The backbone network includes an encoding module configured to extract image features at different scales and a decoding module configured to map image features extracted at different scales back to the resolution of the input image to output the feature map.
  14. 根据权利要求1所述的测量方法,其特征在于:The measuring method according to claim 1, characterized in that:
    所述输入图像为眼底图像,所述目标为视杯和/或视盘。The input image is a fundus image, and the target is an optic cup and/or an optic disc.
  15. 根据权利要求14所述的测量方法,其特征在于:The measurement method according to claim 14, characterized in that:
    所述基于所述第一输出和所述第二输出对所述目标进行识别以获取各个类别的目标的紧框标为:The tight box for identifying the target based on the first output and the second output to obtain targets of each category is:
    从所述第一输出中获取属于各个类别的概率最大的像素点的位置作为第一位置,基于所述第二输出中与所述第一位置对应的位置且对应类别的目标偏移获取各个类别的目标的紧框标。Obtain the position of the pixel point with the highest probability belonging to each category from the first output as the first position, and acquire each category based on the position corresponding to the first position in the second output and the target offset of the corresponding category A tight frame for the target of .
  16. 根据权利要求14所述的测量方法,其特征在于:The measurement method according to claim 14, characterized in that:
    基于所述眼底图像中的视杯的紧框标和/或所述眼底图像中的视盘的紧框标对视杯和/或视盘进行测量以获取视杯和/或视盘的尺寸,基于所述眼底图像中的视杯和视盘的尺寸获取视杯和视盘的比值。Based on the tight frame of the optic cup in the fundus image and/or the tight frame of the optic disc in the fundus image, the optic cup and/or optic disc are measured to obtain the size of the optic cup and/or optic disc, based on the Cup and Disc Sizes in Fundus Images Obtain the cup and disc ratio.
  17. 一种基于紧框标的深度学习的测量装置,其特征在于,是利用基于目标的紧框标进行训练的网络模块对所述目标进行识别从而实现测量的测量装置,所述紧框标为所述目标的最小外接矩形,所述测量装置包括获取模块、网络模 块和识别模块;所述获取模块配置为获取包括至少一个目标的输入图像,所述至少一个目标属于至少一个感兴趣的类别;所述网络模块配置为接收所述输入图像并基于所述输入图像获取第一输出和第二输出,所述第一输出包括所述输入图像中的各个像素点属于各个类别的概率,所述第二输出包括所述输入图像中各个像素点的位置与每个类别的目标的紧框标的偏移,将所述第二输出中的偏移作为目标偏移,其中,所述网络模块包括骨干网络、基于弱监督学习的图像分割的分割网络、以及基于边框回归的回归网络,所述骨干网络用于提取所述输入图像的特征图,所述分割网络将所述特征图作为输入以获得所述第一输出,所述回归网络将所述特征图作为输入以获得所述第二输出,其中,所述特征图与所述输入图像的分辨率一致;以及所述识别模块配置为基于所述第一输出和所述第二输出对所述目标进行识别以获取各个类别的目标的紧框标。A measurement device based on deep learning of a tight frame, characterized in that it is a measurement device that uses a network module trained based on a target-based tight frame to identify the target so as to achieve measurement, and the tight frame is the The minimum circumscribed rectangle of the target, the measurement device includes an acquisition module, a network module and a recognition module; the acquisition module is configured to acquire an input image comprising at least one target, and the at least one target belongs to at least one category of interest; the The network module is configured to receive the input image and obtain a first output and a second output based on the input image, the first output includes the probability that each pixel in the input image belongs to each category, and the second output Including the offset of the position of each pixel in the input image and the tight frame of each category of targets, using the offset in the second output as the target offset, wherein the network module includes a backbone network, based on A segmentation network for image segmentation of weakly supervised learning, and a regression network based on border regression, the backbone network is used to extract the feature map of the input image, and the segmentation network uses the feature map as input to obtain the first output, the regression network takes as input the feature map to obtain the second output, wherein the feature map is consistent with the resolution of the input image; and the identification module is configured to be based on the first output and the second output to identify the objects to obtain a tight box for each category of objects.
  18. 根据权利要求17所述的测量装置,其特征在于:The measuring device according to claim 17, characterized in that:
    所述输入图像为眼底图像,所述目标为视杯和/或视盘。The input image is a fundus image, and the target is an optic cup and/or an optic disc.
  19. 根据权利要求18所述的测量装置,其特征在于:The measuring device according to claim 18, characterized in that:
    所述基于所述第一输出和所述第二输出对所述目标进行识别以获取各个类别的目标的紧框标为:The tight box for identifying the target based on the first output and the second output to obtain targets of each category is:
    从所述第一输出中获取属于各个类别的概率最大的像素点的位置作为第一位置,基于所述第二输出中与所述第一位置对应的位置且对应类别的目标偏移获取各个类别的目标的紧框标。Obtain the position of the pixel point with the highest probability belonging to each category from the first output as the first position, and acquire each category based on the position corresponding to the first position in the second output and the target offset of the corresponding category A tight frame for the target of .
PCT/CN2021/125152 2021-10-11 2021-10-21 Measurement method and measurement apparatus based on deep learning of tight box mark WO2023060637A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111184059 2021-10-11
CN202111184059.7 2021-10-11

Publications (1)

Publication Number Publication Date
WO2023060637A1 true WO2023060637A1 (en) 2023-04-20

Family

ID=78871496

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/125152 WO2023060637A1 (en) 2021-10-11 2021-10-21 Measurement method and measurement apparatus based on deep learning of tight box mark

Country Status (2)

Country Link
CN (6) CN115359070A (en)
WO (1) WO2023060637A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116681892B (en) * 2023-06-02 2024-01-26 山东省人工智能研究院 Image precise segmentation method based on multi-center polar mask model improvement

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018125580A1 (en) * 2016-12-30 2018-07-05 Konica Minolta Laboratory U.S.A., Inc. Gland segmentation with deeply-supervised multi-level deconvolution networks
CN110321815A (en) * 2019-06-18 2019-10-11 中国计量大学 A kind of crack on road recognition methods based on deep learning
CN110349148A (en) * 2019-07-11 2019-10-18 电子科技大学 A kind of image object detection method based on Weakly supervised study
CN111652140A (en) * 2020-06-03 2020-09-11 广东小天才科技有限公司 Method, device, equipment and medium for accurately segmenting questions based on deep learning
CN112883971A (en) * 2021-03-04 2021-06-01 中山大学 SAR image ship target detection method based on deep learning
CN112966684A (en) * 2021-03-15 2021-06-15 北湾科技(武汉)有限公司 Cooperative learning character recognition method under attention mechanism

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8705826B2 (en) * 2008-05-14 2014-04-22 Agency For Science, Technology And Research Automatic cup-to-disc ratio measurement system
WO2014031086A1 (en) * 2012-08-24 2014-02-27 Agency For Science, Technology And Research Methods and systems for automatic location of optic structures in an image of an eye, and for automatic retina cup-to-disc ratio computation
CN106214120A (en) * 2016-08-19 2016-12-14 靳晓亮 A kind of methods for screening of glaucoma
CN106530280B (en) * 2016-10-17 2019-06-11 东软医疗系统股份有限公司 The localization method and device of organ in a kind of image
CN107689047B (en) * 2017-08-16 2021-04-02 汕头大学 Method and device for automatically cutting fundus image and readable storage medium thereof
US9946960B1 (en) * 2017-10-13 2018-04-17 StradVision, Inc. Method for acquiring bounding box corresponding to an object in an image by using convolutional neural network including tracking network and computing device using the same
US11087130B2 (en) * 2017-12-29 2021-08-10 RetailNext, Inc. Simultaneous object localization and attribute classification using multitask deep neural networks
CN109829877A (en) * 2018-09-20 2019-05-31 中南大学 A kind of retinal fundus images cup disc ratio automatic evaluation method
CN113012093B (en) * 2019-12-04 2023-12-12 深圳硅基智能科技有限公司 Training method and training system for glaucoma image feature extraction
CN111862187B (en) * 2020-09-21 2021-01-01 平安科技(深圳)有限公司 Cup-to-tray ratio determining method, device, equipment and storage medium based on neural network
CN112232240A (en) * 2020-10-21 2021-01-15 南京师范大学 Road sprinkled object detection and identification method based on optimized intersection-to-parallel ratio function
CN112668579A (en) * 2020-12-24 2021-04-16 西安电子科技大学 Weak supervision semantic segmentation method based on self-adaptive affinity and class distribution
CN113326763B (en) * 2021-05-25 2023-04-18 河南大学 Remote sensing target detection method based on boundary frame consistency

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018125580A1 (en) * 2016-12-30 2018-07-05 Konica Minolta Laboratory U.S.A., Inc. Gland segmentation with deeply-supervised multi-level deconvolution networks
CN110321815A (en) * 2019-06-18 2019-10-11 中国计量大学 A kind of crack on road recognition methods based on deep learning
CN110349148A (en) * 2019-07-11 2019-10-18 电子科技大学 A kind of image object detection method based on Weakly supervised study
CN111652140A (en) * 2020-06-03 2020-09-11 广东小天才科技有限公司 Method, device, equipment and medium for accurately segmenting questions based on deep learning
CN112883971A (en) * 2021-03-04 2021-06-01 中山大学 SAR image ship target detection method based on deep learning
CN112966684A (en) * 2021-03-15 2021-06-15 北湾科技(武汉)有限公司 Cooperative learning character recognition method under attention mechanism

Also Published As

Publication number Publication date
CN115331050A (en) 2022-11-11
CN113780477A (en) 2021-12-10
CN113920126B (en) 2022-07-22
CN115359070A (en) 2022-11-18
CN115423818A (en) 2022-12-02
CN113780477B (en) 2022-07-22
CN115578577A (en) 2023-01-06
CN113920126A (en) 2022-01-11

Similar Documents

Publication Publication Date Title
JP2019032773A (en) Image processing apparatus, and image processing method
EP3499414B1 (en) Lightweight 3d vision camera with intelligent segmentation engine for machine vision and auto identification
CN107103320B (en) Embedded medical data image identification and integration method
CN113591795B (en) Lightweight face detection method and system based on mixed attention characteristic pyramid structure
CN112733950A (en) Power equipment fault diagnosis method based on combination of image fusion and target detection
JP5549345B2 (en) Sky detection apparatus and method used in image acquisition apparatus
CN104077577A (en) Trademark detection method based on convolutional neural network
JP2006524394A (en) Delineation of human contours in images
CN113435282B (en) Unmanned aerial vehicle image ear recognition method based on deep learning
CN109670501B (en) Object identification and grasping position detection method based on deep convolutional neural network
CN108334955A (en) Copy of ID Card detection method based on Faster-RCNN
CN113724231A (en) Industrial defect detection method based on semantic segmentation and target detection fusion model
US20200327557A1 (en) Electronic detection of products and arrangement of products in a display structure, electronic detection of objects and arrangement of objects on and around the display structure, electronic detection of conditions of and around the display structure, and electronic scoring of the detected product and object arrangements and of the detected conditions
CN114897816A (en) Mask R-CNN mineral particle identification and particle size detection method based on improved Mask
CN113033315A (en) Rare earth mining high-resolution image identification and positioning method
CN112069985A (en) High-resolution field image rice ear detection and counting method based on deep learning
WO2023060637A1 (en) Measurement method and measurement apparatus based on deep learning of tight box mark
CN112668445A (en) Vegetable type detection and identification method based on yolov5
CN113688846B (en) Object size recognition method, readable storage medium, and object size recognition system
CN113008380B (en) Intelligent AI body temperature early warning method, system and storage medium
CN112924037A (en) Infrared body temperature detection system and detection method based on image registration
CN112488165A (en) Infrared pedestrian identification method and system based on deep learning model
CN114373156A (en) Non-contact type water level, flow velocity and flow intelligent monitoring system based on video image recognition algorithm
JP2008084109A (en) Eye opening/closing determination device and eye opening/closing determination method
Zhou et al. Wireless capsule endoscopy video automatic segmentation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21960366

Country of ref document: EP

Kind code of ref document: A1